# Maximal Marignal Relevance (with Sliding Window)

The introduciton can be found in my [blog].

In [None]:
import pandas as pd
import numpy as np

## Prepare Dataset
- We use an open dataset from the [Government open data](https://data.gov.tw/dataset/43388).
- The CAS dataset has some interesting columns like:
    * Emblem_ID(標章編號)
    * Factory_CName(廠商名稱)
    * Factory_Address(地址)
    * Factory_Tel(電話)
    * Factory_Fax(傳真)
    * Factory_Director(負責人)
    * Material_Name(產品類別)
    * PType_Name(產品種類)
    * **Product_Name(產品名稱)**
- Assume users can search product by `Product_Name`, we will return the search results with MMR.


In [5]:
df = pd.read_csv("COA_OpenData.csv")
df.head()

Unnamed: 0,Emblem_ID,Factory_CName,Factory_Address,Factory_Tel,Factory_Fax,Factory_Director,Material_Name,PType_Name,Product_Name
0,162001,德豐木業股份有限公司,南投縣竹山鎮延平一路二號,049-2642094,049-2647894,李岳峰,林產加工品,集成材,柳杉結構用集成材(異等級結構用(對稱構成、非對稱構成)、同等級結構用、大斷面、中斷面、小斷面...
1,161902,明昇木業,嘉義市西區世賢路一段580巷310號,052326186,052339102,李明生,林產加工品,木製材品,柳杉底材用製材品(針葉樹底材用製材(角材、板材)，材面品質2等以上，含水率為30％以下，乾燥...
2,161901,明昇木業,嘉義市西區世賢路一段580巷310號,052326186,052339102,李明生,林產加工品,木製材品,柳杉製材品(乙種結構用製材(板材)，材面品質3等以上，含水率SD25，乾燥處理(天然)，尺度...
3,161802,昆儀實業股份有限公司,宜蘭縣蘇澳鎮自強路6號,03-9903188,03-9905088,郭宗欽,林產加工品,木製材品,防腐柳杉規格品(裝修用木材(板材)，材面品質大節(含)以上，含水率D18，板材，防腐處理AC...
4,161802,昆儀實業股份有限公司,宜蘭縣蘇澳鎮自強路6號,03-9903188,03-9905088,郭宗欽,林產加工品,木製材品,防腐柳杉規格品(乙種結構用材(角材、圓柱)，材面品質3等(含)以上，含水率SD20，防腐處理...


## Embedding and Similarity
- Embedding: `Qwen3-Embedding-0.6B` (Dimension: 1024)
- Package: `sentence-transformers`
- Similarity: cosine similarity

In [6]:
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

# The queries and documents to embed
documents = df["Product_Name"].tolist()
document_embeddings = model.encode(documents)

## Search with MMR
- Consider top 30 as the retrieval result
- Adopt MMR for re-rank with top 10
  
For MMR, we implement a version with sliding window

In [7]:
def mmr(
    query_embedding: np.ndarray,
    document_embeddings: np.ndarray,
    diversity: float = 0.1,
    top_n: int = 10,
    window_size: int | None = None
) -> list[str]:
    """Maximal Marginal Relevance (with sliding window).

    Arguments:
        query_embedding: The query embedding
        document_embeddings: The embeddings of the selected documents
        diversity: The diversity of the selected embeddings. Values between 0 and 1.
        top_n: The top n items to return
        window_size: The size of the sliding window

    Returns:
            list[int]: The indices of the selected documents
    """
    from sklearn.metrics.pairwise import cosine_similarity

    # compute similarity(Q, D) and similarity(D, D)
    query_doc_similarity = cosine_similarity([query_embedding], document_embeddings)[0]
    pair_similarity = cosine_similarity(document_embeddings)

    if window_size is None:
        window_size = min(10, len(document_embeddings))

    # return doc_idx as the result and recode candidates_idx as current candidate set
    doc_idx = [np.argmax(query_doc_similarity)]
    candidates_idx = [i for i in range(len(document_embeddings)) if i != doc_idx[0]]
    for _ in range(top_n - 1):
        # in each iteration, select one documnet within candidates using MMR
        candidate_similarities = query_doc_similarity[candidates_idx]
        target_similarities = np.max(pair_similarity[candidates_idx][:, doc_idx[-window_size:]], axis=1)

        # calculate MMR
        mmr = (1 - diversity) * candidate_similarities - diversity * target_similarities
        mmr_idx = candidates_idx[np.argmax(mmr)]

        # Update doc_idx & candidates
        doc_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    return doc_idx

In [8]:
def search(
    query: str,
    using_mmr: bool = True,
    window_size: int | None = None,
    diversity: float = 0.1
) -> list[str]:
    query_embedding = model.encode(query, prompt_name="query")
    similarity_scores = model.similarity(query_embedding, document_embeddings)[0]
    
    indices = np.argsort(similarity_scores.tolist())[::-1]
    
    if not using_mmr:   
        return df.Product_Name[indices[:10]]
    
    doc_idx = mmr(
        query_embedding,
        document_embeddings[indices[:30]],
        top_n=10,
        window_size=window_size,
        diversity=diversity
    )
    return df.Product_Name[indices[doc_idx]]

In [12]:
pd.DataFrame({
    "Diversity=0.1": search("鮮乳", diversity=0.1).to_list(),
    "Diversity=0.9": search("鮮乳", diversity=0.9).to_list()
})

Unnamed: 0,Diversity=0.1,Diversity=0.9
0,四方鮮乳全脂鮮乳,四方鮮乳全脂鮮乳
1,四方鮮乳低脂鮮乳,酪農戶限定鮮乳.萬丹
2,四方鮮乳全脂鮮乳,牧鄉極致鮮乳
3,四方鮮乳低脂鮮乳,豐新鮮100%鮮羊乳
4,四方鮮乳低脂鮮乳,光泉鮮乳-成分無調整
5,鮮配家鮮羊乳,崙背鮮乳
6,低脂鮮乳,85度C高品質鮮乳
7,豐新鮮100%鮮羊乳,福樂一番鮮低脂鮮乳
8,統一鮮乳,統一鮮乳
9,低脂鮮乳,華南鮮羊乳


As you can see, if we prioritize diversity, we get a more diverse set of results; otherwise, we'll see many duplicates.

In [11]:
pd.DataFrame({
    "window_size=None": search("鮮乳", diversity=0.6, window_size=None).to_list(),
    "window_size=4": search("鮮乳", diversity=0.6, window_size=4).to_list(),
})

Unnamed: 0,window_size=None,window_size=4
0,四方鮮乳全脂鮮乳,四方鮮乳全脂鮮乳
1,豐新鮮100%鮮羊乳,豐新鮮100%鮮羊乳
2,酪農戶限定鮮乳.萬丹,酪農戶限定鮮乳.萬丹
3,牧鄉極致鮮乳,牧鄉極致鮮乳
4,光泉鮮乳-成分無調整,光泉鮮乳-成分無調整
5,崙背鮮乳,四方鮮乳全脂鮮乳
6,85度C高品質鮮乳,崙背鮮乳
7,福樂一番鮮特極鮮乳,鮮配家鮮羊乳
8,統一鮮乳,85度C高品質鮮乳
9,鮮配家鮮羊乳,光泉鮮乳-成分無調整


If we set window_size to k, MMR will only consider the k nearly selected documents in each iteration considerint the diversity. Otherwise, MMR will consider all selected documents. As shown in the table below, `四方鮮乳全脂鮮乳` appears in index 0 and 5 and `光泉鮮乳-成分無調整` appears in index 4 and 9, when window_size is 4.  