# Hybrid Search with Milvus

This notebook demonstrates hybrid search capabilities using Milvus with BGE-M3 embeddings, combining dense and sparse vector search.


In [31]:
# Import required libraries
import pandas as pd
from pymilvus.model.hybrid import BGEM3EmbeddingFunction
from pymilvus import (
    connections,
    utility,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
)
from IPython.display import Markdown, display

from pymilvus import (
    AnnSearchRequest,
    WeightedRanker,
)


## Load and Prepare Data

Load the Quora duplicate questions dataset and prepare the documents for embedding.


In [32]:
# Load data from TSV file
file_path = "data/quora_duplicate_questions.tsv"
df = pd.read_csv(file_path, sep="\t")

# Extract unique questions (limited to 500 for demo purposes)
questions = set()
for _, row in df.iterrows():
    obj = row.to_dict()
    questions.add(obj["question1"][:512])
    questions.add(obj["question2"][:512])
    if len(questions) > 500:  # Skip this if you want to use the full dataset
        break

docs = list(questions)
print(f"Total documents loaded: {len(docs)}")
print(f"Sample document: {docs[0]}")


Total documents loaded: 502
Sample document: Almost two weeks in on Prozac, why do I still feel antisocial and still don't want to go to events? Why does my social behavior and mood change daily?


## Initialize Embedding Function

Set up the BGE-M3 embedding function to generate both dense and sparse embeddings.


In [33]:
# Initialize BGE-M3 embedding function
ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu")
dense_dim = ef.dim["dense"]
print(f"Dense vector dimension: {dense_dim}")

# Generate embeddings for all documents
docs_embeddings = ef(docs)
print("Embeddings generated successfully!")


Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

Dense vector dimension: 1024


pre tokenize: 100%|██████████| 32/32 [00:00<00:00, 3261.04it/s]
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 32/32 [00:16<00:00,  1.91it/s]

Embeddings generated successfully!





## Connect to Milvus and Create Collection

Establish connection to Milvus and create a collection with both dense and sparse vector fields.


In [34]:
# Connect to Milvus
uri = "http://localhost:19530"
connections.connect(uri=uri, token="root:Milvus")
print("Connected to Milvus successfully!")

# Define collection schema
fields = [
    # Use auto generated id as primary key
    FieldSchema(
        name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100
    ),
    # Store the original text to retrieve based on semantically distance
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=512),
    # Milvus now supports both sparse and dense vectors,
    # we can store each in a separate field to conduct hybrid search on both vectors
    FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
    FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
]
schema = CollectionSchema(fields)

# Create or recreate collection
col_name = "hybrid_demo"
if utility.has_collection(col_name):
    Collection(col_name).drop()
    print(f"Dropped existing collection: {col_name}")

col = Collection(col_name, schema, consistency_level="Bounded")
print(f"Created collection: {col_name}")


Connected to Milvus successfully!
Dropped existing collection: hybrid_demo
Created collection: hybrid_demo


## Create Indexes and Insert Data

Create indexes for both dense and sparse vectors, then insert the documents and their embeddings.


In [35]:
# Create indexes
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
col.create_index("sparse_vector", sparse_index)
print("Created sparse vector index")

dense_index = {"index_type": "AUTOINDEX", "metric_type": "IP"}
col.create_index("dense_vector", dense_index)
print("Created dense vector index")

# Load collection
col.load()
print("Collection loaded successfully")


Created sparse vector index
Created dense vector index
Collection loaded successfully


In [36]:
# Insert documents in batches
for i in range(0, len(docs), 50):
    batched_entities = [
        docs[i : i + 50],
        docs_embeddings["sparse"][i : i + 50],
        docs_embeddings["dense"][i : i + 50],
    ]
    col.insert(batched_entities)
    print(f"Inserted batch {i//50 + 1}/{(len(docs)-1)//50 + 1}")

# Flush to ensure data is written to disk
col.flush()
print(f"Number of entities inserted: {col.num_entities}")


Inserted batch 1/11
Inserted batch 2/11
Inserted batch 3/11
Inserted batch 4/11
Inserted batch 5/11
Inserted batch 6/11
Inserted batch 7/11
Inserted batch 8/11
Inserted batch 9/11
Inserted batch 10/11
Inserted batch 11/11
Number of entities inserted: 502


## Get Search Query

Enter your search query to test different search methods.


In [37]:
# Get search query from user
query = input("Enter your search query: ")
print(f"Search query: {query}")

# Generate embeddings for the query
query_embeddings = ef([query])
print("Query embeddings generated successfully!")


Enter your search query:  How to start learning programming?


Search query: How to start learning programming?
Query embeddings generated successfully!


## Define Search Functions

Define functions for dense search, sparse search, and hybrid search.


In [38]:
def dense_search(col, query_dense_embedding, limit=10):
    """Perform dense vector search"""
    search_params = {"metric_type": "IP", "params": {}}
    res = col.search(
        [query_dense_embedding],
        anns_field="dense_vector",
        limit=limit,
        output_fields=["text"],
        param=search_params,
    )[0]
    return [hit.get("text") for hit in res]


def sparse_search(col, query_sparse_embedding, limit=10):
    """Perform sparse vector search"""
    search_params = {
        "metric_type": "IP",
        "params": {},
    }
    res = col.search(
        [query_sparse_embedding],
        anns_field="sparse_vector",
        limit=limit,
        output_fields=["text"],
        param=search_params,
    )[0]
    return [hit.get("text") for hit in res]


def hybrid_search(
    col,
    query_dense_embedding,
    query_sparse_embedding,
    sparse_weight=1.0,
    dense_weight=1.0,
    limit=10,
):
    """Perform hybrid search combining dense and sparse vectors"""
    dense_search_params = {"metric_type": "IP", "params": {}}
    dense_req = AnnSearchRequest(
        [query_dense_embedding], "dense_vector", dense_search_params, limit=limit
    )
    sparse_search_params = {"metric_type": "IP", "params": {}}
    sparse_req = AnnSearchRequest(
        [query_sparse_embedding], "sparse_vector", sparse_search_params, limit=limit
    )
    rerank = WeightedRanker(sparse_weight, dense_weight)
    res = col.hybrid_search(
        [sparse_req, dense_req], rerank=rerank, limit=limit, output_fields=["text"]
    )[0]
    return [hit.get("text") for hit in res]

print("Search functions defined successfully!")


Search functions defined successfully!


## Perform Different Types of Search

Execute dense search, sparse search, and hybrid search with the query.


In [39]:
# Perform different types of search
dense_results = dense_search(col, query_embeddings["dense"][0])
sparse_results = sparse_search(col, query_embeddings["sparse"][[0]])
hybrid_results = hybrid_search(
    col,
    query_embeddings["dense"][0],
    query_embeddings["sparse"][[0]],
    sparse_weight=0.7,
    dense_weight=1.0,
)

print(f"Dense search returned {len(dense_results)} results")
print(f"Sparse search returned {len(sparse_results)} results")
print(f"Hybrid search returned {len(hybrid_results)} results")


Dense search returned 10 results
Sparse search returned 10 results
Hybrid search returned 10 results


## Text Formatting Function

Define a function to highlight query terms in the search results for better visualization.


In [40]:
def doc_text_formatting(ef, query, docs):
    """Format documents to highlight query terms in red"""
    tokenizer = ef.model.tokenizer
    query_tokens_ids = tokenizer.encode(query, return_offsets_mapping=True)
    query_tokens = tokenizer.convert_ids_to_tokens(query_tokens_ids)
    formatted_texts = []

    for doc in docs:
        ldx = 0
        landmarks = []
        encoding = tokenizer.encode_plus(doc, return_offsets_mapping=True)
        tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])[1:-1]
        offsets = encoding["offset_mapping"][1:-1]
        for token, (start, end) in zip(tokens, offsets):
            if token in query_tokens:
                if len(landmarks) != 0 and start == landmarks[-1]:
                    landmarks[-1] = end
                else:
                    landmarks.append(start)
                    landmarks.append(end)
        close = False
        formatted_text = ""
        for i, c in enumerate(doc):
            if ldx == len(landmarks):
                pass
            elif i == landmarks[ldx]:
                if close:
                    formatted_text += "</span>"
                else:
                    formatted_text += "<span style='color:red'>"
                close = not close
                ldx = ldx + 1
            formatted_text += c
        if close is True:
            formatted_text += "</span>"
        formatted_texts.append(formatted_text)
    return formatted_texts

print("Text formatting function defined successfully!")


Text formatting function defined successfully!


## Display Search Results

Display the results from all three search methods with highlighted query terms.


In [41]:
# Display dense search results
display(Markdown("**Dense Search Results:**"))
formatted_results = doc_text_formatting(ef, query, dense_results)
for result in formatted_results:
    display(Markdown(result))


**Dense Search Results:**

What's the best way<span style='color:red'> to start learning</span> robotics<span style='color:red'>?</span>

<span style='color:red'>How</span> do I learn a computer language like java<span style='color:red'>?</span>

<span style='color:red'>How</span> can I get started<span style='color:red'> to</span> learn information security<span style='color:red'>?</span>

What is Java<span style='color:red'> programming? How</span> To Learn Java Programming Language ?

<span style='color:red'>How</span> can I learn computer security<span style='color:red'>?</span>

What is the best way<span style='color:red'> to start</span> robotics<span style='color:red'>?</span> Which is the best development board that I can<span style='color:red'> start</span> working on it<span style='color:red'>?</span>

<span style='color:red'>How</span> can I learn<span style='color:red'> to</span> speak English fluently<span style='color:red'>?</span>

What are the best ways<span style='color:red'> to</span> learn French<span style='color:red'>?</span>

<span style='color:red'>How</span> can you make physics easy<span style='color:red'> to</span> learn<span style='color:red'>?</span>

<span style='color:red'>How</span> do we prepare for UPSC<span style='color:red'>?</span>

In [42]:
# Display sparse search results
display(Markdown("**Sparse Search Results:**"))
formatted_results = doc_text_formatting(ef, query, sparse_results)
for result in formatted_results:
    display(Markdown(result))


**Sparse Search Results:**

What is Java<span style='color:red'> programming? How</span> To Learn Java Programming Language ?

What's the best way<span style='color:red'> to start learning</span> robotics<span style='color:red'>?</span>

What is the alternative<span style='color:red'> to</span> machine<span style='color:red'> learning?</span>

<span style='color:red'>How</span> do I create a new Terminal and new shell in Linux using C<span style='color:red'> programming?</span>

<span style='color:red'>How</span> do I create a new shell in a new terminal using C<span style='color:red'> programming</span> (Linux terminal)<span style='color:red'>?</span>

Which business is better<span style='color:red'> to start</span> in Hyderabad<span style='color:red'>?</span>

Which business is good<span style='color:red'> start</span> up in Hyderabad<span style='color:red'>?</span>

What is the best way<span style='color:red'> to start</span> robotics<span style='color:red'>?</span> Which is the best development board that I can<span style='color:red'> start</span> working on it<span style='color:red'>?</span>

What math does a complete newbie need<span style='color:red'> to</span> understand algorithms for computer<span style='color:red'> programming?</span> What books on algorithms are suitable for a complete beginner<span style='color:red'>?</span>

<span style='color:red'>How</span> do you make life suit you and stop life from abusi<span style='color:red'>ng</span> you mentally and emotionally<span style='color:red'>?</span>

In [43]:
# Display hybrid search results
display(Markdown("**Hybrid Search Results:**"))
formatted_results = doc_text_formatting(ef, query, hybrid_results)
for result in formatted_results:
    display(Markdown(result))


**Hybrid Search Results:**

What is Java<span style='color:red'> programming? How</span> To Learn Java Programming Language ?

What's the best way<span style='color:red'> to start learning</span> robotics<span style='color:red'>?</span>

What is the best way<span style='color:red'> to start</span> robotics<span style='color:red'>?</span> Which is the best development board that I can<span style='color:red'> start</span> working on it<span style='color:red'>?</span>

<span style='color:red'>How</span> do I learn a computer language like java<span style='color:red'>?</span>

<span style='color:red'>How</span> can I get started<span style='color:red'> to</span> learn information security<span style='color:red'>?</span>

<span style='color:red'>How</span> can I learn computer security<span style='color:red'>?</span>

<span style='color:red'>How</span> can I learn<span style='color:red'> to</span> speak English fluently<span style='color:red'>?</span>

What are the best ways<span style='color:red'> to</span> learn French<span style='color:red'>?</span>

<span style='color:red'>How</span> can you make physics easy<span style='color:red'> to</span> learn<span style='color:red'>?</span>

<span style='color:red'>How</span> do we prepare for UPSC<span style='color:red'>?</span>