 Install 'pymilvus', the Python SDK for Milvus, an open-source vector database.
 Milvus is used for managing, storing, and searching large-scale vector data.

In [None]:
!pip install pymilvus

 Install 'milvus', a high-performance vector database used for similarity search in large datasets.

In [None]:
!pip install milvus

Install 'sentence-transformers', a library for computing dense vector representations of sentences.
This library provides easy-to-use models for tasks like semantic search, clustering, and more.

In [21]:
!pip install sentence-transformers



 Install 'grpcio', a package that provides support for gRPC (gRPC Remote Procedure Calls).
 gRPC is a high-performance, open-source framework for building distributed systems and APIs.
 By specifying '1.60.0', we ensure compatibility with other packages that might require this version.

In [None]:
!pip install grpcio==1.60.0


In [None]:
!pip show grpcio


In [None]:
!pip install --force-reinstall grpcio==1.60.0 pymilvus


In [1]:
from sentence_transformers import SentenceTransformer
from milvus import default_server
from pymilvus import connections, utility, FieldSchema, CollectionSchema, DataType, Collection
from time import time

Starting the default milvus server instance
Establish a connection to the running Milvus server. The connection parameters include the host
(localhost) and the port number on which the server is listening, provided by default_server.listen_port.
This connection allows performing various operations on the Milvus server, such as creating collections,
inserting data, and running queries.

In [2]:
default_server.start()
connections.connect(host="127.0.0.1", port=default_server.listen_port)

This model ("paraphrase-multilingual-MiniLM-L12-v2") is pre-trained to handle multiple languages
and is commonly used for tasks like paraphrase identification and semantic search.

This model ("paraphrase-multilingual-MiniLM-L12-v2-fine-tuned-3") is fine-tuned to improve performance
on specific tasks or datasets, potentially enhancing the accuracy of the embeddings it generates.

This model ("psais-paraphrase-multilingual-MiniLM-L12-v2-5shot") is fine-tuned using a few-shot learning
approach, which means it is optimized with a small number of training examples to perform well on specific tasks.

In [None]:
v12 = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
ft3_v12 = SentenceTransformer("Sprylab/paraphrase-multilingual-MiniLM-L12-v2-fine-tuned-3")
ft5_v12 = SentenceTransformer("hroth/psais-paraphrase-multilingual-MiniLM-L12-v2-5shot")

In [4]:
# COLLECTION_V12: This collection will store the embeddings generated by the base model.
# COLLECTION_V12_Q: This collection will store the quantized embeddings generated by the base model.
COLLECTION_V12 = "Multilingual_MiniLM_V12"
COLLECTION_V12_Q = "Multilingual_MiniLM_V12_Quantized"
DIMENSION = 384
# start with a clean slate
if utility.has_collection(COLLECTION_V12):
    utility.drop_collection(COLLECTION_V12)
if utility.has_collection(COLLECTION_V12_Q):
    utility.drop_collection(COLLECTION_V12_Q)

In [6]:
import pandas as pd
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection

# Load your dataset
df = pd.read_csv('./game-of-thrones.csv')

# Assuming 'dialogue' is the column with text data
sentences = df['Text'].dropna().tolist()[:100]

In [None]:
sentences = [
    "I am not the kind of girl, who should be rudely barging in on a white veil occasion, but you are not the kind of boy who should be marrying the wrong girl",
    "I sneak in and see your friends and her snotty little family all dressed in pastel and she is yelling at a bridesmaid somewhere back inside a room wearing a gown shaped like a pastry",
    "This is surely not what you thought it would be.",
    "I lose myself in a daydream where I stand and say, 'Don't say yes, run away now I'll meet you when you're out of the church at the back door.'",
    "Don't wait, or say a single vow you need to hear me out and they said, 'Speak now'.",
    "Fond gestures are exchanged.",
    "And the organ starts to play a song that sounds like a death march.",
    "And I am hiding in the curtains, it seems that I was uni`nvited by your lovely bride-to-be.",
    "She floats down the aisle like a pageant queen.",
    "But I know you wish it was me you wish it was me don't you?",
    "I hear the preacher say, 'Speak now or forever hold your peace'",
    "There's the silence, there's my last chance.",
    "I stand up with shaky hands, all eyes on me",
    "Horrified looks from everyone in the room but I'm only looking at you.",
    "And you'll say, 'Let's run away now' I'll meet you when I'm out of my tux at the back door",
    "Baby, I didn't say my vows So glad you were around When they said, 'Speak now'",
    "I said, 'Oh my, what a marvelous tune'",
    "It was the best night, never would forget how we moved.",
    "The whole place was dressed to the nines and we were dancing, dancing like we're made of starlight",
    "I met Bobby on the boardwalk summer of '45",
    "Picked me up late one night out the window we were seventeen and crazy running wild, wild.",
    "Can't remember what song he was playing when we walked in.",
    "The night we snuck into a yacht club party pretending to be a duchess and a prince.",
    "He said, 'Look at you, worrying so much about things you can't change You'll spend your whole life singing the blues If you keep thinking that way'",
    "He was tryna to skip rocks on the ocean saying to me 'Don't you see the starlight, starlight don't you dream impossible things'",
    "Ooh, ooh he's talking crazy Ooh, ooh dancing with me Ooh, ooh we could get married Have ten kids and teach 'em how to dream",
    "The way you move is like a full on rainstorm.",
    "And I'm a house of cards",
    "You're the kind of reckless that should send me running but I kinda know that I won't get far",
    "And you stood there in front of me just close enough to touch",
    "Close enough to hope you couldn't see what I was thinking of",
    "Drop everything now",
    "Meet me in the pouring rain",
    "Kiss me on the sidewalk",
    "Take away the pain",
    "'Cause I see sparks fly, whenever you smile",
    "Get me with those green eyes, baby as the lights go down",
    "Gimme something that'll haunt me when you're not around",
    "My mind forgets to remind me you're a bad idea",
    "You touch me once and it's really something you find I'm even better than you imagined I would be",
    "I'm on my guard for the rest of the world but with you, I know it's no good"
    "And I could wait patiently but I really wish you would"
    "I run my fingers through your hair and watch the lights go wild",
    "Just keep on keeping your eyes on me it's just wrong enough to make it feel right",
    "And lead me up the staircase won't you whisper soft and slow, I'm captivated by you, baby like a fireworks show",
    "You and I walk a fragile line I have known it all this time, But I never thought I'd live to see it break",
    "It's getting dark and it's all too quiet And I can't trust anything now And it's coming over you like it's all a big mistake",
    "Oh, I'm holding my breath Won't lose you again",
    "Something's made your eyes go cold",
    "Come on, come on, don't leave me like this I thought I had you figured out",
    "Something's gone terribly wrong you're all I wanted",
    "Can't breathe whenever you're gone can't turn back now, I'm haunted",
    "I just know You're not gone, you can't be gone, no",
]

In [7]:
len(sentences)


100

In [8]:
# Define the fields for the collection schema.
# The schema includes an 'id' field which is a primary key of type INT64 and auto-generated,
# and an 'embedding' field which is a FLOAT_VECTOR with the specified dimensionality.
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION),
]

# Create a CollectionSchema with the defined fields.
# The enable_dynamic_field=True parameter allows adding new fields dynamically if needed.
schema = CollectionSchema(fields=fields, enable_dynamic_field=True)

# Create the base collection in Milvus with the specified schema.
# This collection will store the embeddings generated by the base model.
collection_v12 = Collection(name=COLLECTION_V12, schema=schema)

# Create another collection in Milvus using the same schema.
# This collection is intended to store the quantized embeddings or embeddings from the fine-tuned model.
collection_v12_ft5 = Collection(name=COLLECTION_V12_Q, schema=schema)


In [22]:
# Initialize dictionaries to store embeddings for each sentence.
# v12_embeds: Stores embeddings generated by the base model (v12).
# v12_q_embeds: Stores embeddings generated by the fine-tuned model (ft5_v12).
v12_embeds = {}
v12_q_embeds = {}

# Loop through each sentence in the 'sentences' list to generate and store embeddings.
for sentence in sentences:
    # Encode the sentence using the base model (v12) and store the result in v12_embeds.
    v12_embeds[sentence] = v12.encode(sentence)

    # Encode the sentence using the fine-tuned model (ft5_v12) and store the result in v12_q_embeds.
    v12_q_embeds[sentence] = ft5_v12.encode(sentence)


In [23]:
# Define the parameters for creating the index.
# index_type: The type of index to create. "IVF_FLAT" is a common index type for vector similarity search.
# metric_type: The metric used to measure similarity. "L2" refers to Euclidean distance.
# params: Additional parameters for the index. "nlist" specifies the number of clusters (or inverted lists) to use.
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 4},
}

# Create an index on the "embedding" field of the base collection (collection_v12) using the specified index parameters.
# Indexing improves the efficiency of similarity searches by organizing the data in a way that allows for faster retrieval.
collection_v12.create_index(field_name="embedding", index_params=index_params)

# Load the base collection (collection_v12) into memory, making it ready for search operations.
collection_v12.load()

# Create an index on the "embedding" field of the quantized collection (collection_v12_ft5) using the same index parameters.
# This prepares the quantized collection for efficient similarity searches as well.
collection_v12_ft5.create_index(field_name="embedding", index_params=index_params)

# Load the quantized collection (collection_v12_ft5) into memory, making it ready for search operations.
collection_v12_ft5.load()


In [24]:
# Loop through each sentence in the 'sentences' list to prepare and insert embeddings into the collections.
for sentence in sentences:
    # Prepare the data to insert into the base collection (collection_v12).
    # Each entry includes the original sentence and its corresponding embedding generated by the base model (v12).
    v12_insert = [
        {
            "sentence": sentence,
            "embedding": v12_embeds[sentence]
        }
    ]

    # Prepare the data to insert into the quantized collection (collection_v12_ft5).
    # Each entry includes the original sentence and its corresponding embedding generated by the fine-tuned model (ft5_v12).
    ft_insert = [
        {
            "sentence": sentence,
            "embedding": v12_q_embeds[sentence]
        }
    ]

    # Insert the prepared data into the base collection (collection_v12).
    collection_v12.insert(v12_insert)

    # Insert the prepared data into the quantized collection (collection_v12_ft5).
    collection_v12_ft5.insert(ft_insert)

# Flush the base collection to ensure all inserted data is persisted and made available for querying.
collection_v12.flush()

# Flush the quantized collection to ensure all inserted data is persisted and made available for querying.
collection_v12_ft5.flush()


In [25]:
# Initialize dictionaries and lists to store search embeddings and data.
# search_embeds: A dictionary to store the embeddings for the sentences to be searched.
# search_data: A list to collect the embeddings to be used in the search query.
search_embeds = {}
search_data = []

# Loop through a subset of sentences (from index 5 to 6) to generate and store embeddings for search.
for sentence in sentences[5:7]:
    # Encode the sentence using the fine-tuned model (ft3_v12) to generate its vector embedding.
    vector_embedding = ft3_v12.encode(sentence)

    # Store the generated embedding in the search_embeds dictionary with the sentence as the key.
    search_embeds[sentence] = vector_embedding

    # Append the generated embedding to the search_data list for use in the search query.
    search_data.append(vector_embedding)


In [26]:
# Measure the time taken for searching in the base collection (collection_v12).
start1 = time()  # Start the timer

# Perform a search in the base collection (collection_v12).
# data: The embeddings to search for, stored in search_data.
# anns_field: The field to search across, which is "embedding".
# param: Search parameters including the metric type ("L2" for Euclidean distance) and nprobe (number of probes).
# limit: The number of top results to return per search query.
# output_fields: Fields to include in the search results, here we want to retrieve the original sentence.
res_v12 = collection_v12.search(
    data=search_data,
    anns_field="embedding",
    param={"metric_type": "L2", "params": {"nprobe": 2}},
    limit=3,
    output_fields=["sentence"]
)

# Calculate and print the time taken for the first search.
time1 = time() - start1
print(f"Time for first search: {time1}")

# Measure the time taken for searching in the quantized collection (collection_v12_ft5).
start2 = time()  # Start the timer

# Perform a search in the quantized collection (collection_v12_ft5).
# The parameters are the same as the previous search.
res_v12_ft5 = collection_v12_ft5.search(
    data=search_data,
    anns_field="embedding",
    param={"metric_type": "L2", "params": {"nprobe": 2}},
    limit=3,
    output_fields=["sentence"]
)

# Calculate and print the time taken for the second search.
time2 = time() - start2
print(f"Time for second search: {time2}")


Time for first search: 0.012391805648803711
Time for second search: 0.009213924407958984


In [27]:
# Iterate over the search results from the base collection (res_v12).
# i: Index of the current query in the search results.
# hits: List of search results (nearest neighbors) for the current query.
for i, hits in enumerate(res_v12):
    # Iterate over each hit (nearest neighbor) in the search results for the current query.
    for j, hit in enumerate(hits):
        # Print the original query sentence from the 'sentences' list using the index 'i'.
        print(f"Query sentence: {sentences[i]}")

        # Print the nearest neighbor's sentence and the distance to the query sentence.
        # hit.entity.get('sentence') retrieves the sentence of the nearest neighbor.
        # hit.distance provides the distance between the query sentence and the nearest neighbor.
        print(f"Nearest Neighbor Number {j}: {hit.entity.get('sentence')} ---- {hit.distance}\n")


Query sentence: [First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. A birds-eye view shows the bodies arranged in a shield-like pattern. The Ranger rides back to the other two.]
Nearest Neighbor Number 0:  We should head back to the wall. ---- 17.235008239746094

Query sentence: [First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. A birds-eye view shows the bodies arranged in a shield-like pattern. The Ranger rides back to the other two.]
Nearest Neighbor Number 1:  We should head back to the wall. ---- 17.235008239746094

Query sentence: [First scene opens with three Rangers riding through a tunne

In [28]:
# Iterate over the search results from the quantized collection (res_v12_ft5).
# i: Index of the current query in the search results.
# hits: List of search results (nearest neighbors) for the current query.
for i, hits in enumerate(res_v12_ft5):
    # Iterate over each hit (nearest neighbor) in the search results for the current query.
    for j, hit in enumerate(hits):
        # Print the original query sentence from the 'sentences' list using the index 'i'.
        print(f"Query sentence: {sentences[i]}")

        # Print the nearest neighbor's sentence and the distance to the query sentence.
        # hit.entity.get('sentence') retrieves the sentence of the nearest neighbor.
        # hit.distance provides the distance between the query sentence and the nearest neighbor.
        print(f"Nearest Neighbor Number {j}: {hit.entity.get('sentence')} ---- {hit.distance}\n")


Query sentence: [First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. A birds-eye view shows the bodies arranged in a shield-like pattern. The Ranger rides back to the other two.]
Nearest Neighbor Number 0:  Right. Give it here. ---- 17.779422760009766

Query sentence: [First scene opens with three Rangers riding through a tunnel, leaving the Wall, and going into the woods. (Eerie music in background) One Ranger splits off and finds a campsite full of mutilated bodies, including a child hanging from a tree branch. A birds-eye view shows the bodies arranged in a shield-like pattern. The Ranger rides back to the other two.]
Nearest Neighbor Number 1:  Right. Give it here. ---- 17.779422760009766

Query sentence: [First scene opens with three Rangers riding through a tunnel, leaving the Wall, and