<a href="https://colab.research.google.com/github/StrategicalIT/PiedPiperAIv2/blob/main/Lab05B-Milvus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB 5B: Exploring Milvus Vector Database
In this lab we are going to use Python to get familiar with the vector databases and in particular with Milvus

The most basic operations in vector databases include adding embeddings to the database and querying the database to find similarity with a given embedding. Additionally, it is important to configure and index that can be used to speed up queries. Milvus database provides 12 different index types that can be advantageous for different use cases.

## Install dependencies

The first step is to install the necessary libraries. In this case we will install the Milvus Lite Python library. Milvus Lite runs in memory as part of our code. This is perfect for this tutorial but for production environments consider using either the standalone or the cluster versions. The good thing is that the skills you learn with Milvus Lite are applicable when using the other versions  

In [None]:
!pip install pymilvus[milvus-lite]

Embeddings are stored in vector databases like Milvus alongside the chunks they represent but embeddings are generated outside the vector database using an embedding model. Luckily

In [None]:
pip install "pymilvus[model]"

## Connect to Milvus

To create a local Milvus vector database, simply instantiate a MilvusClient by specifying a file name to store all data. After you do this you should find a file called "milvus.db". You can see it by clicking on the "folder" icon on the left menu. This will store a copy of index, the embeddings as well as the chunks of documents that the embeddings represent.

In [None]:
from pymilvus import MilvusClient
client = MilvusClient("./milvus.db")

## Create a collection and load documents

First, you have to create a collection which is similar to the tables in a relational database or to the concept of namespace in other products.

The "create_collection" method below doesn't complain if the if the collection "my_collection" already exists. It will try to load it if it exists, otherwise it will create it.

Notice also how we have defined the number of dimensions, i.e. the size of the embeddings we will store. This needs to match the size the embedding model produces.


In [None]:
client.create_collection(
  collection_name="my_collection",
  dimension=384,  # The vectors we will use in this demo have 384 dimensions
  )

We can use "```list_collections()```" to verify the collection is created

In [None]:
print(client.list_collections())

Next we are going to create embeddings. The following code downloads the  "all-MiniLM-L12-v2" model from HuggingFace. You can look it up there to see its properties if you want, for example how many parameters it has.

NOTE: You might see an error related to downloading from HuggingFace without providing a token but it should still download.  It is roughly 140MB in size so it will take less than a minute to download.

In [None]:
from pymilvus import model

embedding_fn = model.dense.SentenceTransformerEmbeddingFunction(
    model_name='all-MiniLM-L12-v2',
    device='cpu'
)

We will use these 4 simple documents for this lab

In [None]:
docs = [
    "This is a document about pineapples",
    "This is a document about oranges",
    "This is a document about planes",
    "This is a document about cars"
]

Next, let's create embeddings for our 4 documents and then create a structure called "data" that we'll use to insert into the collection. For each document we are inserting the text chunk, the embedding and a unique index that we are creating sequentially: 0, 1, 2 ...

In [None]:
vectors  = embedding_fn.encode_documents(docs)
print("Vector dimensions:", embedding_fn.dim)

data = []
for i in range(len(vectors)):
  data.append(
      {
          "id": i,
          "vector": vectors[i],
          "text": docs[i]
      }
  )

We can use the "pprint" command to show one of the elements in "data". This will show all 3 fields we provided, including the embedding with its 384 dimensions.

In [None]:
from pprint import pprint
pprint(data[0])

Milvus supports multiple indexes for different fields in the same collection. With the first command below you see all the indexes the collection has. Notice how there is a single index named "vector"

The second command allows us to get information about a given index. This includes some interesting ones like "index_type" and "metric_type". When we defined our collection, to keep this exercise simpler, we ignored two important fields: schema and index. Milvus picked up default settings for those two, but when we run a production it is better to define them explicitly. One thing to bear in mind is that Milvus Lite only supports FLAT index type, so there is no point in define an index. Milvus accepts an "index_type" of AUTOINDEX that automatically selects the most suitable indexing algorithm and tunes its parameters based on the field's data type and the data distribution.

Finally, notice how the "metric_type" is set to "COSINE" which means we are using cosine similarity to do searches.

In [None]:
indexes = client.list_indexes(collection_name="my_collection")
print(indexes)

# Get index details
index_info = client.describe_index(
    collection_name="my_collection",
    index_name="vector"
)
pprint(index_info)

Now we can load "data" into Milvus. Notice how we need to specify the specific collection as there could be multiple ones.

In [None]:
result = client.upsert(
    collection_name="my_collection",
    data=data
)
pprint(result) #Show the results of the insert operation

In the previous code how we used "upsert" which is the short for "update" or "insert". In other words, if the documents already exist it will update them, otherwise it will create them. If we used "insert" instead of "upsert" and repeat the same command, it will treat them as separate documents and add them again.

When you insert documents you need to make sure the id you provide is unique.

We can check how many documents or rows we have in the collection

In [None]:
print(client.get_collection_stats(collection_name="my_collection"))

## Query the database

Now we use the "```search()```" method to perform a query. First we are embed the query. Notice how we are requesting the 2 best matches. Also, we display only the "text" field, because we don't want to show the embedding itself.

In [None]:
query = ["I need information about fruits"]
query_embedding = embedding_fn.encode_queries(query)

results = client.search(
    collection_name="my_collection",
    data=query_embedding,
    limit=2,
    output_fields=["text"],
)

It should have retrieved 2 documents from the database that are related to our query text. Let's see if the results make sense.

In [None]:
for r in results[0]:
    pprint(r)

The output should include the documents that are relevant to fruits. Try changing the query about other topics like "transportation" and check what output you get.

Also, you can add more documents and repeat the queries.

## Working with metadata

Now we are going to explore how to leverage metadata to filter results. In Milvus we do this by adding a new field to the schema. In this case we will add "climate" to help us filter results.

With the "upsert" function we can update existing documents and insert new ones all in one go. Notice how we are reusing indexes 0 and 1 to update two documents that already exist and indexes 4 and 5 to create two new documents.



In [None]:
docs = [
    "This is a document about pineapples",
    "This is a document about oranges",
    "This is a document about coconuts",
    "This is a document about pears"]

ids = [0, 1, 4, 5] # We will update docs 1 and 2 and insert 4 and 5

climate = ["tropical", "mediterranean", "tropical", "mediterranean"]

vectors  = embedding_fn.encode_documents(docs)

data = []
for i in range(len(vectors)):
  data.append(
      {
          "id": ids[i],
          "vector": vectors[i],
          "text": docs[i],
          "climate": climate[i]
      }
  )

Let's insert the "data" list into "my_collection"

In [None]:
result = client.upsert(
    collection_name="my_collection",
    data=data
)
pprint(result) #Show the results of the insert operation
print(client.get_collection_stats(collection_name="my_collection"))

If we repeat the same query as before we might get fruits irrespectively of their climate.

In [None]:
results = client.search(
    collection_name="my_collection",
    data=embedding_fn.encode_queries(["I need information about fruits"]),
    limit=2,
    output_fields=["text", "climate"],
)

pprint(results)

However, we can use the "filter" function to retrieve only fruits from "tropical" climates

In [None]:
results = client.search(
    collection_name="my_collection",
    data=embedding_fn.encode_queries(["I need information about fruits"]),
    filter="climate == 'tropical'",
    limit=2,
    output_fields=["text", "climate"],
)

pprint(results)

You can experiment further by adding more records and even additional fields to the schema, ex:

```"colour" ="yellow", ...```

Can you think of how you would use metadata for a real-world use case at your business?

## Next Steps
Milvus provides other powerful search features like "full-text search" with the BM25 algorithm to perform keyword-based search on text data. It can also perform "hybrid search" by searching through multiple vector fields within a single collection to improve retrieval accuracy.


## End of Lab5