
# **Milvus tutorial**

This is the sample notebook from the Milvus site.

# **How to run**

This notebook can be run on Google Colab and stand alone python development environments. Click here to run on colab:




# **References**

https://milvus.io/docs/quickstart.md



In [None]:
!pip3 install -U milvus pymilvus pymilvus[model]

Collecting milvus-model>=0.1.0 (from pymilvus)
  Downloading milvus_model-0.2.3-py3-none-any.whl (29 kB)
Collecting onnxruntime (from milvus-model>=0.1.0->pymilvus)
  Downloading onnxruntime-1.18.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
Collecting coloredlogs (from onnxruntime->milvus-model>=0.1.0->pymilvus)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime->milvus-model>=0.1.0->pymilvus)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: humanfriendly, coloredlogs, onnxruntime, milvus-mode

**NOTE:**  It is a good idea to restart the session runtime and run the cells from the very beginnng.

In [None]:
from pymilvus import MilvusClient

client = MilvusClient("milvus_demo.db")
print(" client: ", client)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 1ac84a51e1a847f98b265088cbf4930f


 client:  <pymilvus.milvus_client.milvus_client.MilvusClient object at 0x7d8205c96560>


In [None]:
if client.has_collection(collection_name="demo_collection"):
    client.drop_collection(collection_name="demo_collection")
client.create_collection(
    collection_name="demo_collection",
    dimension=768,  # The vectors we will use in this demo has 768 dimensions
)

DEBUG:pymilvus.milvus_client.milvus_client:Successfully created collection: demo_collection
DEBUG:pymilvus.milvus_client.milvus_client:Successfully created an index on collection: demo_collection


In [None]:
from pymilvus import model

# If connection to https://huggingface.co/ failed, uncomment the following path
# import os
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# This will download a small embedding model "paraphrase-albert-small-v2" (~50MB).
embedding_fn = model.DefaultEmbeddingFunction()

# Text strings to search from.
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

vectors = embedding_fn.encode_documents(docs)
# The output vector has 768 dimensions, matching the collection that we just created.
print("Dim:", embedding_fn.dim, vectors[0].shape)  # Dim: 768 (768,)

# Each entity has id, vector representation, raw text, and a subject label that we use
# to demo metadata filtering later.
data = [
    {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"}
    for i in range(len(vectors))
]

print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/46.9M [00:00<?, ?B/s]

Dim: 768 (768,)
Data has 3 entities, each with fields:  dict_keys(['id', 'vector', 'text', 'subject'])
Vector dim: 768


In [None]:
result = client.insert(collection_name="demo_collection", data=data)
print(" result: ", result)

 result:  {'insert_count': 3, 'ids': [0, 1, 2], 'cost': 0}


# **Vector search**

In [None]:
query_vectors = embedding_fn.encode_queries(["Who is Alan Turing?"])
# If you don't have the embedding function you can use a fake vector to finish the demo:
# query_vectors = [ [ random.uniform(-1, 1) for _ in range(768) ] ]

result = client.search(collection_name="demo_collection",  # target collection
                       data=query_vectors,  # query vectors
                       limit=2,  # number of returned entities
                       output_fields=["text", "subject"],  # specifies fields to be returned
                       )
print(" result: ", result)

 result:  data: ["[{'id': 2, 'distance': 0.5859944820404053, 'entity': {'text': 'Born in Maida Vale, London, Turing was raised in southern England.', 'subject': 'history'}}, {'id': 1, 'distance': 0.5118255019187927, 'entity': {'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history'}}]"] , extra_info: {'cost': 0}


# **Vector Search with Metadata Filtering**

In [None]:
# Insert more docs in another subject.
docs = [
    "Machine learning has been used for drug design.",
    "Computational synthesis with AI algorithms predicts molecular properties.",
    "DDR1 is involved in cancers and fibrosis.",
]
vectors = embedding_fn.encode_documents(docs)
data = [
    {"id": 3 + i, "vector": vectors[i], "text": docs[i], "subject": "biology"}
    for i in range(len(vectors))
]

client.insert(collection_name="demo_collection", data=data)

# This will exclude any text in "history" subject despite close to the query vector.
result = client.search(collection_name="demo_collection",
                       data=embedding_fn.encode_queries(["tell me AI related information"]),
                       filter="subject == 'biology'",
                       limit=2,
                       output_fields=["text", "subject"])
print(" result: ", result)

 result:  data: ["[{'id': 4, 'distance': 0.27030572295188904, 'entity': {'text': 'Computational synthesis with AI algorithms predicts molecular properties.', 'subject': 'biology'}}, {'id': 3, 'distance': 0.1642588973045349, 'entity': {'text': 'Machine learning has been used for drug design.', 'subject': 'biology'}}]"] , extra_info: {'cost': 0}


# **Query**

A query() is an operation that retrieves all entities matching a cretria, such as a filter expression or matching some ids.

For example, retrieving all entities whose scalar field has a particular value:

In [None]:
result = client.query(collection_name="demo_collection",
                      filter="subject == 'history'",
                      output_fields=["text", "subject"])
print(" result: ", result)

 result:  data: ["{'id': 0, 'text': 'Artificial intelligence was founded as an academic discipline in 1956.', 'subject': 'history'}", "{'id': 1, 'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history'}", "{'id': 2, 'text': 'Born in Maida Vale, London, Turing was raised in southern England.', 'subject': 'history'}"] , extra_info: {'cost': 0}


Directly retrieve entities by primary key:

In [None]:
res = client.query(collection_name="demo_collection",
                   ids=[0, 2],
                   output_fields=["vector", "text", "subject"])
print(" result: ", result)

 result:  data: ["{'id': 0, 'text': 'Artificial intelligence was founded as an academic discipline in 1956.', 'subject': 'history'}", "{'id': 1, 'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history'}", "{'id': 2, 'text': 'Born in Maida Vale, London, Turing was raised in southern England.', 'subject': 'history'}"] , extra_info: {'cost': 0}


# **Delete entities**

If you'd like to purge data, you can delete entities specifying the primary key or delete all entities matching a particular filter expression.

In [None]:
# Delete entities by primary key
result = client.delete(collection_name="demo_collection", ids=[0, 2])
print(" result: ", result)

 result:  [0, 2]


In [None]:
# Delete entities by a filter expression
result = client.delete(collection_name="demo_collection",
                       filter="subject == 'biology'")
print(" result: ", result)

 result:  [3, 4, 5]
