# Lesson 28: Build the Vector Database with Milvus for Data

## Introduction (2 minutes)

Welcome to our lesson on building a vector database using Milvus. In this 30-minute session, we'll explore how to set up Milvus, create collections, insert data, and build indexes for efficient similarity search in our RAG system.

## Lesson Objectives

By the end of this lesson, you will be able to:
1. Set up and connect to a Milvus instance
2. Create a collection in Milvus for storing document embeddings
3. Insert vector data into Milvus
4. Build an index for efficient similarity search

## 1. Setting up Milvus (5 minutes)

First, let's set up Milvus using Docker and connect to it using the Python SDK:

```bash
# Pull and run Milvus using Docker
docker pull milvusdb/milvus:latest
docker run -d --name milvus_standalone -p 19530:19530 -p 9091:9091 milvusdb/milvus:latest

In [None]:
Now, let's install the Milvus Python SDK and connect to our Milvus instance:

pip install pymilvus

from pymilvus import connections, utility

def connect_to_milvus():
    connections.connect("default", host="localhost", port="19530")
    print("Connected to Milvus")

connect_to_milvus()

In [None]:
## 2. Creating a Collection (8 minutes)

Let's create a collection to store our document embeddings:

from pymilvus import Collection, FieldSchema, CollectionSchema, DataType

def create_collection(collection_name, dim):
    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim),
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535)
    ]
    schema = CollectionSchema(fields, "Document embeddings for RAG")
    collection = Collection(collection_name, schema)
    print(f"Collection '{collection_name}' created")
    return collection

# Usage
collection_name = "rag_documents"
dim = 384  # Dimension of your embeddings, adjust as needed
collection = create_collection(collection_name, dim)

In [None]:
## 3. Inserting Data (10 minutes)

Now, let's insert some document embeddings into our collection:

import numpy as np
from sentence_transformers import SentenceTransformer

def generate_embeddings(texts, model_name='all-MiniLM-L6-v2'):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(texts)
    return embeddings

def insert_data(collection, texts, embeddings):
    entities = [
        {"embedding": embedding.tolist(), "text": text}
        for embedding, text in zip(embeddings, texts)
    ]
    insert_result = collection.insert(entities)
    print(f"Inserted {insert_result.insert_count} entities")

# Usage
texts = [
    "Milvus is a vector database for similarity search.",
    "RAG systems combine retrieval and generation.",
    "Vector databases are crucial for efficient embedding storage.",
]
embeddings = generate_embeddings(texts)
insert_data(collection, texts, embeddings)

In [None]:
## 4. Building an Index (5 minutes)

To enable efficient similarity search, let's build an index on our collection:

def create_index(collection):
    index_params = {
        "metric_type": "L2",
        "index_type": "IVF_FLAT",
        "params": {"nlist": 1024}
    }
    collection.create_index("embedding", index_params)
    print("Index created")

# Usage
create_index(collection)

In [None]:
## Demonstration: Performing a Search (5 minutes)

Let's perform a simple search to demonstrate the functionality of our vector database:

def search_similar_documents(collection, query_text, top_k=3):
    query_embedding = generate_embeddings([query_text])[0]
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
    results = collection.search(
        data=[query_embedding.tolist()],
        anns_field="embedding",
        param=search_params,
        limit=top_k,
        output_fields=["text"]
    )
    return results

# Usage
query = "How does RAG use vector databases?"
results = search_similar_documents(collection, query)

print(f"Query: {query}")
print("Results:")
for i, hit in enumerate(results[0]):
    print(f"{i+1}. {hit.entity.get('text')} (Distance: {hit.distance})")

In [None]:
## Conclusion and Next Steps (2 minutes)

In this lesson, we've set up Milvus, created a collection for our document embeddings, inserted data, and built an index for efficient similarity search. This vector database will serve as a crucial component in our RAG system, enabling fast and accurate retrieval of relevant documents.

In our next lesson, we'll focus on integrating this vector database with our RAG pipeline and implementing the retrieval mechanism.

Are there any questions about building the vector database with Milvus?

## Additional Resources

1. Milvus documentation: https://milvus.io/docs
2. PyMilvus API Reference: https://milvus.io/api-reference/pymilvus/v2.2.6/About.md
3. "Vector Similarity Search: From Basics to Production" article: https://milvus.io/blog/vector-similarity-search-from-basics-to-production.md

For the next lesson, please ensure your Milvus instance is running and you have the PyMilvus library installed.