![Top <](./images/watsonxdata.png "watsonxdata")

# Lab 1: Getting familiar with Milvus

This notebook demonstrates connecting to a Milvus service when using Jupyter notebooks. Then it shows how create a simple collection with an index.

The first step is to make sure that the Milvus extensions are loaded into the notebook. 

In [None]:
!pip install pymilvus==2.6.2

## Local Connection

A local connection assumes that you are running your Jupyter notebook inside the same server that is running watsonx.data and the Milvus server. The connection user is the default watsonx.data userid (ibmlhadmin). You need to generate the certificate that will be used by the connection.

### Generate the Connection Certificate

In [None]:
!rm -f /tmp/presto.cert
!echo QUIT | openssl s_client -showcerts -connect localhost:8443 | awk '/-----BEGIN CERTIFICATE-----/ {p=1}; p; /-----END CERTIFICATE-----/ {p=0}' > /tmp/presto.crt

In [None]:
rc = %system echo QUIT | openssl s_client -showcerts -connect watsonxdata:8443 | \
        awk '/-----BEGIN CERTIFICATE-----/ {p=1}; p; /-----END CERTIFICATE-----/ {p=0}' > /tmp/presto.crt 

### Local Connection Parameters

In [None]:
host            = 'watsonxdata'
port            = 19530
apiuser         = 'xxxxxxxxxx'
apikey          = 'xxxxxxxx'
server_pem_path = '/tmp/presto.crt'

## Milvus Connection

In [None]:
from pymilvus import(
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

connections.connect(alias='default',
                   host=host,
                   port=port,
                   user=apiuser,
                   password=apikey,
                   server_pem_path=server_pem_path,
                   server_name='watsonxdata',
                   secure=True)

### Check Connection Status

In [None]:
print(f"\nList connections:")
print(connections.list_connections())

## Create a Collection in Milvus
This code will drop the wiki_articles collection if it exists, and then recreate it. This script should return the following text.
```
Status(code=0, message=)
```

#### Make various unitilty commands available

In [None]:
from pymilvus import utility

#### Clean up previous collection if one already exists

In [None]:
utility.drop_collection("wiki_articles")

#### Create a sample collection

In [None]:
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True), # Primary key
    FieldSchema(name="article_text", dtype=DataType.VARCHAR, max_length=2500,),
    FieldSchema(name="article_title", dtype=DataType.VARCHAR, max_length=200,),
    FieldSchema(name="article_subtopic", dtype=DataType.VARCHAR, max_length=10,),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384),
]

schema = CollectionSchema(fields, "wikipedia article collection schema")

wiki_collection = Collection("wiki_articles", schema)

#### Create an index for this collection

- metric_type specifies the distance metric used in the vector space. L2 is the Euclidian distance.
- index_type specifies the type of vector index to use. IVF means inverted file index which means clusting the the vector space and representing each cluster by its centroid. FLAT means that vectors are stored directly without any compression or quantization meaning that precise distance calculations are possible
- params specifies several parameters relevant for our index. For instance nlist defines the number clusters to use for the inverted file index. 

In [None]:
index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
}

wiki_collection.create_index(field_name="vector", index_params=index_params)

#### Double Check that the schema exists

In [None]:
from pymilvus import utility
utility.list_collections()

## Get data from Wikipedia for loading into our collection

In [None]:
import wikipedia
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# search
search_results = wikipedia.search("Climate")

articles = []
for i in range (0,len(search_results)):
    try:
        summary = wikipedia.summary(search_results[i],auto_suggest=False)
    except Exception as err:
        print(f"Skipped article '{search_results[i]}' skipped because of ambiguity.")
        continue
    try:
        page = wikipedia.page(search_results[i],auto_suggest=False).content
    except Exception as err:
        print(f"Skipped article '{search_results[i]}' skipped because of ambiguity.")
        continue

    
    articles.append({
        "title"   : search_results[i],
        "summary" : summary,
        "page"    : page
    })

df = pd.DataFrame.from_dict(articles)
df.style.set_properties(**{'text-align': 'left'})
print(df)

In [None]:
print(articles)

## Split Articles into chunks

### Define function for splitting article into chunks

In [None]:
# Chunk data
def split_into_chunks(text, chunk_size):
    words = text.split()
    #print('text:',text)
    #print('words:',words)
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

### Create list of chunks for all articles and create analog list for additional metadata correspong to the chunk (title, subtopic)

In [None]:
chunk_size=255
passages=[]
passages_titles=[]
passages_subtopic=[]

for a in articles:
    print('title',a['title'])
    if a['title'] == "Climate":
        subtopic="false"
    else:
        subtopic="true"

    p = a['page']
    cl = split_into_chunks(p,chunk_size)

    print("number of chunks=",len(cl))
    for c in cl:
        passages.append(c)
        passages_titles.append(a['title'])
        passages_subtopic.append(subtopic)

### Create the embeddings for the chunks

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
passages_embeddings = model.encode(passages)


### Insert all data into the collection created above

In [None]:
basic_collection = Collection("wiki_articles") 
data = [
    passages,
    passages_titles,
    passages_subtopic,
    passages_embeddings
]
out = basic_collection.insert(data)
basic_collection.flush()  # Ensures data persistence
print("Done")

## Get Infos about the Milvus system 

The following commands show you how you can get information about you Milvus system

#### Server Version

In [None]:
res = utility.get_server_version()
print(res)

#### Server Type

In [None]:
res = utility.get_server_type()
print(res)

#### Check the connections

In [None]:
connections.list_connections()

## Get Infos about the collection we created

Milvus offers several options to get more information about collections. In the following cells we explore a few of them.

#### Get the name of the collection

In [None]:
wiki_collection.name

#### Get the description of the collection

In [None]:
res = wiki_collection.description
print(res)

#### Get the schema of the collection

The following command returns a JSON file describing the schema of the collection.

In [None]:
wiki_collection.schema

#### Check if the collection contains any data

In [None]:
wiki_collection.is_empty

#### How many entities (rows) are in the collection

In [None]:
wiki_collection.num_entities

#### What is the primary field of the collection

Like a relational table a collection can have a primary field. With the following command we can check for this primary field.

In [None]:
wiki_collection.primary_field

#### Get all the partitions of this collection

A collection can consist of several partitions. In this lab we will only use one partition per collection.

In [None]:
wiki_collection.partitions

#### Get the indexes of this collection

Each vector field in a collection should have an index. The following command lists the indexes of a collection.

In [None]:
wiki_collection.indexes

#### Get the names of the indexes of this collection

In [None]:
utility.list_indexes(collection_name="wiki_articles")

#### Get infos about the replicas of this collection

A collection can have replicas. The concept of replicas is not further investigated in this lab.

In [None]:
wiki_collection.get_replicas()

#### Get a JSON document with the meta data about this collection

In [None]:
wiki_collection.describe()

#### Credits: IBM 2025, Wilfried Hoge [hoge@de.ibm.com] and Andreas Weininger [andreas.weininger@de.ibm.com] based on a notebook by George Baklarz [baklarz@ca.ibm.com]