## Load Vector Embeddings to Milvus

Here we will take the data we loaded into watsonx.data from the previous step and load it into the vector database Milvus. This data was previously chunked and stored in a watsonx.data hive table, so we'll pull from here, vectorize the text chunks and load them into Milvus.

Before we can start loading the data, though, we need to create a collection in Milvus to hold the data. We'll call this collection `wiki_articles`. This collection holds the vector embeddings for each chunk of text, as well as the original text itself and additional context.

Let's get started!

#### Load credentials 


In [None]:
import os
from dotenv import load_dotenv
from ibm_cloud_sdk_core import IAMTokenManager
import warnings
warnings.filterwarnings('ignore')

load_dotenv('config.env')

# Connection variables
api_key = os.getenv("API_KEY", None)
ibm_cloud_url = os.getenv("IBM_CLOUD_URL", None) 
project_id = os.getenv("PROJECT_ID", None)

creds = {
    "url": ibm_cloud_url,
    "apikey": api_key 
}
access_token = IAMTokenManager(
    apikey = api_key,
    url = "https://iam.cloud.ibm.com/identity/token"
).get_token()

#### Create Lakehouse Connection

We will use this watsonx.data connection to load the wikipedia articles.

In [None]:
import ssl
import urllib3
import os
from sqlalchemy import create_engine
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) # disable https warning

LH_HOST_NAME=os.getenv("LH_HOST_NAME", None)
LH_PORT=os.getenv("LH_PORT", None) 
LH_USER=os.getenv("LH_USER", None)
LH_PW=os.getenv("LH_PW", None)
LH_CATALOG='tpch'
LH_SCHEMA='tiny'

try: 
    quick_engine.dispose()
except:
    pass

print(f"presto://{LH_USER}:{LH_PW}@{LH_HOST_NAME}:{LH_PORT}/{LH_CATALOG}/{LH_SCHEMA}")

quick_engine = create_engine(
   f"presto://{LH_USER}:{LH_PW}@{LH_HOST_NAME}:{LH_PORT}/{LH_CATALOG}/{LH_SCHEMA}",
   connect_args={
    'protocol': 'https', 
    'requests_kwargs': {'verify': ssl.CERT_NONE }
    }
)

#### Create Milvus Collection & Index

Creating a Milvus collection involves first connecting to the Milvus server, then creating a collection with a defined schema and index. 

In [None]:
from pymilvus import(
    Milvus,
    IndexType,
    Status,
    connections,
    FieldSchema,
    DataType,
    Collection,
    CollectionSchema,
)

import os 

host = os.getenv("MILVUS_HOST", None)
port = os.getenv("MILVUS_PORT", None)
password = os.getenv("LH_PW", None)
user = os.getenv("LH_USER", None)
server_pem_path = os.getenv("LH_CERT", None)

connections.connect(alias = 'default',
                   host = host,
                   port = port,
                   user = user,
                   password = password,
                   server_pem_path = server_pem_path,
                   server_name = host,
                   secure = True)



In [None]:
# Create collection - define fields + schema

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True), # Primary key
    FieldSchema(name="article_text", dtype=DataType.VARCHAR, max_length=2500,),
    FieldSchema(name="article_title", dtype=DataType.VARCHAR, max_length=200,),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=384),
]

schema = CollectionSchema(fields, "wikipedia article collection schema")

wiki_collection = Collection("wiki_articles", schema)

# Create index
index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
}

wiki_collection.create_index(field_name="vector", index_params=index_params)

In [None]:
# Create collection - define fields + schema

schema = CollectionSchema(fields, "German wikipedia article collection schema")

wiki_collection_de = Collection("wiki_articles_de", schema)

# Create index
index_params = {
        'metric_type':'L2',
        'index_type':"IVF_FLAT",
        'params':{"nlist":2048}
}

wiki_collection_de.create_index(field_name="vector", index_params=index_params)

In [None]:
## Status(code=0, message=) means success! 

In [None]:
# we can run a check to see the collections in our milvus instance and we see 'wiki_articles'  has been created 

from pymilvus import utility
utility.list_collections()

#### Insert Vectors into Milvus

Here we read data from the lakehouse table using the connection we created earlier. We pull text chunks and titles from the database, being sure to separate them out into separate lists. We then vectorize using the `sentence-transformers/all-MiniLM-L6-v2` sentence transformer model. Learn more about Hugging Face sentence transformers here: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

It is important we assemble the article text, article titles and vector embeddings into a `data` object. This object will be used to load the data into Milvus.

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from pymilvus import Collection, connections
import warnings
warnings.filterwarnings('ignore')

# Download Wikipedia articles from watsonx.data using the engine we created earlier 

articles_df = pd.read_sql_query("select * from hive_data.watsonxai.wikipedia",quick_engine)

# extract text + titles
passages = articles_df['text'].tolist()
passage_titles = articles_df['title'].tolist()

# Create vector embeddings + data
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
passage_embeddings = model.encode(passages)

basic_collection = Collection("wiki_articles") 
data = [
    passages,
    passage_titles,
    passage_embeddings
]
out = basic_collection.insert(data)
basic_collection.flush()  # Ensures data persistence

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from pymilvus import Collection, connections
import warnings
warnings.filterwarnings('ignore')

# Download Wikipedia articles from watsonx.data using the engine we created earlier 

articles_df = pd.read_sql_query("select * from hive_data.watsonxai.wikipedia_de",quick_engine)

# extract text + titles
passages = articles_df['text'].tolist()
passage_titles = articles_df['title'].tolist()

# Create vector embeddings + data
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 384 dim
passage_embeddings = model.encode(passages)

basic_collection = Collection("wiki_articles_de") 
data = [
    passages,
    passage_titles,
    passage_embeddings
]
out = basic_collection.insert(data)
basic_collection.flush()  # Ensures data persistence

In [None]:
## check to ensure entities have been loaded into 'wiki_articles' collection

basic_collection = Collection("wiki_articles") 

basic_collection.num_entities 

In [None]:
## check to ensure entities have been loaded into 'wiki_articles' collection

basic_collection = Collection("wiki_articles_de") 

basic_collection.num_entities 