## Creating an index and populating it with documents using Milvus and Nomic AI Embeddings

Simple example on how to ingest PDF documents, then web pages content into a Milvus VectorStore. . In this example, the embeddings are the fully open source ones released by NomicAI, [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).

As described in [this blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1), those embeddings feature a "8192 context-length that outperforms OpenAI Ada-002 and text-embedding-3-small on both short and long context tasks". In additions, they are:

- Open source
- Open data
- Open training code
- Fully reproducible and auditable

Requirements:
- A Milvus instance, either standalone or cluster.

### Needed packages and imports

In [1]:
!pip install -q einops==0.7.0 langchain==0.1.9 pypdf==4.0.2 pymilvus==2.3.6 sentence-transformers==2.4.0


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import requests
import os
from langchain.document_loaders import PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus

### Base parameters, the Milvus connection info

In [3]:
MILVUS_HOST = "vectordb-milvus.milvus-standalone.svc.cluster.local"
MILVUS_PORT = 19530
MILVUS_USERNAME = os.getenv('MILVUS_USERNAME')
MILVUS_PASSWORD = os.getenv('MILVUS_PASSWORD')
MILVUS_COLLECTION = os.getenv('MILVUS_COLLECTION_NAME')

## Initial index creation and document ingestion

#### Download and load pdfs

In [4]:
pdf_folder_name = 'ea'
pdf_folder_path = f"./{pdf_folder_name}"

pdf_loader = PyPDFDirectoryLoader(pdf_folder_path)
pdf_docs = pdf_loader.load()

In [5]:
pdf_docs[-1].metadata

{'source': 'ea/3210-EIA-simplificado.pdf', 'page': 73}

#### Inject metadata

In [6]:
# Step 2: Get a list of all files in the folder
files_in_folder = os.listdir(pdf_folder_path)

# Step 3: Filter out the PDF files
pdf_files = [file for file in files_in_folder if file.lower().endswith('.pdf')]

pdf_files

['3166-EIA.pdf', '3210-EIA-simplificado.pdf']

In [7]:
import re

# Define a regular expression pattern to match "<ID>*.pdf"
pattern = re.compile(r'^' + re.escape(pdf_folder_name + '/') + r'([0-9]+).*\.pdf$', re.IGNORECASE)
# pattern = re.compile(r'^ea/([0-9]+).*\.pdf$', re.IGNORECASE)
print(pattern)
for doc in pdf_docs:
    match = pattern.match(doc.metadata["source"])
    if match:
        doc.metadata["id"] = match.group(1)

pdf_docs[-1]

re.compile('^ea/([0-9]+).*\\.pdf$', re.IGNORECASE)


Document(page_content='E. I. A.  DEL PROYECTO DE PLANTACION DE FRUTOS SECOS Y TRANSFORMACIÓN DE 142,50 HA. DE RIEGO POR \nASPERSIÓN EN RIEGO POR GOTEO, EN EL TÉRMINO MUNICIPAL DE PEPINO (TOLEDO).  \nLA ISLA DEL POSTUERO , S.L.  MEMORIA  - 73 - Se define:  \n \n- Número de conatos: Indica el número de conatos iniciados en el término municipal. \nSe define como CONATO aquel incendio f orestal cuya superficie total es inferior a 1 \nHa. \n- Número de incendios: Indica el número de incendios forestales en el término \nmunicipal. Se define como INCENDIO aquel cuya superficie es igual o superior a 1 \nHa. \n- Frecuencia de incendios totales: Número total  de conatos e incendios iniciados en el \nmunicipio.  \n \nSerá necesario, por tanto, maximizar las precauciones para evitar incendios derivados  de \nlas obras dado el elevado riesgo de incendios de la zona según los datos de los  años de \nestudio.  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ', metadata={'source':

#### Load websites

#### Merge both types of docs

In [8]:
docs = pdf_docs

#### Split documents into chunks with some overlap

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024,
                                               chunk_overlap=40)
all_splits = text_splitter.split_documents(docs)
all_splits[0]

Document(page_content='DOCUMENTO AMBIENTAL PARA “AUTORIZACIÓN APROVECHAMIE NTO DE AGUAS SUBTERRÁNEAS MEDIANTE  \nCONCESIÓN DE RIEGO DE  25,75 HAS EN VILLATOBAS (TOL EDO)”. \n EXPEDIENTE DE CHT: C-0376-2021  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nAbril 2022 \n \n \n \n PROMOTOR : JUAN ANTONIO GONZÁLEZ GÓMEZ.  45310 -\nVILLATOBAS (TOLEDO)', metadata={'source': 'ea/3166-EIA.pdf', 'page': 0, 'id': '3166'})

In [10]:
len(all_splits)

491

#### Create the index and ingest the documents

In [12]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
model_kwargs = {'trust_remote_code': True}
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=True
)


db = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    auto_id=True,
    drop_old=True
    )

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>
Failed to create new connection using: 6ff40ebe8d7a478d878ae19da9849a75


MilvusException: <MilvusException: (code=2, message=Fail connecting to server on vectordb-milvus.milvus-standalone.svc.cluster.local:19530. Timeout)>

In [None]:
db.add_documents(all_splits)

#### Alternatively, add new documents

In [None]:
# If you don't want to use a GPU, you can remove the 'device': 'cuda' argument
# model_kwargs = {'trust_remote_code': True, 'device': 'cuda'}
# embeddings = HuggingFaceEmbeddings(
#     model_name="nomic-ai/nomic-embed-text-v1",
#     model_kwargs=model_kwargs,
#     show_progress=True
# )

# db = Milvus(
#     embedding_function=embeddings,
#     connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
#     collection_name=MILVUS_COLLECTION,
#     metadata_field="metadata",
#     text_field="page_content",
#     auto_id=True,
#     drop_old=False
#     )

# db.add_documents(all_splits)

#### Test query

In [None]:
query = "Dónde se localiza el proyecto?"
docs_with_score = db.similarity_search_with_score(query, param={"metadata": {"id":"3166"}})

In [None]:
query = "Quién es el promotor del proyecto?"
docs_with_score = db.similarity_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.metadata)
    print(doc.page_content)
    print("-" * 80)