# Research on Vector Databases

Vector search is a cutting-edge approach to searching and retrieving data that leverages the power of vector similarity calculations. Unlike traditional keyword-based search, which matches documents based on the occurrence of specific terms, vector search focuses on the semantic meaning and similarity of data points. By representing data as vectors in a high-dimensional space, vector search enables more accurate and intuitive search results.

In [None]:
pip install python-dotenv==1.0.1 langchain==0.2.11 langchain-community==0.2.10 scikit-learn==1.5.0

In [1]:
import os
from dotenv import load_dotenv

In [2]:
load_dotenv()

True

## Data loader

First, we need to load a document. Let's try with a pdf and a markdown file.

Then, we will split it into smaller chunks of text. This serves several important purposes:
- Granularity: By splitting a document into smaller chunks, you can retrieve more specific and relevant pieces of information. If you work with large documents as single chunks, retrieval can become inefficient and less precise.
- Search Performance: Smaller chunks improve the performance of search algorithms. It's easier to match a query against smaller, more focused pieces of text rather than a large document.
- Computational efficiency: Working with entire documents as single units can be memory-expensive and slow. Splitting documents into chunks allows for more efficient use of memory and computational resources.

Practical example: Imagine we have a large document, such as a product manual or a scientific paper. If a user asks a question like "How do I reset the device?" or "What is the conclusion of the study?", splitting the document into smaller chunks allows the chatbot to:
- Efficiently locate and retrieve the most relevant section about device resetting instructions or the conclusion of the study.
- Avoid returning irrelevant parts of the document, which might confuse the user.
- Provide a faster response by only processing a small portion of the document.

In [None]:
pip install pypdf==4.2.0 unstructured==0.14.7 markdown==3.6

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Define how the text should be split:
#  - Each chunk should be up to 512 characters long.
#  - There should be an overlap of 64 characters between consecutive chunks. 
#  - This overlap helps maintain context across the chunks.
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

In [16]:
# PDF

from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("../knowledge-base/folder_1/file_1.pdf")

# Load pdf and split into chunks.
pdf_chunks = pdf_loader.load_and_split(text_splitter=splitter)

# Get the number of chunks
print(f"Number of chunks: {len(pdf_chunks)}")

# Print the first 5 chunks
for chunk in pdf_chunks[:5]:
    print(chunk)

Number of chunks: 311
page_content='GENERATIVE AI: HYPE,  OR 
TRUL
Y TRANSFORMATIVE?ISSUE 120 | July 5, 2023 | 12:28 PM EDT”P   Global Macro  
Research
Investors should consider this report as only a single factor in making their investment decision. For 
Reg AC certiﬁcation and other important disclosures, see the Disclosure Appendix, or go to www.gs.com/research/hedge.html.
The Goldman Sachs Group, Inc.Since the release of OpenAI’s generative AI tool ChatGPT in November, investor' metadata={'source': '../knowledge-base/folder_1/file_1.pdf', 'page': 0}
page_content='interest in generative AI technology has surged. The disruptive potential of  this technology, and whether the hype around it—and market pricing—has gone too far, is Top of Mind. We speak with Conviction’s Sarah Guo, NYU’s Gary Marcus, and GS GIR’s US software and internet analysts Kash Rangan and Eric Sheridan about what the technology can—and can’t—do at this stage. GS economists then assess the technol

In [4]:
# Markdown

from langchain_community.document_loaders import UnstructuredMarkdownLoader

md_loader = UnstructuredMarkdownLoader("../knowledge-base/folder_1/file_2.md")

# Load pdf and split into chunks.
md_chunks = md_loader.load_and_split(text_splitter=splitter)

# Get the number of chunks
print(f"Number of chunks: {len(md_chunks)}")

# Print the first 5 chunks
for chunk in md_chunks[:5]:
    print(chunk)

Number of chunks: 7
page_content='The Football World Cup

The FIFA World Cup, often simply referred to as the World Cup, is the premier international football competition. Organized by the Fédération Internationale de Football Association (FIFA), it features national teams from around the globe competing for the most coveted trophy in the sport. The tournament is held every four years and is one of the most widely viewed and followed sporting events in the world.

History' metadata={'source': '../knowledge-base/folder_1/file_2.md'}
page_content='History

The first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Since then, the tournament has grown significantly in size and popularity. Originally featuring just 13 teams, the World Cup now includes 32 teams in the final tournament, with plans to expand to 48 teams in the near future.

Format

The World Cup is divided into two main stages:' metadata={'source': '../knowledge-base/folder_1/file_2.md'}


## Vector store

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

In [13]:
# Write data into the vector store
def insert_data(docs, vstore):
    print(f"Documents to insert: {len(docs)}.")

    inserted_ids = vstore.add_documents(docs)

    print(f"Inserted {len(inserted_ids)} documents.")

In [14]:
# Run a similarity search for the given query
def search_vstore(query, vstore):
    results = vstore.similarity_search(query, k=3)
    for res in results:
        print(f"* {res.page_content} \n[{res.metadata}] \n")

#### Astra DB

For using Datastax Astra DB:
- Create a database in https://astra.datastax.com/
- Obtain your database API endpoint, located under Database Details > API Endpoint, and save it as an environment variable called: ASTRA_DB_API_ENDPOINT
- Generate a token and save it as an environment variable called: ASTRA_DB_APPLICATION_TOKEN

In [None]:
pip install langchain-astradb==0.3.3

In [31]:
from langchain_astradb import AstraDBVectorStore

In [32]:
astradb = AstraDBVectorStore(
    embedding=google_embeddings,
    collection_name="vector_db_test", 
    api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
    token=os.getenv("ASTRA_DB_APPLICATION_TOKEN")
)

In [33]:
insert_data(pdf_chunks, astradb)

Documents to insert: 311.
Inserted 311 documents.


In [34]:
search_vstore("what are the risks of AI?", astradb)

* Allison Nathan: How should investors navigate this risk?  
Sarah Guo: My advice to investors is to focus on choice of 
technical partners, concrete plans, and outcomes . When AI 
products account for a significant share of incremental revenue, 
it's hard to argue with that performance. Or, in consumer 
businesses, if the metrics investors typically use to assess a 
company’s performance— engagement , transactions, ad 
inventory, etc. —materially improve after introducing a new AI 
[{'source': 'example_data/example_pdf.pdf', 'page': 4}] 

* inventory, etc. —materially improve after introducing a new AI 
product, that’s what you want to see.  
Another significant risk is public and regulatory backlash against 
AI technology due to concerns around abuse of these 
technologies in t he areas of bias, disinformation, cybersecurity, 
etc. Just like the internet, general tools like generative AI can 
be used for good and for bad, so investment in risk mitigation 
must occur alongside investm

In [35]:
insert_data(md_chunks, astradb)

Documents to insert: 7.
Inserted 7 documents.


In [36]:
search_vstore("when was the first tournament played?", astradb)

* History

The first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Since then, the tournament has grown significantly in size and popularity. Originally featuring just 13 teams, the World Cup now includes 32 teams in the final tournament, with plans to expand to 48 teams in the near future.

Format

The World Cup is divided into two main stages: 
[{'source': 'example_data/example_markdown.md'}] 

* Format

The World Cup is divided into two main stages:

Qualification: This stage occurs over the three years preceding the tournament, with national teams competing within their respective confederations (e.g., UEFA, CONMEBOL, AFC) to secure a spot in the finals.

Final Tournament: Held over approximately one month, the final tournament features: 
[{'source': 'example_data/example_markdown.md'}] 

* The Football World Cup

The FIFA World Cup, often simply referred to as the World Cup, is the premier international football competition. Organized by th

✅ Pros:
- Friendly UI
- Can run similarity search through the UI (but you need to have the embedding beforehand)
- Has a CQL console to run queries on the database

❌ Cons:
- Probably not suitable to insert private information
- Cannot run similarity search on text directly
- The database goes into hibernation if it's not used for 2 days

#### Pinecone

For using Pinecone:
- Create a project in https://app.pinecone.io/
- Create an index with dimensions=768 (this is the length of the embeddings) and metric=cosine
- Generate an api token and save it as an environment variable called: PINECONE_API_KEY

In [None]:
pip install langchain-pinecone==0.1.1

In [21]:
from langchain_pinecone import PineconeVectorStore

In [27]:
pinecone = PineconeVectorStore(
    embedding=google_embeddings,
    index_name="vector-db-test",
    pinecone_api_key=os.getenv("PINECONE_API_KEY")
)

In [37]:
insert_data(pdf_chunks, pinecone)

Documents to insert: 311.
Inserted 311 documents.


In [38]:
search_vstore("what are the risks of AI?", pinecone)

* Allison Nathan: How should investors navigate this risk?  
Sarah Guo: My advice to investors is to focus on choice of 
technical partners, concrete plans, and outcomes . When AI 
products account for a significant share of incremental revenue, 
it's hard to argue with that performance. Or, in consumer 
businesses, if the metrics investors typically use to assess a 
company’s performance— engagement , transactions, ad 
inventory, etc. —materially improve after introducing a new AI 
[{'page': 4.0, 'source': 'example_data/example_pdf.pdf'}] 

* inventory, etc. —materially improve after introducing a new AI 
product, that’s what you want to see.  
Another significant risk is public and regulatory backlash against 
AI technology due to concerns around abuse of these 
technologies in t he areas of bias, disinformation, cybersecurity, 
etc. Just like the internet, general tools like generative AI can 
be used for good and for bad, so investment in risk mitigation 
must occur alongside inves

In [39]:
insert_data(md_chunks, pinecone)

Documents to insert: 7.
Inserted 7 documents.


In [40]:
search_vstore("when was the first tournament played?", pinecone)

* History

The first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Since then, the tournament has grown significantly in size and popularity. Originally featuring just 13 teams, the World Cup now includes 32 teams in the final tournament, with plans to expand to 48 teams in the near future.

Format

The World Cup is divided into two main stages: 
[{'source': 'example_data/example_markdown.md'}] 

* Format

The World Cup is divided into two main stages:

Qualification: This stage occurs over the three years preceding the tournament, with national teams competing within their respective confederations (e.g., UEFA, CONMEBOL, AFC) to secure a spot in the finals.

Final Tournament: Held over approximately one month, the final tournament features: 
[{'source': 'example_data/example_markdown.md'}] 

* The Football World Cup

The FIFA World Cup, often simply referred to as the World Cup, is the premier international football competition. Organized by th

✅ Pros:
- Friendly UI
- Can run similarity search through the UI (but you need to have the embedding beforehand)

❌ Cons:
- Probably not suitable to insert private information
- Cannot run similarity search on text directly

#### Chroma

In [None]:
pip install langchain-chroma==0.1.1

In [9]:
from langchain_chroma import Chroma

In [10]:
chroma = Chroma("vector_db_test", google_embeddings)

In [41]:
insert_data(pdf_chunks, chroma)

Documents to insert: 311.
Inserted 311 documents.


In [44]:
search_vstore("what are the risks of AI?", chroma)

* Allison Nathan: How should investors navigate this risk?  
Sarah Guo: My advice to investors is to focus on choice of 
technical partners, concrete plans, and outcomes . When AI 
products account for a significant share of incremental revenue, 
it's hard to argue with that performance. Or, in consumer 
businesses, if the metrics investors typically use to assess a 
company’s performance— engagement , transactions, ad 
inventory, etc. —materially improve after introducing a new AI 
[{'page': 4, 'source': 'example_data/example_pdf.pdf'}] 

* Allison Nathan: How should investors navigate this risk?  
Sarah Guo: My advice to investors is to focus on choice of 
technical partners, concrete plans, and outcomes . When AI 
products account for a significant share of incremental revenue, 
it's hard to argue with that performance. Or, in consumer 
businesses, if the metrics investors typically use to assess a 
company’s performance— engagement , transactions, ad 
inventory, etc. —materially im

In [43]:
insert_data(md_chunks, chroma)

Documents to insert: 7.
Inserted 7 documents.


In [45]:
search_vstore("when was the first tournament played?", chroma)

* History

The first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Since then, the tournament has grown significantly in size and popularity. Originally featuring just 13 teams, the World Cup now includes 32 teams in the final tournament, with plans to expand to 48 teams in the near future.

Format

The World Cup is divided into two main stages: 
[{'source': 'example_data/example_markdown.md'}] 

* History

The first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Since then, the tournament has grown significantly in size and popularity. Originally featuring just 13 teams, the World Cup now includes 32 teams in the final tournament, with plans to expand to 48 teams in the near future.

Format

The World Cup is divided into two main stages: 
[{'source': 'example_data/example_markdown.md', 'text': 'History\n\nThe first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Sinc

✅ Pros:
- Suitable for private information as it runs locally

❌ Cons:
- Need to handle persistance on my own (by default the data is stored in memory)
- Doesn't have a UI

#### Azure Cosmos DB

You need to deploy a Cosmos DB for NoSQL in Azure and enable vector search: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/vector-search

In [None]:
pip install azure-cosmos==4.7.0

In [37]:
from azure.cosmos import CosmosClient, PartitionKey
from langchain_community.vectorstores.azure_cosmos_db_no_sql import AzureCosmosDBNoSqlVectorSearch

In [30]:
cosmos_client = CosmosClient(os.getenv("AZURE_COSMOS_DB_URI"), os.getenv("AZURE_COSMOS_DB_KEY"))
database_name = "vector_db_test"
container_name = "test_collection"
partition_key = PartitionKey(path="/id")
cosmos_container_properties = {"partition_key": partition_key}
cosmos_database_properties = {"id": database_name}

In [None]:
indexing_policy = {
    "indexingMode": "consistent",
    "includedPaths": [{"path": "/*"}],
    "excludedPaths": [{"path": '/"_etag"/?'}],
    "vectorIndexes": [{"path": "/embedding", "type": "quantizedFlat"}],
}

vector_embedding_policy = {
    "vectorEmbeddings": [
        {
            "path": "/embedding",
            "dataType": "float32",
            "distanceFunction": "cosine",
            "dimensions": 768,
        }
    ]
}

In [32]:
cosmos_db = AzureCosmosDBNoSqlVectorSearch(
    embedding=google_embeddings,
    cosmos_client=cosmos_client,
    database_name=database_name,
    container_name=container_name,
    vector_embedding_policy=vector_embedding_policy,
    indexing_policy=indexing_policy,
    cosmos_container_properties=cosmos_container_properties,
    cosmos_database_properties=cosmos_database_properties,
    create_container=True
)

In [33]:
insert_data(pdf_chunks, cosmos_db)

Documents to insert: 311.
Inserted 311 documents.


In [34]:
search_vstore("what are the risks of AI?", cosmos_db)

* Allison Nathan: How should investors navigate this risk?  
Sarah Guo: My advice to investors is to focus on choice of 
technical partners, concrete plans, and outcomes . When AI 
products account for a significant share of incremental revenue, 
it's hard to argue with that performance. Or, in consumer 
businesses, if the metrics investors typically use to assess a 
company’s performance— engagement , transactions, ad 
inventory, etc. —materially improve after introducing a new AI 
[{'id': '22b55aaa-cac4-42c2-a607-ed6918a6a109', 'embedding': [-0.010116511955857277, -0.06860020011663437, -0.043540891259908676, 0.0015710726147517562, 0.040942516177892685, 0.00905861146748066, 0.045039474964141846, 0.007032771594822407, 0.010039553977549076, 0.02657351829111576, 0.07548939436674118, 0.008177440613508224, -0.03559532016515732, -0.015411953441798687, 0.010103081353008747, -0.11607693135738373, 0.014570698142051697, 0.057302236557006836, 0.03229697048664093, -0.009306125342845917, 0.0054496

In [35]:
insert_data(md_chunks, cosmos_db)

Documents to insert: 7.
Inserted 7 documents.


In [36]:
search_vstore("when was the first tournament played?", cosmos_db)

* History

The first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Since then, the tournament has grown significantly in size and popularity. Originally featuring just 13 teams, the World Cup now includes 32 teams in the final tournament, with plans to expand to 48 teams in the near future.

Format

The World Cup is divided into two main stages: 
[{'id': '200b0788-22a6-44bc-b517-cc8e6e59bebd', 'embedding': [0.05687127634882927, -0.044950079172849655, -0.08449845761060715, -0.007109180558472872, 0.02278314344584942, 0.050022922456264496, 0.027620889246463776, -0.022218625992536545, 0.007175700273364782, 0.06564831733703613, -0.028561297804117203, 0.04098201543092728, 0.019916323944926262, 0.03842863067984581, -0.016354287043213844, -0.004110174253582954, -0.012827101163566113, 0.049804940819740295, -0.006501531694084406, -0.008129394613206387, -0.007895036600530148, 0.021091509610414505, -0.005252905655652285, 0.017993811517953873, -0.01908742822

✅ Pros:
- Suitable for private information as you can deploy it in your own subscription

❌ Cons:
- It's paid