# Vector Databases - Research

Vector search is a cutting-edge approach to searching and retrieving data that leverages the power of vector similarity calculations. Unlike traditional keyword-based search, which matches documents based on the occurrence of specific terms, vector search focuses on the semantic meaning and similarity of data points. By representing data as vectors in a high-dimensional space, vector search enables more accurate and intuitive search results.

In [None]:
pip install python-dotenv==1.0.1 langchain==0.2.1 langchain-community==0.2.1 scikit-learn==1.5.0

In [36]:
import os
from dotenv import load_dotenv

In [37]:
load_dotenv()

True

## Load and split the data

First, we need to load a document. Let's try with a pdf and a markdown file.

Then, we will split it into smaller chunks of text. This serves several important purposes:
- Granularity: By splitting a document into smaller chunks, you can retrieve more specific and relevant pieces of information. If you work with large documents as single chunks, retrieval can become inefficient and less precise.
- Search Performance: Smaller chunks improve the performance of search algorithms. It's easier to match a query against smaller, more focused pieces of text rather than a large document.
- Computational efficiency: Working with entire documents as single units can be memory-expensive and slow. Splitting documents into chunks allows for more efficient use of memory and computational resources.

Practical example: Imagine we have a large document, such as a product manual or a scientific paper. If a user asks a question like "How do I reset the device?" or "What is the conclusion of the study?", splitting the document into smaller chunks allows the chatbot to:
- Efficiently locate and retrieve the most relevant section about device resetting instructions or the conclusion of the study.
- Avoid returning irrelevant parts of the document, which might confuse the user.
- Provide a faster response by only processing a small portion of the document.

In [None]:
pip install pypdf==4.2.0 unstructured==0.14.7 markdown==3.6

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Define how the text should be split:
#  - Each chunk should be up to 512 characters long.
#  - There should be an overlap of 64 characters between consecutive chunks. 
#  - This overlap helps maintain context across the chunks.
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

In [4]:
# PDF

from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("example_data/example_pdf.pdf")

# Load pdf and split into chunks.
pdf_chunks = pdf_loader.load_and_split(text_splitter=splitter)

# Get the number of chunks
print(f"Number of chunks: {len(pdf_chunks)}")

# Print the first 5 chunks
for chunk in pdf_chunks[:5]:
    print(chunk)

Number of chunks: 311
page_content='GENERATIVE AI: HYPE,  OR \nTRUL\nY TRANSFORMATIVE?ISSUE 120 | July 5, 2023 | 12:28 PM EDT”\x01\x01\x01\x01\x01P\x01\x01\x01\x01\x01\x01\x01\x01\x01 \x01\x01 \x01 \x01Global Macro  \nResearch\nInvestors should consider this report as only a single factor in making their investment decision. For \nReg AC certiﬁcation and other important disclosures, see the Disclosure Appendix, or go to www.gs.com/research/hedge.html.\nThe Goldman Sachs Group, Inc.Since the release of OpenAI’s generative AI tool ChatGPT in November, investor' metadata={'source': 'example_data/example_pdf.pdf', 'page': 0}
page_content='interest in generative AI technology has surged. The disruptive potential of  this technology, and whether the hype around it—and market pricing—has gone too far, is Top of Mind. We speak with Conviction’s Sarah Guo, NYU’s Gary Marcus, and GS GIR’s US software and internet analysts Kash Rangan and Eric Sheridan about what the technology can—and can’t—do a

In [5]:
# Markdown

from langchain_community.document_loaders import UnstructuredMarkdownLoader

md_loader = UnstructuredMarkdownLoader("example_data/example_markdown.md")

# Load pdf and split into chunks.
md_chunks = md_loader.load_and_split(text_splitter=splitter)

# Get the number of chunks
print(f"Number of chunks: {len(md_chunks)}")

# Print the first 5 chunks
for chunk in md_chunks[:5]:
    print(chunk)

Number of chunks: 7
page_content='The Football World Cup\n\nThe FIFA World Cup, often simply referred to as the World Cup, is the premier international football competition. Organized by the Fédération Internationale de Football Association (FIFA), it features national teams from around the globe competing for the most coveted trophy in the sport. The tournament is held every four years and is one of the most widely viewed and followed sporting events in the world.\n\nHistory' metadata={'source': 'example_data/example_markdown.md'}
page_content='History\n\nThe first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Since then, the tournament has grown significantly in size and popularity. Originally featuring just 13 teams, the World Cup now includes 32 teams in the final tournament, with plans to expand to 48 teams in the near future.\n\nFormat\n\nThe World Cup is divided into two main stages:' metadata={'source': 'example_data/example_markdown.md'

## Embeddings

Now we will proceed to transform our text junks into vector embeddings.

Vector embeddings is a powerful technique for transforming complex data into numerical forms that can be easily processed and analyzed by machine learning algorithms. It basically allows us to take virtually any data type and represent it as vectors.

But it isn't as simple as just turning data into vectors. We want to ensure that we can perform tasks on this transformed data without losing the data's original meaning. For example, if we want to compare two sentences, we don't want just to compare the words they contain but rather whether or not they mean the same thing. 

To preserve the data's meaning, we need to understand how to produce vectors where relationships between the vectors make sense. To do this, we need what's known as an embedding model. We apply a pre-trained machine learning model that will produce a representation of this data that is more compact while preserving what's meaningful about the data.

The goal of embeddings is to capture the semantic meaning or relationships between data points in a way that similar items are close together in the vector space, and dissimilar items are far apart. For example, consider two words "king" and "queen". An embedding might map these words to vectors such that the difference between the "king" and "queen" vectors is similar to the difference between the "man" and "woman" vectors. This reflects the underlying semantic relationships.

Key characteristics of embeddings:
- Dimensionality: The number of elements in the vector. Higher dimensions can capture more complex relationships but are computationally more expensive.
- Similarity: Measured using metrics like cosine similarity or euclidean distance, which help in finding how close or far two vectors are from each other.

In [1]:
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Calculate cosine similarity between vectors.
# Values close to 1 indicate very similar vectors, while values close to 0 indicate very different vectors.
# Cosine similarity is particularly useful when the magnitude of the vectors is not as important as their direction. 
# It is often used in text analysis and information retrieval where the orientation (semantic meaning) of vectors 
# matters more than their magnitude.
def compare_embeddings(vector_1, vector_2):
    similarity = cosine_similarity([vector_1], [vector_2])
    print(f"Cosine similarity: {similarity[0][0]}")
    

In [34]:
# Some sentences to test different embedding models
sentences_list = [
    # Include the same words but have different semantics. Similarity should be low.
    ["I want a new watch for my birthday", "I like to watch the TV on weekends"],      
    ["The bank will close at 5 PM", "The river bank is a great place to relax"],
    # Convey the same meaning without sharing words. Similarity should be high.
    ["I have to go to the mechanic", "My car is broken"], 
    ["She went to the store to buy some groceries", "She went shopping for food"],
    # Completely unrelated in terms of their content and context. Similarity should be low.
    ["The stock market experienced a significant decline last week", "She enjoys painting landscapes in her free time"],
    ["The sun rises in the east every morning", "Pizza is a popular food in Italy"]
]

In [7]:
# Function to test an embedding model applying it to a set of sentences and analyzing similarity
def test_embeddings(model):
    for sentences_pair in sentences_list:
        vector_1 = model.embed_query(sentences_pair[0])
        print(f"\"{sentences_pair[0]}\"")
        print(f" Dimensionality: {len(vector_1)}")
        print(f" Sample: {vector_1[:5]}")
        print()
        
        vector_2 = model.embed_query(sentences_pair[1])
        print(f"\"{sentences_pair[1]}\"")
        print(f" Dimensionality: {len(vector_2)}")
        print(f" Sample: {vector_2[:5]}")
        print()

        compare_embeddings(vector_1, vector_2)
        print("---")

#### Google AI

In [None]:
pip install langchain-google-genai

In [9]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

In [33]:
google_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

test_embeddings(google_embeddings)

"I want a new watch for my birthday."
 Dimensionality: 768
 Sample: [-0.0089229391887784, -0.05183125287294388, -0.06784740835428238, -0.04327891394495964, 0.04272729903459549]

"I like to watch the TV on weekends."
 Dimensionality: 768
 Sample: [-0.02537923865020275, -0.02598225325345993, -0.058093760162591934, 0.009168527089059353, 0.008474522270262241]

Cosine similarity: 0.6764325794481864
---
"The bank will close at 5 PM."
 Dimensionality: 768
 Sample: [0.022659489884972572, 0.01701500453054905, -0.016876468434929848, -0.05867573246359825, 0.029572682455182076]

"The river bank is a great place to relax."
 Dimensionality: 768
 Sample: [0.02937248907983303, -0.0046739946119487286, -0.015860099345445633, -0.02378806844353676, 0.0422411672770977]

Cosine similarity: 0.6503056535628654
---
"I have to go to the mechanic."
 Dimensionality: 768
 Sample: [0.01886001229286194, -0.04712492972612381, -0.04557660594582558, -0.0175259280949831, -0.01612073928117752]

"My car is broken."
 Dimen

#### Hugging Face

In [None]:
pip install langchain_huggingface==0.0.3 huggingface_hub==0.23.4

In [22]:
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

In [35]:
hface_embeddings = HuggingFaceEndpointEmbeddings()

test_embeddings(hface_embeddings)

"I want a new watch for my birthday"
 Dimensionality: 768
 Sample: [-0.043364789336919785, 0.04182625934481621, -0.0038558896631002426, 0.0431031733751297, 0.01737719029188156]

"I like to watch the TV on weekends"
 Dimensionality: 768
 Sample: [-0.04785402491688728, 0.0326734222471714, 0.0009588559623807669, -0.0176838506013155, -0.03530512750148773]

Cosine similarity: 0.176958964124023
---
"The bank will close at 5 PM"
 Dimensionality: 768
 Sample: [-0.04836975410580635, 0.0051988218910992146, -0.029005667194724083, 0.00420030765235424, 0.03131011128425598]

"The river bank is a great place to relax"
 Dimensionality: 768
 Sample: [-0.08330544829368591, -0.014301057904958725, -0.029551850631833076, -0.00013575299817603081, -0.01434556394815445]

Cosine similarity: 0.23080635655292628
---
"I have to go to the mechanic"
 Dimensionality: 768
 Sample: [-0.023156633600592613, -0.02565646730363369, -0.012464101426303387, 0.03722125664353371, -0.024508243426680565]

"My car is broken"
 Dime

#### Results

Embedding model | Sentence 1 | Sentence 2 | Cosine similarity 
:---: | :---: | :---: | :---:
Google | "I want a new watch for my birthday" | "I like to watch the TV on weekends" | 0.67
Google | "The bank will close at 5 PM" | "The river bank is a great place to relax" | 0.65
Google | "I have to go to the mechanic" | "My car is broken" | 0.84
Google | "She went to the store to buy some groceries" | "She went shopping for food" | 0.95
Google | "The stock market experienced a significant decline last week" | "She enjoys painting landscapes in her free time" | 0.52
Google | "The sun rises in the east every morning" | "Pizza is a popular food in Italy" | 0.59
Hugging Face | "I want a new watch for my birthday" | "I like to watch the TV on weekends" | 0.17
Hugging Face | "The bank will close at 5 PM" | "The river bank is a great place to relax" | 0.23
Hugging Face | "I have to go to the mechanic" | "My car is broken" | 0.60
Hugging Face | "She went to the store to buy some groceries" | "She went shopping for food" | 0.88
Hugging Face | "The stock market experienced a significant decline last week" | "She enjoys painting landscapes in her free time" | -0.09
Hugging Face | "The sun rises in the east every morning" | "Pizza is a popular food in Italy" | 0.17

- The Google embedding model generally shows moderate to high cosine similarity scores for semantically related sentences. For example:
  - "I have to go to the mechanic" and "My car is broken" (0.84) suggests good recognition of related topics.
  - "She went to the store to buy some groceries" and "She went shopping for food" (0.95) indicates strong similarity in meaning.

- The Hugging Face embedding model shows more varied cosine similarity scores across sentence pairs:
  - While "I have to go to the mechanic" and "My car is broken" (0.60) are somewhat related, they show a lower score compared to Google's model.
  - "She went to the store to buy some groceries" and "She went shopping for food" (0.88) shows higher similarity, but not as high as Google's score (0.95).

Both embedding models generally perform well with semantically related sentences, although the Google model tends to exhibit higher similarity scores overall.

## Connection with the database

#### Astra DB

For using Datastax Astra DB:
- Create a database in https://astra.datastax.com/
- Obtain your database API endpoint, located under Database Details > API Endpoint, and save it as an environment variable called: ASTRA_DB_API_ENDPOINT
- Generate a token and save it as an environment variable called: ASTRA_DB_APPLICATION_TOKEN

In [None]:
pip install langchain-astradb==0.3.3

In [38]:
from langchain_astradb import AstraDBVectorStore

In [39]:
vstore = AstraDBVectorStore(
    embedding=google_embeddings,
    collection_name="vector_db_test", 
    api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
    token=os.getenv("ASTRA_DB_APPLICATION_TOKEN")
)