# Vector Databases - Research

Vector search is a cutting-edge approach to searching and retrieving data that leverages the power of vector similarity calculations. Unlike traditional keyword-based search, which matches documents based on the occurrence of specific terms, vector search focuses on the semantic meaning and similarity of data points. By representing data as vectors in a high-dimensional space, vector search enables more accurate and intuitive search results.

In [None]:
pip install python-dotenv==1.0.1 langchain==0.2.1 langchain-community==0.2.1

In [21]:
import os
from dotenv import load_dotenv

In [20]:
load_dotenv()

True

## Load and split the data

First, we need to load a document. Let's try with a pdf and a markdown file.

Then, we will split it into smaller chunks of text. This serves several important purposes:
- Granularity: By splitting a document into smaller chunks, you can retrieve more specific and relevant pieces of information. If you work with large documents as single chunks, retrieval can become inefficient and less precise.
- Search Performance: Smaller chunks improve the performance of search algorithms. It's easier to match a query against smaller, more focused pieces of text rather than a large document.
- Computational efficiency: Working with entire documents as single units can be memory-expensive and slow. Splitting documents into chunks allows for more efficient use of memory and computational resources.

Practical example: Imagine we have a large document, such as a product manual or a scientific paper. If a user asks a question like "How do I reset the device?" or "What is the conclusion of the study?", splitting the document into smaller chunks allows the chatbot to:
- Efficiently locate and retrieve the most relevant section about device resetting instructions or the conclusion of the study.
- Avoid returning irrelevant parts of the document, which might confuse the user.
- Provide a faster response by only processing a small portion of the document.

In [None]:
pip install pypdf==4.2.0 unstructured==0.14.7 markdown==3.6

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Define how the text should be split:
#  - Each chunk should be up to 512 characters long.
#  - There should be an overlap of 64 characters between consecutive chunks. 
#  - This overlap helps maintain context across the chunks.
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

In [6]:
# PDF

from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("example_data/example_pdf.pdf")

# Load pdf and split into chunks.
pdf_chunks = pdf_loader.load_and_split(text_splitter=splitter)

# Get the number of chunks
print(f"Number of chunks: {len(pdf_chunks)}")

# Print the first 5 chunks
for chunk in pdf_chunks[:5]:
    print(chunk)

Number of chunks: 311
page_content='GENERATIVE AI: HYPE,  OR \nTRUL\nY TRANSFORMATIVE?ISSUE 120 | July 5, 2023 | 12:28 PM EDT”\x01\x01\x01\x01\x01P\x01\x01\x01\x01\x01\x01\x01\x01\x01 \x01\x01 \x01 \x01Global Macro  \nResearch\nInvestors should consider this report as only a single factor in making their investment decision. For \nReg AC certiﬁcation and other important disclosures, see the Disclosure Appendix, or go to www.gs.com/research/hedge.html.\nThe Goldman Sachs Group, Inc.Since the release of OpenAI’s generative AI tool ChatGPT in November, investor' metadata={'source': 'example_data/example_pdf.pdf', 'page': 0}
page_content='interest in generative AI technology has surged. The disruptive potential of  this technology, and whether the hype around it—and market pricing—has gone too far, is Top of Mind. We speak with Conviction’s Sarah Guo, NYU’s Gary Marcus, and GS GIR’s US software and internet analysts Kash Rangan and Eric Sheridan about what the technology can—and can’t—do a

In [15]:
# Markdown

from langchain_community.document_loaders import UnstructuredMarkdownLoader

md_loader = UnstructuredMarkdownLoader("example_data/example_markdown.md")

# Load pdf and split into chunks.
md_chunks = md_loader.load_and_split(text_splitter=splitter)

# Get the number of chunks
print(f"Number of chunks: {len(md_chunks)}")

# Print the first 5 chunks
for chunk in md_chunks[:5]:
    print(chunk)

Number of chunks: 7
page_content='The Football World Cup\n\nThe FIFA World Cup, often simply referred to as the World Cup, is the premier international football competition. Organized by the Fédération Internationale de Football Association (FIFA), it features national teams from around the globe competing for the most coveted trophy in the sport. The tournament is held every four years and is one of the most widely viewed and followed sporting events in the world.\n\nHistory' metadata={'source': 'example_data/example_markdown.md'}
page_content='History\n\nThe first World Cup was held in 1930 in Uruguay, with the host nation emerging as the champions. Since then, the tournament has grown significantly in size and popularity. Originally featuring just 13 teams, the World Cup now includes 32 teams in the final tournament, with plans to expand to 48 teams in the near future.\n\nFormat\n\nThe World Cup is divided into two main stages:' metadata={'source': 'example_data/example_markdown.md'

## Embeddings

Now we will proceed to transform our text junks into vector embeddings.

Vector embeddings is a powerful technique for transforming complex data into numerical forms that can be easily processed and analyzed by machine learning algorithms. It basically allows us to take virtually any data type and represent it as vectors.

But it isn't as simple as just turning data into vectors. We want to ensure that we can perform tasks on this transformed data without losing the data's original meaning. For example, if we want to compare two sentences, we don't want just to compare the words they contain but rather whether or not they mean the same thing. 

To preserve the data's meaning, we need to understand how to produce vectors where relationships between the vectors make sense. To do this, we need what's known as an embedding model. We apply a pre-trained machine learning model that will produce a representation of this data that is more compact while preserving what's meaningful about the data.

The goal of embeddings is to capture the semantic meaning or relationships between data points in a way that similar items are close together in the vector space, and dissimilar items are far apart. For example, consider two words "king" and "queen". An embedding might map these words to vectors such that the difference between the "king" and "queen" vectors is similar to the difference between the "man" and "woman" vectors. This reflects the underlying semantic relationships.

Key characteristics of embeddings:
- Dimensionality: The number of elements in the vector. Higher dimensions can capture more complex relationships but are computationally more expensive.
- Similarity: Measured using metrics like cosine similarity or euclidean distance, which help in finding how close or far two vectors are from each other.

In [43]:
sample_text = "hello, world!"

#### Google AI

In [None]:
pip install langchain-google-genai

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

In [44]:
google_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
google_vector = google_embeddings.embed_query(sample_text)
print(f"Dimensionality: {len(google_vector)}")
print(f"Sample: {google_vector[:5]}")

Dimensionality: 768
Sample: [0.05168594419956207, -0.030764883384108543, -0.03062233328819275, -0.02802734449505806, 0.01813092641532421]


#### Hugging Face

In [None]:
pip install sentence_transformers==3.0.1

In [25]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [47]:
hf_embeddings = HuggingFaceEmbeddings()
hf_vector = hf_embeddings.embed_query(sample_text)
print(f"Dimensionality: {len(hf_vector)}")
print(f"Sample: {hf_vector[:5]}")

Dimensionality: 768
Sample: [0.03492263704538345, 0.018829984590411186, -0.017854740843176842, 0.0001388351374771446, 0.0740736797451973]


# Astra DB

For using Datastax Astra DB:
- Create a database in https://astra.datastax.com/
- Obtain your database API endpoint, located under Database Details > API Endpoint, and save it as an environment variable called: ASTRA_DB_API_ENDPOINT
- Generate a token and save it as an environment variable called: ASTRA_DB_APPLICATION_TOKEN

In [None]:
pip install langchain-astradb==0.3.3

In [17]:
from langchain_astradb import AstraDBVectorStore

In [19]:
vstore = AstraDBVectorStore(
    embedding=embe,
    collection_name="astra_vector_db",
    api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
    token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
)

ValueError: Either an `embedding` or a `collection_vector_service_options`                    must be provided.