# **Vector Stores and Retrievers**

## **What's Covered?**
- Introduction
- Vector Store Key Attributes
- Setting a connection with ChromaDB
- Similarity Search
- Similarity Search with Score
- Similarity Search by Vector

## **Introduction**
One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

<img src="images/langchain_rag.jpg">

This walkthrough showcases basic functionality related to vector stores. A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the text embedding model interfaces before diving into this.

There are many great vector store options, here are a few that are free, open-source, and run entirely on your local machine. Review all integrations for many great hosted offerings.
- Chroma
- FAISS (Facebook AI Similarity Search)

This walkthrough uses the `chroma` vector database, which runs on your local machine as a library.

## **Vector Store Key Attributes**
- Can store large d-dimensional vectors
- Can directly index an embedded vector to its associated string text document.
- Can be "Queried", allowing for a cosine similarity search betwen new vector and the stored vectors.
- Can easily add, update, or delete new vectors.


## **Steps**
1. Load a Document
2. Split it into Chunks
3. Generate vectors for each Chunk i.e. numerical embeddings
4. Store the vectors in a Vector Store
5. Apply similarity search, etc...

In [1]:
# Install ChromaDB Library

# !pip install chromadb
# !pip install langchain-chroma

## **Step 1: Initialize an Embedding Model**

In [2]:
f = open('keys/.openai_api_key.txt')

OPENAI_API_KEY = f.read()

In [3]:
# Step 1 - Initialize an embedding_model
# We are just loading OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)



## **Step 2: Setting a Connection with the ChromaDB**

In [4]:
# Step 2 - Initialize a ChromaDB Connection
from langchain_chroma import Chroma

# Initialize the database connection
# If database exist, it will connect with the collection_name and persist_directory
# Otherwise a new collection will be created
db = Chroma(collection_name="vector_database", 
            embedding_function=embedding_model, 
            persist_directory="./chroma_db_")

In [5]:
# Initially the database is empty

db.get()

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

## **Step 3: Prepare Documents by Splitting them into Chunks**

In [6]:
# Step 3 a - Load a document
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|█████████████████████████████████████████| 10/10 [00:00<00:00, 4229.83it/s]


In [7]:
# Step 3 b - Split the document into chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=300)

chunks = text_splitter.split_documents(data)

print(len(chunks))
print()
print(type(chunks[0]))
print()
print(chunks[0])

1004

<class 'langchain_core.documents.base.Document'>

page_content='1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he tried to tell her,
something got in the way...

4
00:00:10,651 --> 00:00:12,529
...Iike cats, Italian guys.

5
00:00:12,736 --> 00:00:15,922
And finally, Chandler was,
like, "Forget about her."

6
00:00:16,166 --> 00:00:20,762
When Ross was in China, Chandler
let it slip that Ross loved Rachel.

7
00:00:20,975 --> 00:00:22,818
She was, like, "Oh, my God!"

8
00:00:23,061 --> 00:00:25,845
So she went to the airport to meet him.

9
00:00:26,089 --> 00:00:29,710
She didn 't know Ross was getting
off the plane with another woman.' metadata={'source': 'data/subtitles/Friends_2x01.srt'}


In [8]:
print(chunks[-1])

page_content='346
00:22:02,745 --> 00:22:05,407
Is that what you did
while I got gas?

347
00:22:05,781 --> 00:22:07,442
You guys!

348
00:22:08,551 --> 00:22:09,643
And for Ross...

349
00:22:09,919 --> 00:22:11,580
... Mr. Sweet Tooth.

350
00:22:15,391 --> 00:22:17,325
You got me a cola drink?

351
00:22:18,494 --> 00:22:20,291
And a lemon-lime!

352
00:22:22,331 --> 00:22:27,200
Well, this is too much! I feel like
I should get you another sweater.

353
00:22:29,438 --> 00:22:31,531
And last, but not least....

354
00:22:33,943 --> 00:22:36,104
They're ribbed for your pleasure!' metadata={'source': 'data/subtitles/Friends_2x09.srt'}


## **Step 4: Add Chunks to Vector DB**

In [9]:
db.add_documents(chunks)

['00203423-8aa1-4dc9-9c9f-5911402efdf5',
 'fb169cce-fba7-43f5-a9b4-e66e5e0c470e',
 '3000b91e-88b1-460e-837b-f86dd98d917f',
 '537fd001-d609-431f-924f-6cc416d5667c',
 '7f056922-84b6-47c9-94c8-d8de0086b9e4',
 'ef65584b-1b0a-42fd-9a03-f0ee01f1e37d',
 'fda8dbf6-6534-4bff-8259-01d715e875d7',
 '19cbb8f7-4098-489e-a040-cafe765298f7',
 '0af0d6ad-915b-4f54-8145-1a371477b955',
 '1443d7d3-66b3-4650-a6cc-1b2237050dba',
 'd5dbdb2b-0c3a-4c25-95c9-89e39d974eff',
 '8d6ffda4-d7cd-4baf-86c7-a1abc1675223',
 'cd33a1df-c6da-45c4-8eef-768e38a70d57',
 'f6c41e30-e554-4226-8228-c4bec122bd26',
 '23f5af16-c166-4fe9-bd54-89b130be8253',
 '6f34b360-059a-4b77-b632-73f81a3f84d0',
 'b4e7d5b7-209e-40d6-8625-1850f79da4cb',
 'a4f810d9-e871-4007-beae-e7d9847250b3',
 '9d5d5bab-20c2-453d-9532-6cc9c925c749',
 '4badfa81-fa47-484d-ab4d-598d3abdbc4d',
 '035fbf4f-0707-408b-a4db-74009ea6d226',
 'c58fb4e5-52ad-4292-ab08-fc0ededbd105',
 'cde89f0a-5f70-4ef3-ba62-5278765634a4',
 '666a02ed-c88f-4507-8bf7-eb433ac15c77',
 'd198e74d-3ffa-

In [10]:
# # You can check if the documents are indexed or not
# db.get()

# # We can check the already existing values
# print(len(db.get()["ids"]))

## **Similarity Search**

In [11]:
query = "What is their on Julie vs Rachels List?"

relevant_chunks = db.similarity_search(query=query, k=5)

print(len(relevant_chunks))

5


In [12]:
print("Type of output:", type(relevant_chunks))
print()
print("Type of each item in output:", type(relevant_chunks[0]))
print()
print("Number of output docs:", len(relevant_chunks))
print()
print("****PRINTING THE DOCUMENT:\n", relevant_chunks[0].page_content)

Type of output: <class 'list'>

Type of each item in output: <class 'langchain_core.documents.base.Document'>

Number of output docs: 5

****PRINTING THE DOCUMENT:
 126
00:07:28,029 --> 00:07:29,621
He's gonna stay with Julie.

127
00:07:29,864 --> 00:07:31,957
He's gonna stay with her
and she'll be:

128
00:07:32,233 --> 00:07:34,463
"Hi, I'm Julie. Ross picked me.

129
00:07:34,736 --> 00:07:38,797
We'll get married and have lots
of kids and dig up stuff together!"

130
00:07:40,475 --> 00:07:43,137
No offense, but that
sounds nothing like her.

131
00:07:46,080 --> 00:07:50,073
What am I gonna do?
This is like a complete nightmare!

132
00:07:50,318 --> 00:07:54,448
I know. This must be so hard.
"Oh, no! Two women love me!

133
00:07:55,790 --> 00:07:59,055
They're both gorgeous,
my wallet's too small for my 50s...

134
00:07:59,260 --> 00:08:01,751
...and my diamond shoes are too tight!"


In [13]:
print(relevant_chunks[0].metadata)

{'source': 'data/subtitles/Friends_2x08.srt'}


In [14]:
[doc.metadata for doc in relevant_chunks]

[{'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x04.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'}]

## **Similarity Search With Score**

In [15]:
query = "What is their on Julie vs Rachels List?"

relevant_chunks = db.similarity_search_with_score(query=query, k=5)

print(len(relevant_chunks))

5


In [16]:
print("Type of output:", type(relevant_chunks))
print()
print("Type of each item in output:", type(relevant_chunks[0]))
print()
print("Number of output docs:", len(relevant_chunks))
print()
print("****PRINTING THE DOCUMENT:\n", relevant_chunks[0])

Type of output: <class 'list'>

Type of each item in output: <class 'tuple'>

Number of output docs: 5

****PRINTING THE DOCUMENT:
 (Document(metadata={'source': 'data/subtitles/Friends_2x08.srt'}, page_content='126\n00:07:28,029 --> 00:07:29,621\nHe\'s gonna stay with Julie.\n\n127\n00:07:29,864 --> 00:07:31,957\nHe\'s gonna stay with her\nand she\'ll be:\n\n128\n00:07:32,233 --> 00:07:34,463\n"Hi, I\'m Julie. Ross picked me.\n\n129\n00:07:34,736 --> 00:07:38,797\nWe\'ll get married and have lots\nof kids and dig up stuff together!"\n\n130\n00:07:40,475 --> 00:07:43,137\nNo offense, but that\nsounds nothing like her.\n\n131\n00:07:46,080 --> 00:07:50,073\nWhat am I gonna do?\nThis is like a complete nightmare!\n\n132\n00:07:50,318 --> 00:07:54,448\nI know. This must be so hard.\n"Oh, no! Two women love me!\n\n133\n00:07:55,790 --> 00:07:59,055\nThey\'re both gorgeous,\nmy wallet\'s too small for my 50s...\n\n134\n00:07:59,260 --> 00:08:01,751\n...and my diamond shoes are too tight!"')

In [17]:
[doc[0].metadata for doc in relevant_chunks]

[{'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x04.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'}]

In [18]:
# Similarity Score

[doc[1] for doc in relevant_chunks]

[0.3837057650089264,
 0.3851888179779053,
 0.39051416516304016,
 0.39062416553497314,
 0.39110222458839417]

## **Similarity Search By Vector**

In [19]:
query = "What is their on Julie vs Rachels List?"

query_vector = embedding_model.embed_query(query)

relevant_chunks = db.similarity_search_by_vector(embedding=query_vector, k=5)

print(len(relevant_chunks))

5


In [20]:
print("Type of output:", type(relevant_chunks))
print()
print("Type of each item in output:", type(relevant_chunks[0]))
print()
print("Number of output docs:", len(relevant_chunks))

Type of output: <class 'list'>

Type of each item in output: <class 'langchain_core.documents.base.Document'>

Number of output docs: 5


In [21]:
[doc.metadata for doc in relevant_chunks]

[{'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x04.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'}]