# **Vector Stores and Retrievers**

## **What's Covered?**
- Introduction
- Vector Store Key Attributes
- Setting a connection with ChromaDB
- Similarity Search
- Similarity Search with Score
- Similarity Search by Vector

## **Introduction**
One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

<img src="images/langchain_rag.jpg">

This walkthrough showcases basic functionality related to vector stores. A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the text embedding model interfaces before diving into this.

There are many great vector store options, here are a few that are free, open-source, and run entirely on your local machine. Review all integrations for many great hosted offerings.
- Chroma
- FAISS (Facebook AI Similarity Search)

This walkthrough uses the `chroma` vector database, which runs on your local machine as a library.

## **Vector Store Key Attributes**
- Can store large d-dimensional vectors
- Can directly index an embedded vector to its associated string text document.
- Can be "Queried", allowing for a cosine similarity search betwen new vector and the stored vectors.
- Can easily add, update, or delete new vectors.


## **Steps**
1. Load a Document
2. Split it into Chunks
3. Generate vectors for each Chunk i.e. numerical embeddings
4. Store the vectors in a Vector Store
5. Apply similarity search, etc...

In [1]:
# Install ChromaDB Library

# !pip install chromadb
# !pip install langchain-chroma

In [7]:
# Install ChromaDB Library
!pip install chromadb

Collecting chromadb
  Using cached chromadb-0.5.15-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Using cached chroma_hnswlib-0.7.6.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting fastapi>=0.95.2 (from chromadb)
  Using cached fastapi-0.115.2-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Using cached uvicorn-0.32.0-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.7.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting opentelemetr

  error: subprocess-exited-with-error
  
  Building wheel for chroma-hnswlib (pyproject.toml) did not run successfully.
  exit code: 1
  
  [5 lines of output]
  running bdist_wheel
  running build
  running build_ext
  building 'hnswlib' extension
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for chroma-hnswlib
ERROR: Could not build wheels for chroma-hnswlib, which is required to install pyproject.toml-based projects


Collecting langchain-chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0 (from langchain-chroma)
  Using cached chromadb-0.5.15-py3-none-any.whl.metadata (6.8 kB)
Collecting fastapi<1,>=0.95.2 (from langchain-chroma)
  Using cached fastapi-0.115.2-py3-none-any.whl.metadata (27 kB)
Collecting build>=1.0.3 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Using cached chroma_hnswlib-0.7.6.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collect

  error: subprocess-exited-with-error
  
  Building wheel for chroma-hnswlib (pyproject.toml) did not run successfully.
  exit code: 1
  
  [5 lines of output]
  running bdist_wheel
  running build
  running build_ext
  building 'hnswlib' extension
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for chroma-hnswlib
ERROR: Could not build wheels for chroma-hnswlib, which is required to install pyproject.toml-based projects


In [15]:
!pip install --pre chromadb

Collecting chromadb
  Using cached chromadb-0.5.15-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Using cached chroma_hnswlib-0.7.6.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting fastapi>=0.95.2 (from chromadb)
  Using cached fastapi-0.115.2-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Using cached uvicorn-0.32.0-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.7.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting opentelemetr

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpcio-status 1.66.1 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 4.25.5 which is incompatible.
langchain 0.3.3 requires tenacity!=8.4.0,<9.0.0,>=8.1.0, but you have tenacity 9.0.0 which is incompatible.
langchain-community 0.3.2 requires tenacity!=8.4.0,<9.0.0,>=8.1.0, but you have tenacity 9.0.0 which is incompatible.
langchain-core 0.3.10 requires tenacity!=8.4.0,<9.0.0,>=8.1.0, but you have tenacity 9.0.0 which is incompatible.
pycaret 3.3.2 requires pandas<2.2.0, but you have pandas 2.2.3 which is incompatible.
streamlit 1.38.0 requires tenacity<9,>=8.1.0, but you have tenacity 9.0.0 which is incompatible.


In [19]:
#!python --version

Python 3.12.4


In [21]:
#!pip cache purge

Files removed: 3321


In [23]:
#!pip install langchain-chroma

Collecting langchain-chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting tenacity>=8.2.3 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Downloading langchain_chroma-0.1.4-py3-none-any.whl (10 kB)
Downloading tenacity-8.5.0-py3-none-any.whl (28 kB)
Installing collected packages: tenacity, langchain-chroma
  Attempting uninstall: tenacity
    Found existing installation: tenacity 9.0.0
    Uninstalling tenacity-9.0.0:
      Successfully uninstalled tenacity-9.0.0
Successfully installed langchain-chroma-0.1.4 tenacity-8.5.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pycaret 3.3.2 requires pandas<2.2.0, but you have pandas 2.2.3 which is incompatible.


In [None]:
#!pip install langchain-chroma

## **Step 1: Initialize an Embedding Model**

In [5]:
f = open('keys/.openai_api_key.txt')

OPENAI_API_KEY = f.read()

In [7]:
# Step 1 - Initialize an embedding_model
# We are just loading OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

## **Step 2: Setting a Connection with the ChromaDB**

In [29]:
!pip install langchain chromadb



In [9]:
from langchain.vectorstores import Chroma


In [11]:
# Step 2 - Initialize a ChromaDB Connection
#from langchain_chroma import Chroma

# Initialize the database connection
# If database exist, it will connect with the collection_name and persist_directory
# Otherwise a new collection will be created
db = Chroma(collection_name="vector_database", 
            embedding_function=embedding_model, 
            persist_directory="./chroma_db_")

  db = Chroma(collection_name="vector_database",
⚠️ It looks like you upgraded from a version below 0.6 and could benefit from vacuuming your database. Run chromadb utils vacuum --help for more information.


In [35]:
# Initially the database is empty

db.get()

{'ids': ['00203423-8aa1-4dc9-9c9f-5911402efdf5',
  'fb169cce-fba7-43f5-a9b4-e66e5e0c470e',
  '3000b91e-88b1-460e-837b-f86dd98d917f',
  '537fd001-d609-431f-924f-6cc416d5667c',
  '7f056922-84b6-47c9-94c8-d8de0086b9e4',
  'ef65584b-1b0a-42fd-9a03-f0ee01f1e37d',
  'fda8dbf6-6534-4bff-8259-01d715e875d7',
  '19cbb8f7-4098-489e-a040-cafe765298f7',
  '0af0d6ad-915b-4f54-8145-1a371477b955',
  '1443d7d3-66b3-4650-a6cc-1b2237050dba',
  'd5dbdb2b-0c3a-4c25-95c9-89e39d974eff',
  '8d6ffda4-d7cd-4baf-86c7-a1abc1675223',
  'cd33a1df-c6da-45c4-8eef-768e38a70d57',
  'f6c41e30-e554-4226-8228-c4bec122bd26',
  '23f5af16-c166-4fe9-bd54-89b130be8253',
  '6f34b360-059a-4b77-b632-73f81a3f84d0',
  'b4e7d5b7-209e-40d6-8625-1850f79da4cb',
  'a4f810d9-e871-4007-beae-e7d9847250b3',
  '9d5d5bab-20c2-453d-9532-6cc9c925c749',
  '4badfa81-fa47-484d-ab4d-598d3abdbc4d',
  '035fbf4f-0707-408b-a4db-74009ea6d226',
  'c58fb4e5-52ad-4292-ab08-fc0ededbd105',
  'cde89f0a-5f70-4ef3-ba62-5278765634a4',
  '666a02ed-c88f-4507-8bf7-

## **Step 3: Prepare Documents by Splitting them into Chunks**

In [13]:
# Step 3 a - Load a document
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|█████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 605.33it/s]


In [15]:
# Step 3 b - Split the document into chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=300)

chunks = text_splitter.split_documents(data)

print(len(chunks))
print()
print(type(chunks[0]))
print()
print(chunks[0])

1004

<class 'langchain_core.documents.base.Document'>

page_content='1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:00:07,179
Ross was in love
with Rachel since forever.

3
00:00:07,423 --> 00:00:10,437
Every time he tried to tell her,
something got in the way...

4
00:00:10,651 --> 00:00:12,529
...Iike cats, Italian guys.

5
00:00:12,736 --> 00:00:15,922
And finally, Chandler was,
like, "Forget about her."

6
00:00:16,166 --> 00:00:20,762
When Ross was in China, Chandler
let it slip that Ross loved Rachel.

7
00:00:20,975 --> 00:00:22,818
She was, like, "Oh, my God!"

8
00:00:23,061 --> 00:00:25,845
So she went to the airport to meet him.

9
00:00:26,089 --> 00:00:29,710
She didn 't know Ross was getting
off the plane with another woman.' metadata={'source': 'data\\subtitles\\Friends_2x01.srt'}


In [41]:
print(chunks[-1])

page_content='385
00:23:23,532 --> 00:23:25,056
How are you doing?

386
00:23:25,267 --> 00:23:28,430
I don't know. It's weird
not being with Ross...

387
00:23:28,671 --> 00:23:30,764
... but I guess I'm doing okay.

388
00:23:31,040 --> 00:23:33,838
Actually, I've got some
of his stuff that he....

389
00:23:42,038 --> 00:23:44,438
-Have we met?
-I don't think so.

390
00:23:44,438 --> 00:23:46,038
You look so familiar.

391
00:23:47,838 --> 00:23:52,538
May be we can talk about it
over a cup of coffee.

392
00:23:52,538 --> 00:23:57,538
You know what I'm sick of coffe.
Let's go and get some juice.

393
00:23:58,038 --> 00:24:02,738
all right! bye!' metadata={'source': 'data\\subtitles\\Friends_2x10.srt'}


## **Step 4: Add Chunks to Vector DB**

In [None]:
db.add_documents(chunks)

In [10]:
# # You can check if the documents are indexed or not
# db.get()

# # We can check the already existing values
# print(len(db.get()["ids"]))

## **Similarity Search**

In [11]:
query = "What is their on Julie vs Rachels List?"

relevant_chunks = db.similarity_search(query=query, k=5)

print(len(relevant_chunks))

5


In [12]:
print("Type of output:", type(relevant_chunks))
print()
print("Type of each item in output:", type(relevant_chunks[0]))
print()
print("Number of output docs:", len(relevant_chunks))
print()
print("****PRINTING THE DOCUMENT:\n", relevant_chunks[0].page_content)

Type of output: <class 'list'>

Type of each item in output: <class 'langchain_core.documents.base.Document'>

Number of output docs: 5

****PRINTING THE DOCUMENT:
 126
00:07:28,029 --> 00:07:29,621
He's gonna stay with Julie.

127
00:07:29,864 --> 00:07:31,957
He's gonna stay with her
and she'll be:

128
00:07:32,233 --> 00:07:34,463
"Hi, I'm Julie. Ross picked me.

129
00:07:34,736 --> 00:07:38,797
We'll get married and have lots
of kids and dig up stuff together!"

130
00:07:40,475 --> 00:07:43,137
No offense, but that
sounds nothing like her.

131
00:07:46,080 --> 00:07:50,073
What am I gonna do?
This is like a complete nightmare!

132
00:07:50,318 --> 00:07:54,448
I know. This must be so hard.
"Oh, no! Two women love me!

133
00:07:55,790 --> 00:07:59,055
They're both gorgeous,
my wallet's too small for my 50s...

134
00:07:59,260 --> 00:08:01,751
...and my diamond shoes are too tight!"


In [13]:
print(relevant_chunks[0].metadata)

{'source': 'data/subtitles/Friends_2x08.srt'}


In [14]:
[doc.metadata for doc in relevant_chunks]

[{'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x04.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'}]

## **Similarity Search With Score**

In [15]:
query = "What is their on Julie vs Rachels List?"

relevant_chunks = db.similarity_search_with_score(query=query, k=5)

print(len(relevant_chunks))

5


In [16]:
print("Type of output:", type(relevant_chunks))
print()
print("Type of each item in output:", type(relevant_chunks[0]))
print()
print("Number of output docs:", len(relevant_chunks))
print()
print("****PRINTING THE DOCUMENT:\n", relevant_chunks[0])

Type of output: <class 'list'>

Type of each item in output: <class 'tuple'>

Number of output docs: 5

****PRINTING THE DOCUMENT:
 (Document(metadata={'source': 'data/subtitles/Friends_2x08.srt'}, page_content='126\n00:07:28,029 --> 00:07:29,621\nHe\'s gonna stay with Julie.\n\n127\n00:07:29,864 --> 00:07:31,957\nHe\'s gonna stay with her\nand she\'ll be:\n\n128\n00:07:32,233 --> 00:07:34,463\n"Hi, I\'m Julie. Ross picked me.\n\n129\n00:07:34,736 --> 00:07:38,797\nWe\'ll get married and have lots\nof kids and dig up stuff together!"\n\n130\n00:07:40,475 --> 00:07:43,137\nNo offense, but that\nsounds nothing like her.\n\n131\n00:07:46,080 --> 00:07:50,073\nWhat am I gonna do?\nThis is like a complete nightmare!\n\n132\n00:07:50,318 --> 00:07:54,448\nI know. This must be so hard.\n"Oh, no! Two women love me!\n\n133\n00:07:55,790 --> 00:07:59,055\nThey\'re both gorgeous,\nmy wallet\'s too small for my 50s...\n\n134\n00:07:59,260 --> 00:08:01,751\n...and my diamond shoes are too tight!"')

In [17]:
[doc[0].metadata for doc in relevant_chunks]

[{'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x04.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'}]

In [18]:
# Similarity Score

[doc[1] for doc in relevant_chunks]

[0.3837057650089264,
 0.3851888179779053,
 0.39051416516304016,
 0.39062416553497314,
 0.39110222458839417]

## **Similarity Search By Vector**

In [19]:
query = "What is their on Julie vs Rachels List?"

query_vector = embedding_model.embed_query(query)

relevant_chunks = db.similarity_search_by_vector(embedding=query_vector, k=5)

print(len(relevant_chunks))

5


In [20]:
print("Type of output:", type(relevant_chunks))
print()
print("Type of each item in output:", type(relevant_chunks[0]))
print()
print("Number of output docs:", len(relevant_chunks))

Type of output: <class 'list'>

Type of each item in output: <class 'langchain_core.documents.base.Document'>

Number of output docs: 5


In [21]:
[doc.metadata for doc in relevant_chunks]

[{'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'},
 {'source': 'data/subtitles/Friends_2x04.srt'},
 {'source': 'data/subtitles/Friends_2x08.srt'}]