# Vector Stores: Leveraging FAISS for Efficient Similarity Search

This Jupyter Notebook explores **FAISS (Facebook AI Similarity Search)**, a high-performance library for efficient similarity search and clustering of dense vectors. FAISS is designed to handle very large datasets, including those that may not fit entirely in RAM, making it a powerful tool for large-scale information retrieval and RAG applications.

We'll demonstrate how to integrate FAISS with LangChain to create, query, and persist vector stores.

## What You Will Learn:

1.  **Understanding FAISS**: Its purpose and key features for vector search.
2.  **Building a FAISS Vector Store**: Loading documents, splitting them, embedding them with Ollama, and storing them in FAISS.
3.  **Querying the Vector Store**: Performing basic similarity searches.
4.  **Using FAISS as a LangChain Retriever**: Integrating FAISS into the standard LangChain `Retriever` interface.
5.  **Similarity Search with Scores**: Retrieving documents along with their distance scores (L2 distance) to the query.
6.  **Direct Vector Search**: Performing a similarity search directly with an embedding vector.
7.  **Saving and Loading FAISS Indexes**: Persisting your vector store to disk and loading it back, emphasizing the `allow_dangerous_deserialization` flag.

## Key Concepts:

* **FAISS**: An open-source library by Facebook AI Research for fast nearest-neighbor search in high-dimensional vector spaces.
* **Dense Vectors**: Numerical representations (embeddings) of data, where each dimension holds a continuous value.
* **Similarity Search**: Finding vectors in a database that are "closest" to a given query vector, based on a distance metric.
* **L2 Distance (Euclidean Distance)**: A common distance metric used by FAISS. A *lower* L2 score indicates *higher* similarity (closer vectors).
* **Vector Store**: A database or index optimized for storing and querying vector embeddings.
* **Retriever**: A LangChain interface that defines a method for retrieving documents given a query.
* **Persistence**: The ability to save the state of the vector store to disk and load it later, avoiding re-embedding documents.
* **`allow_dangerous_deserialization`**: A crucial flag when loading FAISS indexes saved via `pickle` (which LangChain often uses internally for FAISS), required for security reasons to prevent arbitrary code execution from untrusted sources. Only set to `True` if you trust the source of the saved index.

## Prerequisites:
Before running the code in this notebook, ensure you have:
1.  **Ollama Installed and 'llama2' model pulled**: As used in previous notebooks, `ollama pull llama2` (or whatever model you prefer for embeddings if you change `OllamaEmbeddings()`).
2.  **Required Libraries Installed**:

#### Faiss
Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

In [2]:
# Import necessary classes from LangChain and other libraries
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# 1. Load Documents
# Initialize TextLoader to load content from 'speech.txt'
loader = TextLoader("speech.txt")
documents = loader.load()

# 2. Split Documents
# Initialize CharacterTextSplitter to break down documents into smaller chunks
# chunk_size: maximum characters per chunk.
# chunk_overlap: overlapping characters between chunks for context.
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=30)
docs = text_splitter.split_documents(documents)

# Display the split document chunks.
print("--- Split Documents ---")
print(docs)

# 3. Initialize Embeddings
# Initialize OllamaEmbeddings. By default, it might use 'llama2' or other default
# embedding model configured in Ollama if not specified.
# Ensure Ollama is running and the default model is pulled (e.g., `ollama pull llama2`).
embeddings = OllamaEmbeddings(model="llama3")

# 4. Create FAISS Vector Store
# Create a FAISS vector store from the document chunks and the embeddings.
# FAISS will embed each chunk and store its vector.
db = FAISS.from_documents(docs, embeddings)

# Display the FAISS database object.
print("\n--- FAISS Database Created ---")
print(db)

--- Split Documents ---
[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…'), Document(metadata={'source': 'speech.txt'}, page_content='…\n\nIt will be all the easie

### Querying the FAISS Vector Store

Once the documents are embedded and stored in FAISS, we can perform similarity searches to retrieve relevant documents based on a query.

In [3]:
# Define a query string to search for similar content
query = "How does the speaker describe the desired outcome of the war?"

# Perform a similarity search on the FAISS database.
# This embeds the query and finds the most similar document chunks.
docs_from_search = db.similarity_search(query)

# Print the page content of the first retrieved document.
print("--- Result from direct similarity_search ---")
print(docs_from_search[0].page_content)

--- Result from direct similarity_search ---
It is a distressing and oppressive duty, gentlemen of the Congress, which I have performed in thus addressing you. There are, it may be, many months of fiery trial and sacrifice ahead of us. It is a fearful thing to lead this great peaceful people into war, into the most terrible and disastrous of all wars, civilization itself seeming to be in the balance. But the right is more precious than peace, and we shall fight for the things which we have always carried nearest our hearts—for democracy, for the right of those who submit to authority to have a voice in their own governments, for the rights and liberties of small nations, for a universal dominion of right by such a concert of free peoples as shall bring peace and safety to all nations and make the world itself at last free.


#### Using FAISS as a LangChain Retriever

We can convert the FAISS vector store into a LangChain `Retriever` class. This provides a standardized interface for retrieval, making it easy to integrate with other LangChain components like chains and agents.

In [4]:
# Convert the FAISS database into a LangChain Retriever.
# This abstracts away the underlying vector store implementation.
retriever = db.as_retriever()

# Invoke the retriever with the query.
# The `invoke` method is the standard way to use a LangChain retriever.
docs_from_retriever = retriever.invoke(query)

# Print the page content of the first document retrieved by the retriever.
print("--- Result from retriever.invoke ---")
print(docs_from_retriever[0].page_content)

--- Result from retriever.invoke ---
It is a distressing and oppressive duty, gentlemen of the Congress, which I have performed in thus addressing you. There are, it may be, many months of fiery trial and sacrifice ahead of us. It is a fearful thing to lead this great peaceful people into war, into the most terrible and disastrous of all wars, civilization itself seeming to be in the balance. But the right is more precious than peace, and we shall fight for the things which we have always carried nearest our hearts—for democracy, for the right of those who submit to authority to have a voice in their own governments, for the rights and liberties of small nations, for a universal dominion of right by such a concert of free peoples as shall bring peace and safety to all nations and make the world itself at last free.


#### Similarity Search with Score

FAISS provides methods to not only return the documents but also a "distance score" indicating how close the retrieved document's embedding is to the query's embedding. For L2 distance (which FAISS often defaults to), **a lower score means higher similarity**.

In [5]:
# Perform a similarity search and return the documents along with their scores.
# The score here is the L2 distance.
docs_and_score = db.similarity_search_with_score(query)

# Print the retrieved documents and their scores.
# You'll see tuples of (Document, score). Lower score is better.
print("--- Results from similarity_search_with_score (L2 Distance) ---")
print(docs_and_score)

--- Results from similarity_search_with_score (L2 Distance) ---
[(Document(id='07ddcdf7-5b77-4669-b88c-a60edee37b88', metadata={'source': 'speech.txt'}, page_content='It is a distressing and oppressive duty, gentlemen of the Congress, which I have performed in thus addressing you. There are, it may be, many months of fiery trial and sacrifice ahead of us. It is a fearful thing to lead this great peaceful people into war, into the most terrible and disastrous of all wars, civilization itself seeming to be in the balance. But the right is more precious than peace, and we shall fight for the things which we have always carried nearest our hearts—for democracy, for the right of those who submit to authority to have a voice in their own governments, for the rights and liberties of small nations, for a universal dominion of right by such a concert of free peoples as shall bring peace and safety to all nations and make the world itself at last free.'), np.float32(21774.94)), (Document(id='f45

#### Similarity Search by Vector

Instead of providing a text query, you can also perform a similarity search directly with an embedding vector if you already have one.

In [7]:
# Embed the query string into a vector.
embedding_vector = embeddings.embed_query(query)

# Print the embedding vector.
print("--- Query Embedding Vector ---")
print(embedding_vector)

# Perform a similarity search directly using the embedding vector.
# This bypasses the re-embedding of the query by the FAISS database itself.
docs_score_by_vector = db.similarity_search_by_vector(embedding_vector)

# Print the retrieved documents when searching by vector.
print("\n--- Results from similarity_search_by_vector ---")
print(docs_score_by_vector)

--- Query Embedding Vector ---
[-1.884633183479309, -4.870863437652588, 4.2678303718566895, -0.8921294212341309, -1.3342052698135376, -2.934204578399658, 1.6880780458450317, 0.7167788743972778, -2.3720297813415527, 2.7495880126953125, -0.9061035513877869, 1.4485152959823608, -1.8969908952713013, 0.7298180460929871, -1.914393663406372, 0.5735440254211426, -3.890010356903076, 0.6529111862182617, -2.3091747760772705, 1.7317464351654053, -1.9717049598693848, -0.5709164142608643, 1.4429186582565308, 0.3986818194389343, -0.28688907623291016, 1.167830467224121, 0.4088518023490906, -1.0090652704238892, 1.7255064249038696, 0.03896672651171684, 1.011396050453186, 0.838218092918396, 2.0766167640686035, 1.7169619798660278, 4.368852138519287, -1.1385560035705566, -0.4761849641799927, 1.4079034328460693, 4.454478740692139, 3.1783499717712402, 3.6795008182525635, -0.8825981020927429, -0.9273936748504639, -0.20241717994213104, -0.08269576728343964, -0.4905602037906647, 0.7613548040390015, 1.1347775459

### Saving and Loading FAISS Indexes

FAISS indexes can be saved to and loaded from local disk. This is a crucial feature for persistence, allowing you to reuse your indexed data without having to re-embed all documents every time you start your application.

**Important Note on `allow_dangerous_deserialization`:**
When loading a FAISS index with `load_local`, you'll often encounter the `allow_dangerous_deserialization=True` parameter. This is because LangChain frequently uses Python's `pickle` module internally to save the document store associated with the FAISS index. Pickling untrusted data can lead to arbitrary code execution.
* **Set `True`**: Only if you **trust the source** of the saved `faiss_index` files (e.g., you created them yourself and no one else could have tampered with them).
* **Set `False` (default)**: If you are loading an index from an unknown or untrusted source. In such cases, loading might fail or require a different approach to ensure safety.

In [8]:
# Save the FAISS index to a local directory named "faiss_index".
# This will create files within that directory representing the index.
db.save_local("faiss_index")
print("--- FAISS Index Saved Locally ---")
print("Check your directory for a folder named 'faiss_index'.")

# Load the FAISS index back from the local directory.
# You must provide the original embedding function (`embeddings`) used to create the index.
# `allow_dangerous_deserialization=True` is required because LangChain often uses pickle internally,
# and it's a security measure against deserializing malicious data.
new_db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
print("\n--- FAISS Index Loaded Successfully ---")
print(new_db)

# Perform a similarity search on the newly loaded database to verify it works.
docs_from_new_db = new_db.similarity_search(query)

# Print the content of the first retrieved document from the loaded database.
print("\n--- Result from Loaded FAISS Database ---")
print(docs_from_new_db[0].page_content)

--- FAISS Index Saved Locally ---
Check your directory for a folder named 'faiss_index'.

--- FAISS Index Loaded Successfully ---
<langchain_community.vectorstores.faiss.FAISS object at 0x107c24cd0>

--- Result from Loaded FAISS Database ---
It is a distressing and oppressive duty, gentlemen of the Congress, which I have performed in thus addressing you. There are, it may be, many months of fiery trial and sacrifice ahead of us. It is a fearful thing to lead this great peaceful people into war, into the most terrible and disastrous of all wars, civilization itself seeming to be in the balance. But the right is more precious than peace, and we shall fight for the things which we have always carried nearest our hearts—for democracy, for the right of those who submit to authority to have a voice in their own governments, for the rights and liberties of small nations, for a universal dominion of right by such a concert of free peoples as shall bring peace and safety to all nations and make