# FAISS Vector Store Implementation

This notebook demonstrates how to set up and use a FAISS vector store with LangChain and OpenAI embeddings. FAISS (Facebook AI Similarity Search) is an efficient similarity search library that allows for quick retrieval of vectors similar to a query vector.

## Setup and Imports

First, we'll import all the required libraries and load environment variables.

In [1]:
# Import necessary libraries
import getpass
import os
from dotenv import load_dotenv

import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Load environment variables from .env file
load_dotenv()

True

## Initialize OpenAI Embeddings

Next, we initialize the OpenAI embeddings model that will convert our text into vector representations.

In [2]:
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.environ.get("OPENAI_API_KEY")
)

## Create FAISS Index

Now we create a FAISS index with the L2 (Euclidean) distance metric. We need to specify the dimension of our vectors, which we determine by embedding a sample query.

In [3]:
# Create a FAISS index with L2 distance metric
# We determine the dimension by embedding a sample query
index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

## Initialize FAISS Vector Store

Finally, we create the FAISS vector store using our embeddings model and index.

In [4]:
# Initialize the FAISS vector store
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

## Example Usage

Below are some examples of how to use the FAISS vector store for document storage and similarity search.

In [5]:
# Example: Adding documents to the vector store
from langchain_core.documents import Document

# Create sample documents
documents = [
    Document(page_content="FAISS is a library for efficient similarity search.", metadata={"source": "doc1"}),
    Document(page_content="Vector databases store embeddings for quick retrieval.", metadata={"source": "doc2"}),
    Document(page_content="LangChain provides tools for building LLM applications.", metadata={"source": "doc3"}),
    Document(page_content="LangChain provides tools for building LLM applications.", metadata={"source": "tweet"}),
    Document(page_content="FAISS can handle large datasets efficiently.", metadata={"source": "doc4"}),
    Document(page_content="OpenAI's embeddings are useful for various NLP tasks.", metadata={"source": "doc5"}),
    Document(page_content="FAISS supports both CPU and GPU for indexing.", metadata={"source": "tweet"}),
    Document(
        page_content="Robbers broke into the city bank and stole $1 million in cash.",
        metadata={"source": "news"},
    )
]

# Add documents to the vector store
vector_store.add_documents(documents)

['edcd9bba-1d5e-4521-a459-46ecc59cb7c4',
 '6b9f9980-ec02-40e6-80e1-d20c92e177cb',
 'ec0a896a-82bc-4189-a3a7-28c5d180d275',
 'c3bedc75-f326-4c50-9032-cb4aaead4153',
 '23080406-6637-461f-b18a-ee941ccf5ec2',
 '29951a3e-e88f-4322-aeb6-5d800c491ecf',
 '4077f257-2112-484d-a647-8d92c69c8b13',
 '67a37c13-8b8f-4128-a107-ee87fc12a55b']

In [6]:
# Example: Performing a similarity search
query = "How do vector databases work?"
results = vector_store.similarity_search(query, k=2)

# Display results
for doc in results:
    print(f"Source: {doc.metadata['source']}")
    print(f"Content: {doc.page_content}")
    print("-" * 50)

Source: doc2
Content: Vector databases store embeddings for quick retrieval.
--------------------------------------------------
Source: doc4
Content: FAISS can handle large datasets efficiently.
--------------------------------------------------


In [7]:
# Example: Performing a similarity search with scores
results_with_scores = vector_store.similarity_search_with_score(query, k=2)

# Display results with similarity scores
for doc, score in results_with_scores:
    print(f"Source: {doc.metadata['source']}")
    print(f"Content: {doc.page_content}")
    print(f"Similarity Score: {score}")
    print("-" * 50)

Source: doc2
Content: Vector databases store embeddings for quick retrieval.
Similarity Score: 0.6612452268600464
--------------------------------------------------
Source: doc4
Content: FAISS can handle large datasets efficiently.
Similarity Score: 1.395352840423584
--------------------------------------------------


## Query with filters

In [8]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": "tweet"},
)

print(results)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

[Document(id='c3bedc75-f326-4c50-9032-cb4aaead4153', metadata={'source': 'tweet'}, page_content='LangChain provides tools for building LLM applications.'), Document(id='4077f257-2112-484d-a647-8d92c69c8b13', metadata={'source': 'tweet'}, page_content='FAISS supports both CPU and GPU for indexing.')]
* LangChain provides tools for building LLM applications. [{'source': 'tweet'}]
* FAISS supports both CPU and GPU for indexing. [{'source': 'tweet'}]


Some MongoDB query and projection operators are supported for more advanced metadata filtering. The current list of supported operators are as follows:

* $eq (equals)
* $neq (not equals)
* $gt (greater than)
* $lt (less than)
* $gte (greater than or equal)
* $lte (less than or equal)
* $in (membership in list)
* $nin (not in list)
* $and (all conditions must match)
* $or (any condition must match)
* $not (negation of condition)

In [9]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": {"$eq": "tweet"}},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

* LangChain provides tools for building LLM applications. [{'source': 'tweet'}]
* FAISS supports both CPU and GPU for indexing. [{'source': 'tweet'}]


## Similarity search with score

In [10]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=1, filter={"source": {"$neq": "tweet"}}
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

* [SIM=1.866577] FAISS can handle large datasets efficiently. [{'source': 'doc4'}]


## Query by turning vector store to a retriever

In [11]:
retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 1})
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

[Document(id='67a37c13-8b8f-4128-a107-ee87fc12a55b', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

## Saving and loading

In [13]:
vector_store.save_local("faiss_index")

new_vector_store = FAISS.load_local(
    "faiss_index", embeddings, allow_dangerous_deserialization=True
)

docs = new_vector_store.similarity_search("qux")

In [None]:
docs[0]


Document(id='ec0a896a-82bc-4189-a3a7-28c5d180d275', metadata={'source': 'doc3'}, page_content='LangChain provides tools for building LLM applications.')

## Conclusion

This notebook demonstrated how to set up a FAISS vector store using LangChain and OpenAI embeddings. You can extend this implementation by:

1. Persisting the vector store to disk
2. Loading documents from various sources
3. Implementing more complex retrieval strategies
4. Integrating with LLMs for question answering

For more information, refer to the [FAISS documentation](https://github.com/facebookresearch/faiss) and [LangChain documentation](https://python.langchain.com/docs/integrations/vectorstores/faiss).