### FAISS VectorStore

FAISS which also stands for Facebook AI Similarity Search is a library for efficient similarity search and clustering of dense vectors. We can use it for similarity searching and for vector database storage

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()
os.environ["LANGCHAIN_TRACING_V2"]="true"
os.environ["LANGCHAIN_API_KEY"]=os.getenv("LANGCHAIN_API_KEY")


In [2]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import FAISS


In [3]:
#Ingestion of documents
loader = TextLoader("speech.txt")
docs = loader.load()
docs

[Document(metadata={'source': 'speech.txt'}, page_content="Apples usually have a few seeds, maybe 5 to 10.\n\nOranges have a bit more, around 10 to 15.\n\nWatermelons are champions! They can have hundreds of seeds, sometimes even 600!\n\nStrawberries are tiny, but they have lots of tiny seeds too, about 200 in each one.\n\nBananas are special because they usually don't have any seeds at all.")]

In [4]:
##Splitting of documents
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10)
final_docs = text_splitter.split_documents(docs)
final_docs

[Document(metadata={'source': 'speech.txt'}, page_content='Apples usually have a few seeds, maybe 5 to 10.\n\nOranges have a bit more, around 10 to 15.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Watermelons are champions! They can have hundreds of seeds, sometimes even 600!'),
 Document(metadata={'source': 'speech.txt'}, page_content='Strawberries are tiny, but they have lots of tiny seeds too, about 200 in each one.'),
 Document(metadata={'source': 'speech.txt'}, page_content="Bananas are special because they usually don't have any seeds at all.")]

In [5]:
#Creating embeddings and then creating a vector database using splitted documents and embeddings
embeddings = (
    OllamaEmbeddings(model="llama3.2")
)
db = FAISS.from_documents(final_docs,embeddings)
db

<langchain_community.vectorstores.faiss.FAISS at 0x1c189181340>

#### This is a query that we will use for all types of querying

In [6]:
query = "seeds in apple?"

In [7]:
#Normal similarity search
result = db.similarity_search(query)
result

[Document(id='0e83385c-54a6-4d49-b9d2-012489a3efa5', metadata={'source': 'speech.txt'}, page_content='Strawberries are tiny, but they have lots of tiny seeds too, about 200 in each one.'),
 Document(id='1d4532a5-de5f-43ee-90ab-398f34caafc8', metadata={'source': 'speech.txt'}, page_content='Watermelons are champions! They can have hundreds of seeds, sometimes even 600!'),
 Document(id='9b92730f-dc8e-49a8-a488-b835d508ae40', metadata={'source': 'speech.txt'}, page_content="Bananas are special because they usually don't have any seeds at all."),
 Document(id='28fd271c-79dc-462c-ad34-73e37a7e89d7', metadata={'source': 'speech.txt'}, page_content='Apples usually have a few seeds, maybe 5 to 10.\n\nOranges have a bit more, around 10 to 15.')]

In [14]:
#Similarity search with score. Lower score means less distance means better probability of this being the answer
result2 = db.similarity_search_with_score(query)
result2


[(Document(id='0e83385c-54a6-4d49-b9d2-012489a3efa5', metadata={'source': 'speech.txt'}, page_content='Strawberries are tiny, but they have lots of tiny seeds too, about 200 in each one.'),
  0.7833785),
 (Document(id='1d4532a5-de5f-43ee-90ab-398f34caafc8', metadata={'source': 'speech.txt'}, page_content='Watermelons are champions! They can have hundreds of seeds, sometimes even 600!'),
  0.8181919),
 (Document(id='9b92730f-dc8e-49a8-a488-b835d508ae40', metadata={'source': 'speech.txt'}, page_content="Bananas are special because they usually don't have any seeds at all."),
  1.173011),
 (Document(id='28fd271c-79dc-462c-ad34-73e37a7e89d7', metadata={'source': 'speech.txt'}, page_content='Apples usually have a few seeds, maybe 5 to 10.\n\nOranges have a bit more, around 10 to 15.'),
  1.2151376)]

In [None]:
#We can also query vectors
vector = embeddings.embed_query("Seeds in banana")
result3 = db.similarity_search_with_score_by_vector(vector)
result3


[(Document(id='0e83385c-54a6-4d49-b9d2-012489a3efa5', metadata={'source': 'speech.txt'}, page_content='Strawberries are tiny, but they have lots of tiny seeds too, about 200 in each one.'),
  1.5516274),
 (Document(id='1d4532a5-de5f-43ee-90ab-398f34caafc8', metadata={'source': 'speech.txt'}, page_content='Watermelons are champions! They can have hundreds of seeds, sometimes even 600!'),
  1.5685515),
 (Document(id='9b92730f-dc8e-49a8-a488-b835d508ae40', metadata={'source': 'speech.txt'}, page_content="Bananas are special because they usually don't have any seeds at all."),
  1.5944684),
 (Document(id='28fd271c-79dc-462c-ad34-73e37a7e89d7', metadata={'source': 'speech.txt'}, page_content='Apples usually have a few seeds, maybe 5 to 10.\n\nOranges have a bit more, around 10 to 15.'),
  1.7485698)]

### Retriever

This is a very important component. We can convert the vector store into a retriever class. This allows us to use langchain methods of querying and other stuff which largely work with retrievers and not vectorstore

In [10]:
retriever1 = db.as_retriever()
retriever1.invoke(query)

[Document(id='0e83385c-54a6-4d49-b9d2-012489a3efa5', metadata={'source': 'speech.txt'}, page_content='Strawberries are tiny, but they have lots of tiny seeds too, about 200 in each one.'),
 Document(id='1d4532a5-de5f-43ee-90ab-398f34caafc8', metadata={'source': 'speech.txt'}, page_content='Watermelons are champions! They can have hundreds of seeds, sometimes even 600!'),
 Document(id='9b92730f-dc8e-49a8-a488-b835d508ae40', metadata={'source': 'speech.txt'}, page_content="Bananas are special because they usually don't have any seeds at all."),
 Document(id='28fd271c-79dc-462c-ad34-73e37a7e89d7', metadata={'source': 'speech.txt'}, page_content='Apples usually have a few seeds, maybe 5 to 10.\n\nOranges have a bit more, around 10 to 15.')]

#### We can also save our database in the form of pickel and can also load it

In [11]:
#Saving a database
db.save_local("faiss_index")

#Loading a database
newdb = FAISS.load_local("faiss_index",embeddings)


ValueError: The de-serialization relies loading a pickle file. Pickle files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.You will need to set `allow_dangerous_deserialization` to `True` to enable deserialization. If you do this, make sure that you trust the source of the data. For example, if you are loading a file that you created, and know that no one else has modified the file, then this is safe to do. Do not set this to `True` if you are loading a file from an untrusted source (e.g., some random site on the internet.).