### Perform RAG on a CSV dataset using FAISS as vector database

### FAISS

Also known as ** Facebook AI Similarity Search** is an open-source library built for similarity search and clustering of dense vectors.

Read more about it here - https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

We will use FAISS as vector database to store and do similarity search. We will use package `faiss-cpu` so processing is done via CPU

In [2]:
%pip install -q dotenv pandas langchain langchain-community faiss-cpu langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.


Note - you may get local error due to subprocess running while installing dotenv(if it has been installed or is being used in same environment) so remove it from above command to see it removes that error

### Create documents using CSVLoader where each row is considered a separate document

In [5]:
from langchain_community.document_loaders.csv_loader import CSVLoader

csv_loader = CSVLoader(file_path="../data/employee_data.csv")
csv_docs = csv_loader.load()

In [6]:
len(csv_docs) # each row is again a document

3000

In [7]:
csv_docs[0]

Document(metadata={'source': '../data/employee_data.csv', 'row': 0}, page_content='ï»¿EmpID: 3427\nFirstName: Uriah\nLastName: Bridges\nStartDate: 20-Sep-19\nExitDate: \nTitle: Production Technician I\nSupervisor: Peter Oneill\nADEmail: uriah.bridges@bilearner.com\nBusinessUnit: CCDR\nEmployeeStatus: Active\nEmployeeType: Contract\nPayZone: Zone C\nEmployeeClassificationType: Temporary\nTerminationType: Unk\nTerminationDescription: \nDepartmentType: Production\nDivision: Finance & Accounting\nDOB: 07-10-1969\nState: MA\nJobFunctionDescription: Accounting\nGenderCode: Female\nLocationCode: 34904\nRaceDesc: White\nMaritalDesc: Widowed\nPerformance Score: Fully Meets\nCurrent Employee Rating: 4')

Here each object is of type `Document` but it has different **metadata** where dataframe loader had remaining fields, this has source file name and row number

In [7]:
vars(csv_docs[0])

{'id': None,
 'metadata': {'source': '../data/employee_data.csv', 'row': 0},
 'page_content': 'ï»¿EmpID: 3427\nFirstName: Uriah\nLastName: Bridges\nStartDate: 20-Sep-19\nExitDate: \nTitle: Production Technician I\nSupervisor: Peter Oneill\nADEmail: uriah.bridges@bilearner.com\nBusinessUnit: CCDR\nEmployeeStatus: Active\nEmployeeType: Contract\nPayZone: Zone C\nEmployeeClassificationType: Temporary\nTerminationType: Unk\nTerminationDescription: \nDepartmentType: Production\nDivision: Finance & Accounting\nDOB: 07-10-1969\nState: MA\nJobFunctionDescription: Accounting\nGenderCode: Female\nLocationCode: 34904\nRaceDesc: White\nMaritalDesc: Widowed\nPerformance Score: Fully Meets\nCurrent Employee Rating: 4',
 'type': 'Document'}

Note - For parsing CSVs for data, you can use both DataframeLoader and CSVLoader but you must know that in dataframeloader **only 1 column is passed as `page_content` and rest is considered as metadata.

This metadata field is not considered while doing similarity search but rather comes useful in filtering data before/after similarity search

###  Initialise hugging face client for embeddings and LLM API calls

- Create a file called `.env`
- Login to hugging face hub and go to your profile => Settings => Access Tokens
- Generate a new token and save it in the `.env` file as `HUGGINGFACEHUB_API_TOKEN=hf_token`

In [9]:
import os
from dotenv import load_dotenv

load_dotenv()

hugging_face_api_key = os.environ["HUGGINGFACEHUB_API_TOKEN"]

In [10]:
# print to verify if key exists
# print(hugging_face_api_key)

#### Initialize Hugging Face Inference Endpoint client to use Mistral 7b Instruct v0.2

In [11]:
from langchain_community.llms import HuggingFaceEndpoint
repo_id = "mistralai/Mistral-7B-Instruct-v0.2" # last result of this


llm = HuggingFaceEndpoint(
    repo_id=repo_id, temperature=0.7
)

  llm = HuggingFaceEndpoint(
  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\VARUN ARORA\.cache\huggingface\token
Login successful


#### Create vector database FAISS

In [12]:
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# use the open-source embedding function to convert text to embeddings, can choose another function as per leaderboard - https://huggingface.co/spaces/mteb/leaderboard
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


Store the documents using embedding function for `csv_docs`

In [14]:
import faiss
from langchain_community.vectorstores import FAISS

csv_faiss_index = "faiss_rag_csvloader"

faiss_index_client = FAISS.from_documents(
    documents=csv_docs,
    embedding=embedding_function, 
    
)

Now, this client has vector embeddings for our data. We should persist this in a folder to avoid creating vector database everytime.

Provide the folder to save index and index name to uniquely identify indices when multiple indices exist in the folder

In [15]:
faiss_index_client.save_local(folder_path="faiss_vector_stores", index_name=csv_faiss_index)

#### Loading vector store from local

Provide the embedding function as parameter and `allow_deserialization` flag to True. The reason we need to set this flag to true is that **it loads a pkl file which can be modified to deliver malicious code. Adding this flag to True is so you confirm you trust the source**

In [18]:
faiss_client = FAISS.load_local(folder_path="faiss_vector_stores", index_name=csv_faiss_index, embeddings=embedding_function, allow_dangerous_deserialization=True)

#### Similarity search to get k(here 7) semantically closest documents to input query

In [19]:
faiss_client.similarity_search(query="Which married employees have best performance?", k=7)

[Document(metadata={'source': '../data/employee_data.csv', 'row': 1531}, page_content='ï»¿EmpID: 1958\nFirstName: Lin\nLastName: Chan\nStartDate: 04-Jul-19\nExitDate: \nTitle: Production Technician I\nSupervisor: Neil Aguilar\nADEmail: lin.chan@bilearner.com\nBusinessUnit: PYZ\nEmployeeStatus: Active\nEmployeeType: Contract\nPayZone: Zone B\nEmployeeClassificationType: Full-Time\nTerminationType: Unk\nTerminationDescription: \nDepartmentType: Production\nDivision: Sales & Marketing\nDOB: 18-10-1990\nState: MA\nJobFunctionDescription: Assistant\nGenderCode: Female\nLocationCode: 2170\nRaceDesc: Other\nMaritalDesc: Married\nPerformance Score: Fully Meets\nCurrent Employee Rating: 3'),
 Document(metadata={'source': '../data/employee_data.csv', 'row': 1594}, page_content='ï»¿EmpID: 2021\nFirstName: Marlee\nLastName: Woods\nStartDate: 29-Nov-22\nExitDate: \nTitle: Production Technician II\nSupervisor: Christine Salas\nADEmail: marlee.woods@bilearner.com\nBusinessUnit: WBL\nEmployeeStatus: A

#### Build simple RAG chain

In [20]:
# Retrieve and generate using the relevant snippets of the blog.
from langchain import hub

# set k as 20 to retrieve 20 most similar docs
retriever = faiss_client.as_retriever(search_kwargs={'k': 10})
prompt = hub.pull("rlm/rag-prompt") # pull common RAG prompt here - https://smith.langchain.com/hub/rlm/rag-prompt

  prompt = loads(json.dumps(prompt_object.manifest))


In [21]:
# this function is passed in to format each document retrieved from vector store to get the page content only as context to LLM

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [22]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

csvloader_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [23]:
# 10 documents retrieved confirmed
retriever.invoke("Which married employees have best performance?")

[Document(metadata={'source': '../data/employee_data.csv', 'row': 1531}, page_content='ï»¿EmpID: 1958\nFirstName: Lin\nLastName: Chan\nStartDate: 04-Jul-19\nExitDate: \nTitle: Production Technician I\nSupervisor: Neil Aguilar\nADEmail: lin.chan@bilearner.com\nBusinessUnit: PYZ\nEmployeeStatus: Active\nEmployeeType: Contract\nPayZone: Zone B\nEmployeeClassificationType: Full-Time\nTerminationType: Unk\nTerminationDescription: \nDepartmentType: Production\nDivision: Sales & Marketing\nDOB: 18-10-1990\nState: MA\nJobFunctionDescription: Assistant\nGenderCode: Female\nLocationCode: 2170\nRaceDesc: Other\nMaritalDesc: Married\nPerformance Score: Fully Meets\nCurrent Employee Rating: 3'),
 Document(metadata={'source': '../data/employee_data.csv', 'row': 1594}, page_content='ï»¿EmpID: 2021\nFirstName: Marlee\nLastName: Woods\nStartDate: 29-Nov-22\nExitDate: \nTitle: Production Technician II\nSupervisor: Christine Salas\nADEmail: marlee.woods@bilearner.com\nBusinessUnit: WBL\nEmployeeStatus: A

In [24]:
response = csvloader_rag_chain.invoke("Which married employees have best performance?")

In [25]:
print(response)

 Lin Chan and Michael Albert, both Married employees, have a Performance Score of "Fully Meets" and Current Employee Rating of 3 and 2 respectively.


For Chain Working Explanation, refer to bottom of file chroma_rag_hface_csv.ipynb