### Perform RAG on a CSV dataset using ChromaDB as vector database

Install the important dependencies first

In [12]:
%pip install -q dotenv pandas langchain langchain-community langchain-chroma langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.


There are 2 common ways we can handle CSV data for creating RAG apps - 
1. Convert CSV to dataframe providing explicit/automatically inference types from data and then split it to chunks
2. Use CSVLoader to split CSV directly with each row as document taken

Here we will use CSVLoader

### Create similar documents using CSVLoader

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader

csv_loader = CSVLoader(file_path="../../data/employee_data.csv")
csv_docs = csv_loader.load()

In [7]:
len(csv_docs) # each row is again a document

3000

In [8]:
csv_docs[0]

Document(metadata={'source': '../../data/employee_data.csv', 'row': 0}, page_content='ï»¿EmpID: 3427\nFirstName: Uriah\nLastName: Bridges\nStartDate: 20-Sep-19\nExitDate: \nTitle: Production Technician I\nSupervisor: Peter Oneill\nADEmail: uriah.bridges@bilearner.com\nBusinessUnit: CCDR\nEmployeeStatus: Active\nEmployeeType: Contract\nPayZone: Zone C\nEmployeeClassificationType: Temporary\nTerminationType: Unk\nTerminationDescription: \nDepartmentType: Production\nDivision: Finance & Accounting\nDOB: 07-10-1969\nState: MA\nJobFunctionDescription: Accounting\nGenderCode: Female\nLocationCode: 34904\nRaceDesc: White\nMaritalDesc: Widowed\nPerformance Score: Fully Meets\nCurrent Employee Rating: 4')

Here each object is of type `Document` but it has different **metadata** where dataframe loader had remaining fields, this has source file name and row number

In [9]:
vars(csv_docs[0])

{'id': None,
 'metadata': {'source': '../../data/employee_data.csv', 'row': 0},
 'page_content': 'ï»¿EmpID: 3427\nFirstName: Uriah\nLastName: Bridges\nStartDate: 20-Sep-19\nExitDate: \nTitle: Production Technician I\nSupervisor: Peter Oneill\nADEmail: uriah.bridges@bilearner.com\nBusinessUnit: CCDR\nEmployeeStatus: Active\nEmployeeType: Contract\nPayZone: Zone C\nEmployeeClassificationType: Temporary\nTerminationType: Unk\nTerminationDescription: \nDepartmentType: Production\nDivision: Finance & Accounting\nDOB: 07-10-1969\nState: MA\nJobFunctionDescription: Accounting\nGenderCode: Female\nLocationCode: 34904\nRaceDesc: White\nMaritalDesc: Widowed\nPerformance Score: Fully Meets\nCurrent Employee Rating: 4',
 'type': 'Document'}

One way to decide among these 2 techniques is to consider whether the entire row (all columns) are needed for similarity search or just one?

If you need all then CSVLoader should be the choice, otherwise dataframeloader if you want only 1 column for similarity search

###  Initialise hugging face client for embeddings and LLM API calls

- Create a file called `.env`
- Login to hugging face hub and go to your profile => Settings => Access Tokens
- Generate a new token and save it in the `.env` file as `HUGGINGFACEHUB_API_TOKEN=hf_token`

In [10]:
import os
from dotenv import load_dotenv

load_dotenv()

hugging_face_api_key = os.environ["HUGGINGFACEHUB_API_TOKEN"]

In [11]:
# print to verify if key exists
# print(hugging_face_api_key)

#### Initialize Hugging Face Inference Endpoint client to use Mistral 7b Instruct v0.2

In [12]:
from langchain_community.llms import HuggingFaceEndpoint
repo_id = "mistralai/Mistral-7B-Instruct-v0.2" # last result of this


llm = HuggingFaceEndpoint(
    repo_id=repo_id, temperature=0.7
)

  llm = HuggingFaceEndpoint(
  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\VARUN ARORA\.cache\huggingface\token
Login successful


#### Create vector database chromaDB

In [13]:
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_chroma import Chroma

# use the open-source embedding function to convert text to embeddings, can choose another function as per leaderboard - https://huggingface.co/spaces/mteb/leaderboard
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


Store the documents using embedding function for `csv_docs`

In [14]:
csv_chromadb_directory = "chroma_rag_csvloader"

Create vectorDB clients for docs while storing documents in vector database

In [15]:
csvloader_db_client = Chroma.from_documents(documents=csv_docs, embedding=embedding_function, persist_directory=csv_chromadb_directory)

If the directories already exists, you have to call different method for client creation - 

`db = Chroma(persist_directory="my_directory", embedding_function=embedding_function)`

#### Build simple RAG chain for csvloader vector database

In [36]:
# Retrieve and generate using the relevant snippets of the blog.
from langchain import hub

# set k as 20 to retrieve 20 most similar docs
retriever = csvloader_db_client.as_retriever(search_kwargs={'k': 10})
prompt = hub.pull("rlm/rag-prompt") # pull common RAG prompt here - https://smith.langchain.com/hub/rlm/rag-prompt



In [37]:
# this function is passed in to format each document retrieved from vector store to get the page content only as context to LLM

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [38]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

csvloader_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [40]:
# 10 documents retrieved confirmed
retriever.invoke("Which married employees have best performance?")

[Document(metadata={'row': 1531, 'source': '../../data/employee_data.csv'}, page_content='ï»¿EmpID: 1958\nFirstName: Lin\nLastName: Chan\nStartDate: 04-Jul-19\nExitDate: \nTitle: Production Technician I\nSupervisor: Neil Aguilar\nADEmail: lin.chan@bilearner.com\nBusinessUnit: PYZ\nEmployeeStatus: Active\nEmployeeType: Contract\nPayZone: Zone B\nEmployeeClassificationType: Full-Time\nTerminationType: Unk\nTerminationDescription: \nDepartmentType: Production\nDivision: Sales & Marketing\nDOB: 18-10-1990\nState: MA\nJobFunctionDescription: Assistant\nGenderCode: Female\nLocationCode: 2170\nRaceDesc: Other\nMaritalDesc: Married\nPerformance Score: Fully Meets\nCurrent Employee Rating: 3'),
 Document(metadata={'row': 1911, 'source': '../../data/employee_data.csv'}, page_content='ï»¿EmpID: 2338\nFirstName: Ivan\nLastName: Hull\nStartDate: 10-Oct-18\nExitDate: 02-May-22\nTitle: Production Manager\nSupervisor: Robert Sullivan\nADEmail: ivan.hull@bilearner.com\nBusinessUnit: EW\nEmployeeStatus:

In [None]:
response = csvloader_rag_chain.invoke("Which married employees have best performance?")

In [41]:
print(response)

 Based on the provided context, Lin Chan and Ivan Hull are the married employees with the best performance scores of "Fully Meets" and current employee ratings of 3 and 4 respectively.


You can try and create same chain for `chromadb_rag_dfloader`

### Chain Working Explanation

The chain of operations can be visualized as follows:

1. **Retrieve and Format Context**:
    - The input query is sent to the `retriever`.
    - The `retriever` fetches relevant documents.
    - These documents are passed through `format_docs` to prepare them for the prompt.

2. **Combine with Question**:
    - Simultaneously, the question is passed through `RunnablePassthrough`.
    - The `prompt` combines the formatted context and the question into a single prompt.

3. **Generate Response**:
    - This prompt is sent to the `llm`.
    - The `llm` generates a response based on the prompt.

4. **Parse Output**:
    - The generated response is parsed by `StrOutputParser` to ensure it is in a clean string format.

Here's the same explanation with the original code snippet for context:

```python
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
```

### Breakdown

1. `{"context": retriever | format_docs, "question": RunnablePassthrough()}`:
    - This dictionary contains two keys: `context` and `question`.
    - `context` is formed by passing the input query through `retriever` and `format_docs` in sequence.
    - `question` is directly passed through `RunnablePassthrough`.

2. `| prompt`:
    - The combined context and question are then passed to the `prompt`, which structures them into a single input for the LLM.

3. `| llm`:
    - The structured prompt is processed by the `llm`, which generates a response.

4. `| StrOutputParser()`:
    - The response from the LLM is parsed into a string format by `StrOutputParser`.

This RAG chain ensures that the generated answer is relevant, well-formed, and based on the most pertinent context available.