    # Simple RAG (Retrieval-Augmented Generation) System for CSV Files

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

## Key Components

1. Loading and spliting csv files.
2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
3. Retriever setup for querying the processed documents
4. Creating a question and answer over the csv data.

## Method Details

### Document Preprocessing

1. The csv is loaded using langchain Csvloader
2. The data is split into chunks.


### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the most relevant chunks for a given query.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file.

Option one - Meta Llama 3 8B - on cpu - SUPER SLOW

In [None]:
# Option one - Meta Llama 3 8B - on cpu - SUPER SLOW
from langchain_community.document_loaders.csv_loader import CSVLoader
# from langchain_openai import ChatOpenAI,OpenAIEmbeddings
from langchain.llms import GPT4All
from langchain.embeddings import HuggingFaceEmbeddings
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

llm = GPT4All(model="Meta-Llama-3-8B-Instruct.Q4_0.gguf")
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Option 2 - Hugging Face pipeline - faster with GPU

In [2]:
#  Option two - Ollama - faster with GPU
from langchain_ollama import ChatOllama, OllamaEmbeddings
model_id = "llama3.1"
llm = ChatOllama(model=model_id, temperature=0)
embeddings = OllamaEmbeddings(model=model_id)

In [14]:
response = llm.invoke("do you remember what I said?")
print(response.content)

This is the beginning of our conversation, so I don't have any prior memory or context. I'm happy to chat with you now, though! What would you like to talk about?


# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

In [None]:
import pandas as pd

file_path = '../data/customers-100.csv' # insert the path of the csv file
data = pd.read_csv(file_path)

#preview the csv file
data.head()

load and process csv data

In [52]:
loader = CSVLoader(file_path=file_path)
docs = loader.load_and_split()

Initiate faiss vector store and openai embedding

In [53]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

index = faiss.IndexFlatL2(len(embeddings.embed_query(" ")))
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)



Add the splitted csv data to the vector store

In [54]:
vector_store.add_documents(documents=docs)

['7282ee4c-f4b4-4e02-b4b4-608320d4a0ff',
 'f5794194-468e-4fc5-893a-52a6fa031bdd',
 '4fe947d6-f5e2-4c19-adeb-4eb54728b7ff',
 '9e88dcf5-8773-43ef-b17b-6b4c39399be6',
 '0b0eb398-4e2b-4ded-a648-b998c18fa08c',
 '18ea91c0-70b6-47d5-99ec-2f34063b1930',
 '50f07c63-ce96-44c7-879f-d7c216e5ee00',
 '46607f4f-fb44-4820-ae3c-68e81ef82031',
 '2520167f-6c0e-4b78-8568-97c5f083a6d8',
 'fc47cab5-7f72-4802-a99b-6d1fb374a87d',
 '3e90c62e-7c03-4c07-a64c-dc97137414b5',
 '3c21258d-174b-4f88-8d35-2c614d65ec03',
 '9de4923f-0be1-48f7-9805-fc8f430ef0ec',
 '2551567e-c2bc-4480-ad53-1a847b6b92ef',
 '186af085-5648-4140-a253-41141ffd87cf',
 '8c8d5c72-6600-42b6-a332-5b3aee2dd2ff',
 'b2cc1719-3b58-4826-8331-3e89a0e3480c',
 'd0c39537-2feb-45f9-94f4-8ec546de6bdc',
 '32a5b976-e2ef-4304-ab6a-23b2bcbe70a1',
 'fd4b495b-9c80-4ab4-9da5-d9a3dc7ae3b4',
 '318c65ec-bc0d-455d-a5f7-4d0d6397f714',
 '86e980ca-2644-485f-a2dc-f0b0658b41e4',
 'efd0c379-5c9d-44c8-9d2a-ee885ba8f5ed',
 '92793092-7996-4260-bed0-ee0ce6ae059c',
 '631407a6-22b7-

Create the retrieval chain

In [131]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

retriever = vector_store.as_retriever()

# Set up system prompt
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
    
])

# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Query the rag bot with a question based on the CSV data

In [132]:
answer= rag_chain.invoke({"input": "which company does sheryl Baxter work for?"})
print(answer['answer'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


System: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.

Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: zunigavanessa@smith.info
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/

Index: 9
Customer Id: C2dE4dEEc489ae0
First Name: Sheryl
Last Name: Meyers
Company: Browning-Simon
City: Robersonstad
Country: Cyprus
Phone 1: 854-138-4911x5772
Phone 2: +1-448-910-2276x729
Email: mariokhan@ryan-pope.org
Subscription Date: 2020-01-13
Website: https://www.bullock.net/

Index: 11
Customer Id: 216E205d6eBb815
First Name: Carl
Last Name: Schroeder
Company: Oconnell, Meza and Everett
City: Shannonville
Country: Guernsey
Phone 1: 637-854-0256x825
Phone 2: 114.33