    # Simple RAG (Retrieval-Augmented Generation) System for CSV Files

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

## Key Components

1. Loading and spliting csv files.
2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
3. Retriever setup for querying the processed documents
4. Creating a question and answer over the csv data.

## Method Details

### Document Preprocessing

1. The csv is loaded using langchain Csvloader
2. The data is split into chunks.


### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the most relevant chunks for a given query.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file.

import libries

In [34]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from pathlib import Path
# from langchain_openai import ChatOpenAI,OpenAIEmbeddings
from langchain.llms import GPT4All
from langchain.embeddings import HuggingFaceEmbeddings
import os
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

# llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
llm = GPT4All(model="mistral-7b-openorca.Q4_0")

llama_model_load: error loading model: tensor 'blk.1.ffn_down.weight' data is not within the file bounds, model is corrupted or incomplete
llama_load_model_from_file: failed to load model
LLAMA ERROR: failed to load model from /home/dor/.cache/gpt4all/mistral-7b-openorca.Q4_0.gguf


Exception: Model not loaded

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

In [18]:
import pandas as pd

file_path = ('../data/customers-100.csv') # insert the path of the csv file
data = pd.read_csv(file_path)

#preview the csv file
data.head()

Unnamed: 0,Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
0,1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
1,2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
2,3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
3,4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
4,5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,Slovakia (Slovak Republic),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/


load and process csv data

In [19]:
loader = CSVLoader(file_path=file_path)
docs = loader.load_and_split()

Initiate faiss vector store and openai embedding

In [20]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

# embeddings = OpenAIEmbeddings()
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
index = faiss.IndexFlatL2(len(embeddings.embed_query(" ")))
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)



Add the splitted csv data to the vector store

In [21]:
vector_store.add_documents(documents=docs)

['79e59249-1d80-4961-9ada-b836c6b48ed3',
 'aca3add5-54d4-4f8d-861d-dd4a4e4bdb63',
 'ef31a008-1c64-44b0-95d4-78dbf3bb0fa6',
 'fe3519a8-519f-4a31-a699-a18abb0316d5',
 '0c4d2b25-4091-49b9-b9ef-fe2421af8438',
 'c7906fdd-804c-457a-9939-9ef8f0df42ad',
 '85033680-856a-40ba-8d49-690391ecbe38',
 'bb9f35cc-387c-417e-a839-9963e24482b8',
 'd5e95c0f-d208-4a91-a714-02c906bc5460',
 'd5d22ae7-178b-4922-b9e0-199ffd7c602b',
 'd2d6ea46-5df1-4816-901c-0c17bb10ea57',
 'cfcfc394-cf27-4af0-a83a-6e8b4a7e7a15',
 'd725aa03-aa94-4363-83df-cb359b072a5a',
 'afbe43c1-c105-4dc5-9d74-0b21a175b5f3',
 'e341b382-793c-426f-99df-930bbf29d9f8',
 'b6cd2152-2de8-412c-9fbd-7439ad0226b2',
 '213b27bd-5247-463d-8f8a-0af3de52c7e6',
 'e8cf1ddc-ee92-461f-b6b9-7268f2330868',
 'c53f8eee-56e1-4405-9dec-de3dbaa858ff',
 '1d4066be-4d1a-4302-a68a-6b7f9dcd5b10',
 '496d323f-4837-4f3d-86d1-fcdd47e12473',
 'b2c7cc11-52a7-4c1c-987b-3035053ff45b',
 '033591d7-faeb-437a-86c9-e96032325215',
 'de953219-93ce-427f-a706-3df753adb8da',
 '351e112d-cf86-

Create the retrieval chain

In [26]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

retriever = vector_store.as_retriever()

# Set up system prompt
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
    
])

# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Query the rag bot with a question based on the CSV data

In [27]:
answer= rag_chain.invoke({"input": "which company does sheryl Baxter work for?"})
answer['answer']

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}