##Overview:

This project demonstrates the implementation of a Retrieval-Augmented Generation (RAG) model to develop a question-answering (QA) bot for a business. The model retrieves relevant information from a dataset stored in Pinecone, a vector database, and uses a generative model, Cohere's large language model, to generate coherent answers to queries. The pipeline is built using LangChain, Pinecone, and Cohere API.

##System Components

1. Document Loader: A CSV document loader is used to extract textual content from a dataset, in this case, the SQuAD dataset.

2. Text Chunking: To ensure efficient retrieval, large text blocks are split into smaller chunks.

3. Embedding Generation: Cohere's embedding model is used to convert text chunks into vectors that represent the meaning of the text in a high-dimensional space.

4. Vector Storage (Pinecone): The text embeddings are stored in Pinecone, a vector database that allows for efficient retrieval using similarity search.

5. Retrieval QA Chain: The QA chain combines the retrieval mechanism (Pinecone) with a generative language model (Cohere) to answer queries based on the retrieved documents.

How It Works

1. Data Loading: The CSV file is read, and the text content is split into chunks.
2. Vectorization: Each chunk is transformed into a vector embedding using Cohere.
3. Indexing: The embeddings are stored in Pinecone for efficient retrieval.
4. Retrieval: For any given query, the bot retrieves the most relevant text chunks using Pinecone.
5. Answer Generation: Based on the retrieved chunks, Cohere generates a coherent response to the query.

In [None]:
!pip install pinecone-client langchain-community pypdf langchain-cohere langchain_pinecone



In [None]:
import os
from google.colab import userdata
os.environ["COHERE_API_KEY"] = userdata.get("COHERE_API_KEY")
os.environ["PINECONE_API_KEY"] = userdata.get("PINECONE_API_KEY")

In [None]:
from langchain_core.documents import Document
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_text_from_csv(uploaded_file):
    loader = CSVLoader(file_path=uploaded_file)
    documents = loader.load()
    text = " ".join([doc.page_content for doc in documents])
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=20000, chunk_overlap=2000, separators=["\n\n", ".", " ", ""])
    chunks = text_splitter.split_text(text)
    docs = [Document(page_content=chunk) for chunk in chunks]
    return docs

In [None]:
from langchain_cohere import CohereEmbeddings
embeddings = CohereEmbeddings(model="embed-english-v3.0", cohere_api_key= os.environ["COHERE_API_KEY"])
embeddings

CohereEmbeddings(client=<cohere.client.Client object at 0x7b891dbf9e40>, async_client=<cohere.client.AsyncClient object at 0x7b89201d7cd0>, model='embed-english-v3.0', truncate=None, cohere_api_key=SecretStr('**********'), embedding_types=['float'], max_retries=3, request_timeout=None, user_agent='langchain:partner', base_url=None)

In [None]:
from langchain.vectorstores import Pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key= os.environ["PINECONE_API_KEY"])
index = pc.Index('dataindex')
namespace= "angelee"

if 'dataindex' not in pc.list_indexes().names():
    pc.create_index(
        name= 'dataindex',
        dimension= 1024,
        metric= 'cosine',
        spec=ServerlessSpec(
            cloud= 'aws',
            region= 'us-east-1'
        )
    )

In [None]:
docs = get_text_from_csv("/content/SQuAD_csv.csv")
docs[100]

Document(metadata={}, page_content='.\nquestion: In what category could files without digital rights management be found on the iTunes store?\nid: 56cc89c46d243a140015f010\nanswer_start: 512\ntext: iTunes Plus : 166\ncontext: At the time the store was introduced, purchased audio files used the AAC format with added encryption, based on the FairPlay DRM system. Up to five authorized computers and an unlimited number of iPods could play the files. Burning the files with iTunes as an audio CD, then re-importing would create music files without the DRM. The DRM could also be removed using third-party software. However, in a deal with Apple, EMI began selling DRM-free, higher-quality songs on the iTunes Stores, in a category called "iTunes Plus." While individual songs were made available at a cost of US$1.29, 30¢ more than the cost of a regular DRM song, entire albums were available for the same price, US$9.99, as DRM encoded albums. On October 17, 2007, Apple lowered the cost of individua

In [None]:
from langchain_pinecone import PineconeVectorStore
vectorstore = PineconeVectorStore.from_documents(docs, embeddings, index_name= 'dataindex', namespace="angelee")

In [None]:
from langchain_cohere import ChatCohere
from langchain.chains import RetrievalQA

llm = ChatCohere(
    model="command-xlarge-nightly",
    cohere_api_key=os.environ["COHERE_API_KEY"],
)

knowledge = PineconeVectorStore.from_existing_index(
    index_name= 'dataindex',
    namespace= namespace,
    embedding= embeddings
)

retriever = knowledge.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever= retriever
)

In [None]:
query = "Who is Frédéric Chopin?"

retrieved_docs = retriever.invoke(query)
print("Retrieved Document Chunks:")
for doc in retrieved_docs:
    print(doc.page_content)

In [None]:
result = qa.invoke(query).get('result')
print("Answer:", result)

Answer: Frédéric Chopin was a Polish and French composer and virtuoso pianist of the Romantic era. He was born in 1810 in Żelazowa Wola, a village located 29 miles west of Warsaw. Chopin's father, Nicolas Chopin, was a Frenchman from Lorraine who had emigrated to Poland in 1787. Chopin's mother, Justyna Krzyżanowska, was a poor relative of the Skarbeks, one of the families for whom Nicolas worked. Chopin was the couple's second child and only son. He had an elder sister, Ludwika, and two younger sisters, Izabela and Emilia.

Chopin was a child prodigy who completed his musical education and composed his earlier works in Warsaw before leaving Poland at the age of 20, less than a month before the outbreak of the November 1830 Uprising. He settled in Paris at the age of 21 and obtained French citizenship in 1835. Chopin formed a friendship with Franz Liszt and was admired by many of his musical contemporaries, including Robert Schumann.

Chopin's works remain popular, and he has been the 