
# Retrieval-Augmented Generation (RAG) Model for QA Bot on P&L Data
This notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline using LangChain, FAISS, and Google Generative AI. 
The process involves extracting text from PDFs, creating a vector store, and performing QA (Question Answering) with generative AI. 
Additionally, we calculate the accuracy of the responses by comparing them against ground truth data.


The required dependencies are installed at the beginning,Ensure all dependencies are installed before running the pipeline.


In [None]:

# Required installations
%pip install os
%pip install langchain
%pip install langchain_google_genai
%pip install PyPDF2
%pip install faiss-cpu
%pip install langchain_community



## Step 1: Import Necessary Libraries
This step imports all the required libraries, including:
- `os` for managing environment variables and file paths.
- `PyPDF2` for extracting text from PDF files.
- `LangChain` modules for text processing, embeddings, vector storage, and question-answering.
- `FAISS` for efficient similarity search.


In [1]:

import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import FAISS
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from PyPDF2 import PdfReader


  from .autonotebook import tqdm as notebook_tqdm



## Step 2: Define Helper Functions
We define helper functions to:
1. Extract text from a PDF file.
2. Split the text into smaller chunks for easier processing and embedding.
3. Create a FAISS vector store for similarity search, storing the index in a specified folder.
4. Create the RAG model for question-answering using Google Generative AI and LangChain.


In [2]:

def extract_text_from_pdf(pdf_file):
    """
    Extracts text from a PDF file using PyPDF2.
    """
    pdf_reader = PdfReader(pdf_file)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()
    return text


In [3]:

def get_text_chunks(text):
    """
    Splits the loaded text into chunks for embedding and retrieval.
    """
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000)
    return text_splitter.split_text(text)


In [4]:

def get_vector_store(text_chunks, index_path="faiss_index"):
    """
    Embeds the text chunks into a vector store for similarity search.
    Saves the FAISS index to the specified folder.
    """
    if not os.path.exists(index_path):
        os.makedirs(index_path)

    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    vector_store.save_local(index_path)
    return vector_store


In [10]:

def get_rag_model():
    """
    Creates a conversational chain for QA using LangChain and Google Generative AI.
    """
    prompt_template = """
    Answer the question as detailed as possible from the provided context. Make sure to provide all the details. 
    If the answer is not in the provided context, just say, \"Answer is not available in the context.\" Don't provide a wrong answer.

    Context:\n{context}\n
    Question:\n{question}\n
    Answer:
    """
    model = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0.3)
    prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
    return load_qa_chain(model, chain_type="stuff", prompt=prompt)



## Step 3: Define Input Variables
In this step, we set up the input variables:
- `api_key`: The Google API key to access Generative AI services.To get API key visit https://aistudio.google.com/app/apikey and Create your own api key.
- `pdf_path`: The path to the PDF file.
- `user_question`: An example question to test the RAG pipeline.
- `index_folder`: The folder where FAISS index files will be stored.


In [48]:

api_key = "Enter Your own API key here"  # Replace with your actual API key
os.environ["GOOGLE_API_KEY"] = api_key

pdf_path = (r"\Sample Financial Statement.pdf")  # Replace with the path to your PDF file
user_question = "What is the gross profit for Q3 2024?"  # Example question
index_folder = "faiss_index"  # Folder to save the FAISS index




## Step 4: Process the PDF and Run the QA Pipeline
This step involves:
1. Extracting text from the PDF.
2. Splitting the text into chunks.
3. Creating a FAISS vector store.
4. Performing similarity search on the vector store.
5. Running the RAG model to generate an answer to the question.


In [49]:

try:
    # Extract text from the PDF
    raw_text = extract_text_from_pdf(pdf_path)

    # Split text into chunks
    text_chunks = get_text_chunks(raw_text)

    # Create the vector store and save it to the folder
    vector_store = get_vector_store(text_chunks, index_path=index_folder)

    # Perform similarity search
    docs = vector_store.similarity_search(user_question)

    # Initialize the RAG model
    chain = get_rag_model()

    # Get the response
    response = chain({"input_documents": docs, "question": user_question}, return_only_outputs=True)

    # Display the response
    print("Response:")
    print(response.get("output_text", "No response generated."))

except Exception as e:
    print(f"An error occurred: {e}")


Response:
    11,175



## Step 5: Evaluate the Model's Accuracy
We calculate the accuracy of the RAG model by comparing the generated responses with the expected answers.
This step assumes that a `ground_truth` dictionary is available containing questions and their expected answers.


In [50]:

# Accuracy Calculation
# Define a ground truth dataset as a dictionary of questions and their expected answers
ground_truth = {
    "What is the gross profit for Q3 2024?": "11,175",
    "What is the net revenue for the year 2023?": "146,767",
    
    
    # Add more questions and their expected answers here
}

# Generate predictions using the pipeline
predictions = []
true_labels = []

for question, true_answer in ground_truth.items():
    docs = vector_store.similarity_search(question)  # Perform similarity search
    response = chain({"input_documents": docs, "question": question}, return_only_outputs=True)
    predicted_answer = response.get("output_text", "No response generated.")
    
    predictions.append(predicted_answer.strip())
    true_labels.append(true_answer.strip())

# Calculate accuracy
from sklearn.metrics import accuracy_score

try:
    accuracy = accuracy_score(true_labels, predictions)
    print(f"Model Accuracy: {accuracy * 100:.2f}%")
except Exception as e:
    print("Error in calculating accuracy:", e)

# Optionally, display predictions and true answers for inspection
for i, question in enumerate(ground_truth.keys()):
    print(f"Question: {question}")
    print(f"Predicted Answer: {predictions[i]}")
    print(f"True Answer: {true_labels[i]}")
    print("-" * 50)


Model Accuracy: 100.00%
Question: What is the gross profit for Q3 2024?
Predicted Answer: 11,175
True Answer: 11,175
--------------------------------------------------
Question: What is the net revenue for the year 2023?
Predicted Answer: 146,767
True Answer: 146,767
--------------------------------------------------



## Conclusion
This notebook demonstrated the RAG pipeline with FAISS and Google Generative AI. 
We successfully processed a PDF, created a FAISS index, performed similarity search, generated responses, and evaluated the model's accuracy.
