<a href="https://colab.research.google.com/github/dayody/RAG_System_QA/blob/main/RAGQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Retrieval-Augmented Generation (RAG) System for AI Research Papers (Google Colab)
This notebook implements a RAG system to answer questions based on a collection of AI research papers, adapted to run in Google Colab.




In [2]:
# CELL 1: Install Required Libraries
# ------------------------------------
# We use '!' to run shell commands in Colab. This installs all necessary packages.

!pip install -q langchain langchain-openai langchain-community pypdf faiss-cpu tiktoken

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/70.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m1.3/2.5 MB[0m [31m36.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m53.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.7/309.7 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
# CELL 2: Import Libraries and Set OpenAI API Key
# ---------------------------------------------------------
# This cell imports all necessary packages and securely prompts you to enter your
# OpenAI API key.

import os
import getpass

# Core LangChain and community module imports
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough, RunnableParallel
from langchain.schema.output_parser import StrOutputParser

# Securely get the OpenAI API key from the user
if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')

if "OPENAI_API_KEY" in os.environ:
    print("\nOpenAI API key set successfully.")
else:
    print("\nError: OpenAI API key was not set.")


# %%

Enter your OpenAI API key: ··········

OpenAI API key set successfully.


In [4]:
# CELL 3: Upload PDF Files
# --------------------------
# This cell allows you to upload your research papers to the Colab environment.
# It will create a 'papers' directory to store them.

import shutil
from google.colab import files

print("Please upload your PDF research papers.")
# Create a directory to store the papers
papers_dir = 'papers'
if os.path.exists(papers_dir):
    shutil.rmtree(papers_dir) # Clean up previous uploads
os.makedirs(papers_dir)

# Upload files
uploaded = files.upload()

# Move uploaded files to the 'papers' directory
for filename in uploaded.keys():
    shutil.move(filename, os.path.join(papers_dir, filename))

print(f"\nUploaded {len(uploaded)} files to the '{papers_dir}' directory.")


# %%

Please upload your PDF research papers.


Saving 1706.03762v7.pdf to 1706.03762v7.pdf
Saving 2005.11401v4.pdf to 2005.11401v4.pdf
Saving 2005.14165v4.pdf to 2005.14165v4.pdf

Uploaded 3 files to the 'papers' directory.


In [5]:
# CELL 4: Deliverable 1 - Document Preprocessing
# ------------------------------------------------
# We load the PDF documents from the 'papers/' directory and then split them
# into smaller, overlapping chunks.

print("\nLoading and preprocessing documents...")

try:
    loader = PyPDFDirectoryLoader("papers/")
    docs_before_split = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=150,
        length_function=len,
        is_separator_regex=False,
    )
    docs_chunked = text_splitter.split_documents(docs_before_split)

    print(f"Successfully loaded {len(docs_before_split)} documents.")
    print(f"Split documents into {len(docs_chunked)} chunks.")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure you have uploaded valid PDF files in the previous step.")


# %%


Loading and preprocessing documents...
Successfully loaded 109 documents.
Split documents into 450 chunks.


In [6]:
# CELL 5: Deliverable 2 - Building the Retrieval System
# -------------------------------------------------------
# This cell converts the text chunks into numerical vectors (embeddings) and
# stores them in a FAISS vector store for efficient searching.

print("\nCreating vector store and retriever...")

try:
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(docs_chunked, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    print("Vector store and retriever created successfully.")
except Exception as e:
    print(f"An error occurred during vector store creation: {e}")


# %%


Creating vector store and retriever...
Vector store and retriever created successfully.


In [9]:
# CELL 6: Deliverable 3 & 4 - Answer Generation and Source Attribution (Corrected)
# ----------------------------------------------------------------------
# Here, we define the complete RAG chain using a robust pattern that
# prevents the data type error. The retriever is called once, and its
# output (the Document objects) is correctly passed for both answer
# generation and source attribution.

print("\nDefining the RAG chain...")

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

prompt_template = """
You are an expert assistant for question-answering tasks.
Use the following retrieved context to answer the question.
If you don't know the answer from the context, just say that you don't know.
Keep the answer concise and use a maximum of three sentences.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""
prompt = ChatPromptTemplate.from_template(prompt_template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# This is the corrected chain.
# It ensures the retriever is called with the question string and
# the context is properly formatted and passed to the prompt.
rag_chain_with_sources = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(
    answer=(
        RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
        | prompt
        | llm
        | StrOutputParser()
    )
)

print("RAG chain with source attribution is ready.")



Defining the RAG chain...
RAG chain with source attribution is ready.


In [10]:
# CELL 7: Testing the RAG System
# --------------------------------
# This cell provides a function to query the RAG system and then runs the
# sample questions provided in the project description.

def ask_question(query: str):
    """
    Invokes the RAG chain with a query and prints the answer and its sources.
    """
    if 'rag_chain_with_sources' not in globals():
        print("RAG chain is not defined. Please run the previous cells.")
        return

    print(f"\n{'='*20}\nQuery: {query}\n{'='*20}")

    try:
        response = rag_chain_with_sources.invoke(query)
        answer = response["answer"]
        sources = response["context"]

        print("ANSWER:")
        print(answer)

        print("\nSOURCES:")
        if sources:
            for i, source_doc in enumerate(sources):
                source_name = os.path.basename(source_doc.metadata.get('source', 'Unknown'))
                page_number = source_doc.metadata.get('page', 'N/A')
                print(f"  - Source {i+1}: {source_name}, Page: {page_number}")
        else:
            print("  - No sources were retrieved for this answer.")

    except Exception as e:
        print(f"An error occurred while processing the query: {e}")


# List of sample questions to test the system
sample_questions = [
    "What are the main components of a RAG model, and how do they interact?",
    "What are the two sub-layers in each encoder layer of the Transformer model?",
    "Explain how positional encoding is implemented in Transformers and why it is necessary.",
    "Describe the concept of multi-head attention in the Transformer architecture. Why is it beneficial?",
    "What is few-shot learning, and how does GPT-3 implement it during inference?"
]

# Iterate through the questions and get answers
for q in sample_questions:
    ask_question(q)




Query: What are the main components of a RAG model, and how do they interact?
ANSWER:
The main components of a RAG model are a retrieval component and a generation component. They interact by retrieving relevant information from a large corpus of documents and using that information to generate specific and factual answers.

SOURCES:
  - Source 1: 2005.11401v4.pdf, Page: 8
  - Source 2: 2005.11401v4.pdf, Page: 4
  - Source 3: 2005.11401v4.pdf, Page: 8
  - Source 4: 2005.11401v4.pdf, Page: 1

Query: What are the two sub-layers in each encoder layer of the Transformer model?
ANSWER:
The two sub-layers in each encoder layer of the Transformer model are a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network.

SOURCES:
  - Source 1: 1706.03762v7.pdf, Page: 2
  - Source 2: 1706.03762v7.pdf, Page: 1
  - Source 3: 1706.03762v7.pdf, Page: 1
  - Source 4: 1706.03762v7.pdf, Page: 4

Query: Explain how positional encoding is implemented in Transform