<a href="https://colab.research.google.com/github/adi1bioinfo/NLP-Projects/blob/master/ragQA_researchPapers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Q&A tool using LangChain

## Installing all required libraries

In [None]:
# The main framework for building the application
!pip install langchain

# Specific LangChain packages for community integrations and Google's models
!pip install langchain_community langchain_core langchain_google_genai

# The library for turning text into numerical vectors (embeddings)
!pip install sentence-transformers

# The library for our local vector database (a super-fast search index)
!pip install faiss-cpu

# The library for loading text from PDF files
!pip install pypdf

Collecting pypdf
  Using cached pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Using cached pypdf-6.0.0-py3-none-any.whl (310 kB)
Installing collected packages: pypdf
Successfully installed pypdf-6.0.0


In [None]:
!pip install -U langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Downloading langchain_huggingface-0.3.1-py3-none-any.whl (27 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.3.1


## API key setting and gemini model selction

In [None]:
import os
from google.colab import userdata
from langchain_google_genai import ChatGoogleGenerativeAI

# securely loading secret key
os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')

# Initializes the connection to the Google Gemini model
llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro")
# llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

# Now, let's test it by asking a simple question
try:
    response = llm.invoke("Can you explain what a transformer model is in one sentence?")
    print("Connection successful!")
    print(response.content)
except Exception as e:
    print("An error occurred. Please check your API key and permissions.")
    print(e)

Connection successful!
A transformer is a neural network architecture that excels at understanding context and relationships in sequential data, like text, by using a self-attention mechanism to weigh the importance of all input elements relative to each other.


## Uploading all research paper in a folder then using them

In [None]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

folder_path = "research_papers/"

# Create the loader, pointing it to our folder and load them
loader = PyPDFDirectoryLoader(folder_path)
all_docs = loader.load()

print(f"Successfully loaded {len(all_docs)} pages from your documents.")

# You can also inspect the metadata of the first page to see the source file
print("\n--- Metadata of the second page ---\n")
print(all_docs[1].metadata)

Successfully loaded 323 pages from your documents.

--- Metadata of the second page ---

{'producer': 'GPL Ghostscript 9.15', 'creator': 'Arbortext Advanced Print Publisher 9.0.215/W Unicode', 'creationdate': '2021-07-22T15:50:14+01:00', 'moddate': '2021-07-22T15:50:14+01:00', 'title': '16269654912414 1..29', 'source': 'research_papers/A signal capture and proofreading.pdf', 'total_pages': 29, 'page': 1, 'page_label': '2'}


## Splitting documents into Chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# This splitter will try to break text up into chunks of 2000 characters
# with a 200-character overlap.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200
)

# Split the loaded documents into chunks
chunks = text_splitter.split_documents(all_docs)

# Check the results
print(f"Split the {len(all_docs)} pages into {len(chunks)} chunks.\n")

# You can also see what a single chunk looks like
print("\n--- Example Chunk ---\n")
print(chunks[0].page_content)

Split the 323 pages into 851 chunks.


--- Example Chunk ---

*For correspondence:
francis.barr@bioch.ox.ac.uk (FAB);
simon.newstead@bioch.ox.ac.uk
(SN)
†These authors contributed
equally to this work
Competing interests:The
authors declare that no
competing interests exist.
Funding:
See page 26
Received: 14 March 2021
Accepted: 16 June 2021
Published: 17 June 2021
Reviewing editor: Adam
Linstedt, Carnegie Mellon
University, United States
Copyright Gerondopoulos et
al. This article is distributed under
the terms of the
Creative
Commons Attribution License,
which permits unrestricted use
and redistribution provided that
the original author and source are
credited.
A signal capture and proofreading
mechanism for the KDEL-receptor
explains selectivity and dynamic range in
ER retrieval
Andreas Gerondopoulos†, Philipp Bra¨ uer†, Tomoaki Sobajima, Zhiyi Wu,
Joanne L Parker, Philip C Biggin, Francis A Barr*, Simon Newstead*
Department of Biochemistry, University of Oxford, Oxford, United King

## Creating Vector database

Embedding Model: Hugging Face (all-MiniLM-L6-v2) model which is specifically rained to read a piece of text and convert its meaning into a list of numbers called a vector.

Vector Store (FAISS): Using a library called FAISS (Facebook AI Similarity Search) to create my database. This is an efficient tool for storing thousands of vectors and finding the ones most similar to a new query vector almost instantly.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Take each chunk from our list, then Runs it through the embedding model to get a vector and then stores all the vectors in the FAISS index
db = FAISS.from_documents(chunks, embeddings)

## Building Q&A chain

This will take user's question, Retrieves relevant document chunks from the FAISS database then inserts those chunks and the question into a prompt and then sends the complete prompt to the LLM to generate a final answer.

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain

# LLM
llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro")

# Creating a retriever from our vector store, this is responsible for fetching relevant documents.
retriever = db.as_retriever()

# Creating a prompt template, this tells the LLM how to use the retrieved documents and the user's question.
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context.
Provide a detailed and well-structured answer. If the answer is not in the context, say that you don't know.

<context>
{context}
</context>

Question: {input}
""")

# Creating the document chain, which is responsible for taking the retrieved documents and formatting them into the prompt.
document_chain = create_stuff_documents_chain(llm, prompt)

# Creating the main retrieval chain, this is the final chain that orchestrates the entire process.
retrieval_chain = create_retrieval_chain(retriever, document_chain)

In [None]:
# ask a question!
question = "What were the primary conclusions of the study on CRISPR-Cas9?"
response = retrieval_chain.invoke({"input": question})

# Print the question and final answer
print("--- Question ---")
print(question)
print("--- Answer ---")
print(response["answer"])

--- Question ---
What were the primary conclusions of the study on CRISPR-Cas9?
--- Answer ---
Based on the context provided, I don't know what the primary conclusions of the study on CRISPR-Cas9 were. The provided text does not mention CRISPR-Cas9. The context discusses qPCR methodology, the phylogenetics and functions of PQ-loop proteins, and lists various laboratory reagents and equipment.


In [None]:
# ask a question!
question = "What were the important residues in KDEL receptor and what are their functioning"
response = retrieval_chain.invoke({"input": question})

# Print the question and final answer
print("--- Question ---")
print(question)
print("--- Answer ---")
print(response["answer"])

--- Question ---
What were the important residues in KDEL receptor and what are their functioning
--- Answer ---
Based on the provided context, the following are the important residues in the KDEL receptor and their functions:

**1. Residues Involved in Ligand/Signal Binding and Selectivity:**

*   **D112:** This residue interacts with a threonine at position -7 of the KDEL motif on the natural ligand.
*   **I56 and L116:** These are two of four amino acids in the receptor that interact with an isoleucine at position -6 of the ligand.
*   **E117:** In KDELR3, the loop containing this residue is altered compared to KDELR1 and KDELR2. This residue is located close to other residues on the receptor's surface that are important for signal selectivity.
*   **H12:** This residue is involved in creating a salt-hydrogen bond (SHB) that is destabilized in the ER, leading to the release of the KDEL peptide.

**2. Residues Involved in Vesicle Trafficking (COP-I and COP-II Binding):**

*   **K201,

In [None]:
# ask a question!
question = "SHB is made between which residues? and how His12 is playing important role?"
response = retrieval_chain.invoke({"input": question})

# Print the question and final answer
print("--- Question ---")
print(question)
print("--- Answer ---")
print(response["answer"])

--- Question ---
SHB is made between which residues? and how His12 is playing important role?
--- Answer ---
Based on the provided context, here is a detailed and well-structured answer:

### Residues Involved in the Short Hydrogen Bond (SHB)

The short hydrogen bond (SHB) is formed between the following two residues:
*   **Y158**, located on the transmembrane helix 6 (TM6).
*   **E127**, located on the transmembrane helix 5 (TM5).

### The Important Role of His12 (H12)

His12 (H12), a conserved histidine residue on transmembrane helix 1 (TM1), plays a crucial and multi-faceted role in the KDEL protein retrieval mechanism. Its importance is derived from its location and its chemical properties.

1.  **pH Sensor:** Due to its ability to be protonated, H12 is considered the **pH sensor** for the entire process. The pH difference between the Golgi (acidic) and the ER (near-neutral) is the key trigger for binding and release, and H12 is the residue that detects this change.

2.  **Enabling