 # RAG-based AI assistant for scientific paper summarization and suggestion
![Alt text](https://miro.medium.com/v2/resize:fit:1200/0*ffG7IPkdztO6BARk.png)

## Project Overview
This project builds a smart PDF question-answering system using:

- LangChain for chaining components
- FAISS for vector-based document retrieval
- SentenceTransformers for document embedding
- ChatGroq (Qwen-QWQ-32B) as the LLM
- LangChain PromptTemplate for custom response formatting

The system allows users to ask questions about the content of a scientific paper (PDF), and get concise, accurate, and technical answers based on the document.

## Project Structure

```bash
scientific_paper_summarization_and_suggestion
├── data/paper.pdf               # Your input scientific paper
├── notebook/notebook.ipynb           # Your notebook
├── vector_index/           # FAISS index folder (auto-generated)
├── .env                    # Your API key
└── main.py                 # Main project script
```

## Step-by-Step Code Explanation
### 1. Environment Setup



In [49]:
pip install langchain langchain_community langchain_groq pypdf huggingface faiss-cpu



In [50]:
from google.colab import userdata
api_key = userdata.get('GROQ_API_KEY')

Loads your GROQ API key from a .env file.

Always keep your keys secret and never hardcode them.
![Alt text](https://miro.medium.com/v2/resize:fit:1400/1*hcvunNJ4IolhZ6Qav5UPjQ.png)


### 2. PDF Loading and Chunking





In [51]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_and_split_pdf(file_path):
    loader = PyPDFLoader(file_path)
    documents = loader.load()
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    return splitter.split_documents(documents)

Loads the PDF file using PyPDFLoader.

Splits the PDF into smaller chunks using RecursiveCharacterTextSplitter to make them fit LLM context windows.

### 3. Embedding and Creating a FAISS Vector Index


In [52]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

def create_vector_store(docs, index_path="vector_index"):
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    vectorstore = FAISS.from_documents(docs, embeddings)
    vectorstore.save_local(index_path)
    return vectorstore


Converts text chunks into vectors using a HuggingFace model.<br>
Saves vectors using FAISS to enable fast similarity search.

### 4.  Load an Existing Vector Store


In [53]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

def load_vector_store(index_path="vector_index"):
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return FAISS.load_local(index_path, embeddings, allow_dangerous_deserialization=True)

Reloads a previously created FAISS vector index from disk.


### 5. Custom Prompt Template


In [54]:
from langchain.prompts import PromptTemplate

CUSTOM_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are a scientific research assistant.
Given the following context from scientific papers, answer the user's question.
Be concise, technical, and include any relevant research suggestions.

Context:
{context}

Question:
{question}

Answer:
"""
)

Defines how the LLM should respond.

It provides role-based instructions and formats the prompt cleanly.

### 6. Create Retrieval-Based QA Chain


In [55]:
from langchain.chains import RetrievalQA
from langchain_groq import ChatGroq

def create_qa_chain(vectorstore):
    llm = ChatGroq(model_name="qwen-qwq-32b", temperature=0, api_key=api_key)
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        chain_type="stuff",
        chain_type_kwargs={"prompt": CUSTOM_PROMPT}
    )
    return qa_chain

Initializes the Qwen-32B model from Groq.

Uses RetrievalQA to combine vector-based context retrieval with the LLM.

chain_type="stuff" puts all retrieved documents directly into the prompt.

### 7. Main Function: Build or Load Vector Store & Run Query


In [56]:
pdf_path = "/content/paper.pdf"
user_query = "Summarize the main findings of the paper."

docs = load_and_split_pdf(pdf_path)
vectorstore = create_vector_store(docs)
vectorstore = load_vector_store()

qa_chain = create_qa_chain(vectorstore)
response = qa_chain.run(user_query)
print("\n💡 AI Response:\n", response)


💡 AI Response:
 
<think>
Okay, I need to summarize the main findings of the paper based on the provided context. Let me look at the information given. The user provided references from the paper's bibliography and a general procedure for a chemical reaction. The references include works from Takeuchi, Bailey, and others, mostly from the 90s and late 90s. The general procedure describes a chemical synthesis method: they're mixing compounds like 3a, NBu4Br, NaH, and ethyl bromofluoroacetate in THF, then adding NH4Cl afterward.

Hmm, the question is asking for the main findings, but the context given doesn't explicitly state the results or conclusions. The references might hint at the paper's focus. Takeuchi's work in 1997 and 1998 could be related to organic synthesis, maybe of some compounds like antibiotics or pharmaceuticals, given the journal names (J. Antibiot., Chem. Pharm. Bull.). The procedure outlined is a synthesis step, perhaps for creating a specific compound. Since the proc


### Workflow:
![Alt text](https://s3.amazonaws.com/samples.clarifai.com/rag_template-image1.webp)

Check if a vector index already exists.

If not, load the PDF → chunk → embed → save FAISS index.

Create a QA chain with retrieval + Groq LLM.

Ask the user query and print the response.
