Generative AI in Research: using Large Language Models (LLMs) to enhance and streamline the academic literature review process.

leverage RAG Techniques for summarizing papers, identifying connections across papers (authors, references, methods), uncovering key themes in them.

Download 2 papers (related to diffusion model) and convert them to .txt files in a directory named "data". Use these .txt files as input papers and evaluate if the RAG technique is giving good results.

In [1]:
!pip install pymupdf requests



### download any 2 papers of diffusion model and convert their pdf to .txt files

In [2]:
!rm -r data

In [3]:
import os
import requests
import fitz  # PyMuPDF

# Sample arXiv paper IDs related to diffusion models
arxiv_ids = [
    "2006.11239",  # Denoising Diffusion Probabilistic Models
    "2105.05233",  # Improved Denoising Diffusion Probabilistic Models
]

def download_pdf(arxiv_id, output_folder):
    url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
    pdf_path = os.path.join(output_folder, f"{arxiv_id}.pdf")
    response = requests.get(url)
    with open(pdf_path, "wb") as f:
        f.write(response.content)
    print(f"Downloaded {arxiv_id}")
    return pdf_path

def pdf_to_text(pdf_path, txt_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(text)
    print(f"Converted to text: {txt_path}")

def main():
    data_dir = "data"
    os.makedirs(data_dir, exist_ok=True)

    for arxiv_id in arxiv_ids:
        pdf_path = download_pdf(arxiv_id, data_dir)
        txt_path = os.path.join(data_dir, f"{arxiv_id}.txt")
        pdf_to_text(pdf_path, txt_path)

if __name__ == "__main__":
    main()


Downloaded 2006.11239
Converted to text: data/2006.11239.txt
Downloaded 2105.05233
Converted to text: data/2105.05233.txt


In [4]:
!rm data/*.pdf

In [5]:
# Install dependencies
!pip install langchain langchain_community faiss-cpu sentence-transformers transformers networkx matplotlib spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m76.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [6]:
import os, glob
import gc
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import networkx as nx
import matplotlib.pyplot as plt
import spacy
import re
gc.collect()
import torch
torch.cuda.empty_cache()

In [9]:
if 'results' in locals():
    del results

In [10]:
def clean_text(text):
    import re
    # Remove inline citations like [14], [14, 27]
    text = re.sub(r"\[[0-9,\s]+\]", "", text)
    # Remove URLs
    text = re.sub(r"http\S+|www\.\S+", "", text)
    # Remove LaTeX math expressions
    text = re.sub(r"\$.*?\$", "", text)
    # Remove repeated words
    text = re.sub(r"\b(\w+)( \1\b)+", r"\1", text)
    # Remove special characters
    text = re.sub(r"[^a-zA-Z0-9.,;:?!\s]", "", text)
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text


# Load a document and return its content
def load_document(file_path):
    loader = TextLoader(file_path)
    docs = loader.load()
    for doc in docs:
        doc.page_content = clean_text(doc.page_content)  # Apply cleaning here
    return docs

# Split the document into chunks ensuring each chunk is under the token limit
def split_document(docs, chunk_size=1000, chunk_overlap=50):
    # Make sure docs is always a list
    if not isinstance(docs, list):
        docs = [docs]

    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_documents(docs)

# Vector store (use sentence embeddings)
def create_faiss_index(docs):
    embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    return FAISS.from_documents(docs, embeddings)

# Load HuggingFace LLM
def load_llm():
    model_id = "google/flan-t5-xl"  # Better GPU utilization, faster than flan-t5-large
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to("cuda")  # Move model to GPU
    #pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)
    pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0, max_new_tokens=512)

    return HuggingFacePipeline(pipeline=pipe)


# Build RAG chain
def build_qa_chain(llm, vectorstore):
    return RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever(), chain_type="stuff")

# Analyze a single document in chunks and store results with a chunk limit
def analyze_document(llm, docs, batch_size=1, max_chunks=10):
    results = {"Summary": [], "Connections": [], "Themes": []}
    chunks = split_document(docs)
    chunks = chunks[:max_chunks]

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]

        for label, q in [
            ("Summary", "Summarize the following scientific paper text in concise bullet points. Include: Main contribution, Dataset used, method, Evaluation metrics and Key results."),
            ("Connections", "You are reading several research papers. Based on the passage below, what connections or similarities can you identify with other papers on diffusion models? Mention common techniques, models, datasets, or evaluation strategies."),
            ("Themes", "What are the central *research themes* in the following paper? List them as concise topics.")
        ]:
            prompts = [f"{q}\n\n---\n{chunk.page_content.strip()}" for chunk in batch]
            try:
                responses = llm.pipeline(prompts)
                for response in responses:
                    text = response['generated_text'].strip()
                    results[label].append(text)
                    print(f"\n🔍 {label}:\n{text}")
            except Exception as e:
                print(f"Error during {label} batch: {e}")
                results[label].extend(["Error"] * len(batch))

        del batch
        torch.cuda.empty_cache()
        gc.collect()

    return results


# Main pipeline
def process_all_documents(data_dir="data", max_chunks=10):
    files = glob.glob(os.path.join(data_dir, "*.txt"))
    results = {"Summary": [], "Connections": [], "Themes": []}

    llm = load_llm()

    for file_path in files:
        print(f"\n📄 Processing: {file_path}")
        # Load and clean the document
        doc = load_document(file_path)

        # Process the document in chunks
        doc_results = analyze_document(llm, docs=doc, max_chunks=max_chunks)

        # Collect results
        for label in results:
            results[label].extend(doc_results.get(label, []))

        # Free up GPU memory after processing each document
        del doc
        torch.cuda.empty_cache()
        gc.collect()  # Run garbage collection to free memory

    # Clean-up results (e.g., remove empty strings or redundant entries)
    for label in results:
        flat = [str(item).strip() for sublist in results[label] for item in (sublist if isinstance(sublist, list) else [sublist])]
        cleaned = [s for s in flat if s and s.lower() != "error"]
        combined_text = "\n".join(cleaned)

        if label == "Themes":
          final_theme = summarize_combined_output(llm, combined_text, label) # Remove duplicates
          themes = list(dict.fromkeys([line.strip() for line in final_theme.split("\n") if line.strip()]))
          results[label] = "\n".join(themes)
        else:
          results[label] = summarize_combined_output(llm, combined_text, label)


    return results

# Summarize combined chunk outputs into a single final output
def summarize_combined_output(llm, text, label):
    prompts = {
        "Summary": "You are a helpful scientific assistant. Based on the following combined summaries of a scientific paper, provide a single concise overall summary. Mention the main contribution, dataset used, method, evaluation metrics, and key results.",
        "Connections": "You are reading several research papers. Based on the following notes, summarize the common connections or similarities across papers, focusing on shared techniques, datasets, or models.",
        "Themes": "Summarize the central research themes mentioned in the combined text below. List them as concise, broad topics."
    }
    prompt = f"{prompts[label]}\n\n{text}"
    try:
        response = llm.pipeline(prompt)
        return response[0]["generated_text"].strip()
    except Exception as e:
        print(f"Error generating final {label}: {e}")
        return "Error"


# Main execution
results = process_all_documents(data_dir="data", max_chunks=10)  # Limit the number of chunks for faster processing
# Final outputs
final_connections = results.get("Connections", "")
final_themes = results.get("Themes", "")
final_summary = results.get("Summary", "")

# Display summaries
print("\nFinal Summary:")
print(final_summary)
print("\nFinal Connections Across Papers:")
print(final_connections)
print("\nFinal Themes:")
print(final_themes)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0



📄 Processing: data/2105.05233.txt

🔍 Summary:
We show that diffusion models can achieve image sample quality superior to the current stateoftheart generative models. We achieve this on unconditional im age synthesis by nding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classier guid ance: a simple, computeefcient method for trading off diversity for delity using gradients from a classier. We achieve an FID of 2.97 on ImageNet 128128, 4.59 on ImageNet 256256, and 7.72 on ImageNet 512512, and we match BigGANdeep even with as few as 25 forward passes per sample. Finally, we nd that classier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256256 and 3.85 on ImageNet 512512. We release our code at 1 Introduction

🔍 Connections:
---

🔍 Themes:
Diffusion Models Beat GANs on Image Synthesis

🔍 Summary:
--- 512512. We release our code at 1 Introduction Figure 1: Se

Token indices sequence length is longer than the specified maximum sequence length for this model (2658 > 512). Running this sequence through the model will result in indexing errors



Final Summary:
Main contribution: We propose a new method for predicting the t-axis of a t-spline based on the t-spline. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that the t-spline is a t-spline with a t-spline-like structure. We show that t