# Research Paper Summarizer for Literature Surveys

Doing a literature survey means going through dozens of research papers to find which ones are actually relevant to your topic — this can take hours or even days of reading. This tool automates that process by analyzing each PDF, extracting key information, and generating a structured Excel table with a relevance score and color coding so you can instantly see which papers are worth reading in depth.

## What this tool does
- Reads all PDF research papers from a folder
- Extracts key fields: objective, methodology, validation method, standards, limitations and more
- Scores each paper by relevance to your research topic (1-10)
- Generates a color coded Excel table — green (highly relevant), yellow (moderate), red (low relevance)
- Sorts papers from most to least relevant automatically

## How to Use
1. Add your PDF papers to a folder called `papers/` in the same directory as this notebook
2. Update the `research_topic` variable below with your research area
3. Run all cells
4. Check the output Excel file generated in the same directory

## Requirements
- OpenAI API key in a `.env` file
- Install dependencies: `uv pip install pdfplumber pandas openpyxl openai python-dotenv`

---
*Built as a Day 1 extension project for the Udemy course: AI Engineer Core Track — LLM Engineering, RAG, QLoRA, Agents by Ed Donner.*

### Install required libraries
Run the cell below before running the rest of the notebook:
- `pdfplumber` — extracts text from PDF files
- `pandas` — creates and manipulates the data table
- `openpyxl` — creates and formats the Excel output file
- `openai` — connects to the OpenAI API to analyze papers
- `python-dotenv` — securely loads your OpenAI API key from .env file

In [None]:
!uv pip install pdfplumber pandas openpyxl openai python-dotenv

In [None]:
# Standard library imports
import os
import json

# Third party imports
import pdfplumber          # extracts text from PDF files
import pandas as pd        # creates and manipulates the data table
from openai import OpenAI  # connects to the OpenAI API
from openpyxl import load_workbook                    # opens and edits Excel files
from openpyxl.styles import PatternFill, Alignment    # formats Excel cells
from dotenv import load_dotenv                        # loads API key from .env file


In [None]:
# Load the OpenAI API key securely from the .env file
load_dotenv(override=True)

In [None]:
# Initialize the OpenAI client — this is what we use to make API calls throughout the notebook
openai = OpenAI()

In [None]:
# CHANGE THIS TO YOUR RESEARCH TOPIC

research_topic = """
UAV intelligence quality assurance, standards, and validation methods. 
Topics of interest include: UAV system reliability, fault detection, 
testing frameworks, quality standards, validation methodologies, 
and intelligent UAV systems.
"""

# System prompt instructs the LLM how to behave and what to extract
# We pass the research topic so it can judge relevance accurately
# The LLM is asked to respond in JSON so we can parse it into a structured table

system_prompt = f"""
You are a research assistant helping with a literature survey on this topic: {research_topic}

Extract the following fields from the research paper and respond ONLY in valid JSON format with these exact keys:
{{
    "Year": "Look carefully for the publication year in the copyright notice, journal header, submission date, or first page. Return only the 4-digit year. If truly not found, write Unknown",
    "Paper Title": "",
    "Authors": "",
    "Application Domain": "",
    "AI / Intelligence Component": "",
    "Objective": "",
    "Validation Method": "",
    "Test Environment": "",
    "Evaluation Metrics": "",
    "Robustness / Safety Testing": "",
    "Standards Mentioned": "",
    "Standards Body Referenced": "",
    "Limitations": "",
    "Is Relevant": "Yes or No only",
    "Relevance Score": "Rate strictly from 1 to 10 based on how directly the paper addresses UAV intelligence quality assurance, validation standards, or certification methods. 9-10: paper directly addresses UAV QA frameworks, validation standards, or certification as its PRIMARY contribution. 7-8: paper covers UAV fault detection, reliability, or safety testing but QA/standards is not the main focus. 5-6: paper uses UAVs as a tool for another application like inspection, agriculture, or mapping with minimal QA focus. 1-4: paper has little or no connection to UAV QA or validation standards."
}}
"""

# Extracts all text from every page of the PDF
# If your model has a smaller context window, limit the text by adding: return text[:15000]
def extract_text_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join(page.extract_text() for page in pdf.pages if page.extract_text())
    return text  

# Sends the extracted PDF text to OpenAI and gets back structured JSON
# The JSON is then parsed into a Python dictionary for easy table conversion
def analyze_paper(pdf_path):
    text = extract_text_from_pdf(pdf_path)
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Here is the research paper:\n\n{text}"}
    ]
    response = openai.chat.completions.create(
        model="gpt-5-nano",   # fastest and cheapest GPT-5 model, great for summarization
        messages=messages
    )
    raw = response.choices[0].message.content
    # Strip any markdown formatting the model may have added around the JSON
    raw = raw.strip().replace("```json", "").replace("```", "").strip()
    return json.loads(raw)


# CHANGE THIS TO YOUR PAPERS FOLDER PATH
# Place all your PDF research papers inside this folder

papers_folder = "papers/"
results = []

# Loop through all PDFs in the folder and analyze each one
for filename in sorted(os.listdir(papers_folder)):
    if filename.endswith(".pdf"):
        print(f"Processing: {filename}")
        pdf_path = os.path.join(papers_folder, filename)
        try:
            data = analyze_paper(pdf_path)
            data["Filename"] = filename  # track which PDF each row came from
            results.append(data)
        except Exception as e:
            print(f"Error with {filename}: {e}")


# Convert results to a pandas DataFrame
df = pd.DataFrame(results)

# Standardize is_relevant to always be Yes or No
def standardize_relevant(val):
    if str(val).lower() in ["true", "yes", "1"]:
        return "Yes"
    elif str(val).lower() in ["false", "no", "0"]:
        return "No"
    return "Yes"

df["Is Relevant"] = df["Is Relevant"].apply(standardize_relevant)

# Standardize relevance_score to be out of 10
# The LLM sometimes returns 0-1 scale instead of 1-10 so we normalize it

def standardize_score(val):
    try:
        score = float(val)
        if score <= 1.0:  # convert 0-1 scale to 0-10
            return round(score * 10, 1)
        return round(score, 1)
    except:
        return None

df["Relevance Score"] = df["Relevance Score"].apply(standardize_score)

# Sort papers from most relevant to least relevant
df = df.sort_values("Relevance Score", ascending=False).reset_index(drop=True)

# Save to Excel
output_file = "literature_survey_output.xlsx"
df.to_excel(output_file, index=False)

# Apply color coding and formatting to the Excel file
wb = load_workbook(output_file)
ws = wb.active

# Color definitions: green = highly relevant, yellow = moderate, red = low relevance
green  = PatternFill(start_color="C6EFCE", end_color="C6EFCE", fill_type="solid")
yellow = PatternFill(start_color="FFEB9C", end_color="FFEB9C", fill_type="solid")
red    = PatternFill(start_color="FFC7CE", end_color="FFC7CE", fill_type="solid")

headers = [cell.value for cell in ws[1]]
score_col = headers.index("Relevance Score") + 1

# Keep header row white and uncolored
for cell in ws[1]:
    cell.fill = PatternFill(fill_type=None)

# Color code each data row based on its relevance score
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
    score_cell = row[score_col - 1]
    try:
        score = float(score_cell.value)
        if score >= 8:
            fill = green
        elif score >= 6:
            fill = yellow
        else:
            fill = red
    except:
        fill = yellow
    for cell in row:
        cell.fill = fill

# Wrap text and align content to top for better readability in Excel
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
    for cell in row:
        cell.alignment = Alignment(wrap_text=True, vertical="top")

# Auto-fit column widths based on the longest content in each column
# Short columns stay compact, long columns are capped at width 60
for col in ws.columns:
    max_length = 0
    col_letter = col[0].column_letter
    for cell in col:
        try:
            if cell.value:
                cell_length = len(str(cell.value))
                if cell_length > max_length:
                    max_length = cell_length
        except:
            pass
    if max_length < 15:
        adjusted_width = max_length + 4
    elif max_length < 50:
        adjusted_width = max_length + 2
    else:
        adjusted_width = 60
    ws.column_dimensions[col_letter].width = adjusted_width

# Save the final formatted Excel file
wb.save(output_file)
print(f"Done! Check {output_file}")