# 🚀 ResearchGPT Quickstart

This notebook walks through the full pipeline on the sample paper:
- Load PDF
- Extract metadata
- Clean & chunk text
- Build index & run search
- Summarize & analyze chunks
- Save metadata JSON


In [46]:
import os, sys, json
from pathlib import Path
from dotenv import load_dotenv

# --- 1) Project root handling ---
project_root = Path.cwd().parent  # from notebooks/ → go up one level
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print("✅ Project root:", project_root)

# --- 2) Load environment variables ---
load_dotenv()
print("✅ MISTRAL_API_KEY loaded?", bool(os.getenv("MISTRAL_API_KEY")))

# --- 3) Import local modules ---
from src.config import MISTRAL_API_KEY
from src.pdf_utils import load_all_pdfs_text
from src.text_utils import clean_text, chunk_text
from src.indexer import build_index, search
from src.summarizer import summarize_chunks
from src.analyst import analyze_chunks
from src.metadata_utils import extract_metadata
from src.io_utils import safe_stem

print("✅ Imports successful")

✅ Project root: /home/isa/code/research_gpt_assistant
✅ MISTRAL_API_KEY loaded? True
✅ Imports successful


In [47]:
# Path to the sample paper
pdf_path = project_root / "data/sample_papers/attention_is_all_you_need.pdf"

# Load all PDFs in that folder
pairs = load_all_pdfs_text(pdf_path.parent)

if not pairs:
    raise FileNotFoundError(f"No PDFs found in {pdf_path.parent.resolve()}")

pdf_path, raw_text = pairs[0]
print("✅ Loaded PDF:", pdf_path.name)
print("\n--- First 500 chars of raw text ---\n")
print(raw_text[:500])

✅ Loaded PDF: attention_is_all_you_need.pdf

--- First 500 chars of raw text ---

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.eduŁukasz Kaise


In [48]:
meta = extract_metadata(pdf_path)

print("✅ Metadata extracted:")
print(json.dumps(meta, indent=2))

✅ Metadata extracted:
{
  "title": "Attention Is All You Need",
  "authors": null,
  "abstract": "The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-\nto-German translation task, improving over the existing best results, including\nensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,\nour model establishes a new single-model state-of-the-art BLEU score of 41.8 after\ntraining for 3.5 days on eight

In [49]:
# Clean text
cleaned = clean_text(raw_text)

# Chunk text
chunks = chunk_text(cleaned, max_chars=1500, overlap=150)

print(f"✅ Total chunks: {len(chunks)}")
print("\n--- First 2 chunks ---\n")
for i, ch in enumerate(chunks[:2]):
    print(f"[Chunk {i+1}]\n{ch[:400]}...\n")

✅ Total chunks: 30

--- First 2 chunks ---

[Chunk 1]
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Goog...

[Chunk 2]
rench translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature. We show that the Transformer generalizes well to
other tasks by applying it successfully to English constituency parsing both with
large and limited training data.
∗Equal contri...



In [50]:
index = build_index([(f"chunk {i+1}", ch) for i, ch in enumerate(chunks)])
hits = search(index, "Summarize contributions and limitations.", k=5)

print("✅ Top hits:")
for score, (lbl, txt) in hits:
    print(f"- {lbl} (score={score:.3f})")
    print(txt[:200], "\n")

✅ Top hits:
- chunk 30 (score=0.000)
n
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the
sentence. We give two suc 

- chunk 29 (score=0.000)
rent colors represent different heads. Best viewed in color.
13

Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
< 

- chunk 28 (score=0.000)
 Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine
translation system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144 , 2016.
[39] Jie  

- chunk 27 (score=0.000)
 Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,
and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts
layer. arXiv preprint arX 

- chunk 26 (score=0.000)
ural machine translation. arXiv preprint arXiv:1508

In [51]:
top_chunks = [txt for _s, (_lbl, txt) in hits]

summary = summarize_chunks(MISTRAL_API_KEY, "Attention Is All You Need", top_chunks)
analysis = analyze_chunks(MISTRAL_API_KEY, "Attention Is All You Need", top_chunks)

print("✅ Summary (first 500 chars):\n", summary[:500])
print("\n---\n")
print("✅ Analysis (first 500 chars):\n", analysis[:500])

✅ Summary (first 500 chars):
 - The paper "Attention Is All You Need" discusses a new model for neural machine translation that uses self-attention mechanisms.
- The self-attention mechanism allows the model to focus on different parts of the input sequence when generating each output word.
- The authors provide visualizations of the attention weights for different layers and heads, showing that some heads seem to perform tasks related to sentence structure or anaphora resolution.
- One example provided is the sentence "The 

---

✅ Analysis (first 500 chars):
 # Analysis: Attention Is All You Need


## Methods

- Method: Transformer model (as described in "Attention is All You Need" by Vaswani et al.)
- Architecture: Not explicitly stated, but it's a Transformer model with 6 layers.
- Datasets: Not explicitly stated, but the context suggests it could be related to machine translation or language understanding tasks.
- Training Setup:
  - Optimizer: Not explicitly stated.
  - Learning 

In [52]:
meta_out = {
    "file": pdf_path.name,
    "title": meta.get("title", pdf_path.stem),
    "authors": meta.get("authors", "Unknown"),
    "abstract": meta.get("abstract"),
    "query_used": "Summarize contributions and limitations.",
    "outputs": {
        "summary_md": str(project_root / "results/summaries" / f"{safe_stem(pdf_path)}_summary.md"),
        "analysis_md": str(project_root / "results/analyses" / f"{safe_stem(pdf_path)}_analysis.md"),
    }
}

meta_dir = project_root / "results/metadata"
meta_dir.mkdir(parents=True, exist_ok=True)

meta_path = meta_dir / f"{safe_stem(pdf_path)}_meta.json"
meta_path.write_text(json.dumps(meta_out, indent=2), encoding="utf-8")

print("✅ Metadata JSON saved to:", meta_path)
print(json.dumps(meta_out, indent=2))

✅ Metadata JSON saved to: /home/isa/code/research_gpt_assistant/results/metadata/attention_is_all_you_need_meta.json
{
  "file": "attention_is_all_you_need.pdf",
  "title": "Attention Is All You Need",
  "authors": null,
  "abstract": "The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-\nto-German translation task, improving over the existing best results, including\nensembles, by over 2 BLEU. On the WMT 2014 English-to-Fr

In [53]:
import pandas as pd

df = pd.read_csv("../results/batch_report.csv") if not Path("results/batch_report.csv").exists() else pd.read_csv("results/batch_report.csv")

print("✅ Batch Report Loaded Successfully!\n")
print(df.head())


✅ Batch Report Loaded Successfully!

             timestamp                           file  \
0  2025-10-11T17:12:07  attention_is_all_you_need.pdf   
1  2025-10-11T17:24:45  attention_is_all_you_need.pdf   

                                 query_used  \
0  Summarize contributions and limitations.   
1  Summarize contributions and limitations.   

                                        summary_path  \
0  results/summaries/attention_is_all_you_need_su...   
1  results/summaries/attention_is_all_you_need_su...   

                                       analysis_path  duration_sec  
0  results/analyses/attention_is_all_you_need_ana...          6.54  
1  results/analyses/attention_is_all_you_need_ana...          6.59  


In [54]:
print("\n📊 Summary Insights:")

# Normalize column names
df.columns = [c.strip().lower() for c in df.columns]

if "duration_sec" in df.columns:
    print(f"Total PDFs processed: {len(df)}")
    print(f"Average runtime: {df['duration_sec'].mean():.2f} seconds")
    print(f"Fastest run: {df['duration_sec'].min():.2f} seconds")
    print(f"Slowest run: {df['duration_sec'].max():.2f} seconds")
else:
    print("⚠️ No 'duration_sec' column found — check CSV headers below:")
    print(df.columns.tolist())



📊 Summary Insights:
Total PDFs processed: 2
Average runtime: 6.56 seconds
Fastest run: 6.54 seconds
Slowest run: 6.59 seconds
