# 🚀 ResearchGPT Quickstart

This notebook walks through the full pipeline on the sample paper:
- Load PDF
- Extract metadata
- Clean & chunk text
- Build index & run search
- Summarize & analyze chunks
- Save metadata JSON

You can run each step interactively to understand the process.


In [66]:
import os, sys, json
from pathlib import Path
from dotenv import load_dotenv

# Always ensure we're starting from project root
project_root = Path.cwd().parent  # notebooks/ → project root
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print("✅ Project root added:", project_root)
print("sys.path[0]:", sys.path[0])

# Load environment variables
load_dotenv()
print("✅ MISTRAL_API_KEY loaded?", bool(os.getenv("MISTRAL_API_KEY")))



✅ Project root added: /home/isa/code/research_gpt_assistant
sys.path[0]: /home/isa/code/research_gpt_assistant
✅ MISTRAL_API_KEY loaded? True


In [67]:
from src.config import MISTRAL_API_KEY
from src.pdf_utils import load_all_pdfs_text
from src.text_utils import clean_text, chunk_text
from src.indexer import build_index, search
from src.summarizer import summarize_chunks
from src.analyst import analyze_chunks
from src.metadata_utils import extract_metadata
from src.io_utils import safe_stem

print("✅ Imports successful!")


✅ Imports successful!


In [68]:
pdf_path = project_root / "data/sample_papers/attention_is_all_you_need.pdf"

print("Looking for PDFs in:", pdf_path.parent.resolve())
pdfs = list(pdf_path.parent.glob("*.pdf"))
print("Found PDFs:", pdfs)


Looking for PDFs in: /home/isa/code/research_gpt_assistant/data/sample_papers
Found PDFs: [PosixPath('/home/isa/code/research_gpt_assistant/data/sample_papers/attention_is_all_you_need.pdf')]


In [69]:
pairs = load_all_pdfs_text(pdf_path.parent)

if not pairs:
    raise FileNotFoundError(f"No PDFs found in {pdf_path.parent.resolve()}")

pdf_path, raw_text = pairs[0]
print("✅ Loaded PDF:", pdf_path)
print("First 500 characters:\n", raw_text[:500])


✅ Loaded PDF: /home/isa/code/research_gpt_assistant/data/sample_papers/attention_is_all_you_need.pdf
First 500 characters:
 Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.eduŁukasz Kaise


In [70]:
meta = extract_metadata(pdf_path)
print("✅ Extracted metadata:\n")
print(json.dumps(meta, indent=2))


✅ Extracted metadata:

{
  "title": "Attention Is All You Need",
  "authors": null,
  "abstract": "The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-\nto-German translation task, improving over the existing best results, including\nensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,\nour model establishes a new single-model state-of-the-art BLEU score of 41.8 after\ntraining for 3.5 days on eigh

In [71]:
txt = clean_text(raw_text)
chunks = chunk_text(txt, max_chars=1500, overlap=150)

print(f"✅ Total chunks: {len(chunks)}")
print("\n--- First 2 chunks ---\n")
for i, ch in enumerate(chunks[:2], 1):
    print(f"Chunk {i}:\n{ch[:300]}...\n")



✅ Total chunks: 30

--- First 2 chunks ---

Chunk 1:
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parma...

Chunk 2:
rench translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature. We show that the Transformer generalizes well to
other tasks by applying it s...



In [72]:
index = build_index([(f"{pdf_path.stem} [chunk {i+1}]", ch) for i, ch in enumerate(chunks)])
hits = search(index, "What problem does this paper solve?", k=3)

print("✅ Top hits:\n")
for score, (lbl, text) in hits:
    print(f"- {lbl} (score {score:.3f})\n{text[:200]}...\n")



✅ Top hits:

- attention_is_all_you_need [chunk 14] (score 0.072)
 easier it is to learn long-range dependencies [ 12]. Hence we also compare
the maximum path length between any two input and output positions in networks composed of the
different layer types.
As not...

- attention_is_all_you_need [chunk 16] (score 0.043)
e sequence length. Each training
batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000
target tokens.
5.2 Hardware and Schedule
We trained our models on one ma...

- attention_is_all_you_need [chunk 1] (score 0.041)
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
...



In [73]:
top_chunks = [text for _s, (_lbl, text) in hits]

summary = summarize_chunks(MISTRAL_API_KEY, "Attention Is All You Need", top_chunks)
analysis = analyze_chunks(MISTRAL_API_KEY, "Attention Is All You Need", top_chunks)

print("✅ Summary preview:\n", summary[:500])
print("\n---\n")
print("✅ Analysis preview:\n", analysis[:500])



✅ Summary preview:
 - The paper introduces a new model architecture called the Transformer, which is based solely on attention mechanisms and does not use recurrence or convolutions.
- The Transformer is faster than recurrent layers when the sequence length is smaller than the representation dimensionality, especially for sentence representations used in machine translations.
- To improve computational performance for tasks involving very long sequences, the Transformer could be restricted to considering only a nei

---

✅ Analysis preview:
 # Analysis: Attention Is All You Need


## Methods

- Methods:
  - Transformer architecture, based solely on attention mechanisms, without recurrence or convolutions.
  - Self-attention layers for connecting positions in the network, with computational complexity O(n) when considering all positions, or O(n/r) when considering a neighborhood of size r.
  - Convolutional layers with kernel width k < n, requiring a stack of O(n/k) convolutional layers

In [74]:
meta_out = {
    "file": pdf_path.name,
    "title": meta.get("title", pdf_path.stem),
    "authors": meta.get("authors", "Unknown"),
    "abstract": meta.get("abstract"),
    "query_used": "What problem does this paper solve?",
    "outputs": {
        "summary_preview": summary[:200] + "...",
        "analysis_preview": analysis[:200] + "..."
    }
}

print("✅ Metadata object:\n")
print(json.dumps(meta_out, indent=2))

out_path = project_root / "results/metadata/attention_is_all_you_need_demo_meta.json"
out_path.parent.mkdir(parents=True, exist_ok=True)
out_path.write_text(json.dumps(meta_out, indent=2), encoding="utf-8")

print("✅ Saved to:", out_path)



✅ Metadata object:

{
  "file": "attention_is_all_you_need.pdf",
  "title": "Attention Is All You Need",
  "authors": null,
  "abstract": "The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-\nto-German translation task, improving over the existing best results, including\nensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,\nour model establishes a new single-model state-of-the-art BLEU score of 4