# Introduction

Keeping up with the rapid growth of scientific publications is a major challenge for researchers. Traditional literature reviews are time-consuming, manual, and difficult to scale. To address this, we built an AI Research Assistant powered by Groq LLaMA models and the arXiv API.

The system automates the entire pipeline of a literature review:

🔍 Search relevant papers from arXiv.
📄 Extract & summarize content from PDFs.
✨ Generate concise summaries for quick understanding.
🔄 Iteratively suggest new research directions, enabling continuous exploration.

This project showcases how agentic AI systems can go beyond simple question answering to perform autonomous academic exploration, assisting researchers, students, and professionals in navigating vast research landscapes.

# Installing Dependencies

In [1]:
!pip install -q groq arxiv PyPDF2 termcolor requests tenacity

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.4/131.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


# Importing Necessary Libraries

In [2]:
import os
import re
import time
import json
import requests
from dataclasses import dataclass
from typing import List, Tuple, Optional

import arxiv
from PyPDF2 import PdfReader
from termcolor import colored
from tenacity import retry, wait_exponential_jitter, stop_after_attempt, retry_if_exception_type
from groq import Groq
from groq._exceptions import RateLimitError, APIStatusError
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
groq_api_key = user_secrets.get_secret("GROQ_API_KEY")

os.environ["GROQ_API_KEY"] = groq_api_key

GROQ_MODEL = "llama-3.1-70b-versatile"

# Pipeline parameters
MAX_RESULTS = 10
NUMBER_OF_TURNS = 10
INITIAL_SEARCH_TERM = "coding ability of large language models"
PAPERS_DIR = "research_papers"

# Chunking — character-based
CHUNK_SIZE = 6000
CHUNK_OVERLAP = 400       
MAX_CHUNKS = 30

# Output controls
MAX_SUMMARY_TOKENS = 512 
TEMPERATURE = 0.3         

# Basic checks
if not groq_api_key:
    print(colored("⚠️  Missing GROQ_API_KEY. Set it in your environment or this cell.", "yellow"))
else:
    print(colored("✓ GROQ API key detected", "green"))
print(colored(f"Using Groq model: {GROQ_MODEL}", "cyan"))


✓ GROQ API key detected
Using Groq model: llama-3.1-70b-versatile


# Helper Function

In [3]:
def sanitize_folder_name(name: str, max_length: int = 60) -> str:
    s = re.sub(r'[\\/*?:"<>|]', "_", name).strip()
    s = re.sub(r"\s+", "_", s)
    if len(s) > max_length:
        s = s[:max_length].rsplit("_", 1)[0]
    return s or "untitled"

def ensure_dir(path: str) -> str:
    os.makedirs(path, exist_ok=True)
    return path

def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP, max_chunks: int = MAX_CHUNKS) -> List[str]:
    if not text:
        return []
    chunks = []
    i = 0
    n = len(text)
    while i < n and len(chunks) < max_chunks:
        end = min(i + chunk_size, n)
        chunks.append(text[i:end])
        if end == n:
            break
        i = end - overlap
        if i < 0: i = 0
    return chunks

def save_text(path: str, content: str):
    ensure_dir(os.path.dirname(path))
    with open(path, "w", encoding="utf-8") as f:
        f.write(content)


# ArXiv tools (Client API)

In [4]:
def get_arxiv_papers(query: str, max_results: int = 5):
    try:
        client = arxiv.Client(page_size=max_results)
        search = arxiv.Search(query=query, max_results=max_results, sort_by=arxiv.SortCriterion.Relevance)
        return list(client.results(search))
    except Exception as e:
        print(f"Error: {e}")
        return []

def download_pdf(url: str, filename: str, folder: str) -> str:
    os.makedirs(folder, exist_ok=True)
    path = os.path.join(folder, filename)
    r = requests.get(url, timeout=60)
    r.raise_for_status()
    with open(path, "wb") as f:
        f.write(r.content)
    return path

def extract_text_from_pdf(pdf_path: str) -> str:
    reader = PdfReader(pdf_path)
    return "\n".join([page.extract_text() or "" for page in reader.pages]).strip()


# Groq (LLaMA) Chat Helpers

In [5]:
client = None
if groq_api_key:
    client = Groq(api_key=groq_api_key)

@retry(
    wait=wait_exponential_jitter(initial=1, max=20),
    stop=stop_after_attempt(5),
    retry=retry_if_exception_type((RateLimitError, APIStatusError, requests.exceptions.RequestException))
)
def llama_chat(messages, model=GROQ_MODEL, temperature=TEMPERATURE, max_tokens=MAX_SUMMARY_TOKENS):
    if client is None:
        raise RuntimeError("Groq client not initialized (missing key).")
    resp = client.chat.completions.create(
        model=model, messages=messages, temperature=temperature, max_tokens=max_tokens
    )
    return resp.choices[0].message.content.strip()

def choose_paper_with_llama(papers) -> Tuple[Optional[int], str]:
    listing = [f"{i}. {p.title.strip()}\nAbstract: {(p.summary or '').strip()}" for i, p in enumerate(papers)]
    prompt = (
        "You are a careful academic assistant. From the numbered list of papers, "
        "pick the SINGLE best paper for the search term. "
        "Respond in exactly this JSON format:\n"
        '{ "choice": <number>, "reason": "<brief reason without numeric lists>" }\n\n'
        "Papers:\n" + "\n".join(listing)
    )
    messages = [
        {"role": "system", "content": "You select the most promising academic paper and explain briefly."},
        {"role": "user", "content": prompt},
    ]
    try:
        out = llama_chat(messages, max_tokens=256)
        m = re.search(r'\{.*\}', out, re.S)
        if m:
            j = json.loads(m.group(0))
            return int(j.get("choice")), j.get("reason", "")
        return (int(re.search(r'(\d+)', out).group(1)) if re.search(r'(\d+)', out) else None), out
    except Exception:
        return None, ""


# Summarization Helpers

In [12]:
def safe_summarize_text(text: str):
    chunks = chunk_text(text, CHUNK_SIZE, CHUNK_OVERLAP)
    summaries = []

    for i, chunk in enumerate(chunks, 1):
        if not chunk.strip():
            print(f"⚠️ Skipping empty chunk {i}")
            continue
        try:
            summary = groq_generate(
                f"Summarize the following paper content:\n\n{chunk}",
                max_tokens=MAX_SUMMARY_TOKENS,
                temperature=TEMPERATURE,
            )
            if summary and summary.strip():
                summaries.append(summary.strip())
        except Exception as e:
            print(f"✗ Summarization failed on chunk {i}: {e}")

    return "\n\n".join(summaries) if summaries else None

def search_arxiv(query, max_results=10): 
    try: 
       search = arxiv.Search(
         query=query,
         max_results=max_results,
         sort_by=arxiv.SortCriterion.Relevance 
       ) 
       return list(search.results()) 
    except Exception as e: 
       print(f"✗ Arxiv search failed: {e}") 
       return []

def groq_generate(prompt, max_tokens=200, temperature=0.7):
    if not prompt or not prompt.strip():
        print("⚠️ Empty prompt given to Groq, skipping.")
        return None
    try:
        client = Groq(api_key=groq_api_key)
        resp = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=temperature
        ) 
        return resp.choices[0].message.content.strip()
    except Exception as e: 
        print(f"✗ Groq call failed: {e}")
        return None

def extract_paper_text(paper, folder=PAPERS_DIR):
    try:
        pdf_path = os.path.join(folder, f"{paper.get_short_id()}.pdf")
        paper.download_pdf(filename=pdf_path)
        return f"Title: {paper.title.strip()}\n\nAbstract: {paper.summary.strip()}"
    except Exception as e:
        print(f"✗ Failed to extract text: {e}")
        return ""

def save_summary(paper, summary, folder=PAPERS_DIR):
    try:
        fname = os.path.join(folder, f"{paper.get_short_id()}_summary.txt")
        os.makedirs(folder, exist_ok=True)
        with open(fname, "w", encoding="utf-8") as f:
            f.write(summary)
        print(f"✓ Saved summary to {fname}")
    except Exception as e:
        print(f"✗ Failed to save summary: {e}")


# Iterative Paper Pipeline

In [13]:
def run_iteration(search_term, prev_paper_title=""):
    print(colored(f"Searching arXiv for: {search_term}", "green"))
    papers = search_arxiv(search_term)

    if not papers:
        print(colored("⚠️ No papers found. Keeping same search term.", "yellow"))
        return search_term, None

    chosen_paper = papers[0]  # take top result
    print(colored(f"Chosen paper: {chosen_paper.title}", "magenta"))

    text = extract_paper_text(chosen_paper)
    if text:
        final_summary = safe_summarize_text(text)
        if final_summary:
            save_summary(chosen_paper, final_summary)
        
# --- KEY CHANGE: force Groq to output only a topic ---
    new_prompt = (
        f"Based on the paper '{chosen_paper.title}', suggest ONE concise related "
        f"research topic (5-8 words max, no punctuation)."
    )
    new_term = groq_generate(new_prompt, max_tokens=30)

    if not new_term or len(new_term.split()) < 2:
        print(colored("⚠️ Invalid search term generated, reusing old term.", "yellow"))
        new_term = search_term

    return new_term, chosen_paper


# Main Loop

In [14]:
print(colored("=== LLaMA Research Assistant (Groq) ===", "cyan"))
if not groq_api_key:
    raise SystemExit("Please set GROQ_API_KEY and re-run.")

ensure_dir(PAPERS_DIR)

chosen_paper = None
search_term = INITIAL_SEARCH_TERM or "artificial intelligence research"

for turn in range(1, NUMBER_OF_TURNS + 1):
    print(colored(f"\n=== Research Turn {turn}/{NUMBER_OF_TURNS} ===", "cyan"))

    prev_title = chosen_paper.title if chosen_paper else "No previous paper"

    search_term, chosen_paper = run_iteration(
        search_term,
        prev_paper_title=prev_title
    )


=== LLaMA Research Assistant (Groq) ===

=== Research Turn 1/10 ===
Searching arXiv for: coding ability of large language models
Chosen paper: Testing the Effect of Code Documentation on Large Language Model Code Understanding


  return list(search.results())


✓ Saved summary to research_papers/2404.03114v1_summary.txt

=== Research Turn 2/10 ===
Searching arXiv for: Assessing the Impact of Code Comments on Debugging.
Chosen paper: Out-Of-Place debugging: a debugging architecture to reduce debugging interference
✓ Saved summary to research_papers/1811.02034v1_summary.txt

=== Research Turn 3/10 ===
Searching arXiv for: Reducing Debugging Interference in Cloud Computing Environments
Chosen paper: Out-Of-Place debugging: a debugging architecture to reduce debugging interference
✓ Saved summary to research_papers/1811.02034v1_summary.txt

=== Research Turn 4/10 ===
Searching arXiv for: Investigating Real-Time Debugging Techniques for High-Performance Systems
Chosen paper: Out-Of-Place debugging: a debugging architecture to reduce debugging interference
✓ Saved summary to research_papers/1811.02034v1_summary.txt

=== Research Turn 5/10 ===
Searching arXiv for: Improving debugging techniques for distributed and concurrent systems.
Chosen paper: A

# Conclusion

This project demonstrates how LLMs (LLaMA via Groq API) can be combined with arXiv search, automated summarization, and iterative refinement to build an autonomous AI Research Assistant. By integrating paper retrieval, PDF parsing, text chunking, summarization, and topic expansion into a single pipeline, the system can:

* Continuously explore new research directions.
* Generate concise, human-readable summaries of academic papers.
* Suggest fresh and relevant research topics for further investigation.

The assistant not only streamlines the literature review process but also highlights the potential of agentic AI systems that learn, adapt, and build knowledge iteratively.

While the current version focuses on arXiv papers, the pipeline can be extended to other sources (PubMed, Semantic Scholar, IEEE Xplore) and enhanced with advanced summarization techniques, semantic ranking, and interactive interfaces (voice/chat apps).

In short, this project is a step toward creating autonomous research companions that augment human intelligence in navigating the ever-growing landscape of scientific knowledge.

# THANK YOU