Abstractor

Automated classification of academic research papers into 8 knowledge domains using machine learning.

Abstractor is a 4-stage NLP pipeline that teaches a machine learning model to automatically categorize arXiv research papers.

The 4-Stage Pipeline

┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: Data Acquisition.                                  │
├─────────────────────────────────────────────────────────────┤
│ • Fetches papers from arXiv API                             │
│ • Balances classes (handles cross-listing duplication)      │
│ • Outputs: arxiv_dataset.parquet + .csv                     │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: NLP Feature Engineering                            │
├─────────────────────────────────────────────────────────────┤
│ • Text cleaning                                             │
│ • Lemmatization & tokenization                              │
│ • TF-IDF vectorization (convert text → numbers)             │
│ • Extract metadata features (author count, date, etc.)      │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 3: ML Model Training                                  │
├─────────────────────────────────────────────────────────────┤
│ • Train XGBoost classifier                                  │
│ • Evaluate on test set (accuracy, F1, confusion matrix)     │
│ • Hyperparameter tuning                                     │
│ • Save final model                                          │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 4: Demo                                               │
├─────────────────────────────────────────────────────────────┤
│ • Web interface for real-time classification                │
│ • API endpoint for programmatic access                      │
│ • Performance dashboard                                     │
└─────────────────────────────────────────────────────────────┘

Stage 1: Data Acquisition

The fetch_data.py script automates paper collection. It queries the arXiv API for papers in 8 broad categories, handles cross-listing duplication, and saves the results in both Parquet and CSV formats.

Dataset Structure

Each row is one paper with these columns:

id                  → "2310.12345" (unique arXiv identifier)
versioned_id        → "2310.12345v2" (includes revision number)
title               → "Neural Networks for Climate Prediction"
abstract            → "We propose a novel approach to..."
primary_category    → "cs.LG" (Computer Science, Machine Learning)
all_categories      → "cs.LG, stat.ML, physics.geo-ph" (where else it's listed)
published_date      → 2023-10-15 (submission date)
updated_date        → 2023-11-02 (last revision)
authors             → "Alice Smith, Bob Jones, Carol White"
journal_ref         → "Nature 2024" (if published; empty if preprint)
doi                 → "10.1234/nature.56789" (if available)
query_label         → "Computer Science" (the category we queried for)

Installation & Running

# 1. Clone the repository
git clone git@github.com:chiraqL/Abstractor.git
cd abstractor

# 2. Create a virtual environment
python -m venv .venv

# Activate it:
# On macOS / Linux:
source .venv/bin/activate

# On Windows:
.venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run the data fetcher (expect 25-40 minutes)
python fetch_data.py

Stage 2: NLP Feature Engineering

Run notebooks/02_eda.ipynb

Stage 3: ML Model Training

Run notebooks/03_modeling.ipynb

Stage 4: Deployment

streamlit run app/app.py

Last updated: May 2026

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
app		app
notebooks		notebooks
src/abstractor		src/abstractor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstractor

The 4-Stage Pipeline

Stage 1: Data Acquisition

Dataset Structure

Installation & Running

Stage 2: NLP Feature Engineering

Stage 3: ML Model Training

Stage 4: Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Abstractor

The 4-Stage Pipeline

Stage 1: Data Acquisition

Dataset Structure

Installation & Running

Stage 2: NLP Feature Engineering

Stage 3: ML Model Training

Stage 4: Deployment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages