Automated classification of academic research papers into 8 knowledge domains using machine learning.
Abstractor is a 4-stage NLP pipeline that teaches a machine learning model to automatically categorize arXiv research papers.
┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: Data Acquisition. │
├─────────────────────────────────────────────────────────────┤
│ • Fetches papers from arXiv API │
│ • Balances classes (handles cross-listing duplication) │
│ • Outputs: arxiv_dataset.parquet + .csv │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: NLP Feature Engineering │
├─────────────────────────────────────────────────────────────┤
│ • Text cleaning │
│ • Lemmatization & tokenization │
│ • TF-IDF vectorization (convert text → numbers) │
│ • Extract metadata features (author count, date, etc.) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 3: ML Model Training │
├─────────────────────────────────────────────────────────────┤
│ • Train XGBoost classifier │
│ • Evaluate on test set (accuracy, F1, confusion matrix) │
│ • Hyperparameter tuning │
│ • Save final model │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 4: Demo │
├─────────────────────────────────────────────────────────────┤
│ • Web interface for real-time classification │
│ • API endpoint for programmatic access │
│ • Performance dashboard │
└─────────────────────────────────────────────────────────────┘
The fetch_data.py script automates paper collection. It queries the arXiv API for papers in 8 broad categories, handles cross-listing duplication, and saves the results in both Parquet and CSV formats.
Each row is one paper with these columns:
id → "2310.12345" (unique arXiv identifier)
versioned_id → "2310.12345v2" (includes revision number)
title → "Neural Networks for Climate Prediction"
abstract → "We propose a novel approach to..."
primary_category → "cs.LG" (Computer Science, Machine Learning)
all_categories → "cs.LG, stat.ML, physics.geo-ph" (where else it's listed)
published_date → 2023-10-15 (submission date)
updated_date → 2023-11-02 (last revision)
authors → "Alice Smith, Bob Jones, Carol White"
journal_ref → "Nature 2024" (if published; empty if preprint)
doi → "10.1234/nature.56789" (if available)
query_label → "Computer Science" (the category we queried for)
# 1. Clone the repository
git clone git@github.com:chiraqL/Abstractor.git
cd abstractor# 2. Create a virtual environment
python -m venv .venv
# Activate it:
# On macOS / Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate# 3. Install dependencies
pip install -r requirements.txt# 4. Run the data fetcher (expect 25-40 minutes)
python fetch_data.pyRun notebooks/02_eda.ipynb
Run notebooks/03_modeling.ipynb
streamlit run app/app.py
Last updated: May 2026