Skip to content

chiraqL/Abstractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abstractor

Automated classification of academic research papers into 8 knowledge domains using machine learning.

Abstractor is a 4-stage NLP pipeline that teaches a machine learning model to automatically categorize arXiv research papers.

The 4-Stage Pipeline

┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: Data Acquisition.                                  │
├─────────────────────────────────────────────────────────────┤
│ • Fetches papers from arXiv API                             │
│ • Balances classes (handles cross-listing duplication)      │
│ • Outputs: arxiv_dataset.parquet + .csv                     │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: NLP Feature Engineering                            │
├─────────────────────────────────────────────────────────────┤
│ • Text cleaning                                             │
│ • Lemmatization & tokenization                              │
│ • TF-IDF vectorization (convert text → numbers)             │
│ • Extract metadata features (author count, date, etc.)      │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 3: ML Model Training                                  │
├─────────────────────────────────────────────────────────────┤
│ • Train XGBoost classifier                                  │
│ • Evaluate on test set (accuracy, F1, confusion matrix)     │
│ • Hyperparameter tuning                                     │
│ • Save final model                                          │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 4: Demo                                               │
├─────────────────────────────────────────────────────────────┤
│ • Web interface for real-time classification                │
│ • API endpoint for programmatic access                      │
│ • Performance dashboard                                     │
└─────────────────────────────────────────────────────────────┘

Stage 1: Data Acquisition

The fetch_data.py script automates paper collection. It queries the arXiv API for papers in 8 broad categories, handles cross-listing duplication, and saves the results in both Parquet and CSV formats.

Dataset Structure

Each row is one paper with these columns:

id                  → "2310.12345" (unique arXiv identifier)
versioned_id        → "2310.12345v2" (includes revision number)
title               → "Neural Networks for Climate Prediction"
abstract            → "We propose a novel approach to..."
primary_category    → "cs.LG" (Computer Science, Machine Learning)
all_categories      → "cs.LG, stat.ML, physics.geo-ph" (where else it's listed)
published_date      → 2023-10-15 (submission date)
updated_date        → 2023-11-02 (last revision)
authors             → "Alice Smith, Bob Jones, Carol White"
journal_ref         → "Nature 2024" (if published; empty if preprint)
doi                 → "10.1234/nature.56789" (if available)
query_label         → "Computer Science" (the category we queried for)

Installation & Running

# 1. Clone the repository
git clone git@github.com:chiraqL/Abstractor.git
cd abstractor
# 2. Create a virtual environment
python -m venv .venv

# Activate it:
# On macOS / Linux:
source .venv/bin/activate

# On Windows:
.venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run the data fetcher (expect 25-40 minutes)
python fetch_data.py

Stage 2: NLP Feature Engineering

Run notebooks/02_eda.ipynb

Stage 3: ML Model Training

Run notebooks/03_modeling.ipynb

Stage 4: Deployment

streamlit run app/app.py

Last updated: May 2026

About

Automated NLP pipeline to classify research papers into domains

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors