# CS Subfield Classifier using SPECTER + XGBoost

This notebook builds a subfield classifier for Computer Science research papers using:
- SPECTER embeddings (from Title + Abstract)
- XGBoost for classification
- LabelEncoder for encoding subfields

Target: Achieve ≥90% classification accuracy across subfields like AI, CV, etc.

In [1]:
import sentence_transformers
import xgboost
import sklearn
import joblib

print("All packages are already installed and ready ✅")

All packages are already installed and ready ✅


## 1. Load and Prepare CS Subfield Dataset

We load the `CS_subfields.csv` file from the `Data/` directory. Each row contains:

- **Title**: Research paper title
- **Abstract**: Paper abstract
- **Subfield**: CS subfield label (e.g., AI, CV, NLP, CV, SE, etc.)
- **Link**: Source URL for the paper

We also concatenate the Title and Abstract into a new `input_text` column, which will be used for generating SPECTER embeddings.

In [4]:
import pandas as pd

# Step 1: Load dataset
df = pd.read_csv("Data/CS_subfields.csv")

# Create input_text for embedding
df["input_text"] = df["Title"].astype(str).str.strip() + " " + df["Abstract"].astype(str).str.strip()

# Show shape and preview only the 4 relevant columns
print(df.shape)
df[["Title", "Abstract", "Subfield", "Link"]].head()

(1498, 5)


Unnamed: 0,Title,Abstract,Subfield,Link
0,Beyond Frameworks: Unpacking Collaboration Str...,Multi-agent collaboration has emerged as a piv...,AI,http://arxiv.org/abs/2505.12467v1
1,Any-to-Any Learning in Computational Pathology...,Recent advances in computational pathology and...,AI,http://arxiv.org/abs/2505.12711v1
2,AutoMat: Enabling Automated Crystal Structure ...,Machine learning-based interatomic potentials ...,AI,http://arxiv.org/abs/2505.12650v1
3,ACU: Analytic Continual Unlearning for Efficie...,The development of artificial intelligence dem...,AI,http://arxiv.org/abs/2505.12239v1
4,Empowering Sustainable Finance with Artificial...,This chapter explores the convergence of two m...,AI,http://arxiv.org/abs/2505.12012v1


In [5]:
# Preview the constructed input_text
df["input_text"].head()

0    Beyond Frameworks: Unpacking Collaboration Str...
1    Any-to-Any Learning in Computational Pathology...
2    AutoMat: Enabling Automated Crystal Structure ...
3    ACU: Analytic Continual Unlearning for Efficie...
4    Empowering Sustainable Finance with Artificial...
Name: input_text, dtype: object