# 🧠 Discipline Classifier: SciBERT + LoRA (PyTorch Loop)

This notebook fine-tunes **`allenai/scibert_scivocab_uncased`** with **LoRA** on a dataset of 5,402 computing research abstracts labeled as **CS**, **IS**, or **IT**.  
We have already:
1. Loaded and preprocessed the combined dataset (Title + Abstract → `text`, `label`).  
2. Tokenized it with `AutoTokenizer`.  
3. Wrapped SciBERT for sequence classification via PEFT (LoRA).  

Below, instead of using `Trainer`, we implement a **pure PyTorch training loop** to avoid version mismatches between `transformers`/`accelerate`/`peft`.


In [1]:
import torch
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, f1_score
import joblib
import os

## 1. Load and Preprocess Dataset

We load the dataset from `Expanded Discipline Dataset.csv`, encode labels (CS, IS, IT), combine Title + Abstract into a single `text` field, and prepare for Hugging Face’s `datasets.Dataset` format.


In [2]:
# 1.1  Standard imports for DataFrame manipulation
import pandas as pd

# 1.2  Read CSV into DataFrame
df = pd.read_csv("Data/Expanded Discipline Dataset.csv")

# 1.3  Drop rows where Title, Abstract, or Discipline is missing
df = df.dropna(subset=["Title", "Abstract", "Discipline"])

# 1.4  Combine Title + Abstract into one 'text' column
df["text"] = df["Title"].str.strip() + ". " + df["Abstract"].str.strip()

# 1.5  Encode discipline labels
label2id = {"CS": 0, "IS": 1, "IT": 2}
id2label = {v: k for k, v in label2id.items()}
df["label"] = df["Discipline"].map(label2id)

# 1.6  Keep only the 'text' and 'label' columns
df = df[["text", "label"]]

# 1.7  Preview
df.head()

Unnamed: 0,text,label
0,VITA-Audio: Fast Interleaved Cross-Modal Token...,0
1,AMO: Adaptive Motion Optimization for Hyper-De...,0
2,FlexiAct: Towards Flexible Action Control in H...,0
3,Actor-Critics Can Achieve Optimal Sample Effic...,0
4,Demonstrating ViSafe: Vision-enabled Safety fo...,0


# 2. Import & Load SciBERT Tokenizer

Before tokenizing, we need to import and instantiate SciBERT’s tokenizer

In [3]:
from transformers import AutoTokenizer

# 2.1  Load the SciBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")

# 3. Tokenize Dataset

Now that we have `tokenizer`, we will:
1. Convert our pandas `df` into a Hugging Face `datasets.Dataset`.  
2. Tokenize each `text` into `input_ids` + `attention_mask`.  
3. Remove any extra columns (keeping only `input_ids`, `attention_mask`, and `label`).

In [4]:
from datasets import Dataset

# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=512,  # SciBERT's max sequence length
        return_tensors="pt"
    )

# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Remove the original text column, keep only input_ids, attention_mask, and label
tokenized_dataset = tokenized_dataset.remove_columns(["text"])

print(f"Dataset size: {len(tokenized_dataset)}")
print(f"Features: {tokenized_dataset.features}")

Map:   0%|          | 0/5402 [00:00<?, ? examples/s]

Dataset size: 5402
Features: {'label': Value(dtype='int64', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}


In [5]:
from datasets import Dataset

# Split into train/test (80/20)
train_test_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

Training samples: 4321
Test samples: 1081


In [6]:
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model

# Load the base SciBERT model
model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/scibert_scivocab_uncased",
    num_labels=3,  # CS, IS, IT
    id2label=id2label,
    label2id=label2id
)

# Configure LoRA
lora_config = LoraConfig(
    task_type="SEQ_CLS",  # Sequence Classification
    r=16,  # Low-rank dimension
    lora_alpha=32,  # LoRA scaling parameter
    lora_dropout=0.1,
    target_modules=["query", "value"]  # Apply LoRA to attention layers
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


'NoneType' object has no attribute 'cadam32bit_grad_fp32'
trainable params: 592,131 || all params: 110,512,902 || trainable%: 0.5358025979627248


  warn("The installed version of bitsandbytes was compiled without GPU support. "


In [7]:
from torch.utils.data import DataLoader

# Set format for PyTorch
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=False)

In [8]:
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR
from tqdm import tqdm

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Using device: {device}")

# Training hyperparameters
num_epochs = 3
learning_rate = 5e-5
weight_decay = 0.01

# Optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
num_training_steps = len(train_dataloader) * num_epochs
scheduler = LinearLR(optimizer, start_factor=1.0, end_factor=0.1, total_iters=num_training_steps)

# Loss function (already included in the model, but we can define it explicitly)
criterion = nn.CrossEntropyLoss()

Using device: cpu


In [9]:
# Training function
def train_epoch(model, dataloader, optimizer, scheduler, device):
    model.train()
    total_loss = 0
    predictions = []
    true_labels = []
    
    progress_bar = tqdm(dataloader, desc="Training")
    
    for batch in progress_bar:
        # Move batch to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits
        
        # Backward pass
        loss.backward()
        optimizer.step()
        scheduler.step()
        
        # Accumulate loss and predictions
        total_loss += loss.item()
        predictions.extend(torch.argmax(logits, dim=-1).cpu().numpy())
        true_labels.extend(labels.cpu().numpy())
        
        # Update progress bar
        progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})
    
    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions, average='weighted')
    
    return avg_loss, accuracy, f1

# Evaluation function
def evaluate(model, dataloader, device):
    model.eval()
    total_loss = 0
    predictions = []
    true_labels = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits
            
            total_loss += loss.item()
            predictions.extend(torch.argmax(logits, dim=-1).cpu().numpy())
            true_labels.extend(labels.cpu().numpy())
    
    avg_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions, average='weighted')
    
    return avg_loss, accuracy, f1, predictions, true_labels

In [None]:
# Training loop
print("Starting training...")
train_losses = []
train_accuracies = []
train_f1s = []

for epoch in range(num_epochs):
    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    print("-" * 50)
    
    # Train
    train_loss, train_acc, train_f1 = train_epoch(model, train_dataloader, optimizer, scheduler, device)
    
    print(f"Train Loss: {train_loss:.4f}")
    print(f"Train Accuracy: {train_acc:.4f}")
    print(f"Train F1-Score: {train_f1:.4f}")
    
    # Store metrics
    train_losses.append(train_loss)
    train_accuracies.append(train_acc)
    train_f1s.append(train_f1)

print("\nTraining completed!")

Starting training...

Epoch 1/3
--------------------------------------------------


Training:  18%|███              | 48/271 [50:43<1:22:07, 22.10s/it, loss=1.0416]