In [37]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from datasets import Dataset
import json
from transformers import TrainerCallback
import torch

# Overview

This is notebook one of three notebooks for this project. 

1. **Model building and testing using sciBert**
2. Trend analysis
3. Author network analysis

# Warning

**This notebook takes approximately 5 days to run (at least on a 64G MacBook Pro M1). Use caution when executing.**

# Conclusions and comments

The goal was to produce a model that could predict the paper categories with >90% accuracy. None of the models tested met this goal. In fact, given that it took a week to tune the model and tune the hyper parameters, the gain for effort was probably not a good tradeoff. 

Reasons for this likely include:

* Class imbalance: model performance for the 3 largest classes was ~80%, 93% and 94%. For many of the smaller classes, it was quite poor. 
* Overlap between categories: based on the category description, it is likely there is some overlap between these categories. 
    + One approach could be to try to train on the categories rather than the category codes
    + Another approach could be to lump some of the categories together. For example, is there really a distinction beteen cs.LG (Machine Learning) and stat.ML (Machine Learning (statistics))?
* This data set only has the abstract and title available as data, yet sciBert was training on full text. It is likely that including the full text in the data set would improve the model performance.

# Overview

SciBert is a BERT-based model for analysing scientific text. A manuscript describing the model is here: arxiv.org/abs/1903.10676. And the model itself is available here: https://github.com/allenai/scibert?tab=readme-ov-file and on hugging face. 

For this project I:
* Loaded the cleaned dataset from my EDA notebook: https://github.com/deannachurch/Springboard/blob/main/Capstone3/notebooks/EDA.ipynb
* Filtered category_codes with less than 5 rows (perhaps I should have been more aggressive)
* Split the data into train, test and validation sets
* Tokenized the data using the SciBert tokenizer
* Created a SciBert model using the Hugging Face library
* Trained the model using the training set
* Hyperparameter tuning using the training set
* Evaluated the model using the test set
* Evaluated all three models (original SciBert, tuned SciBert, tuned SciBert + tuned hyperparameters) using the validation set
* Saved the model
* Compared results of the three models to determine if the class imblance was an issue in model performance. 

# Model building and testing with sciBert

In [2]:
df= pd.read_parquet('../data/processed/arXiv_scientific_dataset_final.parquet')
display(df)

Unnamed: 0,id,title,category,category_code,published_date,updated_date,authors,first_author,summary,summary_word_count,...,title_count,author_count_boxcox,title_count_sqrt,published_year,published_quarter,published_month,updated_year,updated_quarter,updated_month,year_period
0,cs-9308101v1,Dynamic Backtracking,Artificial Intelligence,cs.AI,1993-08-01,1993-08-01,['M. L. Ginsberg'],'M. L. Ginsberg',Because of their occasional need to return to ...,79,...,2,0.000000,1.414214,1993,1993Q3,1993-08,1993,1993Q3,1993-08,1990s
1,cs-9308102v1,A Market-Oriented Programming Environment and ...,Artificial Intelligence,cs.AI,1993-08-01,1993-08-01,['M. P. Wellman'],'M. P. Wellman',Market price systems constitute a well-underst...,119,...,12,0.000000,3.464102,1993,1993Q3,1993-08,1993,1993Q3,1993-08,1990s
2,cs-9309101v1,An Empirical Analysis of Search in GSAT,Artificial Intelligence,cs.AI,1993-09-01,1993-09-01,"['I. P. Gent', 'T. Walsh']",'I. P. Gent',We describe an extensive study of search in GS...,167,...,7,0.715010,2.645751,1993,1993Q3,1993-09,1993,1993Q3,1993-09,1990s
3,cs-9311101v1,The Difficulties of Learning Logic Programs wi...,Artificial Intelligence,cs.AI,1993-11-01,1993-11-01,"['F. Bergadano', 'D. Gunetti', 'U. Trinchero']",'F. Bergadano',As real logic programmers normally use cut (!)...,174,...,8,1.154208,2.828427,1993,1993Q4,1993-11,1993,1993Q4,1993-11,1990s
4,cs-9311102v1,Software Agents: Completing Patterns and Const...,Artificial Intelligence,cs.AI,1993-11-01,1993-11-01,"['J. C. Schlimmer', 'L. A. Hermens']",'J. C. Schlimmer',To support the goal of allowing users to recor...,187,...,8,0.715010,2.828427,1993,1993Q4,1993-11,1993,1993Q4,1993-11,1990s
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112516,abs-2501.18184v1,Genetic Algorithm with Border Trades (GAB),Machine Learning,cs.LG,2025-01-30,2025-01-30,['Qingchuan Lyu'],'Qingchuan Lyu',This paper introduces a novel approach to impr...,74,...,6,0.000000,2.449490,2025,2025Q1,2025-01,2025,2025Q1,2025-01,2020s
112517,abs-2501.18280v1,Jailbreaking LLMs' Safeguard with Universal Ma...,Computation and Language (Natural Language Pro...,cs.CL,2025-01-30,2025-01-30,"['Haoyu Liang', 'Youran Sun', 'Yunfeng Cai', '...",'Haoyu Liang',The security issue of large language models (L...,150,...,11,1.730617,3.316625,2025,2025Q1,2025-01,2025,2025Q1,2025-01,2020s
108722,abs-2405.20132v4,LLaMEA: A Large Language Model Evolutionary Al...,Neural and Evolutionary Computing,cs.NE,2024-05-30,2025-01-30,"['Niki van Stein', 'Thomas Bäck']",'Niki van Stein',Large Language Models (LLMs) such as GPT-4 hav...,177,...,11,0.715010,3.316625,2024,2024Q2,2024-05,2025,2025Q1,2025-01,2020s
112519,abs-2501.18504v1,CLEAR: Cue Learning using Evolution for Accura...,Computer Vision and Pattern Recognition,cs.CV,2025-01-30,2025-01-30,"['Peter J. Bentley', 'Soo Ling Lim', 'Fuyuki I...",'Peter J. Bentley',Large Language Model (LLM) image recognition i...,170,...,13,1.154208,3.605551,2025,2025Q1,2025-01,2025,2025Q1,2025-01,2020s


In [3]:
# we only need a subset of columns for training
# specifically removing category- we will use category code as the target, but category is highly correlated with category_code

cols_to_keep=['id', 'title', 'category_code', 'published_date', 'updated_date', 'authors', 'first_author', 'summary', 'summary_word_count', 'author_count_boxcox', 'title_count_sqrt']
df_model=df[cols_to_keep]
display(df_model.head())

Unnamed: 0,id,title,category_code,published_date,updated_date,authors,first_author,summary,summary_word_count,author_count_boxcox,title_count_sqrt
0,cs-9308101v1,Dynamic Backtracking,cs.AI,1993-08-01,1993-08-01,['M. L. Ginsberg'],'M. L. Ginsberg',Because of their occasional need to return to ...,79,0.0,1.414214
1,cs-9308102v1,A Market-Oriented Programming Environment and ...,cs.AI,1993-08-01,1993-08-01,['M. P. Wellman'],'M. P. Wellman',Market price systems constitute a well-underst...,119,0.0,3.464102
2,cs-9309101v1,An Empirical Analysis of Search in GSAT,cs.AI,1993-09-01,1993-09-01,"['I. P. Gent', 'T. Walsh']",'I. P. Gent',We describe an extensive study of search in GS...,167,0.71501,2.645751
3,cs-9311101v1,The Difficulties of Learning Logic Programs wi...,cs.AI,1993-11-01,1993-11-01,"['F. Bergadano', 'D. Gunetti', 'U. Trinchero']",'F. Bergadano',As real logic programmers normally use cut (!)...,174,1.154208,2.828427
4,cs-9311102v1,Software Agents: Completing Patterns and Const...,cs.AI,1993-11-01,1993-11-01,"['J. C. Schlimmer', 'L. A. Hermens']",'J. C. Schlimmer',To support the goal of allowing users to recor...,187,0.71501,2.828427


# Need to remove some rows

When I tried to do the test train split, I got an error because some category codes are seen too few times. I need to go through and remove these before I can do the split.


In [4]:
# Count samples per category
category_counts = df_model['category_code'].value_counts()
print(f"{len(category_counts[category_counts < 5])} with category counts less than 5")
print("Categories with very few samples:")
print(category_counts[category_counts < 5])

34 with category counts less than 5
Categories with very few samples:
category_code
nucl-th              4
physics.acc-ph       4
cs.SC                4
cond-mat.str-el      4
math.AG              4
cs.GL                4
astro-ph.EP          3
astro-ph.GA          3
physics.ed-ph        3
cs.OS                3
math.GT              3
astro-ph.HE          3
econ.GN              3
astro-ph             3
physics.space-ph     3
math.CT              3
math.MG              3
physics.gen-ph       3
q-bio.OT             2
q-bio.CB             2
math-ph              2
math.GR              2
math.RT              2
q-fin.EC             2
math.SP              2
nlin.PS              1
physics.hist-ph      1
math.GM              1
math.NT              1
nucl-ex              1
cond-mat.supr-con    1
q-bio.TO             1
physics.class-ph     1
math.CV              1
Name: count, dtype: int64


In [5]:
print(f"Dataframe shape before filtering: {df_model.shape}")
rare_categories = category_counts[category_counts < 5].index
df_model_filtered = df_model[~df_model['category_code'].isin(rare_categories)]
print(f"Dataframe shape after filtering: {df_model_filtered.shape}")

Dataframe shape before filtering: (136160, 11)
Dataframe shape after filtering: (136077, 11)


In [6]:
category_counts = df_model_filtered['category_code'].value_counts()
print(f"{len(category_counts[category_counts < 5])} with category counts less than 5")
print("Categories with very few samples:")
print(category_counts[category_counts < 5])

0 with category counts less than 5
Categories with very few samples:
Series([], Name: count, dtype: int64)


In [7]:
# Split into training and testing dataframes (80/20 split)
train_df, test_df = train_test_split(
    df_model_filtered, 
    test_size=0.2, 
    random_state=42, 
    stratify=df_model_filtered['category_code']  # Ensure balanced distribution of categories
)

In [8]:
# Create standard variables
y_train = train_df['category_code']
y_test = test_df['category_code']
X_train = train_df.drop(columns=['category_code'], axis=1)
X_test = test_df.drop(columns=['category_code'], axis=1)

In [9]:
#Check shapes
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (108861, 10)
y_train shape: (108861,)
X_test shape: (27216, 10)
y_test shape: (27216,)


# Start training and tuning hyperparameters with sciBert

In [10]:
# Combine title and summary for richer context
train_df['text'] = train_df['title'] + " " + train_df['summary'].fillna("")
test_df['text'] = test_df['title'] + " " + test_df['summary'].fillna("")

In [11]:
# load the sciBert tokenizer from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")

In [12]:
#Tokenize the data
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

In [13]:
#Define metrics for evaluation
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [14]:
# Convert to HuggingFace Datasets format
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [15]:
# Apply tokenization
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/108861 [00:00<?, ? examples/s]

Map:   0%|          | 0/27216 [00:00<?, ? examples/s]

In [16]:
# Get unique categories and create a label mapping
unique_categories = train_df['category_code'].unique()
label_to_id = {label: i for i, label in enumerate(unique_categories)}
id_to_label = {i: label for label, i in label_to_id.items()}

In [17]:
# Map category names to IDs
tokenized_train = tokenized_train.map(lambda x: {'label': label_to_id[x['category_code']]})
tokenized_test = tokenized_test.map(lambda x: {'label': label_to_id[x['category_code']]})

Map:   0%|          | 0/108861 [00:00<?, ? examples/s]

Map:   0%|          | 0/27216 [00:00<?, ? examples/s]

In [19]:
# Define training arguments (these hyperparameters can be tuned)
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    push_to_hub=False,
)

In [20]:
# Initialize the model
num_labels = len(unique_categories)
model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/scibert_scivocab_uncased", 
    num_labels=num_labels
)

pytorch_model.bin:   0%|          | 0.00/442M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]

In [22]:
#Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

In [23]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8418,0.830313,0.748457,0.725991,0.720663,0.748457
2,0.6807,0.831189,0.754666,0.732556,0.729094,0.754666
3,0.5656,0.903743,0.75463,0.746296,0.74113,0.75463


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=40824, training_loss=0.7526785128815792, metrics={'train_runtime': 37465.4341, 'train_samples_per_second': 8.717, 'train_steps_per_second': 1.09, 'total_flos': 8.600706324317491e+16, 'train_loss': 0.7526785128815792, 'epoch': 3.0})

In [24]:
#Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

Evaluation results: {'eval_loss': 0.9037431478500366, 'eval_accuracy': 0.7546296296296297, 'eval_f1': 0.7462958656061841, 'eval_precision': 0.7411295148116699, 'eval_recall': 0.7546296296296297, 'eval_runtime': 836.2316, 'eval_samples_per_second': 32.546, 'eval_steps_per_second': 2.034, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [28]:
# Save the model and tokenizer to a directory
output_dir = "../models/first_train_saved_scibert_model"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

# Optionally, save the training arguments
with open(f"{output_dir}/training_args.json", 'w') as f:
    json.dump(trainer.args.to_dict(), f)

print(f"Model successfully saved to {output_dir}")

Model successfully saved to ../models/first_train_saved_scibert_model


In [29]:
# Tune hyperparameters
class PrinterCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        if metrics is not None:
            print(f"Validation metrics: {metrics}")

# Define the hyperparameter search space
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        "allenai/scibert_scivocab_uncased", 
        num_labels=len(unique_categories)
    )

# Define hyperparameter search space
hyperparameter_space = {
    "learning_rate": [1e-5, 3e-5, 5e-5],
    "per_device_train_batch_size": [4, 8],
    "num_train_epochs": [2, 3, 4],
    "weight_decay": [0.01, 0.1]
}

# Create trainer for hyperparameter search
hp_trainer = Trainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[PrinterCallback(), EarlyStoppingCallback(early_stopping_patience=2)]
)

# Run hyperparameter search
best_run = hp_trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    n_trials=10,
    hp_space=lambda trial: {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 4),
        "weight_decay": trial.suggest_float("weight_decay", 0.01, 0.1, log=True),
    }
)

print(f"Best hyperparameters: {best_run.hyperparameters}")

  hp_trainer = Trainer(
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-04-06 16:57:16,989] A new study created in memory with name: no-name-a07ac6e0-49e8-415f-905a-098cbd0b2d8d
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8363,0.79413,0.754078,0.736869,0.7347,0.754078
2,0.6212,0.806271,0.757569,0.734513,0.734911,0.757569
3,0.5391,0.865173,0.756687,0.749345,0.746015,0.756687
4,0.4537,1.115623,0.751984,0.745286,0.74101,0.751984


Validation metrics: {'eval_loss': 0.7941299676895142, 'eval_accuracy': 0.7540784832451499, 'eval_f1': 0.736869050750117, 'eval_precision': 0.7347002448356736, 'eval_recall': 0.7540784832451499, 'eval_runtime': 789.3692, 'eval_samples_per_second': 34.478, 'eval_steps_per_second': 2.155, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.8062713742256165, 'eval_accuracy': 0.7575690770135215, 'eval_f1': 0.7345128492493839, 'eval_precision': 0.7349105125058728, 'eval_recall': 0.7575690770135215, 'eval_runtime': 811.768, 'eval_samples_per_second': 33.527, 'eval_steps_per_second': 2.095, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.8651725053787231, 'eval_accuracy': 0.7566872427983539, 'eval_f1': 0.7493454107045417, 'eval_precision': 0.7460149432419484, 'eval_recall': 0.7566872427983539, 'eval_runtime': 814.3765, 'eval_samples_per_second': 33.419, 'eval_steps_per_second': 2.089, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 1.1156232357025146, 'eval_accuracy': 0.751984126984127, 'eval_f1': 0.7452861248046245, 'eval_precision': 0.7410101849091518, 'eval_recall': 0.751984126984127, 'eval_runtime': 784.2863, 'eval_samples_per_second': 34.702, 'eval_steps_per_second': 2.169, 'epoch': 4.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-07 06:31:29,955] Trial 0 finished with value: 2.99026456368203 and parameters: {'learning_rate': 2.9436025690262455e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 4, 'weight_decay': 0.0115531419876304}. Best is trial 0 with value: 2.99026456368203.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.7369,0.778948,0.761574,0.745257,0.736715,0.761574
2,0.5389,0.757642,0.771899,0.758612,0.753063,0.771899


Validation metrics: {'eval_loss': 0.7789477705955505, 'eval_accuracy': 0.7615740740740741, 'eval_f1': 0.7452567060670364, 'eval_precision': 0.7367148960585395, 'eval_recall': 0.7615740740740741, 'eval_runtime': 1043.012, 'eval_samples_per_second': 26.094, 'eval_steps_per_second': 1.631, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.7576424479484558, 'eval_accuracy': 0.7718988830099941, 'eval_f1': 0.7586121839584354, 'eval_precision': 0.7530633566918324, 'eval_recall': 0.7718988830099941, 'eval_runtime': 807.7761, 'eval_samples_per_second': 33.693, 'eval_steps_per_second': 2.106, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-07 13:22:42,486] Trial 1 finished with value: 3.0554733066702564 and parameters: {'learning_rate': 1.542075482849585e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 2, 'weight_decay': 0.021889860653894203}. Best is trial 1 with value: 3.0554733066702564.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8246,0.802475,0.751066,0.731251,0.727315,0.751066
2,0.6496,0.811515,0.757716,0.734622,0.733394,0.757716
3,0.5517,0.87181,0.756504,0.748673,0.74688,0.756504
4,0.506,1.106066,0.750992,0.744227,0.740424,0.750992


Validation metrics: {'eval_loss': 0.8024747371673584, 'eval_accuracy': 0.7510655496766608, 'eval_f1': 0.7312506535537816, 'eval_precision': 0.7273153159677715, 'eval_recall': 0.7510655496766608, 'eval_runtime': 1098.3513, 'eval_samples_per_second': 24.779, 'eval_steps_per_second': 1.549, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.8115150928497314, 'eval_accuracy': 0.7577160493827161, 'eval_f1': 0.7346222724184841, 'eval_precision': 0.7333942342937588, 'eval_recall': 0.7577160493827161, 'eval_runtime': 810.1876, 'eval_samples_per_second': 33.592, 'eval_steps_per_second': 2.1, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.8718098998069763, 'eval_accuracy': 0.7565035273368607, 'eval_f1': 0.7486726387370474, 'eval_precision': 0.7468795007115948, 'eval_recall': 0.7565035273368607, 'eval_runtime': 790.4627, 'eval_samples_per_second': 34.43, 'eval_steps_per_second': 2.152, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 1.1060656309127808, 'eval_accuracy': 0.7509920634920635, 'eval_f1': 0.7442272671055937, 'eval_precision': 0.7404237301907711, 'eval_recall': 0.7509920634920635, 'eval_runtime': 790.5939, 'eval_samples_per_second': 34.425, 'eval_steps_per_second': 2.152, 'epoch': 4.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-08 03:15:53,393] Trial 2 finished with value: 2.9866351242804923 and parameters: {'learning_rate': 2.9431471855670324e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 4, 'weight_decay': 0.04554799408161034}. Best is trial 1 with value: 3.0554733066702564.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.1238,0.971706,0.729314,0.700931,0.699719,0.729314
2,1.1915,1.077697,0.738463,0.717561,0.710984,0.738463
3,0.7004,1.20803,0.740447,0.733333,0.731122,0.740447
4,0.6282,1.457963,0.737875,0.72994,0.724783,0.737875


Validation metrics: {'eval_loss': 0.9717056751251221, 'eval_accuracy': 0.7293136390358612, 'eval_f1': 0.700931251394988, 'eval_precision': 0.6997186580715208, 'eval_recall': 0.7293136390358612, 'eval_runtime': 790.0155, 'eval_samples_per_second': 34.45, 'eval_steps_per_second': 2.153, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 1.0776968002319336, 'eval_accuracy': 0.7384626690182245, 'eval_f1': 0.7175609417712006, 'eval_precision': 0.7109835813211023, 'eval_recall': 0.7384626690182245, 'eval_runtime': 806.427, 'eval_samples_per_second': 33.749, 'eval_steps_per_second': 2.109, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 1.2080297470092773, 'eval_accuracy': 0.7404467960023515, 'eval_f1': 0.7333331690069669, 'eval_precision': 0.7311219086875841, 'eval_recall': 0.7404467960023515, 'eval_runtime': 843.0027, 'eval_samples_per_second': 32.285, 'eval_steps_per_second': 2.018, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 1.4579625129699707, 'eval_accuracy': 0.7378747795414462, 'eval_f1': 0.7299399077266049, 'eval_precision': 0.7247828005185604, 'eval_recall': 0.7378747795414462, 'eval_runtime': 860.8631, 'eval_samples_per_second': 31.615, 'eval_steps_per_second': 1.976, 'epoch': 4.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-08 20:08:02,282] Trial 3 finished with value: 2.9304722673280574 and parameters: {'learning_rate': 4.6304175973525386e-05, 'per_device_train_batch_size': 4, 'num_train_epochs': 4, 'weight_decay': 0.02753754920538231}. Best is trial 1 with value: 3.0554733066702564.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.761,0.845557,0.748383,0.726982,0.722209,0.748383
2,0.6983,0.816681,0.759223,0.737433,0.732961,0.759223
3,0.4627,0.892972,0.754152,0.745994,0.741016,0.754152


Validation metrics: {'eval_loss': 0.8455565571784973, 'eval_accuracy': 0.7483833039388595, 'eval_f1': 0.7269817669708576, 'eval_precision': 0.7222088491473485, 'eval_recall': 0.7483833039388595, 'eval_runtime': 789.3687, 'eval_samples_per_second': 34.478, 'eval_steps_per_second': 2.155, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.8166805505752563, 'eval_accuracy': 0.7592225161669606, 'eval_f1': 0.7374326891889951, 'eval_precision': 0.7329613745934893, 'eval_recall': 0.7592225161669606, 'eval_runtime': 789.1812, 'eval_samples_per_second': 34.486, 'eval_steps_per_second': 2.155, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.8929721117019653, 'eval_accuracy': 0.7541519694297472, 'eval_f1': 0.7459944927327967, 'eval_precision': 0.7410161547646051, 'eval_recall': 0.7541519694297472, 'eval_runtime': 791.8495, 'eval_samples_per_second': 34.37, 'eval_steps_per_second': 2.148, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-09 06:16:45,166] Trial 4 finished with value: 2.995314586356896 and parameters: {'learning_rate': 4.954751741566973e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 3, 'weight_decay': 0.06483969562767664}. Best is trial 1 with value: 3.0554733066702564.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8139,0.79113,0.757055,0.739398,0.734341,0.757055
2,0.6018,0.775994,0.766865,0.748047,0.745362,0.766865
3,0.548,0.842794,0.764293,0.75658,0.751561,0.764293


Validation metrics: {'eval_loss': 0.7911298274993896, 'eval_accuracy': 0.7570546737213404, 'eval_f1': 0.7393980773002204, 'eval_precision': 0.7343410448904039, 'eval_recall': 0.7570546737213404, 'eval_runtime': 794.9418, 'eval_samples_per_second': 34.236, 'eval_steps_per_second': 2.14, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.7759935855865479, 'eval_accuracy': 0.7668650793650794, 'eval_f1': 0.7480474999480715, 'eval_precision': 0.7453624482244374, 'eval_recall': 0.7668650793650794, 'eval_runtime': 847.2125, 'eval_samples_per_second': 32.124, 'eval_steps_per_second': 2.008, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.8427939414978027, 'eval_accuracy': 0.764293062904174, 'eval_f1': 0.7565804274710398, 'eval_precision': 0.7515608893806157, 'eval_recall': 0.764293062904174, 'eval_runtime': 829.5346, 'eval_samples_per_second': 32.809, 'eval_steps_per_second': 2.051, 'epoch': 3.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-09 16:45:07,168] Trial 5 finished with value: 3.0367274426600037 and parameters: {'learning_rate': 2.2325150424130563e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 3, 'weight_decay': 0.04643672138786191}. Best is trial 1 with value: 3.0554733066702564.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.7599,0.784489,0.760949,0.744398,0.736477,0.760949
2,0.5742,0.758341,0.771384,0.757401,0.751178,0.771384


Validation metrics: {'eval_loss': 0.7844890356063843, 'eval_accuracy': 0.7609494415049971, 'eval_f1': 0.7443978013167994, 'eval_precision': 0.7364765347376072, 'eval_recall': 0.7609494415049971, 'eval_runtime': 799.3559, 'eval_samples_per_second': 34.047, 'eval_steps_per_second': 2.128, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.7583407759666443, 'eval_accuracy': 0.7713844797178131, 'eval_f1': 0.7574008134035475, 'eval_precision': 0.7511777522956509, 'eval_recall': 0.7713844797178131, 'eval_runtime': 798.6488, 'eval_samples_per_second': 34.078, 'eval_steps_per_second': 2.13, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-09 23:34:34,645] Trial 6 finished with value: 3.0513475251348243 and parameters: {'learning_rate': 1.2280142065595493e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 2, 'weight_decay': 0.02250965129934576}. Best is trial 1 with value: 3.0554733066702564.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8,0.811768,0.750882,0.731518,0.728104,0.750882


Validation metrics: {'eval_loss': 0.8117676973342896, 'eval_accuracy': 0.7508818342151675, 'eval_f1': 0.7315178111941577, 'eval_precision': 0.7281040253776916, 'eval_recall': 0.7508818342151675, 'eval_runtime': 798.2984, 'eval_samples_per_second': 34.093, 'eval_steps_per_second': 2.131, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-10 02:57:52,726] Trial 7 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.7521,0.789716,0.757716,0.73815,0.73625,0.757716
2,0.5618,0.764351,0.768849,0.756613,0.751465,0.768849


Validation metrics: {'eval_loss': 0.7897156476974487, 'eval_accuracy': 0.7577160493827161, 'eval_f1': 0.7381497464503939, 'eval_precision': 0.7362496240023345, 'eval_recall': 0.7577160493827161, 'eval_runtime': 797.7648, 'eval_samples_per_second': 34.115, 'eval_steps_per_second': 2.132, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Validation metrics: {'eval_loss': 0.7643511891365051, 'eval_accuracy': 0.7688492063492064, 'eval_f1': 0.7566126083262934, 'eval_precision': 0.751465096022372, 'eval_recall': 0.7688492063492064, 'eval_runtime': 794.6541, 'eval_samples_per_second': 34.249, 'eval_steps_per_second': 2.141, 'epoch': 2.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-10 09:52:51,458] Trial 8 finished with value: 3.0457761170470783 and parameters: {'learning_rate': 3.695497949539519e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 2, 'weight_decay': 0.010931181691060046}. Best is trial 1 with value: 3.0554733066702564.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8235,0.818901,0.746105,0.728277,0.723108,0.746105


Validation metrics: {'eval_loss': 0.8189014792442322, 'eval_accuracy': 0.7461052322163433, 'eval_f1': 0.7282771559063943, 'eval_precision': 0.7231075498302315, 'eval_recall': 0.7461052322163433, 'eval_runtime': 826.4159, 'eval_samples_per_second': 32.933, 'eval_steps_per_second': 2.058, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
[I 2025-04-10 13:20:40,508] Trial 9 pruned. 


Best hyperparameters: {'learning_rate': 1.542075482849585e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 2, 'weight_decay': 0.021889860653894203}


# Hyperparameter tuning
This took 3.8 days!!!

In [31]:
# If you used hp_trainer.hyperparameter_search()
best_hyperparameters = best_run.hyperparameters
print("Best hyperparameters:", best_hyperparameters)

# Create training arguments with the best hyperparameters
best_training_args = TrainingArguments(
    output_dir='./results_final',
    **best_hyperparameters,
    # Add any other arguments not covered in hyperparameter search
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    push_to_hub=False
)

Best hyperparameters: {'learning_rate': 1.542075482849585e-05, 'per_device_train_batch_size': 8, 'num_train_epochs': 2, 'weight_decay': 0.021889860653894203}


In [32]:
# Initialize the model with best hyperparameters
final_model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/scibert_scivocab_uncased", 
    num_labels=len(unique_categories)
)

# Create the final trainer
final_trainer = Trainer(
    model=final_model,
    args=best_training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Train the final model
final_trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  final_trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.8119,0.777088,0.761868,0.745437,0.738134,0.761868
2,0.6596,0.757403,0.770025,0.756649,0.750831,0.770025


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=27216, training_loss=0.8034150927014942, metrics={'train_runtime': 25415.1735, 'train_samples_per_second': 8.567, 'train_steps_per_second': 1.071, 'total_flos': 5.733804216211661e+16, 'train_loss': 0.8034150927014942, 'epoch': 2.0})

In [33]:
# Save the model
final_output_dir = "../models/final_scibert_model"
final_trainer.save_model(final_output_dir)
tokenizer.save_pretrained(final_output_dir)

# Save the training arguments
import json
with open(f"{final_output_dir}/training_args.json", 'w') as f:
    json.dump(final_trainer.args.to_dict(), f)

print(f"Final model successfully saved to {final_output_dir}")

Final model successfully saved to ../models/final_scibert_model


# Now test models

In [34]:
def predict_category(text, model, tokenizer, id_to_label=None):
    """Predict category for a scientific paper using any BERT model"""
    # Prepare input
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    # Move inputs to the same device as the model
    device = model.device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get probabilities and predicted class
    probs = torch.nn.functional.softmax(outputs.logits, dim=1)[0]
    predicted_class_id = outputs.logits.argmax(-1).item()
    
    # If id_to_label mapping is provided, convert ID to category name
    if id_to_label:
        predicted_category = id_to_label[predicted_class_id]
        return predicted_category, probs[predicted_class_id].item()
    else:
        return predicted_class_id, probs[predicted_class_id].item()

In [38]:
def evaluate_model(model, test_df, tokenizer, id_to_label):
    predictions = []
    true_labels = []
    confidences = []
    
    for i in range(len(test_df)):
        example = test_df.iloc[i]
        text = example['title'] + " " + (example['summary'] if not pd.isna(example['summary']) else "")
        true_category = example['category_code']
        
        predicted_category, confidence = predict_category(
            text, 
            model, 
            tokenizer, 
            id_to_label
        )
        
        predictions.append(predicted_category)
        true_labels.append(true_category)
        confidences.append(confidence)
    
    # Calculate accuracy
    accuracy = accuracy_score(true_labels, predictions)
    
    # Generate classification report
    report = classification_report(true_labels, predictions, output_dict=True)
    
    # Average confidence
    avg_confidence = sum(confidences) / len(confidences)
    
    return {
        'accuracy': accuracy,
        'classification_report': report,
        'avg_confidence': avg_confidence,
        'predictions': predictions,
        'confidences': confidences
    }

In [39]:
#model is original scibert model
tuned_model = AutoModelForSequenceClassification.from_pretrained("../models/first_train_saved_scibert_model")
hypertuned_model = AutoModelForSequenceClassification.from_pretrained("../models/final_scibert_model")

In [40]:
# Evaluate all three models
base_results = evaluate_model(model, test_df, tokenizer, id_to_label)
tuned_results = evaluate_model(tuned_model, test_df, tokenizer, id_to_label)
hypertuned_results = evaluate_model(hypertuned_model, test_df, tokenizer, id_to_label)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [41]:
# Compare results
print("Base SciBERT accuracy:", base_results['accuracy'])
print("Tuned SciBERT accuracy:", tuned_results['accuracy'])
print("Hyperparameter-tuned SciBERT accuracy:", hypertuned_results['accuracy'])

print("\nBase SciBERT average confidence:", base_results['avg_confidence'])
print("Tuned SciBERT average confidence:", tuned_results['avg_confidence'])
print("Hyperparameter-tuned SciBERT average confidence:", hypertuned_results['avg_confidence'])

# You can also compare detailed metrics like F1 score per class
print("\nF1 scores for each model:")
for model_name, results in [("Base", base_results), ("Tuned", tuned_results), ("Hypertuned", hypertuned_results)]:
    macro_f1 = results['classification_report']['macro avg']['f1-score']
    weighted_f1 = results['classification_report']['weighted avg']['f1-score']
    print(f"{model_name} model - Macro F1: {macro_f1:.4f}, Weighted F1: {weighted_f1:.4f}")

Base SciBERT accuracy: 0.7546296296296297
Tuned SciBERT accuracy: 0.7546296296296297
Hyperparameter-tuned SciBERT accuracy: 0.7700249853027631

Base SciBERT average confidence: 0.8765491222927696
Tuned SciBERT average confidence: 0.8765491356807369
Hyperparameter-tuned SciBERT average confidence: 0.8336762553880288

F1 scores for each model:
Base model - Macro F1: 0.2274, Weighted F1: 0.7463
Tuned model - Macro F1: 0.2274, Weighted F1: 0.7463
Hypertuned model - Macro F1: 0.1809, Weighted F1: 0.7566


In [48]:
# Create a copy of the test dataframe to avoid modifying the original
results_df = test_df.copy()

# Add predictions from each model
results_df['base_prediction'] = base_results['predictions']
results_df['base_confidence'] = base_results['confidences']

results_df['tuned_prediction'] = tuned_results['predictions']
results_df['tuned_confidence'] = tuned_results['confidences']

results_df['hypertuned_prediction'] = hypertuned_results['predictions']
results_df['hypertuned_confidence'] = hypertuned_results['confidences']

# Add a column to show if predictions match the true label
results_df['base_correct'] = results_df['base_prediction'] == results_df['category_code']
results_df['tuned_correct'] = results_df['tuned_prediction'] == results_df['category_code']
results_df['hypertuned_correct'] = results_df['hypertuned_prediction'] == results_df['category_code']

# Calculate category frequencies in the original training dataset
category_counts = train_df['category_code'].value_counts().to_dict()

# Create a mapping function to add category size
def get_category_size(category):
    return category_counts.get(category, 0)

# Add category size to the results dataframe
results_df['category_size'] = results_df['category_code'].apply(get_category_size)

# Print a sample of the results with category size
display(results_df[['title', 'category_code', 'category_size', 'base_prediction', 'tuned_prediction', 'hypertuned_prediction']].head(50))

# Analyze performance by category size
# Let's create bins of category sizes
results_df['size_bin'] = pd.cut(
    results_df['category_size'], 
    bins=[0, 10, 100, 1000, 10000, float('inf')],
    labels=['Very Small (1-10)', 'Small (11-100)', 'Medium (101-1000)', 'Large (1001-10000)', 'Very Large (10000+)']
)

# Analyze accuracy by category size
size_analysis = results_df.groupby('size_bin').agg({
    'base_correct': 'mean',
    'tuned_correct': 'mean',
    'hypertuned_correct': 'mean',
    'category_code': 'count'
}).rename(columns={'category_code': 'count'})

print("\nAccuracy by category size:")
display(size_analysis)

# Detailed analysis by category
category_analysis = results_df.groupby('category_code').agg({
    'base_correct': 'mean',
    'tuned_correct': 'mean',
    'hypertuned_correct': 'mean',
    'category_size': 'first',  # Grab the size for each category
    'category_code': 'count'
}).rename(columns={'category_code': 'test_count'})

# Sort by category size to see performance on small vs. large categories
print("\nAccuracy by category (sorted by size):")
display(category_analysis.sort_values('category_size').head(15))

# Also look at the largest categories
print("\nAccuracy for largest categories:")
display(category_analysis.sort_values('category_size', ascending=False).head(15))

# Find categories where performance varies significantly between models
category_analysis['max_diff'] = category_analysis[['base_correct', 'tuned_correct', 'hypertuned_correct']].max(axis=1) - \
                              category_analysis[['base_correct', 'tuned_correct', 'hypertuned_correct']].min(axis=1)

print("\nCategories with largest performance differences between models:")
display(category_analysis.sort_values('max_diff', ascending=False).head(10))

# Save the results to CSV for further analysis
results_df.to_csv('model_comparison_results.csv', index=False)
category_analysis.to_csv('category_performance_analysis.csv')
print("\nFull results saved to CSV files")

Unnamed: 0,title,category_code,category_size,base_prediction,tuned_prediction,hypertuned_prediction
101869,A working likelihood approach to support vecto...,cs.LG,31973,stat.ML,stat.ML,stat.ML
7237,A Surprisingly Simple Continuous-Action POMDP ...,cs.AI,10354,cs.AI,cs.AI,cs.AI
108484,Memory-Efficient Reversible Spiking Neural Net...,cs.CV,23240,cs.NE,cs.NE,cs.NE
124544,Some Languages are More Equal than Others: Pro...,cs.CL,20158,cs.CL,cs.CL,cs.CL
51639,Lie Algebrized Gaussians for Image Representation,cs.CV,23240,cs.CV,cs.CV,cs.CV
76005,Doubly Stochastic Variational Inference for De...,stat.ML,8345,stat.ML,stat.ML,stat.ML
93916,Inference with Discriminative Posterior,stat.ML,8345,cs.LG,cs.LG,stat.ML
81504,Weighted Distributed Differential Privacy ERM:...,cs.LG,31973,cs.LG,cs.LG,cs.LG
116818,Detecting Machine-Translated Text using Back T...,cs.CL,20158,cs.CL,cs.CL,cs.CL
18236,Disentangled Contrastive Learning for Social R...,cs.IR,721,cs.IR,cs.IR,cs.IR



Accuracy by category size:


  size_analysis = results_df.groupby('size_bin').agg({


Unnamed: 0_level_0,base_correct,tuned_correct,hypertuned_correct,count
size_bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Very Small (1-10),0.04878,0.04878,0.0,41
Small (11-100),0.181592,0.181592,0.10199,402
Medium (101-1000),0.413761,0.413761,0.411437,2151
Large (1001-10000),0.518971,0.518971,0.541235,3189
Very Large (10000+),0.836001,0.836001,0.854057,21433



Accuracy by category (sorted by size):


Unnamed: 0_level_0,base_correct,tuned_correct,hypertuned_correct,category_size,test_count
category_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
hep-lat,0.0,0.0,0.0,4,1
math.AT,0.0,0.0,0.0,5,1
nlin.CD,0.0,0.0,0.0,5,1
cond-mat.soft,0.0,0.0,0.0,5,1
physics.ins-det,0.0,0.0,0.0,5,1
stat.OT,0.0,0.0,0.0,6,2
hep-th,0.5,0.5,0.0,6,2
math.DG,0.0,0.0,0.0,6,1
math.HO,0.0,0.0,0.0,6,1
nlin.CG,0.0,0.0,0.0,6,2



Accuracy for largest categories:


Unnamed: 0_level_0,base_correct,tuned_correct,hypertuned_correct,category_size,test_count
category_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
cs.LG,0.77283,0.77283,0.803102,31973,7994
cs.CV,0.919793,0.919793,0.931497,23240,5810
cs.CL,0.939484,0.939484,0.943452,20158,5040
cs.AI,0.64156,0.64156,0.663577,10354,2589
stat.ML,0.439387,0.439387,0.448011,8345,2087
cs.NE,0.669691,0.669691,0.717786,4405,1102
cs.RO,0.740331,0.740331,0.729282,722,181
cs.IR,0.461111,0.461111,0.511111,721,180
stat.ME,0.316384,0.316384,0.367232,705,177
math.OC,0.373333,0.373333,0.393333,599,150



Categories with largest performance differences between models:


Unnamed: 0_level_0,base_correct,tuned_correct,hypertuned_correct,category_size,test_count,max_diff
category_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
astro-ph.CO,1.0,1.0,0.0,18,4,1.0
astro-ph.SR,0.5,0.5,0.0,8,2,0.5
hep-th,0.5,0.5,0.0,6,2,0.5
q-bio.GN,0.5,0.5,0.0,33,8,0.5
physics.optics,0.4,0.4,0.0,21,5,0.4
cond-mat.mtrl-sci,0.5,0.5,0.1,42,10,0.4
physics.comp-ph,0.333333,0.333333,0.066667,61,15,0.266667
cs.SY,0.222222,0.222222,0.0,71,18,0.222222
cs.MA,0.314286,0.314286,0.114286,142,35,0.2
q-bio.PE,0.3,0.3,0.1,38,10,0.2



Full results saved to CSV files
