# CS 440 Final Project
Project: **NLBSE 2025 Code Comment Classification**

Members (Team Overfitted):
 - Erik Cooper
 - Seth Harling
## Vision
This tool competition involves building and training 3 multi-label classification models for code comments. These comments fall in to different categories for the different languages (Java, Python, and Pharo). These categories include comments for summarization, development notes, intent, and usage.

For our project, we chose to run with a strongly langauge processing approach, trying our best to ignore formatting as hints to the category. This is one of the reasons why we chose to use GPT-2 as our base model.

When searching for pretrained NLP models, we were presented with many options. The main ones we considered were variations of BERT (RoBERTa for one), ELMo, UniLM, and GPT. There were a couple properties we were specifically looking for in a model for this project. We wanted a model that would work well on a classification problem, and we wanted a smaller model that would take less time to train so we could spend more time learning how to work the NLPs than waiting for it to work.

With this, we settled on the smallest size of [GPT-2](https://huggingface.co/openai-community/gpt2) with 124M parameters. This number of parameters felt in line with the scope of our project, plus GPT is a rather famous model that most people have at least heard about. We thought that working with something somewhat familiar would help with learning how this all works. Though, to make the tuning process a little quicker, we use [distilgpt2](https://huggingface.co/distilbert/distilgpt2) to reduce our training times by 50% with its lower 82M parameters.

### Imports

In [1]:
from datasets import load_dataset
import numpy as np
import torch
from tqdm.notebook import tqdm
from sklearn.metrics import accuracy_score, classification_report
from transformers import (GPT2Config,
                          GPT2Tokenizer,
                          GPT2Model,
                          AdamW, 
                          get_linear_schedule_with_warmup,
                          GPT2ForSequenceClassification)
import re
from lion_pytorch import Lion

Using TensorFlow backend.


### Defining string constants

In [2]:
# From challenge ()
classification_langs = ['java', 'python', 'pharo']
classification_labels = {
    'java': ['summary', 'Ownership', 'Expand', 'usage', 'Pointer', 'deprecation', 'rational'],
    'python': ['Usage', 'Parameters', 'DevelopmentNotes', 'Expand', 'Summary'],
    'pharo': ['Keyimplementationpoints', 'Example', 'Responsibilities', 'Classreferences', 'Intent', 'Keymessages', 'Collaborators']
}

### Parameters

In [3]:
# Hyperparameters
hyperparameters = {
    'num_epochs': [2, 3, 4],
    'batch_size': [4, 8, 16],
    'learning_rate': [1e-5, 2e-5, 3e-5],
    'optimizer': ['Lion', 'AdamW'],
    'max_length': [256, 512, 1024],
}

In [4]:
batch_size = 8   # batch size for training
max_length = 512 # max length of the text that can be passed to the model
num_epochs = 4   # number of epochs
model_name = 'distilgpt2' # model name
add_class_name = True # add class name to the input text

### Preprocessing functions

In [5]:
def preprocess_java(comment, class_name, labels=None):
    # Create dataset object
    output = []
    
    for i in range(len(comment)):
        text = comment[i]
        
        # remove entirety of html lists
        # text = re.sub(r'<ol>[.\s\S]*?<\/ol>', '', text)
        
        # remove html tags
        #text = re.sub(r'<.*?>', '', text)
        
        # remove bullets
        #text = re.sub(r'\s\*', '', text)
        
        # remove bulleted lines
        #text = re.sub(r'\n\s*\*.*', '', text)
        
        # remove curly braced sections
        #text = re.sub(r'\{.*?\}', '', text)
        
        # remove // comments
        # text = re.sub(r'\s*\/\/.*', '', text)
        
        # remove formatting for // comments
        # text = re.sub(r'\/\/', '', text)
        
        # remove formatting for /* */ comments
        # text = re.sub(r'\/\*.|\*\/', '', text)
        
        # remove multiple spaces
        text = re.sub(r'\s+', ' ', text)
        
        # Add class name
        if add_class_name:
            text = class_name[i] + ': ' + text
        
        # truncate middle
        #if (len(text) > max_length):
        #    text = text[:(int(max_length/2)-4)] + ' ... ' + text[-(int(max_length/2)-4):]
        
        # truncate end
        if (len(text) > max_length):
            text = text[:max_length]
        
        # Build dictionary
        if labels is not None:
            output.append({
                'text': text,
                'label': labels[i]
            })
        else:
            output.append({
                'text': text,
                'label': 0
            })
    
    return output

In [6]:
def preprocess_python(comment, class_name, labels=None):
    output = []
    
    for i in range(len(comment)):
        text = comment[i]
        
        # Add class name
        text = class_name[i] + ': ' + text
        
        # truncate
        if (len(text) > max_length):
            text = text[:(int(max_length/2)-4)] + ' ... ' + text[-(int(max_length/2)-4):]
        
        # Build dictionary
        if labels is not None:
            output.append({
                'text': text,
                'label': labels[i]
            })
        else:
            output.append({
                'text': text,
                'label': 0
            })
    
    return output

In [7]:
def preprocess_pharo(comment, class_name, labels=None):
    output = []
    
    for i in range(len(comment)):
        text = comment[i]
        
        # Add class name
        text = class_name[i] + ': ' + text
        
        # truncate
        if (len(text) > max_length):
            text = text[:(int(max_length/2)-4)] + ' ... ' + text[-(int(max_length/2)-4):]
        
        # Build dictionary
        if labels is not None:
            output.append({
                'text': text,
                'label': labels[i]
            })
        else:
            output.append({
                'text': text,
                'label': 0
            })
    
    return output

### GPT2 Collator

In [8]:
# Collator object for GPT2, which will tokenize the text
class GPT2_collator(object):
    def __init__(self, tokenizer, max_length):
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __call__(self, sequences):
        texts = [sequence['text'] for sequence in sequences]
        labels = [sequence['label'] for sequence in sequences]
        
        inputs = self.tokenizer(text=texts, return_tensors='pt', padding='max_length', truncation=True, max_length=self.max_length)
        inputs.update({'labels': torch.tensor(labels)})
        
        return inputs

### Train, Validate, and Predict

In [9]:
# Training function, updates model weights
def train(model, dataloader, optimizer, scheduler, max_batches=None):
    global device
    
    model.train()
    
    pred_labels = []
    true_labels = []
    total_loss = 0
    
    batches_processed = 0
    
    for batch in tqdm(dataloader, total=len(dataloader)):
        if max_batches is not None and batches_processed >= max_batches:
            break
        
        inputs = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**inputs)
        
        loss = outputs.loss
        total_loss += loss.item()
        
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
        pred_labels.extend(torch.argmax(outputs.logits, dim=-1).cpu().numpy())
        true_labels.extend(inputs['labels'].cpu().numpy())
        
        batches_processed += 1
    
    avg_epoch_loss = total_loss / batches_processed
    
    return pred_labels, true_labels, avg_epoch_loss

In [10]:
# For validation, no updating
def validate(model, dataloader, max_batches=None):
    global device
    
    model.eval()
    
    pred_labels = []
    true_labels = []
    total_loss = 0
    
    batches_processed = 0
    
    for batch in tqdm(dataloader, total=len(dataloader)):
        if max_batches is not None and batches_processed >= max_batches:
            break
        
        inputs = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**inputs)
        
        loss = outputs.loss
        total_loss += loss.item()
        
        pred_labels.extend(torch.argmax(outputs.logits, dim=-1).cpu().numpy())
        true_labels.extend(inputs['labels'].cpu().numpy())
        
        batches_processed += 1
    
    avg_epoch_loss = total_loss / batches_processed
    
    return pred_labels, true_labels, avg_epoch_loss

In [11]:
# For prediction, no update and no original labels
def predict(model, dataloader):
    global device
    
    model.eval()
    
    pred_labels = []
    
    for batch in tqdm(dataloader, total=len(dataloader)):
        inputs = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**inputs)
        
        pred_labels.extend(torch.argmax(outputs.logits, dim=-1).cpu().numpy())
    
    return pred_labels

### Setup

In [12]:
# Loading dataset
ds = load_dataset('NLBSE/nlbse25-code-comment-classification')

In [13]:
# Setting up device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

Using device: cuda


In [14]:
# Model setup and training
def build_model(lang, save=False, optimizer='AdamW', learning_rate=5e-5, num_epochs=4, batch_size=8, max_length=512):
    global device
    
    num_labels = len(classification_labels[lang])
    
    # Setup
    print('Setting config...')
    model_config = GPT2Config.from_pretrained(model_name, num_labels=num_labels, id2label={str(i): label for i, label in enumerate(classification_labels[lang])}, label2id={label: i for i, label in enumerate(classification_labels[lang])})

    print('Loading tokenizer...')
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    tokenizer.padding_side = 'left'
    tokenizer.pad_token = tokenizer.eos_token

    print('Loading model...')
    model = GPT2ForSequenceClassification.from_pretrained(model_name, config=model_config)
    model.resize_token_embeddings(len(tokenizer))
    model.config.pad_token_id = tokenizer.eos_token_id
    model.to(device)
    
    collator = GPT2_collator(tokenizer, max_length)
    
    # Prepare data
    print('Preparing data...')
    if lang == 'java':
        train_data = preprocess_java(ds['java_train']['comment_sentence'], ds['java_train']['class'], np.argmax(ds['java_train']['labels'], axis=1))
        eval_data = preprocess_java(ds['java_test']['comment_sentence'], ds['java_train']['class'], np.argmax(ds['java_test']['labels'], axis=1))
    elif lang == 'python':
        train_data = preprocess_python(ds['python_train']['comment_sentence'], ds['python_train']['class'], np.argmax(ds['python_train']['labels'], axis=1))
        eval_data = preprocess_python(ds['python_test']['comment_sentence'], ds['python_train']['class'], np.argmax(ds['python_test']['labels'], axis=1))
    elif lang == 'pharo':
        train_data = preprocess_pharo(ds['pharo_train']['comment_sentence'], ds['pharo_train']['class'], np.argmax(ds['pharo_train']['labels'], axis=1))
        eval_data = preprocess_pharo(ds['pharo_test']['comment_sentence'], ds['pharo_train']['class'], np.argmax(ds['pharo_test']['labels'], axis=1))
    
    train_dataloader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collator)
    eval_dataloader = torch.utils.data.DataLoader(eval_data, batch_size=batch_size, shuffle=False, collate_fn=collator)
    
    # Training
    print('Training...')
    opt = None
    if optimizer == 'AdamW':
        opt = AdamW(model.parameters(), lr = learning_rate, eps = 1e-8, weight_decay=0.01)
    elif optimizer == 'Lion':
        opt = Lion(model.parameters(), lr = learning_rate)
    else:
        raise ValueError('Invalid optimizer ' + optimizer)
    total_steps = len(train_dataloader) * num_epochs
    scheduler = get_linear_schedule_with_warmup(opt, num_warmup_steps=0, num_training_steps=total_steps)

    loss_list = []
    accuracy_list = []
    eval_loss_list = []
    eval_accuracy_list = []

    max_batches = None # Set to None to run all batches, or a number to run a limited number of batches for testing
    for epoch in tqdm(range(num_epochs)):
        train_labels, true_labels, train_loss = train(model, train_dataloader, opt, scheduler, max_batches)
        train_accuracy = accuracy_score(true_labels, train_labels)
        loss_list.append(train_loss)
        accuracy_list.append(train_accuracy)
        print(f'Epoch {epoch+1}/{num_epochs} - Train Loss: {train_loss}, Train Accuracy: {train_accuracy}')
        
        eval_labels, true_labels, eval_loss = validate(model, eval_dataloader, max_batches)
        eval_accuracy = accuracy_score(true_labels, eval_labels)
        eval_loss_list.append(eval_loss)
        eval_accuracy_list.append(eval_accuracy)
        print(f'Epoch {epoch+1}/{num_epochs} - Eval Loss:  {eval_loss}, Eval Accuracy:  {eval_accuracy}')
    
    # Save model
    if save:
        model.save_pretrained(f'./models/gpt2-{lang}')
        tokenizer.save_pretrained(f'./models/gpt2-{lang}')
    
    return model, tokenizer, loss_list, accuracy_list, eval_loss_list, eval_accuracy_list

In [15]:
# Run tests on java lang with all hyperparameters
java_results = {}
for num_epochs in hyperparameters['num_epochs']:
    for batch_size in hyperparameters['batch_size']:
        for learning_rate in hyperparameters['learning_rate']:
            for optimizer in hyperparameters['optimizer']:
                for max_length in hyperparameters['max_length']:
                    print(f'Running test with hyperparameters: num_epochs={num_epochs}, batch_size={batch_size}, learning_rate={learning_rate}, optimizer={optimizer}, max_length={max_length}')
                    model, tokenizer, loss_list, accuracy_list, eval_loss_list, eval_accuracy_list = build_model('java', save=False, optimizer=optimizer, learning_rate=learning_rate, num_epochs=num_epochs, batch_size=batch_size, max_length=max_length)
                    
                    # Do classification report
                    eval_data = preprocess_java(ds['java_test']['comment_sentence'], ds['java_train']['class'], np.argmax(ds['java_test']['labels'], axis=1))
                    eval_dataloader = torch.utils.data.DataLoader(eval_data, batch_size=batch_size, shuffle=False, collate_fn=GPT2_collator(tokenizer, max_length))
                    eval_labels = predict(model, eval_dataloader)
                    true_labels = [sequence['label'] for sequence in eval_data]
                    report = classification_report(true_labels, eval_labels)
                    
                    # Save results
                    java_results[(num_epochs, batch_size, learning_rate, optimizer, max_length)] = report

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=1e-05, optimizer=Lion, max_length=256
Setting config...
Loading tokenizer...
Loading model...

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.6397435572520056, Train Accuracy: 0.7852639873916469


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.629224700980843, Eval Accuracy:  0.7965217391304348


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.23635876673068257, Train Accuracy: 0.9198844234305227


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.7113719995243488, Eval Accuracy:  0.7878260869565218


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=1e-05, optimizer=Lion, max_length=512
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5483554800804782, Train Accuracy: 0.8170475439978986


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.6135114677846205, Eval Accuracy:  0.7994202898550725


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.23133857889800638, Train Accuracy: 0.9194904123982138


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.7771916943639318, Eval Accuracy:  0.7843478260869565


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=1e-05, optimizer=Lion, max_length=1024
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5562578586077106, Train Accuracy: 0.8144208037825059


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.752224933457809, Eval Accuracy:  0.7942028985507247


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.2401930083123483, Train Accuracy: 0.9152876280535855


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.6917992350101351, Eval Accuracy:  0.8028985507246377


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=1e-05, optimizer=AdamW, max_length=256
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 1.0281132760217038, Train Accuracy: 0.6565537168374048


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.6530293427969338, Eval Accuracy:  0.7866666666666666


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.5211112448285792, Train Accuracy: 0.8283425269240872


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.6202000713519737, Eval Accuracy:  0.8017391304347826


  0%|          | 0/432 [00:00<?, ?it/s]

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=1e-05, optimizer=AdamW, max_length=512
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.922301430120172, Train Accuracy: 0.696742842132913


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.6837685561476974, Eval Accuracy:  0.7947826086956522


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.5010096177604527, Train Accuracy: 0.8341213553979512


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.6425566440480502, Eval Accuracy:  0.8005797101449276


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=1e-05, optimizer=AdamW, max_length=1024
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.8177066580998954, Train Accuracy: 0.7297084318360915


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.7084728347951939, Eval Accuracy:  0.7797101449275362


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.4560506397326567, Train Accuracy: 0.8430522721302863


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.6583370498088631, Eval Accuracy:  0.7878260869565218


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=2e-05, optimizer=Lion, max_length=256
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5507755134775139, Train Accuracy: 0.8173102180194379


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.6092473932257471, Eval Accuracy:  0.807536231884058


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.21404676003643958, Train Accuracy: 0.9285526661413186


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.7548675712959165, Eval Accuracy:  0.7744927536231884


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=2e-05, optimizer=Lion, max_length=512
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5748513391652135, Train Accuracy: 0.8112687155240347


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.8582519151564892, Eval Accuracy:  0.7663768115942029


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.21275356243432905, Train Accuracy: 0.9280273180982401


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.7600421617341824, Eval Accuracy:  0.776231884057971


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=2e-05, optimizer=Lion, max_length=1024
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5643955723589439, Train Accuracy: 0.8153401628578933


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.7646414309748205, Eval Accuracy:  0.7408695652173913


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.23710541630015458, Train Accuracy: 0.9214604675597583


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.7682910554986948, Eval Accuracy:  0.7605797101449275


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=2e-05, optimizer=AdamW, max_length=256
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.6771455861304245, Train Accuracy: 0.7693722090885211


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.601005663205022, Eval Accuracy:  0.7953623188405797


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.342960861694956, Train Accuracy: 0.8819280273180983


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.6638030577849315, Eval Accuracy:  0.7930434782608695


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=2e-05, optimizer=AdamW, max_length=512
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.812909219069152, Train Accuracy: 0.7286577357499343


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.6450040359411813, Eval Accuracy:  0.7930434782608695


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.40411465927722967, Train Accuracy: 0.8645915418965064


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.7180579445275557, Eval Accuracy:  0.784927536231884


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=2e-05, optimizer=AdamW, max_length=1024
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.6582835657402732, Train Accuracy: 0.7882847386393486


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.91139795571723, Eval Accuracy:  0.7281159420289856


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.36500569349722234, Train Accuracy: 0.8779879169950092


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.7353009356199807, Eval Accuracy:  0.7733333333333333


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=3e-05, optimizer=Lion, max_length=256
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5745934056158607, Train Accuracy: 0.8120567375886525


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.6711301167545874, Eval Accuracy:  0.7942028985507247


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.2134706364278422, Train Accuracy: 0.9305227213028632


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.8489049937584662, Eval Accuracy:  0.7808695652173913


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=3e-05, optimizer=Lion, max_length=512
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5294329033345208, Train Accuracy: 0.8282111899133175


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.665312615013742, Eval Accuracy:  0.7942028985507247


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.20612770796956933, Train Accuracy: 0.9307853953244024


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.7274097578461233, Eval Accuracy:  0.7808695652173913


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=3e-05, optimizer=Lion, max_length=1024
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5257560894244017, Train Accuracy: 0.8309692671394799


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.6202169941076405, Eval Accuracy:  0.7982608695652174


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.22264529391711274, Train Accuracy: 0.9265826109797741


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.685672871724381, Eval Accuracy:  0.7860869565217391


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=3e-05, optimizer=AdamW, max_length=256
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.6952705782591272, Train Accuracy: 0.7742316784869976


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.6051369224335303, Eval Accuracy:  0.7959420289855073


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.3234412178773572, Train Accuracy: 0.8886262148673496


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.6767144058007565, Eval Accuracy:  0.7884057971014493


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=3e-05, optimizer=AdamW, max_length=512
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.6945469860159186, Train Accuracy: 0.7723929603362227


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.7083710895275106, Eval Accuracy:  0.7768115942028986


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.3436202162519286, Train Accuracy: 0.8846861045442606


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.6686041199695022, Eval Accuracy:  0.7918840579710145


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=4, learning_rate=3e-05, optimizer=AdamW, max_length=1024
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...




  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.5582553682154661, Train Accuracy: 0.8132387706855791


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.9975926474875855, Eval Accuracy:  0.7257971014492753


  0%|          | 0/1904 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.29727720972971095, Train Accuracy: 0.8936170212765957


  0%|          | 0/432 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.9026581015168631, Eval Accuracy:  0.7269565217391304


  0%|          | 0/432 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=8, learning_rate=1e-05, optimizer=Lion, max_length=256
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/952 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.7889480246798363, Train Accuracy: 0.7277383766745469


  0%|          | 0/216 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.7104200621856793, Eval Accuracy:  0.7530434782608696


  0%|          | 0/952 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.29589059836651344, Train Accuracy: 0.8980824796427633


  0%|          | 0/216 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.9732985745970466, Eval Accuracy:  0.7263768115942029


  0%|          | 0/216 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=8, learning_rate=1e-05, optimizer=Lion, max_length=512
Setting config...
Loading tokenizer...
Loading model...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/952 [00:00<?, ?it/s]

Epoch 1/2 - Train Loss: 0.7058066360749297, Train Accuracy: 0.7677961649592855


  0%|          | 0/216 [00:00<?, ?it/s]

Epoch 1/2 - Eval Loss:  0.8489354523052397, Eval Accuracy:  0.7379710144927536


  0%|          | 0/952 [00:00<?, ?it/s]

Epoch 2/2 - Train Loss: 0.2886104336650548, Train Accuracy: 0.9028106120304702


  0%|          | 0/216 [00:00<?, ?it/s]

Epoch 2/2 - Eval Loss:  0.9770047848105605, Eval Accuracy:  0.6985507246376812


  0%|          | 0/216 [00:00<?, ?it/s]

Running test with hyperparameters: num_epochs=2, batch_size=8, learning_rate=1e-05, optimizer=Lion, max_length=1024
Setting config...
Loading tokenizer...


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading model...
Preparing data...
Training...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/952 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# Show results
for key, value in java_results.items():
    print(f'Hyperparameters: {key}')
    print(value)
    print('')

Hyperparameters: (2, 4, 1e-05, 'AdamW', 256)
              precision    recall  f1-score   support

           0       0.82      0.90      0.86       892
           1       0.96      1.00      0.98        45
           2       0.24      0.09      0.13       100
           3       0.79      0.82      0.81       427
           4       0.79      0.94      0.86       178
           5       1.00      0.60      0.75        15
           6       0.50      0.04      0.08        68

    accuracy                           0.80      1725
   macro avg       0.73      0.63      0.64      1725
weighted avg       0.77      0.80      0.78      1725


Hyperparameters: (2, 4, 1e-05, 'AdamW', 512)
              precision    recall  f1-score   support

           0       0.81      0.93      0.87       892
           1       0.98      1.00      0.99        45
           2       0.19      0.08      0.11       100
           3       0.86      0.78      0.82       427
           4       0.77      0.94      0.