# Deep learning in Human Language Technology Project

- Student(s) name(s): Nouman Bashir
- Date: 07/11/2025
- Chosen Corpus: Rotten Tomatoes (sentence-level sentiment)
- Contributions (if group project): None

### Corpus information

- Description of the chosen corpus: The dataset contains 2 features as text and label, label 1 represents positive whereas 0 represents negative. The training dataset contains 8530 rows of data, while the validation and test datasets contain 1066 each rows.
- Paper(s) and other published materials related to the corpus: Bo Pang and Lillian Lee. 2005. "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales." Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115-124. ArXiv: https://arxiv.org/abs/cs/0506075
This project uses the Rotten Tomatoes movie review dataset (Pang and Lee, 2005), which contains 10,662 sentences labeled with binary sentiment (positive/negative). The dataset was originally introduced for sentiment classification research and has become a standard benchmark in the NLP community.
- Random baseline performance and expected performance for recent machine learned models: ~50% is the random baseline, and the expected perofrmance on pre-trained BERT is 87-89% and SOTA is ~92%.


---

## 1. Setup Installation and Importations

In [None]:
# Install required packages

!pip install -q transformers datasets torch accelerate evaluate scikit-learn

# Import libraries
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback,
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import random
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

# Check and display device information
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hUsing device: cuda
GPU: Tesla T4


---

## 2. Data download, sampling and preprocessing



### 2.1. Download the corpus

In [None]:
# Your code to download the corpus here

dataset = load_dataset("rotten_tomatoes")
print(dataset)

README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})


In [None]:
#print(dataset['train'].column_names)

In [None]:
for split in dataset:
    print(f"Number of examples in {split}: {len(dataset[split])}")

Number of examples in train: 8530
Number of examples in validation: 1066
Number of examples in test: 1066


### 2.2. Sampling and preprocessing

In [None]:
# to split into negative and positive count
for split in dataset:
    labels=[ex['label'] for ex in dataset[split]]
    neg_count = labels.count(0)
    pos_count = labels.count(1)
    total = len(labels)

In [None]:
print(f"\n{split.upper()} SET:")
print(f"Total examples: {total}")
print(f"Negative (0): {neg_count} ({neg_count/total*100:.1f}%)\nPositive (1): {pos_count} ({pos_count/total*100:.1f}%)")


TEST SET:
Total examples: 1066
Negative (0): 533 (50.0%)
Positive (1): 533 (50.0%)


In [None]:
# Calculate random baseline

def calculate_random_baseline(dataset_split):
    labels = [ex['label'] for ex in dataset_split]
    random_predictions = [random.randint(0, 1) for _ in labels]
    return accuracy_score(labels, random_predictions)

random_baseline = calculate_random_baseline(dataset['test'])
print(f"\nRandom Baseline on Test Set: {random_baseline:.4f}")


Random Baseline on Test Set: 0.4719


---

## 3. Prompting a generative model



### 3.1 Prompt optimization

In [None]:
# Your code and experiments relating to the prompt optimization here

gen_model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
print(f"\nModel: {gen_model_name}")
print("Description: 135M parameter instruction-tuned generative model")

# Loading and tokenizing model
gen_tokenizer = AutoTokenizer.from_pretrained(gen_model_name)
gen_model = AutoModelForCausalLM.from_pretrained(gen_model_name).to(device)

if gen_tokenizer.pad_token is None:
    gen_tokenizer.pad_token = gen_tokenizer.eos_token

prompt_templates = {
    "Simple": """Classify sentiment: {text}
Answer (positive/negative):""",

    "Few-shot": """Classify movie review sentiments.

Review: a masterpiece of form and execution
Sentiment: positive

Review: simplistic , silly and tedious.
Sentiment: negative

Review: {text}
Sentiment:"""
}

def evaluate_prompt(template, model, tokenizer, val_samples, max_samples=100):
    predictions = []
    labels = []
    invalid = 0
    samples = val_samples.select(range(min(max_samples, len(val_samples))))
    for i, example in enumerate(samples):
        prompt = template.format(text=example['text'])
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=5,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id
            )
        generated = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip().lower()
        if 'positive' in generated:
            pred = 1
        elif 'negative' in generated:
            pred = 0
        else:
            pred = random.randint(0, 1)
            invalid += 1
        predictions.append(pred)
        labels.append(example['label'])
    accuracy = accuracy_score(labels, predictions)
    return accuracy, invalid

print("\nTesting prompt templates:")
prompt_results = []

for name, template in prompt_templates.items():
    print(f"\n  Testing '{name}' prompt...")
    acc, invalid = evaluate_prompt(template, gen_model, gen_tokenizer, dataset['validation'])
    prompt_results.append({
        'Template': name,
        'Accuracy': acc,
    })
    print(f"Accuracy: {acc:.4f}")

# Select best prompt
prompt_df = pd.DataFrame(prompt_results).sort_values('Accuracy', ascending=False)
best_prompt_name = prompt_df.iloc[0]['Template']
best_prompt_template = prompt_templates[best_prompt_name]

print(f"\nBest prompt template: '{best_prompt_name}'")
print(f"Validation accuracy: {prompt_df.iloc[0]['Accuracy']:.4f}")


Model: HuggingFaceTB/SmolLM-135M-Instruct
Description: 135M parameter instruction-tuned generative model


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]


Testing prompt templates:

  Testing 'Simple' prompt...
Accuracy: 0.8400

  Testing 'Few-shot' prompt...
Accuracy: 0.9400

Best prompt template: 'Few-shot'
Validation accuracy: 0.9400


### 3.2 Evaluation on test set

In [None]:
def evaluate_model_prompting(model, tokenizer, dataset_split, template, model_name="Model"):
    predictions = []
    labels = []
    invalid_count = 0
    print(f"\nEvaluating {model_name} with prompting on {len(dataset_split)} examples.")
    for i, example in enumerate(dataset_split):
        if i % 100 == 0:
            print(f"  Progress: {i}/{len(dataset_split)}", end='\r')
        prompt = template.format(text=example['text']) #
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=5,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id
            )
        generated = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip().lower()
        if 'positive' in generated:
            pred = 1
        elif 'negative' in generated:
            pred = 0
        else:
            pred = random.randint(0, 1)
            invalid_count += 1
        predictions.append(pred)
        labels.append(example['label'])
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'invalid_count': invalid_count,
        'predictions': predictions
    }

prompting_results_135m = evaluate_model_prompting(
    gen_model, gen_tokenizer, dataset['test'],
    best_prompt_template, "SmolLM-135M"
)

print(f"\nTest Set Results:")
print(f"Accuracy:  {prompting_results_135m['accuracy']:.4f}")
print(f"Precision: {prompting_results_135m['precision']:.4f}")
print(f"Recall:    {prompting_results_135m['recall']:.4f}")
print(f"F1 Score:  {prompting_results_135m['f1']:.4f}")


Evaluating SmolLM-135M with prompting on 1066 examples.

Test Set Results:
Accuracy:  0.6782
Precision: 0.6227
Recall:    0.9043
F1 Score:  0.7376


---

## 4. Fine-tuning a generative model



### 4.1. Model training

In [None]:
if gen_tokenizer.pad_token is None:
    gen_tokenizer.pad_token = gen_tokenizer.eos_token

# Preparing training data
def prepare_gen_data(examples):
    texts = []
    for text, label in zip(examples['text'], examples['label']):
        label_text = "positive" if label == 1 else "negative"
        formatted = f"Classify this movie review sentiment as positive or negative.\n\nReview: {text}\nSentiment: {label_text}{gen_tokenizer.eos_token}"
        texts.append(formatted)
    tokenized = gen_tokenizer(texts, truncation=True, max_length=256, padding='max_length')
    tokenized['labels'] = tokenized['input_ids'].copy()
    return tokenized

print("Tokenizing datasets\n")
gen_train_dataset = dataset['train'].map(prepare_gen_data, batched=True, remove_columns=dataset['train'].column_names)
gen_val_dataset = dataset['validation'].map(prepare_gen_data, batched=True, remove_columns=dataset['validation'].column_names)

gen_train_dataset.set_format('torch')
gen_val_dataset.set_format('torch')

print(f"Training samples: {len(gen_train_dataset)}")
print(f"Validation samples: {len(gen_val_dataset)}")

Tokenizing datasets



Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Training samples: 8530
Validation samples: 1066


### 4.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

#hyperparameters used
lr = 2e-5
bs = 8
epochs = 3

# Training arguments
training_args = TrainingArguments(
    num_train_epochs=epochs,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    learning_rate=lr,
    weight_decay=0.01,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    push_to_hub=False,
    report_to="none",
    seed=42
)

# Reloading the model again
gen_model = AutoModelForCausalLM.from_pretrained(gen_model_name).to(device)

# Initialize trainer
trainer = Trainer(
    model=gen_model,
    args=training_args,
    train_dataset=gen_train_dataset,
    eval_dataset=gen_val_dataset,
)

# Model training
print("\nTraining generative model...")
trainer.train()

print("\nTraining complete!")


Training generative model...


Epoch,Training Loss,Validation Loss
1,0.4,0.392612
2,0.3722,0.388089
3,0.3523,0.388228


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].



Training complete!


### 4.3. Evaluation on test set

In [None]:
def evaluate_finetuned_gen(model, tokenizer, dataset_split):
    predictions = []
    labels = []
    invalid_count = 0
    model.eval()
    print(f"\nEvaluating on {len(dataset_split)} test examples:")
    for i, example in enumerate(dataset_split):
        if i % 100 == 0:
            print(f"  Progress: {i}/{len(dataset_split)}", end='\r')
        prompt = f"Classify this movie review sentiment as positive or negative.\n\nReview: {example['text']}\nSentiment:" # only different from the prompting code
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=5,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id
            )
        generated = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip().lower()
        if 'positive' in generated:
            pred = 1
        elif 'negative' in generated:
            pred = 0
        else:
            pred = random.randint(0, 1)
            invalid_count += 1
        predictions.append(pred)
        labels.append(example['label'])

    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

# Evaluate on test set
finetuned_gen_results_135m = evaluate_finetuned_gen(gen_model, gen_tokenizer, dataset['test'])

print(f"\n\nTest Set Results:")
print(f"Accuracy:  {finetuned_gen_results_135m['accuracy']:.4f}")
print(f"Precision: {finetuned_gen_results_135m['precision']:.4f}")
print(f"Recall:    {finetuned_gen_results_135m['recall']:.4f}")
print(f"F1 Score:  {finetuned_gen_results_135m['f1']:.4f}")


Evaluating on 1066 test examples:


Test Set Results:
Accuracy:  0.8630
Precision: 0.8772
Recall:    0.8443
F1 Score:  0.8604


---

## 5. Fine-tuning a bidirectional model



### 5.1. Model training

In [None]:
# Your code to train the transformer-based model on the training set and evaluate the performance on the validation set here

bert_model_name = "google-bert/bert-base-cased"
print(f"\nModel: {bert_model_name}")

bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
bert_model = AutoModelForSequenceClassification.from_pretrained(bert_model_name,num_labels=2).to(device)

# Preparing data
def prepare_bert_data(examples):
    return bert_tokenizer(examples['text'], truncation=True, max_length=128, padding='max_length')

print("\nTokenizing datasets for BERT:")
bert_train_dataset = dataset['train'].map(prepare_bert_data, batched=True)
bert_val_dataset = dataset['validation'].map(prepare_bert_data, batched=True)
bert_test_dataset = dataset['test'].map(prepare_bert_data, batched=True)

bert_train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
bert_val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
bert_test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print(f"Training samples: {len(bert_train_dataset)}")
print(f"Validation samples: {len(bert_val_dataset)}")


Model: google-bert/bert-base-cased


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Tokenizing datasets for BERT:


Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Training samples: 8530
Validation samples: 1066


### 5.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

# Used hyperparameters
lr = 2e-5
bs = 16
epochs = 3

# Defining metrics (optional)
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

bert_training_args = TrainingArguments(
    num_train_epochs=epochs,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    learning_rate=lr,
    weight_decay=0.01,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
    report_to="none",
    seed=42
)

bert_trainer = Trainer(
    model=bert_model,
    args=bert_training_args,
    train_dataset=bert_train_dataset,
    eval_dataset=bert_val_dataset,
    compute_metrics=compute_metrics,
)

bert_trainer.train()

print("Training complete!")

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3452,0.349173,0.858349,0.865778,0.822635,0.913696
2,0.2038,0.41267,0.875235,0.874647,0.878788,0.870544
3,0.0794,0.590018,0.860225,0.861653,0.852941,0.870544


Training complete!


### 5.3 Evaluation on test set

In [None]:
# Your code to evaluate the final model on the test set here

test_results = bert_trainer.predict(bert_test_dataset)

finetuned = {
    'accuracy': test_results.metrics['test_accuracy'],
    'precision': test_results.metrics['test_precision'],
    'recall': test_results.metrics['test_recall'],
    'f1': test_results.metrics['test_f1'],
    'predictions': test_results.predictions.argmax(-1).tolist()
}

print(f"\nTest Set Results\n:")
print(f"Accuracy:  {finetuned['accuracy']:.4f}")
print(f"Precision: {finetuned['precision']:.4f}")
print(f"Recall:    {finetuned['recall']:.4f}")
print(f"F1 Score:  {finetuned['f1']:.4f}")


Test Set Results
:
Accuracy:  0.8480
Precision: 0.8673
Recall:    0.8218
F1 Score:  0.8439


---

## 6. Bonus Task (optional)

Repeat sections 3 through 5 here for a second generative and a second bidirectional model. When summarizing your results below (Section 7), include also comparison of the two generative models and the two bidirectional models.

### 6.1 SmolLM-360M-Instruct

#### 6.1.1 Prompting

In [None]:
gen2_model_name = "HuggingFaceTB/SmolLM-360M-Instruct"
print(f"\nModel: {gen2_model_name}")

gen2_tokenizer = AutoTokenizer.from_pretrained(gen2_model_name)
gen2_model = AutoModelForCausalLM.from_pretrained(gen2_model_name).to(device)

if gen2_tokenizer.pad_token is None:
    gen2_tokenizer.pad_token = gen2_tokenizer.eos_token

prompting_results_360m = evaluate_model_prompting(
    gen2_model, gen2_tokenizer, dataset['test'],
    best_prompt_template, "SmolLM-360M"
)

print(f"\nTest Set Results\n:")
print(f"Accuracy:  {prompting_results_360m['accuracy']:.4f}")
print(f"Precision: {prompting_results_360m['precision']:.4f}")
print(f"Recall:    {prompting_results_360m['recall']:.4f}")
print(f"F1 Score:  {prompting_results_360m['f1']:.4f}")


Model: HuggingFaceTB/SmolLM-360M-Instruct


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]


Evaluating SmolLM-360M with prompting on 1066 examples.

Test Set Results
:
Accuracy:  0.8255
Precision: 0.8916
Recall:    0.7411
F1 Score:  0.8094


####  6.1.2 Fine Tuning

#####  6.1.2.1 Dataset Preparation

In [None]:
# Dataset preparation

def prepare_gen2_data(examples):
    texts = []
    for text, label in zip(examples['text'], examples['label']):
        label_text = "positive" if label == 1 else "negative"
        formatted = f"Classify this movie review sentiment as positive or negative.\n\nReview: {text}\nSentiment: {label_text}{gen2_tokenizer.eos_token}"
        texts.append(formatted)
    tokenized = gen2_tokenizer(texts, truncation=True, max_length=128, padding='max_length')
    tokenized['labels'] = tokenized['input_ids'].copy()
    return tokenized

print("Tokenizing datasets for SmolLM-360M:")
gen2_train_dataset = dataset['train'].map(prepare_gen2_data, batched=True, remove_columns=dataset['train'].column_names)
gen2_val_dataset = dataset['validation'].map(prepare_gen2_data, batched=True, remove_columns=dataset['validation'].column_names)

gen2_train_dataset.set_format('torch')
gen2_val_dataset.set_format('torch')

Tokenizing datasets for SmolLM-360M:


Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

##### 6.1.2.2 Training

In [None]:
# Training arguments
gen2_training_args = TrainingArguments(
    num_train_epochs=2,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    learning_rate=lr,
    weight_decay=0.01,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    push_to_hub=False,
    report_to="none",
    seed=42
)

gen2_trainer = Trainer(
    model=gen2_model,
    args=gen2_training_args,
    train_dataset=gen2_train_dataset,
    eval_dataset=gen2_val_dataset,
)

print("\nTraining Model")
gen2_trainer.train()


Training Model


Epoch,Training Loss,Validation Loss
1,0.7549,0.741505
2,0.7113,0.736643


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=1068, training_loss=0.7614618187093556, metrics={'train_runtime': 1982.7381, 'train_samples_per_second': 8.604, 'train_steps_per_second': 0.539, 'total_flos': 4122375561216000.0, 'train_loss': 0.7614618187093556, 'epoch': 2.0})

##### 6.1.2.3 Evaluation

In [None]:
finetuned_gen_results_360m = evaluate_finetuned_gen(gen2_model, gen2_tokenizer, dataset['test'])

print(f"\nTest Set Results:\n")
print(f"  Accuracy:  {finetuned_gen_results_360m['accuracy']:.4f}")
print(f"  Precision: {finetuned_gen_results_360m['precision']:.4f}")
print(f"  Recall:    {finetuned_gen_results_360m['recall']:.4f}")
print(f"  F1 Score:  {finetuned_gen_results_360m['f1']:.4f}")


Evaluating on 1066 test examples:

Test Set Results:

  Accuracy:  0.8865
  Precision: 0.9023
  Recall:    0.8668
  F1 Score:  0.8842


### 6.2 DistilBERT base-cased

#### 6.2.1 Data Preparation

In [None]:
bert_model_name = "distilbert/distilbert-base-cased"
print(f"\nModel: {bert_model_name}")

# Load BERT tokenizer and model
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
bert_model = AutoModelForSequenceClassification.from_pretrained(bert_model_name,num_labels=2).to(device)

# Prepare BERT data
def prepare_bert_data(examples):
    return bert_tokenizer(examples['text'], truncation=True, max_length=128, padding='max_length')

print("\nTokenizing datasets for BERT:")
bert_train_dataset = dataset['train'].map(prepare_bert_data, batched=True)
bert_val_dataset = dataset['validation'].map(prepare_bert_data, batched=True)
bert_test_dataset = dataset['test'].map(prepare_bert_data, batched=True)

bert_train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
bert_val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
bert_test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print(f"Training samples: {len(bert_train_dataset)}")
print(f"Validation samples: {len(bert_val_dataset)}")


Model: distilbert/distilbert-base-cased


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Tokenizing datasets for BERT:


Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Training samples: 8530
Validation samples: 1066


#### 6.2.2 Fine Tuning

##### 6.2.2.1 Configuration and Training

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='binary'
    )
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    logging_dir='./logs',
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
    seed=42,
    report_to="none",
    push_to_hub=False,
)

print("\nTraining Configuration:")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Mixed precision (FP16): {training_args.fp16}")

trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=bert_train_dataset,
    eval_dataset=bert_val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

train_result = trainer.train()

print('Training Completed.')


Training Configuration:
Epochs: 3
Batch size: 8
Learning rate: 2e-05
Mixed precision (FP16): True


Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3866,0.395504,0.825516,0.831826,0.802792,0.863039
2,0.2534,0.629152,0.834897,0.84029,0.813708,0.868668
3,0.1575,0.755594,0.834897,0.838828,0.81932,0.859287


Training Completed.


##### 6.2.2.2 Evaluation

In [None]:
test_results = trainer.predict(bert_test_dataset)

finetuned_bert_results = {
    'accuracy': test_results.metrics['test_accuracy'],
    'precision': test_results.metrics['test_precision'],
    'recall': test_results.metrics['test_recall'],
    'f1': test_results.metrics['test_f1'],
    'predictions': test_results.predictions.argmax(-1).tolist()
}

print(f"\nTest Set Results (BERT-base-cased):")
print(f"Accuracy:  {finetuned_bert_results['accuracy']:.4f}")
print(f"Precision: {finetuned_bert_results['precision']:.4f}")
print(f"Recall:    {finetuned_bert_results['recall']:.4f}")
print(f"F1 Score:  {finetuned_bert_results['f1']:.4f}")

del bert_model, trainer
torch.cuda.empty_cache()


Test Set Results (BERT-base-cased):
Accuracy:  0.8358
Precision: 0.8303
Recall:    0.8443
F1 Score:  0.8372


### 6.3 TurkuNLP/finnish-modernbert-large

#### 6.3.1 Data Preparation

In [None]:
bert_model_name = "TurkuNLP/finnish-modernbert-large"
print(f"\nModel: {bert_model_name}")

# Load BERT tokenizer and model
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
bert_model = AutoModelForSequenceClassification.from_pretrained(bert_model_name,num_labels=2).to(device)

# Prepare BERT data
def prepare_bert_data(examples):
    return bert_tokenizer(examples['text'], truncation=True, max_length=128, padding='max_length')

print("\nTokenizing datasets for BERT:")
bert_train_dataset = dataset['train'].map(prepare_bert_data, batched=True)
bert_val_dataset = dataset['validation'].map(prepare_bert_data, batched=True)
bert_test_dataset = dataset['test'].map(prepare_bert_data, batched=True)

bert_train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
bert_val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
bert_test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print(f"Training samples: {len(bert_train_dataset)}")
print(f"Validation samples: {len(bert_val_dataset)}")


Model: TurkuNLP/finnish-modernbert-large


Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at TurkuNLP/finnish-modernbert-large and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Tokenizing datasets for BERT:


Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Training samples: 8530
Validation samples: 1066


#### 6.3.2 Fine Tuning

##### 6.3.2.1 Configuration and Training

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='binary'
    )
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    logging_dir='./logs',
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
    seed=42,
    report_to="none",
    push_to_hub=False,
)

print("\nTraining Configuration:")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Mixed precision (FP16): {training_args.fp16}")

trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=bert_train_dataset,
    eval_dataset=bert_val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

train_result = trainer.train()

print('Training Completed.')


Training Configuration:
Epochs: 3
Batch size: 8
Learning rate: 2e-05
Mixed precision (FP16): True


Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3022,0.371642,0.86773,0.873769,0.835616,0.915572
2,0.1715,0.543947,0.88743,0.886578,0.893333,0.879925
3,0.0298,0.841778,0.883677,0.882798,0.889524,0.876173


Training Completed.


##### 6.3.2.2 Evaluation

In [None]:
test_results = trainer.predict(bert_test_dataset)

finetuned_bert_results = {
    'accuracy': test_results.metrics['test_accuracy'],
    'precision': test_results.metrics['test_precision'],
    'recall': test_results.metrics['test_recall'],
    'f1': test_results.metrics['test_f1'],
    'predictions': test_results.predictions.argmax(-1).tolist()
}

print(f"\nTest Set Results (TurkuNLP/finnish-modernbert-large):")
print(f"Accuracy:  {finetuned_bert_results['accuracy']:.4f}")
print(f"Precision: {finetuned_bert_results['precision']:.4f}")
print(f"Recall:    {finetuned_bert_results['recall']:.4f}")
print(f"F1 Score:  {finetuned_bert_results['f1']:.4f}")

del bert_model, trainer
torch.cuda.empty_cache()


Test Set Results (TurkuNLP/finnish-modernbert-large):
Accuracy:  0.8762
Precision: 0.8848
Recall:    0.8649
F1 Score:  0.8748


---

## 7. Results and summary

### 7.1 Corpus insights

I learned how to prepare the datasets (already given) for any LM model and how to hyper-paramter the model because this is the main and important step before training our model. I heard before that training might goes for hours but I saw that thing in this project (that was highly frustrated every time I train the model). The corpus has 80% training, 10% testing and the reamining 10% for validation data. The testing data was well balanced before implementing any kind of the operation on it (means model prompting and fine-tunning).

### 7.2 Results

Recommended Models:

*   While performing prompting on Smol-135M, the accuracy is 67.82%  and after fine-tunning the accuracy has increased to 86.30% (which was quite a bit interesting to me).
*   BERT-cased got 84.80% accuracy.

Now the Extra Models:


*   After fine-tunning SmolLM-360M, got accuracy 88.65%
*   Distil-BERT-cased got the accuracy of 83.58% after fine tunning the model.
*   A finnish model was also tested: Finnish-mordernbert-large which gives the accuracy after fine-tunning the hyper-paramters is 87.62%. Finnish-mordernbert-small was also tested but the accuracy was around 80%.



### 7.3 Relation to random baseline / expected performance / state of the art

If I compare my results with the random baseline then my models perform a way better/ahead of the line. But when I compare my all the used models (in this project) to the expected performance. The results are following:

*   SOTA has accuracy around 92%. While the highest accuracy I got after hyper-parametering my models is the 88.65% for SmolLM-360M-Instruct model. Which is less than the SOTA but I must say still not that bad.

 **I used FinBert just to check how the model performs and it went pretty well beyond by expectations.**
---

## 8 Error analysis (group projects only)

(Present the error analysis results here)