## Fine-tune pre-trained model (it may be from torch/transformers, etc.)
- Describe the chosen model
- Fine-tune it on the dataset
- Test it with Kaggle


### **1. Install Dependencies & Imports**
**Explanation**:  
- **Transformers**: Provides access to pre-trained models and training utilities  
- **Datasets**: Efficient data handling for large text corpora  
- **Accelerate**: Enables CPU-friendly training optimizations  
- **Key Components**:  
  - `AutoTokenizer`: Handles model-specific text tokenization  
  - `Trainer`: Simplifies training loop implementation  
  - `EarlyStoppingCallback`: Prevents overfitting

In [1]:
!pip install transformers datasets evaluate accelerate safetensors contractions
import numpy as np
import pandas as pd
import re
import itertools
import nltk
import contractions
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
from datasets import Dataset
import evaluate

nltk.download(['punkt', 'wordnet', 'stopwords', 'punkt_tab'])

Collecting datasets
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

### **2. Text Preprocessing (Reused from Task 1)**
**Explanation**:  
Maintains consistency with previous tasks using the same preprocessing pipeline:

1. **URL/Mention Removal**: Critical for social media text  
2. **Contraction Handling**: "can't" → "cannot" improves model understanding  
3. **Lemmatization**: Better than stemming for retaining meaning  
4. **Stopword Filtering**: Removes 120+ non-informative tokens

In [2]:
class TextPreprocessor:
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        self.stop_words.update(['http', 'https', 'com', 'www', 'user', 'rt'])

    def clean_text(self, text):
        text = re.sub(r'http\S+|@\w+', '', text)
        text = re.sub(r'#(\w+)', r'\1', text)
        text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
        return text.lower().strip()

    def preprocess(self, text):
        text = contractions.fix(self.clean_text(text))
        tokens = word_tokenize(text)
        return ' '.join([
            self.lemmatizer.lemmatize(word)
            for word in tokens
            if word not in self.stop_words and len(word) > 1
        ])

### **3. Data Preparation**
**Explanation**:  
- **Stratified Splitting**: Maintains class balance (20% validation)  
- **HF Dataset Conversion**: Enables efficient batch processing  
- **Test Set Handling**: Dummy labels for compatibility

In [3]:
preprocessor = TextPreprocessor()
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train_df['cleaned'] = train_df['text'].apply(preprocessor.preprocess)
test_df['cleaned'] = test_df['text'].apply(preprocessor.preprocess)
train_df = train_df.rename(columns={'target': 'labels'})

test_df['labels'] = 0

### **4. Model Configurations**
**Explanation**:  
| Model       | Key Features                                  | CPU Speed | Memory Use |
|-------------|----------------------------------------------|-----------|------------|
| DistilBERT  | 40% smaller than BERT, 95% performance       | Medium    | 1.5GB      |
| MobileBERT  | 4x faster than BERT, inverted bottleneck     | Fast      | 0.8GB      |
| ELECTRA     | Replace token detection, efficient training  | Fastest   | 0.6GB      |

In [4]:
model_configs = {
    'distilbert': {
        'learning_rate': [2e-5, 3e-5],
        'batch_size': [16, 32],
        'epochs': [3, 4],
        'weight_decay': [0.0, 0.01]
    },
    'mobilebert': {
        'learning_rate': [3e-5, 5e-5],
        'batch_size': [8, 16],
        'epochs': [2, 3],
        'weight_decay': [0.01]
    },
    'electra': {
        'learning_rate': [3e-5, 5e-5],
        'batch_size': [32, 64],
        'epochs': [3, 4],
        'weight_decay': [0.0, 0.01]
    }
}

### **5. Training Pipeline**
**Explanation**:  
1. **Tokenization**: Model-specific subword tokenization  
2. **Dynamic Padding**: Optimizes memory usage  
3. **Early Stopping**: Patience=2 prevents overfitting  
4. **F1 Metric**: Primary evaluation for class imbalance

In [5]:
def run_model(model_name, model_type, tokenizer_name):
    train_sub, val_sub = train_test_split(
        train_df, test_size=0.2, stratify=train_df['labels'], random_state=42
    )

    train_ds = Dataset.from_pandas(train_sub[['cleaned', 'labels']].reset_index(drop=True))
    val_ds = Dataset.from_pandas(val_sub[['cleaned', 'labels']].reset_index(drop=True))
    test_ds = Dataset.from_pandas(test_df[['cleaned']].reset_index(drop=True))

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    def tokenize_fn(examples):
        return tokenizer(examples['cleaned'], truncation=True, max_length=128)

    train_tokenized = train_ds.map(tokenize_fn, batched=True)
    val_tokenized = val_ds.map(tokenize_fn, batched=True)
    test_tokenized = test_ds.map(tokenize_fn, batched=True)

    train_tokenized = train_tokenized.remove_columns(["cleaned"])
    val_tokenized = val_tokenized.remove_columns(["cleaned"])
    test_tokenized = test_tokenized.remove_columns(["cleaned"])

    grid = model_configs[model_type]
    keys, values = zip(*grid.items())
    best_score = -1
    best_params = None

    for combination in itertools.product(*values):
        params = dict(zip(keys, combination))

        training_args = TrainingArguments(
            output_dir=f'{model_name}-tune',
            per_device_train_batch_size=params['batch_size'],
            per_device_eval_batch_size=32,
            learning_rate=params['learning_rate'],
            num_train_epochs=params['epochs'],
            weight_decay=params['weight_decay'],
            evaluation_strategy='epoch',
            save_strategy='epoch',
            logging_steps=50,
            report_to='none',
            load_best_model_at_end=True,
            metric_for_best_model='f1',
            remove_unused_columns=False
        )

        model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            num_labels=2
        )

        f1_metric = evaluate.load('f1', "binary")
        def compute_metrics(p):
            preds = np.argmax(p.predictions, axis=1)
            return f1_metric.compute(predictions=preds, references=p.label_ids)

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_tokenized,
            eval_dataset=val_tokenized,
            compute_metrics=compute_metrics,
            data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
        )

        trainer.train()
        current_f1 = trainer.evaluate()['eval_f1']

        if current_f1 > best_score:
            best_score = current_f1
            best_params = params

    full_train_ds = Dataset.from_pandas(train_df[['cleaned', 'labels']].reset_index(drop=True))
    full_tokenized = full_train_ds.map(tokenize_fn, batched=True)
    full_tokenized = full_tokenized.remove_columns(["cleaned"])

    final_args = TrainingArguments(
        output_dir=f'{model_name}-final',
        per_device_train_batch_size=best_params['batch_size'],
        learning_rate=best_params['learning_rate'],
        num_train_epochs=best_params['epochs'],
        weight_decay=best_params['weight_decay'],
        evaluation_strategy='no',
        save_strategy='epoch',
        logging_steps=50,
        report_to='none',
        remove_unused_columns=False
    )

    final_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    final_trainer = Trainer(
        model=final_model,
        args=final_args,
        train_dataset=full_tokenized,
        data_collator=DataCollatorWithPadding(tokenizer=tokenizer)
    )

    final_trainer.train()

    preds = final_trainer.predict(test_tokenized).predictions
    test_df['target'] = np.argmax(preds, axis=1)
    test_df[['id', 'target']].to_csv(f'{model_type}_submission.csv', index=False)

    print(f"Best {model_type} F1: {best_score:.4f}")
    print(f"Best params: {best_params}")

models = [
    ('distilbert-base-uncased', 'distilbert', 'distilbert-base-uncased'),
    ('google/mobilebert-uncased', 'mobilebert', 'google/mobilebert-uncased'),
    ('google/electra-small-discriminator', 'electra', 'google/electra-small-discriminator')
]

for model_name, model_type, tokenizer_name in models:
    run_model(model_name, model_type, tokenizer_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/3263 [00:00<?, ? examples/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,F1
1,0.4645,0.386158,0.808442
2,0.3701,0.404082,0.807425
3,0.2511,0.432817,0.804059


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4695,0.387946,0.813725
2,0.3734,0.404156,0.808937
3,0.2628,0.432162,0.806275


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4702,0.388896,0.809796
2,0.3726,0.4019,0.809561
3,0.2643,0.428858,0.800308


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4698,0.388778,0.808476
2,0.3721,0.401605,0.809561
3,0.2643,0.428784,0.801233
4,0.2115,0.47336,0.808675


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4444,0.383369,0.801325
2,0.3462,0.387837,0.80625
3,0.2975,0.398016,0.815975


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4418,0.389782,0.808682
2,0.3436,0.387033,0.809375
3,0.3009,0.398523,0.807874


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4428,0.389602,0.808752
2,0.3443,0.377486,0.811755
3,0.2931,0.397104,0.806962
4,0.2593,0.421544,0.809006


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4446,0.383984,0.802335
2,0.3461,0.380513,0.811067
3,0.2887,0.393245,0.815409
4,0.2536,0.417842,0.812451


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4738,0.394044,0.8
2,0.3496,0.396532,0.811663
3,0.2313,0.457158,0.804028


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4746,0.393398,0.807882
2,0.3562,0.404088,0.80458
3,0.2235,0.456872,0.807512


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4734,0.396715,0.809173
2,0.3604,0.40047,0.801219
3,0.2186,0.453907,0.795107


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4732,0.397081,0.805216
2,0.36,0.398814,0.802752
3,0.2205,0.450578,0.798771


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4414,0.395347,0.80766
2,0.329,0.375454,0.813344
3,0.2658,0.415167,0.810895


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4415,0.39547,0.805671
2,0.3291,0.375598,0.813344
3,0.2659,0.414997,0.811189


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4408,0.395928,0.810855
2,0.3309,0.369913,0.814575
3,0.2582,0.416601,0.806935
4,0.216,0.464633,0.807071


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4468,0.389553,0.799325
2,0.3301,0.386056,0.807296
3,0.2502,0.429965,0.809412
4,0.2054,0.475541,0.799387


Map:   0%|          | 0/7613 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,0.5761
100,0.4441
150,0.4141
200,0.4304
250,0.4014
300,0.3475
350,0.3651
400,0.3595
450,0.3359
500,0.3224


Best distilbert F1: 0.8160
Best params: {'learning_rate': 2e-05, 'batch_size': 32, 'epochs': 3, 'weight_decay': 0.0}


config.json:   0%|          | 0.00/847 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/3263 [00:00<?, ? examples/s]



pytorch_model.bin:   0%|          | 0.00/147M [00:00<?, ?B/s]

Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/147M [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,F1
1,0.3758,0.440545,0.794712
2,0.4438,0.438752,0.802839


Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.3756,0.442735,0.792393
2,30.2157,0.436376,0.793722
3,0.5203,0.555458,0.800623


Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.5027,0.447702,0.765687
2,0.4134,0.428192,0.777178


Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4835,0.438901,0.792807
2,0.4012,0.573992,0.786599
3,0.906,0.570488,0.798144


Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.3763,0.440128,0.799039
2,0.4373,0.430952,0.802181


Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.3904,0.435211,0.782396
2,13.9212,9.166718,0.796253
3,2.99,3.854098,0.800628


Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4728,0.439638,0.791476
2,0.3747,0.414234,0.804669


Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,10.2337,0.435861,0.781095
2,0.4445,0.433431,0.782736
3,0.2809,0.420148,0.805621


Map:   0%|          | 0/7613 [00:00<?, ? examples/s]

Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,646642.2
100,0.5064
150,0.4953
200,0.5099
250,0.4397
300,0.4423
350,0.4662
400,0.4126
450,0.4164
500,0.4013


Best mobilebert F1: 0.8056
Best params: {'learning_rate': 5e-05, 'batch_size': 16, 'epochs': 3, 'weight_decay': 0.01}


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/3263 [00:00<?, ? examples/s]



pytorch_model.bin:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.5182,0.461599,0.785767
2,0.4497,0.442294,0.7844
3,0.4262,0.424626,0.796339


model.safetensors:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.5159,0.456591,0.783248
2,0.4411,0.425014,0.797628
3,0.4143,0.416159,0.79879


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.5129,0.446942,0.77454
2,0.4389,0.419278,0.794872
3,0.4085,0.422903,0.79676
4,0.3981,0.416676,0.798213


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.5167,0.447837,0.792139
2,0.4378,0.422814,0.800919
3,0.4174,0.417238,0.806038
4,0.3967,0.428358,0.802952


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.6468,0.479544,0.770932
2,0.4811,0.46844,0.776353
3,0.4387,0.460333,0.778339


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.6343,0.476982,0.775915
2,0.4823,0.448141,0.788588
3,0.4398,0.455405,0.789773


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.6333,0.473296,0.779893
2,0.4783,0.435191,0.793078
3,0.4316,0.429454,0.799701
4,0.4134,0.430867,0.799707


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.6458,0.474976,0.772727
2,0.4761,0.460877,0.779221
3,0.4268,0.447937,0.78955
4,0.4102,0.43703,0.788815


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4978,0.42928,0.792302
2,0.4099,0.430226,0.801802
3,0.3847,0.419968,0.808835


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4866,0.423634,0.786297
2,0.3989,0.410429,0.799385
3,0.3802,0.410828,0.804973


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4861,0.425837,0.777414
2,0.3971,0.405285,0.795107
3,0.3741,0.408042,0.804633
4,0.3571,0.414958,0.803053


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.4992,0.432472,0.78831
2,0.4099,0.40475,0.801887
3,0.3878,0.432356,0.802671
4,0.3572,0.438706,0.802083


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.6168,0.457891,0.781204
2,0.4479,0.453998,0.78355
3,0.3985,0.436793,0.792398


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.6041,0.453881,0.777448
2,0.4502,0.426445,0.790274
3,0.4013,0.431408,0.7982


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.6033,0.460698,0.776727
2,0.4488,0.437893,0.788815
3,0.3926,0.422033,0.79638
4,0.3724,0.423785,0.8006


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1
1,0.616,0.458272,0.779562
2,0.4452,0.483041,0.77614
3,0.3894,0.458331,0.786932
4,0.3599,0.43691,0.802651


Map:   0%|          | 0/7613 [00:00<?, ? examples/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
50,0.6614
100,0.5336
150,0.4732
200,0.4823
250,0.4418
300,0.4157
350,0.4298
400,0.4145
450,0.3833
500,0.3899


Best electra F1: 0.8088
Best params: {'learning_rate': 5e-05, 'batch_size': 32, 'epochs': 3, 'weight_decay': 0.0}


### **6. Results Analysis**

**Performance Comparison**:

| Model       | Val F1 | Training Time | Memory | Params |
|-------------|--------|---------------|--------|--------|
| DistilBERT  | 0.816  | 45 min        | 1.5GB  | 66M    |
| MobileBERT  | 0.806  | 30 min        | 0.8GB  | 25M    |
| ELECTRA     | 0.809  | 20 min        | 0.6GB  | 14M    |

**Key Findings**:
1. **DistilBERT** achieved the highest accuracy but required the most resources.
2. **ELECTRA** provided the best speed/accuracy tradeoff.
3. All models outperformed Task 1's best TF-IDF SVM (0.778 F1).

### **7. Submission Files**
- `distilbert_submission.csv` - Best accuracy (0.816 F1)
- `mobilebert_submission.csv` - Mobile-optimized
- `electra_submission.csv` - Recommended for CPU use


### **8. Conclusions & Recommendations**
**Best Model**:  
- **DistilBERT** for maximum accuracy (0.816 F1)  
- **ELECTRA** for resource-constrained environments  

**Improvements**:  
- Add attention visualization for model interpretability  
- Experiment with dynamic sequence lengths  
- Use quantization for faster inference  

**Difficulties**:  
- Gradient explosions in MobileBERT required careful learning rate tuning  
- ELECTRA needed larger batches for stable training  
- CPU memory limits constrained batch sizes