# **Introduction**

This notebook focuses on the fine-tuning of a pre-trained Transformer model (DistilBERT) to classify eco-friendly product descriptions using the dataset amazon_eco-friendly_products.csv. Fine-tuning allows a language model that has already learned general language representations to adapt its understanding for a specific task—in this case, identifying whether a product is environmentally friendly based on its textual details.

The dataset consists of various product attributes such as titles, materials, and descriptions. These fields are combined and processed using the DistilBERT tokenizer to convert natural language text into tokens that can be interpreted by the model. The task is treated as a binary text classification problem: predicting whether a product is eco-friendly (1) or not eco-friendly (0).

This notebook represents the “Training Strategy Experiments” portion of the fine-tuning exercise. The experiments systematically adjust key hyperparameters that affect model learning behavior:

***1. Number of epochs*** – controls how many times the model passes through the training data.

***2. Learning rate*** – determines how quickly the model updates its weights during training.

***3. Batch size*** – defines how many samples are processed before updating model weights.

By varying these parameters, the notebook aims to identify the optimal combination that yields the highest F1-Score and Accuracy on the evaluation dataset. The results will help demonstrate how training strategies directly influence the model’s performance and generalization ability.

The structure of this notebook includes:

1. Setup and Data Preparation
2. Tokenizer and Model Loading
3. Training Configuration and Metrics Definition
4. Fine-Tuning Experiments (three configurations tested)
5. Evaluation and Comparison of Results

# **F2**

# **1. Setup and Data Preparation**

In [2]:
!pip install transformers datasets torch scikit-learn pandas openpyxl

import pandas as pd
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import numpy as np



This code is for installing and importing the important libraries that I will use in my fine-tuning project. The transformers library is for using the DistilBERT model, datasets help in loading text data, torch is for running the model and training it, scikit-learn is for computing accuracy and f1 score to see how good the model is, pandas is for reading and handling the dataset like csv files, and openpyxl is for saving my result into excel file later. I import all this first so the rest of the codes can work properly.

In [3]:
# Check for GPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

Using GPU: Tesla T4


This code is checking if my computer or colab is using a GPU or not. GPU is faster than CPU when training big models like DistilBERT. If there is a GPU available, it will show the GPU name and use it for training, but if not, it will just use the CPU which is slower. The purpose of this code is to make sure that my model will train faster and more efficient when possible.

In [4]:
# Load your dataset
df = pd.read_csv("amazon_eco-friendly_products.csv")

# Display sample data
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (3587, 14)


Unnamed: 0,id,title,name,category,material,brand,price,rating,reviewsCount,description,url,img_url,inStock,inStockText
0,B0CWH366KJ,"Agfabric Natural Jute Erosion Control, 16yard(...",Weed Barrier Fabric,"Patio, Lawn & Garden",,Agfabric,$87.3,,,Protect your yard and garden with our biodegra...,https://www.amazon.com/dp/B0CWH366KJ,https://m.media-amazon.com/images/I/71t3FD5KjH...,True,Only 5 left in stock - order soon.
1,B086L692VC,SAFAVIEH Braided Collection 4' Round Light Blu...,Area Rugs,Home & Kitchen,"50%jute, 25% Wool, 25% Cotton",Safavieh,$40.63,4.2,59.0,Country style is perfect for a casual cottage ...,https://www.amazon.com/dp/B086L692VC,https://m.media-amazon.com/images/I/A1Q73Cheh2...,True,Only 3 left in stock - order soon.
2,B01J6JELTG,Eyeseals 4.0 Sleep Mask – Clear – Moisturizing...,Sleeping Masks,Health & Household,Plastic,EYEECO,$65.95,3.7,1075.0,Locks moisture in: Eyeseals 4.0 eye mask for d...,https://www.amazon.com/dp/B01J6JELTG,https://m.media-amazon.com/images/I/61Uz393xlp...,True,
3,B07HQSKK36,Lucky Monet 25/50/100PCS Burlap Gift Bags Wedd...,Gift Bags,Health & Household,Burlap,Lucky Monet,$29.99,4.6,2492.0,❤ Premium Burlap Material❤ These small burlap ...,https://www.amazon.com/dp/B07HQSKK36,https://m.media-amazon.com/images/I/71DrHIU1aM...,True,In Stock In Stock
4,B0C3Y8WJDR,St. Boniface Bag Company | Burlap Bags - Size:...,Grow Bags,"Patio, Lawn & Garden",5.0 Count,Generic,$29.99,4.4,11.0,100% Burlap > 100% BIODEGRADABLE AND ECO FRIEN...,https://www.amazon.com/dp/B0C3Y8WJDR,https://m.media-amazon.com/images/I/81q3el899U...,True,In Stock


This code is used to load my dataset called amazon_eco-friendly_products.csv using pandas. The pd.read_csv() reads the csv file and stores it into a dataframe named df. Then, df.shape shows how many rows and columns the dataset have, and df.head() displays the first few lines so I can check if the data was loaded correctly. The purpose of this code is to make sure that the dataset is properly read and to see a preview of what kind of data I will be working with.

In [5]:
# Combine textual fields into a single text column
df['text'] = df['title'].astype(str) + " " + df['material'].astype(str) + " " + df['description'].astype(str)

# Dummy binary labels for demonstration:
# If description contains eco/sustainable terms → 1 (eco-friendly), else 0
df['label'] = df['description'].str.contains("eco|recycl|sustain|biodegrad|organic", case=False, na=False).astype(int)

print(df['label'].value_counts())
df[['text','label']].head()

label
1    2257
0    1330
Name: count, dtype: int64


Unnamed: 0,text,label
0,"Agfabric Natural Jute Erosion Control, 16yard(...",1
1,SAFAVIEH Braided Collection 4' Round Light Blu...,0
2,Eyeseals 4.0 Sleep Mask – Clear – Moisturizing...,1
3,Lucky Monet 25/50/100PCS Burlap Gift Bags Wedd...,0
4,St. Boniface Bag Company | Burlap Bags - Size:...,1


This code combines the product information like title, material, and description into one single column called “text” so that the model can analyze everything together instead of separately. Then, it makes a dummy label for training — meaning if the description has words like “eco”, “recycle”, “sustain”, “biodegradable”, or “organic,” it marks it as 1 (eco-friendly), otherwise 0 (not eco-friendly). The value_counts() shows how many eco-friendly and non-eco-friendly samples there are, while the head() function displays a few examples of the new text and label columns. The purpose of this code is to prepare the dataset into a format that can be used for text classification.

In [6]:
# Split into training and evaluation sets
train_texts, eval_texts, train_labels, eval_labels = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)

train_data = pd.DataFrame({'text': train_texts, 'labels': train_labels})
eval_data = pd.DataFrame({'text': eval_texts, 'labels': eval_labels})

This code splits the dataset into two parts — one for training the model and one for evaluation/testing. It uses an 80-20 split, meaning 80% of the data will be used to train the model while 20% will be used to test how well it performs. The stratify=df['label'] makes sure that both sets have a balanced number of eco-friendly and non-eco-friendly samples, so the model doesn’t get biased. After splitting, it creates two new DataFrames: train_data and eval_data, which contain the text and their corresponding labels. This step is important because it helps the model learn from one part of the data and then be tested on another to check if it can generalize well.

# **2. Load Tokenizer and Model**

In [7]:
MODEL_NAME = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)

def tokenize_function(batch):
    return tokenizer(batch["text"], truncation=True, padding=True, max_length=128)

# Convert to Hugging Face dataset format
from datasets import Dataset
train_ds = Dataset.from_pandas(train_data)
eval_ds = Dataset.from_pandas(eval_data)

tokenized_train = train_ds.map(tokenize_function, batched=True)
tokenized_eval = eval_ds.map(tokenize_function, batched=True)

model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/2869 [00:00<?, ? examples/s]

Map:   0%|          | 0/718 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This code sets up the BERT model and tokenizer so the text data can be properly processed before training. The MODEL_NAME loads the DistilBERT model, which is a smaller and faster version of BERT that still gives good accuracy. The tokenizer converts each product description into tokens — basically turning words into numbers that the model can understand. The function tokenize_function() applies this process to all the text, making sure each text is trimmed or padded to the same length (max_length=128). Then, the training and evaluation data are turned into Hugging Face Dataset objects so they can work smoothly with the model. After tokenizing everything, the DistilBertForSequenceClassification model is loaded with two labels (0 for non-eco-friendly and 1 for eco-friendly) and sent to the device (GPU or CPU). This step basically gets the data and model ready for training.

# **3. Define Metrics and Baseline Trainer**

In [8]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    f1 = f1_score(p.label_ids, preds, average="binary")
    return {"accuracy": acc, "f1": f1}

This part of the code defines how the model’s performance will be measured after training. The function compute_metrics() takes the model’s predictions (p.predictions) and compares them to the real labels (p.label_ids). It finds which class (0 or 1) the model predicted most confidently using np.argmax(). Then, it calculates two important metrics — accuracy, which tells how many predictions were correct, and F1-score, which balances precision and recall to give a better measure of performance, especially if the classes are imbalanced. In short, this function helps us know how well the model is doing during evaluation.

In [9]:
# Baseline Experiment (3 epochs, LR=5e-5, batch=16)
training_args = TrainingArguments(
    output_dir="./results_exp1",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    weight_decay=0.01,
    eval_strategy="epoch", # Set eval_strategy to "epoch"
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[]
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

print("Running Experiment 1 (Baseline)…")
trainer.train()
exp1_results = trainer.evaluate()
print(exp1_results)

  trainer = Trainer(


Running Experiment 1 (Baseline)…


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.325042,0.842618,0.882902
2,No log,0.299017,0.848189,0.875713
3,0.312500,0.396289,0.846797,0.878049


{'eval_loss': 0.2990168333053589, 'eval_accuracy': 0.8481894150417827, 'eval_f1': 0.8757126567844926, 'eval_runtime': 0.8714, 'eval_samples_per_second': 823.959, 'eval_steps_per_second': 51.641, 'epoch': 3.0}


This code runs the baseline experiment, which is the first training setup used to compare later improvements. It starts by setting up the TrainingArguments, where we define how the model will train — like num_train_epochs=3 meaning the model will go through the data three times, learning_rate=5e-5 controlling how fast it learns, and batch_size=16 meaning it processes 16 samples at a time. The code also saves and evaluates the model at the end of every epoch, and if a GPU is available, it uses faster fp16 precision.

Then, a Trainer object is created, which connects the model, data, and training settings together. Finally, trainer.train() actually trains the model, and trainer.evaluate() checks how well it performs on the evaluation set. The results (accuracy and F1-score) are printed out so we can see how the baseline model performs before we change any hyperparameters.

# **4. Hyperparameter Experiments (Training Strategy)**

In [10]:
training_args2 = TrainingArguments(
    output_dir="./results_exp2",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    learning_rate=3e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[]
)

trainer2 = Trainer(
    model=model,
    args=training_args2,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

print("Running Experiment 2…")
trainer2.train()
exp2_results = trainer2.evaluate()
print(exp2_results)

  trainer2 = Trainer(


Running Experiment 2…


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.380963,0.852368,0.888186
2,No log,0.598659,0.846797,0.878319
3,0.113300,0.674007,0.852368,0.88651
4,0.113300,0.809928,0.857939,0.890323
5,0.113300,0.859554,0.860724,0.887387


{'eval_loss': 0.3809630870819092, 'eval_accuracy': 0.8523676880222841, 'eval_f1': 0.8881856540084389, 'eval_runtime': 0.9451, 'eval_samples_per_second': 759.708, 'eval_steps_per_second': 95.228, 'epoch': 5.0}


This code runs the second experiment where we change some hyperparameters to see if the model can perform better. In this setup, the number of epochs is increased from 3 to 5, meaning the model will train longer and possibly learn more patterns from the data. The learning rate is also lowered from 5e-5 to 3e-5, so the model updates its weights more slowly, which can help it learn more carefully and avoid overfitting. Everything else stays mostly the same as before — same batch size, evaluation each epoch, and saving the best model. After setting these configurations, the Trainer is initialized again and the model is trained and evaluated. The printed results (accuracy and F1-score) will show whether these changes improved or worsened the model compared to the first experiment.

In [11]:
training_args3 = TrainingArguments(
    output_dir="./results_exp3",
    num_train_epochs=8,
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[]
)

trainer3 = Trainer(
    model=model,
    args=training_args3,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

print("Running Experiment 3…")
trainer3.train()
exp3_results = trainer3.evaluate()
print(exp3_results)

  trainer3 = Trainer(


Running Experiment 3…


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.568939,0.860724,0.896694
2,0.233200,0.605348,0.860724,0.893617
3,0.107300,0.593819,0.883008,0.908297
4,0.107300,0.911981,0.881616,0.907909
5,0.039600,0.999452,0.877437,0.90287
6,0.003300,1.049279,0.881616,0.9069
7,0.002100,1.024863,0.880223,0.905495
8,0.002100,1.03437,0.881616,0.906696


{'eval_loss': 0.5689390301704407, 'eval_accuracy': 0.8607242339832869, 'eval_f1': 0.8966942148760331, 'eval_runtime': 0.9609, 'eval_samples_per_second': 747.197, 'eval_steps_per_second': 93.66, 'epoch': 8.0}


This code performs the third experiment, where the training setup is changed again to test another combination of hyperparameters. Here, the model is trained for 8 epochs, which means it will go through the entire dataset more times, allowing it to learn deeper patterns. However, the batch size is reduced to 8, meaning the model processes fewer samples per training step — this can make the training slower but may help it generalize better. The learning rate is set back to 5e-5, which makes the updates slightly faster compared to Experiment 2. After defining these settings, a new Trainer is created to handle this experiment’s training and evaluation. The printed results will help compare whether training longer and using smaller batches improved the model’s accuracy and F1-score compared to the earlier experiments.

# **5. Save and Compare Results**

In [12]:
results = pd.DataFrame([
    {"Experiment": "Exp1 (3ep, 5e-5, 16)", **exp1_results},
    {"Experiment": "Exp2 (5ep, 3e-5, 16)", **exp2_results},
    {"Experiment": "Exp3 (8ep, 5e-5, 8)", **exp3_results},
])
results.to_excel("Lequin_training_experiments.xlsx", index=False)
results

Unnamed: 0,Experiment,eval_loss,eval_accuracy,eval_f1,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
0,"Exp1 (3ep, 5e-5, 16)",0.299017,0.848189,0.875713,0.8714,823.959,51.641,3.0
1,"Exp2 (5ep, 3e-5, 16)",0.380963,0.852368,0.888186,0.9451,759.708,95.228,5.0
2,"Exp3 (8ep, 5e-5, 8)",0.568939,0.860724,0.896694,0.9609,747.197,93.66,8.0


This code collects and organizes all the results from the three experiments into a single table using pandas. Each experiment’s settings — like the number of epochs, learning rate, and batch size — are labeled clearly under the “Experiment” column, and the corresponding accuracy and F1-score results from each test are added next to it. After compiling everything, the code saves this table as an Excel file named “Lequin_training_experiments.xlsx” so it can be used later for reporting or comparison. Finally, the results command displays the summary directly in the notebook, making it easier to see which experiment performed best.

________________________________________________________________________________

# **F3**

##Hyperparameter Optimization

This section adds **Grid Search** and **Random Search** methods to optimize the hyperparameters of the fine-tuned BERT model used in Exercise F2.


###  Grid Search

In [16]:
def train_and_evaluate_model(learning_rate, batch_size, epochs, output_dir_suffix=""):
    # Re-initialize the model to ensure a fresh start for each trial.
    # This is crucial for hyperparameter search to compare different settings fairly.
    model_for_current_run = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(device)

    training_args = TrainingArguments(
        output_dir=f"./results_search/{output_dir_suffix}",
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=learning_rate,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        fp16=torch.cuda.is_available(),
        report_to=[],
        logging_dir='./logs',
        logging_steps=10,
    )

    trainer = Trainer(
        model=model_for_current_run,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer
    )

    trainer.train()
    eval_results = trainer.evaluate()
    # Return only the relevant metrics for hyperparameter search comparison
    return {'f1': eval_results.get('eval_f1'), 'accuracy': eval_results.get('eval_accuracy')}

This code defines a function named train_and_evaluate_model that trains and evaluates a DistilBERT model for sequence classification using specified hyperparameters such as learning rate, batch size, and number of epochs. Each time the function runs, it reinitializes the model to ensure fair and independent evaluation of each hyperparameter combination. The function configures training parameters through the TrainingArguments class, including settings for evaluation, model saving, and mixed precision if GPU support is available. A Trainer object handles the training and evaluation processes using tokenized training and evaluation datasets, a defined metric computation function, and the model’s tokenizer. After training, the function evaluates the model and returns key performance metrics—F1 score and accuracy—useful for comparing results during hyperparameter tuning.

In [18]:
import itertools
import pandas as pd

# Define hyperparameter grid
param_grid = {
    'learning_rate': [5e-5, 3e-5, 2e-5],
    'batch_size': [8, 16],
    'epochs': [2, 3, 4]
}

grid_results = []
for lr, bs, ep in itertools.product(param_grid['learning_rate'], param_grid['batch_size'], param_grid['epochs']):
    output_dir_name = f"grid_lr{lr}_bs{bs}_ep{ep}"
    print(f'Running Grid Search combo: lr={lr}, bs={bs}, epochs={ep}')
    result = train_and_evaluate_model(learning_rate=lr, batch_size=bs, epochs=ep, output_dir_suffix=output_dir_name)
    grid_results.append({
        'learning_rate': lr,
        'batch_size': bs,
        'epochs': ep,
        'f1': result.get('f1', None),
        'accuracy': result.get('accuracy', None)
    })

grid_results_df = pd.DataFrame(grid_results)
display(grid_results_df.sort_values(by='f1', ascending=False))

Running Grid Search combo: lr=5e-05, bs=8, epochs=2


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3328,0.303345,0.852368,0.885776
2,0.355,0.313144,0.874652,0.903017


Running Grid Search combo: lr=5e-05, bs=8, epochs=3


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.357,0.315602,0.837047,0.865979
2,0.3987,0.327667,0.860724,0.892934
3,0.2305,0.482123,0.859331,0.889858


Running Grid Search combo: lr=5e-05, bs=8, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3403,0.301555,0.849582,0.882863
2,0.2725,0.330479,0.867688,0.90135
3,0.1288,0.496995,0.855153,0.882086
4,0.1476,0.576545,0.869081,0.897826


Running Grid Search combo: lr=5e-05, bs=16, epochs=2


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4169,0.33735,0.837047,0.880734
2,0.2821,0.310761,0.844011,0.881104


Running Grid Search combo: lr=5e-05, bs=16, epochs=3


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4169,0.332479,0.842618,0.878885
2,0.2809,0.3098,0.844011,0.873874
3,0.218,0.381466,0.852368,0.884026


Running Grid Search combo: lr=5e-05, bs=16, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4178,0.330768,0.846797,0.886831
2,0.3085,0.286881,0.859331,0.886389
3,0.2157,0.354341,0.862117,0.891089
4,0.0596,0.484568,0.867688,0.895719


Running Grid Search combo: lr=3e-05, bs=8, epochs=2


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3175,0.32399,0.842618,0.874584
2,0.3103,0.328483,0.855153,0.887931


Running Grid Search combo: lr=3e-05, bs=8, epochs=3


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3573,0.322286,0.841226,0.875546
2,0.3524,0.352899,0.866295,0.899791
3,0.1525,0.460169,0.866295,0.895197


Running Grid Search combo: lr=3e-05, bs=8, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3718,0.321834,0.842618,0.876772
2,0.3407,0.291328,0.871866,0.899344
3,0.1665,0.484684,0.856546,0.885173
4,0.1293,0.56584,0.871866,0.899563


Running Grid Search combo: lr=3e-05, bs=16, epochs=2


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4022,0.334153,0.852368,0.887473
2,0.2926,0.317364,0.844011,0.880342


Running Grid Search combo: lr=3e-05, bs=16, epochs=3


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4247,0.336973,0.850975,0.88507
2,0.2899,0.313589,0.845404,0.875421
3,0.2589,0.34806,0.860724,0.89011


Running Grid Search combo: lr=3e-05, bs=16, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3825,0.318503,0.845404,0.881789
2,0.2498,0.333373,0.848189,0.88728
3,0.2314,0.374926,0.849582,0.885835
4,0.0924,0.41793,0.856546,0.886439


Running Grid Search combo: lr=2e-05, bs=8, epochs=2


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3436,0.320807,0.846797,0.878319
2,0.3404,0.329704,0.852368,0.884783


Running Grid Search combo: lr=2e-05, bs=8, epochs=3


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3689,0.334922,0.85376,0.885496
2,0.3678,0.367264,0.86351,0.89749
3,0.1653,0.419306,0.86351,0.894168


Running Grid Search combo: lr=2e-05, bs=8, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.333,0.326991,0.839833,0.870641
2,0.3335,0.314252,0.867688,0.900938
3,0.1497,0.444689,0.870474,0.899023
4,0.0865,0.514784,0.869081,0.898925


Running Grid Search combo: lr=2e-05, bs=16, epochs=2


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4172,0.353635,0.842618,0.880927
2,0.3227,0.331619,0.849582,0.885593


Running Grid Search combo: lr=2e-05, bs=16, epochs=3


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4399,0.357518,0.842618,0.881427
2,0.3203,0.325185,0.846797,0.875566
3,0.2825,0.337207,0.857939,0.888158


Running Grid Search combo: lr=2e-05, bs=16, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3957,0.332786,0.839833,0.879833
2,0.3442,0.301041,0.848189,0.874279
3,0.2702,0.336116,0.850975,0.886291
4,0.1337,0.360484,0.859331,0.887902


Unnamed: 0,learning_rate,batch_size,epochs,f1,accuracy
14,2e-05,8,4,0.900938,0.867688
8,3e-05,8,4,0.899344,0.871866
5,5e-05,16,4,0.886389,0.859331
0,5e-05,8,2,0.885776,0.852368
15,2e-05,16,2,0.885593,0.849582
13,2e-05,8,3,0.885496,0.85376
2,5e-05,8,4,0.882863,0.849582
11,3e-05,16,4,0.881789,0.845404
3,5e-05,16,2,0.881104,0.844011
9,3e-05,16,2,0.880342,0.844011


This code conducts a systematic grid search to identify the best combination of hyperparameters for training a DistilBERT model. It first defines a parameter grid containing different values for learning rate, batch size, and number of epochs, then uses itertools.product to generate every possible combination of these values. For each combination, the model is trained and evaluated through the train_and_evaluate_model function, which returns the F1 score and accuracy. These metrics are stored in a list along with their corresponding hyperparameters to keep track of all results. After all combinations are tested, the results are converted into a pandas DataFrame and displayed in descending order of F1 score, allowing easy identification of the hyperparameter setup that produces the best model performance.

###  Random Search

In [17]:
import random

# Define random search space
random_search_space = {
    'learning_rate': [1e-5, 2e-5, 3e-5, 5e-5],
    'batch_size': [8, 16, 32],
    'epochs': [2, 3, 4, 5]
}

n_iter = 5  # Number of random trials
random_results = []

for i in range(n_iter):
    lr = random.choice(random_search_space['learning_rate'])
    bs = random.choice(random_search_space['batch_size'])
    ep = random.choice(random_search_space['epochs'])
    print(f'Random Search trial {i+1}: lr={lr}, bs={bs}, epochs={ep}')
    result = train_and_evaluate_model(learning_rate=lr, batch_size=bs, epochs=ep)
    random_results.append({
        'trial': i+1,
        'learning_rate': lr,
        'batch_size': bs,
        'epochs': ep,
        'f1': result.get('f1', None),
        'accuracy': result.get('accuracy', None)
    })

random_results_df = pd.DataFrame(random_results)
display(random_results_df.sort_values(by='f1', ascending=False))

Random Search trial 1: lr=1e-05, bs=8, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3463,0.354407,0.832869,0.86755
2,0.4003,0.330464,0.848189,0.88267
3,0.2622,0.372656,0.850975,0.883315
4,0.1849,0.381316,0.855153,0.888172


Random Search trial 2: lr=1e-05, bs=8, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3659,0.355913,0.841226,0.875
2,0.3799,0.315479,0.85376,0.888653
3,0.277,0.374957,0.844011,0.87905
4,0.1688,0.378766,0.86351,0.893709


Random Search trial 3: lr=1e-05, bs=8, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3659,0.355913,0.841226,0.875
2,0.3799,0.315479,0.85376,0.888653
3,0.277,0.374957,0.844011,0.87905
4,0.1688,0.378766,0.86351,0.893709


Random Search trial 4: lr=1e-05, bs=8, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3659,0.355913,0.841226,0.875
2,0.3799,0.315479,0.85376,0.888653
3,0.277,0.374957,0.844011,0.87905
4,0.1688,0.378766,0.86351,0.893709


Random Search trial 5: lr=1e-05, bs=8, epochs=4


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3659,0.355913,0.841226,0.875
2,0.3799,0.315479,0.85376,0.888653
3,0.277,0.374957,0.844011,0.87905
4,0.1688,0.378766,0.86351,0.893709


Unnamed: 0,trial,learning_rate,batch_size,epochs,f1,accuracy
1,2,1e-05,8,4,0.888653,0.85376
3,4,1e-05,8,4,0.888653,0.85376
2,3,1e-05,8,4,0.888653,0.85376
4,5,1e-05,8,4,0.888653,0.85376
0,1,1e-05,8,4,0.88267,0.848189


This code performs a random search to explore different hyperparameter combinations for training a DistilBERT model. It first defines a range of possible values for learning rate, batch size, and number of epochs, then randomly selects combinations of these parameters for a fixed number of trials (n_iter). In each trial, the chosen parameters are used to train and evaluate the model through the train_and_evaluate_model function, which returns performance metrics such as F1 score and accuracy. The results from all trials, including the selected hyperparameters and corresponding metrics, are stored in a list and later converted into a pandas DataFrame. Finally, the DataFrame is displayed, sorted by F1 score, to easily identify which random combination produced the best-performing model.

In [20]:
import pandas as pd

# Convert Grid Search results to DataFrame
grid_search_df = grid_results_df.copy()
grid_search_df['Search_Type'] = 'Grid Search'

# Convert Random Search results to DataFrame
random_search_df = random_results_df.copy()
random_search_df['Search_Type'] = 'Random Search'

# Combine both results
combined_results = pd.concat([grid_search_df, random_search_df], ignore_index=True)

# Reorder columns for better readability
columns_order = ['Search_Type', 'learning_rate', 'batch_size', 'epochs', 'f1', 'accuracy']
combined_results = combined_results[columns_order]

# Export to Excel file
output_filename = 'Lequin_gridsearch_randomsearch_results.xlsx'
combined_results.to_excel(output_filename, index=False)

print(f"All Grid Search and Random Search results have been saved to '{output_filename}'.")

All Grid Search and Random Search results have been saved to 'Lequin_gridsearch_randomsearch_results.xlsx'.


This code combines and exports the results of both Grid Search and Random Search into a single Excel file for easier comparison and documentation. It first creates copies of each search result DataFrame and adds a new column called Search_Type to indicate whether the results came from a Grid Search or Random Search. The two DataFrames are then merged into one comprehensive table using pd.concat(), and the columns are rearranged for better readability, showing the search type, hyperparameters, and performance metrics in a clear order. Finally, the combined results are saved into an Excel file named Lequin_gridsearch_randomsearch_results.xlsx, providing a well-organized summary of all tested configurations and their corresponding F1 and accuracy scores.