# **Introduction**

This notebook focuses on the fine-tuning of a pre-trained Transformer model (DistilBERT) to classify eco-friendly product descriptions using the dataset amazon_eco-friendly_products.csv. Fine-tuning allows a language model that has already learned general language representations to adapt its understanding for a specific task—in this case, identifying whether a product is environmentally friendly based on its textual details.

The dataset consists of various product attributes such as titles, materials, and descriptions. These fields are combined and processed using the DistilBERT tokenizer to convert natural language text into tokens that can be interpreted by the model. The task is treated as a binary text classification problem: predicting whether a product is eco-friendly (1) or not eco-friendly (0).

This notebook represents the “Training Strategy Experiments” portion of the fine-tuning exercise. The experiments systematically adjust key hyperparameters that affect model learning behavior:

***1. Number of epochs*** – controls how many times the model passes through the training data.

***2. Learning rate*** – determines how quickly the model updates its weights during training.

***3. Batch size*** – defines how many samples are processed before updating model weights.

By varying these parameters, the notebook aims to identify the optimal combination that yields the highest F1-Score and Accuracy on the evaluation dataset. The results will help demonstrate how training strategies directly influence the model’s performance and generalization ability.

The structure of this notebook includes:

1. Setup and Data Preparation
2. Tokenizer and Model Loading
3. Training Configuration and Metrics Definition
4. Fine-Tuning Experiments (three configurations tested)
5. Evaluation and Comparison of Results

# **1. Setup and Data Preparation**

In [2]:
!pip install transformers datasets torch scikit-learn pandas openpyxl

import pandas as pd
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import numpy as np



This code is for installing and importing the important libraries that I will use in my fine-tuning project. The transformers library is for using the DistilBERT model, datasets help in loading text data, torch is for running the model and training it, scikit-learn is for computing accuracy and f1 score to see how good the model is, pandas is for reading and handling the dataset like csv files, and openpyxl is for saving my result into excel file later. I import all this first so the rest of the codes can work properly.

In [4]:
# Check for GPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU.")

Using GPU: Tesla T4


This code is checking if my computer or colab is using a GPU or not. GPU is faster than CPU when training big models like DistilBERT. If there is a GPU available, it will show the GPU name and use it for training, but if not, it will just use the CPU which is slower. The purpose of this code is to make sure that my model will train faster and more efficient when possible.

In [6]:
# Load your dataset
df = pd.read_csv("amazon_eco-friendly_products.csv")

# Display sample data
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (3587, 14)


Unnamed: 0,id,title,name,category,material,brand,price,rating,reviewsCount,description,url,img_url,inStock,inStockText
0,B0CWH366KJ,"Agfabric Natural Jute Erosion Control, 16yard(...",Weed Barrier Fabric,"Patio, Lawn & Garden",,Agfabric,$87.3,,,Protect your yard and garden with our biodegra...,https://www.amazon.com/dp/B0CWH366KJ,https://m.media-amazon.com/images/I/71t3FD5KjH...,True,Only 5 left in stock - order soon.
1,B086L692VC,SAFAVIEH Braided Collection 4' Round Light Blu...,Area Rugs,Home & Kitchen,"50%jute, 25% Wool, 25% Cotton",Safavieh,$40.63,4.2,59.0,Country style is perfect for a casual cottage ...,https://www.amazon.com/dp/B086L692VC,https://m.media-amazon.com/images/I/A1Q73Cheh2...,True,Only 3 left in stock - order soon.
2,B01J6JELTG,Eyeseals 4.0 Sleep Mask – Clear – Moisturizing...,Sleeping Masks,Health & Household,Plastic,EYEECO,$65.95,3.7,1075.0,Locks moisture in: Eyeseals 4.0 eye mask for d...,https://www.amazon.com/dp/B01J6JELTG,https://m.media-amazon.com/images/I/61Uz393xlp...,True,
3,B07HQSKK36,Lucky Monet 25/50/100PCS Burlap Gift Bags Wedd...,Gift Bags,Health & Household,Burlap,Lucky Monet,$29.99,4.6,2492.0,❤ Premium Burlap Material❤ These small burlap ...,https://www.amazon.com/dp/B07HQSKK36,https://m.media-amazon.com/images/I/71DrHIU1aM...,True,In Stock In Stock
4,B0C3Y8WJDR,St. Boniface Bag Company | Burlap Bags - Size:...,Grow Bags,"Patio, Lawn & Garden",5.0 Count,Generic,$29.99,4.4,11.0,100% Burlap > 100% BIODEGRADABLE AND ECO FRIEN...,https://www.amazon.com/dp/B0C3Y8WJDR,https://m.media-amazon.com/images/I/81q3el899U...,True,In Stock


This code is used to load my dataset called amazon_eco-friendly_products.csv using pandas. The pd.read_csv() reads the csv file and stores it into a dataframe named df. Then, df.shape shows how many rows and columns the dataset have, and df.head() displays the first few lines so I can check if the data was loaded correctly. The purpose of this code is to make sure that the dataset is properly read and to see a preview of what kind of data I will be working with.

In [7]:
# Combine textual fields into a single text column
df['text'] = df['title'].astype(str) + " " + df['material'].astype(str) + " " + df['description'].astype(str)

# Dummy binary labels for demonstration:
# If description contains eco/sustainable terms → 1 (eco-friendly), else 0
df['label'] = df['description'].str.contains("eco|recycl|sustain|biodegrad|organic", case=False, na=False).astype(int)

print(df['label'].value_counts())
df[['text','label']].head()

label
1    2257
0    1330
Name: count, dtype: int64


Unnamed: 0,text,label
0,"Agfabric Natural Jute Erosion Control, 16yard(...",1
1,SAFAVIEH Braided Collection 4' Round Light Blu...,0
2,Eyeseals 4.0 Sleep Mask – Clear – Moisturizing...,1
3,Lucky Monet 25/50/100PCS Burlap Gift Bags Wedd...,0
4,St. Boniface Bag Company | Burlap Bags - Size:...,1


This code combines the product information like title, material, and description into one single column called “text” so that the model can analyze everything together instead of separately. Then, it makes a dummy label for training — meaning if the description has words like “eco”, “recycle”, “sustain”, “biodegradable”, or “organic,” it marks it as 1 (eco-friendly), otherwise 0 (not eco-friendly). The value_counts() shows how many eco-friendly and non-eco-friendly samples there are, while the head() function displays a few examples of the new text and label columns. The purpose of this code is to prepare the dataset into a format that can be used for text classification.

In [8]:
# Split into training and evaluation sets
train_texts, eval_texts, train_labels, eval_labels = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)

train_data = pd.DataFrame({'text': train_texts, 'labels': train_labels})
eval_data = pd.DataFrame({'text': eval_texts, 'labels': eval_labels})

This code splits the dataset into two parts — one for training the model and one for evaluation/testing. It uses an 80-20 split, meaning 80% of the data will be used to train the model while 20% will be used to test how well it performs. The stratify=df['label'] makes sure that both sets have a balanced number of eco-friendly and non-eco-friendly samples, so the model doesn’t get biased. After splitting, it creates two new DataFrames: train_data and eval_data, which contain the text and their corresponding labels. This step is important because it helps the model learn from one part of the data and then be tested on another to check if it can generalize well.

# **2. Load Tokenizer and Model**

In [9]:
MODEL_NAME = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)

def tokenize_function(batch):
    return tokenizer(batch["text"], truncation=True, padding=True, max_length=128)

# Convert to Hugging Face dataset format
from datasets import Dataset
train_ds = Dataset.from_pandas(train_data)
eval_ds = Dataset.from_pandas(eval_data)

tokenized_train = train_ds.map(tokenize_function, batched=True)
tokenized_eval = eval_ds.map(tokenize_function, batched=True)

model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/2869 [00:00<?, ? examples/s]

Map:   0%|          | 0/718 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This code sets up the BERT model and tokenizer so the text data can be properly processed before training. The MODEL_NAME loads the DistilBERT model, which is a smaller and faster version of BERT that still gives good accuracy. The tokenizer converts each product description into tokens — basically turning words into numbers that the model can understand. The function tokenize_function() applies this process to all the text, making sure each text is trimmed or padded to the same length (max_length=128). Then, the training and evaluation data are turned into Hugging Face Dataset objects so they can work smoothly with the model. After tokenizing everything, the DistilBertForSequenceClassification model is loaded with two labels (0 for non-eco-friendly and 1 for eco-friendly) and sent to the device (GPU or CPU). This step basically gets the data and model ready for training.

# **3. Define Metrics and Baseline Trainer**

In [10]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    f1 = f1_score(p.label_ids, preds, average="binary")
    return {"accuracy": acc, "f1": f1}

This part of the code defines how the model’s performance will be measured after training. The function compute_metrics() takes the model’s predictions (p.predictions) and compares them to the real labels (p.label_ids). It finds which class (0 or 1) the model predicted most confidently using np.argmax(). Then, it calculates two important metrics — accuracy, which tells how many predictions were correct, and F1-score, which balances precision and recall to give a better measure of performance, especially if the classes are imbalanced. In short, this function helps us know how well the model is doing during evaluation.

In [11]:
# Baseline Experiment (3 epochs, LR=5e-5, batch=16)
training_args = TrainingArguments(
    output_dir="./results_exp1",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    weight_decay=0.01,
    eval_strategy="epoch", # Set eval_strategy to "epoch"
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[]
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

print("Running Experiment 1 (Baseline)…")
trainer.train()
exp1_results = trainer.evaluate()
print(exp1_results)

  trainer = Trainer(


Running Experiment 1 (Baseline)…


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.322781,0.842618,0.876503
2,No log,0.286105,0.86351,0.891593
3,0.301400,0.384816,0.86351,0.891832


{'eval_loss': 0.2861045002937317, 'eval_accuracy': 0.8635097493036211, 'eval_f1': 0.8915929203539823, 'eval_runtime': 0.8198, 'eval_samples_per_second': 875.781, 'eval_steps_per_second': 54.889, 'epoch': 3.0}


This code runs the baseline experiment, which is the first training setup used to compare later improvements. It starts by setting up the TrainingArguments, where we define how the model will train — like num_train_epochs=3 meaning the model will go through the data three times, learning_rate=5e-5 controlling how fast it learns, and batch_size=16 meaning it processes 16 samples at a time. The code also saves and evaluates the model at the end of every epoch, and if a GPU is available, it uses faster fp16 precision.

Then, a Trainer object is created, which connects the model, data, and training settings together. Finally, trainer.train() actually trains the model, and trainer.evaluate() checks how well it performs on the evaluation set. The results (accuracy and F1-score) are printed out so we can see how the baseline model performs before we change any hyperparameters.

# **4. Hyperparameter Experiments (Training Strategy)**

In [12]:
training_args2 = TrainingArguments(
    output_dir="./results_exp2",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    learning_rate=3e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[]
)

trainer2 = Trainer(
    model=model,
    args=training_args2,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

print("Running Experiment 2…")
trainer2.train()
exp2_results = trainer2.evaluate()
print(exp2_results)

  trainer2 = Trainer(


Running Experiment 2…


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.501808,0.845404,0.885685
2,No log,0.5802,0.866295,0.895652
3,0.103000,0.638563,0.864903,0.895587
4,0.103000,0.68738,0.873259,0.90011
5,0.103000,0.731015,0.871866,0.897321


{'eval_loss': 0.5018078684806824, 'eval_accuracy': 0.8454038997214485, 'eval_f1': 0.8856848609680742, 'eval_runtime': 1.023, 'eval_samples_per_second': 701.828, 'eval_steps_per_second': 87.973, 'epoch': 5.0}


This code runs the second experiment where we change some hyperparameters to see if the model can perform better. In this setup, the number of epochs is increased from 3 to 5, meaning the model will train longer and possibly learn more patterns from the data. The learning rate is also lowered from 5e-5 to 3e-5, so the model updates its weights more slowly, which can help it learn more carefully and avoid overfitting. Everything else stays mostly the same as before — same batch size, evaluation each epoch, and saving the best model. After setting these configurations, the Trainer is initialized again and the model is trained and evaluated. The printed results (accuracy and F1-score) will show whether these changes improved or worsened the model compared to the first experiment.

In [13]:
training_args3 = TrainingArguments(
    output_dir="./results_exp3",
    num_train_epochs=8,
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),
    report_to=[]
)

trainer3 = Trainer(
    model=model,
    args=training_args3,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

print("Running Experiment 3…")
trainer3.train()
exp3_results = trainer3.evaluate()
print(exp3_results)

  trainer3 = Trainer(


Running Experiment 3…


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.511038,0.860724,0.896266
2,0.231800,0.540654,0.86351,0.891832
3,0.108200,0.716098,0.862117,0.899083
4,0.108200,0.843148,0.876045,0.904609
5,0.039300,0.991103,0.869081,0.896018
6,0.005700,1.034713,0.867688,0.893378
7,0.002900,1.063686,0.881616,0.907909
8,0.002900,1.065143,0.87883,0.905332


{'eval_loss': 0.5110384821891785, 'eval_accuracy': 0.8607242339832869, 'eval_f1': 0.8962655601659751, 'eval_runtime': 1.02, 'eval_samples_per_second': 703.932, 'eval_steps_per_second': 88.237, 'epoch': 8.0}


This code performs the third experiment, where the training setup is changed again to test another combination of hyperparameters. Here, the model is trained for 8 epochs, which means it will go through the entire dataset more times, allowing it to learn deeper patterns. However, the batch size is reduced to 8, meaning the model processes fewer samples per training step — this can make the training slower but may help it generalize better. The learning rate is set back to 5e-5, which makes the updates slightly faster compared to Experiment 2. After defining these settings, a new Trainer is created to handle this experiment’s training and evaluation. The printed results will help compare whether training longer and using smaller batches improved the model’s accuracy and F1-score compared to the earlier experiments.

# **5. Save and Compare Results**

In [14]:
results = pd.DataFrame([
    {"Experiment": "Exp1 (3ep, 5e-5, 16)", **exp1_results},
    {"Experiment": "Exp2 (5ep, 3e-5, 16)", **exp2_results},
    {"Experiment": "Exp3 (8ep, 5e-5, 8)", **exp3_results},
])
results.to_excel("Lequin_training_experiments.xlsx", index=False)
results

Unnamed: 0,Experiment,eval_loss,eval_accuracy,eval_f1,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
0,"Exp1 (3ep, 5e-5, 16)",0.286105,0.86351,0.891593,0.8198,875.781,54.889,3.0
1,"Exp2 (5ep, 3e-5, 16)",0.501808,0.845404,0.885685,1.023,701.828,87.973,5.0
2,"Exp3 (8ep, 5e-5, 8)",0.511038,0.860724,0.896266,1.02,703.932,88.237,8.0


This code collects and organizes all the results from the three experiments into a single table using pandas. Each experiment’s settings — like the number of epochs, learning rate, and batch size — are labeled clearly under the “Experiment” column, and the corresponding accuracy and F1-score results from each test are added next to it. After compiling everything, the code saves this table as an Excel file named “Lequin_training_experiments.xlsx” so it can be used later for reporting or comparison. Finally, the results command displays the summary directly in the notebook, making it easier to see which experiment performed best.