# Install Required Packages

In order to run this notebook, you need to have several Python packages installed. The commands below will install the following packages:

- **datasets:** Provides access to a wide range of datasets and tools to load and process them.
- **transformers:** Contains pre-trained models and tools for working with transformer-based architectures.
- **evaluate:** Offers utilities to compute evaluation metrics.
- **numpy:** A fundamental package for numerical computing in Python.
- **codecarbon:** Tracks energy consumption and emissions during model training.



In [3]:
!pip install datasets
!pip install transformers
!pip install evaluate
!pip install numpy
!pip install codecarbon



# Import Necessary Libraries

This cell imports all the required libraries for loading datasets, preprocessing, model training, evaluation, and energy consumption tracking. Here's what each import does:

- **load_dataset (from datasets):**  
  - Loads datasets from Hugging Face's Datasets library. It supports a variety of datasets for natural language processing (NLP) tasks.

- **AutoTokenizer, AutoModelForSequenceClassification (from transformers):**  
  - `AutoTokenizer` automatically loads the appropriate tokenizer for a given model.
  - `AutoModelForSequenceClassification` loads a pretrained transformer model designed for text classification.

- **TrainingArguments, Trainer (from transformers):**  
  - `TrainingArguments` allows you to configure parameters like batch size, learning rate, and number of epochs.
  - `Trainer` handles the training and evaluation processes using the Hugging Face Transformers library.

- **evaluate:**  
  - Provides tools for computing evaluation metrics such as accuracy, F1-score, or recall.

- **numpy:**  
  - Used for efficient numerical operations and matrix manipulations.

- **DataCollatorWithPadding (from transformers):**  
  - Automatically pads sequences to the maximum length in a batch, ensuring consistent input size for the model.

- **EmissionsTracker (from codecarbon):**  
  - Tracks the carbon footprint of the training process, providing insights into energy consumption and CO₂ emissions.

These imports are essential for loading data, building models, training with transformers, evaluating results, and tracking environmental impact.



In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
import numpy as np
from transformers import DataCollatorWithPadding
from codecarbon import EmissionsTracker

# Load the Dataset

This line of code loads the **Phishing Site Classification** dataset using the Hugging Face Datasets library.

***About the Dataset:***

- **Dataset Name:** Phishing Site Classification
- **Source:** Hugging Face (Published by Shawhin)
- **Task:** Binary Classification — Classify URLs as either Safe or Not Safe.
- **Objective:** The dataset is specifically designed to train and evaluate machine learning models for detecting phishing sites.

In [5]:
dataset_dict = load_dataset("shawhin/phishing-site-classification")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## View the Loaded Dataset




In [6]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 2100
    })
    validation: Dataset({
        features: ['text', 'labels'],
        num_rows: 450
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 450
    })
})

# Load Pretrained BERT Model and Tokenizer

This section of code loads a pretrained **BERT (Bidirectional Encoder Representations from Transformers)** model and its tokenizer using Hugging Face's `transformers` library.


In [7]:
# Load model directly
model_path = "google-bert/bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_path)

id2label = {0: "Safe", 1: "Not Safe"}
label2id = {"Safe": 0, "Not Safe": 1}
model = AutoModelForSequenceClassification.from_pretrained(model_path,
                                                           num_labels=2,
                                                           id2label=id2label,
                                                           label2id=label2id,)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Print Model Layers and Trainable Parameters

This code iterates through all the parameters of the model and prints their names along with their `requires_grad` status.



In [8]:
# print layers
for name, param in model.named_parameters():
   print(name, param.requires_grad)

bert.embeddings.word_embeddings.weight True
bert.embeddings.position_embeddings.weight True
bert.embeddings.token_type_embeddings.weight True
bert.embeddings.LayerNorm.weight True
bert.embeddings.LayerNorm.bias True
bert.encoder.layer.0.attention.self.query.weight True
bert.encoder.layer.0.attention.self.query.bias True
bert.encoder.layer.0.attention.self.key.weight True
bert.encoder.layer.0.attention.self.key.bias True
bert.encoder.layer.0.attention.self.value.weight True
bert.encoder.layer.0.attention.self.value.bias True
bert.encoder.layer.0.attention.output.dense.weight True
bert.encoder.layer.0.attention.output.dense.bias True
bert.encoder.layer.0.attention.output.LayerNorm.weight True
bert.encoder.layer.0.attention.output.LayerNorm.bias True
bert.encoder.layer.0.intermediate.dense.weight True
bert.encoder.layer.0.intermediate.dense.bias True
bert.encoder.layer.0.output.dense.weight True
bert.encoder.layer.0.output.dense.bias True
bert.encoder.layer.0.output.LayerNorm.weight True


# Freezing and Unfreezing Model Parameters

This code is used to selectively freeze and unfreeze the parameters of a pre-trained BERT model during fine-tuning.




In [9]:
# freeze base model parameters
for name, param in model.base_model.named_parameters():
    param.requires_grad = False

# unfreeze base model pooling layers
for name, param in model.base_model.named_parameters():
    if "pooler" in name:
        param.requires_grad = True

## Explanation of Printing Model Layers and Trainable Status

This code prints the names of all the parameters in the model along with their `requires_grad` status to verify which layers are trainable and which are frozen.


In [10]:

# print layers
for name, param in model.named_parameters():
   print(name, param.requires_grad)

bert.embeddings.word_embeddings.weight False
bert.embeddings.position_embeddings.weight False
bert.embeddings.token_type_embeddings.weight False
bert.embeddings.LayerNorm.weight False
bert.embeddings.LayerNorm.bias False
bert.encoder.layer.0.attention.self.query.weight False
bert.encoder.layer.0.attention.self.query.bias False
bert.encoder.layer.0.attention.self.key.weight False
bert.encoder.layer.0.attention.self.key.bias False
bert.encoder.layer.0.attention.self.value.weight False
bert.encoder.layer.0.attention.self.value.bias False
bert.encoder.layer.0.attention.output.dense.weight False
bert.encoder.layer.0.attention.output.dense.bias False
bert.encoder.layer.0.attention.output.LayerNorm.weight False
bert.encoder.layer.0.attention.output.LayerNorm.bias False
bert.encoder.layer.0.intermediate.dense.weight False
bert.encoder.layer.0.intermediate.dense.bias False
bert.encoder.layer.0.output.dense.weight False
bert.encoder.layer.0.output.dense.bias False
bert.encoder.layer.0.output.Lay

# Text Preprocessing Function

This cell defines a simple preprocessing function using the tokenizer to prepare text data for input into the model.



In [11]:
# define text preprocessing
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

# Tokenizing All Datasets

This cell applies the previously defined `preprocess_function` to tokenize all the datasets using the `map` function from the Hugging Face Datasets library.



In [12]:
# tokenize all datasetse
tokenized_data = dataset_dict.map(preprocess_function, batched=True)

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

# Creating a Data Collator

This cell creates a **Data Collator** using the `DataCollatorWithPadding` class from Hugging Face Transformers.



In [13]:
# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Loading Metrics and Computing Evaluation Metrics

This cell is responsible for evaluating the model's performance using accuracy and AUC (Area Under the Curve) metrics.



In [14]:
# load metrics
accuracy = evaluate.load("accuracy")
auc_score = evaluate.load("roc_auc")

def compute_metrics(eval_pred):
    # get predictions
    predictions, labels = eval_pred

    # apply softmax to get probabilities
    probabilities = np.exp(predictions) / np.exp(predictions).sum(-1, keepdims=True)
    # use probabilities of the positive class for ROC AUC
    positive_class_probs = probabilities[:, 1]
    # compute auc
    auc = np.round(auc_score.compute(prediction_scores=positive_class_probs, references=labels)['roc_auc'],3)

    # predict most probable class
    predicted_classes = np.argmax(predictions, axis=1)
    # compute accuracy
    acc = np.round(accuracy.compute(predictions=predicted_classes, references=labels)['accuracy'],3)

    return {"Accuracy": acc, "AUC": auc}

# Explanation of Hyperparameters and TrainingArguments

This cell defines the hyperparameters for the model training process and sets up the training configuration using the `TrainingArguments` class from the Hugging Face Transformers library.


In [15]:
# hyperparameters
lr = 2e-4
batch_size = 8
num_epochs = 10

training_args = TrainingArguments(
    output_dir="bert-phishing-classifier_teacher",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Emissions Tracking and Model Training with CodeCarbon

This cell performs two key functions:  
1. **Training the Model**: It uses the Hugging Face `Trainer` class to fine-tune the BERT model on the phishing site classification dataset.  
2. **Tracking Emissions**: The `EmissionsTracker` from CodeCarbon monitors and reports the carbon footprint generated during the training process.


In [16]:
from codecarbon import EmissionsTracker
from transformers import Trainer

# Assuming trainer is already set up
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train the model with emissions tracking
with EmissionsTracker(output_dir='/content/output', output_file='emissions.csv', allow_multiple_runs=True) as tracker:
    trainer.train()

# Get total emissions directly from the tracker
total_emissions = tracker.final_emissions
print(f"Total emissions: {total_emissions} kgCO2eq")

  trainer = Trainer(
[codecarbon ERROR @ 20:11:08] Error: Another instance of codecarbon is probably running as we find `/tmp/.codecarbon.lock`. Turn off the other instance to be able to run this one or use `allow_multiple_runs` or delete the file. Exiting.
[codecarbon INFO @ 20:11:08] [setup] RAM Tracking...
[codecarbon INFO @ 20:11:08] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist at \sys\class\powercap\intel-rapl to measure CPU

[codecarbon INFO @ 20:11:09] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.00GHz
[codecarbon INFO @ 20:11:09] [setup] GPU Tracking...
[codecarbon INFO @ 20:11:09] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 20:11:09] >>> Tracker's metadata:
[codecarbon INFO @ 20:11:09]   Platform system: Linux-6.1.85+-x86_64-with-glibc2.35
[codecarbon INFO @ 20:11:09]   Python version: 3.11.11
[codecarbon INFO @ 20:11:09]   CodeCarbon version: 2.8.3
[codecarbon INFO @ 20:11:09]   Available RAM : 12.675 GB
[codecarbon INFO



Epoch,Training Loss,Validation Loss,Accuracy,Auc
1,0.5076,0.390735,0.804,0.912
2,0.409,0.341644,0.833,0.93
3,0.3566,0.314357,0.851,0.939
4,0.3574,0.354922,0.849,0.946
5,0.3504,0.335314,0.86,0.948
6,0.3493,0.289885,0.867,0.95
7,0.3343,0.288745,0.876,0.95
8,0.3122,0.288694,0.869,0.95
9,0.312,0.285123,0.867,0.951
10,0.3133,0.290017,0.867,0.951


[codecarbon INFO @ 20:11:25] Energy consumed for RAM : 0.000020 kWh. RAM Power : 4.7530388832092285 W
[codecarbon INFO @ 20:11:25] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 20:11:25] Energy consumed for all GPUs : 0.000219 kWh. Total GPU Power : 52.48887440242671 W
[codecarbon INFO @ 20:11:25] 0.000416 kWh of electricity used since the beginning.
[codecarbon INFO @ 20:11:40] Energy consumed for RAM : 0.000040 kWh. RAM Power : 4.7530388832092285 W
[codecarbon INFO @ 20:11:40] Energy consumed for all CPUs : 0.000354 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 20:11:40] Energy consumed for all GPUs : 0.000470 kWh. Total GPU Power : 60.32000823669473 W
[codecarbon INFO @ 20:11:40] 0.000864 kWh of electricity used since the beginning.
[codecarbon INFO @ 20:11:55] Energy consumed for RAM : 0.000059 kWh. RAM Power : 4.7530388832092285 W
[codecarbon INFO @ 20:11:55] Energy consumed for all CPUs : 0.000531 kWh. Total CPU Power : 42.5 W
[codeca

Total emissions: 0.0016696896840862562 kgCO2eq


# Validation and Emissions Evaluation

This cell performs the following tasks:  
1. **Model Validation**: Applies the trained BERT model to the validation dataset and evaluates its performance.  
2. **Metrics Calculation**: Computes accuracy and AUC using the `compute_metrics()` function.  
3. **Emissions Reporting**: Reads and prints the total carbon emissions generated during training using the data stored in the `emissions.csv` file.  


In [17]:
import pandas as pd

# Apply model to validation dataset
predictions = trainer.predict(tokenized_data["validation"])

# Extract the logits and labels from the predictions object
logits = predictions.predictions
labels = predictions.label_ids

# Use your compute_metrics function
metrics = compute_metrics((logits, labels))

# Read the total emissions from the correct CSV file path
emissions_df = pd.read_csv('/content/output/emissions.csv')
total_emissions = emissions_df['emissions'].iloc[-1]

# Print the validation metrics and total emissions together
print("Validation Metrics and Total Emissions:")
print(f"Metrics: {metrics}")
print(f"Total emissions: {total_emissions} kgCO2eq")

Validation Metrics and Total Emissions:
Metrics: {'Accuracy': np.float64(0.893), 'AUC': np.float64(0.946)}
Total emissions: 0.0016696896840862 kgCO2eq
