<a href="https://colab.research.google.com/github/eitanaka/DATS6401_Final_Project/blob/main/ELECTRA__Experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ELECTRA Experiment

[Intro to NLP](https://huggingface.co/learn/nlp-course/chapter1/1)

[Intro to ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)


### What is ELECTRA?

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) is a transformer-based model for pre-training language models that has been designed to be more computationally efficient.

ELECTRA works differently compared to other transformer models like BERT or GPT. While models like BERT use a masked language model where some of the input tokens are randomly masked and the model is trained to predict the original value of the masked words, ELECTRA uses a method called replaced token detection.

### How it works?

1. Similar to BERT, some tokens in the input are randomly selected. But instead of masking, these tokens are replaced by other tokens, either sampled randomly from the vocabulary or generated by another small neural network, called the generator.
2. The ELECTRA model, which is referred to as the discriminator, is then trained to determine for each token, whether it is an original or a replaced token.

### Key points about ELECTRA
- Since it makes predictions about each token in the input (as opposed to BERT, which only predicts the masked tokens), ELECTRA makes more efficient use of the data.
- The generator-discriminator setup is similar to the idea of Generative Adversarial Networks (GANs), but in ELECTRA, the generator and discriminator are not adversarial; the generator does not try to fool the discriminator.
- ELECTRA requires less computational resources compared to other models while providing competitive performance.


In [None]:
 !pip install transformers  datasets evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Using cached transformers-4.30.2-py3-none-any.whl (7.2 MB)
Collecting datasets
  Using cached datasets-2.13.0-py3-none-any.whl (485 kB)
Collecting evaluate
  Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)
Collecting aiohttp (from datasets)
  Using cached aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Using cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Installing collected packages: aiosignal, aiohttp, transformers, datasets, evaluate
Successfully installed aiohttp-3.8.4 aiosignal-1.3.1 datasets-2.13.0 evaluate-0.4.0 transformers-4.30.2


# Notebook Configuration

## Google Drive

In [None]:
from google.colab import drive
import sys

# Mount Google Drive
drive.mount('/content/drive')

# Get the absolute path of the current folder
abspath_curr = '/content/drive/'

# Get the absolute path of the deep utilities folder
abspath_util_deep = '/content/drive/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Warning

In [None]:
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

## TensorFlow

In [None]:
# The magic below allows us to use tensorflow version 2.x
%tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


## Pytorch

In [None]:
import torch

## Random Seeds

In [None]:
#The random seed
random_seed = 42

# Set random seed in tensorflow
tf.random.set_seed(random_seed)

# Set random seed in pytorch
torch.manual_seed(random_seed)

# Set random seed in numpy
import numpy as np
np.random.seed(random_seed)

# Data Preprocessing

## Loading Data

In [None]:
import pandas as pd

# Setup functions for cleaning
url1 = 'https://raw.githubusercontent.com/eitanaka/DATS6202_Final_Project/main/datasets/train.csv' # training set from our github repo
url2 = "https://raw.githubusercontent.com/eitanaka/DATS6202_Final_Project/main/datasets/test.csv" # test set from out github repo

# import the data
test_raw = pd.read_csv(url2)
train_raw = pd.read_csv(url1)

# transform to strings
test_raw['text'] = test_raw['text'].astype(str)
train_raw['text'] = train_raw['text'].astype(str)
test_raw['keyword'] = test_raw['keyword'].astype(str)
train_raw['keyword'] = train_raw['keyword'].astype(str)

# transform into dataframes and remove index
df_test = pd.DataFrame(test_raw)
df_train = pd.DataFrame(train_raw)
df_test.reset_index(drop=True, inplace=True)
df_train.reset_index(drop=True, inplace=True)

## Downloading the data to the directry

In [None]:
import tensorflow_datasets as tfds

# Get the name of the data
data_name = 'tweets'

## Getting the training, validation and test set

In [None]:
from sklearn.model_selection import train_test_split

# Divide the training data into training (80%) and validation (20%)
df_train, df_val = train_test_split(df_train, train_size=0.8, random_state=random_seed)

# Reset the index
df_train, df_val = df_train.reset_index(drop=True), df_val.reset_index(drop=True)

In [None]:
# Print the dimension of df_train
pd.DataFrame([[df_train.shape[0], df_train.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,6090,5


In [None]:
# Print the dimension of df_val
pd.DataFrame([[df_val.shape[0], df_val.shape[1]]], columns=['# rows', '# columns'])

Unnamed: 0,# rows,# columns
0,1523,5


## Handling Identifier

In [None]:
df_train = df_train.drop('id', axis=1)
df_val = df_val.drop('id', axis=1)
df_test = df_test.drop('id', axis=1)

## Handling Missing value

In [None]:
print(df_train.isnull().sum())
print(df_val.isnull().sum())
print(df_test.isnull().sum())

keyword        0
location    2020
text           0
target         0
dtype: int64
keyword       0
location    513
text          0
target        0
dtype: int64
keyword        0
location    1105
text           0
dtype: int64


In [None]:
df_train = df_train.drop('location', axis=1)
df_val = df_val.drop('location', axis=1)
df_test = df_test.drop('location', axis=1)

## Convert pandas dataframe to datasets object for transformer

In [None]:
from datasets import Dataset

ds_train = Dataset.from_pandas(df_train)
ds_val = Dataset.from_pandas(df_val)

# Tokenization

In [None]:
from transformers import ElectraTokenizer, ElectraForSequenceClassification

tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = ElectraForSequenceClassification.from_pretrained('google/electra-small-discriminator', num_labels=2)

Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier

## Tokenization for training set and validation set

What is the input format for the model?
- The model expects the inputs to be in the form of a dictionary with keys: input_ids, attention_mask, and labels.
- input_ids: a sequence of integers identifying each input token to its index number in the tokenizer vocabulary
- attention_mask: a sequence of 1s and 0s used to differentiate padding from non-padding
- target: a sequence of integers corresponding to the classification labels to be predicted

The model returns a tuple of (loss, logits) where loss is the cross-entropy loss and logits is the predicted class scores before passing them through an activation function like the softmax.

TensorDataset: Dataset wrapping tensors. Each sample will be retrieved by indexing tensors along the first dimension.

DataLoader: Data loader. Combines a dataset and a sampler, and provides an iterable over the given dataset.

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# 'text' is the tweet text and 'target' is 1 if it's about disaster and 0 otherwise
# tokenization
inputs = tokenizer(df_train['text'].tolist(), return_tensors='pt', truncation=True, padding=True, max_length=512)
labels = torch.tensor(df_train['target'].tolist())

# data loader
dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels)
dataloader = DataLoader(dataset, sampler=RandomSampler(dataset), batch_size=32)

In [None]:
# val_df is our validation dataframe with 'text_clean' and 'target' columns
val_inputs = tokenizer(df_val['text'].tolist(), return_tensors='pt', truncation=True, padding=True, max_length=512)
val_labels = torch.tensor(df_val['target'].tolist())

# Validation data loader
val_dataset = TensorDataset(val_inputs['input_ids'], val_inputs['attention_mask'], val_labels)
val_dataloader = DataLoader(val_dataset, sampler=SequentialSampler(val_dataset), batch_size=32)

## Set leaning rate randomly

In [None]:
from transformers import AdamW

# optimizer (random at this time)
optimizer = AdamW(model.parameters(), lr=1e-5)

In [None]:
# device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Evaluation

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

def flat_accuracy_and_f1(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    accuracy = accuracy_score(labels_flat, pred_flat)
    f1 = f1_score(labels_flat, pred_flat, average='weighted')  # average parameter could be 'macro' or 'micro' based on your need
    return accuracy, f1

# Train

In [None]:
def train(model, dataloader, optimizer):
    model.train()
    for batch in dataloader:
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'labels': batch[2]}
        outputs = model(**inputs)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

In [None]:
def validate(model, dataloader):
    model.eval()
    total_eval_accuracy = 0
    total_eval_loss = 0
    total_eval_f1 = 0

    # Maintain lists to store the predictions and labels for each batch, which will be used to calculate the F1 score
    predictions , true_labels = [], []

    for batch in dataloader:
        batch = tuple(b.to(device) for b in batch)
        with torch.no_grad():
            inputs = {'input_ids': batch[0],
                      'attention_mask': batch[1],
                      'labels': batch[2]}
            outputs = model(**inputs)

        loss = outputs[0]
        total_eval_loss += loss.item()

        logits = outputs[1].detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()

        # Store predictions and true labels for each batch
        predictions.append(logits)
        true_labels.append(label_ids)

    # For each batch, calculate the accuracy and f1 score and add it to the total
    for i in range(len(predictions)):
        total_eval_accuracy += flat_accuracy_and_f1(predictions[i], true_labels[i])[0]
        total_eval_f1 += flat_accuracy_and_f1(predictions[i], true_labels[i])[1]

    # Report the final accuracy and f1 score for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(dataloader)
    avg_val_f1 = total_eval_f1 / len(dataloader)

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(dataloader)

    return avg_val_loss, avg_val_accuracy, avg_val_f1

# Hyperparameter Tuning

In [None]:
!pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.2.0-py3-none-any.whl (390 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.6/390.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.11.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cmaes>=0.9.1 (from optuna)
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully i

I used a framework called Optuna which is an open source hyperparameter optimization framework to automate hyperparameter search.
[Optuna](https://optuna.org/)

Here's a step-by-step summary:

It first defines an objective function that Optuna will aim to minimize. This function represents the model's validation loss.

Within this objective function, hyperparameters that we want to tune are defined. These hyperparameters are the learning rate ('lr') and the number of epochs ('epochs'). For each trial, Optuna generates a set of hyperparameters:

The learning rate is defined to be a float value in the log-uniform distribution between 1e-6 and 1e-4.
The number of epochs is defined to be an integer between 1 and 5.
Using these hyperparameters, an Electra model for sequence classification is trained. The learning rate defined by Optuna is used to create an instance of the AdamW optimizer.

The model is then trained for the defined number of epochs. The training and validation functions are likely to be defined elsewhere in the code.

The function returns the validation loss, which is the metric that Optuna tries to minimize.

Outside of the objective function, a study is created. The direction is set to 'minimize', which indicates that the goal of the study is to find hyperparameters that result in the minimum validation loss.

The objective function is optimized using the study.optimize() function with 20 trials. This means that 20 sets of hyperparameters will be tried, and for each set, the objective function will be computed.

The parameters of the best trial - the one with the lowest validation loss - are printed. This is the optimal set of hyperparameters found by Optuna.

In summary, this script uses Optuna to find the optimal learning rate and number of epochs for training an Electra sequence classification model, with the aim of minimizing the validation loss. It does this by performing a specified number of trials, each with a different set of hyperparameters, and selecting the hyperparameters that result in the lowest validation loss.

In [None]:
# !!!! It takes a long time about 10 mins
import optuna

def objective(trial):
    lr = trial.suggest_loguniform('lr', 1e-6, 1e-4)
    epochs = trial.suggest_int('epochs', 1, 5)

    model = ElectraForSequenceClassification.from_pretrained('google/electra-small-discriminator', num_labels=2)
    optimizer = AdamW(model.parameters(), lr=lr)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    for epoch in range(epochs):
        train(model, dataloader, optimizer)
        val_loss, val_acc, val_f1 = validate(model, val_dataloader)

    return val_loss

# Create a study object and optimize the objective function
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)

print("Best trial:")
trial = study.best_trial

print(f"\tValue: {trial.value}")
print("\tParams: ")
for key, value in trial.params.items():
    print(f"\t\t{key}: {value}")

[I 2023-06-18 18:47:30,281] A new study created in memory with name: no-name-7fa16adf-cc47-4a65-a41b-f428c39256b3
Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not i

Best trial:
	Value: 0.40473867238809663
	Params: 
		lr: 3.347714238183574e-05
		epochs: 2


 In the objective function, we define the range of the learning rate and epochs that Optuna will use to tune the model. The suggest_loguniform function samples the learning rate from a log-uniform distribution, and the suggest_int function chooses an integer within the given range. The objective function returns the validation loss, which Optuna tries to minimize.

By calling study.optimize, Optuna runs the objective function 20 times (n_trials=20) and saves the hyperparameters that give the minimum validation loss.

Finally, we print out the best parameters found by Optuna. These are the optimal hyperparameters for the model.

## Evaluate using hyperparameter

In [None]:
# Best parameters found by Optuna
best_lr = trial.params['lr']
best_epochs = trial.params['epochs']

tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
model = ElectraForSequenceClassification.from_pretrained('google/electra-small-discriminator', num_labels=2)

optimizer = AdamW(model.parameters(), lr=best_lr)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

for epoch in range(best_epochs):
    train(model, dataloader, optimizer)
    val_loss, val_acc, val_f1 = validate(model, val_dataloader)
    print("Epoch: {0}, Validation Loss: {1:.4f}, Accuracy: {2:.4f}, F1 score: {3:.4f}".format(epoch, val_loss, val_acc, val_f1))

Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier

Epoch: 0, Validation Loss: 0.4202, Accuracy: 0.8179, F1 score: 0.8170
Epoch: 1, Validation Loss: 0.4272, Accuracy: 0.8079, F1 score: 0.8086


In the code above, we replaced lr=1e-5 in the AdamW optimizer with lr=best_lr and the range of the for loop with best_epochs. The best_lr and best_epochs values are the best learning rate and number of epochs found by Optuna, respectively. This way, we are training our model using the optimal hyperparameters.

## Prediction

In [None]:
def predict(model, dataloader):
    model.eval()
    predictions = []

    for batch in dataloader:
        batch = tuple(b.to(device) for b in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1]}

        with torch.no_grad():
            outputs = model(**inputs)

        logits = outputs[0].detach().cpu().numpy()
        batch_predictions = np.argmax(logits, axis=1)
        predictions.extend(batch_predictions)

    return predictions

In [None]:
# Prepare the test data
test_inputs = tokenizer(df_test['text_clean'].tolist(), return_tensors='pt', truncation=True, padding=True, max_length=512)

# Test data loader
test_dataset = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'])
test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=32)

# Generate predictions
test_predictions = predict(model, test_dataloader)

df_test['predicted_label'] = test_predictions

KeyError: ignored

In [None]:
import matplotlib.pyplot as plt
target_count = df_test['predicted_label'].value_counts()
labels = ['Not a Disaster', 'Disaster']  # modify according to your case
plt.figure()
plt.pie(target_count, labels = labels, autopct='%1.1f%%')
plt.legend(title = "Labels")
plt.show()

In [None]:
df_test.head(10)

In [None]:
# saving the fine tuned model
model.save_pretrained("./disaster_electra")