## Quantitative Text Analysis Lab Session: Week 12

### Topic: Finet-tuning Transformer with Irish Enviromental Policies 


-----

- Instructor: Yen-Chieh Liao and Stefan Müller 
- Date: 22 April 2024


All Packages

In [4]:
import os
import numpy as np
import pandas as pd
from torch import cuda
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DistilBertForSequenceClassification
)
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from collections import Counter

Check if GPU is Available

In [11]:
# check if GPU is available
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps' 
else:
    device = 'cpu'

print('GPU Device:',device)

GPU Device: mps


Load the Formatted Dataset

In [12]:
from datasets import DatasetDict
dataset = DatasetDict.load_from_disk('irish_environmental_policies')
train, validation, test = dataset['train'], dataset['validation'], dataset['test']

Access Features and Create the Mapping

In [14]:
labels = train.features['label'].names
id2label = {i: l for i, l in enumerate(labels)}
label2id={l:i for i,l in enumerate(labels)}

NUM_LABELS= len(labels)
print("label2id to Label Mapping:", label2id)
print("id2label to Label Mapping:", id2label)
print("NUM_LABELS:", NUM_LABELS)

label2id to Label Mapping: {'None Environment Policy': 0, 'Environment Policy': 1}
id2label to Label Mapping: {0: 'None Environment Policy', 1: 'Environment Policy'}
NUM_LABELS: 2


In [15]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2, label2id= label2id, id2label=id2label)
model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

Tokenize the Text

In [16]:
from transformers import AutoTokenizer, DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [17]:
train = train.map(tokenize_function, batched=True, batch_size=1000) 
test = test.map(tokenize_function, batched=True, batch_size=1000) 
validation = validation.map(tokenize_function, batched=True, batch_size=1000) 
# train = train.shuffle(seed=42).select(range(1000))
# validation = validation.shuffle(seed=42).select(range(200))
# test = test.shuffle(seed=42).select(range(200))

Map: 100%|██████████| 627/627 [00:00<00:00, 5634.88 examples/s]


View Datast Structrue

In [23]:
validation

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 627
})

In [24]:
import pandas as pd
pd.DataFrame(test).head(10)

Unnamed: 0,text,label,input_ids,attention_mask
0,These compare with 1980 rates of £30 in the Di...,0,"[101, 2122, 12826, 2007, 3150, 6165, 1997, 281...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Struggling Artists: We will review the mechani...,0,"[101, 8084, 3324, 1024, 2057, 2097, 3319, 1996...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,Upgrade public transport services in provincia...,0,"[101, 12200, 2270, 3665, 2578, 1999, 4992, 365...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,A beef farmer qualifying for the headage grant...,0,"[101, 1037, 12486, 7500, 6042, 2005, 1996, 213...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,It will have a primary role in decisions on f...,0,"[101, 2009, 2097, 2031, 1037, 3078, 2535, 1999...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
5,Pyramid schemes which fall outside current le...,0,"[101, 11918, 11683, 2029, 2991, 2648, 2783, 60...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ..."
6,Labour is committed to tackling Irelands fin...,0,"[101, 4428, 2003, 5462, 2000, 26997, 2989, 316...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
7,Changing the politicians around the cabinet ta...,0,"[101, 5278, 1996, 8801, 2105, 1996, 5239, 2795...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
8,Children are being drawn into gambling through...,0,"[101, 2336, 2024, 2108, 4567, 2046, 12219, 208...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
9,It is evil in itself and it is a continuing ob...,0,"[101, 2009, 2003, 4763, 1999, 2993, 1998, 2009...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


Create training_args

In [26]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir='./qta_hf_python_model',
    do_train=True,
    do_eval=True,
    num_train_epochs=3,
    learning_rate=0.3,
    logging_strategy='steps',
    logging_dir='./logs',
    evaluation_strategy="steps",  # Evaluate the model periodically
    save_strategy="steps",        # Save the model periodically
    save_total_limit=1,           # Only keep the best model checkpoint
    load_best_model_at_end=True,  # Load the best model at the end of training
)

Function for Evaluation Metrics

In [27]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'Accuracy': acc,
        'F1': f1,
        'Precision': precision,
        'Recall': recall
    }

In [28]:
trainer = Trainer(
    model=model,
    args=training_args,
    # training and validation dataset                 
    train_dataset=train,         
    eval_dataset=validation,            
    compute_metrics= compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [29]:
results=trainer.train()

 71%|███████   | 500/705 [21:37<07:54,  2.32s/it]

{'loss': 1199.7005, 'grad_norm': 0.14355729520320892, 'learning_rate': 0.08723404255319148, 'epoch': 2.13}


  _warn_prf(average, modifier, msg_start, len(result))
                                                 
 71%|███████   | 500/705 [22:40<07:54,  2.32s/it]

{'eval_loss': 0.2923997640609741, 'eval_Accuracy': 0.9282296650717703, 'eval_F1': 0.4813895781637717, 'eval_Precision': 0.46411483253588515, 'eval_Recall': 0.5, 'eval_runtime': 62.4466, 'eval_samples_per_second': 10.041, 'eval_steps_per_second': 1.265, 'epoch': 2.13}


100%|██████████| 705/705 [31:14<00:00,  2.66s/it]  

{'train_runtime': 1874.3317, 'train_samples_per_second': 3.009, 'train_steps_per_second': 0.376, 'train_loss': 850.9188574283681, 'epoch': 3.0}





## Evaluate Prediction and Performance

__View Performance__

Trainer model keeps the best model at the end. Lets evaluate the model on the train/test/validation

In [30]:
q=[trainer.evaluate(eval_dataset=data) for data in [train, validation, test]]
pd.DataFrame(q, index=["train","val","test"]).iloc[:,:5]

  _warn_prf(average, modifier, msg_start, len(result))
100%|██████████| 235/235 [03:16<00:00,  1.20it/s]
  _warn_prf(average, modifier, msg_start, len(result))
100%|██████████| 79/79 [01:00<00:00,  1.30it/s]
  _warn_prf(average, modifier, msg_start, len(result))
100%|██████████| 79/79 [01:15<00:00,  1.05it/s]


Unnamed: 0,eval_loss,eval_Accuracy,eval_F1,eval_Precision,eval_Recall
train,0.238682,0.942553,0.485214,0.471277,0.5
val,0.2924,0.92823,0.48139,0.464115,0.5
test,0.286418,0.929825,0.481818,0.464912,0.5


In [31]:
trainer.evaluate(test)

  _warn_prf(average, modifier, msg_start, len(result))
100%|██████████| 79/79 [01:09<00:00,  1.13it/s]


{'eval_loss': 0.2864184081554413,
 'eval_Accuracy': 0.9298245614035088,
 'eval_F1': 0.4818181818181818,
 'eval_Precision': 0.4649122807017544,
 'eval_Recall': 0.5,
 'eval_runtime': 70.6224,
 'eval_samples_per_second': 8.878,
 'eval_steps_per_second': 1.119,
 'epoch': 3.0}

Save the Model

In [218]:
# saving the best fine-tuned model & tokenizer
model_save_path = "qta_hf_python_model"
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

('qta_toy_model/tokenizer_config.json',
 'qta_toy_model/special_tokens_map.json',
 'qta_toy_model/vocab.txt',
 'qta_toy_model/added_tokens.json',
 'qta_toy_model/tokenizer.json')

## After Training

- Infrerence Task
- Reload the Model

In [228]:
def get_prediction(text):
    # Assuming 'tokenizer' and 'model' are already initialized and 'device' is set
    inputs = tokenizer(text, padding=True, truncation=True, max_length=250, return_tensors="pt").to(device)
    
    # Perform the prediction by passing in the input IDs and attention mask
    outputs = model(inputs["input_ids"], inputs["attention_mask"])
    
    # Calculate the probabilities and find the maximum
    probs = outputs.logits.softmax(1)
    max_prob_index = probs.argmax(1).item()  # Use .item() to get the value as a Python integer
    
    # Convert the predicted index to label
    predicted_label = id2label[max_prob_index]
    
    return probs, max_prob_index, predicted_label


In [229]:
model.to(device)
text = "I didn't like the movie since it bored me "
_, _, predicted_label = get_prediction(text)

In [230]:
predicted_label

'None Environment Policy'

Use the model with `pipeline` 

In [234]:
from transformers import pipeline, DistilBertForSequenceClassification, DistilBertTokenizerFast
model = DistilBertForSequenceClassification.from_pretrained("qta_toy_model")
tokenizer= DistilBertTokenizerFast.from_pretrained("qta_toy_model")
nlp= pipeline("text-classification", model=model, tokenizer=tokenizer)

In [236]:
nlp("Reducing our carbon footprint through renewable energy sources and enhanced energy efficiency is crucial")

[{'label': 'None Environment Policy', 'score': 0.9776235222816467}]