
# Lab session 4: Transformer models


This lab covers sequence to sequence modeling with Transformer models.  It was designed to give you some first practical experience with Transformers, and we have limited the required amount of input, in order to keep the time and effort for this lab within limits.

General instructions:
- Complete the code where needed
- Provide answers to questions only in the cell where indicated
- **Do not alter the evaluation cells** (`## evaluation`) in any way as they are needed for the partly automated evaluation process


# **Section 1: Introduction to HuggingFace and Basic Usage of Transformers**



We will use Transformer neural networks and explore their capabilities on some popular NLP tasks.

## Huggingface
For this lab session, we’ll use [Huggingface's](https://huggingface.co/) library to build a encoder-decoder architecuture. Huggingface provides a quick way to use pre-trained and transformers-based NLP models. [BERT](https://huggingface.co/transformers/model_doc/bert.html), [T5](https://huggingface.co/transformers/model_doc/t5.html), [GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html), [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) and many others are readily available in this library. 

Install the `transformers` and `sentencepiece` (required for tokenization) libraries:


In [1]:
!pip install transformers=="4.28.1" sentencepiece=="0.1.95"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.1
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m92.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95.tar.gz (508 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m508.7/508.7 kB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m95.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Usage
HuggingFace's Transformers library is built around three types of classes for each pretrained model:

* **model** classes, e.g., `BertModel` which inherits `torch.nn.Modules` and handles loading pretrained weights.

* **configuration** classes which store all the parameters required to build a model, e.g., `BertConfig`. You don’t always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).

* **tokenizer** classes which store the vocabulary for each model and provide methods for encoding (and decoding) strings into a list of token embedding indices to be fed to a model, e.g., `BertTokenizer`.

All these classes can be instantiated from pretrained instances and saved locally using two methods: 

1. `from_pretrained()` lets you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself or stored locally.

2. `save_pretrained()` lets you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.


For example you can load a pretrained model/config/tokenizer with:

  ```
  # import library
  from transformers import BertModel, BertConfig, BertTokenizer

  # load config
  configuration = BertConfig.from_pretrained('bert-base-uncased')
  
  # load tokenizer
  tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
  
  # load model
  model = BertModel.from_pretrained("bert-base-uncased")
  # or 
  model = BertModel.from_pretrianed(configuration)
  ```

Note that in this session we will focus on using and finetuning a pretrained model, not the (pre)training itself. 

### Question-1

- What's the difference between pretraining and finetuning?   

In finetuning, you use a model that is all ready trained, if the task you are doing is different from the one the model has been trained on, you cut the head of the model and add one that feets you task, then you "finetune" the model by freezing all the weights except a little percentage of the weights, which will be trained in your task, it is important to use a low learning rate and low number of epochs, as with this method, it is likely that the model overfitts a lot to the task.
Pretraining, uses the weights of a model that was all ready trained, and trains this model further for your task. This gives an advantage as, instead of randomly initializing the weights and than learning its value, now the model starts with an approximation of how the weights will be, this gives a head start.

We will use a pretrained T5 model to perform some initial experiments. As introduced in the theory lecture, T5 is a transformer based model which uses the encoder-decoder structure. It uses the same basic architecture as proposed in the original transformer paper [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762) with some minor variations. It is based on the core idea that most problems in NLP can be formulated as text to text transformation. In other words, given a sequence of words as input, the
model produces another sequence of words as output. The figure below shows how the input and output are formulated for performing a variety of NLP tasks using the T5 model (also see DL lecture 8).


<img src="https://1.bp.blogspot.com/-o4oiOExxq1s/Xk26XPC3haI/AAAAAAAAFU8/NBlvOWB84L0PTYy9TzZBaLf6fwPGJTR0QCLcBGAsYHQ/s640/image3.gif">

Lets see how T5 actually works. As always, we import the necessary modules and initialize with a specific random seed (for reproducibility). Your device should be set to "cuda", not "cpu". (If not, you can change this in "Edit" > "Notebook settings")

In [2]:
# import necessary libraries

import torch
import transformers

import random
import numpy as np

# for reproducibility

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


## Tokenizer playground

To build an encoder-decoder pipeline we should prepare input data. We shall use the `tokenizer` class, which offers a clean way to convert raw text into ids. For this part, we'll ask you to:

- load the `T5` tokenizer
- tokenize the given sentence into subwords (e.g., `*love NLP*` will convert to `['▁love', '▁N', 'LP']` according to the pretrained tokenizer)
- encode and then decode the given sentence  

**Note**: You might get the following warning: "FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5". Feel free to ignore it.   


In [3]:
from transformers import T5Tokenizer

dummy_sentence = "Don't you love the NLP course? We sure do."

# 1) load t5 tokenizer (with `T5Tokenizer.from_pretrained', based on the "t5-base")

# 2) tokenize dummy_sentence into subwords (use `tokenizer.tokenize')
# dummy_tokens = ...

# 3) encode dummy_sentence into a pytorch tensor (use `tokenizer.encode_plus' with the argument return_tensors='pt', 
# to return torch.Tensor objects). You can also just `__call__` the tokenizer.
# dummy_tensor = ...

# 4) decode the first 6 input_ids [0,6) from the encoded input again (use `tokenizer.decode')
# dummy_decode = ...

############### for student ################
tokenizer = T5Tokenizer.from_pretrained("t5-base")

dummy_tokens = tokenizer.tokenize(dummy_sentence)

dummy_tensor = tokenizer.encode_plus(dummy_tokens, return_tensors='pt')

dummy_decode = tokenizer.decode(dummy_tensor.input_ids[0][0:6])
############################################

print(dummy_tokens)
print('-' * 100)
print(dummy_tensor)
print('-' * 100)
print(dummy_decode)

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


['▁Don', "'", 't', '▁you', '▁love', '▁the', '▁N', 'LP', '▁course', '?', '▁We', '▁sure', '▁do', '.']
----------------------------------------------------------------------------------------------------
{'input_ids': tensor([[1008,   31,   17,   25,  333,    8,  445, 6892,  503,   58,  101,  417,
          103,    5,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
----------------------------------------------------------------------------------------------------
Don't you love the


In [4]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

assert tokenizer is not None
assert tokenizer.name_or_path.find('t5')!=-1, "load t5 tokenizer"
assert len(tokenizer) == 32100, "load base tokenizer"
assert len(dummy_tokens) == 14
assert dummy_tokens[4] == '▁love'
assert isinstance(dummy_tensor, transformers.tokenization_utils_base.BatchEncoding), 'use encode_plus!'
assert dummy_decode == "Don't you love the"

del dummy_tokens
del dummy_decode
del dummy_tensor

print('Well done!')

Well done!


In [5]:
from torch import nn
from torch.nn import CrossEntropyLoss
from transformers import T5EncoderModel

class BinaryClassifierWithT5(nn.Module):

    def __init__(self):
        super().__init__()

        # We only load encoder part
        # (ignore the warning message)
        self.t5_model = T5EncoderModel.from_pretrained('t5-base')
        
        # 1) Get the output dimension of the T5-base model. Huggingface refers 
        #    to this dimension as `d_model`. You can either look up its value
        #    online (https://huggingface.co/t5-base/blob/main/config.json), 
        #    or get it via `self.t5_model.config.d_model`.
        # t5_output_dim = ...

        # 2) Create the linear layer with input dimension = t5_output_dim and a scalar output
        # self.classifier_head = ... (use a linear layer: `nn.Linear`)
        ############### for student ################
        t5_output_dim = self.t5_model.config.d_model

        self.classifier_head = nn.Linear(t5_output_dim, 1)
        ############################################


    def forward(self, input_ids=None, attention_mask=None):
        # The T5 model outputs a sequence of vectors of size `t5_output_dim`
        # (one vector for each token). The dimensions of this tensor are:
        # <batch_size, sequence length, t5_output_dim>.
        sequence_output = self.t5_model(input_ids, attention_mask)['last_hidden_state']

        # 1) To end up with one vector for each sentence in the batch,
        #    we want to average the embeddings over all tokens.
        # averaged_sequence_output = ...  (use `torch.mean`)

        # 2) Pass the averaged sentence embeddings through the linear layer
        # lm_logits = ...  (use `self.classifier_head(...)`)
        
        ############### for student ################
        averaged_sequence_output = torch.mean(sequence_output, dim=1)

        lm_logits = self.classifier_head(averaged_sequence_output)
        ############################################

        return lm_logits


In [6]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY

dummy_model = BinaryClassifierWithT5()

dummy_inps = tokenizer.encode_plus("This is a simple example", return_tensors='pt')

dummy_output = dummy_model(input_ids=dummy_inps['input_ids'], attention_mask=dummy_inps['attention_mask'])

assert isinstance(dummy_model.classifier_head, nn.Linear) 
assert dummy_model.classifier_head.out_features == 1, 'Is it binary?'


del dummy_model
del dummy_inps
del dummy_output

print('Well done!')

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Some weights of the model checkpoint at t5-base were not used when initializing T5EncoderModel: ['decoder.block.7.layer.2.DenseReluDense.wi.weight', 'decoder.block.1.layer.2.layer_norm.weight', 'decoder.block.3.layer.1.EncDecAttention.o.weight', 'decoder.block.5.layer.1.EncDecAttention.o.weight', 'decoder.block.7.layer.1.EncDecAttention.k.weight', 'decoder.block.7.layer.1.layer_norm.weight', 'decoder.block.8.layer.0.SelfAttention.q.weight', 'decoder.block.8.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.5.layer.1.layer_norm.weight', 'decoder.block.6.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.v.weight', 'decoder.block.10.layer.2.DenseReluDense.wo.weight', 'decoder.block.5.layer.0.SelfAttention.k.weight', 'decoder.block.10.layer.1.layer_norm.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.11.layer.2.DenseReluDense.wo.weight', 'decoder.block.10.layer.2.layer_norm.weight',

Well done!


## Prompt or task specification

Before jumping into the main part of this lab, we should be familar with Prompt. The technique of prompt or task specification is a way to steer the generation of pretrained language models to solve a (natural language) query of your choice. For example, [T. Brown et al.](https://arxiv.org/pdf/2005.14165.pdf) used prompting for grammar correction (the task of correcting different kinds of errors in text such as spelling, punctuation). They gave prompts of the form "`Poor English Input: <inp_sentence>\n Good English Output: <out_sentence>"`:

<img src="https://www.dropbox.com/s/ezfh1p891h7qes6/Screenshot%20from%202021-05-01%2009-40-00%20%28edited-Pixlr%29.png?raw=1">


In this scenario, the encoder recieves a sentence in the form of "`Poor English Input: <inp_sentence>\n` and the decoder predicts the `Good English Output: <out_sentence>` with `<x_sentence>` an example in our dataset. 


T5 has some built-in prompts such as:

- translate English to French: `YOUR_INPUT_SENTENCE`
- translate English to German: `YOUR_INPUT_SENTENCE`
- cola sentence: `YOUR_INPUT_SENTENCE`
- ...

Let's see how we can use the T5 model to translate "`I am a student`" into French and German using prompts in combination with a pretrained language model.

In [7]:
from transformers import T5ForConditionalGeneration

# load t5 model
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [8]:
examples = [
 "translate English to French: I am a student", 
 "translate English to German: I am a student",         
]

translation_list = []

# for each example:
# 1. encode your inputs and return a tensor with `tokenizer.encode`
# 2. pass the encoded input through the T5 model with the `generate` function
# 3. decode the generated output with the tokenizer (convert ids to tokens) with `tokenizer.decode`.
#    make sure to retain only the translation itself, not the special tokens such as padding
# 4. append this decoded output to the translation_list

for e in examples:
    ############### for student ################
    encode_e = tokenizer.encode(e,return_tensors='pt')
    #print(encode_e)
    encode_model = t5_model.generate(encode_e)
    #print(encode_model)
    decoded_e = tokenizer.decode(encode_model[0] ,clean_up_tokenization_spaces=True, skip_special_tokens=True)
    #print(decoded_e.split)
    translation_list.append(decoded_e)
    ############################################
print(translation_list)



['Je suis un étudiant', 'Ich bin Studentin']


In [9]:
## evaluation
## DON'T CHANGE THIS CELL IN ANY WAY
assert len(translation_list) == 2, "decode both examples?"
assert translation_list[0] == "Je suis un étudiant"
assert translation_list[1] == "Ich bin Studentin"

print('Well done!')

Well done!


# **Section 2: Fine-tuning pretrained DistilBERT for classification**


In this experiment we'll fine-tune a pretrained DistilBERT model for a classification task. 
DistilBERT is a small, fast, cheap and light Transformer model based on the BERT architecture. Knowledge distillation is performed during the pre-training phase to reduce the size of the original BERT model by 40%. Here's an interesting [blog](https://towardsdatascience.com/distillation-of-bert-like-models-the-theory-32e19a02641f) behind the approach of DistilBERT and knowledge distillation in BERT-like models in general. 

We use a twitter dataset of complaints of airline customers to build our classifiers.

### Question-2

- What's the difference between finetuning and freezing transformers?   

Fine-tuning and freezing are two techniques that can be used when adapting a pre-trained transformer model to a new task. Fine-tuning involves training the model on new data specific to your task. During fine-tuning, you can choose to update all the weights of the model or only a subset of them. In the latter case, the layers that are not being fine-tuned are “frozen”. On the other hand, freezing refers to the process of preventing the weights of a model from being updated during training. When you freeze all the weights of a model, you are essentially using it as a fixed feature extractor.

In [10]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import torch

print(device)

cuda


In [11]:
!pip install Sentencepiece
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash
  Downloading xxhash-3.2.0-cp310-cp310-manyli

In [12]:
complaint_train = pd.read_csv("https://raw.githubusercontent.com/semerekiros/ML_NLP/main/twitter_train.csv", encoding='latin-1')
complaint_test = pd.read_csv("https://raw.githubusercontent.com/semerekiros/ML_NLP/main/twitter_test.csv", encoding='latin-1')

complaint_train.sample(5)

Unnamed: 0,id,tweet,label
291,95181,@JLJeffLewis @AmericanAir no excuse for lost l...,0
755,20845,I thought airport wi-fi was ridiculous until I...,0
2096,32473,@DanniAllen14 @united @RunLikeAGirl_ca @just_t...,1
432,165082,My @united flight to LA had no electricity for...,0
479,37552,Poop. _@stevethebikeguy: @JetBlue announces ne...,0


In [13]:
X = complaint_train.tweet.values
y = complaint_train.label.values

X_train, X_val, y_train, y_val =\
    train_test_split(X, y, test_size=0.1, random_state=42)

In [14]:
def preprocess_tweet(text):
  
    text = re.sub(r'(@.*?)[\s]', ' ', text)    # Remove '@name'
    text = re.sub(r'&amp;', '&', text)  # Replace '&amp;' with '&'
    text = re.sub(r'\s+', ' ', text).strip()  # Remove trailing whitespace

    return text


First, we need to tokenize the input text into token IDs, before it can be fed into DistilBERT. The figure below illustrates the tokenization process.



![img](http://jalammar.github.io/images/distilBERT/bert-distilbert-input-tokenization.png)

In [15]:
from transformers import DistilBertTokenizer, DistilBertModel

# Load the Distilled BERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)


MAX_LEN = 64

# Create a function to tokenize a set of texts
def preprocessing_for_bert(data):
    """Perform required preprocessing steps for pretrained DistilBERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []

    for sent in data:
        # `tokenizer` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Return an attention mask
        #    (6) Return a dictionary of tokens mapped to IDs
        

        encoded_sent = tokenizer(
            text=preprocess_tweet(sent),           # Preprocess the tweet
            add_special_tokens=True,               # Add `[CLS]` and `[SEP]`
            max_length=MAX_LEN,                    # Max length to truncate/pad
            padding='max_length',                  #Pad each sequence to the max_length argument provided        
            truncation =True,                      #Truncate each sequence to the max_length argument provided
            return_attention_mask = True
            )

        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        
        attention_masks.append(encoded_sent.get('attention_mask'))
    
    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [16]:
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train)
val_inputs, val_masks = preprocessing_for_bert(X_val)

Tokenizing data...


Now, we will create a Pytorch DataLoader. This allows us to easily load in batches of our new tokenized dataset during the training and validation process. 

In [17]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

batch_size = 32

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

Next, we need to define the architecture of our classifier, which is built upon DistilBert. Please follow the instructions in the code below to add a feed-forward classifier after the pre-trained BERT model. You are also instructed to freeze the parameters in the pre-trained BERT model we have loaded from the transformer class. This ensures that only the newly defined classification layer is trained, while the parameters of the pre-trained BERT model are kept constant. 

In [18]:
%%time
import torch
import torch.nn as nn
#from transformers import BertModel, DistilBertModel
from transformers import DistilBertTokenizer, DistilBertModel


# Create the DistilBertClassfier class
class DistilBertClassifier(nn.Module):

    def __init__(self, unfreeze_layers=None):
        super(DistilBertClassifier, self).__init__()


        D_in, H, D_out = 768, 50, 2          # Specify hidden size of DistilBERT, hidden size of our classifier, and number of labels

        self.bert = DistilBertModel.from_pretrained("distilbert-base-uncased")    # Load DistilBERT model
        
        # Instantiate a one-layer feed-forward classifier
        # this classifier consists of a single hidden layer, with nn.RELU() between the hidden and output layer
        # self.classifier = nn.Sequential ( ... )
            
        ############### for student ################
        self.classifier = nn.Sequential(nn.Linear(D_in, H),nn.ReLU(), nn.Linear(H, D_out))

        ############################################

        # Freeze all the trainable layers in the DistilBertModel
        # (1) loop through all the parameters in self.bert (you should look up how to access the parameters of a layer/model in PyTorch)
        # (2) for each parameter, set requires_grad to False 

        ############### for student ################
        for param in self.bert.parameters():
          param.requires_grad = False

        ############################################

        # unfreeze/train the specific layers of the transformer 
        if unfreeze_layers is not None:
            assert isinstance(unfreeze_layers, list), "unfreeze_layers expects a list of layers to unfreeze"
          
            for layer_no in unfreeze_layers:
                for param in list(self.bert.transformer.layer[layer_no].parameters()):
                    param.requires_grad = True
        
    def forward(self, input_ids, attention_mask):
        
        outputs = self.bert(input_ids=input_ids,            # input_ids.shape = attention_mask.shape (batch_size, max_length)
                            attention_mask=attention_mask)
             
        last_hidden_state_cls = outputs[0][:, 0, :]         # Extract the last hidden state of the token `[CLS]` as an input for the classification task    
        logits = self.classifier(last_hidden_state_cls)     #logits.shape (batch_size, num_labels)

        return logits


CPU times: user 44 µs, sys: 10 µs, total: 54 µs
Wall time: 58.7 µs


We have implemented the training loop for you. Please study the code in the trainer class so you understand what is going on.

In [19]:
import random
import time
import torch.nn as nn
# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def mytrainer(model, optimizer, train_dataloader,  val_dataloader=None, epochs=4, evaluation=False):
    """Train the  model.
    """
    # Start training loop
    print("Start training...\n")
    for epoch_i in range(epochs):
   
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
        print("-"*70)

        t0_epoch, t0_batch = time.time(), time.time()
        total_loss, batch_loss, batch_counts = 0, 0, 0
        
        model.train()   # Put the model into the training mode

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)  # Load batch to GPU            
            model.zero_grad()    # Zero out any previously calculated gradients
            logits = model(b_input_ids, b_attn_mask)  # Perform a forward pass. 

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            loss.backward()    # Perform a backward pass to calculate gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)    # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"

            # Update parameters and the learning rate
            optimizer.step()
 
            # Print the loss values and time elapsed for every 20 batches
            if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch

                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        print("-"*70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # Measure model's performance on the validation set after each epoch
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            print(f"Eval {epoch_i + 1:^2} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            print("-"*70)
        print("\n")
    
    print("Training complete!")


def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance on the validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

In [20]:

set_seed(42)    # Set seed for reproducibility

bert_classifier = DistilBertClassifier()
bert_classifier.to(device)

# Create the optimizer
optimizer = torch.optim.AdamW(bert_classifier.parameters(),
                  lr=5e-5,    # Default learning rate
                  eps=1e-8    # Default epsilon value
                  )

mytrainer(bert_classifier, optimizer, train_dataloader, val_dataloader, epochs=8, evaluation=True)


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |   20    |   0.695247   |     -      |     -     |   5.49   
   1    |   40    |   0.691375   |     -      |     -     |   1.12   
   1    |   60    |   0.686200   |     -      |     -     |   1.11   
   1    |   80    |   0.683507   |     -      |     -     |   1.09   
   1    |   95    |   0.680904   |     -      |     -     |   0.79   
----------------------------------------------------------------------
Eval 1  |    -    |   0.687869   |  0.676163  |   64.26   |   10.14  
----------------------------------------------------------------------


 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   2    |   20    |   0.675791   |     -      |     -     |   1.14   
   2    |   40    |   0.670133   |     -      |     -     |   1.0

We will now fine-tune the DistilBert model by unfreezing specific layers in the pre-trained model, and training our full model (encoder + classifier) again.

In [21]:
set_seed(42)    # Set seed for reproducibility

unfreeze_layers = [5]   # We'll fine-tune the last layer (update the weights of that specific layer) of distilBERT

unfrozen_classifier = DistilBertClassifier(unfreeze_layers=unfreeze_layers)

unfrozen_classifier.to(device)

# Create the optimizer
optimizer = torch.optim.AdamW(unfrozen_classifier.parameters() )
mytrainer(unfrozen_classifier, optimizer, train_dataloader, val_dataloader, epochs=8, evaluation=True)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |   20    |   0.651182   |     -      |     -     |   1.45   
   1    |   40    |   0.558613   |     -      |     -     |   1.39   
   1    |   60    |   0.565137   |     -      |     -     |   1.39   
   1    |   80    |   0.498318   |     -      |     -     |   1.38   
   1    |   95    |   0.539594   |     -      |     -     |   1.02   
----------------------------------------------------------------------
Eval 1  |    -    |   0.564689   |  0.479649  |   77.33   |   7.19   
----------------------------------------------------------------------


 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   2    |   20    |   0.447765   |     -      |     -     |   1.48   
   2    |   40    |   0.453068   |     -      |     -     |   1.4

### Question-3

- Which classifier (`bert_classifier` or `unfrozen_classifier`) takes more time to train? Which one performs better in terms of validation accuracy? Explain. 

Comparing times, bert_classifier takes slightly less time than unforzen_classifier (56s vs 60s respectively). However with respect to performance, unforzen_classifier performes way better in the validation set than bert_classifier (78.74% vs 71.65% respectively). This make sence, as the freezed model will have way less parameters to learn (only the ones in the output layer we have added) it will take less time, however it will fit worst to our data. In the contrary, the unfreezed model has more parameters, this means it will take more time to learn but as shown works way better.


# **Section 3: Multi-lingual transformers: LaBSE** 

In this exercise, we'll use yet another pre-trained transformer, namely the `Language agnostic BERT sentence embedding` model [LabSe](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html). This is a multilingual sentence embedding model that encodes text from different languages into a shared embedding space. This allows it to be applied to a range of downstream tasks, like text classification, clustering, and others, while also leveraging semantic information for language understanding.

We will fine-tune the LabSe model for a sentiment classifiction task based on small sample of the [YELP](https://www.yelp.com/dataset) dataset in English. 

We will then evaluate our fine-tuned model on a test set in another language (DUTCH, FRENCH) and inspect how agnostic our model really is to the language change.

In [22]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [23]:
import torch
from transformers import AutoTokenizer, AutoModel

# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

import pandas as pd
from sklearn.model_selection import train_test_split


In [24]:
#Load your data into a dataframe
yelp = pd.read_csv("https://raw.githubusercontent.com/semerekiros/ML_NLP/main/yelp_train.csv", encoding='latin-1')

X = yelp.text.values
y = yelp.label.values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=42)

As in the previous exercise, we will first create a preprocessing function that tokenizes the input text and tensorizes it. 

In [25]:
MAX_LEN = 64
mytokenizer = AutoTokenizer.from_pretrained("pvl/labse_bert", do_lower_case=False)

# Create a function to tokenize a set of texts
def preprocessing_for_labse(data):
    input_ids = []
    attention_masks = []
    for sent in data:

        encoded_sent = mytokenizer(
            text=sent,           
            add_special_tokens=True,               
            max_length=MAX_LEN,                    
            padding='max_length',                        
            truncation =True,                       
            return_attention_mask = True
            )

        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))
    
    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

Downloading (…)okenizer_config.json:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [26]:
train_inputs, train_masks = preprocessing_for_labse(X_train)
val_inputs, val_masks = preprocessing_for_labse(X_val)

Again, we create dataloader which allows us to easily extract batches for training and validation.

In [27]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 16

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)


Now, we define the LabSeClassifier model architecture. Once again, we initially opt to freeze all parameters in the pre-trained model. Please complete the code where indicated.

In [48]:
class LabSeClassifier(torch.nn.Module):
    def __init__(self):
        super(LabSeClassifier, self).__init__()
        H, D_out = 768, 2 

        self.labse_encoder = AutoModel.from_pretrained("pvl/labse_bert")
         
        #Freeze all the parameters of the LabSe encoder

        ############### for student ################ 
        for param in self.labse_encoder.parameters():
          param.requires_grad = False

        ############################################

        self.linear = torch.nn.Linear(H, D_out)       

    def forward(self, input_ids, attention_mask):

        output = self.labse_encoder(input_ids=input_ids, attention_mask=attention_mask)

        ### Mean pool over the hidden representations of each token 
        ### to get a single vector representation for the whole sentence 

        token_embeddings = output[0]                                            
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        labse_representation = sum_embeddings/sum_mask
         
        logits = self.linear(labse_representation)
        return logits



Since we only defined another model and don't aim to change anything about the training strategy, we can reuse the `mytrainer()` function from section 2 (initially used to train the DistilBERT-based tweet classifier). Below, we train the LabSeClassifier. 

In [49]:
import random
import numpy as np
def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

set_seed(42)    # Set seed for reproducibility


labse_classifier = LabSeClassifier()

labse_classifier.to(device)

optimizer = torch.optim.AdamW(labse_classifier.parameters(),
                  lr=5e-5,    # Default learning rate
                  eps=1e-8    # Default epsilon value
                  )

mytrainer(labse_classifier, optimizer, train_dataloader, val_dataloader, epochs=5, evaluation=True)      


Some weights of the model checkpoint at pvl/labse_bert were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |   20    |   0.718796   |     -      |     -     |   1.24   
   1    |   40    |   0.699815   |     -      |     -     |   1.12   
   1    |   60    |   0.690503   |     -      |     -     |   1.12   
   1    |   80    |   0.688534   |     -      |     -     |   1.12   
   1    |   100   |   0.679065   |     -      |     -     |   1.12   
   1    |   112   |   0.664355   |     -      |     -     |   0.65   
----------------------------------------------------------------------
Eval 1  |    -    |   0.692259   |  0.674546  |   58.65   |   7.03   
----------------------------------------------------------------------


 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   2    |   20    |   0.655523   |     -      |     -     |   1.1

The `labse_predict` function below takes the trained model as an input, together with a test set of unseen instances, and predicts the sentiment of the examples in the test set. It does this by simply passing the inputs through the trained model, and transforming the obtained logits into a probability distribution of sentiment classes.

We also define a function to determine the accuracy of the model's sentiment predictions over the test set. 

In [50]:
import torch.nn.functional as F

def labse_predict(model, test_dataloader):
    """Perform a forward pass on the trained model to predict probabilities
    on the test set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    all_logits = []

    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
        all_logits.append(logits)

    # Concatenate logits from each batch
    all_logits = torch.cat(all_logits, dim=0)

    # Apply softmax to calculate probabilities
    probs = F.softmax(all_logits, dim=1).cpu().numpy()

    return probs


 

In [51]:
from sklearn.metrics import accuracy_score

def accuracy_function(yelp_test):
 
    test_x = yelp_test.text.values      
    test_y = yelp_test.label.values
  

    test_labels = torch.tensor(test_y)

    test_inputs, test_masks = preprocessing_for_labse(test_x)      #Preprocess the test instance 

    #Prepare testdataloader that will be used by the trained model
    test_data = TensorDataset(test_inputs, test_masks, test_labels)
    test_sampler = SequentialSampler(test_data)
    test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
  

    #Predict the probabilites
    probs = labse_predict(labse_classifier, test_dataloader)
    preds = probs[:, 1]

    y_pred = np.where(preds >= 0.5, 1, 0)     # if probability prediction is >=0.5 then it's class 1, and 0 otherwise 

    accuracy = accuracy_score(test_labels, y_pred)
    return accuracy 
  
     

Now that we have trained a LabSe classifier, which is fine-tuned on English sentences for the task of sentiment prediction, we can evaluate it. Not only will we evaluate its performance on an English test set, we will also test out several other languages (Dutch, French and Italian).

In [52]:

yelp_test_nl = pd.read_csv("https://raw.githubusercontent.com/semerekiros/ML_NLP/main/nyelp_test_nl.csv", encoding='latin-1')
yelp_test_fr = pd.read_csv("https://raw.githubusercontent.com/semerekiros/ML_NLP/main/nyelp_test_fr.csv", encoding='latin-1')
yelp_test_en = pd.read_csv("https://raw.githubusercontent.com/semerekiros/ML_NLP/main/yelp_test.csv", encoding='latin-1')
yelp_test_it = pd.read_csv("https://raw.githubusercontent.com/semerekiros/ML_NLP/main/nyelp_test_it.csv", encoding='latin-1')

nl_accuracy = accuracy_function(yelp_test_nl)
fr_accuracy = accuracy_function(yelp_test_fr)
en_accuracy = accuracy_function(yelp_test_en)
it_accuracy = accuracy_function(yelp_test_it)

print(f'Accuracy English: {en_accuracy*100:.2f}%')
print(f'Accuracy Dutch: {nl_accuracy*100:.2f}%')
print(f'Accuracy French: {fr_accuracy*100:.2f}%')
print(f'Accuracy Italian: {it_accuracy*100:.2f}%')

Accuracy English: 46.50%
Accuracy Dutch: 52.00%
Accuracy French: 48.50%
Accuracy Italian: 46.50%


### Question-4

Why would the model work on other languages than the one it was fine-tuned for (English)? Describe the pre-training strategy that leads to this 'language-agnostic' property of the LabSe model. You can consult the linked [blog post from the creators of the LabSe model](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html) and other online sources to solve this question (please list the ones you used). 

-  LaBSE works with other languages than the ones it was fine-tuned with beacuse it is trained on a large amount of monoloingual sentences and bilingual sentence pairs. More precisely, the model uses two parallel encoders to encode two sequences and obtain their compatibility score. This strategy encourages the model to learn language-agnostic representations that generalize well to many different languages.
https://www.youtube.com/watch?v=7tAWk_Coj-s&ab_channel=Rasa
https://towardsdatascience.com/labse-language-agnostic-bert-sentence-embedding-by-google-ai-531f677d775f

# **Section 4: Fine-tuning T5 For Seq-to-Seq Task**

So far, we have focused only on *classifiers* built on the top of Transformers. In this last exercise, we'll use a pretrained *sequence-to-sequence* transformer model to generate a summary for a given news article. Instead of outputting a probability distribution over classes (as is the case for classification), this model will take a text as an input, and output another text (hence, sequence-to-sequence). 

In [53]:
!pip install Sentencepiece
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [54]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration


In [55]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'


In [56]:
df = pd.read_csv("https://raw.githubusercontent.com/semerekiros/ML_NLP/main/news_summary_small.csv", encoding='latin-1')

df.head()

Unnamed: 0,text,ctext
0,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In the previous two exercises, we wrote our own preprocessing functions before creating our dataloaders. Here we'll directly use the `Dataset` module provided by `PyTorch`. It defines both how text is pre-processed and stores the instances with their corresponding labels. A `Dataloader` method then wraps an iterable around the `Dataset` in order to enable easy access to the samples before sending them to the neural network. 

In [57]:
class SummaryDataset(Dataset):

    def __init__(self, df, tokenizer, source_len, summary_len):
        self.tokenizer = tokenizer
        self.data = df
        self.source_len = source_len
        self.summary_len = summary_len
        self.summarys = self.data.text
        self.articles = self.data.ctext

    def __len__(self):
        return len(self.summarys)

    def __getitem__(self, index):
        article = str(self.articles[index])
        article = ' '.join(article.split())

        summary = str(self.summarys[index])
        summary = ' '.join(summary.split())

        source = self.tokenizer.batch_encode_plus([article], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([summary], max_length= self.summary_len, pad_to_max_length=True,return_tensors='pt')


        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask']

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }


As opposed to the previous exercises, we won't add any layers on top of the pre-trained transformer. This is because the pre-trained transformer we will use is already a sequence-to-sequence model, as is the task of providing a summary for a news article. 

We now immediately advance to defining the training loop. Note that the model's forward function (which we will load in later) takes the following arguments as an input: 
- the input sequence IDs
- the attention mask
- the decoder input IDs
- the labels 

For more information about these arguments, please refer to [T5Model](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model).

In [58]:
def train(epoch, tokenizer, model, device, loader, optimizer):

    #Freeze the encoder part of the t5, to limit required computational resources
    for par in model.get_encoder().parameters():    
        par.requires_grad = False
  
    model.train()
    for _, data in enumerate(loader,0):
        y = data['target_ids'].to(device, dtype=torch.long)
        y_ids = y[:,:-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels [y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype=torch.long)
        mask = data['source_mask'].to(device, dtype=torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
        loss = outputs[0]

        if _%10 == 0:
            print(f'Training Loss: {loss.item()}')
        if _%500==0:
            print(f'Epoch:{epoch}, Loss: {loss.item()}')

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


The following function validates the performance of the model. It asks the model to generate an output sequence, based on a given input sequence. In our case, the input sequence will be a news article, and the output sequence will be its summary. To evaluate the model, we will calculate the BLEU-score between the predicted summary and the target summary. We will do this for all the news articles in our summary validation set, while training on the train set.

In [59]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()

    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype=torch.long)
            ids = data['source_ids'].to(device, dtype =torch.long)
            mask = data['source_mask'].to(device, dtype =torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask,
                max_length = 150,
                num_beams=2,
                repetition_penalty = 2.5,
                length_penalty=1.0,
                early_stopping=True
            )

            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]

            if _%10 == 0:
                print(f'Completed {_}')
            predictions.extend(preds)
            actuals.extend(target)

    return predictions, actuals


In [60]:
from torchtext.data.metrics import bleu_score

def calculate_bleu(predictions, actuals):
    predictions = [prediction.split(" ") for prediction in predictions]
    actuals = [[actual.split(" ")] for actual in actuals]

    score = bleu_score(predictions, actuals)


    return score

Now, we load in the `Dataset` class as defined before and prepare the train and validation `Dataloader`. 

In [61]:
model_config={
    "MODEL":"t5-base",             # model_type: t5-base/t5-large
    "TRAIN_BATCH_SIZE":8,          # training batch size
    "VALID_BATCH_SIZE":20,          # validation batch size
    "TRAIN_EPOCHS":2,              # number of training epochs
    "VAL_EPOCHS":1,                # number of validation epochs
    "LEARNING_RATE":1e-4,          # learning rate
    "MAX_LEN":512,  # max length of source text
    "SUMMARY_LEN":150,   # max length of target text
    "SEED": 42                     # set seed for reproducibility 
}


# Set random seeds and deterministic pytorch for reproducibility
torch.manual_seed(model_config["SEED"]) # pytorch random seed
np.random.seed(model_config["SEED"]) # numpy random seed
torch.backends.cudnn.deterministic = True

# tokenzier for encoding the text
mytokenizer = T5Tokenizer.from_pretrained("t5-base")


df.ctext = 'summarize: ' + df.ctext   #pre-append each text with 'summarize' prompt


# Creation of Dataset and Dataloader
# Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 
train_size = 0.8
train_dataset=df.sample(frac=train_size,random_state = model_config["SEED"])
val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("VAL Dataset: {}".format(val_dataset.shape))



# Creating the Training and Validation dataset for further creation of Dataloader
training_set = SummaryDataset(train_dataset, mytokenizer, model_config["MAX_LEN"], model_config["SUMMARY_LEN"])
val_set = SummaryDataset(val_dataset, mytokenizer, model_config["MAX_LEN"], model_config["SUMMARY_LEN"])

# Defining the parameters for creation of dataloaders
train_params = {
    'batch_size': model_config["TRAIN_BATCH_SIZE"],
    'shuffle': True,
    'num_workers': 0
    }

val_params = {
    'batch_size': model_config["VALID_BATCH_SIZE"],
    'shuffle': False,
    'num_workers': 0
    }

# Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
training_loader = DataLoader(training_set, **train_params)
val_loader = DataLoader(val_set, **val_params)



FULL Dataset: (1000, 2)
TRAIN Dataset: (800, 2)
VAL Dataset: (200, 2)


We now take the pre-trained T5 language generation model, and finetune it on our summary dataset to create the `T5 Summarizer`.

In [62]:
# Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
# Further this model is sent to device (GPU/TPU) for using the hardware.
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model = model.to(device)

# Defining the optimizer that will be used to tune the weights of the network in the training session. 
optimizer = torch.optim.Adam(params =  model.parameters(), lr=model_config["LEARNING_RATE"])

# Training loop
print('Initiating Fine-Tuning for the model on our dataset')

for epoch in range(model_config["TRAIN_EPOCHS"]):
    train(epoch, mytokenizer, model, device, training_loader, optimizer)



Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Initiating Fine-Tuning for the model on our dataset
Training Loss: 6.1493144035339355
Epoch:0, Loss: 6.1493144035339355
Training Loss: 2.4344069957733154
Training Loss: 2.351414203643799
Training Loss: 1.8651801347732544
Training Loss: 2.1591875553131104
Training Loss: 2.250520944595337
Training Loss: 1.9304931163787842
Training Loss: 2.1510298252105713
Training Loss: 1.6705858707427979
Training Loss: 1.9244059324264526
Training Loss: 1.7637704610824585
Epoch:1, Loss: 1.7637704610824585
Training Loss: 1.5756059885025024
Training Loss: 1.8394076824188232
Training Loss: 1.629102110862732
Training Loss: 1.4356831312179565
Training Loss: 1.9609870910644531
Training Loss: 1.8042199611663818
Training Loss: 1.8361130952835083
Training Loss: 1.8112372159957886
Training Loss: 1.528156042098999


Now, let's evaluate the performance of our summarizer in terms of the BLEU score metric. We can use the `validate` and `calculate_bleu` functions we defined earlier. You can ignore the warnings below. 

In [63]:
predictions, actuals = validate(1, mytokenizer, model, device, val_loader)   #Use the model to get predictions
score = calculate_bleu(predictions, actuals)           #Calculate the bleu score between the predictions and the ground truth summaries
print(score)

Completed 0
0.11014135872578089


# Let's test it!

We will now have a look at the news article summaries our model comes up with. We take the first instance in our validation set and ask our model to generate a summary. To do this, we'll use the `generate` function. As an input, we provide the same prompt as was used during finetuning of the model (`summarize: <news article>`). Thanks to the training process, the language generation model will know how to generate a summary of the article. 

If you are looking for a good read on the underlying techniques for text generation (greedy search, beam search, ...) and some examples of how to use the `generate` method, please have a look at this [blog post from huggingface](https://huggingface.co/blog/how-to-generate).

In [64]:
test_sent = val_dataset.ctext.values[1]
print(test_sent)

summarize: Hotels in Mumbai and other Indian cities are to train their staff to spot signs of sex trafficking such as frequent requests for bed linen changes or a "Do not disturb" sign left on the door for days on end. The group behind the initiative is also developing a mobile phone app - Rescue Me - which hotel staff can use to alert local police and senior anti-trafficking officers if they see suspicious behavior. "Hotels are breeding grounds for human trade," said Sanee Awsarmmel, chairman of the alumni group of Maharashtra State Institute of Hotel Management and Catering Technology. "(We) have hospitality professionals working in hotels across the country. We are committed to this cause."The initiative, spearheaded by the alumni group and backed by the Maharashtra state government, comes amid growing international recognition that hotels have a key role to play in fighting modern day slavery. MAHARASHTRA MAJOR DESTINATION FOR TRAFFICKED GIRLS Maharashtra, of which Mumbai is the ca

In [65]:
import warnings
warnings.filterwarnings("ignore")


test_tokenized = mytokenizer.encode_plus(test_sent, max_length = model_config["MAX_LEN"], pad_to_max_length=True, return_tensors="pt")

test_input_ids  = test_tokenized["input_ids"]
test_attention_mask = test_tokenized["attention_mask"]

test_input_ids = test_input_ids.to(device, dtype=torch.long)
test_attention_mask = test_attention_mask.to(device, dtype=torch.long)

model.eval()

beam_outputs = model.generate(
    input_ids = test_input_ids,
    attention_mask = test_attention_mask,
    max_length = 150,
    num_beams=2,
    repetition_penalty = 2.5,
    length_penalty=1.0,
    early_stopping=True
)

for beam_output in beam_outputs:
    predicted_summary = mytokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print(f' Decoding strategy: Beam search, \n Generated summary:  {predicted_summary}')



 Decoding strategy: Beam search, 
 Generated summary:  hotels in Mumbai and other Indian cities are to train staff to spot signs of sex trafficking such as frequent requests for bed linen changes or a "Do not disturb" sign left on the door for days on end. The initiative comes amid growing international recognition that hotels have a key role to play in fighting modern day slavery.


Beam search was used to generate the previous summary. As is mentioned in the [blog post](https://huggingface.co/blog/how-to-generate) many other decoding strategies can be used to generate output sequence, given an input sequence. Here, we ask you to use nucleus sampling, also called `top-p sampling`. Have a look at how we implemented the beam-search decoding strategy in the code above (`beam_outputs = model.generate(...)`), and add some arguments which ensure that top-p sampling is used instead of beam search.

In [73]:
# Write the decoding strategy with nucleus sampling 
#  
# sample_outputs = model.generate (
#    ...
#    specify some additional arguments to implement top-p sampling with a probability of 0.88
#  )

sample_outputs = model.generate(
    input_ids = test_input_ids,
    attention_mask = test_attention_mask,
    max_length = 150,
    num_beams=2,
    repetition_penalty = 2.5,
    length_penalty=1.0,
    early_stopping=True,
    ############### for student ###################
    top_k=100,
    top_p=0.88,
    num_return_sequences=2,
    do_sample=True
    ##############################################
)


for sample_output in sample_outputs:
    predicted_summary =  mytokenizer.decode(sample_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print(f' Decoding strategy: Nucleus sampling, \n Generated summary:  {predicted_summary}')

 Decoding strategy: Nucleus sampling, 
 Generated summary:  hotel staff will be trained to spot signs of sex trafficking, including frequent requests for bed linen changes and a "Do not disturb" sign left on the door for days on end. The initiative comes amid growing international recognition that hotels have a key role to play in fighting modern day slavery.
 Decoding strategy: Nucleus sampling, 
 Generated summary:  hotels in Mumbai and other Indian cities are to train staff to spot signs of sex trafficking such as frequent requests for bed linen changes or a "Do not disturb" sign left on the door for days on end. The initiative is also developing a mobile phone app which staff can use to alert local police and senior anti-trafficking officers if they see suspicious behavior.


### Acknowledgment

If you received help or feedback from fellow students, please acknowledge that here. We count on your academic honesty:

**<font color=blue><<< LIST POTENTIAL COLLABORATORS HERE >>></font>**
