***Deep Learning Applications 2023** course, held by Professor **Andrew David Bagdanov** - University of Florence, Italy*

*Notebook and code created by **Giovanni Colombo** - Mat. 7092745*

Check the dedicated [Repository on GitHub](https://github.com/giovancombo/DLA-Labs/tree/main/lab2).

# Deep Learning Applications: Laboratory #2 - LLMs

In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things. I you haven't already, it is highly recommended to:

+ Read the [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper, which is the basis for all transformer-based LLMs.
+ Watch (and potentially *code along*) with this [Andrej Karpathy video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which shows you how to build an autoregressive GPT model from the ground up.

## Exercise 1: Warming Up
In this first exercise you will train a *small* autoregressive GPT model for character generation (the one used by Karpathy in his video) to generate text in the style of Dante Aligheri. Use [this file](https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt), which contains the entire text of Dante's Inferno (**note**: you will have to delete some introductory text at the top of the file before training). Train the model for a few epochs, monitor the loss, and generate some text at the end of training. Qualitatively evaluate the results.

In [203]:
# Imports and dependencies
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import wandb

# %pip install emoji
from emoji import emojize
emojizer = True             # Bringing some color to the code

# Hyperparameters
text = 'taylor_swift'

train_size = 0.7

batch_size = 64             # Batch size = number of independent sequences of text, analyzed in parallel
block_size = 512            # Dimension of an input seuqence of characters, for next character prediction
n_embd = 100                # Embedding dimension for each token

n_heads = 4                 # Number of Self-Attention heads in a Multi-Head Attention block
n_layers = 4                # Number of Blocks of the Transformer
learning_rate = 5e-4
dropout = 0.4

eval_iters = 200
total_steps = 10
log_interval = 100

# Creating a configuration dictionary for logging in wandb
config = dict(
    text = text,
    batch_size = batch_size,
    block_size = block_size,
    n_embd = n_embd,
    n_heads = n_heads,
    n_layers = n_layers,
    learning_rate = learning_rate,
    dropout = dropout,
    eval_iters = eval_iters,
    total_steps = total_steps,
    log_interval = log_interval,
    train_size = train_size,
)

# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

print(emojize(f"Device: {device} :eagle:") if torch.cuda.is_available() else emojize(f"Device: {device} :snail:")) if emojizer else print(f"Device: {device}")

Device: cuda:0 🦅


In [191]:
# Downloading the Dante's Divina Commedia txt file from the internet
#!wget https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt

# Opening and reading the content of the input text file
with open(text + '.txt', 'r', encoding = 'utf-8') as f:
    text = f.read()
print("Length of dataset in characters:", len(text))

# Creating a sorted set of all the unique characters present in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)             # Number of unique characters in the text = size of the vocabulary

Length of dataset in characters: 298350


In [192]:
# Creating a dictionary for mapping characters to integers and viceversa
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}

encode = lambda s: [stoi[i] for i in s]
decode = lambda l: ''.join([itos[i] for i in l])

# text = list of characters
# data = list of integers of all the text --> it's our dataset
data = torch.tensor(encode(text), dtype = torch.long)

# Splitting our dataset in train and validation sets
n = int(train_size*len(data))
train_data, val_data = data[:n], data[n:]

In [193]:
# Let's implement a single Self-Attention Head = creating communication between tokens
class SelfAttention(nn.Module):
    def __init__(self, head_size):
        super(SelfAttention, self).__init__()

        self.query = nn.Linear(n_embd, head_size, bias=False)       # Q,K,V = matrices (n_embd, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape                       # (B,T,C) = (batch_size, block_size, n_embd)
        q = self.query(x)                     # (B,T,n_embd) @ (n_embd,head_size) = (B,T,head_size)
        k = self.key(x)                       # (batch_size, block_size, head_size)
        v = self.value(x)                     # (batch_size, block_size, head_size)
        # Dot-Product and Scaling
        wei = q @ k.transpose(-2,-1) * C**-0.5              # (B,T,head_size) @ (B,head_size,T) = (B,T,T)
        # Masking (only for Decoder)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        # Softmax                         
        wei = F.softmax(wei, dim = -1)
        # Applying dropout for randomly inhibiting some communication between tokens
        wei = self.dropout(wei)
        # Dot-Product with Values
        out = wei @ v                                       # (B,T,T) @ (B,T,C) = (B,T,C)
        return out

In [194]:
# Let's implement a Multi-Head Attention block = Multiple Self-Attention Heads in parallel and concatenated
class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, head_size):
        super(MultiHeadAttention, self).__init__()

        self.heads = nn.ModuleList([SelfAttention(head_size) for _ in range(n_heads)])
        self.projection = nn.Linear(n_embd, n_embd)             # For handling Residual Connections
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim = -1)   # Concatenation of the outputs of the heads
        out = self.dropout(self.projection(out))                # Dropout always at the end
        return out

In [195]:
# Let's implement the Feed Forward block = a simple MLP, allowing tokens to do some computation after communication
class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super(FeedForward, self).__init__()

        self.ffwd = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),    # The original paper uses 4*n_embd as hidden dimension
            nn.ReLU(),
            nn.Linear(4*n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.ffwd(x)

In [196]:
# Let's implement a Decoder block = Multi-Head Attention + Feed Forward, comprehensive of Residual Connections & Layer Normalization
class DecoderBlock(nn.Module):
    def __init__(self, n_embd, n_heads):
        super(DecoderBlock, self).__init__()

        head_size = n_embd // n_heads       # As the original paper does

        self.attention = MultiHeadAttention(n_heads, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.attention(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [197]:
# Let's implement the Transformer Decoder: many Decoder blocks in sequence, with a final Linear layer
class TransformerDecoder(nn.Module):
    def __init__(self):
        super(TransformerDecoder, self).__init__()
        # Creating Lookup Tables for storing Token and Positional Encodings
        self.tok_embedding = nn.Embedding(vocab_size, n_embd)
        self.pos_embedding = nn.Embedding(block_size, n_embd)

        # In the original paper, authors used 6 layers of 8 heads each
        self.decoder_blocks = nn.Sequential(*[DecoderBlock(n_embd, n_heads = n_heads) for _ in range(n_layers)])

        self.ln = nn.LayerNorm(n_embd)
        self.fc = nn.Linear(n_embd, vocab_size)     # Final Linear layer for predicting the next character

    def forward(self, idx, targets = None):
        B,T = idx.shape         # idx and targets are both (B,T) tensors of integers: B = batch_size, T = block_size
        # Creating the Embeddings for the specific input tokens
        tok_emb = self.tok_embedding(idx)                               # (B,T,C) = (batch_size, block_size, n_embd)
        pos_emb = self.pos_embedding(torch.arange(T, device = device))  # (T,C) = (block_size, n_embd)
        x = tok_emb + pos_emb                       # (B,T,C) + (T,C) = (B,T,C) for Broadcasting Semantics

        x = self.decoder_blocks(x)                  # (B,T,n_embd)
        x = self.ln(x)                              # (B,T,n_embd)        
        logits = self.fc(x)                         # (B,T,vocab_size)

        if targets is None:
            loss = None
            return logits
        else:
            B,T,C = logits.shape                    # (B,T,vocab_size)
            # Reshaping logits and targets for dimension issues with PyTorch cross_entropy function
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            return logits, targets, loss

    # Let's create a function for generating new text!
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits = self(idx_cond)                     # Calling the Forward method: no targets provided, we're generating
            logits = logits[:, -1, :]                   # Focusing only on the last character                -> (B,vocab_size)
            probs = F.softmax(logits, dim = -1)         # Getting probabilities distribution through Softmax -> (B,vocab_size)           
            idx_next = torch.multinomial(probs, num_samples = 1)     # Sampling from the distribution -> (B,1)
            idx = torch.cat((idx, idx_next), dim = 1)   # Adding the new character to the sequence -> (B,T+1)  
        return idx

In [198]:
# Function for instantiating Model, Loss and Optimizer
def build_model():
    model = TransformerDecoder().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr = learning_rate)

    n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    print(emojize(f":puzzle_piece: Model {model.__class__.__name__} instantiated!\n" +
                        f":beans: Number of parameters: {n_params}") if emojizer else f"Model {model.__class__.__name__} instantiated!\nNumber of parameters: {n_params}")
    print(emojize(f":fire: Optimizer: {optimizer.__class__.__name__}") if emojizer else f"Optimizer: {optimizer.__class__.__name__}")
    print(optimizer)
    print(emojize(f"Device: {device} :eagle:") if torch.cuda.is_available() else emojize(f"Device: {device} :snail:")) if emojizer else print(f"Device: {device}")

    return model, criterion, optimizer

In [199]:
# Let's define a function for getting a new batch of random sequences of characters in the text
def get_batch(split):
    data = train_data if split == 'train' else val_data

    idx = torch.randint(len(data) - block_size, (batch_size,))      # Drawing a set of batch_size indexes in the text
    x = torch.stack([data[i : i+block_size] for i in idx])          # Stacking block_size characters from each index
    y = torch.stack([data[i+1 : i+block_size+1] for i in idx])      # Creating the targets (= inputs shifted by 1)
    x, y = x.to(device), y.to(device)

    return x, y

# Let's create a function for saving and visualizing train and validation losses
@torch.no_grad()                            # Context Manager for disabling gradient calculation: better memory usage
def estimate_loss():
    outloss = {}
    outacc = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        accuracies = torch.zeros(eval_iters)
        for k in range(eval_iters):         # Evaluating the losses eval_iters times on different batches
            X, Y = get_batch(split)
            logits, targets, loss = model(X, Y)

            # Computing accuracy
            total, correct = 0, 0
            _, pred = torch.max(logits.data, 1)
            total += targets.size(0)
            correct += (pred == targets).sum().item()
            accuracy = 100 * correct / total

            losses[k] = loss.item()
            accuracies[k] = accuracy
        outloss[split] = losses.mean()
        outacc[split] = accuracies.mean()

    return outloss, outacc                  # out = dictionary of train and validation mean losses

# Function for log of validation data at the end of an epoch
def log_validation(epoch, mean_loss, val_loss, mean_accuracy, val_accuracy, step):
    wandb.log({"Train Loss": mean_loss, 
               "Validation Loss": val_loss,
               "Epoch": epoch + 1,
               "Train Accuracy": mean_accuracy, 
               "Validation Accuracy": val_accuracy,}, step = step)

In [200]:
# Training Loop      
def train(model, criterion, optimizer):
    # Telling W&B to watch gradients and the model parameters
    wandb.watch(model, criterion, log = "all", log_freq = log_interval)
    example_ct = 0

    print(emojize("\nStarting Training...:flexed_biceps_medium-light_skin_tone:") if emojizer else "Starting Training...")
    for step in range(total_steps):
        model.train()
        xb, yb = get_batch('train')                 # Sampling a batch of data
        _, _, loss = model(xb, yb)
        
        optimizer.zero_grad(set_to_none = True)
        loss.backward()
        optimizer.step()

        example_ct += batch_size

        if step % log_interval == 0 or step == total_steps - 1:
            losses, accuracies = estimate_loss()
            log_validation(step, losses['train'], losses['val'], accuracies['train'], accuracies['val'], step)
            print(emojize(f":scroll: Step {step+1}/{total_steps}:\t:dotted_line_face: Train Loss = {losses['train']:.4f}; Val Loss = {losses['val']:.4f}\t:bullseye: Train Accuracy = {accuracies['train']:.2f}%; Val Accuracy = {accuracies['val']:.2f}%") if emojizer
                  else f"Step {step+1}/{total_steps}:\tTrain Loss = {losses['train']:.4f}; Val Loss = {losses['val']:.4f}\tTrain Accuracy = {accuracies['train']:.2f}%; Val Accuracy = {accuracies['val']:.2f}%")

    print(emojize("\nTraining completed!:OK_hand_medium-light_skin_tone:") if emojizer else "\nTraining completed!")

In [201]:
# Function for generating new text!
def text_generator(model, new_tokens):
    # context = First character of the generated sequence = (1,1) Tensor of value 0 --> Token embedding for New Line
    context = torch.zeros((1,1), dtype = torch.long, device = device)

    print(emojize("\n:sparkles: TEXT GENERATION ACTIVATED! Generating new text...") if emojizer else "\nTEXT GENERATION ACTIVATED! Generating new text...")
    generated_text = decode(model.generate(context, max_new_tokens = new_tokens)[0].tolist())
    
    print(emojize("Text generated!:magic_wand:") if emojizer else "Text generated!")
    print(generated_text)

In [205]:
# Generation configuration
generation = False
new_tokens = 500           # Number of tokens generated

# Saving configuration
save_model = False
folder = f"1_transformers/{text}"
model_name = "model_" + text + "_bs" + str(batch_size) + "_bl" + str(block_size) + "_ne" + str(n_embd) + "_nh" + str(n_heads) + "_nl" + str(n_layers) + "_lr" + str(learning_rate)


wandb.login()
print(emojize(":sun: Initializing Weights & Biases run...") if emojizer else "Initializing Weights & Biases run...")

with wandb.init(project = "DLA_Lab2_LLM", config = config):
    config = wandb.config
    
    # Building model and optimizer
    model, criterion, optimizer = build_model()

    # Training the model
    train(model, criterion, optimizer)

    # Generating new text from the model trained (optional)
    if generation:
        text_generator(model, new_tokens)

    # Saving the model (optional)
    if save_model:
        print(emojize(f"\nSaving Model parameters in \'{folder}\'... :writing_hand_medium-light_skin_tone:") if emojizer else f"\nSaving Model parameters in \'{folder}\'...")
        if not os.path.exists(folder):
            os.makedirs(folder)
            print(emojize(f"Folder \'{folder}\' created :file_folder:") if emojizer else f"Folder \'{folder}\' created!")

        torch.save(model.state_dict(), f"{folder}/{model_name}.pt")
        print(emojize(f"\nModel parameters saved! :floppy_disk:") if emojizer else '\nModel parameters saved!')

☀️ Initializing Weights & Biases...


🧩 Model TransformerDecoder instantiated!
🫘 Number of parameters: 553289
🔥 Optimizer: 
AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.0005
    maximize: False
    weight_decay: 0.01
)
Device: cuda:0 🦅

Starting Training...💪🏼
📜 Step 1/10:	🫥 Train Loss = 4.4583; Val Loss = 4.4586	🎯 Train Accuracy = 2.87%; Val Accuracy = 2.83%
📜 Step 10/10:	🫥 Train Loss = 3.4851; Val Loss = 3.5022	🎯 Train Accuracy = 17.09%; Val Accuracy = 16.59%

Training completed!👌🏼


0,1
Epoch,▁█
Train Accuracy,▁█
Train Loss,█▁
Validation Accuracy,▁█
Validation Loss,█▁

0,1
Epoch,10.0
Train Accuracy,17.08778
Train Loss,3.48508
Validation Accuracy,16.58801
Validation Loss,3.50219


## Exercise 2: Working with Real LLMs

Our toy GPT can only take us so far. In this exercise we will see how to use the [Hugging Face](https://huggingface.co/) model and dataset ecosystem to access a *huge* variety of pre-trained transformer models.

### Exercise 2.1: Installation and text tokenization

First things first, we need to install the [Hugging Face transformer library](https://huggingface.co/docs/transformers/index):

    conda install -c huggingface -c conda-forge transformers
    
The key classes that you will work with are `GPT2Tokenizer` to encode text into sub-word tokens, and the `GPT2LMHeadModel`. **Note** the `LMHead` part of the class name -- this is the version of the GPT2 architecture that has the text prediction heads attached to the final hidden layer representations (i.e. what we need to **generate** text). 

Instantiate the `GPT2Tokenizer` and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length.

**Tip**: Pass the `return_tensors='pt'` argument to the tokenizer to get Pytorch tensors as output (instead of lists).

In [119]:
# Imports and dependencies
import os
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from tqdm.auto import tqdm

# Device configuration
device = torch.device('cpu')

print(emojize(f"Device: {device} :snail:")) if emojizer else print(f"Device: {device}")

Device: cpu 🐌


In [120]:
# Custom input text to be tokenized
input_text = "Hello World"

# Creating a subword tokenizer from GPT2 pretrained model: new vocab_size = 50,257!
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Converting string inputs to sequences of tokens: let's compare the input length with the encoded sequence length
# It's curious to see the difference between calling tokenizer itself, or its encode/tokenize attributes
tokenized_text = tokenizer.encode(input_text, return_tensors = 'pt')

print(emojize(f":input_latin_lowercase: Input text:\t\t{input_text}\n:bookmark_tabs: Tokenized text:\t{tokenized_text}") if emojizer else f"Input string:\t\t{input_text}\nTokenized string:\t{tokenized_text}")

🔡 Input text:		Hello World
📑 Tokenized text:	tensor([[15496,  2159]])


This GPT2 tokenizer works on a subword level, so what I could notice is that, tipically, inputs are divided into several 2/3/4 words chunks and encoded to a particular integer.
As the input string sequences increases their length, the encoded sequence length increases.
Input sequences of 2/3/4 characters can be encoded to a single integer.

While inputing the sequence *"Hello World"*, I could notice that the tokenizer has a single integer for the whole "Hello" and "World" words, suggesting that many existing english (and not only, maybe) words are encoded to a single integer.

Passing the `return_tensors = 'pt'` argument makes the tokenizer output PyTorch tensors instead of lists.

From the original [Hugging Face Documentation](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Tokenizer), we can read that:
> This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not.

Trying to see the behaviour of slightly different versions of the sequence *"Hello World"*, it's possible to notice that:
- *"Hello World"* encodes to tensor([[15496,  2159]])
- *"hello World"* encodes to tensor([[31373,  2159]]) --> Case matters
- *" hello World"* encodes to tensor([[23748,  2159]]) --> Space matters!
- *"HelloWorld"* encodes to tensor([[15496, 10603]])
- *"Hello World "* encodes to tensor([[15496,  2159,   220]]) --> But space character has its own encoding integer when nothing follows it

### Exercise 2.2: Generating Text

There are a lot of ways we can, given a *prompt* in input, sample text from a GPT2 model. Instantiate a pre-trained `GPT2LMHeadModel` and use the [`generate()`](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to generate text from a prompt.

**Note**: The default inference mode for GPT2 is *greedy* which might not result in satisfying generated text. Look at the `do_sample` and `temperature` parameters.

In [146]:
# Saving configuration
save_generation = False
folder = '2_text_generation'

# We can even ask the user to input a text prompt
input_text = "The main goal in life is"         # input("What do you want to say?\n")

# Hyperparameters for text generation
max_new_tokens = 100
do_sample = True
temperature = 1.2
early_stopping = True
no_repeat_ngram_size = 2

# Loading the pretrained model: setting the padding token as the end of sequence token
model = GPT2LMHeadModel.from_pretrained('gpt2', pad_token_id = tokenizer.eos_token_id)

print(emojize(f"{model.__class__.__name__} instantiated!:OK_hand_medium-light_skin_tone:\n") if emojizer else f"{model.__class__.__name__} instantiated!\n")
print(emojize(f":input_latin_lowercase: Input text:\t{input_text}\n:sparkles: Waiting for the generation of new text...") if emojizer else f"Input text: {input_text}\nWaiting for the generation of new text...")

# inputs is a Pytorch tensor
tokenized_text = tokenizer(input_text, return_tensors = 'pt')

# Let's generate some text from the input sequence: new_text is a Pytorch tensor
generated_text = tokenizer.decode(model.generate(tokenized_text['input_ids'],
                                                 max_new_tokens = max_new_tokens,
                                                 do_sample = do_sample,
                                                 temperature = temperature,
                                                 early_stopping = early_stopping,
                                                 no_repeat_ngram_size = no_repeat_ngram_size)[0].tolist())

print(emojize("\nText generated!:magic_wand:") if emojizer else "\nText generated!")
print(f"{generated_text}")


# Saving the generated text in a txt file
if save_generation:
    if not os.path.exists(folder):
        os.makedirs(folder)
        print(emojize(f"Folder \'{folder}\' created!:file_folder:") if emojizer else f"Folder \'{folder}\' created!")

    with open(f"{folder}/generation_log.txt", 'a') as f:
        f.write(f"HPs: max_new_tokens = {max_new_tokens}, do_sample = {do_sample}, temperature = {temperature}, early_stopping = {early_stopping}, no_repeat_ngram_size = {no_repeat_ngram_size}\n\n")
        f.write(f"Input text: {input_text}\nGenerated text: {generated_text}\n\n- - - - - - - - - - - - - - - -\n")

    print(emojize(f"\nText generated saved in \'{folder}/generation_log.txt\' :floppy_disk:") if emojizer else '\nText generated saved in \'{folder}/generation_log.txt\'')

GPT2LMHeadModel instantiated!👌🏼

🔡 Input text:	The main goal in life is
✨ Waiting for the generation of new text...

Text generated!🪄
[3mThe main goal in life is to be as creative as possible in your career – so to see this as fulfilling when you can do so without procrastinating, isn't that amazing?

Yes, it is! It is all about taking your imagination but it's all important, and it gives you a boost without necessarily thinking about it in the slightest! You'd never know, but seeing someone take advantage of the creativity of their subconscious without feeling ashamed about doing so is something really awesome to experience. You really give it away[0m

Text generated saved in 'text_generation/generation_log.txt' 💾


In order to qualitatively evaluate the performance of the `generate()` function and the effect of its arguments, I decided to fix the text prompt to be the same at every run: *"Who knows if God exists, but for sure I"*

- `do_sample = False, temperature = 1.0`: with no argument tuned, the generation is totally greedy and helds no sense. The text generated is a simple sentence of text repeated over and over until the `max_new_tokens` limit of tokens is reached. When `do_sample` is `False`, it's like having a very low `temperature` value, as sampling (= source of noise) is frozen in favour of a deterministic, greedy approach to generation.
- Setting `do_sample` to `True` unlocks the generation, by allowing sampling of more diverse original sequences of tokens, instead of giving always the same greedy text. The tuning of `temperature` lets the magic happen!
- The higher the `temperature`, the "noisier" and unpredictable the generation will be. Very high temperatures lead to, again, non-sense generation, with wrong words and sequences of random symbols.

In [143]:
from transformers import pipeline

# An alternative way for generating text is to use a Hugging Face pipeline
generator = pipeline('2_text-generation', model = model, tokenizer = tokenizer)

print(emojize(f":railway_track: Generating through the \x1B[3mpipeline\x1B[0m\n\n:input_latin_lowercase: Input text:\t{input_text}\n:sparkles: Waiting for the generation of new text...\n") if emojizer else f"Generating through the pipeline\n\nInput text:\t{input_text}\nWaiting for the generation of new text...\n")
print(emojize("Text generated!:magic_wand:\n") if emojizer else "Text generated!\n",
      generator(input_text,
                max_new_tokens = max_new_tokens,
                do_sample = do_sample,
                temperature = temperature,
                early_stopping = early_stopping,
                no_repeat_ngram_size = no_repeat_ngram_size)[0]['generated_text'])

🛤️ Generating through the [3mpipeline[0m

🔡 Input text:	The main goal in life is
✨ Waiting for the generation of new text...

Text generated!🪄
 The main goal in life is always the pursuit of happiness, and that involves a profound level of compassion with every human. Even if others make a mockery of your moral high conduct to stop you from pursuing your dreams - don't even try to get involved.

Being an adult, you do not need to ask anyone "How'd you know I was attractive if a celebrity made a video." Rather, it makes sense for your mind to become an organ with a great deal of attention given towards each individual's unique and fascinating personal


## Exercise 3: Reusing Pre-trained LLMs (choose one)

Choose **one** of the following exercises (well, *at least* one). In each of these you are asked to adapt a pre-trained LLM (`GPT2Model` or `DistilBERT` are two good choices) to a new Natural Language Understanding task. A few comments:

+ Since GPT2 is a *autoregressive* model, there is no latent space aggregation at the last transformer layer (you get the same number of tokens out that you give in input). To use a pre-trained model for a classification or retrieval task, you should aggregate these tokens somehow (or opportunistically select *one* to use).

+ BERT models (including DistilBERT) have a special [CLS] token prepended to each latent representation in output from a self-attention block. You can directly use this as a representation for classification (or retrieval).

+ The first *two* exercises below can probably be done *without* any fine-tuning - that is, just training a shallow MLP to classify or represent with the appropriate loss function.

### Exercise 3.1: Training a Text Classifier

Peruse the [text classification datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=downloads). Choose a *moderately* sized dataset and use a LLM to train a classifier to solve the problem.

**Note**: A good first baseline for this problem is certainly to use an LLM *exclusively* as a feature extractor and then train a shallow model.

In [233]:
# Imports and dependencies
import os
import numpy as np
import torch
import torch.nn as nn
from datasets import load_dataset
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from tqdm.auto import tqdm

# Generalizing code for handling different datasets and models
datasets = ['ag_news', 'dair-ai/emotion']
dataset_name = datasets[1]
pretrained_model = 'distilbert-base-uncased'

#Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(emojize(f"Device: {device} :eagle:") if torch.cuda.is_available() else emojize(f"Device: {device} :snail:")) if emojizer else print(f"Device: {device}")

Device: cuda 🦅


So far, I just tried to fine-tune a pretrained DistilBERT model for the Sequence Classification task.

But for this specific exercise, fine-tuning can be avoided! One hint is to use DistilBERT *only* as a mere feature extractor, and to use a very shallow model (an MLP, or even a Logistic Regression!) on the final representation for the multi-class classification.

Let's try to do so!

The choice of the dataset is crucial for what we're trying to achieve: text classification without having to fine-tune DistilBERT.

Looking at the Hugging Face Datasets, one of the best datasets to use is the *ag_news*: a moderately sized multi-class dataset, with perfectly balanced classes.
But, as always, I cannot feel satisfied with easy things: my attention got captured by the *dair-ai/emotion* dataset too, that, in comparison to the previous one, looks like a real mess! 6 classes, skewed, with not that much data available.

I'll try to face the same challenge using the two datasets, in order to check and report any difference encountered.

In [234]:
# 1) Loading the selected dataset using the load_dataset function from Hugging Face datasets
data = load_dataset(dataset_name)
print(emojize(f"Dataset \'{dataset_name}\' loaded!:OK_hand_medium-light_skin_tone:\n") if emojizer else f"Dataset \'{dataset_name}\' loaded!")

# 2) Instantiate the Tokenizer to tokenize the raw data
tokenizer = DistilBertTokenizer.from_pretrained(pretrained_model)

# 3) Instantiate the pre-trained Model
model = DistilBertModel.from_pretrained(pretrained_model).to(device)

print(emojize(f"\n{tokenizer.__class__.__name__} and {model.__class__.__name__} instantiated!:OK_hand_medium-light_skin_tone:") if emojizer else f"\n{tokenizer.__class__.__name__} and {model.__class__.__name__} instantiated!")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Dataset 'dair-ai/emotion' loaded!👌🏼



Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



DistilBertTokenizer and DistilBertModel instantiated!👌🏼


Looking at the DistilBertTokenizer, I can see that it's a word level tokenizer, at least for the english language.

In [235]:
# Features Extraction from the dataset (to be done only the first time: following times we can load the saved features)
extract_features = False         # If True, extracts the features from the dataset and saves them again
batch_size = 64

# Saving configuration
save_classification = False
folder = f'3_1_text_classification/{dataset_name}'

if extract_features:
    
    train_features = []
    train_labels = []
    test_features = []
    test_labels = []

    # Extracting features from the Train dataset
    for i in tqdm(range(0, len(data['train']), batch_size), desc = emojize('Extracting Train Features :hammer_and_wrench:') if emojizer else 'Extracting Train Features'):
        
        encoded_inputs = tokenizer(data['train']['text'][i : i + batch_size], padding = True, truncation = True, return_tensors = 'pt').to(device)
        with torch.no_grad():
            outputs = model(**encoded_inputs)
        encoded_features = outputs.last_hidden_state.mean(dim = 1).detach().cpu().numpy()

        train_features.append(encoded_features)
        train_labels.append(data['train']['label'][i : i + batch_size])

    # Extracting features from the Test dataset
    for i in tqdm(range(0, len(data['test']), batch_size), desc = emojize('Extracting Test Features :hammer_and_wrench:' if emojizer else 'Extracting Test Features')):
        
        encoded_inputs = tokenizer(data['test']['text'][i : i + batch_size], padding = True, truncation = True, return_tensors = 'pt').to(device)
        with torch.no_grad():
            outputs = model(**encoded_inputs)
        encoded_features = outputs.last_hidden_state.mean(dim = 1).detach().cpu().numpy()
        
        test_features.append(encoded_features)
        test_labels.append(data['test']['label'][i : i + batch_size])

    # Concatenating the lists of features and labels into a single NumPy array
    # train_features is now a (16000, 768) array: 768 is the embedding dimension
    train_features = np.concatenate(train_features, axis = 0)
    train_labels = np.concatenate(train_labels, axis = 0)

    test_features = np.concatenate(test_features, axis = 0)
    test_labels = np.concatenate(test_labels, axis = 0)

    print(emojize("Done!:OK_hand_medium-light_skin_tone:") if emojizer else 'Done!')

    if save_classification:
        print(emojize(f"\nSaving Features and Labels in \'{folder}\'... :writing_hand_medium-light_skin_tone:") if emojizer else f"\nSaving features and labels in \'{folder}\'...")
        if not os.path.exists(folder):
            os.makedirs(folder)
            print(emojize(f"Folder \'{folder}\' created :file_folder:") if emojizer else f"Folder \'{folder}\' created!")

        np.save(f'{folder}/train_features.npy', train_features)
        np.save(f'{folder}/test_features.npy', test_features)
        np.save(f'{folder}/train_labels.npy', train_labels)
        np.save(f'{folder}/test_labels.npy', test_labels)

        print(emojize(f"Features and Labels saved! :floppy_disk:") if emojizer else 'Features and Labels saved!')
else:
    print(emojize(f"Loading Features and Labels from \'{folder}\'...:open_file_folder:") if emojizer else f"\nLoading features and labels from \'{folder}\'...")

    train_features = np.load(f'{folder}/train_features.npy')
    test_features = np.load(f'{folder}/test_features.npy')
    train_labels = np.load(f'{folder}/train_labels.npy')
    test_labels = np.load(f'{folder}/test_labels.npy')
    
    print(emojize(f"Features and Labels loaded!:OK_hand_medium-light_skin_tone:") if emojizer else 'Features and Labels loaded!')

Loading Features and Labels from 'text_classification/dair-ai/emotion'...📂
Features and Labels loaded!👌🏼


In [236]:
# Hyperparameters for the shallow classifier (MLP) instantiated
hidden_size = 512
epochs = 20
learning_rate = 1e-3

# Weighted Loss for handling class imbalance
weighted = False

class MLP(nn.Module):
    def __init__(self, hidden_size):
        super(MLP, self).__init__()

        self.fc1 = nn.Linear(768, hidden_size)
        self.fc2 = nn.Linear(hidden_size, 6)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x
    

# Instantiating Model, Optimizer and Loss
shallowcls = MLP(hidden_size).to(device)
optimizer = torch.optim.Adam(shallowcls.parameters(), lr = learning_rate)

if weighted:
    weights = torch.tensor([3.43, 2.98, 12.27, 7.41, 8.26, 27.97]).to(device)
else:
    weights = None
criterion = nn.CrossEntropyLoss(weight = weights)


# Training Loop
print(emojize("Starting Training...:flexed_biceps_medium-light_skin_tone:") if emojizer else "Starting Training...")
for epoch in range(epochs):
    shallowcls.train()
    for i in range(0, len(train_features), batch_size):
        X = torch.tensor(train_features[i : i + batch_size], dtype = torch.float).to(device)
        Y = torch.tensor(train_labels[i : i + batch_size], dtype = torch.long).to(device)

        outputs = shallowcls(X)
        loss = criterion(outputs, Y)

        optimizer.zero_grad(set_to_none = True)
        loss.backward()
        optimizer.step()

    # Time for Testing
    total, correct = 0, 0
    shallowcls.eval()
    for i in range(0, len(test_features), batch_size):
        Xval = torch.tensor(test_features[i : i + batch_size], dtype = torch.float).to(device)
        Yval = torch.tensor(test_labels[i : i + batch_size], dtype = torch.long).to(device)

        outputs = shallowcls(Xval)   
        _, pred = torch.max(outputs.data, 1)
        total += Yval.size(0)
        correct += (pred == Yval).sum().item()

    test_accuracy = 100 * correct / total

    print(emojize(f":scroll: Epoch {epoch+1}/{epochs}:\t:dotted_line_face: Training Loss = {loss.item():.4f}   :bullseye: Test Accuracy = {test_accuracy:.2f}%") if emojizer else f"Epoch {epoch+1}/{epochs}:\tTraining Loss = {loss.item():.4f}   Test Accuracy = {test_accuracy:.2f}%")

print(emojize("\nTraining completed!:OK_hand_medium-light_skin_tone:") if emojizer else "\nTraining completed!")

Starting Training...💪🏼
📜 Epoch 1/20:	🫥 Training Loss = 1.0757   🎯 Test Accuracy = 60.40%
📜 Epoch 2/20:	🫥 Training Loss = 1.0004   🎯 Test Accuracy = 61.95%
📜 Epoch 3/20:	🫥 Training Loss = 0.9655   🎯 Test Accuracy = 62.80%
📜 Epoch 4/20:	🫥 Training Loss = 0.9377   🎯 Test Accuracy = 63.70%
📜 Epoch 5/20:	🫥 Training Loss = 0.9017   🎯 Test Accuracy = 64.40%
📜 Epoch 6/20:	🫥 Training Loss = 0.8737   🎯 Test Accuracy = 65.45%
📜 Epoch 7/20:	🫥 Training Loss = 0.8416   🎯 Test Accuracy = 65.15%
📜 Epoch 8/20:	🫥 Training Loss = 0.8069   🎯 Test Accuracy = 65.20%
📜 Epoch 9/20:	🫥 Training Loss = 0.7749   🎯 Test Accuracy = 65.65%
📜 Epoch 10/20:	🫥 Training Loss = 0.7397   🎯 Test Accuracy = 65.80%
📜 Epoch 11/20:	🫥 Training Loss = 0.7071   🎯 Test Accuracy = 66.00%
📜 Epoch 12/20:	🫥 Training Loss = 0.6701   🎯 Test Accuracy = 66.75%
📜 Epoch 13/20:	🫥 Training Loss = 0.6322   🎯 Test Accuracy = 66.50%
📜 Epoch 14/20:	🫥 Training Loss = 0.5930   🎯 Test Accuracy = 66.65%
📜 Epoch 15/20:	🫥 Training Loss = 0.5613   🎯 Test

In [240]:
print(emojize("Training a Logistic Regression model...:flexed_biceps_medium-light_skin_tone:") if emojizer else "Training a Logistic Regression model...")
logreg = LogisticRegression(max_iter = 1500).fit(train_features, train_labels)

print(emojize("Model trained! Making predictions on new data...:face_with_monocle:\n") if emojizer else "Model trained! Making predictions on new data...\n")
pred = logreg.predict(test_features)

accuracy = accuracy_score(test_labels, pred)*100
f1 = f1_score(test_labels, pred, average = 'weighted')
precision = precision_score(test_labels, pred, average = 'weighted')
recall = recall_score(test_labels, pred, average = 'weighted')

print(emojize(f"Prediction completed!\n:bullseye: Test Accuracy = {accuracy:.2f}%\n" +
                    f":firecracker: F1 Score = {f1:.4f}\n" +
                    f":sewing_needle: Precision = {precision:.4f}\n" +
                    f":megaphone: Recall = {recall:.4f}") if emojizer else f"Prediction completed! Test Accuracy = {accuracy:.2f}%\nF1 Score = {f1:.4f}\nPrecision = {precision:.4f}\nRecall = {recall:.4f}")

Training a Logistic Regression model...💪🏼
Model trained! Making predictions on new data...🧐

Prediction completed!
🎯 Test Accuracy = 65.25%
🧨 F1 Score = 0.6436
🪡 Precision = 0.6416
📣 Recall = 0.6525


- Using the *dair-ai/emotion* dataset, the multi-class classifier, with or without any tweaks on the loss weights to address the class imbalance problem, struggles to reach 67% Test Accuracy (while a fine-tuned DistilBERT is capable to go up to 94% accuracy).
- Using the *ag_news* dataset, instead, the classifier makes no effort to provide results with 92% Test Accuracy.

Furthermore, results obtained by training a simple MLP and an even simpler Logistic Regression are basically the same.

### Exercise 3.2: Training a Question Answering Model

Peruse the [multiple choice question answering datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:multiple-choice&sort=downloads). Chose a *moderately* sized one and train a model to answer contextualized multiple-choice questions. You *might* be able to avoid fine-tuning by training a simple model to *rank* the multiple choices (see margin ranking loss in Pytorch).

In [247]:
# Imports and dependencies
import os
import numpy as np
import torch
import torch.nn as nn
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from tqdm.auto import tqdm

# Generalizing code for handling different datasets and models
pretrained_model = 'distilbert-base-uncased'

#Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(emojize(f"Device: {device} :eagle:") if torch.cuda.is_available() else emojize(f"Device: {device} :snail:")) if emojizer else print(f"Device: {device}")

# Loading the selected dataset (race-middle), Model and Tokenizer
data = load_dataset('race', 'middle')
print(emojize(f"Dataset \'race-middle\' loaded!:OK_hand_medium-light_skin_tone:\n") if emojizer else f"Dataset \'{dataset_name}\' loaded!")

tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_model).to(device)

print(emojize(f"\n{tokenizer.__class__.__name__} and {model.__class__.__name__} instantiated!:OK_hand_medium-light_skin_tone:") if emojizer else f"\n{tokenizer.__class__.__name__} and {model.__class__.__name__} instantiated!")

Device: cuda 🦅


Downloading data: 100%|██████████| 405k/405k [00:00<00:00, 786kB/s]
Downloading data: 100%|██████████| 6.97M/6.97M [00:02<00:00, 2.65MB/s]
Downloading data: 100%|██████████| 407k/407k [00:00<00:00, 1.39MB/s]
Generating test split: 100%|██████████| 1436/1436 [00:00<00:00, 20448.77 examples/s]
Generating train split: 100%|██████████| 25421/25421 [00:00<00:00, 515151.11 examples/s]
Generating validation split: 100%|██████████| 1436/1436 [00:00<00:00, 287220.82 examples/s]


Dataset 'race-middle' loaded!👌🏼



Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



DistilBertTokenizer and DistilBertModel instantiated!👌🏼


In [269]:
data['train']

# input1 = article
# input2 = question

# output1 = answer predicted from article
# output2 = answer predicted from question

# criterion = nn.MarginRankingLoss()
# loss = criterion(output1, output2, target)

Dataset({
    features: ['example_id', 'article', 'answer', 'question', 'options'],
    num_rows: 25421
})

### Exercise 3.3: Training a Retrieval Model

The Hugging Face dataset repository contains a large number of ["text retrieval" problems](https://huggingface.co/datasets?task_categories=task_categories:text-retrieval&p=1&sort=downloads). These tasks generally require that the model measure *similarity* between text in some metric space -- naively, just a cosine similarity between [CLS] tokens can get you pretty far. Find an interesting retrieval problem and train a model (starting from a pre-trained LLM of course) to solve it.

**Tip**: Sometimes identifying the *retrieval* problems in these datasets can be half the challenge. [This dataset](https://huggingface.co/datasets/BeIR/scifact) might be a good starting point.

In [None]:
# Your code here.