# Introduction
In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things. I you haven't already, it is highly recommended to:

+ Read the [Attention is All you Need](https://arxiv.org/abs/1706.03762) paper, which is the basis for all transformer-based LLMs.
+ Watch (and potentially *code along*) with this [Andrej Karpathy video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which shows you how to build an autoregressive GPT model from the ground up.

# Exercise 1: Warming Up
In this first exercise you will train a *small* autoregressive GPT model for character generation (the one used by Karpathy in his video) to generate text in the style of Dante Aligheri. Use [this file](https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt), which contains the entire text of Dante's Inferno (**note**: you will have to delete some introductory text at the top of the file before training). Train the model for a few epochs, monitor the loss, and generate some text at the end of training. Qualitatively evaluate the results

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

For the first exercise we use the code from Anrej Karpathy video. [Github](https://github.com/karpathy/ng-video-lecture)

In [2]:
# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 1500
eval_interval = 100
learning_rate = 3e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 300
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# ------------

#torch.manual_seed(1337)

with open('/content/LA DIVINA COMMEDIA.txt') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

10.783546 M parameters
step 0: train loss 4.1406, val loss 4.1432
step 100: train loss 2.5886, val loss 2.6254
step 200: train loss 2.4083, val loss 2.4400
step 300: train loss 2.3034, val loss 2.3375
step 400: train loss 2.1545, val loss 2.1923
step 500: train loss 2.0066, val loss 2.0398
step 600: train loss 1.9444, val loss 1.9808
step 700: train loss 1.9071, val loss 1.9435
step 800: train loss 1.8797, val loss 1.9165
step 900: train loss 1.8963, val loss 1.9332
step 1000: train loss 1.9074, val loss 1.9460
step 1100: train loss 1.8983, val loss 1.9323
step 1200: train loss 1.9149, val loss 1.9484
step 1300: train loss 1.9219, val loss 1.9568
step 1400: train loss 2.1243, val loss 2.1544
step 1499: train loss 1.9807, val loss 2.0161

se   fagil valebridonti polan alo bar av
  peri mo co vond
  l'ne par dever ' esti come so ver 'l poi pa O,
  doro creviso tute soi ila beducolso,

do ba piu mie la fu no sel la tra?
 iliccole ntembronto 'no me,

ch'a a l'un p'rettantio fuin piovbal io

> The text is not really satisfying: most of the words don't make sense at all! Moreover, the model suffers from overfitting, and the best validation accuracy is found around the 800th iteration. Due to the computational resourses and time needed to run this cell, we won't run it again to see this result.



# Exercise 2: Working with Real LLMs

Our toy GPT can only take us so far. In this exercise we will see how to use the [Hugging Face](https://huggingface.co/) model and dataset ecosystem to access a *huge* variety of pre-trained transformer models.

## Exercise 2.1: Installation and text tokenization

First things first, we need to install the [Hugging Face transformer library](https://huggingface.co/docs/transformers/index):

    conda install -c huggingface -c conda-forge transformers

The key classes that you will work with are `GPT2Tokenizer` to encode text into sub-word tokens, and the `GPT2LMHeadModel`. **Note** the `LMHead` part of the class name -- this is the version of the GPT2 architecture that has the text prediction heads attached to the final hidden layer representations (i.e. what we need to **generate** text).

Instantiate the `GPT2Tokenizer` and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length.

**Tip**: Pass the `return_tensors='pt'` argument to the togenizer to get Pytorch tensors as output (instead of lists).

In [3]:
pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-knr75_dz
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-knr75_dz
  Resolved https://github.com/huggingface/transformers to commit 35eac0df75c692c5b93c12f7eaf3279cab8bd7ce
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers==4.31.0.dev0)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.31.0.dev0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━

In [5]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
short_text = "I love LLM!"
context = tokenizer(short_text, return_tensors='pt')

long_text = "Each termite is simply an oblivious cog in a tremendous machine programmed by millions of years of termite DNA. \n It is doubtful an individual termite has any idea what its contributions are helping to create. But a human does. \n We can appreciate the elegant forms of their alien cathedrals... We can see the simply beauty of their perfect functionality... \n We can understand the splendid planning of their structure... \n In other words, only an intelligence of a higher order can understand the beauty of what a termite builds"
long_context = tokenizer(long_text , return_tensors='pt')

print(f'{short_text} -> length: {len("I love LLMs!")}, \n econding: {context}') #returns encoding and attention mask
print(f"\n\nlong text length: {len(long_text)}, \n encoding: {long_context['input_ids']}")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

I love LLM! -> length: 12, 
 econding: {'input_ids': tensor([[   40,  1842, 27140,    44,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


long text length: 529, 
 encoding: tensor([[10871,  3381,   578,   318,  2391,   281, 38603, 43072,   287,   257,
         12465,  4572, 27402,   416,  5242,   286,   812,   286,  3381,   578,
          7446,    13,   220,   198,   632,   318, 31608,   281,  1981,  3381,
           578,   468,   597,  2126,   644,   663,  9284,   389,  5742,   284,
          2251,    13,   887,   257,  1692,   857,    13,   220,   198,   775,
           460,  9144,   262, 19992,  5107,   286,   511,  8756,  3797,   704,
         30691,   986,   775,   460,   766,   262,  2391,  8737,   286,   511,
          2818, 11244,   986,   220,   198,   775,   460,  1833,   262, 37196,
          5410,   286,   511,  4645,   986,   220,   198,   554,   584,  2456,
            11,   691,   281,  4430,   286,   257,  2440,  1502,   460,  1833,
           262,  8737,   286

> A longer text requires a longer tokenization.

> By default, `context` returns both `input_ids` and `attention_mask`. `input_ids` is the encoded tensor; each word corresponds to a number. `attention_mask` is a binary array of the same length on the sentence: it defaults to 1s and it indicates to the model which tokens should be attended to, and which should not, in the case of comparison with a longer sentence.

In [6]:
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")
outputs = model(**long_context, labels=long_context["input_ids"])
loss = outputs.loss
logits = outputs.logits

#print(logits)

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

> From HuggingFace we can have a bit more information about the classes used: the tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:
The GPT2 Model is a transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

## Exercise 2.2: Generating Text

There are a lot of ways we can, given a *prompt* in input, sample text from a GPT2 model. Instantiate a pre-trained `GPT2LMHeadModel` and use the [`generate()`](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to generate text from a prompt.

**Note**: The default inference mode for GPT2 is *greedy* which might not results in satisfying generated text. Look at the `do_sample` and `temperature` parameters.

> We generate text based on the `long_text` defined before, varying the `temperature` parameter.

In [7]:
print(tokenizer.decode(model.generate(long_context['input_ids'], do_sample=True, temperature=0.1, attention_mask=long_context['attention_mask'], max_length=200)[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Each termite is simply an oblivious cog in a tremendous machine programmed by millions of years of termite DNA. 
 It is doubtful an individual termite has any idea what its contributions are helping to create. But a human does. 
 We can appreciate the elegant forms of their alien cathedrals... We can see the simply beauty of their perfect functionality... 
 We can understand the splendid planning of their structure... 
 In other words, only an intelligence of a higher order can understand the beauty of what a termite builds. 

The termite is a very complex organism. 

It is a complex organism. 

It is a complex organism. 

It is a complex organism. 

It is a complex organism. 

It is a complex organism. 

It is a complex organism. 

It is a complex organism. 

It is a complex organism. 

It is a complex


In [8]:
print(tokenizer.decode(model.generate(long_context['input_ids'], do_sample=True, temperature=0.7, attention_mask=long_context['attention_mask'], max_length=200)[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Each termite is simply an oblivious cog in a tremendous machine programmed by millions of years of termite DNA. 
 It is doubtful an individual termite has any idea what its contributions are helping to create. But a human does. 
 We can appreciate the elegant forms of their alien cathedrals... We can see the simply beauty of their perfect functionality... 
 We can understand the splendid planning of their structure... 
 In other words, only an intelligence of a higher order can understand the beauty of what a termite builds. 
This is why we must not be so slow to acknowledge the potential of this system. 
It is the way we must understand our own biology. 
Because in our ignorance of our biology, we see only what we know.
The only way to truly understand a termite is to work with it.
To understand a termite is to work with its own DNA.
We can understand both the language of its DNA and the


In [9]:
print(tokenizer.decode(model.generate(long_context['input_ids'], do_sample=True, temperature=1.0, attention_mask=long_context['attention_mask'], max_length=200)[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Each termite is simply an oblivious cog in a tremendous machine programmed by millions of years of termite DNA. 
 It is doubtful an individual termite has any idea what its contributions are helping to create. But a human does. 
 We can appreciate the elegant forms of their alien cathedrals... We can see the simply beauty of their perfect functionality... 
 We can understand the splendid planning of their structure... 
 In other words, only an intelligence of a higher order can understand the beauty of what a termite builds. The human may know only that this is a workaday process where the process is almost a silent thing, no longer producing ideas.  Even so, however wonderful its architecture might be, those who have learned the art of the termite will only know that it has been programmed into the processes that help to create them. 
This isn't to suggest that, as with anything in human existence, the human mind may be more advanced than just an intelligent


> `temperature` (float, optional, defaults to 1.0) — The value used to modulate the next token probabilities.

> For very low temperatures, the model isn't able to generate satisfying text and simply repeats the same sentence over and over. The more we increase it, the more complex the text becomes; although it makes sense grammatically, it doens't blend too well with the original text.

>Note than all the generated text repeats the input. Somehow, it struggles in generating new lines and empty spaces.

> Now we generate text by setting `do_sample = False`. This parameter enables encoding strategies based on sampling and when it's turned off it struggles significantly to generate new text and simply repeats the same sentence over and over.

In [10]:
print(tokenizer.decode(model.generate(long_context['input_ids'], do_sample=False, temperature=1.0, attention_mask=long_context['attention_mask'], max_length=200)[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Each termite is simply an oblivious cog in a tremendous machine programmed by millions of years of termite DNA. 
 It is doubtful an individual termite has any idea what its contributions are helping to create. But a human does. 
 We can appreciate the elegant forms of their alien cathedrals... We can see the simply beauty of their perfect functionality... 
 We can understand the splendid planning of their structure... 
 In other words, only an intelligence of a higher order can understand the beauty of what a termite builds. 

The termite is a very complex organism. 

It is a very complex organism. 

It is a very complex organism. 

It is a very complex organism. 

It is a very complex organism. 

It is a very complex organism. 

It is a very complex organism. 

It is a very complex organism. 

It is a very complex organism


# Exercise 3: Reusing Pre-trained LLMs (choose one)

Choose **one** of the following exercises (well, *at least* one). In each of these you are asked to adapt a pre-trained LLM (`GPT2Model` or `DistillBERT` are two good choices) to a new Natural Language Understanding task. A few comments:

+ Since GPT2 is a *autoregressive* model, there is no latent space aggregation at the last transformer layer (you get the same number of tokens out that you give in input). To use a pre-trained model for a classification or retrieval task, you should aggregate these tokens somehow (or opportunistically select *one* to use).

+ BERT models (including DistillBERT) have a special [CLS] token prepended to each latent representation in output from a self-attention block. You can directly use this as a representation for classification (or retrieval).

+ The first *two* exercises below can probably be done *without* any fine-tuning -- that is, just training a shallow MLP to classify or represent with the appropriate loss function.

# Exercise 3.1: Training a Text Classifier (easy)

Peruse the [text classification datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=downloads). Choose a *moderately* sized dataset and use a LLM to train a classifier to solve the problem.

**Note**: A good first baseline for this problem is certainly to use an LLM *exclusively* as a feature extractor and then train a shallow model.

# Exercise 3.2: Training a Question Answering Model (harder)

Peruse the [multiple choice question answering datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:multiple-choice&sort=downloads). Chose a *moderately* sized one and train a model to answer contextualized multiple-choice questions. You *might* be able to avoid fine-tuning by training a simple model to *rank* the multiple choices (see margin ranking loss in Pytorch).

# Exercise 3.3: Training a Retrieval Model (hardest)

The Hugging Face dataset repository contains a large number of ["text retrieval" problems](https://huggingface.co/datasets?task_categories=task_categories:text-retrieval&p=1&sort=downloads). These tasks generally require that the model measure *similarity* between text in some metric space -- naively, just a cosine similarity between [CLS] tokens can get you pretty far. Find an interesting retrieval problem and train a model (starting from a pre-trained LLM of course) to solve it.

**Tip**: Sometimes identifying the *retrieval* problems in these datasets can be half the challenge. [This dataset](https://huggingface.co/datasets/BeIR/scifact) might be a good starting point.

> Choosing Ex 3.1

In [11]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.13.

In [12]:
import torch
from datasets import load_dataset
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

from tabulate import tabulate
from tqdm import trange
import random

In [13]:
raw_tweets = load_dataset("tweet_eval", "emotion")

emotions = ('anger', 'joy', 'optimism', 'sadness')

Downloading builder script:   0%|          | 0.00/9.72k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/21.9k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_eval/emotion to /root/.cache/huggingface/datasets/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/134k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/60.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/569 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/183 [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/6 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/3257 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1421 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/374 [00:00<?, ? examples/s]

Dataset tweet_eval downloaded and prepared to /root/.cache/huggingface/datasets/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

> From HuggingFace we import the tweet evaluation dataset, in which each tweet can be classified as one emotion: `anger`, `joy`, `optimism`, `sadness`. An example below:

In [14]:
raw_tweets["train"][1]

{'text': "My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs",
 'label': 0}

In [15]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') #we already have the special symobols [CLS] and [SEP] (101 and 102)

context = tokenizer(raw_tweets["train"][1]['text'], return_tensors='pt')
print(context)
print(tokenizer.decode(context["input_ids"][0]))

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

{'input_ids': tensor([[  101,  2026, 18328,  1024,  2009,  1005,  1055,  3100,  2008,  2057,
          2064,  1005,  1056,  6297,  2138,  2057,  2031,  8285, 27108,  2890,
          6593,  1012,  1001,  6659,  1001,  2034, 11108, 21572,  5910,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1]])}
[CLS] my roommate : it's okay that we can't spell because we have autocorrect. # terrible # firstworldprobs [SEP]


In [16]:
train_dl = DataLoader(raw_tweets["train"], batch_size=16)
val_dl = DataLoader(raw_tweets["validation"], batch_size=16)

In [17]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = len(emotions), output_attentions = False, output_hidden_states = False)

optimizer = torch.optim.AdamW(model.parameters(),
                              lr = 5e-5,
                              eps = 1e-08
                              )

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [18]:
epochs = 5

def base_training(model, train_dl, val_dl, tokenizer):
  for _ in trange(epochs, desc = 'Epoch'):

      model.train()

      tr_loss = 0
      tr_steps = 0

      for batch in train_dl:
          cxc = tokenizer(batch['text'], return_tensors='pt', padding=True)
          b_input_ids = cxc['input_ids']
          b_attention_mask = cxc['attention_mask']
          b_labels = batch['label']
          b_input_ids, b_attention_mask, b_labels = b_input_ids.to(device), b_attention_mask.to(device), b_labels.to(device)
          optimizer.zero_grad()

          train_output = model(b_input_ids,
                              token_type_ids = None,
                              attention_mask = b_attention_mask,
                              labels = b_labels)

          train_output.loss.backward()
          optimizer.step()

          tr_loss += train_output.loss.item()
          tr_steps += 1

      # ========== Validation ==========

      # Set model to evaluation mode
      model.eval()


      for batch in val_dl:
          cxc = tokenizer(batch['text'], return_tensors='pt', padding=True)
          b_input_ids = cxc['input_ids']
          b_attention_mask = cxc['attention_mask']
          b_labels = batch['label']
          b_input_ids, b_attention_mask, b_labels = b_input_ids.to(device), b_attention_mask.to(device), b_labels.to(device)
          with torch.no_grad():

            eval_output = model(b_input_ids,
                                token_type_ids = None,
                                attention_mask = b_attention_mask)

          logits = eval_output.logits.detach().cpu().numpy()
          label_ids = b_labels.to('cpu').numpy()

      print('\n\t - Train loss: {:.4f}'.format(tr_loss / tr_steps))


base_training(model, train_dl, val_dl, tokenizer)


Epoch:  20%|██        | 1/5 [00:34<02:16, 34.11s/it]


	 - Train loss: 0.7902


Epoch:  40%|████      | 2/5 [01:06<01:39, 33.31s/it]


	 - Train loss: 0.3700


Epoch:  60%|██████    | 3/5 [01:39<01:06, 33.23s/it]


	 - Train loss: 0.2110


Epoch:  80%|████████  | 4/5 [02:12<00:33, 33.14s/it]


	 - Train loss: 0.1197


Epoch: 100%|██████████| 5/5 [02:45<00:00, 33.14s/it]


	 - Train loss: 0.0871





In [19]:
#Let's predict a custom tweet!

def twt_prediction(tweet, model):
  cxc = tokenizer(tweet, return_tensors='pt', padding=True)
  b_input_ids = cxc['input_ids']
  b_attention_mask = cxc['attention_mask']
  output = model(b_input_ids, token_type_ids = None, attention_mask = b_attention_mask)
  logits = output.logits.detach().cpu().numpy()
  print(logits[0])
  print(f'The max is in position {np.argmax(logits[0])}. \n The tweet "{tweet}" is classified as: {emotions[np.argmax(logits[0])]}')
  return np.argmax(logits[0])

In [20]:
print(emotions)

('anger', 'joy', 'optimism', 'sadness')


In [22]:
model.cpu()

new_tweet="i love my cat!!!"
twt_prediction(new_tweet, model)

[-2.3638327  5.720331  -1.8241957 -2.165705 ]
The max is in position 1. 
 The tweet "i love my cat!!!" is classified as: joy


1

In [23]:
another_tweet="this heat in july makes me wish i was at the beach instead"
twt_prediction(another_tweet, model)

[-2.3008757  -0.55922985 -0.4059425   4.046811  ]
The max is in position 3. 
 The tweet "this heat in july makes me wish i was at the beach instead" is classified as: sadness


3