# Introduction
In this laboratory we will get our hands dirty working with Large Language Models (e.g. GPT and BERT) to do various useful things. I you haven't already, it is highly recommended to:

+ Read the [Attention is All you Need](https://arxiv.org/abs/1706.03762) paper, which is the basis for all transformer-based LLMs.
+ Watch (and potentially *code along*) with this [Andrej Karpathy video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which shows you how to build an autoregressive GPT model from the ground up.

# Exercise 1: Warming Up
In this first exercise you will train a *small* autoregressive GPT model for character generation (the one used by Karpathy in his video) to generate text in the style of Dante Aligheri. Use [this file](https://archive.org/stream/ladivinacommedia00997gut/1ddcd09.txt), which contains the entire text of Dante's Inferno (**note**: you will have to delete some introductory text at the top of the file before training). Train the model for a few epochs, monitor the loss, and generate some text at the end of training. Qualitatively evaluate the results

In [1]:
import os
import torch
from pathlib import Path

from lab2.models.gpt import decode, estimate_loss, get_batch, GPTLanguageModel

def make_paths_relative_to_root():
    """Always use the same, absolute (relative to root) paths
    which makes moving the notebooks around easier.
    """
    top_level = Path(__file__).parent
    os.chdir(top_level)

In [2]:
batch_size = 64 # how many independent sequences will we process in parallel?

max_iters = 1500
eval_interval = 100
learning_rate = 3e-4

eval_iters = 200 # how many iterations to use for evaluation
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Using device: cuda
10.783546 M parameters


In [13]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
torch.save(model.state_dict(), 'gpt.pth' )

step 0: train loss 1.4181, val loss 1.5813
step 100: train loss 1.3261, val loss 1.5519
step 200: train loss 1.2284, val loss 1.5358


KeyboardInterrupt: 

In [3]:
# generate from the model
model.load_state_dict(torch.load('runs/nanogpt/gpt.pth'))
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
#open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))


Udivi ai' dentro paro` che quei pion, speragia
  vangiandom'io conorposto so buoro.

Come farso lei gin.  Corto a costa partia:
  Feggia` che ira dirro
  - per la e scena Folgrimentici.

<<O a milla diserrar ci tua calde>>.
  dispirtoritto al suoi quel Sel eti cotiace!
  zur dentro fingra infernon posa.

<<I' volta che vol'omo sa sua quenta,
  di Vuto di quando Sannomando` trai>>.

Li` ne' si de' l folsen gralicevigio,
  come i faron, avelto perro cossominci
  e che io Alo mo,rte non largizz'amb


# Exercise 2: Working with Real LLMs

Our toy GPT can only take us so far. In this exercise we will see how to use the [Hugging Face](https://huggingface.co/) model and dataset ecosystem to access a *huge* variety of pre-trained transformer models.

## Exercise 2.1: Installation and text tokenization

First things first, we need to install the [Hugging Face transformer library](https://huggingface.co/docs/transformers/index):

    conda install -c huggingface -c conda-forge transformers
    
The key classes that you will work with are `GPT2Tokenizer` to encode text into sub-word tokens, and the `GPT2LMHeadModel`. **Note** the `LMHead` part of the class name -- this is the version of the GPT2 architecture that has the text prediction heads attached to the final hidden layer representations (i.e. what we need to **generate** text). 

Instantiate the `GPT2Tokenizer` and experiment with encoding text into integer tokens. Compare the length of input with the encoded sequence length.

**Tip**: Pass the `return_tensors='pt'` argument to the togenizer to get Pytorch tensors as output (instead of lists).

In [73]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# import tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2',return_tensors='pt',padding_side='left')
# define model backbone
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

GPT uses a BPE tokenizer, which is a subword-based tokenizer that allows to bring a good balance between character-level and word-level tokenizers, representing most common words as single tokens and rarer words as sequences of subword units.


In [74]:
input = 'Large Language Models will become sentient because'

input_ids = tokenizer.encode(input)
print(input_ids)
print(tokenizer.convert_ids_to_tokens(input_ids))

[21968, 15417, 32329, 481, 1716, 36834, 780]
['Large', 'ĠLanguage', 'ĠModels', 'Ġwill', 'Ġbecome', 'Ġsentient', 'Ġbecause']


We can see that some words are encoded with more than one token. All tokens containing the first word piece are preceded by the character Ġ

## Exercise 2.2: Generating Text

There are a lot of ways we can, given a *prompt* in input, sample text from a GPT2 model. Instantiate a pre-trained `GPT2LMHeadModel` and use the [`generate()`](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/text_generation#transformers.GenerationMixin.generate) method to generate text from a prompt.

**Note**: The default inference mode for GPT2 is *greedy* which might not results in satisfying generated text. Look at the `do_sample` and `temperature` parameters.

### Greedy search
In greedy search we always pick the token with the highest probability as the next token. This strategy can lead to repetitive and not very interesting text, as we can see in the following example.

In [75]:
# encode context the generation is conditioned on
# don't pass anything
input = tokenizer.bos_token
# pass
input = 'Large Language Models will become sentient because'

input_ids = tokenizer.encode(input, return_tensors='pt')

output = model.generate(input_ids, max_length=128)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Large Language Models will become sentient because they will be able to communicate with each other.

The AI will be able to communicate with the human race through the use of the language of the AI.

The AI will be able to communicate with the human race through the use of the language of the AI. The AI will be able to communicate with the human race through the use of the language of the AI. The AI will be able to communicate with the human race through the use of the language of the AI. The AI will be able to communicate with the human race through the use of the language of the AI. The AI


### Multinomial sampling
In multinomial sampling we randomly select the next token based on the probability distribution over the vocabulary. This allows to generate more diverse text, sampling less probable tokens. To perform multinomial sampling we set the `do_sample` parameter.

We can control the diversity of the sampling by using the `temperature` parameter. A low temperature will result in a more conservative sampling, while a high temperature will result in a more diverse sampling.

In [76]:
output = model.generate(input_ids,temperature=0.1,do_sample=True, max_length=128)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Large Language Models will become sentient because of the fact that they are able to communicate with each other.

The AI will be able to communicate with its own language, and will be able to communicate with other AI's language. The AI will be able to communicate with its own language, and will be able to communicate with other AI's language. The AI will be able to communicate with its own language, and will be able to communicate with other AI's language. The AI will be able to communicate with its own language, and will be able to communicate with other AI's language. The AI will be able to communicate with its own


In [77]:
output = model.generate(input_ids,temperature=0.9,do_sample=True,max_length=512)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Large Language Models will become sentient because they have learned how to communicate with each others. This can become even more valuable in the future when these groups are both very interested in one another (and they often share many knowledge about the environment).

Language models can also be used to show the difference between words in a world with different language systems – e.g. we can use the model to show all words are related to each other. Similarly, we can show which language is related to the other. Such language models can be used to show the differences between individuals to show that their language systems are related to them.

These models can also be used to show how people think of the world infruit different from normal human speech. A good example of this can be seen with a simple language model that looks at a group of people who talk about the world

### Beam search
In beam search we keep several sequence hypotheses at each time step, and eventually choose the one with the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that are not necessarily the most probable at each step. To perform beam search we set the `num_beams` parameter.

We can also combine the beam search with multinomial sampling by setting the `do_sample` parameter.

In [78]:
output = model.generate(input_ids, num_beams=5, temperature=0.9,do_sample=True, max_length=128)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
Large Language Models will become sentient because of the way they interact with the world around them.

In addition to the language models, there are many other models that can be used in order to interact with the world around them. These include:

Human Language Models

Human Language Models

Human Language Models

Human Language Models

Human Language Models


# Exercise 3: Reusing Pre-trained LLMs (choose one)

Choose **one** of the following exercises (well, *at least* one). In each of these you are asked to adapt a pre-trained LLM (`GPT2Model` or `DistillBERT` are two good choices) to a new Natural Language Understanding task. A few comments:

+ Since GPT2 is a *autoregressive* model, there is no latent space aggregation at the last transformer layer (you get the same number of tokens out that you give in input). To use a pre-trained model for a classification or retrieval task, you should aggregate these tokens somehow (or opportunistically select *one* to use).

+ BERT models (including DistillBERT) have a special [CLS] token prepended to each latent representation in output from a self-attention block. You can directly use this as a representation for classification (or retrieval).

+ The first *two* exercises below can probably be done *without* any fine-tuning -- that is, just training a shallow MLP to classify or represent with the appropriate loss function.

# Exercise 3.1: Training a Text Classifier (easy)

Peruse the [text classification datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=downloads). Choose a *moderately* sized dataset and use a LLM to train a classifier to solve the problem.

**Note**: A good first baseline for this problem is certainly to use an LLM *exclusively* as a feature extractor and then train a shallow model.

# Exercise 3.2: Training a Question Answering Model (harder)

Peruse the [multiple choice question answering datasets on Hugging Face](https://huggingface.co/datasets?task_categories=task_categories:multiple-choice&sort=downloads). Chose a *moderately* sized one and train a model to answer contextualized multiple-choice questions. You *might* be able to avoid fine-tuning by training a simple model to *rank* the multiple choices (see margin ranking loss in Pytorch).

# Exercise 3.3: Training a Retrieval Model (hardest)

The Hugging Face dataset repository contains a large number of ["text retrieval" problems](https://huggingface.co/datasets?task_categories=task_categories:text-retrieval&p=1&sort=downloads). These tasks generally require that the model measure *similarity* between text in some metric space -- naively, just a cosine similarity between [CLS] tokens can get you pretty far. Find an interesting retrieval problem and train a model (starting from a pre-trained LLM of course) to solve it.

**Tip**: Sometimes identifying the *retrieval* problems in these datasets can be half the challenge. [This dataset](https://huggingface.co/datasets/BeIR/scifact) might be a good starting point.

## 3.2 Question Answering on Race Dataset
Here I train a QA model on the [RACE](https://huggingface.co/datasets/race) dataset.
The dataset is composed by 28K passages, and nearly 100K questions collected from English examinations in China, designed for middle and high school students.

In [2]:
from datasets import load_dataset

In [3]:
dataset = load_dataset("race","middle")
batch_size = 8

Found cached dataset race (C:/Users/Dario/.cache/huggingface/datasets/race/middle/0.1.0/5839ff74a429622f5f20cca69c5fcf0e87ac6d5fd2777c42b948000684829f7b)


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
train_data = dataset['train']
val_data = dataset['validation']
print("Train dataset length:", len(train_data))
dataiter = iter(val_data)

Train dataset length: 25421


In [7]:
row = next(dataiter)
# print(row)
print(f"CONTEXT: {row['article'][0:1000]} ...")
print(f"QUESTION: {row['question']}")
print(f"OPTIONS: {row['options']} - CORRECT: {row['answer']}")

CONTEXT: APRIL 5 marks the 100thanniversary of the sinking of the Titanic. In 1997, the movieTitanicwas a huge hit all around the world. Now, the 3-D version of the movie will come to Chinese theaters on April 10.
As anyone who has seen the movie knows, the Titanic struck an iceberg  and sank on its first _ in 1912, killing 1,517 people. A century after the accident, scientists have found something new to blame  for the sinking: the moon. Donald Olson, a Texas State University physicist, led a team of astronomers  to examine the moon's role, according to Reuters.
Normally, icebergs stay in place and cannot move until they melt enough or a high tide  frees them. A single iceberg can become stuck many times on its journey southward. The process can take several years.
According to Olson, a hundred years ago the moon made its closest approach to the Earth in more than 1,400 years. This caused the moon to have a much stronger pull on the Earth's oceans than usual, which created a super-hig

Each row is composed by a **context**, a question and four possible options. Only one of them is the right one.

In [13]:
from lab2.data_loader.race_datamodule import RACEDataModule

dm = RACEDataModule("distilbert-base-uncased","middle",512,batch_size,batch_size)

dm.prepare_data()
dm.setup("fit")

train_dl,valid_dl = dm.train_dataloader(), dm.val_dataloader()

Found cached dataset race (C:/Users/Dario/.cache/huggingface/datasets/race/middle/0.1.0/5839ff74a429622f5f20cca69c5fcf0e87ac6d5fd2777c42b948000684829f7b)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset race (C:/Users/Dario/.cache/huggingface/datasets/race/middle/0.1.0/5839ff74a429622f5f20cca69c5fcf0e87ac6d5fd2777c42b948000684829f7b)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at C:\Users\Dario\.cache\huggingface\datasets\race\middle\0.1.0\5839ff74a429622f5f20cca69c5fcf0e87ac6d5fd2777c42b948000684829f7b\cache-0034193d0877ac79.arrow
Loading cached processed dataset at C:\Users\Dario\.cache\huggingface\datasets\race\middle\0.1.0\5839ff74a429622f5f20cca69c5fcf0e87ac6d5fd2777c42b948000684829f7b\cache-8218c613ed21a888.arrow


Map:   0%|          | 0/1436 [00:00<?, ? examples/s]

In [21]:
dataiter = iter(valid_dl)
row = next(dataiter)

for k,v in row.items():
    print(k,v.shape)

input_ids = row['input_ids']
labels = row.pop("label")

print(dm.tokenizer.decode(input_ids[0,0,:]))

input_ids torch.Size([8, 4, 512])
attention_mask torch.Size([8, 4, 512])
label torch.Size([8])
[CLS] april 5 marks the 100thanniversary of the sinking of the titanic. in 1997, the movietitanicwas a huge hit all around the world. now, the 3 - d version of the movie will come to chinese theaters on april 10. as anyone who has seen the movie knows, the titanic struck an iceberg and sank on its first _ in 1912, killing 1, 517 people. a century after the accident, scientists have found something new to blame for the sinking : the moon. donald olson, a texas state university physicist, led a team of astronomers to examine the moon's role, according to reuters. normally, icebergs stay in place and cannot move until they melt enough or a high tide frees them. a single iceberg can become stuck many times on its journey southward. the process can take several years. according to olson, a hundred years ago the moon made its closest approach to the earth in more than 1, 400 years. this caused the 

## Model

In [23]:
from lab2.models.qa_lightning import QATransformer

model = QATransformer("distilbert-base-uncased",
                      num_choices=4,
                      criterion="CrossEntropyLoss",
                      task_name="middle",
                      learning_rate=1e-3
                      )

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Feature extraction

## Margin ranking loss
PyTorch's `MarginRankingLoss` is a loss function that aims to predict a relative distance between inputs. This task is called *Metric Learning*. It takes as input two inputs $x_1$ and $x_2$ and a label $y$ (containing 1 or -1).
- If $y=1$ then $x_1$ should be ranked higher
- Conversely, if $y=-1$ $x_2$ should have a higher value.

The loss function is defined as follows:

$$\text{loss}(x_1, x_2, y) = \max(0, -y * (x_1 - x_2) + \text{margin})$$

In [24]:
import torch.nn.functional as F

logits = model(**row)
criterion = F.multi_margin_loss

loss = criterion(logits,labels,margin=0.5)

print(loss)

tensor(0.3739, grad_fn=<MultiMarginLossBackward0>)


# Training


In [None]:
import lightning
import torch
trainer = lightning.Trainer(
    max_epochs=5,
    accelerator="auto",
    devices=1 if torch.cuda.is_available() else None,  # limiting got iPython runs
    limit_train_batches=0.1,
    limit_val_batches=0.1
)
trainer.fit(model, datamodule=dm)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Found cached dataset race (C:/Users/Dario/.cache/huggingface/datasets/race/middle/0.1.0/5839ff74a429622f5f20cca69c5fcf0e87ac6d5fd2777c42b948000684829f7b)


  0%|          | 0/3 [00:00<?, ?it/s]

Found cached dataset race (C:/Users/Dario/.cache/huggingface/datasets/race/middle/0.1.0/5839ff74a429622f5f20cca69c5fcf0e87ac6d5fd2777c42b948000684829f7b)


  0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/1436 [00:00<?, ? examples/s]

Map:   0%|          | 0/25421 [00:00<?, ? examples/s]

Map:   0%|          | 0/1436 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name       | Type               | Params
--------------------------------------------------
0 | model      | DistilBertModel    | 66.4 M
1 | classifier | Sequential         | 591 K 
2 | criterion  | CrossEntropyLoss   | 0     
3 | train_acc  | MulticlassAccuracy | 0     
4 | valid_acc  | MulticlassAccuracy | 0     
--------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.817   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

## Evaluation

In [12]:
batch = next(iter(race_datamodule.val_dataloader()))
ys = batch.pop("label")

device = "cpu"

labels = torch.index_select(batch['input_ids'], 1, ys)[:,0,:]

print(race_datamodule.tokenizer.batch_decode(labels,skip_special_tokens=True)[0])

batch = {k:v.to(device)for k,v in batch.items()}

preds = torch.argmax(model(**batch) , dim=1)

print(ys,preds)

april 5 marks the 100thanniversary of the sinking of the titanic. in 1997, the movietitanicwas a huge hit all around the world. now, the 3 - d version of the movie will come to chinese theaters on april 10. as anyone who has seen the movie knows, the titanic struck an iceberg and sank on its first _ in 1912, killing 1, 517 people. a century after the accident, scientists have found something new to blame for the sinking : the moon. donald olson, a texas state university physicist, led a team of astronomers to examine the moon's role, according to reuters. normally, icebergs stay in place and cannot move until they melt enough or a high tide frees them. a single iceberg can become stuck many times on its journey southward. the process can take several years. according to olson, a hundred years ago the moon made its closest approach to the earth in more than 1, 400 years. this caused the moon to have a much stronger pull on the earth's oceans than usual, which created a super - high tide