In this exercise, you'll train GPT-2 on a dataset of Shakespeare! You'll be using Hugging Face to download the dataset and pre-trained model - there's a tutorial located in the same folder as this in the GitHub repository. That tutorial will explain how to load models and datasets, pre-process them, and fine-tune the model!

The starter code here is quite lightweight - you should refer back to the tutorial often. You will likely want to copy-paste some code from it, just make sure you know what it is doing!

Once the model is trained, try generating some text! Hopefully it looks (somewhat) like Shakespeare!

If you finish this and want to explore further, try training a different model or use a different dataset. This exercise is meant for you to be able to explore, so do what you find interesting!

Download required libraries:

In [1]:
!pip install transformers
!pip install datasets



Load the dataset you'll be using:

In [2]:
from datasets import load_dataset

ds = load_dataset('tiny_shakespeare')
ds

Found cached dataset tiny_shakespeare (C:/Users/Frank/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

Let's take a quick look at what's in the dataset!

In [3]:
print(ds['train']['text'][0][:300])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us



And from here it's up to you! Make sure to look at the tutorial, it will hopefully be a good guide. If you have any questions be sure to ping someone in the Discord!

In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

def token_func(example):
  return tokenizer(example['text'])

tokenized_ds = ds.map(token_func, batched=True)
tokenized_ds

Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-babfbdcf4b3186c8.arrow
Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-d9a53b4a56b68156.arrow
Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-05b60ed0fdec609c.arrow


DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 1
    })
})

In [6]:
rem_tokenized_ds = tokenized_ds.remove_columns(['text'])
rem_tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1
    })
})

In [7]:
from itertools import chain

# group texts into blocks of block_size
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

batched_ds = rem_tokenized_ds.map(group_texts, batched=True)

batched_ds['train'] = batched_ds['train'].add_column('labels', batched_ds['train']['input_ids'])
batched_ds

Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-6f89cb6b5a97da3a.arrow
Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-0de79604e8c4b309.arrow
Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-305a09f8c6a665ec.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2359
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 141
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 140
    })
})

In [8]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

train_dl = DataLoader(
  batched_ds['train'],
  shuffle=True,
  batch_size=16,
  collate_fn=default_data_collator
)

In [12]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
model.train()
dl_iter = iter(train_dl)

for i in range(10):
  for batch in train_dl:
    # push all to device
    batch = {k: batch[k].to(device) for k in batch.keys()}

    outputs = model(**batch)
    optimizer.zero_grad()
    loss = outputs.loss
    loss.backward()
    print (f'Loss: {loss.item():.4f}')
    optimizer.step()

Loss: 3.3795
Loss: 3.6152
Loss: 3.4189
Loss: 3.6499
Loss: 3.3276
Loss: 3.6957
Loss: 3.3236
Loss: 3.4749
Loss: 3.5228
Loss: 3.3541
Loss: 3.2301
Loss: 3.4461
Loss: 3.0970
Loss: 3.2144
Loss: 3.3786
Loss: 3.4664
Loss: 3.5698
Loss: 3.4169
Loss: 3.4750
Loss: 3.3923
Loss: 3.4117
Loss: 3.6177
Loss: 3.3912
Loss: 3.4630
Loss: 3.3955
Loss: 3.5567
Loss: 3.4228
Loss: 3.6344
Loss: 3.3220
Loss: 3.4118
Loss: 3.5458
Loss: 3.5157
Loss: 3.4853
Loss: 3.3305
Loss: 3.5574
Loss: 3.4092
Loss: 3.4761
Loss: 3.4871
Loss: 3.5109
Loss: 3.2790
Loss: 3.3625
Loss: 2.9946
Loss: 3.5450
Loss: 3.5391
Loss: 3.1538
Loss: 3.2277
Loss: 3.5986
Loss: 3.4985
Loss: 3.5277
Loss: 3.6161
Loss: 3.6001
Loss: 3.2655
Loss: 3.5254
Loss: 3.4515
Loss: 3.2421
Loss: 3.5881
Loss: 3.2949
Loss: 3.4625
Loss: 3.4916
Loss: 3.5714
Loss: 3.6535
Loss: 3.3874
Loss: 3.4360
Loss: 3.6170
Loss: 3.3610
Loss: 3.4255
Loss: 3.4381
Loss: 3.5244
Loss: 3.1968
Loss: 3.4660
Loss: 3.4384
Loss: 3.5274
Loss: 3.5693
Loss: 3.4242
Loss: 3.2391
Loss: 3.2034
Loss: 3.3489

In [21]:
# generate text
prompt = tokenizer('Washington:\n', return_tensors='pt')
prompt = prompt.input_ids.to(device)
out = model.generate(
  prompt,
  min_new_tokens=30,
  max_new_tokens=120,
  no_repeat_ngram_size=3
)
print(tokenizer.decode(out[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Washington:
I will not be so long in saying.

First Senator:
You are a senator, and you must be so.
You have been forsworn to the people's house:
And you have not been so long to demand it.
Your honours both,--

SICINIUS:
We have been so forswaken, and so long.
I would be so brief to say the truth.
The people are not so much in their love
As they are in fearing to hear me tell it. They
are not as much in fear
