In this exercise, you'll train GPT-2 on a dataset of Shakespeare! You'll be using Hugging Face to download the dataset and pre-trained model - there's a tutorial located in the same folder as this in the GitHub repository. That tutorial will explain how to load models and datasets, pre-process them, and fine-tune the model!

The starter code here is quite lightweight - you should refer back to the tutorial often. You will likely want to copy-paste some code from it, just make sure you know what it is doing!

Once the model is trained, try generating some text! Hopefully it looks (somewhat) like Shakespeare!

If you finish this and want to explore further, try training a different model or use a different dataset. This exercise is meant for you to be able to explore, so do what you find interesting!

Download required libraries:

In [1]:
!pip install transformers
!pip install datasets



Load the dataset you'll be using:

In [2]:
from datasets import load_dataset

ds = load_dataset('tiny_shakespeare')
ds

Found cached dataset tiny_shakespeare (C:/Users/Frank/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

Let's take a quick look at what's in the dataset!

In [3]:
print(ds['train']['text'][0][:300])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us



And from here it's up to you! Make sure to look at the tutorial, it will hopefully be a good guide. If you have any questions be sure to ping someone in the Discord!

In [4]:
ds

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('gpt2')
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

def token_func(example):
  return tokenizer(example['text'])

tokenized_ds = ds.map(token_func, batched=True)
tokenized_ds

Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-babfbdcf4b3186c8.arrow
Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-d9a53b4a56b68156.arrow


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (17995 > 1024). Running this sequence through the model will result in indexing errors


DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 1
    })
})

In [6]:
rem_tokenized_ds = tokenized_ds.remove_columns(['text'])
rem_tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1
    })
})

In [7]:
from itertools import chain

# group texts into blocks of block_size
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

batched_ds = rem_tokenized_ds.map(group_texts, batched=True)
batched_ds['train'] = batched_ds['train'].add_column('labels', batched_ds['train']['input_ids'])
batched_ds

Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-6f89cb6b5a97da3a.arrow
Loading cached processed dataset at C:\Users\Frank\.cache\huggingface\datasets\tiny_shakespeare\default\1.0.0\b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e\cache-0de79604e8c4b309.arrow


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2359
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 141
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 140
    })
})

In [8]:
print(len(batched_ds['train'][0]['input_ids']))
print(len(batched_ds['train'][0]['labels']))

128
128


In [9]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

train_dl = DataLoader(
  batched_ds['train'],
  shuffle=True,
  batch_size=16,
  collate_fn=default_data_collator
)

In [12]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
model.train()
dl_iter = iter(train_dl)

for batch in train_dl:
  batch_size = len(batch['input_ids'])
  for i in range(batch_size):
    try:
      batch = next(dl_iter)
    except StopIteration:
      break

    # push all to device
    batch = {k: batch[k].to(device) for k in batch.keys()}

    outputs = model(**batch)
    optimizer.zero_grad()
    loss = outputs.loss
    loss.backward()
    print (f'Step [{i+1}/{batch_size}], Loss: {loss.item():.4f}')
    optimizer.step()

Step [1/16], Loss: 2.6431
Step [2/16], Loss: 2.9191
Step [3/16], Loss: 2.7397
Step [4/16], Loss: 2.7720
Step [5/16], Loss: 2.6491
Step [6/16], Loss: 2.8077
Step [7/16], Loss: 2.6606
Step [8/16], Loss: 2.8092
Step [9/16], Loss: 2.6795
Step [10/16], Loss: 2.6024
Step [11/16], Loss: 2.6891
Step [12/16], Loss: 2.9350
Step [13/16], Loss: 2.7233
Step [14/16], Loss: 2.7324
Step [15/16], Loss: 2.8585
Step [16/16], Loss: 2.7384
Step [1/16], Loss: 2.7024
Step [2/16], Loss: 2.7754
Step [3/16], Loss: 2.6712
Step [4/16], Loss: 2.9204
Step [5/16], Loss: 2.8153
Step [6/16], Loss: 2.8420
Step [7/16], Loss: 2.8336
Step [8/16], Loss: 3.0135
Step [9/16], Loss: 2.7976
Step [10/16], Loss: 2.7232
Step [11/16], Loss: 2.8511
Step [12/16], Loss: 2.9437
Step [13/16], Loss: 2.8820
Step [14/16], Loss: 2.6424
Step [15/16], Loss: 2.7303
Step [16/16], Loss: 2.6960
Step [1/16], Loss: 2.7519
Step [2/16], Loss: 2.7385
Step [3/16], Loss: 2.7152
Step [4/16], Loss: 2.7813
Step [5/16], Loss: 2.5900
Step [6/16], Loss: 2.912

In [18]:
# generate text
out = model.generate(min_new_tokens=10, max_new_tokens=30)
print(tokenizer.decode(out[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


<|endoftext|>I'll be brief, and not be long.

DUKE VINCENTIO:
I am not so brief, nor so long
