# GPT2 Language Model Fine-tuning with Texts from Shakespeare
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fcakyon/gpt2-shakespeare/blob/main/gpt2-shakespeare.ipynb)

## 0. Install requirements

In [None]:
!pip install -U transformers datasets torch sentencepiece

## 1. Initialize Model and Tokenizer

- Import required modules:

In [2]:
import torch
import math
from transformers import GPT2Tokenizer, GPT2LMHeadModel, HfArgumentParser, TrainingArguments, Trainer, default_data_collator
from datasets import load_dataset

- Initialize a GPT2 model with a language modelling head:

In [4]:
 model = GPT2LMHeadModel.from_pretrained('gpt2')

- Initialize GPT2 tokenizer:

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [6]:
model.resize_token_embeddings(len(tokenizer))

Embedding(50257, 768)

## 2. Initialize Dataset

- Download Shakespeare dataset from [Huggingface datasets hub](https://github.com/huggingface/datasets/blob/master/datasets/tiny_shakespeare/tiny_shakespeare.py):

In [7]:
# download and load the dataset from the hub
dataset_name = "tiny_shakespeare"
cache_dir = "lm_dataset/"
datasets = load_dataset(dataset_name, cache_dir=cache_dir)

Using custom data configuration default


Downloading and preparing dataset tiny_shakespeare/default (download: 1.06 MiB, generated: 1.06 MiB, post-processed: Unknown size, total: 2.13 MiB) to lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435071.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset tiny_shakespeare downloaded and prepared to lm_dataset/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e. Subsequent calls will reuse this data.


- Tokenize all the texts:

In [9]:
column_names = datasets["train"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]

def tokenize_function(examples):
    # truncate dataset with max accepted size of the model
    output = tokenizer(examples[text_column_name])
    return output

# tokenize dataset
tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    desc="Running tokenizer on dataset",
)

HBox(children=(FloatProgress(value=0.0, description='Running tokenizer on dataset', max=1.0, style=ProgressSty…

Token indices sequence length is longer than the specified maximum sequence length for this model (301966 > 1024). Running this sequence through the model will result in indexing errors





HBox(children=(FloatProgress(value=0.0, description='Running tokenizer on dataset', max=1.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description='Running tokenizer on dataset', max=1.0, style=ProgressSty…




- Split whole dataset into smaller sets of blocks:

In [10]:
# get block size (max input length of the model)
block_size = tokenizer.model_max_length
if block_size > 1024:
    block_size = 1024
    
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# split total dataset into smaller sets of length block_size
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    desc=f"Grouping texts in chunks of {block_size}",
)

HBox(children=(FloatProgress(value=0.0, description='Grouping texts in chunks of 1024', max=1.0, style=Progres…




HBox(children=(FloatProgress(value=0.0, description='Grouping texts in chunks of 1024', max=1.0, style=Progres…




HBox(children=(FloatProgress(value=0.0, description='Grouping texts in chunks of 1024', max=1.0, style=Progres…




In [11]:
train_dataset = lm_datasets["train"]
eval_dataset = lm_datasets["validation"]

## 3. Initialize Trainer

In [12]:
training_args = TrainingArguments(output_dir = "output/", per_device_train_batch_size=1, num_train_epochs=20, save_total_limit=1)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    # Data collator will default to DataCollatorWithPadding, so we change it.
    data_collator=default_data_collator,
)

# 4. Perform Training

In [None]:
# perform training
train_result = trainer.train()

# saves the tokenizer
trainer.save_model()

# save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

# save training state
trainer.save_state()

# 5. Evaluate Model

In [None]:
# perform evaluation over validation data
metrics = trainer.evaluate()

# calculate perplexity
try:
    perplexity = math.exp(metrics["eval_loss"])
except OverflowError:
    perplexity = float("inf")
    
# save perplexity
metrics["perplexity"] = perplexity
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

# 6. Generate Samples

In [None]:
# fix seed
import torch
torch.manual_seed(3)

# tokenize start of a sentence
ids = tokenizer.encode('One does not simply walk into',
                      return_tensors='pt').cuda()

# generate samples by top-p sampling
sample_output = model.generate(
    ids, 
    do_sample=True, 
    max_length=100, 
    top_p=0.92, 
    top_k=0
)

# print generated texts
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

# 7. Push to Hub

In [None]:
kwargs = {"finetuned_from": "gpt2", "tasks": "text-generation"}
kwargs["dataset_tags"] = "tiny_shakespeare"
kwargs["dataset_args"] = "default"
kwargs["dataset"] = "tiny_shakespeare default"

trainer.push_to_hub(**kwargs)