# Finetune GPT-2 on wiki-text

In this Lab, we are using a series of library from Hugging Face (i.e. tranformers, datasets, peft). You may need to go through the document of these library to learn the usage. (Hint: you may use the imported contents in the code cell below, other contents is not necessary for this lab)

In [1]:
import os
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

import inspect
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch.nn as nn

cuda


In [2]:
# added by me to fix cuda memory issue
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

## Lab 2(a) Generate text with GPT2

Using the API provided by hugging face, we can easily load the pre-trained GPT2 model and generate text. (GPT2 is a early generative model, the quality of the generated text is not as good as the later model like GPT3.)

In [3]:
# your code here: load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

def generate_text(model, tokenizer, prompt, max_length):


    # your code here: tokenize the prompt
    inputs = tokenizer(prompt, return_tensors = "pt", padding = True).to(model.device)
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask

    # your code here: generate token using the model
    gen_tokens = model.generate(input_ids = input_ids,
                               attention_mask = attention_mask,
                               max_length = max_length,
                               do_sample = True,
                               top_p = 0.9,
                               top_k = 50,
                               temperature = 0.8,
                               pad_token_id = tokenizer.eos_token_id)

    # your code here: decode the generated tokens
    gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens = True)
    
    print(gen_text)

generate_text(model, tokenizer, "GPT-2 is a language model based on transformer developed by OpenAI", 100)

GPT-2 is a language model based on transformer developed by OpenAI and developed in collaboration with the AI Research Institute at the University of Arizona.

OpenAI is a project of the Center for Artificial Intelligence, University of Phoenix, Arizona, USA.

OpenAI is published by the Center for Artificial Intelligence, University of Phoenix, Arizona, USA.


## Lab 2(b) Prepare dataset for training

Please fill the code cell below to download the dataset and prepare the dataset for finetuning.


In [4]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# your code here: load the dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# get 10% of dataset
dataset_train = dataset["train"].select(range(len(dataset["train"]) // 10))
dataset_valid = dataset["validation"].select(range(len(dataset["validation"]) // 10))

# your code here: implement function that tokenize the dataset and set labels to be the same as input_ids
def tokenize_function(examples):
    tokenized = tokenizer(examples["text"],
                         padding = "max_length",
                         truncation = True,
                         max_length = 1024)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

# your code here: tokenize the dataset (you may need to remove columns that are not needed)
tokenized_datasets_train = dataset_train.map(tokenize_function, batched = True, remove_columns = dataset_train.column_names)
tokenized_datasets_valid = dataset_valid.map(tokenize_function, batched = True, remove_columns = dataset_train.column_names)

tokenized_datasets_train.set_format("torch")
tokenized_datasets_valid.set_format("torch")

# your code here: create datacollator for training and validation dataset
data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm = False, pad_to_multiple_of = None)

train_dataloader = DataLoader(tokenized_datasets_train, shuffle=True, batch_size=4, collate_fn=data_collator)
valid_dataloader = DataLoader(tokenized_datasets_valid, batch_size=4, collate_fn=data_collator)

# Test the DataLoader
for batch in train_dataloader:
    print(batch['input_ids'].shape)
    print(batch['attention_mask'].shape)
    print(batch['labels'].shape)
    break

print("DataLoader is working correctly!")

torch.Size([4, 1024])
torch.Size([4, 1024])
torch.Size([4, 1024])
DataLoader is working correctly!


## Lab 2(c) Evaluate perplexity on wiki-text

Before finetuning, we evaluate the pre-trained GPT2 model on the wiki-text dataset. The perplexity is a common metric to evaluate the performance of language model. The lower the perplexity, the better the model. To compute the perplexity in practice, we use the formula as follows, which is a transformation of the formula in class:
$PP(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i|\text{context})\right)$

In [5]:
def evaluate_perplexity(model, dataloader):
    model.eval()
    total_loss = 0
    total_length = 0
    loss_fn = nn.CrossEntropyLoss(reduction='sum')

    with torch.no_grad():
        for batch in dataloader:
            # your code here: get the input_ids, attention_mask, and labels from the batch
            input_ids = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)
            labels = batch['labels'].to(model.device)

            # your code here: forward pass
            outputs = model(input_ids = input_ids, attention_mask = attention_mask, labels = labels)
            logits = outputs.logits

            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()

            # your code here: calculate the loss
            loss = loss_fn(shift_logits.view(-1, shift_logits.size(-1)),
                           shift_labels.view(-1))

            total_loss += loss.item()
            total_length += attention_mask.sum().item()

    # Calculate perplexity
    perplexity = torch.exp(torch.tensor(total_loss / total_length))

    return perplexity.item()


perplexity = evaluate_perplexity(model, valid_dataloader)
print(f"Initial perplexity: {perplexity}")

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Initial perplexity: 42.9958610534668


## Lab 2(d) Fine-tune GPT2 on wiki-text



In [6]:
# Tip: Print out transformer version and training arguments
# print("transformers version:", transformers.__version__)
# print("TrainingArguments signature:")
# print(inspect.signature(TrainingArguments.__init__))

In [7]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-wikitext-2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
    
    # your code here: report validation and training loss every epoch
    eval_strategy = "epoch",
    save_strategy = "epoch",
    logging_steps = 50,
    report_to = None,
    load_best_model_at_end = True
)

# print("transformers version:", transformers.__version__)
# print("TrainingArguments signature:")
# print(inspect.signature(TrainingArguments.__init__))

# your code here: create a Trainer object
trainer = Trainer(model = model,
                  args = training_args,
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  train_dataset = tokenized_datasets_train,
                  eval_dataset = tokenized_datasets_valid)

trainer.train()
trainer.save_model()

  trainer = Trainer(model = model,
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


Epoch,Training Loss,Validation Loss
1,3.4364,3.341595
2,3.1316,3.368648
3,2.8906,3.394768


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


# Test fine-tuned model

In [8]:
# your code here: load the fine-tuned model
model_finetuned = AutoModelForCausalLM.from_pretrained("gpt2-wikitext-2").to(device)
perplexity = evaluate_perplexity(model_finetuned, valid_dataloader)
print(f"fine-tuned perplexity: {perplexity}")

fine-tuned perplexity: 25.921186447143555


# Generate some text using the fine-tuned model

In [9]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# generate text
generate_text(model_finetuned, tokenizer, "GPT-2 is a language model based on transformers developed by OpenAI", 100)

GPT-2 is a language model based on transformers developed by OpenAI . It is a simple and efficient mechanism for generating new combinations of monadic and non-adic languages . The monadic and non-adic features of the language can be expressed using the new features of the monadic language . The monadic features of the language can be expressed using the new features of the non-adic language . The monadic features can be expressed using the new features of the non-adic language .


## Lab 2(e) Parameter efficient fine-tuning (LoRA)

finetune the base gpt model through LoRA

In [10]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.2,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# your code here: load GPT2 model and add the lora adapter

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model_lora = get_peft_model(model, peft_config)

training_args = TrainingArguments(
    output_dir="./gpt2-lora-wikitext-2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_steps=400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
    
    # your code here: report validation and training loss every epoch
    eval_strategy = "epoch",
    save_strategy = "epoch",
    logging_steps = 50,
    report_to = None,
    load_best_model_at_end = True
)

# your code here: set trainer and train the model
trainer = Trainer(model = model_lora,
                  args = training_args,
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  train_dataset = tokenized_datasets_train,
                  eval_dataset = tokenized_datasets_valid)

ppl = evaluate_perplexity(model_lora, valid_dataloader)
print(f"Perplexity after lora finetuning: {ppl}")


  trainer = Trainer(model = model_lora,


Perplexity after lora finetuning: 25.921186447143555


# Evaluate lora fine-tuned model on wiki-text

compare the text generated by the fully fine-tuned model and LoRA fine-tuned model and the pre-trained model. Do you see any difference in the quality of the generated text? Try to explain why. (Hint: trust your result and report as it is.)

In [11]:
generate_text(model_lora, tokenizer, "GPT-2 is a language model based on transformers developed by OpenAI", 100)

GPT-2 is a language model based on transformers developed by OpenAI for the production of a more efficient and efficient language . 

OpenAI is not a " new " language , nor is it a new language for language engineering . 

Instead, it is a new system , and it uses a different approach to learning . 

The goal of this approach is to create a more efficient language and improve it at a faster rate . 

There are two main


Compare the perplexity of the fully fine-tuned model and LoRA fine-tuned model. Do you see any difference in the perplexity? Try to explain why.

In [13]:
ppl = evaluate_perplexity(model_lora, valid_dataloader)

print(f"Perplexity after lora finetuning: {ppl}")

Perplexity after lora finetuning: 25.921186447143555
