# Text Analysis for Digital Humanities: Fine-Tuning GPT

We can compare GPT2 models that are finetuned according to different approaches!

## What is fine-tuning?
Fine-tuning GPT-2 on a specific dataset, like a collection of Irish drama texts, customizes the model's responses to reflect the themes, style, language, idioms, and character types found within that corpus. This process tailors the model's generative capabilities, making it more likely to produce outputs that are stylistically and thematically aligned with the fine-tuning material.

The model's adaptation will be more pronounced when generating text related to or prompted by the domain we train it on. This might include specific narrative styles, dialogue structures, and dramatic conventions unique to the genre and cultural context.

## Weight Adjustment
Fine-tuning adjusts the weights of the neural network to minimize the loss on the new data. The changes in weights help the model better predict or generate sequences that resemble the fine-tuning data.


## Importing packages

If you want to run this code yourself, I **highly** recommend doing so in a new environment. Read more [here](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) if you want to know how this works. The PyTorch and Transformers packages require very specific versions that might clash with other installed packages.

After importing, I check which versions of the packages I have.

In [1]:
import torch
import transformers
print(torch.__version__)
print(transformers.__version__)


2.1.1
4.35.2


# Finetuning
Time to start the finetuning process. We will be using GPT2, the base model with 117M parameters.

We load in our dataset of Irish drama, and initialize a tokenizer.

The tokenizer performs several critical tasks to convert raw text into a format that the GPT-2 model can understand:

- Splitting Text into Tokens: The tokenizer breaks down input text into tokens. For GPT-2, these tokens are usually subwords or characters, allowing the model to handle a wide range of words and vocabularies efficiently.
- Converting Tokens to IDs: Each token is mapped to a unique integer ID based on the GPT-2 vocabulary. This conversion is necessary because neural networks operate on numerical data, not raw text.
- Adding Special Tokens: GPT-2 requires certain special tokens for its operation (e.g., end-of-text token). The tokenizer takes care of adding these tokens where appropriate.
- Padding & Truncation: To process batches of data efficiently, all input sequences must be of the same length. The tokenizer can pad shorter sequences with a special padding token or truncate longer ones to a maximum length.
- Creating Attention Masks: The tokenizer generates attention masks to differentiate real tokens from padding tokens. This helps the model pay attention to relevant tokens and ignore padded areas.

In [2]:
import os
from torch.utils.data import Dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments

# Load texts from files
def load_texts_from_folder(folder_path):
    texts = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            file_path = os.path.join(folder_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                texts.append(file.read())
    return texts

folder_path = 'data/data_cleaned'
texts = load_texts_from_folder(folder_path)

In [3]:
len(texts[0])

44653

## Approach 1: Using Truncation

Let's finetune a GPT model. First, we need to tokenize our texts using GPT2's **tokenizer**. This tokenizer converts raw text input into a sequence of numerical tokens, with special tokens added for padding, special characters, and end-of-text markers, facilitating processing by the model.

For our first approach, we will tokenize the entire texts with truncation and padding to a fixed maximum length. This method is straightforward and treats each text as an individual sequence for the model to learn from. The main characteristics include:

- `truncation`: Texts longer than `max_length=512` are cut off, potentially losing important information at the end.
- `padding`: Texts shorter than `max_length=512` are padded to ensure uniform sequence length, usually with the pad_token. This is not relevant to us as all texts we are feeding into the model are much longer than 512 tokens.

During fine-tuning, the prediction task involves predicting the next token in the sequence based on the preceding tokens. So, for each input sequence (i.e., for each Irish drama text) consisting of its first 512 tokens, the model  predicts the next token for each token position within that sequence. The model predicts tokens from the second to the 512th position within each sequence. 

In [9]:
# DIY

# Initialize tokenizer with padding token set
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

# Tokenize texts
encodings = tokenizer(texts, truncation=True, padding=True, max_length=512, return_tensors="pt")

class TextDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    
    def __len__(self):
        return len(self.encodings.input_ids)
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # For language modeling, the labels are the input_ids shifted by one
        item["labels"] = item["input_ids"].clone()
        return item

# Initialize the dataset
train_dataset = TextDataset(encodings)

In the following cell, we initialize the fine-tuning process using Hugging Face's `Trainer` class.

The first parameter, `model`, is the pre-trained GPT-2 model that we intend to fine-tune. It has been previously loaded and is now set to be further trained on our specific dataset to adjust its weights based on the new data, enhancing its ability to generate or understand text similar to your training corpus.

`TrainingArguments` further specifies various configuration settings for the training process:
- `output_dir`: The directory where the training outputs (like the fine-tuned model checkpoints) will be saved.
- `num_train_epochs`: The number of times the training process should iterate over the entire dataset. Here, it's set to 3, meaning the model will see the dataset three times.
- `per_device_train_batch_size`: The number of training examples processed per device (e.g., GPU) per training step. A batch size of 4 is specified, balancing the computational load and memory usage.
- `logging_dir`: Directory where training logs will be saved, enabling monitoring of the training process through metrics like loss over time.

Finally, `trainer.train()` starts the training process based on the specified model, training arguments, and dataset. The Trainer handles various training aspects, including feeding the input data to the model, performing backpropagation to adjust the model's weights, saving checkpoints, and logging training progress.

In [10]:
# Initialize the model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    logging_dir='./logs',
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Start training
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Step,Training Loss


TrainOutput(global_step=120, training_loss=3.4367586771647134, metrics={'train_runtime': 252.5598, 'train_samples_per_second': 1.877, 'train_steps_per_second': 0.475, 'total_flos': 123852423168000.0, 'train_loss': 3.4367586771647134, 'epoch': 3.0})

In [12]:
model_save_path = 'models/finetuned_model_1'
tokenizer_save_path = 'models/finetuned_tokenizer'

# Save the model
model.save_pretrained(model_save_path)

# Save the tokenizer
tokenizer.save_pretrained(tokenizer_save_path)


('models/finetuned_tokenizer/tokenizer_config.json',
 'models/finetuned_tokenizer/special_tokens_map.json',
 'models/finetuned_tokenizer/vocab.json',
 'models/finetuned_tokenizer/merges.txt',
 'models/finetuned_tokenizer/added_tokens.json')

## Approach 2: Sliding Windows

We have a model. The only problem is that we trained the model on a pretty limited amount of text from our corpus.

Recall that we set the `max_length=512` parameter. If each token represents roughly 4 characters on average (a rough estimation since tokens can vary from parts of a word to several words depending on the tokenizer's vocabulary and the nature of the text), then 512 tokens might cover around 2048 characters. This comes down to the first lines of the first scene for each work.

To make use of more of our data during finetuning, we will now implement a "sliding windows approach". This involves segmenting our Irish drama texts into smaller, overlapping portions (windows).

In this approach, the model first processes the tokens in "chunks". First, the model processes the first 512 tokens, like we did before. However, after processing the first chunk, the window is moved forward by the step_size (in this case, 200 tokens). This ensures some overlap between adjacent chunks, allowing the model to capture context from nearby tokens. This process is repeated for each subsequent chunk until the entire text is covered. Each chunk is treated as an independent sequence, and the model predicts tokens within that sequence.

In [13]:
window_size = 512  # Max tokens per chunk
step_size = 200  # Tokens to move the window each time

def create_sliding_windows(tokenizer, text, window_size, step_size):
    # First, split the text into words or smaller units
    words = text.split()
    max_tokens_for_window = window_size - tokenizer.num_special_tokens_to_add()

    # Initialize
    windows = []
    start_index = 0
    
    while start_index < len(words):
        # Dynamically determine the end index by tokenizing a slice of words and checking the length
        end_index = start_index + 1
        while end_index <= len(words):
            tokens = tokenizer.encode(' '.join(words[start_index:end_index]), add_special_tokens=True)
            if len(tokens) > max_tokens_for_window:
                break
            end_index += 1
        
        # Adjust end_index to fit within limits, then encode
        end_index -= 1
        window_tokens = tokenizer.encode(' '.join(words[start_index:end_index]), add_special_tokens=True)
        windows.append(window_tokens)
        
        start_index += step_size

    return windows


In [14]:
all_windows = []  # Initialize an empty list to hold all window segments

for text in texts:
    windows = create_sliding_windows(tokenizer, text, window_size, step_size)
    all_windows.extend(windows)  # Add the segments from this text to the collection


This resulting `all_windows` object is a list where each element is another list. Each inner list contains the sequence of token IDs representing a segment (window) of the original text after tokenization.

The below `SlidingWindowDataset` class ensures that the data fed into the model during training follows the required format. It handles tasks such as padding sequences to a consistent length, organizing the data into windows appropriate for the sliding windows approach, and preparing both input and label data for each window.

In [15]:
class SlidingWindowDataset(Dataset):
    def __init__(self, token_windows):
        # Expecting token_windows to be a list of lists (token IDs for each window)
        self.token_windows = token_windows

    def __len__(self):
        # Return the total number of windows
        return len(self.token_windows)

    def __getitem__(self, idx):
        # Accessing the token IDs for the window at the given index
        window_token_ids = self.token_windows[idx]

        # Padding: Ensure each sequence is of the same length
        padded_window_token_ids = window_token_ids + [tokenizer.pad_token_id] * (window_size - len(window_token_ids))

        # Converting the list of token IDs into a PyTorch tensor
        input_ids = torch.tensor(padded_window_token_ids, dtype=torch.long)

        # For language modeling, the labels are the same as input_ids, also padded
        labels = input_ids.clone()

        # Return a dictionary with input_ids and labels
        return {'input_ids': input_ids, 'labels': labels}


# Correctly creating the dataset instance with the list of windows
train_dataset = SlidingWindowDataset(all_windows)


Now, we initiate training for our second model.

In [16]:
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        logging_dir="./logs",
    ),
    train_dataset=train_dataset,
)

trainer.train()


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


Step,Training Loss
500,3.56
1000,3.4443
1500,3.4053
2000,3.3437
2500,3.3427
3000,3.2626
3500,3.2215
4000,3.1926
4500,3.1958
5000,3.1716


TrainOutput(global_step=7677, training_loss=3.2442199540073027, metrics={'train_runtime': 19106.5249, 'train_samples_per_second': 1.607, 'train_steps_per_second': 0.402, 'total_flos': 8022187966464000.0, 'train_loss': 3.2442199540073027, 'epoch': 3.0})

In [17]:
model_save_path = 'models/finetuned_model_2'

# Save the model
model.save_pretrained(model_save_path)

# Save the tokenizer
tokenizer.save_pretrained(tokenizer_save_path)


('models/finetuned_tokenizer/tokenizer_config.json',
 'models/finetuned_tokenizer/special_tokens_map.json',
 'models/finetuned_tokenizer/vocab.json',
 'models/finetuned_tokenizer/merges.txt',
 'models/finetuned_tokenizer/added_tokens.json')

## Evaluating Outputs: Perplexity

One way to evaluate the model is by calculating a **perplexity** score: a measure of how well the probability distribution predicted by the model matches the actual distribution of the words in the text. Lower perplexity indicates better performance.

Perplexity is usually calculated based on the so-called "cross-entropy loss" of the model when predicting the next token in a sequence.

Basically, we feed the model with a few texts it hasn't seen yet. The model takes a randomly selected 512 tokens from each of these texts, then does the prediction task of predicting tokens for each of these texts. These predictions are compared with the actual tokens in the input sequences to compute the loss. We check and average the loss for each of these predictions; this final score informs the perplexity score.

In [38]:
import torch
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def calculate_perplexity(model, tokenizer, texts, batch_size=4):
    # Ensure the model is in evaluation mode.
    model.eval()
    
    tokenizer.pad_token = tokenizer.eos_token
    
    modified_texts = []
    for text in texts:
        # Tokenize the text to find out its total length in tokens.
        tokens = tokenizer.encode(text, add_special_tokens=False)
        text_len = len(tokens)
        
        # If the text is longer than 512 tokens, choose a random start point.
        if text_len > 512:
            start_index = np.random.randint(0, text_len - 512)
            end_index = start_index + 512
            tokens = tokens[start_index:end_index]
        else:
            tokens = tokens[:512]  # Ensure not longer than 512 tokens
        
        # Decode tokens back to text.
        modified_text = tokenizer.decode(tokens, clean_up_tokenization_spaces=True)
        modified_texts.append(modified_text)
    
    # Proceed as before but with modified_texts.
    encodings = tokenizer(modified_texts, return_tensors='pt', padding=True, truncation=True, max_length=512, add_special_tokens=True)
    
    dataset = TensorDataset(encodings.input_ids, encodings.attention_mask)
    
    dataloader = DataLoader(dataset, batch_size=batch_size)

    total_loss = 0.0
    total_length = 0

    with torch.no_grad():
        for batch in dataloader:
            input_ids, attention_mask = batch[0].to(model.device), batch[1].to(model.device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
            loss = outputs.loss
            total_loss += loss.item() * input_ids.size(0)
            total_length += input_ids.size(0)

    average_loss = total_loss / total_length
    perplexity = torch.exp(torch.tensor(average_loss))

    return perplexity.item()

# Example usage
model_1 = GPT2LMHeadModel.from_pretrained('models/finetuned_model_1').to('cpu')
model_2 = GPT2LMHeadModel.from_pretrained('models/finetuned_model_2').to('cpu')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

eval_texts = load_texts_from_folder('data/eval_texts')
perplexity_1 = calculate_perplexity(model_1, tokenizer, eval_texts)
print(f"Model 1 Perplexity: {perplexity_1}")
perplexity_2 = calculate_perplexity(model_2, tokenizer, eval_texts)
print(f"Model 2 Perplexity: {perplexity_2}")


Token indices sequence length is longer than the specified maximum sequence length for this model (20446 > 1024). Running this sequence through the model will result in indexing errors


Model 1 Perplexity: 32.5595703125
Model 2 Perplexity: 39.908390045166016


Even though `model 2` was finetuned using a lot more data, it has a lower perplexity score--meaning the model's predictions on the test data are less accurate. There could be many reasons for this; for instance, `model 2` might be overfitting to less relevant, more specific patterns in the data. 

## Evaluating Outputs: Interpretation

While perplexity offers one approach to comparing model performance, we might be more interested in the kinds of texts these models actually generate, and how informative, surprising, or inspiring they are. 

Let's compare the performance of the two models ourselves. We'll enter a prompt and have a look at how the two models complete it.

When `do_sample=True`, the model generates text by sampling from the probability distribution of the next token given the context. This distribution is determined by the model's predictions. Instead of simply picking the most probable next token (deterministic), the model randomly selects the next token based on this probability distribution, which can introduce variety and creativity in the generated text.

Parameters like `temperature`, `top_k`, and `top_p` modify this distribution to control diversity and coherence:

- Temperature: Controls randomness, higher values increase diversity.
- Top-p: The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus.
- Top-k: Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens.

In [35]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Assuming the tokenizer is the same for both models and has been loaded previously
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def generate_text(model, prompt, do_sample=True, max_length=50, temperature=1, top_k=50, top_p=0.95, repetition_penalty=1.1):
    """
    Generates text based on a given prompt using the specified model.
    
    Parameters:
    - model: The fine-tuned model to use for text generation.
    - prompt: The initial text to start generating from.
    - max_length: Maximum length of the generated text.
    - temperature: Sampling temperature for generating text.
    - top_k: The number of highest probability vocabulary tokens to keep for top-k filtering.
    - top_p: Nucleus sampling's cumulative probability cutoff to keep for top-p filtering.
    
    Returns:
    - generated_text: The generated text as a string.
    """
    # Encode the prompt text to tensor
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    
    # Generate a sequence of tokens following the prompt
    output_ids = model.generate(input_ids, max_length=max_length, 
                                temperature=temperature, 
                                do_sample=do_sample, 
                                top_k=top_k, 
                                top_p=top_p, 
                                repetition_penalty=repetition_penalty, 
                                pad_token_id=tokenizer.eos_token_id)
    
    # Decode the generated tokens to a string
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    return generated_text

# Load your fine-tuned models
model_1 = GPT2LMHeadModel.from_pretrained('models/finetuned_model_1')
model_2 = GPT2LMHeadModel.from_pretrained('models/finetuned_model_2')

# Prompt to generate text from - change this!
prompt = "Once upon a time"

# Generate texts
generated_text_1 = generate_text(model_1, prompt, max_length=150, temperature=1)
generated_text_2 = generate_text(model_2, prompt, max_length=150, temperature=1)
print("Generated text from model 1:", generated_text_1, '\n')
print("Generated text from model 2:", generated_text_2)


Generated text from model 1: Once upon a time, it was clear that the young King had left town and stayed in Sir Tristram's flat. But he never brought his father to see her again until after breakfast at twelve o'clock; not even on purpose for an explanation of what happened during dinner before midnight but as she continued asleep: there is no mention given where they have gone from thence except according their custom how night fell (for if God were known so all those things might be hidden); then perhaps this woman will give me further indications concerning some curious event which occurred nine days since last we saw each other--in case nothing should happen between now and evening? I cannot help dreaming more than three-quarters or six hours into our conversation about something else worthy mentioning 

Generated text from model 2: Once upon a time he would have found me and we were alone. A short while later (to the men at their table) Mr Marwood is about to be seized, but before

What do you notice about the difference between the output of `model 1` and `model 2`?