# Fine-Tuning GPT-2 for Joke Generation

This notebook contains the complete code to fine-tune a pre-trained GPT-2 model to generate jokes. We will use the `transformers` and `datasets` libraries from Hugging Face.

**Steps:**
1.  **Setup**: Install and import the required libraries.
2.  **Load Dataset**: Load a dataset of short jokes from the Hugging Face Hub.
3.  **Preprocessing**: Tokenize the dataset and format it for training.
4.  **Training**: Fine-tune the GPT-2 model on our joke dataset.
5.  **Inference**: Use the fine-tuned model within a `pipeline` to generate new jokes.

## 1. Setup

First, let's install the necessary libraries. We need `transformers` for the model, `datasets` to handle the data, and `torch` as the backend.

In [None]:
!pip install transformers datasets torch accelerate -q

Now, let's import everything we'll need.

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    Trainer,
    TrainingArguments,
    pipeline
)

## 2. Load Dataset

We'll use the `short-jokes-dataset` from the Hugging Face Hub. It's a simple dataset with one column containing jokes.

In [None]:
dataset = load_dataset("short-jokes-dataset", split="train")

# Let's take a look at a few examples
print(dataset)
for i in range(3):
    print(f"Joke {i+1}: {dataset[i]['text']}")

## 3. Preprocessing

Next, we need to prepare the data for the model. This involves:
1.  Loading the GPT-2 tokenizer.
2.  Setting a padding token to handle jokes of different lengths.
3.  Creating a function to tokenize the text. We will wrap each joke with special tokens (`<|startoftext|>` and `<|endoftext|>`) to teach the model the structure of a complete joke.

In [None]:
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# GPT-2 doesn't have a default pad token, so we'll set it to the end-of-speech token.
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    # Format each joke with start and end tokens
    formatted_jokes = [f"<|startoftext|>{joke}<|endoftext|>" for joke in examples["text"]]
    
    # Tokenize the formatted jokes
    tokenized_output = tokenizer(
        formatted_jokes,
        truncation=True,
        padding="max_length",
        max_length=128 # You can adjust this based on joke length
    )
    
    # The model expects 'labels' for language modeling, which are the same as input_ids
    tokenized_output["labels"] = tokenized_output["input_ids"].copy()
    
    return tokenized_output

# Apply the tokenization to the entire dataset
# We use batched=True for faster processing
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

print("\nTokenized dataset sample:")
print(tokenized_dataset[0])

## 4. Training

Now we are ready to fine-tune the model.
1.  Load the pre-trained `GPT2LMHeadModel`.
2.  Define `TrainingArguments` to configure the training process (e.g., number of epochs, learning rate, output directory).
3.  Create a `Trainer` instance and start training.

In [None]:
# Load the pre-trained model
model = GPT2LMHeadModel.from_pretrained(model_name)

# Resize token embeddings because we added a new pad token
model.resize_token_embeddings(len(tokenizer))

# Define the output directory for our fine-tuned model
output_dir = "./gpt2-joker"

# Define training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1,  # For a quick demonstration. Increase to 3-5 for better results.
    per_device_train_batch_size=8, # Adjust based on your GPU memory
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    fp16=torch.cuda.is_available(), # Use mixed precision if a GPU is available
    report_to="none", # Can be set to "wandb" or "tensorboard"
)

# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

# Start training
print("Starting the training process...")
trainer.train()
print("Training finished!")

# Save the final model and tokenizer
print(f"Saving model to {output_dir}")
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

## 5. Inference with Pipeline

With our model fine-tuned and saved, we can now use it to generate jokes. The `pipeline` function from `transformers` makes this incredibly easy.

We will load our saved model into a `text-generation` pipeline and use it to generate a few jokes.

In [None]:
# Load the fine-tuned model using the pipeline
joke_generator = pipeline("text-generation", model=output_dir, tokenizer=output_dir)

# The prompt should be the start-of-text token we used during training
prompt = "<|startoftext|>"

print("--- Generating Jokes ---\n")

generated_jokes = joke_generator(
    prompt,
    max_length=80, # Max length of the generated joke
    num_return_sequences=5, # Number of jokes to generate
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8,
    pad_token_id=tokenizer.eos_token_id # Set pad token ID
)

for i, joke in enumerate(generated_jokes):
    # Clean up the output by removing the prompt and end token
    joke_text = joke['generated_text'].replace(prompt, "").replace("<|endoftext|>", "").strip()
    print(f"Joke {i+1}: {joke_text}\n")