# SFT Training for Tweet Generation

This notebook implements Supervised Fine-Tuning (SFT) for tweet generation using GPT-2.

## Setup and Installation


In [4]:
# Install required packages
%pip install -q transformers datasets trl wandb accelerate

# Import libraries
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
import torch
import wandb
import os
import json

print("✅ Packages installed and imported successfully!")


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/564.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m563.2/564.6 kB[0m [31m24.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.6/564.6 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h✅ Packages installed and imported successfully!


## GPU Setup and Device Detection


In [5]:
# Check GPU availability
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"🚀 Using GPU: {torch.cuda.get_device_name(0)}")
    print(f"📊 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"🔧 CUDA Version: {torch.version.cuda}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"🍎 Using Apple Silicon GPU (MPS)")
else:
    device = torch.device("cpu")
    print(f"💻 Using CPU (training will be slower)")

# In Colab, select T4 GPU
print(f"\n🎯 Selected device: {device}")


🚀 Using GPU: Tesla T4
📊 GPU Memory: 15.8 GB
🔧 CUDA Version: 12.6

🎯 Selected device: cuda


## Data Setup


In [6]:
# Upload the dataset (not in Github)
from google.colab import files
uploaded = files.upload()

Saving tweet_sft_dataset_10k.jsonl to tweet_sft_dataset_10k.jsonl


In [7]:
dataset_path = "tweet_sft_dataset_10k.jsonl"

# Check if dataset file exists
if os.path.exists(dataset_path):
    print(f"✅ Found dataset: {dataset_path}")
    data_path = dataset_path
else:
    print("📝 Creating sample dataset for testing...")
    # Create a small sample dataset (AI-suggested)
    sample_data = [
        {"instruction": "Write a personal_story tweet about coding", "response": "Spent 2 hours debugging a typo. It was a missing semicolon 😅"},
        {"instruction": "Write a classic tweet about wisdom", "response": "The most dangerous phrase in programming: 'Just a small change'"},
        {"instruction": "Write a funny tweet about technology", "response": "My computer is so slow, it's still processing my thoughts from yesterday"},
        {"instruction": "Write a motivational tweet about learning", "response": "Every expert was once a beginner. Keep coding! 💪"},
        {"instruction": "Write a relatable tweet about work", "response": "Me: I'll just fix this one small bug. Also me: 3 hours later..."}
    ]

    # Save sample data
    with open(dataset_path, 'w') as f:
        for item in sample_data:
            f.write(json.dumps(item) + '\n')

    data_path = dataset_path
    print(f"✅ Created sample dataset: {dataset_path}")

# Load dataset
dataset = load_dataset("json", data_files=data_path)
print(f"📊 Dataset loaded: {len(dataset['train'])} examples")


✅ Found dataset: tweet_sft_dataset_10k.jsonl


Generating train split: 0 examples [00:00, ? examples/s]

📊 Dataset loaded: 10000 examples


## Model Setup


In [8]:
# Model configuration
model_name = "gpt2"

print(f"🤖 Loading model: {model_name}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Move model to device
model = model.to(device)

print(f"✅ Model loaded and moved to {device}")
print(f"📏 Model parameters: {model.num_parameters():,}")


🤖 Loading model: gpt2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

✅ Model loaded and moved to cuda
📏 Model parameters: 124,439,808


## Data Preprocessing


In [10]:
def format_dataset(examples):
    """Format the dataset for the model"""
    texts = [
        inst + "\nResponse: " + resp
        for inst, resp in zip(examples["instruction"], examples["response"])
    ]
    return {"text": texts}

# Pre-format the dataset and remove all other columns
print("🔄 Formatting dataset...")
formatted_dataset = dataset["train"].map(
    format_dataset,
    batched=True,
    remove_columns=dataset["train"].column_names
)

print(f"✅ Formatted dataset columns: {formatted_dataset.column_names}")
print(f"📝 First example: {formatted_dataset[0]}")
print(f"📊 Total examples: {len(formatted_dataset)}")


🔄 Formatting dataset...
✅ Formatted dataset columns: ['text']
📝 First example: {'text': 'Write a personal_story tweet about coding\nResponse: Spent 2 hours debugging a typo. It was a missing semicolon 😅'}
📊 Total examples: 10000


In [11]:
# Split the dataset
train_dataset, val_dataset = formatted_dataset.train_test_split(test_size=1000, seed=42).values()

print(f"📊 Training dataset size: {len(train_dataset)}")
print(f"📊 Validation dataset size: {len(val_dataset)}")

📊 Training dataset size: 9000
📊 Validation dataset size: 1000


## Weights & Biases Setup


In [16]:
# Initialize W&B
wandb.init(
    project="rlhf-learning-sft",
    name="tweet-generation-sft-colab-3epochs-validation",
    config={
        "train_size": len(train_dataset),
        "val_size": len(val_dataset),
        "num_epochs": 3,
        "experiment": "validation_split",
        "batch_size": 4,
        "gradient_accumulation_steps": 4,
        "learning_rate": 5e-5,
        "warmup_steps": 100,
        "max_length": 512,
        "eval_steps": 100,
        "device": str(device),
        "cuda_available": torch.cuda.is_available(),
        "mps_available": torch.backends.mps.is_available(),
        "gpu_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "Apple Silicon (MPS)" if torch.backends.mps.is_available() else "CPU",
        "cuda_version": torch.version.cuda if torch.cuda.is_available() else None,
    }
)

print("✅ W&B initialized successfully!")


✅ W&B initialized successfully!


## Training Configuration


In [17]:
# Training arguments
training_args = TrainingArguments(
    output_dir="../sft_results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    warmup_steps=100,
    logging_steps=10,

    # NEW: Evaluation settings
    eval_strategy="steps",  # Evaluate during training
    eval_steps=100,  # Evaluate every 100 steps
    save_steps=2000,

    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",  # Use validation loss

    report_to="wandb",
    run_name="exp2-validation-split-3epochs",
    logging_dir="../logs",
)

print("✅ Training arguments configured!")
print(f"📊 Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"💾 Output directory: {training_args.output_dir}")


✅ Training arguments configured!
📊 Effective batch size: 16
💾 Output directory: ../sft_results


## Trainer Setup


In [18]:
# SFT Trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

print("✅ SFT Trainer configured!")
print(f"🎯 Training dataset size: {len(train_dataset)}")
print(f"🔄 Total training steps: {len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * training_args.num_train_epochs}")


✅ SFT Trainer configured!
🎯 Training dataset size: 9000
🔄 Total training steps: 1686


## Training


In [19]:
# Start training
print("🚀 Starting SFT training...")
print("📊 Check your W&B dashboard for real-time metrics!")

trainer.train()

print("✅ Training completed successfully!")
print("📁 Check the sft_results folder for saved models")


🚀 Starting SFT training...
📊 Check your W&B dashboard for real-time metrics!


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
100,0.8053,0.678122,1.010408,40561.0,0.824944
200,0.455,0.374534,0.665175,81332.0,0.870371
300,0.3994,0.356035,0.594075,122015.0,0.873447
400,0.3468,0.331604,0.58066,162920.0,0.872608
500,0.3297,0.331336,0.570523,203465.0,0.873785
600,0.3236,0.33143,0.555294,243857.0,0.874127
700,0.3284,0.327272,0.566601,284639.0,0.873152
800,0.3116,0.321753,0.563056,325624.0,0.875432
900,0.3273,0.318654,0.551397,366000.0,0.873473
1000,0.3198,0.315559,0.557615,406584.0,0.875867


✅ Training completed successfully!
📁 Check the sft_results folder for saved models


## Cleanup and Finalization


In [22]:
# Finish W&B run
wandb.finish()

print("🎉 Training session completed!")
print("📊 Check your W&B dashboard for detailed results")
print("💾 Model checkpoints saved in ./sft_results/")

# Display final model info
print(f"\n📈 Final model info:")
print(f"   Device: {device}")
print(f"   Parameters: {model.num_parameters():,}")
print(f"   Training examples: {len(formatted_dataset)}")


🎉 Training session completed!
📊 Check your W&B dashboard for detailed results
💾 Model checkpoints saved in ./sft_results/

📈 Final model info:
   Device: cuda
   Parameters: 124,439,808
   Training examples: 10000


## Test the Trained Model (Optional)

Test your fine-tuned model with some sample prompts!


In [23]:
# Test the trained model
def generate_tweet(prompt, max_length=100):
    """Generate a tweet using the trained model"""
    input_text = f"{prompt}\nResponse:"
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Test with sample prompts
test_prompts = [
    "Write a funny tweet about programming",
    "Write a motivational tweet about learning",
    "Write a personal story tweet about coding"
]

print("🧪 Testing the trained model...")
print("=" * 50)

for prompt in test_prompts:
    print(f"\n📝 Prompt: {prompt}")
    response = generate_tweet(prompt)
    print(f"🤖 Generated: {response}")
    print("-" * 30)


🧪 Testing the trained model...

📝 Prompt: Write a funny tweet about programming
🤖 Generated: Write a funny tweet about programming
Response: Just realized that debugging is 90% reading your own code and wondering who wrote this garbage 🤔
------------------------------

📝 Prompt: Write a motivational tweet about learning
🤖 Generated: Write a motivational tweet about learning
Response: Learning machine learning is like learning a new language - confusing until it makes sense
------------------------------

📝 Prompt: Write a personal story tweet about coding
🤖 Generated: Write a personal story tweet about coding
Response: Spent 2 hours chasing down an error. It was a missing semicolon 😅
------------------------------


In [20]:
# Load the base GPT-2 model for comparison
print("🤖 Loading base GPT-2 model for comparison...")
base_model_name = "gpt2"
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)

# Set pad token for base tokenizer
if base_tokenizer.pad_token is None:
    base_tokenizer.pad_token = base_tokenizer.eos_token

# Move base model to device
base_model = base_model.to(device)

print("✅ Base GPT-2 model loaded.")


def generate_tweet_base(prompt, max_length=100):
    """Generate a tweet using the base model"""
    input_text = f"{prompt}\nResponse:"
    inputs = base_tokenizer(input_text, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = base_model.generate(
            **inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True,
            pad_token_id=base_tokenizer.eos_token_id
        )

    generated_text = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Test with sample prompts using the base model
test_prompts = [
    "Write a funny tweet about programming",
    "Write a motivational tweet about learning",
    "Write a personal story tweet about coding"
]

print("\n🧪 Testing the BASE model...")
print("=" * 50)

for prompt in test_prompts:
    print(f"\n📝 Prompt: {prompt}")
    response = generate_tweet_base(prompt)
    print(f"🤖 Generated (Base Model): {response}")
    print("-" * 30)

🤖 Loading base GPT-2 model for comparison...
✅ Base GPT-2 model loaded.

🧪 Testing the BASE model...

📝 Prompt: Write a funny tweet about programming
🤖 Generated (Base Model): Write a funny tweet about programming
Response: @joe_larsen: I'm not sure if you're aware that I'm a programmer, but I'm a very busy person and I've been doing a lot of research on programming languages. I'm a bit obsessed with programming, and I think I might have gotten a little too obsessed with it because I was really excited about it. But I'm not sure if you realize what's going on. It's a really big gap.
------------------------------

📝 Prompt: Write a motivational tweet about learning
🤖 Generated (Base Model): Write a motivational tweet about learning
Response:
"I love math. I love being able to see what I want to see, understand it, and explain it to others. I don't usually do this, but I enjoy doing it. And when I do it, it is really great."
Response:
"That's really nice and I'll do that again."
-------