# Hamletmachine LLM Training on Google Colab Pro

This notebook trains the hamletmachine language model using Google Colab's GPU.

## üéì Colab Pro Benefits (Free for Students)

- **Better GPUs**: T4, P100, or V100 (vs T4 only on free tier)
- **Longer Sessions**: Up to 24 hours (vs ~9-12 hours on free tier)
- **More RAM**: Better for larger models and datasets
- **Better Availability**: Priority GPU access

## Setup Instructions

1. **Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí **GPU**
2. **Mount Google Drive** (recommended, for saving checkpoints): Run the mount cell below
3. **Clone from GitHub**: Run the clone cell to get the latest code
4. **Run all cells** in order

## Notes
- Checkpoints are saved to Google Drive (if mounted) for persistence
- Training progress is logged to TensorBoard
- Settings are auto-optimized based on your GPU type


## 1. Setup Environment & GPU Detection

In [None]:
# Check GPU availability and detect GPU type
import torch
import os

print("üîç Detecting GPU...")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    
    print(f"‚úÖ GPU: {gpu_name}")
    print(f"   Memory: {gpu_memory:.2f} GB")
    
    # Detect GPU type and set optimal defaults
    gpu_type = None
    if 'T4' in gpu_name:
        gpu_type = 'T4'
        print("   Type: T4 (16GB) - Good for GPT-2 small/medium")
    elif 'P100' in gpu_name:
        gpu_type = 'P100'
        print("   Type: P100 (16GB) - Excellent! Can handle GPT-2 medium/large")
    elif 'V100' in gpu_name:
        gpu_type = 'V100'
        print("   Type: V100 (16GB/32GB) - Excellent! Can handle larger models")
    elif 'A100' in gpu_name:
        gpu_type = 'A100'
        print("   Type: A100 - Premium! Can handle very large models")
    else:
        gpu_type = 'UNKNOWN'
        print(f"   Type: Unknown GPU - Using conservative settings")
    
    # Store GPU info for later use
    GPU_TYPE = gpu_type
    GPU_MEMORY_GB = gpu_memory
    
    print(f"\nüí° Recommended settings for {gpu_type}:")
    if gpu_type == 'T4':
        print("   - Model: GPT-2 small (124M) or medium (350M)")
        print("   - Batch size: 4-8 per device")
        print("   - FP16: Enabled (recommended)")
    elif gpu_type in ['P100', 'V100']:
        print("   - Model: GPT-2 medium (350M) or large (774M)")
        print("   - Batch size: 8-16 per device")
        print("   - FP16: Enabled (recommended)")
    elif gpu_type == 'A100':
        print("   - Model: GPT-2 large (774M) or XL (1.5B)")
        print("   - Batch size: 16-32 per device")
        print("   - FP16: Enabled (recommended)")
else:
    print("‚ö†Ô∏è  No GPU detected! Please enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")
    GPU_TYPE = None
    GPU_MEMORY_GB = 0

In [None]:
# Mount Google Drive (recommended - for saving checkpoints)
from google.colab import drive
drive.mount('/content/drive')

# Set checkpoint directory
CHECKPOINT_DIR = '/content/drive/MyDrive/hamletmachine/checkpoints'
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
print(f"‚úÖ Checkpoints will be saved to: {CHECKPOINT_DIR}")
print("üí° Checkpoints saved to Drive will persist after session ends!")

In [None]:
# Clone from GitHub
!git clone https://github.com/danzhechen/hamletmachine.git
%cd hamletmachine

print("‚úÖ Repository cloned!")

## 2. Install Dependencies

In [None]:
# Install project dependencies
!pip install -q transformers>=4.35.0 datasets>=2.14.0 accelerate>=0.24.0 tokenizers>=0.15.0
!pip install -q torch>=2.1.0 numpy>=1.24.0 pandas>=2.0.0 pyyaml>=6.0
!pip install -q tensorboard>=2.15.0 tqdm>=4.66.0

# Optional: Install WandB for experiment tracking
# !pip install -q wandb

print("‚úÖ Dependencies installed!")

In [None]:
# Add project to Python path
import sys
sys.path.insert(0, '/content/hamletmachine')

print("‚úÖ Project added to Python path!")

## 3. Upload or Process Data

In [None]:
# Dataset files are automatically available from GitHub!
# The train.jsonl, validation.jsonl, and test.jsonl files are included in the repository.

# Verify data files exist (they should be there after cloning from GitHub)
data_dir = '/content/hamletmachine/data/processed'
required_files = ['train.jsonl', 'validation.jsonl', 'test.jsonl']

if os.path.exists(data_dir):
    files = os.listdir(data_dir)
    jsonl_files = [f for f in files if f.endswith('.jsonl')]
    
    # Check for required files
    missing_files = [f for f in required_files if f not in files]
    
    if not missing_files:
        print(f"‚úÖ All dataset files found: {required_files}")
        # Show file sizes
        for f in required_files:
            file_path = os.path.join(data_dir, f)
            if os.path.exists(file_path):
                size_mb = os.path.getsize(file_path) / (1024 * 1024)
                print(f"   {f}: {size_mb:.2f} MB")
    else:
        print(f"‚ö†Ô∏è  Missing files: {missing_files}")
        print("   These should be in the repository. If missing, you can:")
        print("   1. Re-clone the repository")
        print("   2. Or process data on Colab (see Option 2 below)")
    
    if jsonl_files:
        print(f"\nüìÅ All JSONL files in directory: {jsonl_files}")
else:
    print(f"‚ö†Ô∏è  Data directory not found: {data_dir}")
    print("   This should exist after cloning from GitHub.")

# Option 2: Process data on Colab (if you need to regenerate)
# Uncomment below if you want to process raw training materials:
# from hamletmachine.data.pipeline import DataPipeline
# pipeline = DataPipeline(config_path='configs/data_config.yaml')
# pipeline.run()

## 4. Configure Training (Auto-Optimized for Your GPU)

To train a **bigger model** (e.g. GPT-2 large 774M or XL 1.5B), set `USE_BIGGER_MODEL = True` and `BIGGER_MODEL_NAME = 'gpt2-large'` (or `'gpt2-xl'`) in the next cell. Batch sizes are reduced automatically to avoid OOM.

In [None]:
# Auto-configure training based on GPU type
import yaml

# ========== OPTIONAL: Use a bigger model (override GPU auto-selection) ==========
# Set to True to train a larger model; batch sizes will be reduced to avoid OOM.
USE_BIGGER_MODEL = False  # Set True to use BIGGER_MODEL_NAME below
BIGGER_MODEL_NAME = 'gpt2-large'  # Options: 'gpt2-medium' (350M), 'gpt2-large' (774M), 'gpt2-xl' (1.5B)
# GPU guidance: T4 ‚Üí gpt2-medium max; P100/V100 ‚Üí gpt2-large; A100 ‚Üí gpt2-large or gpt2-xl
# ================================================================================

# GPU-specific optimizations
def get_gpu_optimized_config(gpu_type, gpu_memory_gb):
    """Get optimized config based on GPU type."""
    
    if gpu_type == 'T4':
        # T4: Good for small/medium models
        return {
            'model_architecture': 'gpt2',  # 124M parameters
            'batch_size': 8,  # Can go up to 8 on T4 with FP16
            'gradient_accumulation': 4,  # Effective batch size: 32
            'fp16': True,
            'max_seq_length': 1024,
        }
    elif gpu_type == 'P100':
        # P100: Excellent for medium/large models
        return {
            'model_architecture': 'gpt2-medium',  # 350M parameters
            'batch_size': 12,  # P100 can handle larger batches
            'gradient_accumulation': 4,  # Effective batch size: 48
            'fp16': True,
            'max_seq_length': 1024,
        }
    elif gpu_type == 'V100':
        # V100: Excellent for large models
        return {
            'model_architecture': 'gpt2-medium',  # Can try gpt2-large (774M)
            'batch_size': 16,  # V100 can handle even larger batches
            'gradient_accumulation': 4,  # Effective batch size: 64
            'fp16': True,
            'max_seq_length': 1024,
        }
    elif gpu_type == 'A100':
        # A100: Premium, can handle very large models
        return {
            'model_architecture': 'gpt2-large',  # 774M parameters
            'batch_size': 24,  # A100 can handle very large batches
            'gradient_accumulation': 4,  # Effective batch size: 96
            'fp16': True,
            'max_seq_length': 1024,
        }
    else:
        # Conservative defaults for unknown GPUs
        return {
            'model_architecture': 'gpt2',  # Start small
            'batch_size': 4,
            'gradient_accumulation': 4,
            'fp16': True,
            'max_seq_length': 1024,
        }

# Get optimized config
if 'GPU_TYPE' in globals() and GPU_TYPE:
    gpu_config = get_gpu_optimized_config(GPU_TYPE, GPU_MEMORY_GB)
    print(f"üéØ Auto-configured for {GPU_TYPE} GPU:")
    print(f"   Model: {gpu_config['model_architecture']}")
    print(f"   Batch size: {gpu_config['batch_size']} per device")
    print(f"   Gradient accumulation: {gpu_config['gradient_accumulation']}")
    print(f"   Effective batch size: {gpu_config['batch_size'] * gpu_config['gradient_accumulation']}")
    print(f"   FP16: {gpu_config['fp16']}")
else:
    # Fallback if GPU not detected
    gpu_config = get_gpu_optimized_config('T4', 16)
    print("‚ö†Ô∏è  Using default T4 settings (GPU not detected)")

# Apply bigger-model override (reduces batch size to avoid OOM)
if USE_BIGGER_MODEL and BIGGER_MODEL_NAME:
    bigger = BIGGER_MODEL_NAME
    gpu_config['model_architecture'] = bigger
    # Memory-safe batch sizes for larger models (reduce if you hit OOM)
    if bigger == 'gpt2-xl':  # 1.5B
        gpu_config['batch_size'] = min(gpu_config['batch_size'], 2)
        gpu_config['gradient_accumulation'] = max(gpu_config['gradient_accumulation'], 8)
    elif bigger == 'gpt2-large':  # 774M
        gpu_config['batch_size'] = min(gpu_config['batch_size'], 4)
        gpu_config['gradient_accumulation'] = max(gpu_config['gradient_accumulation'], 6)
    elif bigger == 'gpt2-medium':  # 350M
        gpu_config['batch_size'] = min(gpu_config['batch_size'], 8)
    print(f"üìå Override: using bigger model {bigger} (batch_size={gpu_config['batch_size']}, grad_accum={gpu_config['gradient_accumulation']})")

# Create full training config
config_path = '/content/hamletmachine/configs/train_config.yaml'
if not os.path.exists(config_path):
    config = {
        'model': {
            'architecture': gpu_config['model_architecture'],
        },
        'training': {
            'output_dir': CHECKPOINT_DIR if 'CHECKPOINT_DIR' in globals() else '/content/models/checkpoints',
            'num_train_epochs': 3,
            'per_device_train_batch_size': gpu_config['batch_size'],
            'per_device_eval_batch_size': gpu_config['batch_size'],
            'gradient_accumulation_steps': gpu_config['gradient_accumulation'],
            'learning_rate': 5.0e-5,
            'warmup_steps': 100,
            'logging_steps': 10,
            'save_steps': 500,
            'eval_steps': 500,
            'save_total_limit': 3,
            'fp16': gpu_config['fp16'],
            'dataloader_pin_memory': True,
        },
        'data': {
            'train_file': '/content/hamletmachine/data/processed/train.jsonl',
            'validation_file': '/content/hamletmachine/data/processed/validation.jsonl',
            'max_seq_length': gpu_config['max_seq_length'],
        },
        'tokenizer': {
            'tokenizer_name': 'gpt2',
        },
        'logging': {
            'logger': 'tensorboard',
            'logging_dir': '/content/logs',
        }
    }
    
    # Save config
    os.makedirs(os.path.dirname(config_path), exist_ok=True)
    with open(config_path, 'w') as f:
        yaml.dump(config, f, default_flow_style=False)
    print(f"\n‚úÖ Created optimized config at {config_path}")
else:
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    print(f"\n‚úÖ Loaded existing config from {config_path}")
    # Override with GPU-optimized settings if not already set
    if 'GPU_TYPE' in globals() and GPU_TYPE:
        config['model']['architecture'] = gpu_config['model_architecture']
        config['training']['per_device_train_batch_size'] = gpu_config['batch_size']
        config['training']['per_device_eval_batch_size'] = gpu_config['batch_size']
        config['training']['gradient_accumulation_steps'] = gpu_config['gradient_accumulation']
        config['training']['fp16'] = gpu_config['fp16']
        print("   (Updated with GPU-optimized settings)")

print("\nüìã Training Configuration:")
print(yaml.dump(config, default_flow_style=False))

## 5. Train Model

In [None]:
# Import training modules
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
import torch

print("‚úÖ Training modules imported!")

In [None]:
# Load tokenizer
tokenizer_name = config['tokenizer']['tokenizer_name']
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"‚úÖ Tokenizer loaded: {tokenizer_name}")
print(f"Vocabulary size: {len(tokenizer)}")

In [None]:
# Load model (let TrainingArguments handle mixed precision)
model_name = config['model']['architecture']
use_bf16 = config['training'].get('bf16', False)

print(f"üì• Loading model: {model_name}")
print(f"   Mixed precision: BF16={use_bf16}")

# Load model in float32 - TrainingArguments will handle mixed precision conversion
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32  # Let trainer handle mixed precision
)

# Move to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Calculate model size
num_params = sum(p.numel() for p in model.parameters())
num_trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"‚úÖ Model loaded: {model_name}")
print(f"   Parameters: {num_params / 1e6:.2f}M")
print(f"   Trainable: {num_trainable / 1e6:.2f}M")
print(f"   Device: {device}")

# Check GPU memory usage
if torch.cuda.is_available():
    memory_allocated = torch.cuda.memory_allocated(0) / 1e9
    memory_reserved = torch.cuda.memory_reserved(0) / 1e9
    print(f"   GPU Memory - Allocated: {memory_allocated:.2f} GB, Reserved: {memory_reserved:.2f} GB")

In [None]:
# Load and tokenize datasets
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=config['data']['max_seq_length'],
        padding='max_length'
    )

print("üì• Loading datasets...")

# Load JSONL files
train_dataset = load_dataset('json', data_files=config['data']['train_file'], split='train')
val_dataset = load_dataset('json', data_files=config['data']['validation_file'], split='train')

print(f"   Training examples: {len(train_dataset)}")
print(f"   Validation examples: {len(val_dataset)}")

# Tokenize
print("üî§ Tokenizing datasets...")
train_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names,
    desc="Tokenizing training set"
)
val_dataset = val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=val_dataset.column_names,
    desc="Tokenizing validation set"
)

print(f"‚úÖ Datasets loaded and tokenized!")
print(f"   Training examples: {len(train_dataset)}")
print(f"   Validation examples: {len(val_dataset)}")

In [None]:
# Setup data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal LM, not masked LM
)

# Setup training arguments
training_args = TrainingArguments(
    output_dir=config['training']['output_dir'],
    num_train_epochs=config['training']['num_train_epochs'],
    per_device_train_batch_size=config['training']['per_device_train_batch_size'],
    per_device_eval_batch_size=config['training']['per_device_eval_batch_size'],
    gradient_accumulation_steps=config['training']['gradient_accumulation_steps'],
    learning_rate=config['training']['learning_rate'],
    warmup_steps=config['training']['warmup_steps'],
    logging_steps=config['training']['logging_steps'],
    save_steps=config['training']['save_steps'],
    eval_steps=config['training']['eval_steps'],
    save_total_limit=config['training']['save_total_limit'],
    fp16=config['training'].get('fp16', False),
    bf16=config['training'].get('bf16', False),  # BF16 is more stable than FP16 for T4/P100/V100
    dataloader_pin_memory=config['training'].get('dataloader_pin_memory', True),
    logging_dir=config['logging']['logging_dir'],
    eval_strategy='steps',
    save_strategy='steps',
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    report_to='tensorboard' if config['logging']['logger'] == 'tensorboard' else None,
    # Gradient clipping for stability
    max_grad_norm=1.0,
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("‚úÖ Trainer setup complete!")
print(f"   Effective batch size: {config['training']['per_device_train_batch_size'] * config['training']['gradient_accumulation_steps']}")
print(f"   Total training steps: {len(train_dataset) // (config['training']['per_device_train_batch_size'] * config['training']['gradient_accumulation_steps']) * config['training']['num_train_epochs']}")

In [None]:
# Start training!
print("üöÄ Starting training...")
print(f"   Model: {config['model']['architecture']}")
print(f"   GPU: {GPU_TYPE if 'GPU_TYPE' in globals() else 'Unknown'}")
print(f"   Checkpoints: {config['training']['output_dir']}")
print(f"   TensorBoard: {config['logging']['logging_dir']}")
print(f"   Epochs: {config['training']['num_train_epochs']}")
print("\nüí° With Colab Pro, you have up to 24 hours - training should complete comfortably!")
print("\n" + "="*60)

trainer.train()

print("\n" + "="*60)
print("\n‚úÖ Training complete!")

In [None]:
# Save final model
final_model_dir = os.path.join(config['training']['output_dir'], 'final_model')
trainer.save_model(final_model_dir)
tokenizer.save_pretrained(final_model_dir)

print(f"‚úÖ Final model saved to: {final_model_dir}")
print(f"üí° Model is saved to Google Drive and will persist after session ends!")

## 6. Monitor Training (TensorBoard)

In [None]:
# Load TensorBoard extension
%load_ext tensorboard

# Start TensorBoard
%tensorboard --logdir {config['logging']['logging_dir']} --port 6006

## 7. Download Checkpoints (Optional)

In [None]:
# If checkpoints are saved to Colab storage (not Drive), download them
# This creates a zip file you can download

import shutil

checkpoint_dir = config['training']['output_dir']
if os.path.exists(checkpoint_dir) and not checkpoint_dir.startswith('/content/drive'):
    zip_path = '/content/hamletmachine_checkpoints.zip'
    shutil.make_archive(
        zip_path.replace('.zip', ''),
        'zip',
        checkpoint_dir
    )
    print(f"‚úÖ Checkpoints zipped: {zip_path}")
    print("Download from: Files ‚Üí hamletmachine_checkpoints.zip")
else:
    print("‚úÖ Checkpoints are saved to Google Drive - no download needed!")
    print(f"   Access them at: {checkpoint_dir}")

## 8. Evaluate and Test the Model

In [None]:
# Load the trained model for evaluation
# Option 1: Use the model already in memory (if you just trained)
if 'model' in globals() and 'tokenizer' in globals():
    print("‚úÖ Using model already loaded in memory")
    eval_model = model
    eval_tokenizer = tokenizer
else:
    # Option 2: Load from saved checkpoint
    # The model is saved in the 'final_model' subdirectory
    final_model_dir = os.path.join(config['training']['output_dir'], 'final_model')
    
    # Alternative: Load from a specific checkpoint (e.g., checkpoint-500)
    # final_model_dir = os.path.join(config['training']['output_dir'], 'checkpoint-500')
    
    if not os.path.exists(final_model_dir):
        print(f"‚ö†Ô∏è  Model directory not found: {final_model_dir}")
        print("   Available checkpoints:")
        checkpoint_dir = config['training']['output_dir']
        if os.path.exists(checkpoint_dir):
            for item in os.listdir(checkpoint_dir):
                item_path = os.path.join(checkpoint_dir, item)
                if os.path.isdir(item_path):
                    print(f"     - {item}")
    else:
        print(f"üì• Loading model from: {final_model_dir}")
        eval_tokenizer = AutoTokenizer.from_pretrained(final_model_dir)
        eval_model = AutoModelForCausalLM.from_pretrained(final_model_dir)
        
        # Move to GPU if available
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        eval_model = eval_model.to(device)
        eval_model.eval()  # Set to evaluation mode
        
        print(f"‚úÖ Model loaded successfully!")
        print(f"   Device: {device}")
        print(f"   Parameters: {sum(p.numel() for p in eval_model.parameters()) / 1e6:.2f}M")

In [None]:
# Quantitative Evaluation: Calculate metrics on validation and test sets
import math

print("üìä Running quantitative evaluation...")

# Evaluate on validation set (if available)
if 'val_dataset' in globals() and val_dataset is not None:
    print("\nüîç Evaluating on validation set...")
    val_metrics = trainer.evaluate(eval_dataset=val_dataset)
    val_loss = val_metrics.get('eval_loss', 'N/A')
    val_perplexity = math.exp(val_loss) if isinstance(val_loss, (int, float)) else 'N/A'
    
    print(f"   Validation Loss: {val_loss:.4f}")
    print(f"   Validation Perplexity: {val_perplexity:.4f if isinstance(val_perplexity, (int, float)) else val_perplexity}")
else:
    print("‚ö†Ô∏è  Validation dataset not available in memory")

# Evaluate on test set
print("\nüîç Evaluating on test set...")
test_file = config['data'].get('test_file', '/content/hamletmachine/data/processed/test.jsonl')

if os.path.exists(test_file):
    # Load and tokenize test set
    test_ds = load_dataset('json', data_files=test_file, split='train')
    
    def tokenize_function(examples):
        return eval_tokenizer(
            examples['text'],
            truncation=True,
            max_length=config['data']['max_seq_length'],
            padding='max_length'
        )
    
    test_dataset = test_ds.map(
        tokenize_function,
        batched=True,
        remove_columns=test_ds.column_names
    )
    
    # Create a temporary trainer for evaluation
    from transformers import Trainer, TrainingArguments
    
    eval_args = TrainingArguments(
        output_dir='/tmp/eval',
        per_device_eval_batch_size=config['training']['per_device_eval_batch_size'],
        fp16=False,
        bf16=False,
    )
    
    eval_trainer = Trainer(
        model=eval_model,
        args=eval_args,
        eval_dataset=test_dataset,
    )
    
    test_metrics = eval_trainer.evaluate()
    test_loss = test_metrics.get('eval_loss', 'N/A')
    test_perplexity = math.exp(test_loss) if isinstance(test_loss, (int, float)) else 'N/A'
    
    print(f"   Test Loss: {test_loss:.4f}")
    print(f"   Test Perplexity: {test_perplexity:.4f if isinstance(test_perplexity, (int, float)) else test_perplexity}")
else:
    print(f"‚ö†Ô∏è  Test file not found: {test_file}")

In [None]:
# Qualitative Testing: Generate text samples
import torch

def generate_text(prompt, max_new_tokens=150, temperature=0.8, top_p=0.95, top_k=50):
    """Generate text from a prompt."""
    # Tokenize input
    inputs = eval_tokenizer(prompt, return_tensors="pt").to(eval_model.device)
    
    # Generate
    with torch.no_grad():
        output_ids = eval_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            pad_token_id=eval_tokenizer.eos_token_id,
            eos_token_id=eval_tokenizer.eos_token_id,
        )
    
    # Decode output
    generated_text = eval_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    # Return only the newly generated part (remove the prompt)
    if generated_text.startswith(prompt):
        return generated_text[len(prompt):].strip()
    return generated_text.strip()

print("üé≠ Testing text generation...")
print("="*60)

# Test prompts related to your training data
test_prompts = [
    "HAMLETMASCHINE:\n\n",
    "Heiner M√ºller writes about Hamletmachine as",
    "The critique of political economy begins with",
    "In the beginning was the word, and the word was",
]

for i, prompt in enumerate(test_prompts, 1):
    print(f"\nüìù Prompt {i}: {repr(prompt)}")
    print("-" * 60)
    generated = generate_text(prompt, max_new_tokens=200, temperature=0.7)
    print(generated)
    print("=" * 60)

print("\n‚úÖ Text generation testing complete!")

In [None]:
# Interactive text generation - customize your prompts here

# Try your own prompts!
custom_prompt = "HAMLETMASCHINE:\n\n"  # Change this to test different prompts

print(f"üìù Generating text from prompt: {repr(custom_prompt)}")
print("-" * 60)

generated = generate_text(
    custom_prompt,
    max_new_tokens=300,  # Adjust length
    temperature=0.8,      # Lower = more focused, Higher = more creative
    top_p=0.95,          # Nucleus sampling
    top_k=50            # Top-k sampling
)

print(generated)
print("-" * 60)
print("\nüí° Tips:")
print("   - Lower temperature (0.5-0.7) = more conservative, focused text")
print("   - Higher temperature (0.8-1.2) = more creative, diverse text")
print("   - Adjust max_new_tokens to control output length")