<a href="https://colab.research.google.com/github/bharathbolla/The-LLM-Cookbook-Practical-Recipes-for-Fine-Tuning-Optimization-and-Deployment/blob/main/Chapter_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Recipe-1. Formatting Instructions with a Prompt Template


In [None]:
pip install datasets

Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading fsspec-2024.12.0-py3-none-any.whl (183 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.8.4.1 which is 

In [None]:
# --- Recipe: Formatting the Instructions ---
# Goal: Apply a specific prompt template to structure raw instruction data.
# Method: Uses a simple Python function and demonstrates the Alpaca template.

import json

# --- 1. Sample Raw Data (List of Dictionaries) ---
# Represents data you might load from a file or API
raw_data = [
    {
        "instruction": "Provide a brief summary of the concept of transfer learning in machine learning.",
        "input": "", # No additional input needed
        "output": "Transfer learning is a machine learning technique where a model developed for a task is reused as the starting point for a model on a second, related task. It leverages knowledge gained from the source task to improve performance on the target task, often reducing the need for large amounts of target-specific data."
    },
    {
        "instruction": "Translate the following sentence to Spanish.",
        "input": "Hello, how are you?",
        "output": "Hola, ¿cómo estás?"
    },
    {
        "instruction": "List three common types of renewable energy sources.",
        "input": "",
        "output": "1. Solar Energy\n2. Wind Energy\n3. Hydroelectric Energy"
    }
]

# --- 2. Define Prompt Templates ---
# Using the Alpaca format as an example

# Template for instructions WITH input context
PROMPT_WITH_INPUT_TEMPLATE = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
""" # Output will be appended here during training/inference

# Template for instructions WITHOUT input context
PROMPT_NO_INPUT_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
""" # Output will be appended here during training/inference

# --- 3. Formatting Function ---
def format_instruction_data(example):
    """Applies the appropriate Alpaca prompt template."""
    instruction = example.get("instruction", "")
    input_context = example.get("input", "")
    output = example.get("output", "") # Output is needed for training labels

    if input_context and input_context.strip():
        # Use the template with input
        prompt_start = PROMPT_WITH_INPUT_TEMPLATE.format(
            instruction=instruction,
            input=input_context
        )
    else:
        # Use the template without input
        prompt_start = PROMPT_NO_INPUT_TEMPLATE.format(
            instruction=instruction
        )

    # For training, we concatenate the prompt start and the expected output
    # For inference, we only use prompt_start
    formatted_text_for_training = prompt_start + output

    return {
        "formatted_prompt": prompt_start, # Useful for inference later
        "formatted_training_text": formatted_text_for_training
    }

# --- 4. Apply Formatting ---
print("--- Formatting Raw Data ---")
formatted_data = []
for example in raw_data:
    formatted_example = format_instruction_data(example)
    formatted_data.append(formatted_example)

# --- 5. Display Results ---
for i, item in enumerate(formatted_data):
    print(f"\n--- Example {i+1} ---")
    print(f"Original: {raw_data[i]}")
    print("-" * 20)
    print(f"Formatted Prompt (for Inference):\n{item['formatted_prompt']}")
    print("-" * 20)
    print(f"Formatted Text (for Training):\n{item['formatted_training_text']}")
    print("=" * 40)

# --- Notes ---
# - This formatted_training_text is what you would tokenize for Causal LM fine-tuning.
# - During tokenization for training, you need to identify which tokens belong to the
#   'output' part to avoid masking them during loss calculation.
# - Remember to add the EOS token to the end of formatted_training_text before tokenization.

# --- End of Recipe ---

--- Formatting Raw Data ---

--- Example 1 ---
Original: {'instruction': 'Provide a brief summary of the concept of transfer learning in machine learning.', 'input': '', 'output': 'Transfer learning is a machine learning technique where a model developed for a task is reused as the starting point for a model on a second, related task. It leverages knowledge gained from the source task to improve performance on the target task, often reducing the need for large amounts of target-specific data.'}
--------------------
Formatted Prompt (for Inference):
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Provide a brief summary of the concept of transfer learning in machine learning.

### Response:

--------------------
Formatted Text (for Training):
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Provide a brief summary of the concept of transf

## Recipe-2. The IFT Training Loop with Loss Masking

In [None]:
from huggingface_hub import HfApi
from huggingface_hub import login

api = HfApi()
whoami = api.whoami(token="hf_xxxxxxxxxxxxxx")
print(whoami)
login("hf_xxxxxxxxxxxxxxxxxxx")

{'type': 'user', 'id': '65feba1b57cc48d9d30d11cf', 'name': 'kalpasubbaiah', 'fullname': 'Kalpa Subbaiah', 'email': 'kalpa.subbaiah@gmail.com', 'emailVerified': True, 'canPay': False, 'periodEnd': None, 'isPro': False, 'avatarUrl': '/avatars/319094e0eb55ce89334d7bd3685ceeb0.svg', 'orgs': [], 'auth': {'type': 'access_token', 'accessToken': {'displayName': 'hugging_face_token_read', 'role': 'read', 'createdAt': '2025-04-22T09:03:46.223Z'}}}


In [None]:
# --- Recipe: The IFT Training Loop ---
# Goal: Fine-tune a base Causal LM on a formatted instruction dataset using Trainer.
# Method: Includes prompt formatting and loss masking for instruction tokens.
# Libraries: transformers, datasets, torch, accelerate
# Note: Requires significant VRAM, especially for models like Gemma-2B.
#       Uses a tiny dummy dataset for demonstration. Replace with a real dataset.

import torch
from datasets import Dataset # To create a dummy dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
import copy

# --- Configuration ---
MODEL_CHECKPOINT = "distilgpt2" # Use small model for demo; Replace with e.g., "google/gemma-2b" for better results
# DATASET_NAME = "databricks/databricks-dolly-15k" # Example real dataset (requires loading)
NUM_EPOCHS = 1
BATCH_SIZE = 2 # Keep very small for demo
GRADIENT_ACCUMULATION_STEPS = 2 # Simulate batch size 4
LEARNING_RATE = 2e-5
OUTPUT_DIR = "./instruction_finetune_output"
MAX_LENGTH = 256 # Max sequence length for tokenization

# --- 1. Load Tokenizer & Prepare Prompt Templates (Alpaca Style) ---
print(f"Loading tokenizer for checkpoint: {MODEL_CHECKPOINT}")
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        print(f"Set PAD token to EOS token: {tokenizer.pad_token}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    exit()

PROMPT_WITH_INPUT_TEMPLATE = (
    "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
)
PROMPT_NO_INPUT_TEMPLATE = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Response:\n"
)

# --- 2. Create/Load and Format Dataset ---
# Using a tiny dummy dataset for demonstration
dummy_data = [
    {"instruction": "Say hello.", "input": "", "output": "Hello!"},
    {"instruction": "Add two numbers.", "input": "5 and 3", "output": "5 + 3 = 8"},
    {"instruction": "Describe the sun.", "input": "", "output": "The sun is a star at the center of our solar system."},
    {"instruction": "Translate to German.", "input": "Thank you", "output": "Danke schön"},
]
# Convert to Hugging Face Dataset object
raw_dataset = Dataset.from_list(dummy_data)
print("Dummy Dataset:")
print(raw_dataset)

def format_and_tokenize(example):
    """Formats, tokenizes, and prepares labels with masking for IFT."""
    instruction = example.get("instruction", "")
    input_context = example.get("input", "")
    output = example.get("output", "")

    # Choose prompt template
    if input_context and input_context.strip():
        prompt_start = PROMPT_WITH_INPUT_TEMPLATE.format(instruction=instruction, input=input_context)
    else:
        prompt_start = PROMPT_NO_INPUT_TEMPLATE.format(instruction=instruction)

    # Concatenate prompt and output, add EOS token
    full_text = prompt_start + output + tokenizer.eos_token

    # Tokenize the full text
    tokenized_full = tokenizer(full_text, truncation=True, padding="max_length", max_length=MAX_LENGTH)

    # Tokenize the prompt part *only* to find its length
    tokenized_prompt = tokenizer(prompt_start, truncation=True, padding="max_length", max_length=MAX_LENGTH)
    prompt_length = len(tokenized_prompt["input_ids"])

    # Create labels - initially same as input_ids
    labels = copy.deepcopy(tokenized_full["input_ids"])

    # --- Crucial Step: Mask prompt tokens in labels ---
    # Set label IDs for prompt tokens to -100 so they are ignored in loss calculation
    for i in range(prompt_length):
        labels[i] = -100

    # Ensure attention mask is included
    tokenized_full["labels"] = labels

    # Sanity check (optional): Decode labels, ignoring -100
    # decoded_labels = tokenizer.decode([l for l in labels if l != -100])
    # print(f"Decoded Labels (should match output + EOS): {decoded_labels}")

    return tokenized_full

print("\nFormatting and tokenizing dataset...")
try:
    # Apply formatting and tokenization
    # Remove original columns as they are now part of the tokenized structure
    tokenized_dataset = raw_dataset.map(
        format_and_tokenize,
        remove_columns=raw_dataset.column_names
    )
    tokenized_dataset.set_format("torch")
    # Split (if needed - using full small dataset for train/eval here for demo)
    # train_val_split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
    # final_datasets = {"train": train_val_split["train"], "validation": train_val_split["test"]}
    final_datasets = {"train": tokenized_dataset, "validation": tokenized_dataset} # Use same for demo

    print("Dataset processed. Sample tokenized input and labels:")
    print(f"Input IDs: {final_datasets['train'][0]['input_ids']}")
    print(f"Labels:    {final_datasets['train'][0]['labels']}") # Note the -100 values
except Exception as e:
    print(f"Error processing dataset: {e}")
    exit()

# --- 3. Data Collator ---
# Use standard LM collator; masking is handled in preprocessing
# mlm=False for Causal LM
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# --- 4. Load Model ---
print(f"\nLoading base Causal LM: {MODEL_CHECKPOINT}")
try:
    model = AutoModelForCausalLM.from_pretrained(MODEL_CHECKPOINT)
    # Ensure model pad token id is set if tokenizer's was changed
    if model.config.pad_token_id is None:
         model.config.pad_token_id = tokenizer.pad_token_id
         print(f"Set model.config.pad_token_id to: {model.config.pad_token_id}")
    print(f"Model loaded. Initial device: {model.device}")
except Exception as e:
    print(f"Error loading model: {e}")
    exit()

# --- 5. Training Arguments ---
print("\nDefining Training Arguments...")
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=1, # Log frequently for small dataset
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    fp16=torch.cuda.is_available(),
    push_to_hub=False,
    report_to="none"
)
print(f"Training on device: {training_args.device}")

# --- 6. Initialize Trainer ---
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=final_datasets["train"],
    eval_dataset=final_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    # No compute_metrics needed unless calculating perplexity explicitly
)

# --- 7. Train ---
print("\nStarting instruction fine-tuning...")
try:
    train_result = trainer.train()
    print("Training finished.")
    metrics = train_result.metrics
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()
    trainer.save_model(OUTPUT_DIR) # Save final best model
    print(f"Model and training state saved to {OUTPUT_DIR}")
except Exception as e:
    print(f"Error during training: {e}")
    exit()

# --- 8. Evaluate (Optional - reports loss by default) ---
print("\nEvaluating final model...")
try:
    eval_results = trainer.evaluate()
    print("Evaluation Results (Loss):")
    print(eval_results)
    # Calculate perplexity if desired
    # perplexity = math.exp(eval_results['eval_loss'])
    # print(f"Perplexity: {perplexity:.2f}")
except Exception as e:
    print(f"Error during evaluation: {e}")

# --- End of Recipe ---

###########################################################
#Recipe: Qualitative Evaluation of IFT Model¶
# --- Recipe: Checking for Understanding (Qualitative Evaluation) ---
# Goal: Test the instruction-following capabilities of the fine-tuned model.
# Method: Load the IFT model and generate responses for unseen instructions.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# --- Configuration ---
# Path to the saved instruction-fine-tuned model
IFT_MODEL_PATH = "./instruction_finetune_output" # From ch7_recipe_ift_trainer

# Base model checkpoint (needed to load tokenizer if not saved with model)
# Or load tokenizer from IFT_MODEL_PATH if it was saved there
TOKENIZER_CHECKPOINT = "distilgpt2" # Must match the base model used for IFT

# Use GPU if available
device_index = 0 if torch.cuda.is_available() else -1

# --- 1. Load Fine-Tuned Model and Tokenizer ---
print(f"Loading fine-tuned IFT model from: {IFT_MODEL_PATH}")
try:
    model = AutoModelForCausalLM.from_pretrained(IFT_MODEL_PATH)
    # Try loading tokenizer from the fine-tuned path first, fallback to base
    try:
        tokenizer = AutoTokenizer.from_pretrained(IFT_MODEL_PATH)
    except OSError:
        print(f"Tokenizer not found in {IFT_MODEL_PATH}, loading from {TOKENIZER_CHECKPOINT}")
        tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_CHECKPOINT)

    # Ensure PAD token is set correctly (important for generation)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        print(f"Set PAD token to EOS token: {tokenizer.pad_token}")
    # Ensure model config has pad token id
    if model.config.pad_token_id is None:
         model.config.pad_token_id = tokenizer.pad_token_id

    print("IFT Model and Tokenizer loaded.")
except Exception as e:
    print(f"Error loading model/tokenizer: {e}")
    print("Ensure the IFT training recipe ran successfully and saved the model.")
    exit()

# --- 2. Create Inference Pipeline ---
# Using pipeline simplifies generation
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device_index
)

# --- 3. Define Prompt Formatting Function (Must match training template!) ---
# Re-use the templates from the formatting recipe
PROMPT_WITH_INPUT_TEMPLATE = (
    "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
)
PROMPT_NO_INPUT_TEMPLATE = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Response:\n"
)

def format_inference_prompt(instruction, input_context=""):
    """Formats the prompt for inference using the Alpaca template."""
    if input_context and input_context.strip():
        return PROMPT_WITH_INPUT_TEMPLATE.format(instruction=instruction, input=input_context)
    else:
        return PROMPT_NO_INPUT_TEMPLATE.format(instruction=instruction)

# --- 4. Test with Unseen Instructions ---
test_instructions = [
    {"instruction": "What is the capital of France?"},
    {"instruction": "Write a short story about a friendly robot.", "input": ""},
    {"instruction": "Convert the following temperature from Celsius to Fahrenheit.", "input": "25°C"},
    {"instruction": "List the planets in our solar system."},
    {"instruction": "Generate a python function to calculate factorial"} # Task likely unseen in dummy data
]

print("\n--- Testing IFT Model with Unseen Instructions ---")
for i, test_case in enumerate(test_instructions):
    instruction = test_case["instruction"]
    input_context = test_case.get("input", "")

    # Format the prompt exactly as done during training (excluding the output)
    prompt = format_inference_prompt(instruction, input_context)

    print(f"\n--- Test Case {i+1} ---")
    print(f"Instruction: {instruction}")
    if input_context: print(f"Input: {input_context}")
    print(f"Formatted Prompt (Input to Model):\n{prompt}")
    print("-" * 20)

    try:
        # Generate response
        # Adjust generation parameters as needed
        outputs = generator(
            prompt,
            max_new_tokens=100, # Limit generated length
            num_return_sequences=1,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id # Often needed for Causal LMs
        )
        # Extract only the generated part (the response)
        # The output includes the prompt, so we split/slice based on prompt length
        response = outputs[0]['generated_text'][len(prompt):].strip()

        print(f"Generated Response:\n{response}")

    except Exception as e:
        print(f"Error during generation for this test case: {e}")

    print("=" * 40)

# --- 5. Qualitative Assessment ---
# Review the generated responses. Does the model:
# - Understand the instruction?
# - Provide a relevant and coherent answer?
# - Follow formatting requests (if any were given)?
# - Avoid simply repeating the prompt?
# - Handle tasks it likely didn't see in the (dummy) training data?
#
# Note: With the dummy dataset and small model used in the training recipe,
# the results here will likely be poor. Using a larger base model and a real
# instruction dataset (like Dolly) would yield much better instruction following.

# --- End of Recipe ---

Loading tokenizer for checkpoint: distilgpt2
Set PAD token to EOS token: <|endoftext|>
Dummy Dataset:
Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 4
})

Formatting and tokenizing dataset...


Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Dataset processed. Sample tokenized input and labels:
Input IDs: tensor([21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,
          257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,
        21017, 46486,    25,   198, 25515, 23748,    13,   198,   198, 21017,
        18261,    25,   198, 15496,     0, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,2.2038,3.903072


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Training finished.
***** train metrics *****
  epoch                    =        1.0
  total_flos               =      243GF
  train_loss               =     2.2038
  train_runtime            = 0:00:02.42
  train_samples_per_second =      1.651
  train_steps_per_second   =      0.413
Model and training state saved to ./instruction_finetune_output

Evaluating final model...




Evaluation Results (Loss):
{'eval_loss': 3.9030723571777344, 'eval_runtime': 0.0921, 'eval_samples_per_second': 43.441, 'eval_steps_per_second': 10.86, 'epoch': 1.0}


Device set to use cuda:0


Loading fine-tuned IFT model from: ./instruction_finetune_output
IFT Model and Tokenizer loaded.

--- Testing IFT Model with Unseen Instructions ---

--- Test Case 1 ---
Instruction: What is the capital of France?
Formatted Prompt (Input to Model):
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the capital of France?

### Response:

--------------------
Generated Response:
How do we know that we have a capital?
### Response:
What is the capital of France?
### Response:
How do we know that we have a capital?
### Response:
How do we know that we have a capital?
### Response:
What is the capital of France?
### Response:
What is the capital of France?
### Response:
What is the capital of France?
### Response:
How do we know that we have

--- Test Case 2 ---
Instruction: Write a short story about a friendly robot.
Formatted Prompt (Input to Model):
Below is an instruction that describes a task. Write a resp

## Recipe-3: Qualitative Evaluation of IFT Model

In [None]:
#Recipe: Qualitative Evaluation of IFT Model¶
# --- Recipe: Checking for Understanding (Qualitative Evaluation) ---
# Goal: Test the instruction-following capabilities of the fine-tuned model.
# Method: Load the IFT model and generate responses for unseen instructions.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# --- Configuration ---
# Path to the saved instruction-fine-tuned model
IFT_MODEL_PATH = "./instruction_finetune_output" # From ch7_recipe_ift_trainer

# Base model checkpoint (needed to load tokenizer if not saved with model)
# Or load tokenizer from IFT_MODEL_PATH if it was saved there
TOKENIZER_CHECKPOINT = "distilgpt2" # Must match the base model used for IFT

# Use GPU if available
device_index = 0 if torch.cuda.is_available() else -1

# --- 1. Load Fine-Tuned Model and Tokenizer ---
print(f"Loading fine-tuned IFT model from: {IFT_MODEL_PATH}")
try:
    model = AutoModelForCausalLM.from_pretrained(IFT_MODEL_PATH)
    # Try loading tokenizer from the fine-tuned path first, fallback to base
    try:
        tokenizer = AutoTokenizer.from_pretrained(IFT_MODEL_PATH)
    except OSError:
        print(f"Tokenizer not found in {IFT_MODEL_PATH}, loading from {TOKENIZER_CHECKPOINT}")
        tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_CHECKPOINT)

    # Ensure PAD token is set correctly (important for generation)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        print(f"Set PAD token to EOS token: {tokenizer.pad_token}")
    # Ensure model config has pad token id
    if model.config.pad_token_id is None:
         model.config.pad_token_id = tokenizer.pad_token_id

    print("IFT Model and Tokenizer loaded.")
except Exception as e:
    print(f"Error loading model/tokenizer: {e}")
    print("Ensure the IFT training recipe ran successfully and saved the model.")
    exit()

# --- 2. Create Inference Pipeline ---
# Using pipeline simplifies generation
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device_index
)

# --- 3. Define Prompt Formatting Function (Must match training template!) ---
# Re-use the templates from the formatting recipe
PROMPT_WITH_INPUT_TEMPLATE = (
    "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
)
PROMPT_NO_INPUT_TEMPLATE = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Response:\n"
)

def format_inference_prompt(instruction, input_context=""):
    """Formats the prompt for inference using the Alpaca template."""
    if input_context and input_context.strip():
        return PROMPT_WITH_INPUT_TEMPLATE.format(instruction=instruction, input=input_context)
    else:
        return PROMPT_NO_INPUT_TEMPLATE.format(instruction=instruction)

# --- 4. Test with Unseen Instructions ---
test_instructions = [
    {"instruction": "What is the capital of France?"},
    {"instruction": "Write a short story about a friendly robot.", "input": ""},
    {"instruction": "Convert the following temperature from Celsius to Fahrenheit.", "input": "25°C"},
    {"instruction": "List the planets in our solar system."},
    {"instruction": "Generate a python function to calculate factorial"} # Task likely unseen in dummy data
]

print("\n--- Testing IFT Model with Unseen Instructions ---")
for i, test_case in enumerate(test_instructions):
    instruction = test_case["instruction"]
    input_context = test_case.get("input", "")

    # Format the prompt exactly as done during training (excluding the output)
    prompt = format_inference_prompt(instruction, input_context)

    print(f"\n--- Test Case {i+1} ---")
    print(f"Instruction: {instruction}")
    if input_context: print(f"Input: {input_context}")
    print(f"Formatted Prompt (Input to Model):\n{prompt}")
    print("-" * 20)

    try:
        # Generate response
        # Adjust generation parameters as needed
        outputs = generator(
            prompt,
            max_new_tokens=100, # Limit generated length
            num_return_sequences=1,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id # Often needed for Causal LMs
        )
        # Extract only the generated part (the response)
        # The output includes the prompt, so we split/slice based on prompt length
        response = outputs[0]['generated_text'][len(prompt):].strip()

        print(f"Generated Response:\n{response}")

    except Exception as e:
        print(f"Error during generation for this test case: {e}")

    print("=" * 40)

# --- 5. Qualitative Assessment ---
# Review the generated responses. Does the model:
# - Understand the instruction?
# - Provide a relevant and coherent answer?
# - Follow formatting requests (if any were given)?
# - Avoid simply repeating the prompt?
# - Handle tasks it likely didn't see in the (dummy) training data?
#
# Note: With the dummy dataset and small model used in the training recipe,
# the results here will likely be poor. Using a larger base model and a real
# instruction dataset (like Dolly) would yield much better instruction following.

# --- End of Recipe ---


## Recipe-4: Comparing Self-Instruct vs Human-Curated


In [None]:
# --- Recipe: Comparing IFT Performance: Self-Instruct vs. Human-Curated ---
# Goal: Fine-tune the same base model on two dataset types (Alpaca-style vs Dolly-style)
#       and compare their performance qualitatively and via validation loss.
# Method: Uses subsets of yahma/alpaca-cleaned and databricks/dolly-15k.
# Libraries: transformers, datasets, torch, accelerate
# Note: Uses distilgpt2 for speed; results differ greatly with larger models.
#       Requires sufficient disk space for datasets and model checkpoints.

import torch
import copy
import numpy as np
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
import os
import random

# --- Configuration ---
MODEL_CHECKPOINT = "distilgpt2" # Small model for faster demo
# Dataset 1: Self-Instruct style (Alpaca cleaned subset)
DATASET_1_NAME = "yahma/alpaca-cleaned"
OUTPUT_DIR_1 = "./ift_alpaca_output"
# Dataset 2: Human-Generated style (Dolly subset)
DATASET_2_NAME = "databricks/databricks-dolly-15k"
OUTPUT_DIR_2 = "./ift_dolly_output"

# Training Params (keep consistent for comparison)
NUM_EPOCHS = 1
BATCH_SIZE = 2 # Keep small for demo
GRADIENT_ACCUMULATION_STEPS = 4 # Effective batch size 8
LEARNING_RATE = 2e-5
MAX_LENGTH = 256 # Max sequence length
NUM_SAMPLES_PER_DATASET = 500 # Use small subsets for faster demo run

# --- 1. Load Tokenizer & Define Prompt Template ---
# Use the same tokenizer and template for both fine-tuning runs
print(f"Loading tokenizer for checkpoint: {MODEL_CHECKPOINT}")
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        print(f"Set PAD token to EOS token: {tokenizer.pad_token}")
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    exit()

# Using Alpaca-style template for consistency
PROMPT_WITH_INPUT_TEMPLATE = (
    "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
)
PROMPT_NO_INPUT_TEMPLATE = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Response:\n"
)

# --- 2. Data Loading and Preprocessing Function ---
def format_and_tokenize(example, dataset_type):
    """Formats (Alpaca style), tokenizes, and prepares labels with masking."""
    # Adapt field names based on dataset
    if dataset_type == 'alpaca':
        instruction = example.get("instruction", "")
        input_context = example.get("input", "")
        output = example.get("output", "")
    elif dataset_type == 'dolly':
        instruction = example.get("instruction", "")
        input_context = example.get("context", "") # Dolly uses 'context'
        output = example.get("response", "")      # Dolly uses 'response'
    else:
        raise ValueError("Unknown dataset_type")

    # Choose prompt template
    if input_context and input_context.strip():
        prompt_start = PROMPT_WITH_INPUT_TEMPLATE.format(instruction=instruction, input=input_context)
    else:
        prompt_start = PROMPT_NO_INPUT_TEMPLATE.format(instruction=instruction)

    # Concatenate prompt and output, add EOS token
    full_text = prompt_start + output + tokenizer.eos_token

    # Tokenize the full text
    tokenized_full = tokenizer(full_text, truncation=True, padding="max_length", max_length=MAX_LENGTH)

    # Tokenize the prompt part *only* to find its length for masking
    tokenized_prompt = tokenizer(prompt_start, truncation=True, padding="max_length", max_length=MAX_LENGTH)
    prompt_length = len(tokenized_prompt["input_ids"])

    # Create labels - initially same as input_ids
    labels = copy.deepcopy(tokenized_full["input_ids"])

    # --- Crucial Step: Mask prompt tokens in labels ---
    for i in range(prompt_length):
        if i < len(labels): # Ensure index is within bounds
             labels[i] = -100

    # Ensure attention mask is included
    tokenized_full["labels"] = labels
    return tokenized_full

# --- 3. Fine-Tune on Dataset 1 (Alpaca - Self-Instruct) ---
print(f"\n--- Processing Dataset 1: {DATASET_1_NAME} ---")
model_1_results = {}
try:
    # Load subset
    raw_dataset_1 = load_dataset(DATASET_1_NAME, split=f"train[:{NUM_SAMPLES_PER_DATASET}]")
    # Filter out examples that might be too long after formatting (optional but good practice)
    # raw_dataset_1 = raw_dataset_1.filter(lambda x: len(x['instruction']) + len(x['input']) + len(x['output']) < 1500)
    tokenized_dataset_1 = raw_dataset_1.map(
        lambda x: format_and_tokenize(x, 'alpaca'),
        remove_columns=raw_dataset_1.column_names
    )
    # Create dummy validation set for demo if needed
    split_ds_1 = tokenized_dataset_1.train_test_split(test_size=0.1, seed=42)
    final_datasets_1 = {"train": split_ds_1["train"], "validation": split_ds_1["test"]}
    print("Dataset 1 processed.")

    # Load Base Model
    model_1 = AutoModelForCausalLM.from_pretrained(MODEL_CHECKPOINT)
    if model_1.config.pad_token_id is None: model_1.config.pad_token_id = tokenizer.pad_token_id

    # Training Args
    training_args_1 = TrainingArguments(
        output_dir=OUTPUT_DIR_1, num_train_epochs=NUM_EPOCHS,
        per_device_train_batch_size=BATCH_SIZE, gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        learning_rate=LEARNING_RATE, weight_decay=0.01, eval_strategy="epoch",
        save_strategy="epoch", logging_strategy="steps", logging_steps=10,
        load_best_model_at_end=True, metric_for_best_model="loss",
        fp16=torch.cuda.is_available(), push_to_hub=False, report_to="none"
    )

    # Data Collator
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    # Trainer
    trainer_1 = Trainer( model=model_1, args=training_args_1,
        train_dataset=final_datasets_1["train"], eval_dataset=final_datasets_1["validation"],
        tokenizer=tokenizer, data_collator=data_collator )

    # Train
    print(f"\nStarting fine-tuning on {DATASET_1_NAME}...")
    train_result_1 = trainer_1.train()
    trainer_1.save_model(OUTPUT_DIR_1)
    print(f"Model 1 (Alpaca) saved to {OUTPUT_DIR_1}")
    # Store eval results
    eval_results_1 = trainer_1.evaluate(eval_dataset=final_datasets_1["validation"])
    model_1_results = {"eval_loss": eval_results_1.get("eval_loss", None)}
    print(f"Model 1 (Alpaca) Validation Results: {model_1_results}")

except Exception as e:
    print(f"Error during Dataset 1 processing or training: {e}")
    # Clean up potentially loaded model
    if 'model_1' in locals(): del model_1
    torch.cuda.empty_cache()


# --- 4. Fine-Tune on Dataset 2 (Dolly - Human-Generated) ---
print(f"\n--- Processing Dataset 2: {DATASET_2_NAME} ---")
model_2_results = {}
# It's crucial to reload the *base* model to avoid continuing training
if os.path.exists(OUTPUT_DIR_2):
     print(f"Output directory {OUTPUT_DIR_2} already exists. Skipping training for Model 2 assuming it's done.")
     # Attempt to load previous results if needed for comparison later
     try:
          # Placeholder: In a real scenario you might load metrics saved by Trainer
          # For this demo, we'll assume it needs re-running if dir exists but no results loaded
          print("Cannot load previous results in this demo script. Re-run required if comparison needed.")
     except:
          print("Could not load previous results for Model 2.")

else:
    try:
        # Load subset
        raw_dataset_2 = load_dataset(DATASET_2_NAME, split=f"train[:{NUM_SAMPLES_PER_DATASET}]")
         # Filter out examples that might be too long after formatting
        # raw_dataset_2 = raw_dataset_2.filter(lambda x: len(x['instruction']) + len(x['context']) + len(x['response']) < 1500)
        tokenized_dataset_2 = raw_dataset_2.map(
            lambda x: format_and_tokenize(x, 'dolly'),
            remove_columns=raw_dataset_2.column_names
        )
        split_ds_2 = tokenized_dataset_2.train_test_split(test_size=0.1, seed=42)
        final_datasets_2 = {"train": split_ds_2["train"], "validation": split_ds_2["test"]}
        print("Dataset 2 processed.")

        # --- IMPORTANT: Reload the BASE model ---
        print(f"Reloading BASE model: {MODEL_CHECKPOINT}")
        model_2 = AutoModelForCausalLM.from_pretrained(MODEL_CHECKPOINT)
        if model_2.config.pad_token_id is None: model_2.config.pad_token_id = tokenizer.pad_token_id

        # Training Args (use different output dir)
        training_args_2 = TrainingArguments(
            output_dir=OUTPUT_DIR_2, num_train_epochs=NUM_EPOCHS,
            per_device_train_batch_size=BATCH_SIZE, gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
            learning_rate=LEARNING_RATE, weight_decay=0.01, eval_strategy="epoch",
            save_strategy="epoch", logging_strategy="steps", logging_steps=10,
            load_best_model_at_end=True, metric_for_best_model="loss",
            fp16=torch.cuda.is_available(), push_to_hub=False, report_to="none"
        )

        # Data Collator (can reuse)
        data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

        # Trainer
        trainer_2 = Trainer( model=model_2, args=training_args_2,
            train_dataset=final_datasets_2["train"], eval_dataset=final_datasets_2["validation"],
            tokenizer=tokenizer, data_collator=data_collator )

        # Train
        print(f"\nStarting fine-tuning on {DATASET_2_NAME}...")
        train_result_2 = trainer_2.train()
        trainer_2.save_model(OUTPUT_DIR_2)
        print(f"Model 2 (Dolly) saved to {OUTPUT_DIR_2}")
        # Store eval results
        eval_results_2 = trainer_2.evaluate(eval_dataset=final_datasets_2["validation"])
        model_2_results = {"eval_loss": eval_results_2.get("eval_loss", None)}
        print(f"Model 2 (Dolly) Validation Results: {model_2_results}")

    except Exception as e:
        print(f"Error during Dataset 2 processing or training: {e}")
        # Clean up
        if 'model_2' in locals(): del model_2
        torch.cuda.empty_cache()


# --- 5. Comparison ---
print("\n--- Comparison Summary ---")

# Compare Validation Loss (lower is generally better)
loss1 = model_1_results.get('eval_loss', 'N/A')
loss2 = model_2_results.get('eval_loss', 'N/A')
print(f"Model 1 (Alpaca) Final Validation Loss: {loss1}")
print(f"Model 2 (Dolly) Final Validation Loss: {loss2}")
if isinstance(loss1, float) and isinstance(loss2, float):
    if loss1 < loss2:
        print("Model 1 (Alpaca) had lower validation loss.")
    elif loss2 < loss1:
        print("Model 2 (Dolly) had lower validation loss.")
    else:
        print("Validation losses were equal.")
else:
    print("Could not compare losses numerically.")


# Qualitative Evaluation on Sample Prompts
print("\n--- Qualitative Evaluation ---")
# Define a few diverse test prompts (ensure they weren't in the small training subsets)
test_prompts = [
    {"instruction": "What are the primary colors?"},
    {"instruction": "Write a haiku about a cat."},
    {"instruction": "Explain the concept of recursion in programming."},
]

# Load models for inference if training was successful
model_inf_1 = None
model_inf_2 = None
if os.path.exists(OUTPUT_DIR_1):
    try:
        model_inf_1 = AutoModelForCausalLM.from_pretrained(OUTPUT_DIR_1)
        model_inf_1.to(training_args_1.device if 'training_args_1' in locals() else 'cpu') # Move to device
    except Exception as e: print(f"Failed to load Model 1: {e}")
if os.path.exists(OUTPUT_DIR_2):
     try:
        model_inf_2 = AutoModelForCausalLM.from_pretrained(OUTPUT_DIR_2)
        model_inf_2.to(training_args_2.device if 'training_args_2' in locals() else 'cpu') # Move to device
     except Exception as e: print(f"Failed to load Model 2: {e}")


# Helper function for generation
def generate_response(model, prompt_text):
    if model is None: return "Model not loaded."
    try:
        inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
        # Ensure pad token ID is set for generation
        gen_kwargs = {"max_new_tokens": 75, "pad_token_id": tokenizer.eos_token_id, "do_sample": True, "temperature": 0.7}
        outputs = model.generate(**inputs, **gen_kwargs)
        response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) # Decode only new tokens
        return response.strip()
    except Exception as e:
        return f"Generation Error: {e}"

# Format inference prompt (Alpaca style)
def format_inference_prompt(instruction, input_context=""):
    if input_context and input_context.strip():
        return PROMPT_WITH_INPUT_TEMPLATE.format(instruction=instruction, input=input_context)
    else:
        return PROMPT_NO_INPUT_TEMPLATE.format(instruction=instruction)


# Generate and compare
for i, p in enumerate(test_prompts):
    print(f"\n--- Test Prompt {i+1} ---")
    instruction = p['instruction']
    input_ctx = p.get('input', '')
    formatted_prompt = format_inference_prompt(instruction, input_ctx)
    print(f"Instruction: {instruction}")
    if input_ctx: print(f"Input: {input_ctx}")
    print("-" * 20)

    print("Model 1 (Alpaca) Response:")
    print(generate_response(model_inf_1, formatted_prompt))
    print("-" * 20)

    print("Model 2 (Dolly) Response:")
    print(generate_response(model_inf_2, formatted_prompt))
    print("=" * 40)

print("\nCompare the responses qualitatively: Which model seems more helpful, accurate, creative, or better follows the instruction nuances?")
print("Note: Results heavily depend on base model size, dataset size/quality, and training parameters.")

# --- End of Recipe ---


2025-06-03 08:37:49.036009: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748939869.224282      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748939869.281225      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading tokenizer for checkpoint: distilgpt2


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Set PAD token to EOS token: <|endoftext|>

--- Processing Dataset 1: yahma/alpaca-cleaned ---


README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset 1 processed.


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

  trainer_1 = Trainer( model=model_1, args=training_args_1,



Starting fine-tuning on yahma/alpaca-cleaned...


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
0,1.4327,2.569823


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Model 1 (Alpaca) saved to ./ift_alpaca_output




Model 1 (Alpaca) Validation Results: {'eval_loss': 2.5698230266571045}

--- Processing Dataset 2: databricks/databricks-dolly-15k ---


README.md:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

databricks-dolly-15k.jsonl:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15011 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset 2 processed.
Reloading BASE model: distilgpt2


  trainer_2 = Trainer( model=model_2, args=training_args_2,



Starting fine-tuning on databricks/databricks-dolly-15k...


Epoch,Training Loss,Validation Loss
0,1.6086,3.056088


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Model 2 (Dolly) saved to ./ift_dolly_output




Model 2 (Dolly) Validation Results: {'eval_loss': 3.0560877323150635}

--- Comparison Summary ---
Model 1 (Alpaca) Final Validation Loss: 2.5698230266571045
Model 2 (Dolly) Final Validation Loss: 3.0560877323150635
Model 1 (Alpaca) had lower validation loss.

--- Qualitative Evaluation ---

--- Test Prompt 1 ---
Instruction: What are the primary colors?
--------------------
Model 1 (Alpaca) Response:
When the color is red, the color is white. The color is blue.
### Response:
What is the value of a given color?
### Response:
What is the value of a given color?
### Response:
What is the value of a given color?
### Response:
What is the value of a given color?
### Response
--------------------
Model 2 (Dolly) Response:
When will the color be used?
### Response:
What is a word like?
### Response:
What is a word like?
### Response:
What is a word like?
### Response:
What is a word like?
### Response:
What is a word like?
### Response:
What is a word like?

--- Test Prompt 2 ---
Instruction: