# üç≥ PantryPilot: Llama 3.2 3B Fine-tuning on Lambda Labs A100

**Optimized for Lambda Labs A100 GPU**

- Expected training time: **45-60 minutes**
- Expected cost: **$0.83 - $1.10**
- GPU: A100 40GB
- Speed: ~0.5-0.8 it/s

## üìã Prerequisites
1. Lambda Labs account with credits
2. A100 instance running
3. Training data files:
   - `recipes_train_chat.jsonl` (13 MB)
   - `recipes_val_chat.jsonl` (1.7 MB)
4. HuggingFace token with Llama 3.2 access

## üîß Setup & Installation

In [None]:
# Install required libraries
!pip install -q transformers==4.57.1 peft==0.18.0 accelerate bitsandbytes datasets tensorboard

In [None]:
# Check GPU
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## üîê HuggingFace Authentication

Enter your HuggingFace token: `YOUR_HUGGINGFACE_TOKEN_HERE`

In [None]:
# Login to HuggingFace
from huggingface_hub import login

# Option 1: Direct token login (for automation)
HF_TOKEN = "YOUR_HUGGINGFACE_TOKEN_HERE"  # Replace with your token
login(token=HF_TOKEN)

# Option 2: Interactive login (uncomment to use)
# from huggingface_hub import notebook_login
# notebook_login()

print("‚úÖ Logged in to HuggingFace")

## üìÅ Upload Training Data

**Upload files using one of these methods:**

### Method 1: JupyterLab File Upload
1. Click the upload button in JupyterLab file browser
2. Upload `recipes_train_chat.jsonl` and `recipes_val_chat.jsonl`
3. Move them to `data/` folder

### Method 2: SCP from local machine
```bash
# From your local terminal:
scp recipes_train_chat.jsonl ubuntu@<lambda-ip>:~/data/
scp recipes_val_chat.jsonl ubuntu@<lambda-ip>:~/data/
```

### Method 3: Python upload widget (below)

In [None]:
# Create data directory
!mkdir -p data

# Check if files exist
import os
train_file = "data/recipes_train_chat.jsonl"
val_file = "data/recipes_val_chat.jsonl"

if os.path.exists(train_file) and os.path.exists(val_file):
    print("‚úÖ Training data found!")
    !ls -lh data/
else:
    print("‚ö†Ô∏è Please upload training data files to the data/ directory")
    print(f"Missing files:")
    if not os.path.exists(train_file):
        print(f"  - {train_file}")
    if not os.path.exists(val_file):
        print(f"  - {val_file}")

## üìä Dataset Preparation

In [None]:
import json
from typing import List
from torch.utils.data import Dataset

class RecipeDataset(Dataset):
    """Custom dataset for recipe generation using pre-formatted ChatML data."""

    def __init__(self, data_path: str, tokenizer, max_length: int = 1024):
        """Initialize dataset."""
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.data = self.load_data(data_path)
        print(f"Loaded {len(self.data):,} samples from {data_path}")

    def load_data(self, data_path: str) -> List[dict]:
        """Load ChatML data from JSONL file."""
        data = []
        with open(data_path, 'r') as f:
            for line in f:
                data.append(json.loads(line.strip()))
        return data

    def __len__(self):
        """Return dataset length."""
        return len(self.data)

    def __getitem__(self, idx):
        """Get single item."""
        item = self.data[idx]
        prompt = item['text']

        # Tokenize
        encodings = self.tokenizer(
            prompt,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt"
        )

        return {
            "input_ids": encodings["input_ids"].squeeze(),
            "attention_mask": encodings["attention_mask"].squeeze(),
            "labels": encodings["input_ids"].squeeze(),
        }

# Preview sample data
with open('data/recipes_train_chat.jsonl', 'r') as f:
    sample = json.loads(f.readline())
    print("\nüìù Sample data:")
    print(f"Text length: {len(sample['text'])} chars")
    print(f"Scenario: {sample.get('scenario', 'N/A')}")
    print(f"\nFirst 500 chars:")
    print(sample['text'][:500] + "...")

## ü§ñ Load Model & Apply LoRA

**A100 Optimizations:**
- 4-bit quantization for memory efficiency
- LoRA for parameter-efficient fine-tuning
- Only ~0.28% of parameters trainable

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training

MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"

print(f"Loading model: {MODEL_NAME}")
print("This may take 2-3 minutes...\n")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

print("‚úÖ Model loaded successfully!")

In [None]:
# Apply LoRA configuration
print("Applying LoRA configuration...\n")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # LoRA rank
    lora_alpha=32,           # LoRA alpha (scaling factor)
    lora_dropout=0.05,       # Dropout for regularization
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Attention layers
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print("\n‚úÖ LoRA applied! Only 0.28% of parameters will be trained.")

## üöÄ Training Setup

**A100-Optimized Settings:**
- Batch size: 8 (utilize A100 memory)
- Gradient accumulation: 2 (effective batch = 16)
- No gradient checkpointing (speed priority)
- FP16 mixed precision
- Expected speed: 0.5-0.8 it/s

In [None]:
# Load datasets
print("Loading datasets...\n")

train_dataset = RecipeDataset(
    "data/recipes_train_chat.jsonl",
    tokenizer,
    max_length=1024  # Optimized for speed/quality balance
)

val_dataset = RecipeDataset(
    "data/recipes_val_chat.jsonl",
    tokenizer,
    max_length=1024
)

print(f"\nüìä Dataset Statistics:")
print(f"Train samples: {len(train_dataset):,}")
print(f"Val samples: {len(val_dataset):,}")

# Calculate training parameters
batch_size = 8
gradient_accumulation_steps = 2
num_epochs = 3
effective_batch_size = batch_size * gradient_accumulation_steps
total_steps = (len(train_dataset) // effective_batch_size) * num_epochs

print(f"\n‚öôÔ∏è Training Configuration:")
print(f"Per-device batch size: {batch_size}")
print(f"Gradient accumulation steps: {gradient_accumulation_steps}")
print(f"Effective batch size: {effective_batch_size}")
print(f"Total epochs: {num_epochs}")
print(f"Total training steps: {total_steps:,}")
print(f"\n‚è±Ô∏è Estimated time: 45-60 minutes on A100")
print(f"üí∞ Estimated cost: $0.83 - $1.10")

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# A100-optimized training arguments
training_args = TrainingArguments(
    output_dir="./llama3b_recipe_lora",
    
    # Training parameters
    num_train_epochs=3,
    per_device_train_batch_size=8,        # Optimized for A100
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,        # Effective batch = 16
    
    # Optimization
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_steps=50,
    lr_scheduler_type="cosine",
    
    # Speed optimization
    fp16=True,                             # Mixed precision
    gradient_checkpointing=False,          # Disable for speed
    optim="adamw_torch",                   # Fast optimizer
    
    # Logging & evaluation
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=100,
    
    # Checkpointing
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    
    # System
    dataloader_num_workers=4,
    remove_unused_columns=False,
    report_to="tensorboard",
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

print("‚úÖ Trainer initialized!")
print("\nüöÄ Ready to start training!")

## üéØ Start Training

**Expected Performance on A100:**
- Speed: ~0.5-0.8 it/s (iterations per second)
- Time: 45-60 minutes
- Cost: $0.83 - $1.10

**Progress will be shown every 10 steps**
**Validation runs every 100 steps**

In [None]:
import time

print("üöÄ Starting training...")
print(f"Start time: {time.strftime('%Y-%m-%d %H:%M:%S')}")
print("="*60)

# Train the model
trainer.train()

print("="*60)
print(f"End time: {time.strftime('%Y-%m-%d %H:%M:%S')}")
print("\n‚úÖ Training complete!")

## üíæ Save Model

In [None]:
# Save the fine-tuned model
output_dir = "./llama3b_recipe_lora_final"

print(f"üíæ Saving model to {output_dir}...")
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print("\n‚úÖ Model saved!")
print("\nüìÅ Saved files:")
!ls -lh {output_dir}

# Get model size
!du -sh {output_dir}

## üß™ Test the Model

In [None]:
# Test inference
print("üß™ Testing the fine-tuned model...\n")
model.eval()

test_prompts = [
    {
        "name": "Korean Recipe Test",
        "prompt": """<|im_start|>system
You are a recipe generation AI that creates recipes based on user inventory and preferences.<|im_end|>
<|im_start|>user
I have chicken, rice, onion, and garlic. I want a Korean recipe.<|im_end|>
<|im_start|>assistant
"""
    },
    {
        "name": "Vegan Recipe Test",
        "prompt": """<|im_start|>system
You are a recipe generation AI that creates recipes based on user inventory and preferences.<|im_end|>
<|im_start|>user
I have tofu, broccoli, and soy sauce. I want a vegan recipe.<|im_end|>
<|im_start|>assistant
"""
    }
]

for test in test_prompts:
    print(f"{'='*60}")
    print(f"Test: {test['name']}")
    print(f"{'='*60}\n")
    
    inputs = tokenizer(test['prompt'], return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract just the assistant response
    if "<|im_start|>assistant" in response:
        assistant_response = response.split("<|im_start|>assistant")[-1]
        assistant_response = assistant_response.replace("<|im_end|>", "").strip()
        print(assistant_response)
    else:
        print(response)
    
    print("\n")

## üì• Download Model

Download the fine-tuned LoRA adapter to use locally.

In [None]:
# Zip the model for download
print("üì¶ Zipping model files...")
!zip -r llama3b_recipe_lora_final.zip llama3b_recipe_lora_final/

print("\n‚úÖ Model zipped!")
!ls -lh llama3b_recipe_lora_final.zip

print("\nüì• Download the file using one of these methods:")
print("1. Right-click on llama3b_recipe_lora_final.zip in JupyterLab file browser ‚Üí Download")
print("2. Use SCP from your local machine:")
print("   scp ubuntu@<lambda-ip>:~/llama3b_recipe_lora_final.zip .")

## üìä View Training Metrics

In [None]:
# Load TensorBoard
%load_ext tensorboard
%tensorboard --logdir llama3b_recipe_lora/runs

## üéâ Done!

### Next Steps:
1. Download the model (`llama3b_recipe_lora_final.zip`)
2. Stop the Lambda instance to avoid charges
3. Use the model locally for inference

### Cost Summary:
- Training time: ~45-60 minutes
- A100 rate: $1.10/hour
- Total cost: **$0.83 - $1.10**

### Model Usage:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./llama3b_recipe_lora_final")

# Generate recipe
# ... (use as shown in test section)
```