# 01 - Environment Setup and Model Verification

This notebook sets up the environment for fine-tuning Llama 3.2 1B on CPU.

## What we'll do:
1. Install required packages
2. Verify hardware and PyTorch configuration
3. Download Llama 3.2 1B model
4. Test 8-bit quantization
5. Run a simple generation test

## 1. Install Required Packages

**Note:** Run this once, then restart the kernel before continuing.

In [None]:
# Install packages from requirements.txt
!pip install -q -r ../requirements.txt

print("✓ Installation complete! Please restart the kernel before continuing.")

## 2. Verify Environment

In [None]:
import torch
import transformers
import peft
import trl
import platform
import sys

print("=" * 60)
print("ENVIRONMENT INFORMATION")
print("=" * 60)

# Python and System Info
print(f"Python Version: {sys.version.split()[0]}")
print(f"Platform: {platform.system()} {platform.release()}")
print(f"Processor: {platform.processor()}")
print()

# Package Versions
print("PACKAGE VERSIONS:")
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"PEFT: {peft.__version__}")
print(f"TRL: {trl.__version__}")
print()

# Device Information
print("DEVICE INFORMATION:")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
else:
    print("Running on CPU (expected for Intel MacBook)")

# Check MPS (Apple Silicon) - won't be available on Intel Mac
if hasattr(torch.backends, 'mps'):
    print(f"MPS Available: {torch.backends.mps.is_available()}")

print("=" * 60)

## 3. Download and Load Llama 3.2 1B

**Important:** You'll need to:
1. Have a Hugging Face account
2. Accept the Llama 3.2 license at: https://huggingface.co/meta-llama/Llama-3.2-1B
3. Create an access token at: https://huggingface.co/settings/tokens
4. Login using the token

In [None]:
from huggingface_hub import login

# Login to Hugging Face
# You'll be prompted to enter your token
login()

## 4. Load Model with 8-bit Quantization

We'll use 8-bit quantization which is more stable on CPU than 4-bit.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import warnings
warnings.filterwarnings('ignore')

model_id = "meta-llama/Llama-3.2-1B"

print("Loading model with 8-bit quantization...")
print("This may take a few minutes on first run (downloading model)...\n")

# Configure 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("✓ Model loaded successfully!")
print(f"\nModel size: {model.get_memory_footprint() / 1e9:.2f} GB")
print(f"Total parameters: {model.num_parameters() / 1e6:.0f}M")

## 5. Test Model Generation

Let's verify the model works by generating some text.

In [None]:
test_prompt = """Extract the project estimation from this BRD:

Business Requirements Document
Project: Mobile App Development

The project requires 3 developers for 8 weeks.
Estimated effort: 480 hours
Budget: $48,000

Answer in JSON format:"""

print("Testing model generation...\n")
print("Prompt:")
print("-" * 60)
print(test_prompt)
print("-" * 60)

# Generate
inputs = tokenizer(test_prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\nGenerated Output:")
print("-" * 60)
print(generated_text)
print("-" * 60)
print("\n✓ Model generation working!")
print("Note: The base model may not produce perfect JSON yet.")
print("After fine-tuning, it will extract structured data reliably.")

## 6. Verify LoRA Configuration

Test that we can prepare the model for PEFT training.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

print("Testing LoRA configuration...\n")

# Prepare model for training
model_for_training = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=16,  # Scaling
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA
peft_model = get_peft_model(model_for_training, lora_config)

print("LoRA Configuration:")
print("-" * 60)
peft_model.print_trainable_parameters()
print("-" * 60)
print("\n✓ LoRA setup successful!")
print("Only ~0.5-2% of parameters will be trained.")

## 7. Setup Summary

Let's create a summary of our setup.

In [None]:
import json
from datetime import datetime

setup_info = {
    "timestamp": datetime.now().isoformat(),
    "model_id": model_id,
    "quantization": "8-bit",
    "device": "cpu",
    "pytorch_version": torch.__version__,
    "transformers_version": transformers.__version__,
    "peft_version": peft.__version__,
    "model_size_gb": round(model.get_memory_footprint() / 1e9, 2),
    "total_parameters": model.num_parameters(),
    "lora_config": {
        "rank": 8,
        "alpha": 16,
        "dropout": 0.05,
        "target_modules": lora_config.target_modules,
    }
}

# Save setup info
with open("../configs/setup_info.json", "w") as f:
    json.dump(setup_info, f, indent=2)

print("Setup Information:")
print("=" * 60)
print(json.dumps(setup_info, indent=2))
print("=" * 60)
print("\n✓ Setup information saved to configs/setup_info.json")

## Setup Complete!

### What we've verified:
- ✓ All required packages installed
- ✓ Llama 3.2 1B downloaded and loaded
- ✓ 8-bit quantization working on CPU
- ✓ Model can generate text
- ✓ LoRA configuration tested

### Next Steps:
Move on to `02_data_generation.ipynb` to create synthetic BRD training data.

### Notes:
- The base model may not produce perfect JSON yet - that's expected
- After fine-tuning, it will extract structured data reliably
- Model is loaded in 8-bit to reduce memory usage on CPU
- Training will be slow on CPU but feasible for 1B model