# HumanEval Evaluation for LLaMA2-7B FFT Model

This notebook evaluates the Full Fine-Tuned (FFT) LLaMA2-7B model on the HumanEval benchmark.

**Goal:** Match the paper's FFT baseline of **29.3% pass@1**

**Model:** `Chrisfrancisque/llama2-7b-coding-fft`

**Expected Runtime:** ~30-45 minutes on Colab T4 GPU

---

## Setup Instructions

1. **Runtime → Change runtime type → GPU (T4)**
2. Run all cells in order
3. Results will be displayed at the bottom

## 1. Install Dependencies

In [None]:
%%capture
# Install required packages
!pip install evalplus transformers torch accelerate bitsandbytes

## 2. Import Libraries

In [None]:
import torch
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm.auto import tqdm
from evalplus.data import get_human_eval_plus
import subprocess
from pathlib import Path

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 3. Configuration

In [None]:
# Model configuration
MODEL_NAME = "Chrisfrancisque/llama2-7b-coding-fft"

# Evaluation configuration
NUM_SAMPLES_PER_TASK = 1  # For pass@1
TEMPERATURE = 0.2  # Low temperature for more deterministic outputs
MAX_NEW_TOKENS = 512  # Maximum tokens to generate
TOP_P = 0.95

# Output directory
OUTPUT_DIR = Path("./humaneval_results")
OUTPUT_DIR.mkdir(exist_ok=True)

print(f"Model: {MODEL_NAME}")
print(f"Samples per task: {NUM_SAMPLES_PER_TASK}")
print(f"Temperature: {TEMPERATURE}")
print(f"Output directory: {OUTPUT_DIR}")

## 4. Load Model and Tokenizer

This will download ~6.5 GB from HuggingFace.

In [None]:
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Set pad token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

model.eval()
print("✓ Model loaded successfully!")
print(f"Model dtype: {model.dtype}")
print(f"Model device: {next(model.parameters()).device}")

## 5. Load HumanEval Dataset

In [None]:
print("Loading HumanEval dataset...")
problems = get_human_eval_plus()
print(f"✓ Loaded {len(problems)} problems")

# Show example problem
example_task_id = "HumanEval/0"
example_problem = problems[example_task_id]
print(f"\nExample problem: {example_task_id}")
print("="*60)
print(example_problem["prompt"][:200] + "...")

## 6. Generate Code Completions

This will take ~30-45 minutes for all 164 problems.

In [None]:
def generate_completion(prompt, num_samples=1):
    """Generate code completion for a prompt"""
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1536)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=MAX_NEW_TOKENS,
            temperature=TEMPERATURE,
            top_p=TOP_P,
            num_return_sequences=num_samples,
            do_sample=True if TEMPERATURE > 0 else False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    completions = []
    for output in outputs:
        generated_tokens = output[inputs['input_ids'].shape[1]:]
        completion = tokenizer.decode(generated_tokens, skip_special_tokens=True)
        completions.append(completion)
    
    return completions

print("Generating completions...")
print("This will take approximately 30-45 minutes")
print("="*60)

samples = []
for task_id, problem in tqdm(problems.items(), desc="Evaluating"):
    prompt = problem["prompt"]
    
    for sample_idx in range(NUM_SAMPLES_PER_TASK):
        completions = generate_completion(prompt, num_samples=1)
        
        sample = {
            "task_id": task_id,
            "completion": completions[0]
        }
        samples.append(sample)

print(f"\n✓ Generated {len(samples)} completions")

## 7. Save Completions

In [None]:
samples_file = OUTPUT_DIR / "samples.jsonl"
print(f"Saving completions to {samples_file}")

with open(samples_file, 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

print("✓ Completions saved")

# Show example completion
print("\nExample completion:")
print("="*60)
print(f"Task: {samples[0]['task_id']}")
print(f"Completion:\n{samples[0]['completion'][:300]}...")

## 8. Run evalplus Evaluation

This executes the generated code in a sandbox and checks correctness.

In [None]:
print("Running evalplus evaluation...")
print("This will execute generated code in a sandboxed environment")
print("="*60)

eval_command = [
    "evalplus.evaluate",
    "--dataset", "humaneval",
    "--samples", str(samples_file),
]

result = subprocess.run(
    eval_command,
    capture_output=True,
    text=True
)

print(result.stdout)
if result.stderr:
    print("Errors:", result.stderr)

## 9. Display Results

In [None]:
# Find the results file
results_file = OUTPUT_DIR / "samples_eval_results.json"

if results_file.exists():
    with open(results_file, 'r') as f:
        results = json.load(f)
    
    print("="*60)
    print("FINAL RESULTS")
    print("="*60)
    
    if "eval" in results:
        eval_results = results["eval"]
        
        # Extract pass@1
        pass_at_1 = eval_results.get("pass@1", 0) * 100
        
        print(f"\n✓ Pass@1: {pass_at_1:.2f}%")
        print(f"\nPaper's FFT Baseline: 29.3%")
        
        difference = pass_at_1 - 29.3
        if difference >= 0:
            print(f"Difference: +{difference:.2f}% (BETTER than paper!)")
        else:
            print(f"Difference: {difference:.2f}%")
        
        print("\nAll metrics:")
        for metric, value in eval_results.items():
            print(f"  {metric}: {value*100:.2f}%")
    
    print("\n" + "="*60)
    print("Full results:")
    print(json.dumps(results, indent=2))
else:
    print("Results file not found. Check the evaluation output above.")
    print(f"Expected file: {results_file}")

## 10. Summary

### Evaluation Complete!

Compare your results to the paper:

| Model | HumanEval Pass@1 |
|-------|------------------|
| LLaMA2-7B (Base) | ~15% |
| **LLaMA2-7B FFT (Paper)** | **29.3%** |
| **Your FFT Model** | **(See above)** |
| LLaMA2-7B MFT (Paper) | 31.7% (+2.4%) |

---

### Next Steps:

1. **If pass@1 ≈ 29.3%**: Your FFT training was successful! ✓
   - Proceed with MFT (Mask Fine-Tuning) training
   - Goal: Achieve >31% pass@1 by removing 10% of parameters

2. **If pass@1 < 25%**: Training may need adjustment
   - Check training logs for issues
   - Consider training longer or adjusting hyperparameters

3. **Download results for your records**:
   - Files in `./humaneval_results/`
   - samples.jsonl (all generated code)
   - samples_eval_results.json (evaluation results)

## Optional: Download Results

In [None]:
# Download results to your local machine
from google.colab import files

if results_file.exists():
    print("Downloading results...")
    files.download(str(results_file))
    print("✓ Download started")
else:
    print("No results file to download")