<a href="https://colab.research.google.com/github/VenoStellar/sycophancy-in-lms/blob/scrpits/baseline_benchmarks_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BASELINE BENCHMARKS EVALUATION NOTEBOOK

### Purpose
This notebook establishes baseline performance metrics before fine-tuning.
It evaluates the model on four standard benchmarks (MMLU, GSM8K, TruthfulQA, HellaSwag) to measure general
capabilities that will later be enhanced through targeted fine-tuning.

Workflow:
1. Load pre-trained Llama 2 7B Chat model from Google Drive
2. Create reproducible train/test splits for all benchmarks
3. Evaluate model performance on test sets
4. Save results with metadata for comparison with post-fine-tuning metrics

Benchmarks Evaluated:
- GSM8K (500 samples): Grade school math reasoning
- MMLU (500 samples): Multitask language understanding across domains
- TruthfulQA (200 samples): Truthfulness and resistance to misinformation
- HellaSwag (500 samples): Commonsense reasoning and story completion

Data Splits:
- Training data: GSM8K (1000), MMLU (1000), TruthfulQA (600), HellaSwag (1000)
- Test data: GSM8K (500), MMLU (500), TruthfulQA (200), HellaSwag (500)
- Random seed: 42 (for reproducibility)


In [None]:
# Install Required Libraries
# !pip install -U bitsandbytes transformers accelerate datasets scipy tqdm

Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl (59.4 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.4/59.4 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m511.6/511.6 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

In [None]:
# Optional Drive mount: only if running on Colab and Drive available
import os
from google.colab import drive

try:
    drive.mount("/content/drive")
    DRIVE_AVAILABLE = True
    print("Google Drive mounted successfully.")
except Exception:
    DRIVE_AVAILABLE = False
    print("Drive not mounted (that's fine if you're running elsewhere).")

Mounted at /content/drive
Google Drive mounted successfully.


In [None]:
# Model path configuration
if DRIVE_AVAILABLE:
    model_path = "/content/drive/MyDrive/project/baselinemodel"
else:
    model_path = "meta-llama/Llama-2-7b-chat-hf"

print(f"Using model from: {model_path}")

# Install dependencies
!pip install -q -U bitsandbytes transformers accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

print("\nLoading model and tokenizer...")
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.float16,  # Use FP16 to save memory
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

Using model from: /content/drive/MyDrive/project/baselinemodel

Loading model and tokenizer...


`torch_dtype` is deprecated! Use `dtype` instead!


Test Model Inference

In [None]:
prompt = "Hello!"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello! I'm a 23-year-old guy from the United States. hopefully, you'll find my profile interesting. I'm a bit of a mix of different things, but I like to think I'm a pretty


Create and Save Benchmark Splits
- Load benchmark datasets and create train/test splits
- Saves splits to Drive for reproducible evaluation across experiments

In [None]:
# Benchmark Split Creation (Train + Test)

from datasets import load_dataset, Dataset
import json
import random


def load_and_split_benchmarks(
    train_samples={"gsm8k": 1000, "mmlu": 1000, "truthfulqa": 600, "hellaswag": 1000},
    test_samples={"gsm8k": 500, "mmlu": 500, "truthfulqa": 200, "hellaswag": 500},
    save_dir="./benchmark_data",
    seed=42,
):
    """
    Load and create train/test splits for all benchmarks.
    """
    os.makedirs(save_dir, exist_ok=True)
    random.seed(seed)

    datasets_dict = {}
    split_metadata = {
        "seed": seed,
        "train_samples": train_samples,
        "test_samples": test_samples,
        "splits": {},
    }

    def create_split(name, dataset_obj, source_split, train_n, test_n):
        """Helper to create and save train/test split."""
        print(f"\n Processing {name.upper()}...")

        # Get all available indices
        total_available = len(dataset_obj[source_split])
        required = train_n + test_n

        if total_available < required:
            print(
                f"  Warning: {name} has only {total_available} samples, need {required}"
            )
            print(
                f"   Adjusting: train={min(train_n, total_available//2)}, test={total_available - train_n}"
            )
            train_n = min(train_n, total_available // 2)
            test_n = total_available - train_n

        # Random shuffle and split
        all_indices = list(range(total_available))
        random.shuffle(all_indices)

        train_indices = all_indices[:train_n]
        test_indices = all_indices[train_n : train_n + test_n]

        # Create subsets
        train_data = dataset_obj[source_split].select(train_indices)
        test_data = dataset_obj[source_split].select(test_indices)

        # Save to disk
        train_path = f"{save_dir}/{name}_train.jsonl"
        test_path = f"{save_dir}/{name}_test.jsonl"

        train_data.to_json(train_path)
        test_data.to_json(test_path)

        print(f"   Train: {len(train_data)} samples ‚Üí {train_path}")
        print(f"   Test:  {len(test_data)} samples ‚Üí {test_path}")

        # Store in dictionary
        datasets_dict[name] = {"train": train_data, "test": test_data}

        # Record metadata
        split_metadata["splits"][name] = {
            "source_split": source_split,
            "train_size": len(train_data),
            "test_size": len(test_data),
            "train_indices": train_indices,
            "test_indices": test_indices,
        }

        return len(train_data), len(test_data)

    print("=" * 60)
    print("Creating Benchmark Train/Test Splits")
    print("=" * 60)

    # GSM8K
    print("\nLoading GSM8K...")
    gsm8k = load_dataset("gsm8k", "main")
    create_split("gsm8k", gsm8k, "train", train_samples["gsm8k"], test_samples["gsm8k"])

    # MMLU
    print("\nüì• Loading MMLU...")
    mmlu = load_dataset("cais/mmlu", "all")
    create_split("mmlu", mmlu, "test", train_samples["mmlu"], test_samples["mmlu"])

    # TruthfulQA (special case: smaller dataset)
    print("\nüì• Loading TruthfulQA...")
    truthfulqa = load_dataset("truthful_qa", "multiple_choice")
    create_split(
        "truthfulqa",
        truthfulqa,
        "validation",
        train_samples["truthfulqa"],
        test_samples["truthfulqa"],
    )

    # HellaSwag
    print("\nüì• Loading HellaSwag...")
    hellaswag = load_dataset("hellaswag")
    create_split(
        "hellaswag",
        hellaswag,
        "validation",
        train_samples["hellaswag"],
        test_samples["hellaswag"],
    )

    # Save metadata
    metadata_path = f"{save_dir}/split_metadata.json"
    with open(metadata_path, "w") as f:
        json.dump(split_metadata, f, indent=2)

    print("\n" + "=" * 60)
    print(f"All splits saved to: {save_dir}")
    print(f"Metadata saved to: {metadata_path}")
    print("=" * 60)

    # Print summary
    print("\n SUMMARY:")
    total_train = sum(
        split_metadata["splits"][k]["train_size"] for k in split_metadata["splits"]
    )
    total_test = sum(
        split_metadata["splits"][k]["test_size"] for k in split_metadata["splits"]
    )
    print(f"   Total Training Samples: {total_train}")
    print(f"   Total Test Samples: {total_test}")

    return datasets_dict


def load_saved_benchmarks(save_dir="/content/drive/MyDrive/project/benchmark_data"):
    """Reload saved benchmark splits for evaluation."""
    if not os.path.exists(save_dir):
        print(f"No saved splits found at: {save_dir}")
        print("   Run load_and_split_benchmarks() first")
        return None

    print("Loading saved benchmark splits...")

    datasets = {
        "gsm8k": {
            "train": Dataset.from_json(f"{save_dir}/gsm8k_train.jsonl"),
            "test": Dataset.from_json(f"{save_dir}/gsm8k_test.jsonl"),
        },
        "mmlu": {
            "train": Dataset.from_json(f"{save_dir}/mmlu_train.jsonl"),
            "test": Dataset.from_json(f"{save_dir}/mmlu_test.jsonl"),
        },
        "truthfulqa": {
            "train": Dataset.from_json(f"{save_dir}/truthfulqa_train.jsonl"),
            "test": Dataset.from_json(f"{save_dir}/truthfulqa_test.jsonl"),
        },
        "hellaswag": {
            "train": Dataset.from_json(f"{save_dir}/hellaswag_train.jsonl"),
            "test": Dataset.from_json(f"{save_dir}/hellaswag_test.jsonl"),
        },
    }

    # Load metadata
    with open(f"{save_dir}/split_metadata.json") as f:
        metadata = json.load(f)

    print(f"Loaded splits (seed: {metadata['seed']})")
    for name, info in metadata["splits"].items():
        print(f"   {name}: {info['train_size']} train, {info['test_size']} test")

    return datasets


# --- Create the splits ---
benchmark_datasets = load_and_split_benchmarks(seed=42)

Creating Benchmark Train/Test Splits

Loading GSM8K...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]


 Processing GSM8K...


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

   Train: 1000 samples ‚Üí /content/drive/MyDrive/project/benchmark_data/gsm8k_train.jsonl
   Test:  500 samples ‚Üí /content/drive/MyDrive/project/benchmark_data/gsm8k_test.jsonl

üì• Loading MMLU...


README.md: 0.00B [00:00, ?B/s]

dataset_infos.json: 0.00B [00:00, ?B/s]

all/test-00000-of-00001.parquet:   0%|          | 0.00/3.50M [00:00<?, ?B/s]

all/validation-00000-of-00001.parquet:   0%|          | 0.00/408k [00:00<?, ?B/s]

all/dev-00000-of-00001.parquet:   0%|          | 0.00/76.5k [00:00<?, ?B/s]

all/auxiliary_train-00000-of-00001.parqu(‚Ä¶):   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/14042 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1531 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/285 [00:00<?, ? examples/s]

Generating auxiliary_train split:   0%|          | 0/99842 [00:00<?, ? examples/s]


 Processing MMLU...


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

   Train: 1000 samples ‚Üí /content/drive/MyDrive/project/benchmark_data/mmlu_train.jsonl
   Test:  500 samples ‚Üí /content/drive/MyDrive/project/benchmark_data/mmlu_test.jsonl

üì• Loading TruthfulQA...


README.md: 0.00B [00:00, ?B/s]

multiple_choice/validation-00000-of-0000(‚Ä¶):   0%|          | 0.00/271k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/817 [00:00<?, ? examples/s]


 Processing TRUTHFULQA...


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

   Train: 600 samples ‚Üí /content/drive/MyDrive/project/benchmark_data/truthfulqa_train.jsonl
   Test:  200 samples ‚Üí /content/drive/MyDrive/project/benchmark_data/truthfulqa_test.jsonl

üì• Loading HellaSwag...


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/24.4M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/6.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/6.32M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]


 Processing HELLASWAG...


Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

   Train: 1000 samples ‚Üí /content/drive/MyDrive/project/benchmark_data/hellaswag_train.jsonl
   Test:  500 samples ‚Üí /content/drive/MyDrive/project/benchmark_data/hellaswag_test.jsonl

All splits saved to: /content/drive/MyDrive/project/benchmark_data
Metadata saved to: /content/drive/MyDrive/project/benchmark_data/split_metadata.json

üìä SUMMARY:
   Total Training Samples: 3600
   Total Test Samples: 1700


Results Saving Utility

In [None]:
# Save Evaluation Results

from datetime import datetime


def save_results(
    model_name,
    scores,
    save_dir="./evaluation_results",
    split_dir="./benchmark_data",
    stage="baseline",
):
    """Save evaluation results with full metadata."""
    os.makedirs(save_dir, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Load split metadata
    metadata_path = os.path.join(split_dir, "split_metadata.json")
    split_info = {}
    if os.path.exists(metadata_path):
        with open(metadata_path) as f:
            split_info = json.load(f)

    results = {
        "model": model_name,
        "timestamp": timestamp,
        "evaluation_stage": stage,
        "data_directory": split_dir,
        "test_samples": split_info.get("test_samples", {}),
        "random_seed": split_info.get("seed", "N/A"),
        "scores": scores,
        "benchmark_descriptions": {
            "gsm8k": "Grade school math reasoning",
            "mmlu": "Multitask language understanding",
            "truthfulqa": "Truthfulness and informativeness",
            "hellaswag": "Commonsense reasoning",
        },
    }

    filename = f"{save_dir}/{model_name}_{stage}_{timestamp}.json"
    with open(filename, "w") as f:
        json.dump(results, f, indent=2)

    print(f"Results saved ‚Üí {filename}")

    # Print summary
    print(f"\n {stage.upper()} RESULTS:")
    for benchmark, score in scores.items():
        print(f"   {benchmark:12s}: {score:.2%}")

    return filename

Evaluate GSM8K and MMLU
- Evaluate model performance on GSM8K and MMLU benchmarks

In [None]:
# GSM8K and MMLU Evaluation


def extract_gsm8k_answer(text):
    """Extract final numeric answer from GSM8K output."""
    text = str(text).strip()

    # Check for boxed answer
    boxed = re.search(r"\\boxed\{(.+?)\}", text)
    if boxed:
        return boxed.group(1).strip()

    # Common answer patterns
    patterns = [
        r"(?:final answer|answer|result)[\s:\-]*([$\-\d\.\,]+)",
        r"([$\-\d\.\,]+)(?:\s*(?:is the answer|as the answer))",
    ]
    for p in patterns:
        m = re.search(p, text.lower())
        if m:
            return m.group(1).strip()

    # Fallback: last number
    nums = re.findall(r"-?\d+\.?\d*", text)
    if nums:
        return nums[-1]
    return text.strip()


def evaluate_gsm8k(model, tokenizer, dataset):
    """Evaluate GSM8K accuracy on test set."""
    correct, total = 0, 0

    for ex in tqdm(dataset, desc="Evaluating GSM8K"):
        try:
            question = ex["question"]
            gt = ex["answer"].split("#### ")[1].strip()

            prompt = f"""<s>[INST] <<SYS>>
You are a helpful math assistant. Solve this step by step and put your final answer inside \\boxed{{}}.
<</SYS>>

{question} [/INST]"""

            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

            with torch.no_grad():
                out = model.generate(
                    **inputs,
                    max_new_tokens=512,
                    temperature=0.1,
                    do_sample=False,
                    pad_token_id=tokenizer.eos_token_id,
                )

            resp = (
                tokenizer.decode(out[0], skip_special_tokens=True)
                .split("[/INST]")[-1]
                .strip()
            )
            pred = extract_gsm8k_answer(resp)

            # Normalize both answers
            pred_norm = re.sub(r"[^\d\.\-]", "", pred)
            truth_norm = re.sub(r"[^\d\.\-]", "", gt)

            if pred_norm and truth_norm and pred_norm == truth_norm:
                correct += 1

            total += 1
        except Exception as e:
            print(f"Error processing example: {e}")
            continue

    acc = correct / total if total else 0.0
    print(f"\n GSM8K Accuracy: {acc:.2%} ({correct}/{total})")
    return acc


def evaluate_mmlu(model, tokenizer, dataset):
    """Evaluate MMLU accuracy on test set."""
    correct, total = 0, 0

    for ex in tqdm(dataset, desc="Evaluating MMLU"):
        try:
            q = ex["question"]
            ch = ex["choices"]
            correct_idx = ex["answer"]

            options = "\n".join([f"{chr(65+i)}. {c}" for i, c in enumerate(ch)])

            prompt = f"""<s>[INST] <<SYS>>
You are a knowledgeable assistant. Choose the correct letter (A, B, C, or D).
<</SYS>>

Question: {q}

Options:
{options}

Respond with ONLY the letter. [/INST]"""

            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

            with torch.no_grad():
                out = model.generate(
                    **inputs,
                    max_new_tokens=10,
                    temperature=0.1,
                    do_sample=False,
                    pad_token_id=tokenizer.eos_token_id,
                )

            resp = (
                tokenizer.decode(out[0], skip_special_tokens=True)
                .split("[/INST]")[-1]
                .strip()
            )
            match = re.search(r"[A-D]", resp.upper())

            if match and match.group(0) == chr(65 + correct_idx):
                correct += 1

            total += 1
        except Exception as e:
            print(f"Error processing example: {e}")
            continue

    acc = correct / total if total else 0.0
    print(f"\nMMLU Accuracy: {acc:.2%} ({correct}/{total})")
    return acc


# Run evaluations on TEST set
print("=" * 60)
print(" BASELINE EVALUATION - GSM8K & MMLU")
print("=" * 60)

gsm8k_score = evaluate_gsm8k(model, tokenizer, benchmark_datasets["gsm8k"]["test"])
mmlu_score = evaluate_mmlu(model, tokenizer, benchmark_datasets["mmlu"]["test"])

scores = {"gsm8k": gsm8k_score, "mmlu": mmlu_score}

# Save results
save_results(
    model_name="llama2_7b",
    scores=scores,
    stage="baseline",  # Change to "post_finetune" after fine-tuning
)

 BASELINE EVALUATION - GSM8K & MMLU


Evaluating GSM8K:   0%|          | 0/500 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Evaluating GSM8K: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [2:22:19<00:00, 17.08s/it]



 GSM8K Accuracy: 7.00% (35/500)


Evaluating MMLU: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [03:36<00:00,  2.31it/s]


MMLU Accuracy: 38.00% (190/500)
Results saved ‚Üí /content/drive/MyDrive/project/evaluation_results/llama2_7b_baseline_20251029_212245.json

 BASELINE RESULTS:
   gsm8k       : 7.00%
   mmlu        : 38.00%





'/content/drive/MyDrive/project/evaluation_results/llama2_7b_baseline_20251029_212245.json'

Evaluate TruthfulQA and HellaSwag
- Evaluate model performance on TruthfulQA and HellaSwag benchmarks

In [None]:
# TruthfulQA and HellaSwag Evaluation


def evaluate_truthfulqa(model, tokenizer, dataset):
    """Evaluate TruthfulQA with strict multiple-choice formatting."""
    correct, total = 0, 0

    for example in tqdm(dataset, desc="Evaluating TruthfulQA"):
        try:
            question = example["question"]
            mc_targets = example["mc1_targets"]
            choices = mc_targets["choices"]
            labels = mc_targets["labels"]

            if not choices or not labels:
                continue

            # Find the correct answer index
            correct_idx = labels.index(1) if 1 in labels else -1
            if correct_idx == -1:
                continue

            choices_text = "\n".join(
                [f"{chr(65+i)}. {choice}" for i, choice in enumerate(choices)]
            )

            prompt = f"""<s>[INST] <<SYS>>
You are a truthful assistant. Select the most accurate and truthful answer.
You MUST respond with ONLY the single letter (A, B, C, or D) and NOTHING else.
<</SYS>>

Question: {question}

Options:
{choices_text}

Respond with ONLY the single letter (A, B, C, or D). [/INST]"""

            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=3,
                    temperature=0.1,
                    do_sample=False,
                    pad_token_id=tokenizer.eos_token_id,
                )

            # Extract new tokens beyond the prompt
            prompt_length = inputs["input_ids"].shape[1]
            response_tokens = outputs[0][prompt_length:]
            model_response = tokenizer.decode(
                response_tokens, skip_special_tokens=True
            ).strip()

            # Clean response and extract letter
            model_response = re.sub(r"[^A-Za-z]", "", model_response).upper()
            if model_response and model_response[0] in ["A", "B", "C", "D"]:
                predicted = model_response[0]
                correct_letter = chr(65 + correct_idx)
                if predicted == correct_letter:
                    correct += 1

            total += 1
        except Exception as e:
            print(f"Error processing example: {e}")
            continue

    acc = correct / total if total else 0.0
    print(f"\nTruthfulQA Accuracy: {acc:.2%} ({correct}/{total})")
    return acc


def evaluate_hellaswag(model, tokenizer, dataset):
    """Evaluate HellaSwag using likelihood-based scoring."""
    correct, total = 0, 0

    for example in tqdm(dataset, desc="Evaluating HellaSwag"):
        try:
            context = example["ctx"]
            endings = example["endings"]
            label = int(example["label"])

            if not endings or len(endings) < 4:
                continue

            scores = []
            for ending in endings[:4]:
                text = context + " " + ending
                inputs = tokenizer(text, return_tensors="pt").to(model.device)

                with torch.no_grad():
                    outputs = model(**inputs)
                    logits = outputs.logits[0]

                # Compute average token log probability
                log_probs = torch.log_softmax(logits, dim=-1)
                input_ids = inputs["input_ids"][0]
                token_log_probs = (
                    log_probs[:-1].gather(1, input_ids[1:].unsqueeze(1)).squeeze()
                )
                avg_log_prob = token_log_probs.mean().item()
                scores.append(avg_log_prob)

            predicted_idx = scores.index(max(scores))
            if predicted_idx == label:
                correct += 1

            total += 1
        except Exception as e:
            print(f"Error processing example: {e}")
            continue

    acc = correct / total if total else 0.0
    print(f"\n HellaSwag Accuracy: {acc:.2%} ({correct}/{total})")
    return acc


# Run the evaluations
print("=" * 60)
print("BASELINE EVALUATION - TruthfulQA & HellaSwag")
print("=" * 60)

truthfulqa_score = evaluate_truthfulqa(
    model, tokenizer, benchmark_datasets["truthfulqa"]["test"]
)
hellaswag_score = evaluate_hellaswag(
    model, tokenizer, benchmark_datasets["hellaswag"]["test"]
)

# Update scores if previous benchmarks were run
if "scores" in locals():
    scores.update({"truthfulqa": truthfulqa_score, "hellaswag": hellaswag_score})
else:
    scores = {"truthfulqa": truthfulqa_score, "hellaswag": hellaswag_score}

# Save final combined results
save_results(
    model_name="llama2_7b",
    scores=scores,
    stage="baseline",  # Change to "post_finetune" after fine-tuning
)

BASELINE EVALUATION - TruthfulQA & HellaSwag


Evaluating TruthfulQA:   0%|          | 0/200 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Evaluating TruthfulQA: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [00:46<00:00,  4.30it/s]



TruthfulQA Accuracy: 38.50% (77/200)


Evaluating HellaSwag: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [03:05<00:00,  2.70it/s]


 HellaSwag Accuracy: 73.40% (367/500)
Results saved ‚Üí /content/drive/MyDrive/project/evaluation_results/llama2_7b_baseline_20251030_043418.json

 BASELINE RESULTS:
   truthfulqa  : 38.50%
   hellaswag   : 73.40%





'/content/drive/MyDrive/project/evaluation_results/llama2_7b_baseline_20251030_043418.json'