# Mistral 7B F1 QA Evaluation

This notebook evaluates Mistral 7B on the F1 QA dataset with multiple choice questions.

It loads 500 JSON files (1500 QA pairs total), formats them as multiple choice questions, and evaluates the model's accuracy.


In [None]:
# Install required packages
%pip install transformers torch accelerate huggingface_hub tqdm




In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
print("✅ Google Drive mounted successfully!")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Google Drive mounted successfully!


In [None]:
# Import utilities
import sys
sys.path.append('/content')
from f1_qa_utils import load_qa_dataset, format_multiple_choice, extract_answer_letter, extract_justification

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
import json
from pathlib import Path
import time
import gc
from typing import List, Dict, Any


## Configuration

Set paths and model to evaluate.


In [None]:
# Configuration for Mistral 7B
MODEL_ID = "mistralai/Mistral-7B-v0.1"
MODEL_NAME = "mistral-7b"

DATASET_PATH = "/content/drive/MyDrive/Data_Collection_Code/f1_refined_data/f1_qa_outputs"
RESULTS_PATH = "/content/drive/MyDrive/CS6220_Project/results"

# Create results directory if needed
Path(RESULTS_PATH).mkdir(parents=True, exist_ok=True)

print(f"✅ Configuration loaded:")
print(f"   Model: {MODEL_NAME}")
print(f"   Model ID: {MODEL_ID}")
print(f"   Dataset: {DATASET_PATH}")
print(f"   Results: {RESULTS_PATH}")
print(f"   📝 Note: Model will be loaded directly from Hugging Face Hub")


✅ Configuration loaded:
   Model: mistral-7b
   Model ID: mistralai/Mistral-7B-v0.1
   Dataset: /content/drive/MyDrive/Data_Collection_Code/f1_refined_data/f1_qa_outputs
   Results: /content/drive/MyDrive/CS6220_Project/results
   📝 Note: Model will be loaded directly from Hugging Face Hub


## Load QA Dataset


In [None]:
# Load all QA pairs
qa_dataset = load_qa_dataset(DATASET_PATH)

print(f"📊 Dataset Statistics:")
print(f"   Total QA pairs: {len(qa_dataset)}")
print(f"   Expected: 1500 (500 files × 3 QA pairs each)")

# Show example
if qa_dataset:
    print(f"\n📝 Example QA pair:")
    example = qa_dataset[0]
    print(f"   Question: {example['question']}")
    print(f"   Correct Answer: {example['correct_answer']}")
    print(f"   Wrong Options: {example['wrong_options']}")

    # Show example of formatted prompt
    print(f"\n📝 Example formatted prompt:")
    prompt, correct_letter = format_multiple_choice(
        example['question'],
        example['correct_answer'],
        example['wrong_options']
    )
    print(f"   Correct letter: {correct_letter}")
    print(f"   Prompt preview: {prompt[:200]}...")


✅ Loaded 1500 QA pairs from /content/drive/MyDrive/Data_Collection_Code/f1_refined_data/f1_qa_outputs
📊 Dataset Statistics:
   Total QA pairs: 1500
   Expected: 1500 (500 files × 3 QA pairs each)

📝 Example QA pair:
   Question: Which driver achieved a double win at the US Grand Prix in Austin?
   Correct Answer: Max Verstappen
   Wrong Options: ['Lewis Hamilton', 'Charles Leclerc', 'Sergio Pérez']

📝 Example formatted prompt:
   Correct letter: D
   Prompt preview: You are an AI assistant specializing in Formula 1. Answer the following multiple choice question using the exact format specified below.

Question: Which driver achieved a double win at the US Grand P...


## Model Evaluator Class


In [None]:
class F1QAEvaluator:
    """Evaluator for F1 QA models."""

    def __init__(self, model_id: str, model_name: str):
        self.model_id = model_id
        self.model_name = model_name
        self.model = None
        self.tokenizer = None
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def load_model(self):
        """Load model and tokenizer directly from Hugging Face Hub."""
        try:
            print(f"🔄 Loading {self.model_name} from {self.model_id}...")

            # Load tokenizer
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_id,
                trust_remote_code=True
            )

            # Set padding token if not set
            if self.tokenizer.pad_token is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token

            # Load model with updated parameter name
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_id,
                trust_remote_code=True,
                dtype=torch.float16 if self.device == "cuda" else torch.float32,
                device_map="auto" if self.device == "cuda" else "cpu"
            )

            print(f"✅ {self.model_name} loaded successfully on {self.device}")
            return True

        except Exception as e:
            print(f"❌ Failed to load {self.model_name}: {e}")
            return False

    def generate_response(self, prompt: str, max_new_tokens: int = 250) -> str:
        """Generate response from the model with proper error handling and attention mask."""
        try:
            # Tokenize with proper attention mask handling
            inputs = self.tokenizer(
                prompt,
                return_tensors="pt",
                truncation=True,
                max_length=512,
                padding=True
            )

            if self.device == "cuda":
                inputs = {k: v.to(self.device) for k, v in inputs.items()}

            with torch.no_grad():
                # Use different generation parameters based on model type
                generation_kwargs = {
                    "input_ids": inputs["input_ids"],
                    "attention_mask": inputs["attention_mask"],
                    "max_new_tokens": max_new_tokens,
                    "do_sample": True,
                    "temperature": 0.7,
                    "top_p": 0.9,
                    "pad_token_id": self.tokenizer.eos_token_id,
                    "eos_token_id": self.tokenizer.eos_token_id,
                }

                # Add model-specific parameters
                if "phi" in self.model_name.lower():
                    # Phi-3 models work better with these settings
                    generation_kwargs.update({
                        "use_cache": False,  # Disable cache to avoid DynamicCache issues
                        "repetition_penalty": 1.1
                    })
                elif "falcon" in self.model_name.lower():
                    # Falcon models work better with these settings
                    generation_kwargs.update({
                        "repetition_penalty": 1.1,
                        "no_repeat_ngram_size": 2
                    })
                else:
                    # Default settings for other models
                    generation_kwargs.update({
                        "repetition_penalty": 1.1
                    })

                outputs = self.model.generate(**generation_kwargs)

            # Check if outputs is valid
            if outputs is None or len(outputs) == 0:
                return ""

            # Decode only the new tokens (excluding the input prompt)
            input_length = inputs["input_ids"].shape[1]
            if len(outputs[0]) > input_length:
                new_tokens = outputs[0][input_length:]
                response = self.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
            else:
                # If no new tokens generated, return empty string
                response = ""

            return response

        except Exception as e:
            print(f"❌ Error generating response: {e}")
            return ""

    def evaluate_dataset(self, qa_dataset: List[Dict]) -> Dict:
        """Evaluate model on entire dataset."""
        results = {
            'model_name': self.model_name,
            'total_questions': len(qa_dataset),
            'correct': 0,
            'incorrect': 0,
            'invalid': 0,
            'accuracy': 0.0,
            'details': []
        }

        print(f"\n🚀 Evaluating {self.model_name} on {len(qa_dataset)} questions...")

        for i, qa_pair in enumerate(tqdm(qa_dataset, desc=f"Evaluating {self.model_name}")):
            question = qa_pair['question']
            correct_answer = qa_pair['correct_answer']
            wrong_options = qa_pair['wrong_options']

            # Format as multiple choice
            prompt, correct_letter = format_multiple_choice(question, correct_answer, wrong_options)

            # Generate response
            response = self.generate_response(prompt)

            # Extract answer letter and justification
            predicted_letter = extract_answer_letter(response)
            justification = extract_justification(response)

            # Check correctness
            is_correct = predicted_letter == correct_letter
            is_invalid = predicted_letter == ""

            if is_correct:
                results['correct'] += 1
            elif is_invalid:
                results['invalid'] += 1
            else:
                results['incorrect'] += 1

            # Store details
            results['details'].append({
                'question': question,
                'correct_answer': correct_answer,
                'correct_letter': correct_letter,
                'predicted_letter': predicted_letter,
                'is_correct': is_correct,
                'is_invalid': is_invalid,
                'response': response,
                'justification': justification,
                'prompt': prompt
            })

        # Calculate accuracy
        valid_responses = results['total_questions'] - results['invalid']
        if valid_responses > 0:
            results['accuracy'] = results['correct'] / valid_responses

        print(f"✅ {self.model_name} evaluation complete:")
        print(f"   Correct: {results['correct']}")
        print(f"   Incorrect: {results['incorrect']}")
        print(f"   Invalid: {results['invalid']}")
        print(f"   Accuracy: {results['accuracy']:.2%}")

        return results

    def cleanup(self):
        """Clear model from memory."""
        del self.model
        del self.tokenizer
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        print(f"🧹 Cleaned up {self.model_name} from memory")


## Run Evaluation


In [None]:
# Evaluate Mistral 7B
print(f"\n{'='*60}")
print(f"🤖 Evaluating Model: {MODEL_NAME}")
print(f"{'='*60}")

evaluator = F1QAEvaluator(MODEL_ID, MODEL_NAME)

if evaluator.load_model():
    results = evaluator.evaluate_dataset(qa_dataset)
    evaluator.cleanup()

    # Save results to JSON
    output_file = Path(RESULTS_PATH) / f"mistral_7b_evaluation_results_{int(time.time())}.json"

    results_summary = {
        'metadata': {
            'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
            'model_name': MODEL_NAME,
            'model_id': MODEL_ID,
            'total_qa_pairs': len(qa_dataset),
            'device': evaluator.device
        },
        'results': results
    }

    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(results_summary, f, indent=2, ensure_ascii=False)

    print(f"\n💾 Results saved to: {output_file}")

    # Print summary
    print(f"\n{'='*60}")
    print(f"📊 MISTRAL 7B EVALUATION SUMMARY")
    print(f"{'='*60}")
    print(f"   Accuracy: {results['accuracy']:.2%}")
    print(f"   Correct: {results['correct']}/{results['total_questions']}")
    print(f"   Incorrect: {results['incorrect']}")
    print(f"   Invalid: {results['invalid']}")
    print(f"\n💾 Full results saved to: {output_file}")

else:
    print(f"❌ Failed to load {MODEL_NAME}")



🤖 Evaluating Model: mistral-7b
🔄 Loading mistral-7b from mistralai/Mistral-7B-v0.1...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

✅ mistral-7b loaded successfully on cuda

🚀 Evaluating mistral-7b on 1500 questions...


Evaluating mistral-7b:  35%|███▌      | 526/1500 [56:20<1:18:00,  4.81s/it]