# MCQA Concise Explanation Evaluation Pipeline

This notebook implements a comprehensive pipeline to evaluate the impact of concise explanation rewriting on Multiple-Choice Question Answering (MCQA) tasks.

## **Pipeline Overview**

For each MCQA sample, the following steps are performed:

1. **Concise Explanation Generation**  
   - Uses an LLM (e.g., GPT-4) to rewrite the original explanation in a more concise form, preserving logical reasoning and correctness.

2. **Masking for Label Leakage Prevention**  
   - Masks answer letters and choice texts in both the verbose and concise explanations to prevent label leakage.

3. **Sufficiency Evaluation**  
   - Assesses whether the (masked) explanation alone is sufficient for a model to select the correct answer, using log-probability calculations over answer choices.

4. **Conciseness Measurement**  
   - Calculates token counts and reduction percentage to quantify how much shorter the concise explanation is compared to the verbose version.

5. **End-to-End Prediction**  
   - Tests whether the model can still predict the correct answer when using the concise explanation.

6. **Perplexity Analysis** (Optional)  
   - Computes language model perplexity for both verbose and concise explanations as an additional measure of informativeness and clarity.

7. **Reflection** (Optional)  
   - Prompts the model to self-assess the sufficiency and completeness of the concise explanation.

## **Key Metrics Reported**

- **Sufficiency Delta:** Difference in log-probabilities between the correct answer and competing choices, for both verbose and concise explanations.
- **Token Reduction:** Number and percentage of tokens saved by conciseness.
- **Prediction Accuracy:** Whether the model selects the correct answer using the concise explanation.
- **Perplexity:** Language model perplexity scores for each explanation.
- **Reflection Output:** Model's self-assessment of explanation sufficiency (if enabled).

In [3]:
import sys
import os

# Add project root to sys.path
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

In [4]:
import logging
from src.setup.hf_auth import hf_login
from src.utils.prompts import CONCISE_REWRITE_PROMPT_TEMPLATE, REFLECTION_PROMPT_TEMPLATE
from src.utils.helpers import mask_explanation, mask_explanation_fa
from src.evaluation.metrics import (
    load_model, 
    calc_perplexity_single, 
    compute_conciseness_metrics,
    scorer_logprobs,
    final_prediction
)

def rewrite_explanation(explanation, rewrite_model):
    """Rewrite explanation using a language model and a concise prompt."""
    prompt = CONCISE_REWRITE_PROMPT_TEMPLATE.format(explanation=explanation)
    # The rewrite_model must have a `.generate_text(prompt)` method (LLM API or local model)
    return rewrite_model.generate_text(prompt)

def reflect_on_explanation(explanation, reflection_model):
    """Optional: ask model to reflect on sufficiency and logic."""
    prompt = REFLECTION_PROMPT_TEMPLATE.format(explanation=explanation)
    return reflection_model.generate_text(prompt)

def process_mcqa_sample(sample, rewrite_model, scorer_model, scorer_tokenizer, logger, mask_fa=False, reflect=False):
    """
    Pipeline for a single MCQA sample.
    sample: dict with keys 'question', 'choices', 'answer', 'verbose_explanation' (and *_fa for Farsi)
    rewrite_model: model for rewriting explanations (e.g., GPT-4)
    scorer_model, scorer_tokenizer: model/tokenizer for scoring answers
    logger: logging.Logger
    mask_fa: bool, use Farsi masking
    reflect: bool, ask for meta-reflection
    """
    # Step 1: Concise Rewrite
    concise_explanation = rewrite_explanation(
        sample["verbose_explanation"], rewrite_model
    )
    logger.info(f"Concise explanation: {concise_explanation}")

    # Step 2: Mask Choices (to avoid label leakage)
    mask_func = mask_explanation_fa if mask_fa else mask_explanation
    row = sample.copy()
    row["explanation"] = sample["verbose_explanation"]
    verbose_masked = mask_func(row)
    row["explanation"] = concise_explanation
    concise_masked = mask_func(row)
    logger.info(f"Masked verbose: {verbose_masked}")
    logger.info(f"Masked concise: {concise_masked}")

    # Step 3: Evaluate sufficiency (log-prob gap)
    suff_delta_verbose = scorer_logprobs(
        scorer_model, scorer_tokenizer, 
        sample["question"], verbose_masked, sample["choices"]
    )
    suff_delta_concise = scorer_logprobs(
        scorer_model, scorer_tokenizer, 
        sample["question"], concise_masked, sample["choices"]
    )
    logger.info(f"Sufficiency delta (verbose): {suff_delta_verbose}")
    logger.info(f"Sufficiency delta (concise): {suff_delta_concise}")

    # Step 4: Measure conciseness
    verbose_tokens, concise_tokens, reduction, percent = compute_conciseness_metrics(
        sample["verbose_explanation"], concise_explanation
    )
    logger.info(f"Tokens: verbose={verbose_tokens}, concise={concise_tokens}, reduction={reduction}, percent={percent:.2f}")

    # Step 5: Final prediction using concise explanation
    final_pred = final_prediction(
        scorer_model, scorer_tokenizer, 
        sample["question"], concise_explanation, sample["choices"]
    )
    is_correct = (final_pred == sample["answer"])
    logger.info(f"Final prediction with concise: {final_pred} (Correct? {is_correct})")

    # Step 6 (optional): Reflection
    reflection = None
    if reflect:
        reflection = reflect_on_explanation(concise_explanation, rewrite_model)
        logger.info(f"Reflection: {reflection}")

    # Step 7: Perplexity (optional, if you want to log it)
    verbose_ppl = calc_perplexity_single(sample["verbose_explanation"], scorer_tokenizer, scorer_model, logger)
    concise_ppl = calc_perplexity_single(concise_explanation, scorer_tokenizer, scorer_model, logger)

    # Aggregate results
    return {
        "concise_explanation": concise_explanation,
        "verbose_masked": verbose_masked,
        "concise_masked": concise_masked,
        "suff_delta_verbose": suff_delta_verbose,
        "suff_delta_concise": suff_delta_concise,
        "verbose_tokens": verbose_tokens,
        "concise_tokens": concise_tokens,
        "token_reduction": reduction,
        "percent_reduction": percent,
        "final_pred": final_pred,
        "is_correct": is_correct,
        "reflection": reflection,
        "verbose_ppl": verbose_ppl,
        "concise_ppl": concise_ppl
    }

def process_batch(samples, rewrite_model, scorer_model, scorer_tokenizer, logger, mask_fa=False, reflect=False):
    """Process a batch of MCQA samples and collect results as a DataFrame."""
    import pandas as pd
    results = []
    for sample in samples:
        result = process_mcqa_sample(
            sample, rewrite_model, scorer_model, scorer_tokenizer, logger, mask_fa=mask_fa, reflect=reflect
        )
        results.append(result)
    return pd.DataFrame(results)

# Usage:
# logger = logging.getLogger("pipeline")
# hf_login()  # Authenticate to HuggingFace
# scorer_tokenizer, scorer_model = load_model("YourScorerModelName", logger)
# batch_results = process_batch(samples, rewrite_model, scorer_model, scorer_tokenizer, logger)

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'torch'

In [None]:
import requests
import time

class OpenRouterModel:
    def __init__(self, api_key, model_name, base_url="https://openrouter.ai/api/v1/chat"):
        self.api_key = api_key
        self.model_name = model_name
        self.base_url = base_url
        self.headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "X-Title": "ConciseExplanationProject"       # Optional, appears in dashboard
        }
        self.default_temperature = 0.3
        self.default_max_tokens = 512

    def set_model(self, model_name):
        self.model_name = model_name

    def set_temperature(self, temperature: float):
        self.default_temperature = temperature

    def set_token_limit(self, max_tokens: int):
        self.default_max_tokens = max_tokens

    def chat(self, prompt: str, system_prompt: str = "", temperature: float = None, max_tokens: int = None, retries: int = 3, sleep_between_retries: float = 2.0) -> str:
        payload = {
            "model": self.model_name,
            "messages": [],
            "temperature": temperature if temperature is not None else self.default_temperature,
            "max_tokens": max_tokens if max_tokens is not None else self.default_max_tokens,
        }

        if system_prompt:
            payload["messages"].append({"role": "system", "content": system_prompt})
        payload["messages"].append({"role": "user", "content": prompt})

        attempt = 0
        while attempt < retries:
            try:
                response = requests.post(self.base_url + "/completions", headers=self.headers, json=payload)
                if response.status_code == 200:
                    return response.json()["choices"][0]["message"]["content"].strip()
                else:
                    raise ValueError(f"[OpenRouter Error] {response.status_code}: {response.text}")
            except Exception as e:
                print(f"Retry {attempt + 1}/{retries} failed: {e}")
                time.sleep(sleep_between_retries)
                attempt += 1

        raise RuntimeError(f"OpenRouter request failed after {retries} retries.")

    def generate(self, prompt: str, **kwargs) -> str:
        return self.chat(prompt=prompt, **kwargs)


In [7]:
# Create an instance
model = OpenRouterModel(
    api_key="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",
    model_name="openai/gpt-4.1-mini"
)

# Simple usage
response = model.generate("Explain why the sky is blue.")
print(response)

# Custom parameters and system message
response = model.chat(
    prompt="Rewrite this explanation concisely: The mitochondria produces energy in cells through respiration.",
    system_prompt="You're a helpful assistant that rewrites verbose explanations more concisely.",
    temperature=0.3
)
print(response)


The sky appears blue because of a phenomenon called **Rayleigh scattering**. Here's how it works:

1. **Sunlight and the atmosphere:** Sunlight, or white light, is made up of many colors, each with different wavelengths. When sunlight enters Earth's atmosphere, it interacts with the molecules and small particles in the air.

2. **Scattering of light:** These molecules scatter the sunlight in different directions. However, shorter wavelengths of light (blue and violet) are scattered much more strongly than longer wavelengths (red, orange, yellow).

3. **Why blue and not violet?:** Although violet light is scattered even more than blue light, our eyes are more sensitive to blue, and some of the violet light is absorbed by the upper atmosphere, making the sky appear predominantly blue.

4. **Effect during sunrise and sunset:** When the Sun is low on the horizon, sunlight travels through more atmosphere, scattering away most of the blue light and leaving reds and oranges, which is why sunr