# Model Performance Evaluation
This notebook evaluates the performance of different models based on their predictions. It processes JSON files from zip archives, combines the data into a DataFrame, and creates a pivot table to compare model performances.

In [4]:
import json
import pandas as pd
import zipfile
import os
import logging
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

## Load JSON Files from Zip Archives
This function loads JSON files from a specified zip archive and returns the data as a list of dictionaries.

In [5]:
def load_json_from_zip(zip_path):
    data = []
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        for file_name in zip_ref.namelist():
            with zip_ref.open(file_name) as file:
                json_data = json.load(file)
                if isinstance(json_data, dict):
                    data.append(json_data)
                elif isinstance(json_data, list):
                    for item in json_data:
                        if isinstance(item, dict):
                            data.append(item)
                        else:
                            logging.warning(f"Skipping non-dictionary item of type {type(item)} in list in file {file_name}")
                else:
                    logging.warning(f"Skipping non-dictionary or non-list item of type {type(json_data)} in file {file_name}")
    return data

## Combine Data from All Zip Files
This section lists all zip files in the `eval_results` directory, loads their data, and combines it into a single DataFrame.

In [6]:
# Directory containing the zip files
eval_results_dir = 'eval_results'

# List all zip files in the directory
zip_files = [f for f in os.listdir(eval_results_dir) if f.endswith('.zip')]

# Combine all data into a single DataFrame
all_data = []
for zip_file in zip_files:
    zip_path = os.path.join(eval_results_dir, zip_file)
    data = load_json_from_zip(zip_path)
    for item in data:
        df = pd.DataFrame(item)
        model_name = os.path.splitext(zip_file)[0].split('_')[2]
        df['model'] = model_name
        all_data.append(df)

# Combine all DataFrames into a single DataFrame
combined_df = pd.concat(all_data, ignore_index=True)

## Ensure Required Columns Exist
Check that the combined DataFrame contains the 'question' and 'answer' columns.

In [7]:
# Ensure 'question' and 'answer' columns exist
if 'question' not in combined_df.columns or 'answer' not in combined_df.columns:
    raise ValueError("The data must contain 'question' and 'answer' columns")

## Create Pivot Table
Create a pivot table with questions as rows and models as columns, showing the performance of each model.

In [8]:
# Create a pivot table with questions as rows and models as columns
pivot_df = combined_df.pivot_table(
    index='question', 
    columns='model', 
    values='pred', 
    aggfunc=lambda x: (x == combined_df.loc[x.index, 'answer']).astype(int).max(), 
    fill_value=0
)

# Save the pivot table to a CSV file
pivot_df.to_pickle('model_performance.pkl')

## Load and Display Model Outputs
Load the JSON file containing model outputs and display the first few rows of the DataFrame.

In [9]:
# Load the JSON file
with open('model_outputs_Meta-Llama-3-8B-Instruct_5shots.json', 'r') as file:
    data = json.load(file)

# Convert the data to a DataFrame
df = pd.DataFrame(data)
df.head()

Unnamed: 0,question_id,question,options,answer,answer_index,cot_content,category,src,pred,generated_text
0,70,"Typical advertising regulatory bodies suggest,...","[Safe practices, Fear, Jealousy, Trivial, Unsa...",I,8,,business,ori_mmlu-business_ethics,I,We refer to Wikipedia articles on advertising...
1,71,Managers are entrusted to run the company in t...,"[Shareholders, Diligence, Self-interest, Share...",F,5,,business,ori_mmlu-business_ethics,B,We refer to Wikipedia articles on business et...
2,72,There are two main issues associated with ____...,"[Down, Autonomy, Remuneration, Benefit, Down, ...",J,9,,business,ori_mmlu-business_ethics,A,We refer to Wikipedia articles on business et...
3,73,_______ locate morality beyond the sphere of r...,"[Ethical egoism, Ethics of duty, Postmodern et...",C,2,,business,ori_mmlu-business_ethics,H,We refer to Wikipedia articles on ethics for ...
4,74,Some of key differences between Islamic finan...,"[Interest, Certain, Assured, Both tangible and...",G,6,,business,ori_mmlu-business_ethics,A,We refer to Wikipedia articles on Islamic fin...


## Compute Embeddings for Questions
Load the embedding model and compute embeddings for the questions.

In [10]:
# Load the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Compute embeddings for existing questions
df['embedding'] = df['question'].apply(lambda x: model.encode(x))
# Create a new DataFrame with only 'question' and 'embedding' columns
df_questions_embeddings = df[['question', 'embedding']]

df_questions_embeddings.to_pickle('questions_embeddings.pkl')

## Find Similar Questions
Prompt the user for an input question, compute its embedding, and find the top 20 most similar questions.

In [11]:
df_similarity = pd.read_pickle('questions_embeddings.pkl')

# Prompt user for input question
input_question = input("Enter your question: ")

# Compute embedding for input question
input_embedding = model.encode(input_question)

# Compute similarities
df_similarity['similarity'] = df_similarity['embedding'].apply(lambda x: cosine_similarity([input_embedding], [x])[0][0])

# Find the top 20 most similar questions
top20 = df_similarity.nlargest(20, 'similarity')

# Load the model performance data
df_performance = pd.read_pickle('model_performance.pkl')

# Filter the performance data for the top 20 questions
top20_questions = top20['question'].values
df_top20_performance = df_performance[df_performance.index.isin(top20_questions)]

# Calculate the mean performance for each model
model_performance = df_top20_performance.mean().sort_values(ascending=False)

# Select the top 3 models
top3_models = model_performance.head(3)

# Display the top 3 models
print("Top 3 models for the input question:")
print(top3_models)

Top 3 models for the input question:
model
Meta-Llama-3         0.80
arx                  0.80
gpt-4o-2024-08-06    0.75
dtype: float64
