# Evaluation Script for Autocast Competition
Step 1: generate `submission.zip` with the same format of autocast competition and download `autocast_test_set_w_answers.csv` from [google drive](https://docs.google.com/spreadsheets/d/1O8kcHuLN7BbklXdHfRpVnJ65C5_qpCehBZPYQ6DRVl0/edit?usp=share_link).

Step 2: put  `submission.zip` and `autocast_test_set_w_answers.csv` in the same directory with `evaluation.ipynb`. 

Step 3: launch  `evaluation.ipynb`.

We validate this evaluation script with the random prediction and get the same score on the autocast official leaderboard (`Combined Metric: 87.41, T/F: 25.00, MCQ: 39.13, NUM: 23.28`). Feel free to use it for the class competition. 



In [None]:
import pandas as pd
import numpy as np
import pickle
import os

In [None]:
def brier_score(probabilities, answer_probabilities):
    return ((probabilities - answer_probabilities) ** 2).sum() / 2

In [None]:
answers_csv = pd.read_csv('autocast_test_set_w_answers.csv')
answers = []
qtypes = []
for question in answers_csv.iterrows():
    question = question[1]
    if question['qtype'] == 't/f':
        ans_idx = 0 if question['answers'] == 'no' else 1
        ans = np.zeros(len(eval(question['choices'])))
        ans[ans_idx] = 1
        qtypes.append('t/f')
    elif question['qtype'] == 'mc':
        ans_idx = ord(question['answers']) - ord('A')
        ans = np.zeros(len(eval(question['choices'])))
        ans[ans_idx] = 1
        qtypes.append('mc')
    elif question['qtype'] == 'num':
        ans = float(question['answers'])
        qtypes.append('num')
    answers.append(ans)

FileNotFoundError: ignored

In [None]:
! mkdir -p submission
! unzip submission.zip -d submission
with open(os.path.join('submission', 'predictions.pkl'), 'rb') as f:
    preds = pickle.load(f)

Archive:  submission.zip
  inflating: submission/predictions.pkl  


In [None]:
tf_results, mc_results, num_results = [],[],[]
for p, a, qtype in zip(preds, answers, qtypes):
    if qtype == 't/f':
        tf_results.append(brier_score(p, a))
    elif qtype == 'mc':
        mc_results.append(brier_score(p, a))
    else:
        num_results.append(np.abs(p - a))

print(f"T/F: {np.mean(tf_results)*100:.2f}, MCQ: {np.mean(mc_results)*100:.2f}, NUM: {np.mean(num_results)*100:.2f}")
print(f"Combined Metric: {(np.mean(tf_results) + np.mean(mc_results) + np.mean(num_results))*100:.2f}")

T/F: 25.00, MCQ: 39.13, NUM: 23.28
Combined Metric: 87.41
