# Finding and using anchor points

In this notebook, we show how to find anchor points based on your training set and how to use them to estimate the performance of new models in the test set.

## Preparing data

Loading packages

In [1]:
import numpy as np
import pickle
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import pairwise_distances
from irt import *
from utils import *

random_state = 42

The leaderboard dataset we will use is composed by six scenarios (sub-datasets):
1. TruthfulQA
1. GSM8K
1. Winogrande
1. ARC
1. HellaSwag
1. MMLU

MMLU is further divided into sub-scenarios (e.g., abstract algebra, anatomy, etc). Let's check scenarios and sub-scenarios:

In [2]:
scenarios

{'harness_truthfulqa_mc_0': ['harness_truthfulqa_mc_0'],
 'gsm8k': ['harness_gsm8k_5'],
 'winogrande': ['harness_winogrande_5'],
 'arc': ['harness_arc_challenge_25'],
 'hellaswag': ['harness_hellaswag_10'],
 'mmlu': ['harness_hendrycksTest_abstract_algebra_5',
  'harness_hendrycksTest_anatomy_5',
  'harness_hendrycksTest_astronomy_5',
  'harness_hendrycksTest_business_ethics_5',
  'harness_hendrycksTest_clinical_knowledge_5',
  'harness_hendrycksTest_college_biology_5',
  'harness_hendrycksTest_college_chemistry_5',
  'harness_hendrycksTest_college_computer_science_5',
  'harness_hendrycksTest_college_mathematics_5',
  'harness_hendrycksTest_college_medicine_5',
  'harness_hendrycksTest_college_physics_5',
  'harness_hendrycksTest_computer_security_5',
  'harness_hendrycksTest_conceptual_physics_5',
  'harness_hendrycksTest_econometrics_5',
  'harness_hendrycksTest_electrical_engineering_5',
  'harness_hendrycksTest_elementary_mathematics_5',
  'harness_hendrycksTest_formal_logic_5',
 

Loading leaderboard data:

In [3]:
#with open('data/lb.pickle', 'rb') as handle:
#    data = pickle.load(handle)
with open('data/lb.pickle', 'rb') as handle:
    data = pickle.load(handle)

In this dataset, we have data from 395 models. Let's see the names of some of them below

In [4]:
len(data['models']),data['models'][:10]

(395,
 ['open-llm-leaderboard/details_zhengr__MixTAO-7Bx2-MoE-DPO',
  'open-llm-leaderboard/details_alignment-handbook__zephyr-7b-sft-full',
  'open-llm-leaderboard/details_rombodawg__Leaderboard-killer-MoE_4x7b',
  'open-llm-leaderboard/details_FelixChao__ExtremeDolphin-MoE',
  'open-llm-leaderboard/details_LoSboccacc__orthogonal-2x7B-base',
  'open-llm-leaderboard/details_moreh__MoMo-70B-lora-1.8.6-DPO',
  'open-llm-leaderboard/details_deepseek-ai__deepseek-moe-16b-base',
  'open-llm-leaderboard/details_Swisslex__Mixtral-Orca-v0.1',
  'open-llm-leaderboard/details_wang7776__Mistral-7B-Instruct-v0.2-sparsity-20',
  'open-llm-leaderboard/details_nfaheem__Marcoroni-7b-DPO-Merge'])

Below, we will process the data so all correctness scores (for all scenarios) are stored in $Y$. The dictionaries `scenarios_position` and `subscenarios_position` give the position of scenarios/subscenarios correctness scores in $Y$.

In [5]:
scenarios_position, subscenarios_position = prepare_data(scenarios, data)
Y = create_responses(scenarios, data)
Y.shape

(395, 28659)

For example, below you can see the scores for MMLU:

In [6]:
Y[:,scenarios_position['mmlu']], Y[:,scenarios_position['mmlu']].shape

(array([[0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        ...,
        [0., 0., 1., ..., 1., 1., 0.],
        [0., 0., 1., ..., 1., 1., 0.],
        [1., 0., 1., ..., 1., 1., 0.]]),
 (395, 14042))

For scenarios that have multiple subscenarios, it is usually the case that we want to give equal importance to individual subscenarios when computing the aggregated performance in that scenario. This is equivalent to using a weighted average when computing the aggregated performance. We will create `balance_weights`, a vector of weights to help us compute those weighted averages. These weights will be different than one only for MMLU, which is the only scenario with multiple subscenarios.

In [7]:
balance_weights = np.ones(Y.shape[1])

N = len(scenarios_position['mmlu'])
n_sub = len(scenarios['mmlu'])
for sub in scenarios['mmlu']:
    n_i = len(subscenarios_position['mmlu'][sub])
    balance_weights[subscenarios_position['mmlu'][sub]] = N/(n_sub*n_i)  

We can see below that first averaging within subscenarios and then computing a simple average is equivalent to using a weighted average from the beginning:

In [8]:
accs1 = np.mean([Y[:,subscenarios_position['mmlu'][sub]].mean(axis=1) for sub in scenarios['mmlu']], axis=0)
accs2 = (balance_weights*Y)[:,scenarios_position['mmlu']].mean(axis=1)

np.abs(accs1 - accs2).mean()

2.322333605307685e-14

## Getting and using anchor points

Let's split the data in train and test (recent models are placed in the test set):

In [9]:
Y_test = Y[:100]
Y_train = Y[100:]

In [10]:
(balance_weights*Y_train)[:,scenarios_position['mmlu']].mean(axis=1).max()

0.7825506758303491

The variable `number_item` gives the number of anchor points we want to find in each scenario:

In [11]:
number_item = 100

The variable `clustering` specified how the clusting is run. If `clustering="correct."`, then correctness is used. On the other hand, if `clustering="irt"`, then the IRT embeddings for examples are used.

In [12]:
clustering = 'irt' # 'correct.' or 'irt'

Computing anchor points and their weights for each scenario:

In [13]:
anchor_points = {}
anchor_weights = {}

for scenario in scenarios.keys():

    if clustering=='correct.':
        X = Y_train[:,scenarios_position[scenario]].T
    elif clustering=='irt':
        A, B, _ = load_irt_parameters('data/irt_model/')
        X = np.vstack((A.squeeze(), B.squeeze().reshape((1,-1)))).T
        X = X[scenarios_position[scenario]]
    else:
        raise NotImplementedError 
        
    #Normalizing balance_weights, so their sum is one within each scenario
    norm_balance_weights = balance_weights[scenarios_position[scenario]]
    norm_balance_weights /= norm_balance_weights.sum()

    # Fitting the KMeans model
    kmeans = KMeans(n_clusters=number_item, n_init="auto", random_state=random_state)
    kmeans.fit(X, sample_weight=norm_balance_weights)

    # Calculating anchor points
    anchor_points[scenario] = pairwise_distances(kmeans.cluster_centers_, X, metric='euclidean').argmin(axis=1)

    # Calculating anchor weights
    anchor_weights[scenario] = np.array([np.sum(norm_balance_weights[kmeans.labels_==c]) for c in range(number_item)])

Saving

In [14]:
anchor = {'anchor_points':anchor_points,
          'anchor_weights':anchor_weights}

with open('data/anchor.pickle', 'wb') as handle:
    pickle.dump(anchor, handle, protocol=pickle.HIGHEST_PROTOCOL)

Checking results

In [15]:
anchor_points['mmlu']

array([ 6737, 14025, 13614, 10782,  2426, 12017, 11354,  7706,   142,
        1843,   181,  5738,  4678,  3765,  5034,  3233,  6516,  6383,
         291,  9383, 12945,  4562,    62,   344,   658,  2674, 11234,
        4252,  1087,   614, 11788,  2649,   758,   629,  6597,  5626,
        4608,  2662,  2028,  9185,  3090,  4864,  7011,  3754,  9916,
        6844,  2725, 11787,  3305,  9671,  1322, 13547,  1765, 11920,
        3323,  7792,  8155, 10401,  8513, 10111,  7406,  8616,  5856,
        8889,  1808,  1631,  4148,  1013, 11356,  1110,  6380,  3777,
       10627,  3038,  6659,  7355,  2682,  8111,  5097, 13835, 10257,
        3064, 12424,  1177,   633,  6655,  6342,  3743,  4222,  8880,
       13712,  4451,  5277,  2079,  2386, 12375,  5915,  7750, 10024,
       10493])

In [16]:
anchor_weights['mmlu']

array([0.00825099, 0.01018247, 0.01093607, 0.00596034, 0.00826069,
       0.00648536, 0.02536688, 0.01830385, 0.00895905, 0.00945692,
       0.00790874, 0.00938025, 0.00797504, 0.0325375 , 0.00995397,
       0.00853033, 0.00718956, 0.00496484, 0.01163898, 0.00594733,
       0.00847858, 0.00478626, 0.00913403, 0.00936453, 0.0107204 ,
       0.00852264, 0.0096986 , 0.00947315, 0.00777175, 0.00833743,
       0.00810972, 0.00752109, 0.01260261, 0.00855586, 0.00787091,
       0.00923388, 0.01315827, 0.01334997, 0.00804081, 0.00876967,
       0.00726678, 0.00872653, 0.00672563, 0.00935088, 0.01420798,
       0.00823026, 0.01041884, 0.00915723, 0.01092038, 0.00954358,
       0.00824789, 0.01266745, 0.00881998, 0.00694057, 0.00852781,
       0.00871871, 0.00712828, 0.00750783, 0.0026459 , 0.00984187,
       0.0054721 , 0.01061534, 0.03047136, 0.00507286, 0.01021506,
       0.01056058, 0.01038674, 0.00877227, 0.00988953, 0.00913583,
       0.0104087 , 0.02252575, 0.00774421, 0.00885257, 0.00805

Using anchor points to estimate performance in the test set and reporting the average prediction error

In [17]:
for scenario in scenarios.keys():
    Y_anchor = Y_test[:,scenarios_position[scenario]][:,anchor_points[scenario]]
    Y_hat = (Y_anchor*anchor_weights[scenario]).sum(axis=1)
    Y_true = (balance_weights*Y_test)[:,scenarios_position[scenario]].mean(axis=1)

    print(f"scenario: {scenario}, avg. error: {np.abs(Y_hat-Y_true).mean():.3f}")

scenario: harness_truthfulqa_mc_0, avg. error: 0.016
scenario: gsm8k, avg. error: 0.019
scenario: winogrande, avg. error: 0.024
scenario: arc, avg. error: 0.023
scenario: hellaswag, avg. error: 0.020
scenario: mmlu, avg. error: 0.028
