# Training Item Response Theory (IRT) models

In this notebook, we show how to train your own Item Response Theory (IRT) models.

## Preparing data

Loading packages

In [24]:
import numpy as np
import pickle
from tqdm import tqdm
from irt import *
from utils import *

random_state = 42

The leaderboard dataset we will use is composed by six scenarios (sub-datasets):
1. TruthfulQA
1. GSM8K
1. Winogrande
1. ARC
1. HellaSwag
1. MMLU

MMLU is further divided into sub-scenarios (e.g., abstract algebra, anatomy, etc). Let's check scenarios and sub-scenarios:

In [25]:
scenarios

{'harness_truthfulqa_mc_0': ['harness_truthfulqa_mc_0'],
 'gsm8k': ['harness_gsm8k_5'],
 'winogrande': ['harness_winogrande_5'],
 'arc': ['harness_arc_challenge_25'],
 'hellaswag': ['harness_hellaswag_10'],
 'mmlu': ['harness_hendrycksTest_abstract_algebra_5',
  'harness_hendrycksTest_anatomy_5',
  'harness_hendrycksTest_astronomy_5',
  'harness_hendrycksTest_business_ethics_5',
  'harness_hendrycksTest_clinical_knowledge_5',
  'harness_hendrycksTest_college_biology_5',
  'harness_hendrycksTest_college_chemistry_5',
  'harness_hendrycksTest_college_computer_science_5',
  'harness_hendrycksTest_college_mathematics_5',
  'harness_hendrycksTest_college_medicine_5',
  'harness_hendrycksTest_college_physics_5',
  'harness_hendrycksTest_computer_security_5',
  'harness_hendrycksTest_conceptual_physics_5',
  'harness_hendrycksTest_econometrics_5',
  'harness_hendrycksTest_electrical_engineering_5',
  'harness_hendrycksTest_elementary_mathematics_5',
  'harness_hendrycksTest_formal_logic_5',
 

In [26]:
SELECTED_SCENARIOS = ['gsm8k', 'arc', 'hellaswag', 'harness_truthfulqa_mc_0']

# select gsm8k, arc, hellaswag
lb_scenarios = {'lb': []}
for scenario in scenarios.keys():
    if scenario in SELECTED_SCENARIOS:
        lb_scenarios['lb'].append(*scenarios[scenario])

Loading leaderboard data:

In [27]:
with open('data/lb_scenarios.pickle', 'rb') as handle:
    data = pickle.load(handle)
    data['models'] = data['models'][0]

In this dataset, we have data from 395 models. Let's see the names of some of them below

In [28]:
len(data['models']), data['models'][:10]

(393,
 ['open-llm-leaderboard/details_moreh__MoMo-70B-lora-1.8.6-DPO',
  'open-llm-leaderboard/details_cloudyu__Yi-34Bx3-MoE-90B',
  'open-llm-leaderboard/details_Weyaxi__Helion-4x34B',
  'open-llm-leaderboard/details_Weyaxi__Bagel-Hermes-34B-Slerp',
  'open-llm-leaderboard/details_Weyaxi__Bagel-Hermes-2x34b',
  'open-llm-leaderboard/details_nfaheem__Marcoroni-7b-DPO-Merge',
  'open-llm-leaderboard/details_jondurbin__bagel-dpo-34b-v0.2',
  'open-llm-leaderboard/details_udkai__Turdus',
  'open-llm-leaderboard/details_gagan3012__MetaModel_moe',
  'open-llm-leaderboard/details_jeonsworld__CarbonVillain-en-10.7B-v3'])

Below, we will process the data so all correctness scores (for all scenarios) are stored in $Y$. The dictionaries `scenarios_position` and `subscenarios_position` give the position of scenarios/subscenarios correctness scores in $Y$.

In [29]:
scenarios_position, subscenarios_position = prepare_data(lb_scenarios, data)
Y = create_responses(lb_scenarios, data)
Y.shape

(393, 13350)

In [30]:
subscenarios_position['lb'].keys()

dict_keys(['harness_truthfulqa_mc_0', 'harness_gsm8k_5', 'harness_arc_challenge_25', 'harness_hellaswag_10'])

In [31]:
# fill nan values with 0
Y[np.isnan(Y)] = 0

# print stats of Y
print('Y stats:')
print('min:', np.min(Y))
print('max:', np.max(Y))
print('mean:', np.mean(Y))
print('std:', np.std(Y))

Y stats:
min: 0.0
max: 1.0000000000000002
mean: 0.7231160348021274
std: 0.443520043003996


For example, below you can see the scores for MMLU:

In [32]:
Y[:,scenarios_position['lb']], Y[:,scenarios_position['lb']].shape

(array([[0.99999999, 0.99994655, 1.        , ..., 1.        , 0.        ,
         1.        ],
        [0.99984026, 0.99999999, 0.99908489, ..., 1.        , 1.        ,
         1.        ],
        [0.99937792, 0.99999997, 0.99732544, ..., 1.        , 1.        ,
         1.        ],
        ...,
        [0.59677787, 0.99801392, 0.20934594, ..., 0.        , 0.        ,
         0.        ],
        [0.67288482, 0.99814252, 0.3865014 , ..., 0.        , 1.        ,
         1.        ],
        [0.42345796, 0.99926741, 0.82057698, ..., 0.        , 0.        ,
         0.        ]]),
 (393, 13350))

For scenarios that have multiple subscenarios, it is usually the case that we want to give equal importance to individual subscenarios when computing the aggregated performance in that scenario. This is equivalent to using a weighted average when computing the aggregated performance. We will create balance_weights, a vector of weights to help us compute those weighted averages. These weights will be different than one only for MMLU, which is the only scenario with multiple subscenarios.

We will use this when choosing the IRT dimension.

In [33]:
balance_weights = np.ones(Y.shape[1])

selected_scenarios = lb_scenarios['lb']

N = len(scenarios_position['lb'])
n_sub = len(selected_scenarios)
for sub in selected_scenarios:
    n_i = len(subscenarios_position['lb'][sub])
    balance_weights[subscenarios_position['lb'][sub]] = N/(n_sub*n_i)  

We can see below that first averaging within subscenarios and then computing a simple average is equivalent to using a weighted average from the beginning:

In [34]:
accs1 = np.mean([Y[:,subscenarios_position['lb'][sub]].mean(axis=1) for sub in lb_scenarios['lb']], axis=0)
accs2 = (balance_weights*Y)[:,scenarios_position['lb']].mean(axis=1)

np.abs(accs1 - accs2).mean()

1.1356762293373007e-14

## Training IRT

Let's split the data in train and test (recent models are placed in the test set):

In [35]:
Y_test = Y[:100]
Y_train = Y[100:]

To train the IRT model, we first need to binarize the values in $Y$ because correctness is not binary for TruthfulQA.

In [36]:
Y_bin_train = np.zeros(Y_train.shape)
Y_bin_test = np.zeros(Y_test.shape)

cs = np.linspace(0.01,.99,100)  # Threshold values to consider
for scenario in lb_scenarios.keys():
    ind = scenarios_position[scenario]
    # Find the best threshold value that minimizes the difference between averages
    c = cs[np.argmin([np.mean((np.abs((Y_train[:,ind]>c).mean(axis=1)-Y_train[:,ind].mean(axis=1)))) for c in tqdm(cs)])]
    # Apply the threshold to train and test responses
    Y_bin_train[:,ind] = (Y_train[:,ind]>c).astype(int)
    Y_bin_test[:,ind] = (Y_test[:,ind]>c).astype(int)

100%|██████████| 100/100 [00:01<00:00, 59.16it/s]


Choosing the dimension for the IRT model based on a simple validation heuristic:

In [37]:
Ds = [2,5,10,15] # Dimensions to try
device = 'cuda' # Either 'cuda' or 'cpu' 
epochs = 2000  # Number of epochs for IRT model training (py-irt default is 2000)
lr = .1  # Learning rate for IRT model training (py-irt default is .1)

val_ind = list(range(0,Y_bin_train.shape[0],5)) # Validation indices
train_ind = [i for i in range(Y_bin_train.shape[0]) if i not in val_ind]

# Saving the training dataset in the needed format
create_irt_dataset(Y_bin_train[train_ind], 'data/irt_val_dataset.jsonlines')

# Trying different Ds
errors = []  
errors2 = []

for D in tqdm(Ds):
    dataset_name = 'data/irt_val_dataset.jsonlines'
    model_name = 'data/irt_val_model/'
    
    # Load trained IRT model parameters
    train_irt_model(dataset_name, model_name, D, lr, epochs, device)
    A, B, Theta = load_irt_parameters(model_name)
    
    # Determine seen and unseen items for validation
    seen_items = list(range(0, Y_bin_train.shape[1], 2))
    unseen_items = list(range(1, Y_bin_train.shape[1], 2))

    # Estimate ability parameters for the validation set
    thetas = [estimate_ability_parameters(Y_bin_train[val_ind][j][seen_items], A[:, :, seen_items], B[:, :, seen_items]) for j in range(len(val_ind))]

    # Compute validation errors for each scenario and update the errors list (in the end, we give the same weight for all scenarios)
    errors2.append([])
    for scenario in lb_scenarios.keys():
        ind = [u for u in unseen_items if u in scenarios_position[scenario]]
        errors2[-1].append(np.mean([abs((balance_weights*item_curve(thetas[j], A, B))[0,ind].mean()-Y_train[val_ind][j,ind].mean())for j in range(len(val_ind))]))
    errors.append(np.mean(errors2[-1]))

  0%|          | 0/4 [00:00<?, ?it/s]

[20:33:08] config: model_type='multidim_2pl' epochs=2000              cli.py:109
           priors='hierarchical' initializers=[] dims=2 lr=0.1                  
           lr_decay=0.9999 dropout=0.5 hidden=100 vocab_size=None               
           log_every=200 seed=42 deterministic=True                             
           data_path: data/irt_val_dataset.jsonlines                  cli.py:111
           output directory: data/irt_val_model/                      cli.py:112
[20:33:08] amortized: False                                       dataset.py:112
[20:33:17] Vocab size: None                                       training.py:90
[20:33:17] Training Model...                                          cli.py:116
           args: {'device': 'cuda', 'num_items': 13350,          training.py:134
           'num_subjects': 234}                                                 
           Parsed Model Args: {'device': 'cuda', 'num_items':    training.py:147
           13350, 'num_subje

 25%|██▌       | 1/4 [00:56<02:49, 56.43s/it]

[20:34:04] config: model_type='multidim_2pl' epochs=2000              cli.py:109
           priors='hierarchical' initializers=[] dims=5 lr=0.1                  
           lr_decay=0.9999 dropout=0.5 hidden=100 vocab_size=None               
           log_every=200 seed=42 deterministic=True                             
           data_path: data/irt_val_dataset.jsonlines                  cli.py:111
           output directory: data/irt_val_model/                      cli.py:112
[20:34:05] amortized: False                                       dataset.py:112
[20:34:12] Vocab size: None                                       training.py:90
[20:34:13] Training Model...                                          cli.py:116
[20:34:13] args: {'device': 'cuda', 'num_items': 13350,          training.py:134
           'num_subjects': 234}                                                 
           Parsed Model Args: {'device': 'cuda', 'num_items':    training.py:147
           13350, 'num_subje

 50%|█████     | 2/4 [01:58<01:59, 59.50s/it]

[20:35:06] config: model_type='multidim_2pl' epochs=2000              cli.py:109
           priors='hierarchical' initializers=[] dims=10 lr=0.1                 
           lr_decay=0.9999 dropout=0.5 hidden=100 vocab_size=None               
           log_every=200 seed=42 deterministic=True                             
           data_path: data/irt_val_dataset.jsonlines                  cli.py:111
           output directory: data/irt_val_model/                      cli.py:112
[20:35:06] amortized: False                                       dataset.py:112
[20:35:14] Vocab size: None                                       training.py:90
[20:35:14] Training Model...                                          cli.py:116
           args: {'device': 'cuda', 'num_items': 13350,          training.py:134
           'num_subjects': 234}                                                 
           Parsed Model Args: {'device': 'cuda', 'num_items':    training.py:147
           13350, 'num_subje

In [15]:
ind_D = np.argmin(np.array(errors))
D = Ds[ind_D]

Saving the training dataset in the needed format:

In [16]:
create_irt_dataset(Y_bin_train, 'data/irt_dataset.jsonlines')

To train the IRT model, we use an adapted version of `py-irt` code (please check README in the tutorials directory for mode details).

In [17]:
train_irt_model(dataset_name='data/irt_dataset.jsonlines', 
                model_name='data/irt_model', 
                D=D, lr=lr, epochs=epochs, device=device)               

[18:43:13] config: model_type='multidim_2pl' epochs=2000              cli.py:109
           priors='hierarchical' initializers=[] dims=10 lr=0.1                 
           lr_decay=0.9999 dropout=0.5 hidden=100 vocab_size=None               
           log_every=200 seed=42 deterministic=True                             
           data_path: data/irt_dataset.jsonlines                      cli.py:111
           output directory: data/irt_model                           cli.py:112
[18:43:14] amortized: False                                       dataset.py:112
[18:43:23] Vocab size: None                                       training.py:90
[18:43:24] Training Model...                                          cli.py:116
[18:43:24] args: {'device': 'cuda', 'num_items': 13350,          training.py:134
           'num_subjects': 293}                                                 
           Parsed Model Args: {'device': 'cuda', 'num_items':    training.py:147
           13350, 'num_subje

## Get the values for $\lambda$ which will be used to get the gp-IRT estimates (in conjunction with anchor methods)

In [18]:
def get_lambda(b, v):
    return (b**2)/(v+(b**2))

The variable `number_item` gives the number of data points we will sample per scenario:

In [19]:
number_item = 30

In [20]:
def estimate_lambdas(errors2, Y_train, scenarios_position, lb_scenarios, number_item):
    lambds = {} 

    for i,scenario in enumerate(lb_scenarios.keys()):
        v = np.var(Y_train[:,scenarios_position[scenario]], axis=1).mean()
        b = np.mean(errors2[ind_D][i]) 
        lambds[scenario] = get_lambda(b, v/(4*number_item))

    return lambds

In [23]:
for num_item in [10, 15, 20, 30, 50]:
    lambds = estimate_lambdas(errors2, Y_train, scenarios_position, lb_scenarios, num_item)
    
    # read lb_anchor_{num_item}.pickle file
    with open(f'data/lb_anchor_{num_item}.pickle', 'rb') as handle:
        anchor_data = pickle.load(handle) # contains 'seen_examples', 'examples_weights', 'scenarios_position', 'subscenarios_position'
    # save to tinybenchmark_lb file, putting 'seen_examples', 'examples_weights', 'irt_parameters', 'scenarios_position', 'subscenarios_position', 'optimal_lambdas'
    with open(f'data/tinybenchmark_lb_{num_item}.pickle', 'wb') as handle:
        pickle.dump({'seen_examples': anchor_data['seen_examples'], 
                     'examples_weights': anchor_data['examples_weights'], 
                     'irt_parameters': {'A': A, 'B': B}, 
                     'scenarios_position': anchor_data['scenarios_position'], 
                     'subscenarios_position': anchor_data['subscenarios_position'], 
                     'optimal_lambdas': lambds}, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [22]:
anchor_data['seen_examples']

{'harness_truthfulqa_mc_0': array([305,  77, 532, 788, 231, 746, 715, 454, 354, 411, 249, 728, 260,
        448,  21,  66,  60, 545, 520, 457, 633, 559, 399, 758, 160, 196,
        668, 273, 636,  82, 566, 546, 292,  92, 666, 711, 561, 644, 127,
        518, 271, 234, 699, 171, 449, 186, 491,  96, 219, 496]),
 'harness_gsm8k_5': array([ 494, 1257,   75,  236, 1290,  906,  675,   83,  956,   38,  683,
        1234,  664, 1187, 1239,  331, 1141,  427,  311,  943,  144,  834,
         536,  362, 1112,  945, 1140, 1103, 1272,  863,  506,  638,  490,
         251,  196,  217,  442,  794,  723,  435, 1011,  775,  198,  999,
          90,  473, 1284,  990,  980, 1224]),
 'harness_arc_challenge_25': array([ 339,  672, 1140, 1136,  787,  512, 1082,  182,   31,   35, 1132,
         126,  228,  412,   87,  131,  693,   70,  270, 1160,  832, 1063,
         692,   58, 1047,  435,  753,  613,  422,  585, 1065, 1151,  309,
         821,  679,   14,  929, 1022,  906,  397, 1033, 1153,  184,  266,
    