# Fluid Benchmarking Demo

The core entry point to Fluid Benchmarking is `evaluation.fluid_benchmarking()`. You need to provide:

- `lm_responses_r`: an rpy2 vector of binary LM responses (0 = incorrect, 1 = correct), in the same item order as the IRT model.
- `irt_model_r`: an rpy2 dataframe with the benchmark's IRT item parameters.

Key hyperparameters:

- `start_ability`: initial ability estimate.
- `n_max`: maximum items to administer. 
- `selection_method`: method for item selection.
- `estimation_method`: method for ability estimation.

Outputs:

- `abilities_fb`: list of provisional ability estimates after each administered item.
- `items_fb`: list of selected item row indices in the benchmark.

In [1]:
import numpy as np
import rpy2.robjects as ro

from fluid_benchmarking import config, datasets, evaluation, indexing, rutils

In [2]:
# Choose LM and benchmark
lm = "olmo2-7b"
benchmark = "hellaswag"

In [3]:
# Load IRT model for selected benchmark
irt_model = datasets.load_irt_model(
    config.HF_REPO_ID,
    config.IRT_MODELS_PATH.format(benchmark)
)

In [4]:
# Load evaluation results and filter to selected benchmark
lm_eval_results = datasets.load_lm_eval_results(
    config.HF_REPO_ID,
    config.LM_EVAL_RESULTS_PATH.format(lm)
)
lm_eval_results_benchmark = indexing.filter_benchmark(
    lm_eval_results, 
    benchmark
)

# Check that item order is identical to IRT model
assert (lm_eval_results_benchmark.index == irt_model.index).all()

# Pick checkpoint
last_checkpoint = lm_eval_results_benchmark.columns[-1]
lm_responses = np.array(lm_eval_results_benchmark[last_checkpoint])

In [5]:
# Convert LM responses and IRT model to rpy2 objects
lm_responses_r = ro.IntVector(lm_responses)
irt_model_r = rutils.pandas2r(irt_model.reset_index(drop=True))

In [6]:
# Set hyperparameters
start_ability = 0
n_max = min(500, len(irt_model))
selection_method = "MFI"  # Maximum Fisher information
estimation_method = "BM"  # Bayes modal estimation (MAP)

In [7]:
# Run Fluid Benchmarking
abilities_fb, items_fb = evaluation.fluid_benchmarking(
    lm_responses_r=lm_responses_r,
    irt_model_r=irt_model_r,
    start_ability=start_ability,
    n_max=n_max,
    selection_method=selection_method,
    estimation_method=estimation_method,
)

In [8]:
# First k provisional ability estimates and administered items
k = 5
print(f"First {k} ability estimates:", [round(x, 3) for x in abilities_fb[:k]])
print(f"First {k} administered items:", [irt_model.index[i] for i in items_fb[:k]])

First 5 ability estimates: [0.404, 0.818, 1.035, 1.357, 1.587]
First 5 administered items: ['hellaswag_760', 'hellaswag_3081', 'hellaswag_3518', 'hellaswag_8158', 'hellaswag_347']


In [9]:
id2item = datasets.load_id_to_item_map()

In [10]:
# Inspect administered items
id2item[irt_model.index[items_fb[0]]]

{'example': 'Education and Communications: How to beat the pink tax. Read labels carefully. Before making a purchase, scrutinize the label. This can help you identify the pink tax.',
 'choices': ["You may think certain products are marked up fairly, because they contain different ingredients or come in greater amounts. However, you may find many women's products are similar to men's products but needlessly marked up.",
  'In particular, watch out for words and abbreviations which indicate a high quality product and rich background. Looking for this information can be a good indicator of whether or not the individual or business should buy this product.',
  'Make sure it states in red letters exactly how many days the policies have expired.. Do not buy products or companies that claim to be pink tax-free.',
  'This knowledge will help you be able to analyze the products you buy and plan those purchases accordingly. Here are some of the known labels that can help you locate pink tax : La