# Benchmarks

Benchmarks combine a dataset, a metric, and an experimental paradigm.
They take a model candidate as input and output a score that measures how similar the model predictions under the experimental paradigm are to the data as evaluated by the metric

### Pre-defined benchmarks

The Brain-Score community has defined many benchmarks, which models can be scored on. We can thus easily test a new or existing model on a variety of experimental datasets. To achieve this scaling, all models implement the model interface, and all benchmarks in turn only use methods from this interface.

In [None]:
from brainscore_vision import score

similarity_score = score(model_identifier='alexnet', benchmark_identifier='MajajHong2015public.IT-pls')
similarity_score



cross-validation: 100%|██████████| 10/10 [04:16<00:00, 25.70s/it]


This score already aggregates over neural sites and cross-validation splits, and normalizes with respect to an estimated noise ceiling.

We can also check the un-ceiled, per-split, and per-neural-site values:

In [None]:
unceiled_score = similarity_score.raw
individual_values = unceiled_score.raw
print(individual_values)

<xarray.Score (split: 10, neuroid: 168)>
array([[0.34887369, 0.46601301, 0.45883724, ..., 0.65668614, 0.70176186,
        0.72235326],
       [0.31144404, 0.52688285, 0.50384677, ..., 0.66639365, 0.67149065,
        0.68811384],
       [0.34135641, 0.43913901, 0.48682732, ..., 0.69500215, 0.7260355 ,
        0.71180254],
       ...,
       [0.21055581, 0.50588129, 0.49815291, ..., 0.67644017, 0.66275208,
        0.68777016],
       [0.30868527, 0.3778854 , 0.47125337, ..., 0.70168043, 0.72231718,
        0.73980201],
       [0.26001712, 0.44765998, 0.42908116, ..., 0.6753077 , 0.72488176,
        0.75500281]])
Coordinates:
  * split       (split) int64 0 1 2 3 4 5 6 7 8 9
  * neuroid     (neuroid) MultiIndex
  - neuroid_id  (neuroid) object 'Chabo_L_A_2_4' ... 'Tito_L_M_9_8'
  - arr         (neuroid) object 'A' 'A' 'A' 'A' 'A' 'A' ... 'M' 'M' 'M' 'M' 'M'
  - col         (neuroid) int64 4 3 5 0 1 2 3 4 5 6 2 ... 4 5 6 7 8 1 3 4 5 7 8
  - hemisphere  (neuroid) object 'L' 'L' 'L' 'L' 'L' 

### Custom benchmarks

We can also define our own benchmarks.
To interface with Brain-Score, each benchmark needs to implement the [`Benchmark` interface](https://brain-score-core.readthedocs.io/en/latest/modules/benchmarks.html). We especially need to write a `__call__` method that takes a model candidate as input and outputs score.

Each benchmark follows the following steps:
1. reproduce the primate experiment on the model (e.g. show the same stimuli)
2. apply a similarity metric to compare model predictions with biological measurements
3. normalize the similarity score with the ceiling, i.e. an upper bound on how well we would expect the best possible model to perform

The following example implements a simple benchmark that show-cases these three steps.

In [7]:
from brainscore_vision.benchmark_helpers.neural_common import average_repetition
from brainscore_core import Score
from brainscore_core.benchmarks import Benchmark
from brainscore_vision import load_dataset, load_model, load_metric, load_ceiling, BrainModel
from brainscore_vision.benchmark_helpers.screen import place_on_screen


# Let's say, we want to test models' match to V1 recordings with an RDM metric.
# We'll use the Freeman*, Ziemba*, et al. 2013 data.

class MyBenchmark(Benchmark):
    def __init__(self):
        self._assembly = load_dataset('MajajHong2015.public').sel(region='IT').squeeze('time_bin')
        self._metric = load_metric('rdm_cv')
        self._ceiler = load_ceiling('internal_consistency')

    @property
    def identifier(self):
        return "my-benchmark-name"

    def __call__(self, candidate: BrainModel) -> Score:
        # All candidate models follow the BrainModel interface, so we can easily treat all models the same way.
        # (1) reproduce the experiment on the model. 
        candidate.start_task(task=BrainModel.Task.passive)
        candidate.start_recording(recording_target="IT", time_bins=[(70, 170)])
        # since different models can have different fields of view, we adjust the image sizes accordingly.
        stimulus_set = place_on_screen(self._assembly.stimulus_set,
                                       target_visual_degrees=candidate.visual_degrees(), source_visual_degrees=8)
        predictions = candidate.look_at(stimuli=stimulus_set)
        # (2) compute similarity between predictions and measurements
        assembly = average_repetition(self._assembly)  # average over repetitions
        unceiled_score = self._metric(predictions, assembly)
        # (3) normalize by our estimate of how well the ideal model could do
        ceiled_score = unceiled_score / self.ceiling
        return ceiled_score

    @property
    def ceiling(self):
        print("Computing ceiling")
        return self._ceiler(self._assembly)


my_benchmark = MyBenchmark()
model = load_model('alexnet')
score = my_benchmark(model)
print(score)



cross-validation:   0%|          | 0/10 [22:02<?, ?it/s]
cross-validation: 100%|██████████| 10/10 [00:42<00:00,  4.29s/it]


Computing ceiling


cross-validation: 100%|██████████| 10/10 [00:48<00:00,  4.85s/it]

<xarray.Score ()>
array(0.36174365)



