## Model-to-model comparison

You can also use Brain-Score to compare how similar models are to one another.

### Behavioral comparison

Let's compare the reading times predictions of two models:

In [3]:
from brainscore_language import load_model, ArtificialSubject, load_metric

# load models
model1 = load_model('distilgpt2')
model2 = load_model('gpt2-xl')

# perform task
model1.perform_behavioral_task(ArtificialSubject.Task.reading_times)
model2.perform_behavioral_task(ArtificialSubject.Task.reading_times)
text = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
reading_times1 = model1.digest_text(text)['behavior']
reading_times2 = model2.digest_text(text)['behavior']

# compare
metric = load_metric('pearsonr')
score = metric(reading_times1, reading_times2)
print(score)

ValueError: array must not contain infs or NaNs

### Neural comparison

Similarly, you can compare how similar neural activity is in two models.
Here, we compare two artificial subject models stemming from the same base model by choosing different layers, but you can also compare different models altogether like above.

In [10]:
from brainscore_language import ArtificialSubject, load_metric
from brainscore_language.model_helpers.huggingface import HuggingfaceSubject

# load models
model1 = HuggingfaceSubject(model_id='distilgpt2', region_layer_mapping={
        ArtificialSubject.RecordingTarget.language_system: 'transformer.h.4.ln_1'})
model2 = HuggingfaceSubject(model_id='distilgpt2', region_layer_mapping={
        ArtificialSubject.RecordingTarget.language_system: 'transformer.h.5.ln_1'})

# record neural activity
model1.perform_neural_recording(recording_target=ArtificialSubject.RecordingTarget.language_system,
                                recording_type=ArtificialSubject.RecordingType.fMRI)
model2.perform_neural_recording(recording_target=ArtificialSubject.RecordingTarget.language_system,
                                recording_type=ArtificialSubject.RecordingType.fMRI)
text = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
activity1 = model1.digest_text(text)['neural']
activity2 = model2.digest_text(text)['neural']

# compare
metric = load_metric('linear_pearsonr')
score = metric(activity1, activity2)
print(score)

KeyError: 'stimulus_id'

## Human-to-human comparison

As with the model-to-model comparisons, you can compare humans to one another.
In this case, the data has already been recorded so we simply compare two sets of data to one another.


### Behavioral comparison

Using data from Futrell et al. 2018, we can compare how similar the reading times of half the subjects are to the reading times of the other half of subjects (this is also part of how the ceiling is estimated in the [Futrell2018 reading times benchmark](https://github.com/brain-score/language/blob/main/brainscore_language/benchmarks/futrell2018/__init__.py)):

In [4]:
from numpy.random import RandomState
from brainscore_language import load_dataset, load_metric

# load data
data = load_dataset('Futrell2018')
print(data)

<xarray.NeuroidAssembly 'data' (presentation: 10256, subject: 180)>
array([[ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       ...,
       [512., 334., 283., ...,  nan,  nan,  nan],
       [432., 390., 590., ...,  nan,  nan,  nan],
       [576., 750., 862., ...,  nan,  nan,  nan]])
Coordinates:
  * presentation             (presentation) MultiIndex
  - word                     (presentation) object 'If' 'you' ... "Tourette's."
  - word_core                (presentation) object 'If' 'you' ... 'Tourettes'
  - story_id                 (presentation) int64 1 1 1 1 1 1 ... 10 10 10 10 10
  - word_id                  (presentation) int64 1 2 3 4 5 ... 936 937 938 939
  - word_within_sentence_id  (presentation) int64 1 2 3 4 5 6 ... 12 13 14 15 16
  - sentence_id              (presentation) int64 1 1 1 1 1 ... 481 481 481 481
  - stimulus_id              (presentation) int64 1 2 3 4 ... 10254 102

In [9]:
# split into halves
random = RandomState(0)
subjects = data['subject_id'].values
half1_subjects = random.choice(subjects, size=len(subjects) // 2, replace=False)
half2_subjects = set(subjects) - set(half1_subjects)
half1 = data[{'subject': [subject_id in half1_subjects for subject_id in subjects]}]
half2 = data[{'subject': [subject_id in half2_subjects for subject_id in subjects]}]

# mean within each half
half1 = half1.mean('subject')
half2 = half2.mean('subject')

# compare
metric = load_metric('pearsonr')
score = metric(half1, half2)
print(score)

<xarray.Score ()>
array(0.61563052)
Attributes:
    rvalue:   0.6156305242624502
    pvalue:   0.0
