## Model-to-model comparison

You can also use Brain-Score to compare how similar models are to one another.

### Behavioral comparison

Let's compare the reading times predictions of two models:

In [13]:
from brainscore_language import load_model, ArtificialSubject, load_metric

# load models
model1 = load_model('distilgpt2')
model2 = load_model('gpt2-xl')

# start task
model1.start_behavioral_task(ArtificialSubject.Task.reading_times)
model2.start_behavioral_task(ArtificialSubject.Task.reading_times)
text = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
reading_times1 = model1.digest_text(text)['behavior'][1:]  # use all reading times except for first word
reading_times2 = model2.digest_text(text)['behavior'][1:]

# compare
metric = load_metric('pearsonr')
score = metric(reading_times1, reading_times2)
print(score)

<xarray.Score ()>
array(0.75950604)
Attributes:
    rvalue:   0.7595060423938943
    pvalue:   0.028803341212904062


### Neural comparison

Similarly, you can compare how similar neural activity is in two models.
Here, we compare two artificial subject models stemming from the same base model by choosing different layers, but you can also compare different models altogether like above.

In [17]:
import numpy as np
from brainscore_language import ArtificialSubject, load_metric
from brainscore_language.model_helpers.huggingface import HuggingfaceSubject

# load models
model1 = HuggingfaceSubject(model_id='distilgpt2', region_layer_mapping={
    ArtificialSubject.RecordingTarget.language_system: 'transformer.h.4.ln_1'})
model2 = HuggingfaceSubject(model_id='distilgpt2', region_layer_mapping={
    ArtificialSubject.RecordingTarget.language_system: 'transformer.h.5.ln_1'})

# record neural activity
model1.start_neural_recording(recording_target=ArtificialSubject.RecordingTarget.language_system,
                                recording_type=ArtificialSubject.RecordingType.fMRI)
model2.start_neural_recording(recording_target=ArtificialSubject.RecordingTarget.language_system,
                                recording_type=ArtificialSubject.RecordingType.fMRI)
text = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog',
        'Waltz', 'bad', 'nymph', 'for', 'quick', 'jigs', 'vex',
        "Glib", "jocks", "quiz", "nymph", "to", "vex", "dwarf",
        "Sphinx", "of", "black", "quartz,", "judge", "my", "vow",
        "How", "vexingly", "quick", "daft", "zebras", "jump!"]
activity1 = model1.digest_text(text)['neural']
activity2 = model2.digest_text(text)['neural']
activity1['stimulus_id'] = activity2['stimulus_id'] = 'presentation', np.arange(len(text))

# compare
metric = load_metric('linear_pearsonr')
score = metric(activity1, activity2)
print(score)


cross-validation:   0%|                                                    | 0/10 [00:00<?, ?it/s][A
cross-validation:  10%|████▍                                       | 1/10 [00:01<00:09,  1.11s/it][A
cross-validation:  20%|████████▊                                   | 2/10 [00:02<00:09,  1.18s/it][A
cross-validation:  30%|█████████████▏                              | 3/10 [00:03<00:08,  1.28s/it][A
cross-validation:  40%|█████████████████▌                          | 4/10 [00:04<00:07,  1.25s/it][A
cross-validation:  50%|██████████████████████                      | 5/10 [00:06<00:06,  1.24s/it][A
cross-validation:  60%|██████████████████████████▍                 | 6/10 [00:07<00:04,  1.24s/it][A
cross-validation:  70%|██████████████████████████████▊             | 7/10 [00:08<00:03,  1.24s/it][A
cross-validation:  80%|███████████████████████████████████▏        | 8/10 [00:09<00:02,  1.16s/it][A
cross-validation:  90%|███████████████████████████████████████▌    | 9/10 [00:10<

<xarray.Score ()>
array(0.62209412)
Attributes:
    raw:      <xarray.Score (split: 10, neuroid: 768)>\narray([[ 0.95910202, ...





## Human-to-human comparison

As with the model-to-model comparisons, you can compare humans to one another.
In this case, the data has already been recorded so we simply compare two sets of data to one another.


### Behavioral comparison

Using data from Futrell et al. 2018, we can compare how similar the reading times of half the subjects are to the reading times of the other half of subjects (this is also part of how the ceiling is estimated in the [Futrell2018 reading times benchmark](https://github.com/brain-score/language/blob/main/brainscore_language/benchmarks/futrell2018/__init__.py)):

In [4]:
from brainscore_language import load_dataset

data = load_dataset('Futrell2018')
print(data)  # will show lots of nans because not every subject has a reading time for every word

<xarray.NeuroidAssembly 'data' (presentation: 10256, subject: 180)>
array([[ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       ...,
       [512., 334., 283., ...,  nan,  nan,  nan],
       [432., 390., 590., ...,  nan,  nan,  nan],
       [576., 750., 862., ...,  nan,  nan,  nan]])
Coordinates:
  * presentation             (presentation) MultiIndex
  - word                     (presentation) object 'If' 'you' ... "Tourette's."
  - word_core                (presentation) object 'If' 'you' ... 'Tourettes'
  - story_id                 (presentation) int64 1 1 1 1 1 1 ... 10 10 10 10 10
  - word_id                  (presentation) int64 1 2 3 4 5 ... 936 937 938 939
  - word_within_sentence_id  (presentation) int64 1 2 3 4 5 6 ... 12 13 14 15 16
  - sentence_id              (presentation) int64 1 1 1 1 1 ... 481 481 481 481
  - stimulus_id              (presentation) int64 1 2 3 4 ... 10254 102

In [9]:
from numpy.random import RandomState
from brainscore_language import load_metric

# split into halves
random = RandomState(0)
subjects = data['subject_id'].values
half1_subjects = random.choice(subjects, size=len(subjects) // 2, replace=False)
half2_subjects = set(subjects) - set(half1_subjects)
half1 = data[{'subject': [subject_id in half1_subjects for subject_id in subjects]}]
half2 = data[{'subject': [subject_id in half2_subjects for subject_id in subjects]}]

# mean within each half
half1 = half1.mean('subject')
half2 = half2.mean('subject')

# compare
metric = load_metric('pearsonr')
score = metric(half1, half2)
print(score)

<xarray.Score ()>
array(0.61563052)
Attributes:
    rvalue:   0.6156305242624502
    pvalue:   0.0


### Neural comparison

Using data from Pereira et al. 2018, we can test how well a pool of subjects can linearly predict a held-out subject (this is also part of how the ceiling is estimated in the [Pereira2018 linear predictivity benchmark](https://github.com/brain-score/language/blob/main/brainscore_language/benchmarks/pereira2018/__init__.py)):

In [24]:
from brainscore_language import load_dataset

data = load_dataset('Pereira2018.language')
data = data.sel(experiment='384sentences')
data = data.dropna('neuroid')
print(data)

<xarray.NeuroidAssembly 'data' (presentation: 384, neuroid: 12155)>
array([[-0.61048431, -0.76491186, -0.79946189, ..., -1.02996304,
        -0.42042251,  0.44600733],
       [-0.57701107, -0.24646438, -0.28553028, ..., -0.29127415,
        -0.10866586,  1.67496226],
       [ 0.5322871 ,  0.69422809,  0.29570084, ...,  0.64426824,
         0.0268965 ,  5.96437518],
       ...,
       [ 0.4911479 ,  0.97394189,  0.14704561, ...,  0.97622657,
         1.07466326,  0.65786844],
       [ 1.0331004 ,  1.5348565 ,  0.84328902, ..., -0.8361398 ,
        -0.52408963,  0.73715778],
       [ 0.53970481,  0.98636439,  0.53409886, ..., -0.03957355,
        -0.06988947,  2.10925683]])
Coordinates:
  * presentation      (presentation) MultiIndex
  - stimulus_num      (presentation) int64 0 1 2 3 4 5 ... 379 380 381 382 383
  - sentence          (presentation) object 'An accordion is a portable music...
  - stimulus          (presentation) object 'An accordion is a portable music...
  - passage_index

In [26]:
from brainscore_language import load_metric

heldout_subject = '426'

pool = data[{'neuroid': [subject != heldout_subject for subject in data['subject'].values]}]
heldout = data[{'neuroid': [subject == heldout_subject for subject in data['subject'].values]}]

metric = load_metric('linear_pearsonr')
score = metric(pool, heldout)
print(score)



cross-validation:   0%|                                                    | 0/10 [00:00<?, ?it/s][A[A

cross-validation:  10%|████▍                                       | 1/10 [00:05<00:50,  5.66s/it][A[A

cross-validation:  20%|████████▊                                   | 2/10 [00:11<00:44,  5.53s/it][A[A

cross-validation:  30%|█████████████▏                              | 3/10 [00:16<00:38,  5.47s/it][A[A

cross-validation:  40%|█████████████████▌                          | 4/10 [00:21<00:32,  5.36s/it][A[A

cross-validation:  50%|██████████████████████                      | 5/10 [00:27<00:27,  5.55s/it][A[A

cross-validation:  60%|██████████████████████████▍                 | 6/10 [00:33<00:22,  5.54s/it][A[A

cross-validation:  70%|██████████████████████████████▊             | 7/10 [00:38<00:16,  5.39s/it][A[A

cross-validation:  80%|███████████████████████████████████▏        | 8/10 [00:43<00:10,  5.39s/it][A[A

cross-validation:  90%|█████████████████████

<xarray.Score ()>
array(0.35985278)
Attributes:
    raw:      <xarray.Score (split: 10, neuroid: 1357)>\narray([[ 0.275236  ,...



