# TAPE: evaluation demo 

See how to:
- Evaluate your model on TAPE
- Report model performance on chosen tasks

## Model Evaluation

### Install

In [1]:
# !git clone https://github.com/RussianNLP/TAPE
# %cd TAPE
# !pip install .

### Load Modules

In [1]:
import pandas as pd
import numpy as np
from datasets import load_dataset
from TAPE.utils.episodes import get_episode_data

  from .autonotebook import tqdm as notebook_tqdm


### Read Data

We will be working with the **RuWorldTree**, a multiple choice QA task:

In [2]:
data = load_dataset("RussianNLP/tape", "ru_worldtree.episodes")
train_data = data["train"].data.to_pandas()
test_data = data["test"].data.to_pandas()

In [3]:
train_data.head()

Unnamed: 0,question,exam_name,school_grade,knowledge_type,answer,perturbation,episode
0,"Тунец - это океаническая рыба, которая хорошо ...",MCAS,5,"CAUSAL,MODEL",A,ru_worldtree,"[10, 11]"
1,Какая часть растения больше всего отвечает за ...,MCAS,5,"PROP,PROCESS",B,ru_worldtree,[11]
2,Тамара купила скейт и комплект для изготовлени...,MCAS,5,"CAUSAL,MODEL",D,ru_worldtree,"[15, 16]"
3,"Кальмар - животное, обитающее в океане. Он выт...",MCAS,5,MODEL,C,ru_worldtree,"[18, 19]"
4,Цвет глаз - это физическая черта. Какое утверж...,Maryland School Assessment - Science,4,"MODEL,PROP",B,ru_worldtree,"[15, 17]"


In [4]:
test_data

Unnamed: 0,question,exam_name,school_grade,knowledge_type,answer,perturbation,episode
0,Что из следующего является примером формы энер...,MCAS,5,EXAMPLE,,ru_worldtree,[]
1,Дима проверил физические свойства минерала. Он...,MCAS,5,MODEL,,ru_worldtree,[]
2,"В течение большей части года воздух над Сочи, ...",MCAS,5,MODEL,,ru_worldtree,[]
3,Какое из следующих утверждений лучше всего объ...,MCAS,5,EXAMPLE,,ru_worldtree,[]
4,"Кипрей болотный - растение, лучше всего растущ...",MCAS,5,MODEL,,ru_worldtree,[]
...,...,...,...,...,...,...,...
4398,разные Есть виды пустыни . Что у них общего ? ...,TIMSS,4,NO TYPE,,swap,[]
4399,"Алина измерила , сколько сахара растворяется с...",TIMSS,4,NO TYPE,,swap,[]
4400,Растения используют энергию непосредственно от...,TIMSS,4,NO TYPE,,swap,[]
4401,Что следующего описывает конденсацию ? (A) тел...,TIMSS,4,NO TYPE,,swap,[]


Original test with no perturbations:

In [5]:
test_data[test_data.perturbation == "ru_worldtree"]

Unnamed: 0,question,exam_name,school_grade,knowledge_type,answer,perturbation,episode
0,Что из следующего является примером формы энер...,MCAS,5,EXAMPLE,,ru_worldtree,[]
1,Дима проверил физические свойства минерала. Он...,MCAS,5,MODEL,,ru_worldtree,[]
2,"В течение большей части года воздух над Сочи, ...",MCAS,5,MODEL,,ru_worldtree,[]
3,Какое из следующих утверждений лучше всего объ...,MCAS,5,EXAMPLE,,ru_worldtree,[]
4,"Кипрей болотный - растение, лучше всего растущ...",MCAS,5,MODEL,,ru_worldtree,[]
...,...,...,...,...,...,...,...
624,Есть разные виды пустыни. Что у них общего? (A...,TIMSS,4,NO TYPE,,ru_worldtree,[]
625,"Алина измерила, сколько сахара растворяется в ...",TIMSS,4,NO TYPE,,ru_worldtree,[]
626,Растения используют энергию непосредственно от...,TIMSS,4,NO TYPE,,ru_worldtree,[]
627,Что из следующего описывает конденсацию? (A) ж...,TIMSS,4,NO TYPE,,ru_worldtree,[]


Test data with `BackTranslation` perturbation:

In [6]:
test_data[test_data.perturbation == "back_translation"]

Unnamed: 0,question,exam_name,school_grade,knowledge_type,answer,perturbation,episode
1258,Что из следующего является примером формы энер...,MCAS,5,EXAMPLE,,back_translation,[]
1259,Дима Проверил физические свойства минерала.. О...,MCAS,5,MODEL,,back_translation,[]
1260,"В течение большей части года воздух над Сочи, ...",MCAS,5,MODEL,,back_translation,[]
1261,Какое из следующих заявлений является лучшим о...,MCAS,5,EXAMPLE,,back_translation,[]
1262,Водно-болотные угодья Кипра являются самыми лу...,MCAS,5,MODEL,,back_translation,[]
...,...,...,...,...,...,...,...
1882,Есть разные виды пустыни.. Что у них общего? (...,TIMSS,4,NO TYPE,,back_translation,[]
1883,"Алина измеряют, сколько сахара растворяется в ...",TIMSS,4,NO TYPE,,back_translation,[]
1884,Растения используют энергию непосредственно от...,TIMSS,4,NO TYPE,,back_translation,[]
1885,Что из следующего описывает конденсацию? (A) ж...,TIMSS,4,NO TYPE,,back_translation,[]


### Model prediction

To evaluate your model on TAPE, first, create a function for model prediction. Here we have a simple random model:

In [7]:
def evaluate_model(k_shots, test_data):
    predictions = np.random.choice(["A", "B", "C", "D"], size=test_data.shape[0])
    return predictions

### Predict

The evaluation loop looks as follows: each episode corresponds to `k` samples from the train data, used as shots, the model is then evaluated on the test data and its perturbed versions.

For a more detailed explanation of episodes refer to [notebooks/episode_example.ipynb](https://github.com/RussianNLP/TAPE/notebooks/episode_example.ipynb).

In [8]:
# iterate over episodes
evaluation_results = []
for episode in sorted(train_data.episode.apply(lambda x: x[0]).unique()):

    k_shots = get_episode_data(train_data, episode)

    # iterate over transformed and original test datasets
    for perturbation, test in test_data.groupby("perturbation"):

        # get model predictions
        predictions = evaluate_model(k_shots, test)

        # save predictions
        evaluation_results.append(
            {
                "episode": episode,
                "shot": k_shots.shape[0],
                "slice": perturbation,
                "preds": predictions,
            }
        )

evaluation_results = pd.DataFrame(evaluation_results)
evaluation_results.head()

Unnamed: 0,episode,shot,slice,preds
0,5,1,addsent,"[C, B, A, B, C, A, C, A, C, A, C, C, A, B, D, ..."
1,5,1,back_translation,"[B, A, C, D, C, B, B, A, A, D, D, A, C, D, A, ..."
2,5,1,butter_fingers,"[B, A, C, A, C, D, D, D, A, A, B, D, D, A, D, ..."
3,5,1,del,"[D, D, D, C, B, B, C, C, D, D, C, B, C, D, C, ..."
4,5,1,emojify,"[C, B, D, D, B, D, C, C, D, A, D, C, B, A, D, ..."


## Reports

Now, when we have the predictions of our model, we can evaluate its performance.

**Note:**  We generate targets randomly to avoid data leakage. Submit your predictions to see true model performance on TAPE.

### Load Modules

In [9]:
from TAPE.reports import report_transformations, report_subpopulations, Report
from TAPE.subpopulations import LengthSubpopulation

### Preparing Data

To plot your evaluation report, create a `prepare_report` function. It should take the predictions of your model and evaluate it across several paradigms, including perturbations and subpopulations.

If your predictions are in the correct format (see above), you can simply use the function below.

In [10]:
# this file is only available to the TAPE authors
# but you can evaluate your model on the public sample, published in BigBench

test_answers = np.load("../TAPE_test_answers/episodes/worldtree_answers.npy")

FileNotFoundError: [Errno 2] No such file or directory: '../TAPE_test_answers/episodes/worldtree_answers.npy'

In [11]:
def prepare_report(
    predictions: pd.DataFrame,
    task_name: str,
    subpopulations: list,
    original_data: pd.DataFrame,
    label_ids: str = "label",
):

    # report perturbations
    transformation_res = report_transformations(
        predictions, task_name, label_ids
    )

    # report subpopulations
    sub_res = report_subpopulations(
        subpopulations=subpopulations,
        org_data=original_data,
        preds=predictions,
        dataset_name=task_name,
        label_ids=label_ids,
    )

    results = pd.concat([transformation_res, sub_res])
    return results

In [12]:
task_name = "ru_worldtree"
subpopulations = [LengthSubpopulation("question")]

# aggregate results
results = prepare_report(
    predictions=evaluation_results,  # model predictions
    task_name=task_name,  # name of the task
    original_data=train_data[
        train_data.perturbation == "ru_worldtree"
    ],  # original (not perturbed) data
    label_ids=test_answers,  # targets for the original test set
    subpopulations=subpopulations,  # list of subpopulations to analyse
)
results

NameError: name 'test_answers' is not defined

Now we have our aggregated results in the correct format, with the following infromation:

- `shot`: number of shots used for evaluation
- `slice`: slice name to apper in the report (either a perturbation or a subpopulation)
- `size`: size of the slice
- `asr`: attack success rate score (percentage of correct predictions that changed to incorrect after perturbation)
- `macro_f1`: macro f1 scores / token overlap (SQuAD F1)
- `accuracy`: accuracy / exact match scores
- `std_<x>`: standard deviations  over the episodes for metric x
- `category`: type of the slice (evalset, perturbations, subpopulation)

Now we can plot the report:

In [13]:
report = Report(results)
report.figure(shot=4)

NameError: name 'results' is not defined

(The reports do not always render in the notebooks, so we attach an example image bellow)
![eval_report](https://github.com/RussianNLP/TAPE/blob/main/images/eval_report_example.png?raw=true)