<a href="https://colab.research.google.com/github/bernardlindor/AnalyseDonneesCinema/blob/main/RAFT_Starter_Kit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup



In [1]:
%%capture

!git clone https://github.com/oughtinc/raft-baselines
%cd raft-baselines
!python -m pip install -r requirements.txt
!python setup.py develop

# hack to make module work
import sys
sys.path.insert(0, r"/content/raft-baselines/src")

## Getting started with the RAFT benchmark

In this notebook, we will walk through:

1. Loading the tasks from the [RAFT dataset](https://huggingface.co/datasets/ought/raft)
2. Creating a classifier using any CausalLM from the [Hugging Face Hub](https://huggingface.co/models)
3. Generating predictions using that classifier for RAFT test examples

This should provide you with the steps needed to make a submission to the [RAFT leaderboard](https://huggingface.co/spaces/ought/raft-leaderboard)!

In [2]:
import datasets

datasets.logging.set_verbosity_error()

## Loading RAFT datasets


We'll focus on the ADE corpus V2 task in this starter kit, but similar code could be run for all of the tasks in RAFT. To see the possible tasks, we can use the following function from `datasets`:

In [3]:
from datasets import get_dataset_config_names

RAFT_TASKS = get_dataset_config_names("ought/raft")
RAFT_TASKS

Downloading builder script:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/56.1k [00:00<?, ?B/s]

['ade_corpus_v2',
 'banking_77',
 'terms_of_service',
 'tai_safety_research',
 'neurips_impact_statement_risks',
 'overruling',
 'systematic_review_inclusion',
 'one_stop_english',
 'tweet_eval_hate',
 'twitter_complaints',
 'semiconductor_org_types']

Each task in RAFT consists of a training set of only **_50 labeled examples_** and an unlabeled test set. All labels have a textual version associated with them. Let's load corpus associated with the `ade_corpus_v2` task:

In [4]:
from datasets import load_dataset

TASK = "ade_corpus_v2"
raft_dataset = load_dataset("ought/raft", name=TASK)
raft_dataset

Downloading and preparing dataset raft/ade_corpus_v2 (download: 9.30 MiB, generated: 699.90 KiB, post-processed: Unknown size, total: 9.98 MiB) to /root/.cache/huggingface/datasets/ought___raft/ade_corpus_v2/1.1.0/79c4de1312c1e3730043f7db07179c914f48403101f7124e2fe336f6f54d9f84...


Downloading data files:   0%|          | 0/11 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/7.79k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/662k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.91k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/327k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/917k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/54.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/70.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/196k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.58k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/412k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/52.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/201k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.09M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.64k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/412k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.38k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/336k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.12k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/68.5k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/11 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/50 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Dataset raft downloaded and prepared to /root/.cache/huggingface/datasets/ought___raft/ade_corpus_v2/1.1.0/79c4de1312c1e3730043f7db07179c914f48403101f7124e2fe336f6f54d9f84. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'ID', 'Label'],
        num_rows: 50
    })
    test: Dataset({
        features: ['Sentence', 'ID', 'Label'],
        num_rows: 5000
    })
})

The `raft_dataset` object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training and test sets. In this task we can see we have 50 labelled examples to work with and 5,000 examples on the test set we need to generate predictions for. To access an example, you need to specify the name of the split and then the index as follows:

In [5]:
raft_dataset["train"][0]

{'ID': 0, 'Label': 2, 'Sentence': 'No regional side effects were noted.'}

Here we can see that each example is assigned a label ID which denotes the class in this particular tasks. Let's check how many classes we have in the training set:

In [6]:
label_ids = raft_dataset["train"].unique("Label")
label_ids

[2, 1]

Okay, this indicates that `ade_corpus_v2` is a binary classification task and we can extract the human-readable label names as follows:

In [7]:
features = raft_dataset["train"].features["Label"]
id2label = {idx : features.int2str(idx) for idx in label_ids}
id2label

{1: 'ADE-related', 2: 'not ADE-related'}

Note that the test set also has a `Label` entry, but it is zero to denote a dummy label (this is what your model needs to predict!):

In [8]:
raft_dataset["test"].unique("Label")

[0]

To get a broader sense of what kind of data we are dealing with, we can use the following function to randomly sample from the corpus and display the results as a table:

In [9]:
import random
import pandas as pd

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    print(df)
    
show_random_elements(raft_dataset["train"])

                                            Sentence  ID            Label
0  The mechanism by which sunitinib induces gynae...  46      ADE-related
1  This report demonstrates the increased risk of...  37  not ADE-related
2  MRI has a high sensitivity and specificity in ...  14  not ADE-related
3  The treatment of Toxoplasma encephalitis in pa...   7  not ADE-related
4  CONCLUSION: Pancreatic enzyme intolerance, alt...   6  not ADE-related
5  NEH must be considered in lupus patients recei...  17  not ADE-related
6  This case report describes a 13-year-old male ...  38  not ADE-related
7  In 1991 the patient were found to be seroposit...  13  not ADE-related
8  Considerable improvement of myasthenic symptom...  34  not ADE-related
9  METHODS: This study is a case report description.  49  not ADE-related


## Creating a classifier from the Hugging Face Model Hub

We provide a class which uses the same prompt construction method as our GPT-3 baseline, but works with any CausalLM on the [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads). The classifier will automatically use a GPU if available. Brief documentation on the arguments for configuring the classifier is provided below.:

In [10]:
from raft_baselines.classifiers import TransformersCausalLMClassifier

classifier = TransformersCausalLMClassifier(
    model_type="distilgpt2",             # The model to use from the HF hub
    training_data=raft_dataset["train"],            # The training data
    num_prompt_training_examples=25,     # See raft_predict.py for the number of training examples used on a per-dataset basis in the GPT-3 baselines run.
                                         # Note that it may be better to use fewer training examples and/or shorter instructions with other models with smaller context windows.
    add_prefixes=(TASK=="banking_77"),   # Set to True when using banking_77 since multiple classes start with the same token
    config=TASK,                         # For task-specific instructions and field ordering
    use_task_specific_instructions=True,
    do_semantic_selection=True,
)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Generating predictions for RAFT test examples

In order to generate predictions on the test set, we need to provide the model with an appropriate prompt with the instructions. Let's take a look at how this works on a single example from the test set.

### Example prompt and prediction

The `TransformersCausalLMClassifier` has a `classify` function that will automatically generate the predicted probabilites from the model. We'll set `should_print_prompt=True` so that we can see which prompt is being used to instruct the model:

In [11]:
test_dataset = raft_dataset["test"]
first_test_example = test_dataset[0]

# delete the 0 Label
del first_test_example["Label"]

# probabilities for all classes
output_probs = classifier.classify(first_test_example, should_print_prompt=True)
output_probs

Label the sentence based on whether it is related to an adverse drug effect (ADE). Details are described below:
Drugs: Names of drugs and chemicals that include brand names, trivial names, abbreviations and systematic names were annotated. Mentions of drugs or chemicals should strictly be in a therapeutic context. This category does not include the names of metabolites, reaction byproducts, or hospital chemicals (e.g. surgical equipment disinfectants).
Adverse effect: Mentions of adverse effects include signs, symptoms, diseases, disorders, acquired abnormalities, deficiencies, organ damage or death that strictly occur as a consequence of drug intake.
Possible labels:
1. ADE-related
2. not ADE-related

Sentence: Treatment of silastic catheter-induced central vein septic thrombophlebitis
Label: not ADE-related

Sentence: We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg
Label: ADE-related

Sentence: In 1991 the patient were found to be seroposit

{'ADE-related': 0.31358153, 'not ADE-related': 0.68641853}

In this example we can see the model predicts that the example is not related to an adverse drug effect. We can use this technique to generate predictions across the whole test set! Let's take a look.

### Creating a submission file of predictions

To submit to the RAFT leaderboard, you'll need to provide a CSV file of predictions on the test set for each task (see [here](https://huggingface.co/datasets/ought/raft-submission) for detailed instructions).  The following code snippet generates a CSV with predictions for the first $N$ test examples in the format required for submission $(ID, Label)$. 

Note that this is expected to generate predictions of all "Not ADE-related" for the 10 test examples with the code as written; few-shot classification is pretty hard!

In [12]:
# Increase this to len(test_dataset) to generate predictions over the full test set
N_TEST = 10
test_examples_to_predict = test_dataset.select(range(N_TEST))

def predict_one(clf, test_example):
    del test_example["Label"]    
    output_probs = clf.classify(example)
    output_label = max(output_probs.items(), key=lambda kv_pair: kv_pair[1])[0]
    return output_label

data = []
for example in test_examples_to_predict:
    data.append({"ID": example["ID"], "Label": predict_one(classifier, example)})
    
result_df = pd.DataFrame(data=data, columns=["ID", "Label"]).astype({"ID": int, "Label": str})   
print(result_df)

   ID            Label
0  50  not ADE-related
1  51  not ADE-related
2  52  not ADE-related
3  53  not ADE-related
4  54  not ADE-related
5  55  not ADE-related
6  56  not ADE-related
7  57  not ADE-related
8  58  not ADE-related
9  59  not ADE-related


Good luck with the rest of the benchmark!