## The GEESE Challenge: running the experiment

In this notebook we will be going through the steps involved in the GEESE task using the e-rte-3-it dataset [1] as our data source.

### Install Miniconda and LM-Eval
(First: we suggest to creat a [Miniconda](https://docs.anaconda.com/miniconda/) environment and work within in.)

If you don't have it already, install the LM-Eval library by running the following cell.

In [None]:
%pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

### Step 1: Generate the explanations for the given labels with M1
We will use llama-3-3B-instruct as M1 to generate the explanations for the entailment labels in e-rte-3-it.

**IMPORTANT NOTE**

This step is implemented within the lm_eval framework. We provide custom code to generate explanations with llama-3, you can change model, prompt and explanation types (in the utils.py script), and generation parameters as you wish.

In [None]:
# Create YAML configurations for each explanation type

EXPLANATION_TYPES = ["whyexp", "whynot", "implicit"]

YAML_geese_gen_template = '''
task: geese_generation_template
dataset_path: azaninello/e-RTE-3-it # the name of the dataset on the HF Hub.
dataset_name: null # the dataset configuration to use. Leave `null` if your dataset does not require a config to be passed. See https://huggingface.co/docs/datasets/load_hub#configurations for more info.
dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`.
training_split: null
validation_split: validation
test_split: test
fewshot_split: validation
output_type: generate_until
doc_to_text: !function utils.generate_whyexp_explanation
doc_to_target: "{{explanation}}"
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: false
generation_kwargs:
  until:
    - "</s>"
    - "<|eot_id|>"
  max_gen_toks: 128
  do_sample: false
  temperature: 0
repeats: 1
metadata:
  version: 1.3
'''
with open('geese_generation_template.yaml', 'w') as f:
        f.write(YAML_geese_gen_template)

for exp_type in EXPLANATION_TYPES:
    task = f"geese_generation_{exp_type}"
    YAML_geese_gen = f'''
    include: geese_generation_template.yaml
    tag: geese_generation_tasks
    task: {task}
    doc_to_text: !function ./utils.generate_{exp_type}_explanation
    '''

    print(YAML_geese_gen)

    with open(f'generation-{task}.yaml', 'w') as f:
        f.write(YAML_geese_gen)

In [None]:
# Run Step 1 with lm_eval

!CUDA_VISIBLE_DEVICES=0,1
!lm_eval --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct \
    --tasks geese_generation_tasks \
    --batch_size 24 \
    --include_path ./ \
    --output_path generation_outputs_test \
    --limit 48 \
    --log_samples \
    --write_out

### Step 2: Post-process explanations
After saving your explanations into a dataset, they must be post-processed to prevent label leakage. This step is mandatory. 

Save the anonimized dataset with you new explanations and push it to the HF hub.

(Tip: label your anonymized explanations with a memorable name: you will use them in the next evaluation step. )

In [None]:
from datasets import Dataset, load_dataset
import os 
import re

MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
DATASET_NAME="azaninello/e-RTE-3-it"
EXPLANATION_TYPES=["human", "whyexp", "whynot", "implicit"]
GENERATION_DIR=os.path.join("generation_outputs", MODEL_NAME.replace("/", "__"))
EXPLAINED_DATASET_DIR="./explained_data"
#LIMIT = 24

dataset = load_dataset(DATASET_NAME, split='test')
pattern = fr"samples_geese_generation_({'|'.join(EXPLANATION_TYPES)})_.*?\.jsonl"
files = [file for file in os.listdir(GENERATION_DIR) if re.match(pattern, file)]

def anonymize(text):
    if text is None:
        return None  # or return "" if you prefer an empty string
    # the anon_pattern needs to be customized according to task, labels, and language
    anon_pattern = r"(\bYES\b|\bNO\b|\bUNKNOWN\b|\bentail\w*\b|\bcontradict\w*\b|\bneutr\w*\b|\bimpl\w*\b|\bcontradd\w*\b)"
    subst_str = "XXX"
    text = re.sub(anon_pattern, subst_str, text, flags=re.IGNORECASE)
    return text

for file in os.listdir(GENERATION_DIR):
    for exp_type in EXPLANATION_TYPES:
        if f"geese_generation_{exp_type}_" in file:
            explained_dataset = load_dataset('json', data_files=os.path.join(GENERATION_DIR, file), split='train')
            explanations = {item['doc']['id']: item['resps'][0][0].strip() for item in explained_dataset}
            dataset = dataset.map(lambda x: {f"{exp_type}": explanations.get(x["id"], '')})
            dataset = dataset.map(lambda x: {f"anon_{exp_type}": anonymize(x[f'{exp_type}']).strip()})

# Add new column by mapping the dictionary
dataset = dataset.map(lambda x: {f"anon_human": anonymize(x['explanation']).strip()})
#dataset = dataset.map(lambda x: {"new_column": new_features.get(x["id"], None)})

# Check the result

dataset.save_to_disk(EXPLAINED_DATASET_DIR)
print(dataset[0])
public_name = "explained-full-llama-3"
dataset.push_to_hub(public_name, private=False)

In [None]:
dataset[700]

### Step 3: Predict labels with explanations with M2

We configure the prediction tasks of the M2 model as LM-EVAL tasks with YAML configs. With configs, you can fill preset fields to easily set up a task. Here, we write 4 different YAML config files under the *geese_tasks* tag for multiple-choice evaluation on the e-rte-3-it labels (read from our newly generated dataset). 

Here we define four configurations corresponding to four explanation sets given as "Hint": 
1. no explanation (Hint = "Not given.")
2. dummy explanation (Hint = text_h)
3. the anonimized human explanations (Hint = anon_human)
4. the explanations generated in Step 1 (in our case Hint = anon_whyexp)

In the gist configuration file presented at the CALAMITA Challenge, we currently report the **geese_human** configuration, but you are encouraged to propose your own explanations **after post-processing** (see above) before using them, and to read the anonimized explanations from the dataset produced in Step 2.

In this step you can can only change the model to use as M2 and recall your own configuration(s) through the task (see above).

**IMPORTANT NOTE**

This step is implemented within lm_eval for the geese_human configuration described in the gist [configuration file](https://gist.github.com/andreazaninello/6d92daeaa1c264477b32fd22acfbd818). 

In [None]:
YAML_geese_noexp_string = '''
tag: geese_prediction_baseline
task: geese_prediction_noexp
dataset_path: azaninello/explained-full-llama-3
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Your task is to predict the entailment label between two sentences, selecting one label among YES (entailment), NO (contradiction), or UNKNOWN (neutrality).\nSentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: Not given.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('prediction-no-exp.yaml', 'w') as f:
    f.write(YAML_geese_noexp_string)

YAML_geese_dummy_string = '''
include: prediction-no-exp.yaml
task: geese_prediction_dummy
doc_to_text: "Your task is to predict the entailment label between two sentences, selecting one label among YES (entailment), NO (contradiction), or UNKNOWN (neutrality).\nSentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{text_h}}.\nEntailment label:"
'''
with open('prediction-dummy-exp.yaml', 'w') as f:
    f.write(YAML_geese_dummy_string)

YAML_geese_human_string = '''
include: prediction-no-exp.yaml
task: geese_prediction_human
doc_to_text: "Your task is to predict the entailment label between two sentences, selecting one label among YES (entailment), NO (contradiction), or UNKNOWN (neutrality).\nSentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{anon_human}}.\nEntailment label:"
'''
with open('prediction-human-exp.yaml', 'w') as f:
    f.write(YAML_geese_human_string)

YAML_geese_whyexp_string = '''
include: prediction-no-exp.yaml
task: geese_prediction_whyexp
doc_to_text: "Your task is to predict the entailment label between two sentences, selecting one label among YES (entailment), NO (contradiction), or UNKNOWN (neutrality).\nSentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{anon_whyexp}}.\nEntailment label:"
'''
with open('prediction-whyexp-pred.yaml', 'w') as f:
    f.write(YAML_geese_whyexp_string)

YAML_geese_whynot_string = '''
include: prediction-no-exp.yaml
tag: geese_prediction_tasks
task: geese_prediction_whynot
doc_to_text: "Your task is to predict the entailment label between two sentences, selecting one label among YES (entailment), NO (contradiction), or UNKNOWN (neutrality).\nSentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{anon_whynot}}.\nEntailment label:"
'''
with open('prediction-whynot-exp.yaml', 'w') as f:
    f.write(YAML_geese_whynot_string)

YAML_geese_implicit_string = '''
include: prediction-no-exp.yaml
tag: geese_prediction_tasks
task: geese_prediction_implicit
doc_to_text: "Your task is to predict the entailment label between two sentences, selecting one label among YES (entailment), NO (contradiction), or UNKNOWN (neutrality).\nSentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{anon_implicit}}.\nEntailment label:"
'''
with open('prediction-implicit-exp.yaml', 'w') as f:
    f.write(YAML_geese_implicit_string)

YAML_geese_obvious_string = '''
include: prediction-no-exp.yaml
task: geese_prediction_obvious
doc_to_text: "Your task is to predict the entailment label between two sentences, selecting one label among YES (entailment), NO (contradiction), or UNKNOWN (neutrality).\nSentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{label}}.\nEntailment label:"
'''
with open('prediction-obvious-exp.yaml', 'w') as f:
    f.write(YAML_geese_obvious_string)

In [None]:
# Run on GPU

!CUDA_VISIBLE_DEVICES=0,1
!lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B \
    --include_path ./ \
    --tasks geese_prediction_baseline \
    --batch_size 24 \
    --output prediction_output

## References
[1] A. Zaninello, S. Brenna, B. Magnini, *Textual entailment with natural language explanations: The Italian e-rte-3 dataset* in: CLiC-it, 2023. URL: https://ceur-ws.org/Vol-3596/short21.pdf.