# The GEESE Challenge: running the experiment

In this notebook we will be going through the steps to perform the three steps involved in the GEESE task using the e-rte-3-it dataset [1] as our data source.

### Install Miniconda and LM-Eval
(Tip: we suggest to creat a [Miniconda](https://docs.anaconda.com/miniconda/) environment and work within in.)

First, you don't have it already, install the LM-Eval library by running the following cell.

In [None]:
%pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

### Step 1: Generate the explanations for the given labels with M1
We will use llama-3-3B-instruct as M1 to generate the explanations for the entailment labels in e-rte-3-it.

The script is now running on CPU to overcome out-of-memory issues, but can be run on GPU by uncommenting the relevant lines.

**IMPORTANT NOTE**

This step is **not** implemented within lm_eval and is not included in the gist configuration file. Here we provide custom code to generate explanations with llama-3. Change model, prompt and generation parameters as you wish, and save it into a HF's Datatset.

In [None]:
import os, tqdm, torch, gc
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset, load_from_disk

os.environ["HF_TOKEN"] = "<your_hf_token>"  # Replace with your HuggingFace token

#os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Ensure enough GPU memory is available

def step_1_generate_explanation(model_name, dataset_name):
    # Use the first GPU if available
    device = "cpu" #torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # Load the model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    #model = model.to(device) # to use GPU

    # Create a pipeline for text generation
    generator = pipeline("text-generation", model=model, tokenizer=tokenizer, pad_token_id=tokenizer.eos_token_id, device=0)

    # Load the dataset
    dataset = load_dataset(dataset_name, split='test')
    model_nick = "llama3"
    
    # Function to generate explanations
    def add_generated_explanation(example):
        text_t = example['text_t']
        text_h = example['text_h']
        label_str = example['label']
        
        def generate_explanation(sentence_1, sentence_2, label, exp_type="Explain why."):
            example_sentences = f"Sentence 1: {sentence_1.strip()}\nSentence 2: {sentence_2.strip()}"
            prompt = f"Your task is to provide an explanation for the label assigned for the entailment relationship between two sentences.\n{example_sentences}\nEntailment label: {label}\n{exp_type}"

            #with torch.no_grad(), torch.cuda.amp.autocast():
            explanation = generator(prompt, max_length=410, num_return_sequences=1, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)

            return explanation[0]['generated_text'].split(f'{exp_type}')[-1]

        explanation_model = generate_explanation(sentence_1=text_t, sentence_2=text_h, label=label_str)
        example[f'anon_{model_nick}'] = explanation_model.strip()

        return example
    
    # Shuffle (and optionally select a smaller chunk of the dataset)
    dataset = dataset.shuffle(123) # dataset.shuffle(123).select(range(20))

    # Process the dataset (not in batches for memory control)
    updated_dataset = dataset.map(add_generated_explanation, batched=False)

    output_dir = f"./explained_{model_nick}"
    updated_dataset.save_to_disk(output_dir)

    print(f"Generation completed. Dataset saved to {output_dir}")

def main():
    step_1_generate_explanation(model_name="meta-llama/Meta-Llama-3-8B-Instruct", dataset_name="azaninello/e-RTE-3-it")

if __name__ == "__main__":
    main()


## Step 2: Post-process explanations
After saving your explanations into a dataset, they must be post-processed to prevent label leakage. This step is mandatory. 

Save the anonimized dataset with you new explanations and push it to the HF hub.

(Tip: label your anonymized explanations with a memorable name: you will use them in the next evaluation step. )

In [4]:
# Check the first example of the locally saved dataset
# *** IMPORTANT STEP *** Then, anonymize the explanations to prevent label-leakage 
# Finally, save the new dataset to the Hugging Face Hub
# Check the anoinimized example

import re
from datasets import load_from_disk

model_nick = "llama3"
output_dir = f"./explained_{model_nick}"

# Load the saved dataset
loaded_dataset = load_from_disk(output_dir)

print("Original:", loaded_dataset[0][f'anon_{model_nick}'])  # Display the first example, including the generated/anonimized explanations

anon_pattern = r"(\bYES\b|\bNO\b|\bUNKNOWN\b|\bentail\w*\b|\bcontradict\w*\b|\bneutral\w*\b)"
subst_str = "XXX"
    
def anonymize_explanations(example):
            # Add the generated explanation to the original model
            example['anon_human'] = re.sub(anon_pattern, subst_str, example['explanation'].strip(), flags=re.IGNORECASE)
            example[f'anon_{model_nick}'] = re.sub(anon_pattern, subst_str, example[f'anon_{model_nick}'].strip(), flags=re.IGNORECASE)
            return example

anon_dataset = loaded_dataset.map(anonymize_explanations, batched=False)

public_name = "geese-llama-3"
anon_dataset.push_to_hub(public_name, private=False)

print("Post-processed:", anon_dataset[0][f'anon_{model_nick}'])

print(f"Upload completed. Explanations saved as {public_name}")


Original: The entailment label is UNKNOWN because the two sentences do not have a direct entailment relationship. Sentence 1 states that it is rumored that Metin Kaplan ordered the murder of Ibrahim Sofu, while Sentence 2 states that Ibrahim Sofu was killed by Metin Kaplan. The two sentences present different information, with Sentence 1 providing a rumor or unverified information, and Sentence 2 presenting a factual statement. There is no logical connection between the two sentences that would imply one sentence is a direct consequence of the other. Therefore, the entailment label is UNKNOWN, indicating that the relationship between the two sentences is not clear or cannot be determined. 

Please note that this explanation is based on the assumption that the entailment label is assigned based on the logical relationship between the two sentences. If the entailment label is assigned based on other criteria, such as the similarity or overlap between the two sentences, a different explan

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 46.99ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  3.29it/s]
No files have been modified since last commit. Skipping to prevent empty commit.


Post-processed: The XXX label is XXX because the two sentences do not have a direct XXX relationship. Sentence 1 states that it is rumored that Metin Kaplan ordered the murder of Ibrahim Sofu, while Sentence 2 states that Ibrahim Sofu was killed by Metin Kaplan. The two sentences present different information, with Sentence 1 providing a rumor or unverified information, and Sentence 2 presenting a factual statement. There is XXX logical connection between the two sentences that would imply one sentence is a direct consequence of the other. Therefore, the XXX label is XXX, indicating that the relationship between the two sentences is not clear or cannot be determined. 

Please note that this explanation is based on the assumption that the XXX label is assigned based on the logical relationship between the two sentences. If the XXX label is assigned based on other criteria, such as the similarity or overlap between the two sentences, a different explanation may be necessary.
Upload compl

In [5]:
print("Example of anonimized explanation:", anon_dataset[0][f'anon_{model_nick}'])

Example of anonimized explanation: The XXX label is XXX because the two sentences do not have a direct XXX relationship. Sentence 1 states that it is rumored that Metin Kaplan ordered the murder of Ibrahim Sofu, while Sentence 2 states that Ibrahim Sofu was killed by Metin Kaplan. The two sentences present different information, with Sentence 1 providing a rumor or unverified information, and Sentence 2 presenting a factual statement. There is XXX logical connection between the two sentences that would imply one sentence is a direct consequence of the other. Therefore, the XXX label is XXX, indicating that the relationship between the two sentences is not clear or cannot be determined. 

Please note that this explanation is based on the assumption that the XXX label is assigned based on the logical relationship between the two sentences. If the XXX label is assigned based on other criteria, such as the similarity or overlap between the two sentences, a different explanation may be nece

### Step 2: Predict labels with explanations with M2

We configure the prediction tasks of the M2 model as LM-EVAL tasks with YAML configs. With configs, you can fill preset fields to easily set up a task. Here, we write 4 different YAML config files under the *geese_tasks* tag for multiple-choice evaluation on the e-rte-3-it labels (read from our newly generated dataset). 

Here we define four configurations corresponding to four explanation sets given as "Hint": 
1. no explanation (Hint = "Not given.")
2. dummy explanation (Hint = text_h)
3. the anonimized human explanations (Hint = anon_human)
4. the explanations generated in Step 1 (in our case Hint = anon_llama3)

For evaluation, you connot change the prompt and generation parameters. In the gist configuration file presented at the CALAMITA Challenge, we currently report the **geese_human** configuration, but you are encouraged to propose your own explanations **after post-processing** (see above) before using them, and to read the anonimized explanations from the dataset produced in Step 1.

In [2]:
YAML_geese_noexp_string = '''
tag: geese_tasks
task: geese_noexp
dataset_path: azaninello/geese-llama-3
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Sentence 1:{{text_t}}Sentence 2: {{text_h}}Hint: Not given.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('no-exp.yaml', 'w') as f:
    f.write(YAML_geese_noexp_string)

YAML_geese_dummy_string = '''
tag: geese_tasks
task: geese_dummy
dataset_path: azaninello/geese-llama-3
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Sentence 1:{{text_t}}Sentence 2: {{text_h}}Hint: {{text_h}}.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('dummy-exp.yaml', 'w') as f:
    f.write(YAML_geese_dummy_string)

YAML_geese_human_string = '''
tag: geese_tasks
task: geese_human
dataset_path: azaninello/geese-llama-3
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Sentence 1:{{text_t}}Sentence 2: {{text_h}}Hint: {{anon_human}}.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('human-exp.yaml', 'w') as f:
    f.write(YAML_geese_human_string)

YAML_geese_llama_string = '''
tag: geese_tasks
task: geese_llama3
dataset_path: azaninello/geese-llama-3
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Sentence 1:{{text_t}}Sentence 2: {{text_h}}Hint: {{anon_llama3}}.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('llama3-exp.yaml', 'w') as f:
    f.write(YAML_geese_llama_string)

### Step 3: Evaluate explanations on downstream task (M2 performance)

Here we provide custom code to test the generated explanations llama-3-8B as M2 on label prediciton. 

In this step you can can only change the model to use as M2 and recall your own configuration(s) through the task (see above).

**IMPORTANT NOTE**

This step is **IS** the one implemented within lm_eval for the geese_human configuration described in the gist [configuration file](https://gist.github.com/andreazaninello/6d92daeaa1c264477b32fd22acfbd818). 

In [4]:
# Run on GPU
#!accelerate launch ----no_python
import gc
torch.cuda.empty_cache()
gc.collect()

!CUDA_VISIBLE_DEVICES=0 accelerate launch --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B \
    --include_path ./ \
    --tasks geese_tasks \
    --limit 800 \
    --output output \
    --log_samples


usage: accelerate <command> [<args>] launch [-h] [--config_file CONFIG_FILE]
                                            [--quiet] [--cpu] [--multi_gpu]
                                            [--tpu] [--ipex]
                                            [--mixed_precision {no,fp16,bf16,fp8}]
                                            [--num_processes NUM_PROCESSES]
                                            [--num_machines NUM_MACHINES]
                                            [--num_cpu_threads_per_process NUM_CPU_THREADS_PER_PROCESS]
                                            [--enable_cpu_affinity]
                                            [--dynamo_backend {no,eager,aot_eager,inductor,aot_ts_nvfuser,nvprims_nvfuser,cudagraphs,ofi,fx2trt,onnxrt,tensorrt,aot_torchxla_trace_once,torhchxla_trace_once,ipex,tvm}]
                                            [--dynamo_mode {default,reduce-overhead,max-autotune}]
                                            [--dynamo_use_fullgrap

In [None]:
# Run on CPU
!accelerate launch --cpu --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B,device_map=cpu \
    --include_path ./ \
    --tasks geese_tasks \
    --limit 800 \
    --output output \
    --log_samples


## References
[1] A. Zaninello, S. Brenna, B. Magnini, *Textual entailment with natural language explanations: The Italian e-rte-3 dataset* in: CLiC-it, 2023. URL: https://ceur-ws.org/Vol-3596/short21.pdf.