# The GEESE Challenge

In this notebook we will be going through a short tutorial on the steps to perform the GEESE task using the e-rte-3-it dataset [1].

### Install LM-Eval
First, you don't have it already, install the LM-Eval library by uncommenting and running the following cell.

In [1]:
#!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

In [29]:
import os, tqdm, torch, gc
torch.cuda.empty_cache()
gc.collect()

9116

### Step 1: Generate the explanations for the given labels with M1
We will use llama-3-3B-instruct as M1 to generate the explanations for the entailment labels in e-rte-3-it.

In [30]:
import os, tqdm, torch, gc
import re
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset, load_from_disk

os.environ["HF_TOKEN"] = "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"  # Replace with your token

#os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Ensure enough GPU memory is available

def step_1_generate_explanation(model_name, dataset_name):
    # Use the first GPU if available
    device = "cpu" #torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # Load the model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    #model = model.to(device)

    # Create a pipeline for text generation
    generator = pipeline("text-generation", model=model, tokenizer=tokenizer, pad_token_id=tokenizer.eos_token_id, device=0)

    # Load the dataset
    dataset = load_dataset(dataset_name, split='test')
    model_nick = "llama3"
    
    # Function to generate explanations
    def add_generated_explanation(example):
        text_t = example['text_t']
        text_h = example['text_h']
        label_str = example['label']
        
        def generate_explanation(sentence_1, sentence_2, label, exp_type="Explain why."):
            example_sentences = f"Sentence 1: {sentence_1.strip()}\nSentence 2: {sentence_2.strip()}"
            prompt = f"Your task is to provide an explanation for the label assigned for the entailment relationship between two sentences.\n{example_sentences}\nEntailment label: {label}\n{exp_type}"

            #with torch.no_grad(), torch.cuda.amp.autocast():
            explanation = generator(prompt, max_length=410, num_return_sequences=1, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)

            return explanation[0]['generated_text'].split(f'{exp_type}')[-1]

        explanation_model = generate_explanation(sentence_1=text_t, sentence_2=text_h, label=label_str)
        example[f'anon_{model_nick}'] = explanation_model.strip()

        return example
    
    # Shuffle (and optionally select a smaller chunk of the dataset)
    dataset = dataset.shuffle(123) #dataset.shuffle(123).select(range(20))

    # Process the dataset (not in batches for memory control)
    updated_dataset = dataset.map(add_generated_explanation, batched=False)

    output_dir = f"./explained_{model_nick}"
    updated_dataset.save_to_disk(output_dir)

    print(f"Generation completed. Dataset saved to {output_dir}")

def main():
    step_1_generate_explanation(model_name="meta-llama/Meta-Llama-3-8B-Instruct", dataset_name="azaninello/e-RTE-3-it")

if __name__ == "__main__":
    main()


Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.23it/s]
Map:   0%|          | 0/800 [00:00<?, ? examples/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Map: 100%|██████████| 800/800 [2:39:27<00:00, 11.96s/ examples]  
Saving the dataset (1/1 shards): 100%|██████████| 800/800 [00:00<00:00, 173749.13 examples/s]

Generation completed. Dataset saved to ./explained_llama3





In [31]:
# Check the first example of the locally saved dataset, anonymize the explanations to prevent label-leakage, and save the new dataset to the Hugging Face Hub

from datasets import load_from_disk

model_nick = "llama3"
output_dir = f"./explained_{model_nick}"

# Load the saved dataset
loaded_dataset = load_from_disk(output_dir)

print(loaded_dataset[0][f'anon_{model_nick}'])  # Display the first example, including the generated/anonimized explanations

anon_pattern = r"(\bYES\b|\bNO\b|\bUNKNOWN\b|\bentail\w*\b|\bcontradict\w*\b|\bneutral\w*\b)"
subst_str = "XXX"
    
def anonymize_explanations(example):
            # Add the generated explanation to the original model
            example['anon_human'] = re.sub(anon_pattern, subst_str, example['explanation'].strip(), flags=re.IGNORECASE)
            example[f'anon_{model_nick}'] = re.sub(anon_pattern, subst_str, example[f'anon_{model_nick}'].strip(), flags=re.IGNORECASE)
            return example

anon_dataset = loaded_dataset.map(anonymize_explanations, batched=False)

public_name = "geese-llama-3-full"
anon_dataset.push_to_hub(public_name, private=False)
print(f"Upload completed. Explanations saved as {public_name}")


The entailment label is UNKNOWN because the two sentences do not have a direct entailment relationship. Sentence 1 states that it is rumored that Metin Kaplan ordered the murder of Ibrahim Sofu, while Sentence 2 states that Ibrahim Sofu was killed by Metin Kaplan. The two sentences present different information, with Sentence 1 providing a rumor or unverified information, and Sentence 2 presenting a factual statement. There is no logical connection between the two sentences that would imply one sentence is a direct consequence of the other. Therefore, the entailment label is UNKNOWN, indicating that the relationship between the two sentences is not clear or cannot be determined. 

Please note that this explanation is based on the assumption that the entailment label is assigned based on the logical relationship between the two sentences. If the entailment label is assigned based on other criteria, such as the similarity or overlap between the two sentences, a different explanation may 

Map: 100%|██████████| 800/800 [00:00<00:00, 4535.20 examples/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 46.95ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.39s/it]


Upload completed. Explanations saved as geese-llama-3-full


## Create "new" evaluation task with lm_eval

We will configure the prediction tasks of the M2 model as LM-EVAL tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.

Here, we write a demo YAML config for a multiple-choice evaluation of e-rte-3-it to get the baseline results.

In [33]:
YAML_geese_noexp_string = '''
tag: geese_tasks
task: geese_noexp
dataset_path: azaninello/geese-llama-3-full
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Sentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: Not given.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('no-exp.yaml', 'w') as f:
    f.write(YAML_geese_noexp_string)

YAML_geese_dummy_string = '''
tag: geese_tasks
task: geese_dummy
dataset_path: azaninello/geese-llama-3-full
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Sentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{text_h}}.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('dummy-exp.yaml', 'w') as f:
    f.write(YAML_geese_dummy_string)

YAML_geese_human_string = '''
tag: geese_tasks
task: geese_human
dataset_path: azaninello/geese-llama-3-full
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Sentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{anon_human}}.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('human-exp.yaml', 'w') as f:
    f.write(YAML_geese_human_string)

YAML_geese_llama_string = '''
tag: geese_tasks
task: geese_llama3
dataset_path: azaninello/geese-llama-3-full
dataset_name: default
output_type: multiple_choice
test_split: test
validation_split: test
doc_to_text: "Sentence 1:{{text_t}}\nSentence 2: {{text_h}}\nHint: {{anon_llama3}}.\nEntailment label:"
doc_to_target: label
doc_to_choice: ["YES", "NO", "UNKNOWN"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
'''
with open('llama3-exp.yaml', 'w') as f:
    f.write(YAML_geese_llama_string)

In [35]:
# !accelerate launch --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B \
    --include_path ./ \
    --tasks geese_tasks \
    --limit 800 \
    --output output \
    --log_samples


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2024-09-24:04:08:45,190 INFO     [__main__.py:279] Verbosity set to INFO
2024-09-24:04:08:45,190 INFO     [__main__.py:303] Including path: ./
2024-09-24:04:08:45,219 INFO     [__init__.py:491] `group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. `group`s which aggregate across subtasks must be only defined in a separate group config file, which will be the official way to create groups that support cross-task aggregation as in `mmlu`. Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs for more information.
2024-09-24:04:08:49,530 INFO     [__main__.py:376] Selected Tasks: ['geese_tasks']
2024-09-24:04:08:49,531 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting 