## Hyperparameter testing

#### **verbs that we used in the study**
- 4 ES, 4SE, 4 positive sentiment, 4 negative sentiment:
    - faszinieren ("fascinate") (SE)
    - inspirieren ("inspire") (SE)
    - enttäuschen ("disappoint") (SE)
    - schockieren ("shock") (SE)
    - bewundern ("admire") (ES)
    - respektieren ("respect") (ES)
    - verabscheuen ("despise") (ES)
    - hassen ("hate") (ES)


#### **verbs that we use for hyperparameter testing**
- 4 ES, 4SE, 4 positive sentiment, 4 negative sentiment
    - stören ("disturb"/"bother") (SE)
    - langeweilen ("bore") (SE)
    - entzücken ("delight") (SE)
    - amüsieren ("amuse") (SE)
    - fürchten ("fear") (ES)
    - beneiden ("envy") (ES)
    - Mitleid mit DP haben ("have pity with DP") (ES)
    - vergöttern  ("idolize"/"adore") (ES)


### Import packages


In [3]:
import os
import pandas as pd
import numpy as np
import json
import torch
import pickle

#Evaluation
from datasets import load_metric
from datasets import list_metrics
import evaluate 
from evaluate import load

### load and preprocess the data 

In [5]:
with open("../data/pkl_files/parameter_tests/gpt2_hptest_01.pkl", "rb") as fp: 
    test01 = pickle.load(fp)
    
with open("../data/pkl_files/parameter_tests/gpt2_hptest_02.pkl", "rb") as fp: 
    test02 = pickle.load(fp)
       
with open("../data/pkl_files/parameter_tests/gpt2_hptest_03.pkl", "rb") as fp: 
    test03 = pickle.load(fp)   

with open("../data/pkl_files/parameter_tests/gpt2_hptest_04.pkl", "rb") as fp: 
    test04 = pickle.load(fp)
    
with open("../data/pkl_files/parameter_tests/gpt2_hptest_05.pkl", "rb") as fp: 
    test05 = pickle.load(fp)

with open("../data/pkl_files/parameter_tests/gpt2_hptest_06.pkl", "rb") as fp: 
    test06 = pickle.load(fp)


In [5]:
## preprocess the data for automatic evaluation
def create_lists_for_eval(datadict):
    """
    Extracts the references and predictions from the datadict and returns them as a list,
    so that they can be user for the automatic evaluation measures.
    
    Returns a list for references, diverse beam, nucleus sampling and typical sampling each
    (in that order). 
    """
    
    all_references =[]
    all_divbeam_predictions = []
    all_nucsamp_predictions = []
    all_typsamp_predicitions = []
    
    for verb in datadict:
        for i in range(0,(len(datadict[verb]["prompt"]))):
            all_references.append(eval(datadict[verb]["human_reference"][i]))
            # from the model generations, we still need to remove the prompts from the texts:
            all_divbeam_predictions.append(" ".join(datadict[verb]["model_generation_diversebeam"][i].partition(",")[2].split()[1:]))
            all_nucsamp_predictions.append(" ".join(datadict[verb]["model_generation_nucleus_sampling"][i].partition(",")[2].split()[1:]))
            all_typsamp_predicitions.append(" ".join(datadict[verb]["model_generation_typical_sampling"][i].partition(",")[2].split()[1:]))

    return all_references, all_divbeam_predictions, all_nucsamp_predictions, all_typsamp_predicitions
    


In [6]:
## call the function for each dictionary:

test01_refs, test01_divbeam, test01_nucsamp, test01_typsamp = create_lists_for_eval(test01)

test02_refs, test02_divbeam, test02_nucsamp, test02_typsamp = create_lists_for_eval(test02)

test03_refs, test03_divbeam, test03_nucsamp, test03_typsamp = create_lists_for_eval(test03)

test04_refs, test04_divbeam, test04_nucsamp, test04_typsamp = create_lists_for_eval(test04)

test05_refs, test05_divbeam, test05_nucsamp, test05_typsamp = create_lists_for_eval(test05)

test06_refs, test06_divbeam, test06_nucsamp, test06_typsamp = create_lists_for_eval(test06)

In [7]:
# the references are always the same ones, so it is okay to use only one of these to compare with the different preds
test01_refs == test02_refs == test03_refs == test04_refs == test05_refs == test06_refs

True

### BLEU

In [6]:
bleu_metric = evaluate.load("bleu")

In [27]:
bleuscore = bleu_metric.compute(references=test01_refs,
                                predictions=test06_typsamp)

round(bleuscore['precisions'][0],2)

0.39

### BERTScore

In [28]:
bertscore_metric= load("bertscore")

In [50]:
bertscore = bertscore_metric.compute(references=test01_refs,
                                     predictions=test06_typsamp, 
                                     lang="de")

round(bertscore['f1'][0],2)

0.77

### GLEU (Google Bleu)

In [52]:
gleu_metric = evaluate.load("google_bleu")

In [73]:
gleuscore = gleu_metric.compute(references=test01_refs,
                                predictions=test06_typsamp)

round(gleuscore["google_bleu"],2)

0.11

### METEOR

In [74]:
meteor_metric = evaluate.load("meteor")

[nltk_data] Downloading package wordnet to /Users/judith/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/judith/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/judith/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [96]:
meteorscore = meteor_metric.compute(references=test01_refs,
                                    predictions=test06_typsamp)

round(meteorscore["meteor"],2)

0.3

### ROUGE

In [104]:
rouge_metric = evaluate.load('rouge')

In [130]:
rougescore = rouge_metric.compute(references=test01_refs,
                                  predictions=test06_typsamp)

round(rougescore["rougeL"],2)

0.26