# CodeBert Grid Experiment Evaluation

**A full run of this Notebook takes about 40 minutes on my machine.**

Make sure to have all required dependencies installed - they are listed in the [environment.yml](./environment.yml). 
You create a conda environment from the yml using 

```
conda env create -f environment.yml
conda activate Lampion-Codebert-Evaluation
```

Make sure to run your Jupyter Notebook from that environment! 
Otherwise you are (still) missing the dependencies. 

**OPTIONALLY** you can use the environment in which your jupter notebook is already running, with starting a new terminal (from jupyter) and run 

```
conda env update --prefix ./env --file environment.yml  --prune
```

Manual Steps

---------------------------------------

The following Steps need to be adjusted for the run to finish: 
    
    1. Run metric Runner external in case you are on windows
    2. Change data directories to required
    3. Change Config Archetypes to match what you have, will be used for printing

Please be aware that by the end of this notebook we create **a big .csv file.**

Some of the statistical tests where easier to do in R, which is provided in a seperate file starting from the bleus.csv created by the end of the notebook.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

import nltk
nltk.download("punkt")
# Homebrew Imports (python-file next to this)
import bleu_evaluator as foreign_bleu

# Set Jupyter vars
# %matplotlib notebook
plt.rcParams.update({'font.size': 35})

%matplotlib inline

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Leonh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Data-Loading / Preparation

Make sure that your dataset looks like described in the [Readme](./README.md), that is 

```
./data
    /PaperResults
        /configs
            /reference
                test_0.gold
                test_0.output
                bleu.txt (optional, can be created below)
            /config_0
                config.properties
                test_0.gold
                test_0.output
                bleu.txt (optional, can be created below)
            /config_1
                config.properties
                test_0.gold
                test_0.output
                bleu.txt (optional, can be created below)
    ...
```

where the configs **must** be numbered to be correctly detected. 

In [3]:
# This runs the bleu-score upon the config files, creating the bleu.txt's 
# If your data package was provided including the txt you dont need to do this. 
# Existing bleu.txt's will be overwritten. 

# Note: This did not behave as intended on Windows - run the command per extraneous bash

#!./metric_runner.sh ./data/java_results/

'.' is not recognized as an internal or external command,
operable program or batch file.


The following cells first run over the given data directory and read all paths, 
then the properties and finally all of the data is loaded.

The bleu.txt files are required at this stage.

In [8]:
# The directory where to look for the data, default are the paper results
# Expected format: 
# [("Prefix","Path"),("Prefix2","Path2")]
data_directories = [("java","./data/java_results"),("python","./data/python_results")]

# These archetypes are later used to group the configurations
# While to grouping is up to you, right here it simply is one archetype for each transformation type, 
# Grouping together different configs with the same transformations applied (but different #Transformations)
config_archetypes = {
    "config_0":"if","config_1":"if","config_2":"if",
    "config_3":"neutral-element","config_4":"neutral-element","config_5":"neutral-element",
    "config_6":"mixed names(pseudo)","config_7":"mixed names(pseudo)","config_8":"mixed names(pseudo)",
    "config_9":"mixed-names(random)","config_10":"mixed-names(random)","config_11":"mixed-names(random)",
    "config_12": "add-var(pseudo)","config_13": "add-var(pseudo)","config_14": "add-var(pseudo)",
    "config_15": "add-var(random)","config_16": "add-var(random)","config_17": "add-var(random)",
    "config_18": "if & neutral-element","config_19": "if & neutral-element","config_20": "if & neutral-element"
}
print(f"looking for results in {data_directories}" )

results={}

for (prefix,data_directory) in data_directories:
    print(f"Looking for {prefix} at {data_directory}")
    results[prefix]={}
    for root,dirs,files in os.walk(data_directory):
        for name in files:
            if ".gold" in name:
                directory = os.path.basename(root)
                results[prefix][directory]={}
                results[prefix][directory]["prefix"]=prefix
                results[prefix][directory]["result_file"]=os.path.join(root,"test_0.output")
                results[prefix][directory]["gold_file"]=os.path.join(root,"test_0.gold")
                results[prefix][directory]["bleu_file"]=os.path.join(root,"bleu.txt")
                if os.path.exists(os.path.join(root,"config.properties")):
                    results[prefix][directory]["property_file"]=os.path.join(root,"config.properties")
                
print(f"Found {len(results.keys())} configuration folders in {data_directory}")

looking for results in [('java', './data/java_results'), ('python', './data/python_results')]
Looking for java at ./data/java_results
Looking for python at ./data/python_results
Found 2 configuration folders in ./data/python_results


In [16]:
def load_properties(filepath, sep='=', comment_char='#'):
    """
    Read the file passed as parameter as a properties file.
    """
    props = {}
    with open(filepath, "rt") as f:
        for line in f:
            l = line.strip()
            if l and not l.startswith(comment_char):
                key_value = l.split(sep)
                key = key_value[0].strip()
                value = sep.join(key_value[1:]).strip().strip('"') 
                props[key] = value 
    return props

print("reading in property-files")

for prefix in results.keys():
    for key in results[prefix].keys():
        if "property_file" in results[prefix][key].keys():
            results[f"{prefix}"][f"{key}"]["properties"]=load_properties(results[prefix][key]["property_file"])

print("done reading the properties")

reading in property-files
done reading the properties


In [24]:
print("reading in result-files")

for prefix in results.keys():
    for key in results[prefix].keys():
        result_file = results[prefix][key]["result_file"]
        f = open(result_file)
        lines=f.readlines()
        results[prefix][key]["results"]={}
        for l in lines:
            num = int(l.split("\t")[0])
            content = l.split("\t")[1]
            content = content.strip()
            results[prefix][key]["results"][num] = content
        f.close()

        gold_file = results[prefix][key]['gold_file']
        gf = open(gold_file)
        glines=gf.readlines()
        results[prefix][key]["gold_results"]={}
        for gl in glines:
            num = int(gl.split("\t")[0])
            content = gl.split("\t")[1]
            content = content.strip()
            results[prefix][key]["gold_results"][num] = content
        gf.close()

print("done reading the result files")
# Comment this in for inspection of results
#results

reading in result-files
done reading the result files


In [31]:
print("reading in the bleu-scores")

for prefix in results.keys():
    for key in results[prefix].keys():
        bleu_file = results[prefix][key]["bleu_file"]
        f = open(bleu_file)
        score=f.readlines()[0]
        results[prefix][key]["bleu"]=float(score)
        f.close()
    
print("done reading the bleu-scores")

#results["java"]["config_0"]["bleu"]

reading in the bleu-scores
done reading the bleu-scores


The following are little helpers and wrappers to make the notebook a bit smoother.

In [39]:
"""
There is a small issue with the configs being named config_0, config_1, config_10:
As they are treated as strings, config_10 is "smaller then" config_2, making the sort unintuitive
This method should help to sort configs in the intended way: config_1,config_2,...,config_9,config_10,config_11,...

config_num can be used to sort the configs where necessary. It can be used e.g. as 
    sorted(non_reference_configs,key=config_num)
"""
def config_num(c):
    # Fallback: If we are not trying to sort configs, just do a normal compare
    if not "config_" in c:
        return -1
    else:
        c_part = int(c.split("_")[1])
        return c_part

# The non reference configs are all result-keys that are not "reference"
# Additionally, they are sorted to match the above behaviour (config10>config2)
non_reference_configs = [] 

for prefix in results.keys():
    non_reference_configs += sorted([(prefix,k) for k in results[prefix].keys() if "reference" != k],key=config_num)

print(non_reference_configs)
# Set the Archetypes also into the results using the Archetype Dictionary defined at the beginning of the notebook
for (prefix, key) in non_reference_configs:
        if "property_file" in results[prefix][key].keys():
            results[prefix][key]["archetype"]=config_archetypes[key]

# This helps looking up archetype+transformations per configuration
def archetype_info(config):
    archetype = config_archetypes[config]
    transforms = int(results[config]["properties"]["transformations"])
    return (archetype,transforms)

# Pretty Print archetype info for a given config
print_archetype_info = lambda config: f"{(archetype_info(config))[0]}@{(archetype_info(config))[1]}"

# Another Set of archetypes used e.g. for grouping and printing
all_archetypes = set(config_archetypes.values())

# archetype-MT-Mapping for Paper (Where we use MT)
archetype_mt_mapping = {
    "if":"MT-IF",
    "neutral-element":"MT-NE",
    "mixed names(pseudo)": "MT-REP + MT-UVP",
    "mixed-names(random)": "MT-RER + MT-UVR",
    "add-var(pseudo)":"MT-UVP",
    "add-var(random)":"MT-UVR",
    "if & neutral-element":"MT-IF + MT-NE"
}

[('java', 'config_0'), ('java', 'config_1'), ('java', 'config_10'), ('java', 'config_11'), ('java', 'config_12'), ('java', 'config_13'), ('java', 'config_14'), ('java', 'config_2'), ('java', 'config_3'), ('java', 'config_4'), ('java', 'config_5'), ('java', 'config_6'), ('java', 'config_7'), ('java', 'config_8'), ('java', 'config_9'), ('python', 'config_0'), ('python', 'config_1'), ('python', 'config_10'), ('python', 'config_11'), ('python', 'config_12'), ('python', 'config_13'), ('python', 'config_14'), ('python', 'config_2'), ('python', 'config_3'), ('python', 'config_4'), ('python', 'config_5'), ('python', 'config_6'), ('python', 'config_7'), ('python', 'config_8'), ('python', 'config_9')]


In [40]:
# These Two Wrappers are adapters to the ntlk library, 
# In addition they cover often-occuring errors with a default behaviour
# (Instead of throwing errors)

def jaccard_wrapper(sentenceA,sentenceB,ngram=1,lowercasing=True):
    a = sentenceA.lower() if lowercasing else sentenceA
    b = sentenceB.lower() if lowercasing else sentenceB
    tokensA = nltk.word_tokenize(a)
    tokensB = nltk.word_tokenize(b)

    ngA_tokens = set(nltk.ngrams(tokensA, n=ngram))
    ngB_tokens = set(nltk.ngrams(tokensB, n=ngram))
    
    if (len(ngB_tokens)==0) and (len(ngA_tokens)==0):
        return 0
    if (len(ngB_tokens)==0) or (len(ngA_tokens)==0):
        return 1
    
    return nltk.jaccard_distance(ngA_tokens, ngB_tokens)

def bleu_wrapper(sentence_to_check,reference):
    check_tokens = nltk.word_tokenize(sentence_to_check)
    ref_tokens = nltk.word_tokenize(reference)
    
    # From comparing the foreign_bleu and nltk the method4 seems to match
    # The Paper names the BLEU-4-Score with a citation to chen & cherry
    # I wish I could be named chen & cherry, its a very cool name. 
    chencherry = nltk.translate.bleu_score.SmoothingFunction()
    smooth_fn = chencherry.method4
    
    try:
        return nltk.translate.bleu_score.sentence_bleu([ref_tokens],check_tokens,smoothing_function=smooth_fn)
    except:
        return 0

In [41]:
results

{'java': {'config_0': {'prefix': 'java',
   'result_file': './data/java_results\\configs\\config_0\\test_0.output',
   'gold_file': './data/java_results\\configs\\config_0\\test_0.gold',
   'bleu_file': './data/java_results\\configs\\config_0\\bleu.txt',
   'property_file': './data/java_results\\configs\\config_0\\config.properties',
   'properties': {'version': '1.1',
    'transformationscope': 'perClassEach',
    'transformations': '1',
    'inputDirectory': '/usr/app/obfuscator_input',
    'outputDirectory': '/usr/app/obfuscator_output',
    'seed': '2022',
    'compilingTransformers': 'false',
    'setAutoImports': 'false',
    'removeAllComments': 'false',
    'IfTrueTransformer': 'True',
    'IfFalseElseTransformer': 'True',
    'AddNeutralElementTransformer': 'False',
    'LambdaIdentityTransformer': 'False',
    'RandomParameterNameTransformer': 'False',
    'RandomParameterNameStringRandomness': 'pseudo',
    'AddUnusedVariableTransformer': 'False',
    'UnusedVariableStringRa

## Bleu-Scores

In the following, the BLEU-scores will be calculated using the foreign libary. 
While there have been minor changes to standard-BLEU, it is the same as used in the original experiment.

The aggregated BLEU-Scores will be stored to the results.

In [None]:
bleu_data = {}
archetypes = set([results[k]["archetype"] for k in results.keys() if "archetype" in results[k].keys()])
for archetype in archetypes:
    bleu_data[archetype]={}
    bleu_data[archetype][0]=results["reference"]["bleu"]
    relevant_configs = [k for k 
                        in results.keys() 
                        if "archetype" in results[k].keys() 
                        and results[k]["archetype"]==archetype]
    for c in relevant_configs:
        bleu_data[archetype][int(results[c]["properties"]["transformations"])]=results[c]["bleu"]

bleu_data_df = pd.DataFrame.from_dict(bleu_data)
bleu_data_df = bleu_data_df.sort_index()
bleu_data_df = bleu_data_df.applymap(lambda cell: round(cell,3))
bleu_data_df.columns = [archetype_mt_mapping[n] for n in bleu_data_df.columns]

with open("./exports/bleu_table.tex","w") as f: 
    f.write(
        bleu_data_df.to_latex(
            caption="BLEU4-Scores for increasing number of metamorphic transformations \n (applied n-times per datapoint)"
            ,label="tab:bleus"
            ,position="tbh"
            #,column_format={rrrrrrr}
        )         
    )

bleu_data_df

In [None]:
#bleu_data_df.columns = [archetype_mt_mapping[a] for a in bleu_data_df.columns]

plt.figure(figsize=(14,7))
plt.ylabel("BLEU-Score",fontsize=20)
#plt.xlabel("# Transformations")
plt.xlabel("Order",fontsize=22)
#for latex, its nicer to have the title set from latex itself
#plt.title("BLEU4-Scores for increasing number of metamorphic transformations \n (applied n-times per datapoint)")

plot = sns.lineplot(data=bleu_data_df,markers=True,style=None,dashes=False)
plt.xticks([0,1,5,10],fontsize=20)
plt.yticks(fontsize=20)
plt.xlim(-0.025,10.1)
plt.legend(bleu_data_df.columns,fontsize=16)
plt.savefig('images/bleu_scores.png')
plt.show()

In [None]:
bleu_data_df_transposed = bleu_data_df.transpose()
bleu_data_df_transposed = bleu_data_df_transposed.drop(axis=1,columns=0)
with open("./exports/transposed_bleu_table.tex","w") as f: 
    f.write(
        bleu_data_df_transposed.to_latex(
            caption="BLEU4-Scores for increasing order of metamorphic transformations \n (applied n-times per datapoint)"
            ,label="tab:bleus"
            ,position="th"
            #,column_format={rrrrrrr}
        )          
)

#bleu_data_df_transposed

## Per Entry Bleu

Now we use the nltk-provided bleu score to calculate the bleu-scores for all entries.
We store them on a per-result basis always bleu(gold,config).

The nltk bleu does not go from 0 to 100 but from 0 to 1, but they are the same by a factor of 100.

In [None]:
# This wrapper applies the "bleu_wrapper" to every element of a configurations results.
# The result is a list of [bleu-score(config[i],gold[i])]
# Entries are in order ascending
calculate_bleus = lambda config_id : [
    bleu_wrapper(results[config_id]["results"][i],results[config_id]["gold_results"][i]) 
    for i 
    in results[config_id]["results"].keys()
]

In [None]:
"""
These plots, while not necessary the best, try to compare the bleus of the reference to the bleus of a config.
They don't take very long, the actual bleu-calculation is what takes time in the cell below.
"""

def plot_bleu_histogram(config_data,reference_data,title):
    plt.figure(figsize=(14,7))
    
    histo_df=pd.DataFrame.from_dict(
        {"reference":reference_data,
            title:config_data }
    )

    sns.displot(
        data=histo_df,
        kind="hist", kde=True,
        height=6, aspect=10/6
               )
    plt.title(f"Histogram of Bleu-Scores for {title}")
    plt.xlabel("Bleu-Score")
    #plt.ylabel("# of Entries")
    plt.xlim(0,1)
    plt.savefig(f'images/{title}_bleu_histogram.png')
    plt.show()
    
def plot_bleu_boxplot(config_data,reference_data,title=None):
    fig = plt.figure(figsize=(6,4))
    ax = fig.add_subplot(1, 1, 1)
    box_df=pd.DataFrame.from_dict(
        {"reference":reference_data,
            title:config_data }
    )    
    sns.boxplot(
        data=box_df)
    
    plt.title(f"Boxplot of Bleu-Scores for {title}")
    plt.ylabel("Bleu-Score")
    
    major_ticks = np.arange(0, 1, 0.2)
    minor_ticks = np.arange(0, 1, 0.05)

    ax.set_yticks(major_ticks)
    ax.set_yticks(minor_ticks, minor=True)

    # And a corresponding grid
    ax.grid(which='both')

    #plt.grid()
    plt.savefig(f'images/{title}_bleu_box.png')
    plt.ylim(0,1)
    
    plt.show()
    
def plot_bleu_violinplot(config_data,reference_data,title):
    plt.figure(figsize=(6,4))
    violin_df=pd.DataFrame.from_dict(
        {"reference":reference_data,
            title:config_data }
    )
    
    sns.violinplot(data=violin_df)
    
    plt.grid()
    plt.title(f"ViolinPlot of Bleu-Scores for {title}")
    plt.ylabel("Bleu-Score")
    
    plt.savefig(f'images/{title}_bleu_violin.png')
    plt.show()
    
#plot_bleu_violinplot(sample_bleus_config_data,bleus_reference_data,"config_20")
#plot_bleu_boxplot(sample_bleus_config_data,bleus_reference_data,"config_20")
#plot_bleu_histogram(sample_bleus_config_data,bleus_reference_data,"config_20")

In [None]:
%%time
# Calculate the reference bleus and store them
bleus_reference_data = calculate_bleus("reference")
results["reference"]["bleu_values"]=bleus_reference_data

# For every entry in the config, calculate bleu and make comparison plots
for config in non_reference_configs:
    bleus_data = calculate_bleus(config)
    # Set the bleu values to only calculate them once
    results[config]["bleu_values"]=bleus_data
    # Use the bleu-data to make some plots
    plot_bleu_violinplot(bleus_data,bleus_reference_data,config)
    plot_bleu_boxplot(bleus_data,bleus_reference_data,config)
    plot_bleu_histogram(bleus_data,bleus_reference_data,config)
    # Delete the bleu data to free some memory and not collide on names
    del bleus_data

## Samples

Before the samples can be inspected, the items need to be re-indexed. 
While all config_results are in the reference_results, there might is an issue with the data being shuffeld. 

To fix this, a reindexing is done.

In [None]:
%%time
#Reindexing Pseudocode

def lookup_index(sentence, comparison_dict):
    for (gold_key,gold_value) in comparison_dict.items():
        if sentence == gold_value:
            return gold_key
    return -1

# Pseudocode:
# For each config (that is not reference)
    # Create a lookup of reference_gold_index -> config_gold_index
    # Invert the lookup 
    # Make a new dictionary where
        # For every key of the config_gold
        # lookup the key of the reference_gold
        # And fill it with {reference_gold_key : config_gold_value}
        # Do the same with the non-gold results
        # Fill it with {reference_gold_key : config_result_value}
    # Set result[config_X]["gold_results"] to the newly created, matching index one 
    # same for non-gold-results
    
for config in non_reference_configs:
    keyMapping={}
    for (k,v) in results[config]["gold_results"].items():
        gk = lookup_index(v,results["reference"]["gold_results"])
        keyMapping[k]=gk
    new_gold_results={}
    new_results={}
    for (config_key,gold_key) in keyMapping.items():
        if gold_key != -1:
            new_gold_results[gold_key]=results[config]["gold_results"][config_key]
            new_results[gold_key]=results[config]["results"][config_key]
    results[config]["gold_results"]=new_gold_results
    results[config]["results"]=new_results

In [None]:
# Short Example that the reindexing worked and looks about right
sample_index = 250
print(results["reference"]["gold_results"][sample_index] )
print()
print(results["reference"]["results"][sample_index])
print(results["config_2"]["results"][sample_index])
del sample_index

## Probing and Sampling

These cells look into the entries and find outstanding / most prominent results given diverse criteria. 
As they are qualitative inspections, they are not being plotted but only printed.

(Previously *hall of shame*)

In [None]:
%%time
biggest_len_inc = 0
biggest_len_inc_pos = ()

biggest_len_dec = 0
biggest_len_dec_pos = ()

biggest_jaccard_dist = 0
biggest_jaccard_dist_pos = ()

smallest_jaccard_dist = 1 
smallest_jaccard_dist_pos = ()

for config in non_reference_configs:
    for index in list(results[config]["results"].keys()):
        gold = results["reference"]["gold_results"][index]
        reference = results["reference"]["results"][index]
        altered = results[config]["results"][index]
        
        if len(reference)-len(altered)>biggest_len_inc:
            biggest_len_inc = len(reference)-len(altered)
            biggest_len_inc_pos = (index,config)
        if len(altered)-len(reference)>biggest_len_dec:
            biggest_len_dec = len(altered)-len(reference)
            biggest_len_dec_pos = (index,config)
            
        jacc_dist = jaccard_wrapper(altered,reference)
        if jacc_dist > biggest_jaccard_dist and jacc_dist < 1:
            biggest_jaccard_dist = jacc_dist
            biggest_jaccard_dist_pos = (index,config)
        if jacc_dist < smallest_jaccard_dist and jacc_dist > 0:
            smallest_jaccard_dist = jacc_dist
            smallest_jaccard_dist_pos = (index,config)
            
# This method prints the i'ths entry of config X aswell as the gold and reference entry for it.
def print_config_item_with_reference(index,config):
    print("Gold:")
    print(results[config]["gold_results"][index])
    print("Reference:")
    print(results["reference"]["results"][index])
    print(f"Altered ({config}@{index}):")
    print(results[config]["results"][index])

In [None]:
print("Biggest jaccard Distance (that is not 1):\n")
print_config_item_with_reference(biggest_jaccard_dist_pos[0],biggest_jaccard_dist_pos[1])

In [None]:
print("Biggest decrease in length:\n")
print_config_item_with_reference(biggest_len_inc_pos[0],biggest_len_inc_pos[1])

In [None]:
print("Biggest increase in length:\n")
print_config_item_with_reference(biggest_len_dec_pos[0],biggest_len_dec_pos[1])

In [None]:
print("Smallest Jaccard Distance (that is not 0):\n")
print_config_item_with_reference(smallest_jaccard_dist_pos[0],smallest_jaccard_dist_pos[1])

**Fishy Example from a Kids Java-Learning Book.** 
Code is actually about learning switch-case statements and set a image to the corresponding fishes (e.g. empty fish glass, fish glass with 2 fishes etc.)

The code examples are put into the paper repository as a separate artefact.

In [None]:
fishyKey = -1
for (key,value) in results["reference"]["gold_results"].items():
    #print(value)
    if "makeAFishyDecision " in value:
        fishyKey = key

print("Fishy Results! \n")
print("Gold:")
print(results["reference"]["gold_results"][fishyKey])
print("Reference:")
print(results["reference"]["results"][fishyKey],"\n")
#for config in non_reference_configs:
for config in ["config_0","config_1","config_20","config_10"]:
    print(f"Altered({config},{print_archetype_info(config)}):")
    print(results[config]["results"][fishyKey])

In [None]:
entries_to_look_at = 3
longest_gold = sorted(list(results["reference"]["gold_results"].items()),reverse=True,key=lambda pair: len(pair[1]))[:entries_to_look_at]
#longest_gold
for l_gold in longest_gold:
    #for config in non_reference_configs:
    for config in ["config_1","config_7","config_14"]:
        print_config_item_with_reference(l_gold[0],config)
        print()

In [None]:
shortest_gold = sorted(list(results["reference"]["gold_results"].items()),reverse=True,key=lambda pair: len(pair[1]))[-3:]
#shortest_gold
for s_gold in shortest_gold:
    #for config in non_reference_configs:
    for config in ["config_1","config_7","config_14"]:
        print_config_item_with_reference(s_gold[0],config)
        print()

For the shortest gold standard you can clearly see that the gold-standard is cut at the first @-Sign. 

Looking for certain key-words in the altered configs

We want to inspect 

- where is the keyword x the most times
- how often does keyword x appear in results for config x

In [None]:
def find_entry_with_most_frequent_keyword(keyword):
    most_keywords=0
    most_keywords_pos=()

    for config in non_reference_configs:
        for index in list(results[config]["results"].keys()):
            altered = results[config]["results"][index]

            keywords = altered.lower().count(keyword)
            if keywords>most_keywords:
                most_keywords = keywords
                most_keywords_pos = (index,config)
    return most_keywords_pos

In [None]:
most_adds = find_entry_with_most_frequent_keyword("add")
print(f"Most occurrences of 'add':\n")
print_config_item_with_reference(most_adds[0],most_adds[1])
print()

most_gets = find_entry_with_most_frequent_keyword("get")
print(f"Most occurrences of 'get':\n")
print_config_item_with_reference(most_gets[0],most_gets[1])
print()

most_configs = find_entry_with_most_frequent_keyword("config")
print(f"Most occurrences of 'config':\n")
print_config_item_with_reference(most_configs[0],most_configs[1])
print()

In [None]:
"""
looks for a certain keyword in the results.
If a config is specified, it only tries to look for that config.
Searches in all configs otherwise.
Returns the entries containing the keyword as a list of pairs (index,config)
"""
def find_entries_with_keyword(keyword,config=None):
    entries=[]

    if config:
        for index in list(results[config]["results"].keys()):
                altered = results[config]["results"][index]
                if keyword in altered.lower():
                    entries.append((index,config)) 
    else:    
        for config in non_reference_configs:
            for index in list(results[config]["results"].keys()):
                altered = results[config]["results"][index]
                if keyword in altered.lower():
                    entries.append((index,config))
    return entries

In [None]:
print(f"Altered-Entries with 'add':\t{len(find_entries_with_keyword('add'))}")
print(f"Altered-Entries with 'get':\t{len(find_entries_with_keyword('get'))}")
print(f"Altered-Entries with 'get':\t{len(find_entries_with_keyword('set'))}")

In [None]:
# The configs 6 7 and 8 are the "Add Neutral" Transformations
print(f"Entries with 'add' in 'reference':\t{len(find_entries_with_keyword('add','reference'))}")

print(f"Entries with 'add' in 'config_6':\t{len(find_entries_with_keyword('add','config_6'))}")
print(f"Entries with 'add' in 'config_7':\t{len(find_entries_with_keyword('add','config_7'))}")
print(f"Entries with 'add' in 'config_8':\t{len(find_entries_with_keyword('add','config_8'))}")
print()

keyword="mock"
for config in non_reference_configs:
    print(f"Entries with '{keyword}' in '{config}':\t{len(find_entries_with_keyword(keyword,config))}")
  

There seems to be no significant change in what are "getters","setters" and similar items. 
They appear mostly evenly distributed and staying that way.

**Differences in AddVar5 to AddVar10**

Next examples look into bleu differences and "why" addvar10 full random is doing better than addvar5.

In [None]:
%%time
# Config 16 and 17 are add_var(5,random) and add_var(10,random)

add_var_diffs = []

for index in list(results["config_16"]["results"].keys()):
    addvar5result=results["config_16"]["results"][index]
    addvar10result=results["config_17"]["results"][index]
    reference=results["reference"]["results"][index]
    gold=results["reference"]["gold_results"][index]
    
    addvar5bleu = bleu_wrapper(addvar5result,gold)
    addvar10bleu = bleu_wrapper(addvar10result,gold)
    diff = (addvar5bleu-addvar10bleu,index)
    add_var_diffs.append(diff)
    
add_var_diffs=sorted(add_var_diffs,key=lambda p:p[0])

for worsties in add_var_diffs[-5:]:
    print(f"Worsened bleu by {worsties[0]}")
    print("Gold:")
    print(f"\t{results['reference']['gold_results'][worsties[1]]}")
    print("Reference:")
    print(f"\t{results['reference']['results'][worsties[1]]}")
    print("AddVar(5):")
    print(f"\t{results['config_16']['results'][worsties[1]]}")
    print("AddVar(10):")
    print(f"\t{results['config_17']['results'][worsties[1]]}")
    print()

for besties in add_var_diffs[:5]:
    print(f"Bettered bleu by {besties[0]}")
    print("Gold:")
    print(f"\t{results['reference']['gold_results'][besties[1]]}")
    print("Reference:")
    print(f"\t{results['reference']['results'][besties[1]]}")
    print("AddVar(5):")
    print(f"\t{results['config_16']['results'][besties[1]]}")
    print("AddVar(10):")
    print(f"\t{results['config_17']['results'][besties[1]]}")
    print()

However these are not helpfull, they only show that the biggest bleu movements are in getters and setters, which is the same behaviour than in other non addvar-entries.

## Jaccard Distances

The following cells want to inspect the jaccard distances. 

For now, I looked mostly into jaccard(config,reference), but the same plots can be re-done for jaccard(config,gold) 

In [None]:
"""
This method requires for the xs and ys to be sorted! 
Without matching indizes it does not make any sense.
"""
def calculate_jaccard_distances(xs,ys,ngrams=1):
    agg = []
    indX = len(xs)
    indY = len(ys)
    if indX != indY:
        raise IndexError()
    else:
        running_index = 0
        while running_index < indX:
            agg.append(jaccard_wrapper(xs[running_index],ys[running_index],ngrams))
            running_index = running_index + 1
    return agg

In [None]:
jaccs = {}
for config in non_reference_configs:
    distances = calculate_jaccard_distances(results["reference"]["results"],results[config]["results"])
    
    jaccs[config]=distances
    plt.figure(figsize=(20,12))
    sns.displot(
        distances,
        kind="hist", kde=True,
        bins=20
    )
    plt.title(f"Histogram of JaccardDistances for {config}\n({print_archetype_info(config)})")
    plt.xlabel("JaccardDistance \n Reference to Altered")
    plt.ylabel("# of Entries")
    plt.xlim(0,1)
    plt.ylim(0,10000)
    
    plt.savefig(f'images/{config}_jaccard_histogram.png')
    plt.show()

In [None]:
jaccs_n2 = {}
for config in non_reference_configs:
    distances = calculate_jaccard_distances(results["reference"]["results"],results[config]["results"],ngrams=2)
    
    jaccs_n2[config]=distances
    plt.figure(figsize=(20,12))
    sns.displot(
        distances,
        kind="hist", kde=True,
        bins=20
    )
    plt.title(f"Histogram of JaccardDistances for {config}\n({print_archetype_info(config)})")
    plt.xlabel("JaccardDistance (ngram=2) \n Reference to Altered")
    plt.ylabel("# of Entries")
    plt.xlim(0,1)
    plt.ylim(0,10000)
    
    plt.savefig(f'images/{config}_jaccard_ngram2_histogram.png')
    plt.show()

In [None]:
jacc_data = []
for config in jaccs.keys():
    jacc_data.append((config,config_archetypes[config],jaccs[config]))

df = pd.DataFrame(jacc_data)
df.columns=["config","archetype","jacc_dist"]
df = df.explode('jacc_dist')
df['jacc_dist'] = df['jacc_dist'].astype('float')
df = df.dropna()

plt.figure(figsize=(30,12))

sns.boxplot(
    x="config",
    y="jacc_dist",
    hue="archetype",
    #width=4.5,
    dodge =False,
    data=df)

plt.grid()
plt.title(f"Boxplot of Jaccard_Distances")
plt.ylabel("Jaccard Distance")
plt.ylim(0,1)

plt.savefig(f'images/jaccard_distances_boxplot.png')
plt.show()

plt.figure(figsize=(30,12))
sns.violinplot(
    x="config",
    hue="archetype",
    y="jacc_dist",
    data=df,
#width=5.5,
  showmeans=False,
  showmedians=False,
    inner=None,
    dropnan=True,
    dropna=True,
    dodge =False
)
plt.ylim(0,1)

plt.savefig(f'images/jaccard_distances_violinplot.png')

plt.show()

del df

## Pandas

This is a different approach to gather all data in a pandas frame and then make 3 dimensional plots and other funny things.



In [None]:
%%time
# Driver for the time is the jaccard distance 

result_df_data = []

for config in non_reference_configs:
    arch = config_archetypes[config]
    ts = results[config]["properties"]["transformations"]
    index = 0
    while index < len(results[config]["results"]):
        ref = results["reference"]["results"][index]
        res = results[config]["results"][index]
        gold = results["reference"]["gold_results"][index]
        bleu = results[config]["bleu_values"][index]
        ref_bleu = results["reference"]["bleu_values"][index]
        
        diff = res != ref
        perfect = gold == res
        
        # Distance Gold<>ConfigText
        jacc_1 = jaccard_wrapper(res,gold,ngram=1)
        jacc_2 = jaccard_wrapper(res,gold,ngram=2)
        # Distance Gold<>ReferenceText
        jacc_1_ref = jaccard_wrapper(ref,res,ngram=1)
        jacc_2_ref = jaccard_wrapper(ref,res,ngram=2)
        
        result_df_data.append(
            (config,arch,archetype_mt_mapping[arch],ts,index,
             bleu,ref_bleu,
             diff,perfect,
             jacc_1,jacc_2,
             jacc_1_ref,jacc_2_ref,
             gold,ref,res)
        )
        index = index + 1

result_df = pd.DataFrame(result_df_data)

result_df.columns=[
    "config","archetype","MT","transformations","index",
    "bleu","reference_bleu",
    "difference","perfect_match",
    "jaccard_n1","jaccard_n2","jaccard_n1_reference","jaccard_n2_reference",
    "gold_result","reference_result","config_result"
]

#result_df = result_df.dropna()

result_df.head()

### Differences

Looking for Differences in results - similar to Jaccard Distance

In [None]:
plt.figure(figsize=(21, 7))
plt.grid()

plt.title('Result-Differences per Configuration')


sns.barplot(
    x="config",y="difference",
    data=result_df,
    hue="MT",
    dodge =False
)


plt.savefig(f'images/number_of_diffs_by_config.png')
plt.show()

### RQ2 Results
For RQ2 we first needed to have simple counts and percentages of the mere numbers.

In [None]:
%%time
totalPerO = result_df[(result_df["transformations"]=='1')].count()[0]
firstOdiff = result_df[(result_df["transformations"]=='1') & (result_df["difference"])].count()[0]
fifthOdiff = result_df[(result_df["transformations"]=='5') & (result_df["difference"])].count()[0]
tenthOdiff = result_df[(result_df["transformations"]=='10') & (result_df["difference"])].count()[0]

print("Total number of entries per Order:",totalPerO)
print(f"Changes in first order {firstOdiff}({round(firstOdiff/totalPerO,3)}%)")
print(f"Changes in fifth order {fifthOdiff}({round(fifthOdiff/totalPerO,3)}%)")
print(f"Changes in tenth order {tenthOdiff}({round(tenthOdiff/totalPerO,3)}%)")

### RQ1 Results
These are some infos on the changed and affected results for the first order mts changes

In [None]:
plot_df = result_df.copy()
plot_df = plot_df[plot_df["transformations"]=='1']

plot_df["jacc1_diff"] = plot_df["jaccard_n1"]-plot_df["jaccard_n1_reference"]
plot_df["abs_jacc1_diff"] = abs(plot_df["jacc1_diff"])
plot_df["bleu_diff"] = plot_df["bleu"]-plot_df["reference_bleu"]
plot_df["abs_bleu_diff"]=abs(plot_df["bleu_diff"])

diffed_df = plot_df[plot_df["jaccard_n1_reference"]>0]

plot_df.head(3)

In [None]:
post_1stMT_count = plot_df.count()[0]
count_jacc_samsies = plot_df[plot_df["jaccard_n1_reference"]==0].count()[0]
count_jacc_diffs = diffed_df.count()[0]
count_bleu_diffs = plot_df[plot_df["abs_bleu_diff"]>0].count()[0]

avg_bleu_diff = np.mean(plot_df[plot_df["abs_bleu_diff"]>0]["abs_bleu_diff"])

print("Entries for first order",post_1stMT_count)
print("Jaccard Changes:",count_jacc_diffs)
print(f"BLEU Changes: {count_bleu_diffs}({round(count_bleu_diffs/post_1stMT_count,3)}%)")
print("Average Bleu-Diff:",avg_bleu_diff)

avg_jacc_diff = np.mean(plot_df[plot_df["abs_bleu_diff"]>0]["abs_bleu_diff"])
median_jacc_diff = np.median(plot_df["jaccard_n1_reference"])
iqr_jacc_diff = stats.iqr(plot_df["jaccard_n1_reference"])

print("Average Jacc Diff:",avg_jacc_diff)
print("Median Jacc Diff:",median_jacc_diff)
print("IQR Jacc Diffs:",iqr_jacc_diff)

Histogram of changes 
(To show nice non-null changes)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16.5, 8.8))

sns.histplot(ax=axes[0],data=plot_df,
             x="jaccard_n1_reference",
             bins=25)
axes[0].set(xlim=(0,1.01))

axes[0].set_xlabel('Difference in Jaccard Distance \n Reference <> First Order MTs', fontsize=23)
axes[0].set_ylabel('Number of Entries', fontsize=19)

axes[0].set_xticks(np.arange(0,1.2,0.2))
axes[0].set_xticklabels([round(x,1) for x in np.arange(0,1.2,0.2)],fontsize=17)

axes[0].set_yticklabels([int(a) for a in axes[0].get_yticks()],fontsize=17)

sns.histplot(ax=axes[1],data=diffed_df,
             x="abs_bleu_diff",
             bins=50)

axes[1].set(xlim=(0,1.01))
axes[1].set_xlabel('Absolute Difference in BLEU4-Score \n for Summaries with Jaccard-Delta', fontsize=23)
axes[1].set_ylabel('Number of Entries', fontsize=19)

axes[1].set_xticks(np.arange(0,1.2,0.2))
axes[1].set_xticklabels([round(x,1) for x in np.arange(0,1.2,0.2)],fontsize=17)

axes[1].set_yticklabels([int(a) for a in axes[1].get_yticks()],fontsize=17)

plt.savefig(f'images/overview_plot_changes_of_firstorder_mts_small.png')

plt.show()

#del plot_df

In [None]:
# One Example for a where the MT-IF creates "growth" in the result
result_df[(result_df["MT"]=="MT-IF") & (result_df["index"]==327)]["reference_bleu"]

In [None]:
plt.figure(figsize=(21, 7))
plt.grid()

sns.scatterplot(x="jaccard_n1",y="bleu",hue="config",style="archetype",data=result_df)

In [None]:
plt.figure(figsize=(15, 15))
plt.grid()

plt.title("Scatterplot of Entries \n Bleu<>ReferenceBleu")

sns.scatterplot(x="reference_bleu",y="bleu",hue="config",size="jaccard_n1",style="archetype",data=result_df[result_df.index % 10 == 0])

plt.savefig(f'images/scatterplot_bleu_reference.png')

## Shapiro Tests

In [None]:
from scipy import stats

jaccs = result_df[result_df["config"]=="config_1"]["reference_bleu"].to_numpy()
shapiro_test = stats.shapiro(jaccs)
print(f"reference bleu score",shapiro_test)

In [None]:
for config in non_reference_configs:
    df_mask=result_df['config']==config
    
    jaccs1 = result_df[df_mask]["jaccard_n1"].to_numpy()
    jaccs2 = result_df[df_mask]["jaccard_n2"].to_numpy()

    shapiro_test1 = stats.shapiro(jaccs1)
    shapiro_test2 = stats.shapiro(jaccs2)
    print(f"jacc1_dist {config}",shapiro_test1)
    print(f"jacc2_dist {config}",shapiro_test2)

In [None]:
agg_df_data = []
for config in non_reference_configs:
    df_mask=result_df['config']==config

    bleu_data = result_df[df_mask]["bleu"].to_numpy()
    jacc_1_data = result_df[df_mask]["jaccard_n1"].to_numpy()
    jacc_2_data = result_df[df_mask]["jaccard_n2"].to_numpy()
    
    arch = config_archetypes[config]
    ts = results[config]["properties"]["transformations"]
    
    shapiro_test = stats.shapiro(bleu_data)
    
    bleu_median = np.median(bleu_data)
    bleu_mean = np.mean(bleu_data)
    bleu_iqr = stats.iqr(bleu_data)
    
    jacc1_median = np.median(jacc_1_data)
    jacc1_mean = np.mean(jacc_1_data)
    jacc1_iqr = stats.iqr(jacc_1_data)    
    
    jacc2_median = np.median(jacc_2_data)
    jacc2_mean = np.mean(jacc_2_data)
    jacc2_iqr = stats.iqr(jacc_2_data)
    
    config_entry = (config,arch,ts,
                    shapiro_test,
                    bleu_median,bleu_mean,bleu_iqr,
                    jacc1_median,jacc1_mean,jacc1_iqr,
                    jacc2_median,jacc2_mean,jacc2_iqr)
    
    agg_df_data.append(config_entry)
    #print(f"delta-bleus {config}",median,mean)
    
agg_df = pd.DataFrame(agg_df_data) 
del agg_df_data
agg_df.columns=[
    "config","archetype","transformations",
    "bleu_shapiro_test",
    "bleu_median","bleu_mean","bleu_iqr",
    "jacc1_median","jacc1_mean","jacc1_iqr",
    "jacc2_median","jacc2_mean","jacc2_iqr",
    
]
agg_df.head()

In [None]:
plt.figure(figsize=(12, 10))

plt.title("Delta-TScore IQR for non-zero delta-tscores")

pivoted_data = agg_df.pivot(index='transformations', columns='archetype', values='bleu_iqr')
pivoted_data = pivoted_data.sort_values("transformations",key=lambda col:col.astype(int),ascending=True)
sns.heatmap(pivoted_data, annot=True, fmt="g",cmap='viridis')


plt.savefig(f'images/heatmap_nonzero_shapiro_pvalues.png')

plt.show()

#sns.heatmap(x="transformations",y="archetype",hue="delta_tscore_iqr",center=0,data=filtered_agg_df)

In [None]:
plt.figure(figsize=(21, 7))
plt.grid()

plt.title('bleu IQR')


sns.barplot(
    x="config",y="bleu_iqr",
    data=agg_df,
    hue="archetype",
    dodge =False
)

plt.savefig(f'images/barplot_deltatscore_iqrs.png')

plt.show()

## Non - Setter / Getter Split 

As we looked in the data, there seems to be a lot of items just for setters and getters that even in the gold standard have a text like "set the XY".
This is rather noisy, and we want to split the data into "Setter","Getter","Other" and have a look at each group.

In [None]:
get_indizes=[]
set_indizes=[]
low_word_indizes=[]
other_indizes=[]


for index in list(results["reference"]["results"].keys()):
        gold = results["reference"]["gold_results"][index]     
        words = len(gold.split())
        
        if "get" in gold.lower() and words < 10:
            get_indizes.append(index) 
        elif "set" in gold.lower() and not "setting" in gold.lower() and words < 10:
            set_indizes.append(index)
        elif words < 5:
            low_word_indizes.append(index)
            
print("gets:",len(get_indizes))
print("sets:",len(set_indizes))
print("low_words:",len(low_word_indizes))


other_indizes = [ i for i in list(results["reference"]["results"].keys())
                  if not i in get_indizes + set_indizes + low_word_indizes ]

print("remaining indizes:",len(other_indizes))

# Comment this in for sampling the remaining indizes
#for i in other_indizes[:50]:
#    print(results["reference"]["gold_results"][i])

In [None]:
ref_get_bleus = [results["reference"]["bleu_values"][index] for index in get_indizes]
ref_getter_bleu = np.mean(ref_get_bleus)
print("get:",ref_getter_bleu)

ref_set_bleus = [results["reference"]["bleu_values"][index] for index in set_indizes]
ref_setter_bleu = np.mean(ref_set_bleus)
print("set:",ref_setter_bleu)

ref_low_word_bleus = [results["reference"]["bleu_values"][index] for index in low_word_indizes]
ref_lowwords_bleu = np.mean(ref_low_word_bleus)
print("low words:",ref_lowwords_bleu)

ref_cleaned_bleus = [results["reference"]["bleu_values"][index] for index in other_indizes]
ref_remaining_bleu = np.mean(ref_cleaned_bleus)
print("remaining indizes:",ref_remaining_bleu)

In [None]:
split_bleus_data = []

# For every archetype, add as the 0 transformation point the reference
for archetype in set(config_archetypes.values()):
    datapoint = ("reference",archetype,0,
                 results["reference"]["bleu"]/100,
                 ref_getter_bleu,ref_setter_bleu,ref_lowwords_bleu,ref_remaining_bleu)
    split_bleus_data.append(datapoint)

# For all configs, make a datapoint with the separated bleus
for config in non_reference_configs:

    archetype = config_archetypes[config]
    transformations = results[config]["properties"]["transformations"]
    
    getter_bleus = [results[config]["bleu_values"][index] for index in get_indizes]
    getter_agg_bleu = np.mean(ref_get_bleus)

    setter_bleus = [results[config]["bleu_values"][index] for index in set_indizes]
    setter_agg_bleu = np.mean(setter_bleus)

    low_word_bleus = [results[config]["bleu_values"][index] for index in low_word_indizes]
    lowwords_agg_bleu = np.mean(low_word_bleus)

    other_bleus = [results[config]["bleu_values"][index] for index in other_indizes]
    other_agg_bleu = np.mean(other_bleus)
    
    datapoint = (config,archetype,transformations,
                 results[config]["bleu"]/100,
                 getter_agg_bleu,setter_agg_bleu,lowwords_agg_bleu,other_agg_bleu)
    split_bleus_data.append(datapoint)

# Make a dataframe from the values 
split_bleus_df = pd.DataFrame(split_bleus_data)
split_bleus_df.columns = [
    "config","archetype","transformations",
    "bleu",
    "getter_bleu","setter_bleu","low_word_bleu","remaining_bleu"
]
split_bleus_df["transformations"] = split_bleus_df["transformations"].astype("int")
split_bleus_df = split_bleus_df.sort_values(["archetype","transformations"])
split_bleus_df.head()

In [None]:
split_bleus_data_type_b = []

# For every archetype, add as the 0 transformation point the reference
for archetype in set(config_archetypes.values()):
    split_bleus_data_type_b.append(
        ("reference",archetype,0,"getter_bleu",ref_getter_bleu)
    )
    split_bleus_data_type_b.append(
        ("reference",archetype,0,"setter_bleu",ref_setter_bleu)
    )
    split_bleus_data_type_b.append(
        ("reference",archetype,0,"low_word_bleu",ref_lowwords_bleu)
    )
    split_bleus_data_type_b.append(
        ("reference",archetype,0,"remaining_bleu",ref_remaining_bleu)
    )
    split_bleus_data_type_b.append(
        ("reference",archetype,0,"bleu",results["reference"]["bleu"]/100)
    )

# For all configs, make a datapoint with the separated bleus
for config in non_reference_configs:

    archetype = config_archetypes[config]
    transformations = results[config]["properties"]["transformations"]
    
    getter_bleus = [results[config]["bleu_values"][index] for index in get_indizes]
    getter_agg_bleu = np.mean(ref_get_bleus)
    split_bleus_data_type_b.append(
        (config,archetype,transformations,"getter_bleu",getter_agg_bleu)
    )
    
    setter_bleus = [results[config]["bleu_values"][index] for index in set_indizes]
    setter_agg_bleu = np.mean(setter_bleus)
    split_bleus_data_type_b.append(
        (config,archetype,transformations,"setter_bleu",setter_agg_bleu)
    )

    low_word_bleus = [results[config]["bleu_values"][index] for index in low_word_indizes]
    lowwords_agg_bleu = np.mean(low_word_bleus)
    split_bleus_data_type_b.append(
        (config,archetype,transformations,"low_word_bleu",lowwords_agg_bleu)
    )
    other_bleus = [results[config]["bleu_values"][index] for index in other_indizes]
    other_agg_bleu = np.mean(other_bleus)
    split_bleus_data_type_b.append(
        (config,archetype,transformations,"remaining_bleu",other_agg_bleu)
    )

    split_bleus_data_type_b.append(
        (config,archetype,transformations,"bleu",results[config]["bleu"]/100 )
    )

# Make a dataframe from the values 
split_bleus_df_type_b = pd.DataFrame(split_bleus_data_type_b)
split_bleus_df_type_b.columns = [
    "config","archetype","transformations","type","value"
]
split_bleus_df_type_b["transformations"] = split_bleus_df_type_b["transformations"].astype("int")
split_bleus_df_type_b = split_bleus_df_type_b.sort_values(["archetype","type","transformations"])
split_bleus_df_type_b.head(10)

In [None]:
plt.figure(figsize=(22, 7))
plt.grid()

sns.lineplot(
    data=split_bleus_df_type_b,
    x="transformations",
    y="value",
    hue="archetype",
    style="type",
    marker=True)

plt.xticks([0,1,5,10])
plt.ylabel("Averaged Bleu-Score")

plt.savefig(f'images/bleu_score_per_category_per_archetype.png')
plt.show()

In [None]:
plt.figure(figsize=(22, 7))
plt.grid()

plt.xticks([0,1,5,10])
plt.ylabel("Averaged Bleu-Score")

sns.lineplot(
    data=split_bleus_df_type_b,
    x="transformations",
    y="value",
    style="type")
plt.title("Average Bleu Score categorized into getters, setters, low words and others")
plt.xlim(0,10)

plt.savefig(f'images/bleu_score_per_category.png')

plt.show()

Word count in gold 

In [None]:
data = []
for index in results["reference"]["gold_results"].keys():
    words = len(results["reference"]["gold_results"][index].split())
    data.append(words)
    
    
plt.figure(figsize=(15, 6))
plt.grid()    
sns.histplot(data,bins=50)

plt.title("Distribution of words in gold standard")
plt.xlabel("# of words")
plt.ylabel("# of entries")
plt.xlim(0,100)

plt.xticks(np.arange(0,100,5))

plt.savefig(f'images/word_distribution_goldstandard.png')

plt.show()

del words,data

In [None]:
data = []
for index in results["reference"]["results"].keys():
    words = len(results["reference"]["results"][index].split())
    data.append(words)
    
    
plt.figure(figsize=(15, 6))
plt.grid()    
sns.histplot(data,bins=50)

plt.title("Distribution of words in reference standard")
plt.xlabel("# of words")
plt.ylabel("# of entries")
plt.xlim(0,100)

plt.xticks(np.arange(0,100,5))

plt.savefig(f'images/word_distribution_reference.png')

plt.show()

del words,data

## Chi-Tests

To check whether there are (statistically) significant differences in the groups


In [None]:
ref_bleus = results["reference"]["bleu_values"]

wilcoxon_data = []

for config in non_reference_configs:
    config_bleus = results[config]["bleu_values"]
    archetype = config_archetypes[config]
    transformations = results[config]["properties"]["transformations"]
    
    
    wilcoxon_result = stats.wilcoxon(ref_bleus,config_bleus)
    statistic = wilcoxon_result[0]
    pvalue = wilcoxon_result[1]
    
    twosided_wilcoxon_result = stats.wilcoxon(ref_bleus,config_bleus,alternative="two-sided")
    twosided_statistic = twosided_wilcoxon_result[0]
    twosided_pvalue = twosided_wilcoxon_result[1]
    
    datapoint = (config,archetype,transformations,
                 #statistic,pvalue,
                 twosided_statistic,twosided_pvalue)
    
    wilcoxon_data.append(datapoint)

wilcoxon_df = pd.DataFrame(wilcoxon_data)
wilcoxon_df.columns = ["config","archetype","transformations",
                       #"wilcoxon_statistics","wilcoxon_pvalue",
                       "twosided_wilcoxon_statistics","twosided_wilcoxon_pvalue"]
    
del config_bleus, ref_bleus

wilcoxon_df.head(7)

In [None]:
%%time
ref_bleus = results["reference"]["bleu_values"]

friedman_data = []

for configA in non_reference_configs:
    configA_bleus = results[configA]["bleu_values"]
    archetypeA = config_archetypes[configA]
    transformationsA = results[configA]["properties"]["transformations"]
    
    for configB in non_reference_configs:
        configB_bleus = results[configB]["bleu_values"]
        archetypeB = config_archetypes[configB]
        transformationsB = results[configB]["properties"]["transformations"]
        friedman_result = stats.friedmanchisquare(ref_bleus,configA_bleus,configB_bleus)
        #print(friedman_result)
        statistic = friedman_result[0]
        pvalue = friedman_result[1]

        datapoint = (configA,archetypeA,transformationsA,
                     configB,archetypeB,transformationsB,
                     statistic,pvalue)

        friedman_data.append(datapoint)

friedman_df = pd.DataFrame(friedman_data)
friedman_df.columns = [
    "configA","archetypeA","transformationsA",
    "configB","archetypeB","transformationsB",
    "friedman_statistics","friedman_pvalue"]
    
del configB_bleus,configA_bleus, ref_bleus

friedman_df.head()

In [None]:
plt.figure(figsize=(12, 10))

plt.title("Friedman-PValue \nof ConfigA<>ConfigB<>Reference")

ffriedman_df = friedman_df.copy()
ffriedman_df['configA'] = friedman_df['configA'].apply(config_num)
ffriedman_df['configA'].astype(int)
ffriedman_df['configB'] = friedman_df['configB'].apply(config_num)
ffriedman_df['configB'].astype(int)

pivoted_data = ffriedman_df.pivot(index='configA', columns='configB', values='friedman_pvalue')
sns.heatmap(pivoted_data, annot=False, fmt="g",cmap='viridis')


plt.savefig(f'images/friedman_pvalues_bleuscore.png')

plt.show()

In [None]:
plot_df = result_df[(result_df["difference"])]
plot_df = plot_df[(plot_df["transformations"]=='1')]

plot_df.info()

# Export 

This can be used to print a pdf (or html). Comment it in if you want to do so. 

--to=pdf takes quite a while, --to=html is pretty fast. 



In [None]:
%%time
# export to csv as annibale wants
method_dict = {}
for i in get_indizes:
    method_dict[i]="Getter"
for i in set_indizes:
    method_dict[i]="Setter"
for i in low_word_indizes:
    method_dict[i]="Low_Words"
for i in other_indizes:
    method_dict[i]="Normal"
    
csv_export_data  = []

for index in results["reference"]["results"].keys():
    ref_data = results["reference"]["results"][index]
    gold_data = results["reference"]["gold_results"][index]
    ref_bleu =  results["reference"]["bleu_values"][index]
    ref_length = len(ref_data)
    ref_word_length = len(ref_data.split())
    ref_jacc1_to_ref = 0 
    
    ref_jacc_1_to_gold = jaccard_wrapper(ref_data,gold_data)
    ref_perfect = gold_data == ref_data
    
    method_type = method_dict[index]
    
    ref_datapoint = (
        "reference","none","none","0", method_type,index,
        #ref_data,
        ref_bleu,ref_jacc1_to_ref,ref_jacc_1_to_gold,ref_length, 
        False,ref_perfect 
    )
    csv_export_data.append(ref_datapoint)
    
    for config in non_reference_configs:
        conf_data = results[config]["results"][index]
        #print(config,conf_data,ref_data)
        arch = config_archetypes[config]
        mt = archetype_mt_mapping[arch]
        ts = results[config]["properties"]["transformations"]
        
        bleu =  results[config]["bleu_values"][index]
        length = len(conf_data)
        word_length = len(conf_data.split())
        
        diff = conf_data != ref_data
        perfect = conf_data == gold_data
        
        result_df_line = result_df[(result_df["config"]==config)  &  (result_df["index"]==index)]
        
        jacc_1_to_ref = result_df_line["jaccard_n1_reference"].iloc(0)[0]
        jacc_1_to_gold = result_df_line["jaccard_n1"].iloc(0)[0]
        
        conf_datapoint = (
            config,arch,mt,ts, method_type,index,
            #ref_data,
            bleu,jacc_1_to_ref,jacc_1_to_gold,
            length,word_length,
            diff, perfect
        )
        csv_export_data.append(conf_datapoint)
    #print(index)
csv_export_df = pd.DataFrame(csv_export_data)
csv_export_df.columns = [
    "config","archetype","MT","transformations","method_type","entry",
    "bleu_score",
    "jaccard_distance_to_gold","jaccard_distance_to_reference",
    "length_in_characters", "length_in_words",
    "different_to_ref", "perfect_match_with_gold"
]
csv_export_df.to_csv("./exports/bleus.csv")
csv_export_df.head(5)

In [None]:
del csv_export_df
!jupyter nbconvert --to=pdf --output-dir=./exports Evaluation.ipynb