# Protocol to compare information extraction with the ground truth

<div class="alert alert-block alert-info">

#### General information

This script was designed incrementally to compute metrics for prediction with generative models. The general idea is to compare the prediction of different models with a ground truth, for a diversity of variables (numeric, categorial). Since the generative process can generate close answer, a human loop is implemented to check if the disagreement is real or just a small variation in the writing.

For this reason, the process is divided in 3 steps :
1. comparison with the gold standard
2. human loop to check if the disagreement is real (with 3 possibilities : disagreement, agreement, partial agreement)
3. computation of the metrics using the human loop results

Small adaptations were done to take into account non homogeneous data structure :
- cleaning the prediction by extracting part of json data
- adapting to specific file format for regex

</div>

#### Files requirement to run the script

To run, the script needs different elements in the `input` folder.

Be `C` the number of variables, `N` the number of elements predicted, `M` the number of predictions, the structure of the files should be :

- `variables.csv` : a file with C lines and 2 columns, one with the name of the variable, and the second with the type (numerical, categorical, open, list, or structured with the field to use and the type of the field, i.e. dictionnary|field[list]) **the name of the groundtruth and variables should be the same**
- `groundtruth.xlsx` : the correct prediction for the variables, with C columns for each variable and N lines
- `predictions.xlsx` with M lines (one for each model prediction of the set of variables), a column for the name of the prediction wich should be the same as the name of the file in the predictions folder.
- a `predictions` folder that contains the CSV files of each prediction run
      - `name.csv` a prediction file with the unique name

Once the script is executed, it produces in an `output` folder :

- a `resolution.xlsx` that list disagreements between prediction & groundtruth (one line per disagreement, with index : model/variable/line), it can be edited by human to check the disagreement. The automatic equality used : a strict equality for numbers, a strict equality for list, 1/2 max characters diff for a categorical element in the name of the category.
- if new models are added (in predictions.xlsx and folder) and there is already an edited `resolution_mod.xlsx`, the script generate a `resolution_mod_updated.xlsx` to keep previous annotation and add the new elements for the annotator to check. To use it, delete the old one and rename the new one to `resolution_mod.xlsx`, with the new modifications

#### Human annotation

In the `resolution_mod.xlsx`, in the modification column :

- E == error
- P == partial equity
- nothing == correct

### Install packages

In [1]:
# pip install -r requirements.txt

### Functions

Both utility functions (transform string to list), comparison functions (compare two elements), and main functions (compute metrics) are defined in the following cells.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.metrics import f1_score
import Levenshtein
import json
import re
import warnings
from scipy.stats import sem, t
from statsmodels.stats.weightstats import DescrStatsW
import os

warnings.filterwarnings('ignore')

def clean_cat(pred:str) -> str|None:
    """
    Clean text from special characters
    """
    to_remove = ["[","]",'"', '.', '-',"”", "“"]
    for i in to_remove:
        pred = str(pred).replace(i,"").lower().strip()
    if pred == "none" or pred == "" or pred=="not mentioned":
        pred = None
    return pred

def extract_list(cell:str) -> list:
    """
    Extract list from string
    """
    cell = clean_cat(str(cell))
    if not cell:
        return []
    # split
    elements = (cell.replace(";",",")).split(",")
    # clean and return
    return [clean_cat(i) for i in elements]

def extract_field(cell:str, entry:str, info:str) -> dict|None:
    """
    Rule-based approach to extract elements in json-like
    (generated json is not always well formatted)
    Different steps to try different strategies
    """
    # first try as a json format
    try:
        return json.loads(cell)[entry]
    except:
        pass

    # then try to deconstruct the JSON with correct spacing
    pattern = r'"'+entry+'": "(.*?)"'
    match = re.search(pattern, cell)
    if match:
        return match.group(1)

    # with no double quote
    pattern = entry+': "(.*?)"'
    match = re.search(pattern, cell)
    if match:
        return match.group(1)
        
    # if the element is at the end of the string (specific to the files)
    if entry in cell:
        end = (cell.replace('"'+entry+'"',entry).
               split(f"{entry}: ")[-1].
               replace("\n","").
               replace("}",""))
        return end
    else:
        # all the cell if there is no entry
        return cell


def fuzzy_equality(str1:str,str2:str, max_diff:int) -> bool:
    """
    Compare 2 strings taken into account small variations
    """
    if not str1 and not str2: #case 2xNone
        return True
    if (not str1 and str2) or (not str2 and str1):
        return False
    distance = Levenshtein.distance(str1, str2)
    return distance <= max_diff

def compare_text(x1: str, x2:str) -> bool:
    """
    Compare 2 texts with rules
    """
    # cleaning, same rule for the 2 elements
    x1 = clean_cat(x1)
    x2 = clean_cat(x2)
    
    # exact equity
    if x1==x2:
        return True

    # the 2 are null
    if x1 is None and x2 is None:
        return True

    # case of only one is null
    if (x1 is None and x2 is not None) or (x2 is None and x1 is not None):
        return False

    # fuzzy equality : 2 different cases
    # case of few characters
    if len(x1) <= 6: 
        t = fuzzy_equality(x1, x2, max_diff=1)
        if t:
            return True
    # case of many characters
    if len(x1) > 6: 
        t = fuzzy_equality(x1, x2, max_diff=2)
        if t:
            return True
        
    # otherwise disagreement
    return False

def compare_list(x1: list,x2:list) -> bool:
    """
    Compare 2 lists
    """
    # case both null
    if x1 is None and x2 is None:
        return True
    
    # case one is null, not the other
    if x1 is None and x2 is not None:
        return False
    if x2 is None and x1 is not None:
        return False

    # sort the element
    x1 = sorted([i for i in x1 if i is not None])
    x2 = sorted([i for i in x2 if i is not None])

    # equity of content
    if set(x1) == set(x2):
        return True

    # different elements in the list
    if len(set(x1)) != len(set(x2)):
        return False

    # comparaison with fuzzyness only if same number to catch small character variation
    if sum([compare_text(i,j) for i,j in zip(x1,x2)]) == len(x1):
        return True

    # otherwise disagreement
    return False

def eq(x1:str|list, x2:str|list, eq_type:str) -> bool|None:
    """
    Apply a rule of equity depending on the type
    """
    # case of text
    if eq_type == "text":
        return compare_text(x1, x2)

    # case of list
    if eq_type == "list":
        return compare_list(x1, x2)
        
    return None

def mean_bootstrap(s:pd.Series, frac:float, n:int=100) -> float:
    """
    boostraping mean
    """
    m = []
    for i in range(n):
        ss = s.sample(frac=frac)
        m.append(ss.sum()/len(ss))
    return np.mean(m)
        
def confidence_interval(data:list, confidence:float=0.95) -> tuple[float,float]:
    """
    Computing a confidence interval
    """
    data = np.array(data)
    n = len(data)
    mean = np.mean(data)
    stderr = sem(data)
    t_value = t.ppf((1 + confidence) / 2, n - 1)
    margin_of_error = t_value * stderr
    lower_bound = mean - margin_of_error
    upper_bound = mean + margin_of_error
    return round(lower_bound,4),round(upper_bound,4)

class Resolution:
    """
    Class to build the disagreement file for human annotation
    + utility functions to use it to correct equity based on the human annotation
    """

    content: list[pd.DataFrame] # build the table of automatic disagreement
    checked: pd.DataFrame | None # available human annotation
    correct_pred_cat: dict[str,dict[str,str]] # dictionnary to correct predictions with goldstandard categories

    def __init__(self):
        """
        Load files and initialize variables
        """
        self.content = [] # list for false prediction

        # load human equivalence
        if Path("feedback/resolution_mod.xlsx").exists():
            df = pd.read_excel("feedback/resolution_mod.xlsx")
            self.checked = df.dropna(subset=["modification"])
        else:
            os.mkdir("feedback", exist_ok=True)
            self.checked = None

        # dictionnary to correct predictions with goldstandard categories
        self.correct_pred_cat = {}
        if Path("feedback/reco_predict_cat_reco.xlsx").exists():
            tmp = pd.read_excel("feedback/reco_predict_cat_reco.xlsx")
            for i,j in tmp.dropna().groupby("var"):
                self.correct_pred_cat[i] = dict(j.set_index("predict")["reco"])

        # create output folder if not existing
        if not Path("output").exists():
            os.mkdir("output")

    def add(self, er_strict, variable, file):
        """
        Add element in the table of disagreement
        """
        disagreements = er_strict
        disagreements["partial"] = None
        disagreements["variable"] = variable
        disagreements["file"] = file
        self.content.append(disagreements.reset_index())

    def write(self):
        """
        Write the file with the disagreement to annotate
        """
        content = pd.concat(self.content)
        content["modification"] = None
        content.to_excel("feedback/resolution.xlsx")

    def mod(self, id_run, variable, id_pred):
        """
        Check if there is an human annotation for a specific element
        """
        # keep only modified
        df = self.checked
        f = (df["variable"] == variable) & (df["file"] == id_run) & (df["Article_ID"] == id_pred)
        if len(df[f]) == 0:
            return None
        if len(df[f]) > 1:
            print("Error in the identification")
            return "error"
        return str(df[f]["modification"].iloc[0]).strip()

    def eq_human(self, id_run, variable, id_pred):
        """
        Check is there is a human eq for an element
        """
        r = self.mod(id_run, variable, id_pred)
        if r in ["E","EE", "U", "EP"]:
            return None
        if r in ["P"]:
            return "partial"
        return "equal"
        
    def update_checked(self):
        """
        Update annotated file database if new entries
        """
        # open files with both the global unannotated data + the previous annotated data
        if not Path("feedback/resolution_mod.xlsx").exists():
            print("No modification_mod.xlsx file")
            return None
        if not Path("feedback/resolution.xlsx").exists():
            print("No modification.xlsx file")
            return None            

        # load files
        df_all = pd.read_excel("feedback/resolution.xlsx")
        df_prev = pd.read_excel("feedback/resolution_mod.xlsx")

        # only take elements missing in the resolution_mod file
        files_to_add = [i for i in list(df_all["file"].unique()) if i not in list(df_prev["file"].unique())]

        if len(files_to_add)==0:
            print("No new model added")
            return None
        else:       
            # add them in the resolution_mod content and create new file
            new_resolution = pd.concat([df_prev, df_all[df_all["file"].isin(files_to_add)]])
            new_resolution.to_excel("feedback/resolution_mod_updated.xlsx")
            print("Added new models to annotate in resolution_mod_updated.xlsx. Please delete the old one and rename the new",files_to_add)


## Script

This script checks general config, then loops over predictions, identify disagreement with goldstandard, generate/apply human correction, and compute metrics.

A normal use should at least run the script twice, once to build the `resolution_mod.xlsx` file, and once to compute the metrics after human annotations.

In [2]:
# Load files
n_round = 4 # decimal rounding
df_gt = pd.read_excel("input/groundtruth.xlsx",index_col="Article_ID")
variables = pd.read_csv("input/variables.csv",index_col=0)
predictions = pd.read_excel("input/predictions.xlsx",index_col=0)

# Initialize the tables
global_table = {}
resolution = Resolution()

# Check if the variables exist in the ground truth
for i in variables.index:
    # specific case for model regex to manage them as specific format
    if i.replace("_regex","") not in df_gt.columns: 
        print(f"The {i} variable is not in the ground truth")

# Add directory for files of comparison per variable after human annotation
if not Path("output").exists():
    os.mkdir("output")
if not Path("output/tables_human_eq").exists():
    os.mkdir("output/tables_human_eq")

# Loop on all predictions
print("Start looping on models")
for i in predictions.index:

    # Test if file exists
    if not Path(f"input/predictions/{i}.csv").exists():
        print(f"predictions/{i}.csv does not exist")
        continue
    
    # Load the data for the prediction
    print("Current prediction:",i)
    df = pd.read_csv(f"input/predictions/{i}.csv", index_col="Article_ID")
    if "Unnamed: 0" in df.columns:
        df = df.drop(columns=["Unnamed: 0"])

    # Test the size of the file
    if len(df) != len(df_gt):
        print(f"Problem in the number of elements of the prediction {i}", len(df), len(df_gt))

    # Loop on variables
    run_table = {}
    eq_table = {}

    for v in variables.index:
        v_m = v
        # fix specific case for regex for specific variables
        if  "_regex" not in v: 
            v_m = v+"_model" 
        if v_m not in df.columns:
            print(f"Variable {v} not in the prediction")
            continue

        # create the paired dataset to compare variable prediction/groundtruth
        df_s = df_gt[[v.replace("_regex","")]].join(df[v_m], rsuffix="pred")
        df_s.columns = ["groundtruth", "prediction"]

        # preprocess the prediction in the case for structured data to clean it
        if "dictionnary" in variables.loc[v,"type"]:
            type_v = variables.loc[v,"type"].split("[")[1].replace("]","")
            entry = variables.loc[v,"type"].replace("dictionnary|","").split("[")[0]
            df_s["prediction"] = df_s["prediction"].apply(lambda x : extract_field(x,entry, i+";"+v))
        else:
            type_v = variables.loc[v,"type"]

        # evaluating equality between GT and prediction regarding the type of variable + cleaning
        if  type_v == "categorical" or type_v == "open":
            df_s = df_s.map(clean_cat)
            strict_eq = df_s.apply(lambda x: eq(x['groundtruth'],x["prediction"], "text"),axis=1)
        if type_v == "list":
            df_s = df_s.map(extract_list)
            strict_eq = df_s.apply(lambda x : eq(x['groundtruth'],x["prediction"], "list"),axis=1)
            
        # add the automatic table of disagreements for building annotator dataset
        resolution.add(df_s[~strict_eq], v, i)

        #------------------------------------------------------------
        # Build different vectors of equality based on human feedback
        #------------------------------------------------------------
        
        human_eq_s = [] # boolean vector strict equality
        human_eq_p = [] # boolean vector partial equality
        human_eq_s_cat = [] # cat vector strict equality with cat
        human_eq_p_cat = [] # cat vector partial equality with cat

        # Loop on strict equity vector
        for idx,value in strict_eq.items():
            # if already equal by computer evaluation
            if value:  
                human_eq_s.append(value)
                human_eq_p.append(value)
                human_eq_p_cat.append(df_s.loc[idx,"groundtruth"])
                human_eq_s_cat.append(df_s.loc[idx,"groundtruth"])
            # else use the human feedback to check
            else:
                # strict human
                if resolution.eq_human(i,v,idx)=="equal": # if equal by human
                    human_eq_s.append(True)
                    human_eq_s_cat.append(df_s.loc[idx,"groundtruth"])
                else:
                    human_eq_s.append(False)
                    human_eq_s_cat.append(df_s.loc[idx,"prediction"])
                
                # partial human
                if resolution.eq_human(i,v,idx) in ["equal","partial"]: # if equal by human
                    human_eq_p.append(True)
                    human_eq_p_cat.append(df_s.loc[idx,"groundtruth"])
                else:
                    human_eq_p.append(False)
                    human_eq_p_cat.append(df_s.loc[idx,"prediction"])

        # transform in series
        human_eq_s = pd.Series(human_eq_s, index=strict_eq.index)
        human_eq_p = pd.Series(human_eq_p, index=strict_eq.index)
        human_eq_s_cat = pd.Series(human_eq_s_cat, index=strict_eq.index)
        human_eq_p_cat = pd.Series(human_eq_p_cat, index=strict_eq.index)

        # add for file output
        eq_table[v] = human_eq_s

        #----------------
        # compute metrics
        #----------------

        
        f1_spe, f1_micro, f1_macro  = None, None, None

        # steps to compute F1 for categoricals
        if type_v == "categorical":
            
            # add corrected column for equivalent
            df_s["corrected"] = human_eq_s_cat
            df_s = df_s.fillna("NA")
            
            # correct prediction to ground truth
            if v in resolution.correct_pred_cat:
                df_s["corrected"] = df_s["corrected"].apply(lambda x: resolution.correct_pred_cat[v][x] 
                                                            if x in resolution.correct_pred_cat[v] else x)

            # compute f1 as the mean of binomial f1 for each GT cat (sort of curated macro f1)
            list_f1 = []
            for cat in list(df_s["groundtruth"].unique()):
                list_f1.append(f1_score(df_s["groundtruth"]==cat, df_s["corrected"]==cat,average = "binary"))
            f1_spe = np.mean(list_f1)
            
            f1_micro = f1_score(df_s["groundtruth"].fillna("NA").apply(str), 
                                human_eq_s_cat.fillna("NA").apply(str), 
                                average='micro')
            f1_macro = f1_score(df_s["groundtruth"].fillna("NA").apply(str), 
                                human_eq_s_cat.fillna("NA").apply(str), 
                                average='macro')

        # for statistics
        vec_for_stats = DescrStatsW(human_eq_s)
        
        # build table for dataset      
        run_table[v] = {
            "accuracy_eq_strict":strict_eq.sum()/len(strict_eq),
            "accuracy_eq_human_s":vec_for_stats.mean, #human_eq_s.sum()/len(human_eq_s), 
            "accuracy_eq_human_p":human_eq_p.sum()/len(human_eq_p),
            "accuracy_eq_human_s_boostrap":mean_bootstrap(human_eq_s, frac=0.5),
            "CI_student": [round(i,n_round) for i in vec_for_stats.tconfint_mean()],
            "f1_spe":round(f1_spe, n_round) if f1_spe is not None else None ,
            "f1_micro":round(f1_micro, n_round) if f1_micro is not None else None ,
            "f1_macro":round(f1_macro, n_round) if f1_macro is not None else None ,
        }

    # build global table
    global_table[i] = pd.DataFrame(run_table).T

    # output file for each prediction after human annotation
    eq_table = pd.concat(eq_table, axis=1)
    eq_table.to_csv(f"output/tables_human_eq/{i}.csv")

# write general output files
resolution.write()
resolution.update_checked()
df = pd.concat(global_table)
df.to_excel("output/scores.xlsx")
print("Results saved in scores.xlsx")

Start looping on models
Current prediction: TestSet200_v2_plus_blinded_8B_JSON_yesCoT_0SHOT_n
Variable cause_of_death_regex not in the prediction
Variable age_in_years_regex not in the prediction
Current prediction: TestSet200_v2_plus_blinded_8B_noJSON_yesCoT_0SHOT
Variable cause_of_death_regex not in the prediction
Variable age_in_years_regex not in the prediction
Current prediction: TestSet200_v2_plus_blinded_70B_JSON_yesCoT_0SHOT_n
Variable cause_of_death_regex not in the prediction
Variable age_in_years_regex not in the prediction
Current prediction: TestSet200_v2_plus_blinded_70B_noJSON_yesCoT_0SHOT
Variable cause_of_death_regex not in the prediction
Variable age_in_years_regex not in the prediction
Current prediction: TestSet200_v2_plus_blinded_8B_JSON_noCoT_0SHOT
Variable cause_of_death_regex not in the prediction
Variable age_in_years_regex not in the prediction
Current prediction: TestSet200_v2_plus_blinded_70B_JSON_noCoT_0SHOT
Variable cause_of_death_regex not in the predicti

Comment : it is normal that _regex variable are not in usual predictions since they exist only in one prediction