# General protocol

#### Logic

The script will take predictions by model (listed in `predictions.xlsx`) for a set of variables (listed in variables.csv) to compute agreement metrics with the groundtruth.

#### Files requirement

- `groundtruth.xlsx` with C columns for each variable and N lines
- a `variables.csv` file with C lines and 2 columns : one with the name of the variable, and the second with the type (numerical, categorical, open, list, or structured with the field to use and the type of the field, i.e. dictionnary|field[list]) **the name of the groundtruth and variables should be the same**
- a `predictions.xlsx` with M lines (one for each model prediction of the set of variables), a column for the NAME of the prediction (also to add : date, parameters, etc. *to discuss*)
- a predictions folder that contains the CSV files of each prediction run
      - `NAME.csv` a prediction file with the unique NAME
- a `resolution.xlsx`file is generated each time to show disagreement
- if exists, `resolution_mod.xlsx` is used to fix disagreement problems
- if new models are added, a `resolution_mod_updated.xlsx` is created to keep previous annotation and add the new one to annotate
  - to use it, delete the old one and rename the new one

#### Modify the resolution_mod file

In the modification column :

- E == error
- P == partial equity
- nothing == correct

#### Metrics

Different kind of equalities

- strict equality : 1/2 max characters diff for cat element / strict equality for list
- strict human equality: after human reading and feedback
- partial human equality : after human reading and feedback

Metrics

- agreement for every types
- micro f1 for cat

### Comments

Current file for annotation : https://docs.google.com/spreadsheets/d/1urxN8BR8p7neAo3LkH-zB3B95gqW_vSf6_tPgGau26c/edit?gid=123938045#gid=123938045

### Install

In [1]:
# pip install python-Levenshtein
# pip install openpyxl

## Functions

In [2]:
import pandas as pd
from pathlib import Path
from sklearn.metrics import f1_score
import Levenshtein
import json
import re
import warnings
warnings.filterwarnings('ignore')

# utility functions

def clean_cat(pred):
    """
    Clean text
    """
    to_remove = ["[","]",'"', '.', '-',"”", "“"]
    for i in to_remove:
        pred = str(pred).replace(i,"").lower().strip()
    if pred == "none" or pred == "" or pred=="not mentioned":
        pred = None
    return pred

def extract_list(cell):
    """
    Extract list from string
    (seems ok but to check)
    """
    cell = clean_cat(str(cell))
    if not cell:
        return []
    l = (cell.replace(";",",")).split(",") # split
    return [clean_cat(i) for i in l] # clean and return

def extract_field(cell, entry):
    """
    Extract specific field in structured string
    """
    # first try a json format
    try:
        return json.loads(cell)[entry]
    except:
        pass

    # then try to deconstruct the JSON with correct spacing
    pattern = r'"'+entry+'": "(.*?)"'
    match = re.search(pattern, cell)
    if match:
        return match.group(1)

    # with no double quote
    pattern = entry+': "(.*?)"'
    match = re.search(pattern, cell)
    if match:
        return match.group(1)
        
    # end of the element (with 2 variations of RESPONSE)
    end = cell.replace('"RESPONSE"','RESPONSE').split('RESPONSE: ')[-1].replace("\n","").replace("}","")
    return end


def fuzzy_equality(str1,str2, max_diff=3):
    """
    Compare 2 strings with character diff
    """
    if not str1 and not str2: #case 2 None
        return True
    if (not str1 and str2) or (not str2 and str1):
        return False
    distance = Levenshtein.distance(str1, str2)
    return distance <= max_diff

def compare_text(x1, x2):
    """
    Compare 2 texts with rules
    """
    # cleaning, same rule for the 2 elements
    x1 = clean_cat(x1)
    x2 = clean_cat(x2)
    
    # exact equity
    if x1==x2:
        return True

    # one is null and not the other
    if (x1 is None) and (x2 is not None):
        return False

    # fuzzy equality
    if len(x1) <= 6: # case of few characters
        t = fuzzy_equality(x1, x2, max_diff=1)
        if t:
            return True
    if len(x1) > 6: # case of many characters
        t = fuzzy_equality(x1, x2, max_diff=2)
        if t:
            return True
    return False

def compare_list(x1,x2):
    """
    Compare 2 lists
    """
    # case one is null, not the other
    if x1 is None and x2 is not None:
        return False
    if x2 is None and x1 is not None:
        return False
    # Case both null
    if x1 is None and x2 is None:
        return True

    # sort the element
    x1 = sorted([i for i in x1 if i is not None])
    x2 = sorted([i for i in x2 if i is not None])

    # equity of content
    if set(x1) == set(x2):
        return True

    # different elements in the list
    if len(set(x1)) != len(set(x2)):
        return False

    # comparaison with fuzzyness
    if sum([compare_text(i,j) for i,j in zip(x1,x2)]) == len(x1):
        return True
        
    return False

def eq(x1, x2, eq_type):
    """
    Apply a rule of equity
    """
    # case of text
    if eq_type == "text":
        return compare_text(x1, x2)

    # case of list
    if eq_type == "list":
        return compare_list(x1, x2)
        
    return None

class Resolution:
    """
    Class to build the file of disagreement for external check
    """
    def __init__(self):
        self.content = []
        self.checked = None
        if Path("resolution_mod.xlsx").exists():
            self.checked = pd.read_excel("resolution_mod.xlsx")

    def add(self, er_strict, variable, file):
        disagreements = er_strict
        disagreements["partial"] = None
        disagreements["variable"] = variable
        disagreements["file"] = file
        self.content.append(disagreements.reset_index())

    def write(self):
        content = pd.concat(self.content)
        content["modification"] = None
        content.to_excel("resolution.xlsx")

    def mod(self, id_run, variable, id_pred):
        """
        Check if there is a modification
        """
        if self.checked is None:
            return None
        # keep only modified
        df = self.checked.dropna(subset=["modification"])
        f = (df["variable"] == variable) & (df["file"] == id_run) & (df["Article_ID"] == id_pred)
        if len(df[f]) == 0:
            return None
        if len(df[f]) > 1:
            print("Error in the identification")
            return "error"
        return str(df[f]["modification"].iloc[0]).strip()

    def eq_human(self, id_run, variable, id_pred):
        r = self.mod(id_run, variable, id_pred)
        if r in ["E","EE"]:
            return None
        if r in ["P"]:
            return "partial"
        return "equal"
        
    def update_checked(self):
        """
        Update already annotated file
        """
        if not Path("resolution_mod.xlsx").exists():
            print("No modification_mod.xlsx file")
            return None
        if not Path("resolution.xlsx").exists():
            print("No modification.xlsx file")
            return None            

        # load files
        df_all = pd.read_excel("resolution.xlsx")
        df_prev = pd.read_excel("resolution_mod.xlsx")

        # only take elements missing in the resolution_mod file
        files_to_add = [i for i in list(df_all["file"].unique()) if i not in list(df_prev["file"].unique())]

        if len(files_to_add)==0:
            print("No new model added")
            return None
        
        # add them in the resolution_mod content and create new file
        new_resolution = pd.concat([df_prev, df_all[df_all["file"].isin(files_to_add)]])
        new_resolution.to_excel("resolution_mod_updated.xlsx")
        print("Added new models to annotate in resolution_mod_updated.xlsx. Please delete the old one and rename the new",files_to_add)


## Script

In [38]:
# Load files
df_gt = pd.read_excel("./groundtruth.xlsx",index_col="Article_ID")
variables = pd.read_csv("./variables.csv",index_col=0)
predictions = pd.read_excel("./predictions.xlsx",index_col=0)

# General test if the variables exist in the ground truth
for i in variables.index:
    if i not in df_gt.columns:
        print(f"The {i} variable is not in the ground truth")

# Loop on predictions
global_table = {}
resolution = Resolution()
for i in predictions.index:
    run_table = {}
    
    # Test if files/variable exist
    if not Path(f"predictions/{i}.csv").exists():
        print(f"predictions/{i}.csv does not exist")
        continue
    
    # Load the data for the prediction
    print("Current set:",i)
    df = pd.read_csv(f"predictions/{i}.csv", index_col="Article_ID")

    # Test the size of the file
    if len(df) != len(df_gt):
        print(f"Problem in the number of elements of the prediction {i}")
    
    # Loop on variables
    for v in variables.index:
        # print("Variable:",v)
        v_m = v+"_model"
        if v_m not in df.columns:
            print(f"Variable {v} not in the prediction")
            continue

        # Create the specific dataset to compare variable prediction/groundtruth
        df_s = df_gt[[v]].join(df[v_m], rsuffix="pred")
        df_s.columns = ["groundtruth", "prediction"]

        # Preprocess the prediction in the case for structured data
        if "dictionnary" in variables.loc[v,"type"]:
            type_v = variables.loc[v,"type"].split("[")[1].replace("]","")
            entry = variables.loc[v,"type"].replace("dictionnary|","").split("[")[0]
            df_s["prediction"] = df_s["prediction"].apply(lambda x : extract_field(x,entry))
        else:
            type_v = variables.loc[v,"type"]

        # Managing equality regarding the type of variable
        if  type_v == "categorical" or type_v == "open":
            df_s = df_s.map(clean_cat)
            strict_eq = df_s.apply(lambda x: eq(x['groundtruth'],x["prediction"], "text"),axis=1)
        if type_v == "list":
            df_s = df_s.map(extract_list)
            strict_eq = df_s.apply(lambda x : eq(x['groundtruth'],x["prediction"], "list"),axis=1)
            
        # Add in the record disagreement from strict equality
        resolution.add(df_s[~strict_eq], v, i)

        # Build human equality
        
        # Vectors
        human_eq_s = [] # boolean vector strict equality
        human_eq_p = [] # boolean vector partial equality
        human_eq_s_cat = [] # cat vector strict equality with cat
        human_eq_p_cat = [] # cat vector partial equality with cat

        # Loop on strict equity vector
        for idx,value in strict_eq.items():
            # if already eq
            if value: 
                human_eq_s.append(value)
                human_eq_p.append(value)
                human_eq_p_cat.append(df_s.loc[idx,"groundtruth"])
                human_eq_s_cat.append(df_s.loc[idx,"groundtruth"])
            # else use the human feedback
            else:
                # strict
                if resolution.eq_human(i,v,idx)=="equal": # if equal by human
                    human_eq_s.append(True)
                    human_eq_s_cat.append(df_s.loc[idx,"groundtruth"])
                else:
                    human_eq_s.append(False)
                    human_eq_s_cat.append(df_s.loc[idx,"prediction"])

                # partial
                if resolution.eq_human(i,v,idx) in ["equal","partial"]: # if equal by human
                    human_eq_p.append(True)
                    human_eq_p_cat.append(df_s.loc[idx,"groundtruth"])
                else:
                    human_eq_p.append(False)
                    human_eq_p_cat.append(df_s.loc[idx,"prediction"])

        # transform in series
        human_eq_s = pd.Series(human_eq_s, index=strict_eq.index)
        human_eq_p = pd.Series(human_eq_p, index=strict_eq.index)
        human_eq_s_cat = pd.Series(human_eq_s_cat, index=strict_eq.index)
        human_eq_p_cat = pd.Series(human_eq_p_cat, index=strict_eq.index)

        # f1 on brut and corrected data (strict and partial)
        f1_s, f1_p, f1_b  = None,
        if type_v == "categorical":
            f1_b = f1_score(df_s["groundtruth"].apply(str), df_s["prediction"].apply(str), average='micro')
            f1_s = f1_score(df_s["groundtruth"].apply(str), human_eq_s_cat.apply(str), average='micro')
            f1_p = f1_score(df_s["groundtruth"].apply(str), human_eq_p_cat.apply(str), average='micro')

        # build table for dataset      
        run_table[v] = {
            "eq_strict":strict_eq.sum()/len(strict_eq),
            "eq_human_s":human_eq_s.sum()/len(human_eq_s), 
            "eq_human_p":human_eq_p.sum()/len(human_eq_p), 
            "f1_brut":f1_b, 
            "f1_human_s":f1_s, 
            "f1_human_p":f1_p
        }

    # build global table
    global_table[i] = pd.DataFrame(run_table).T

resolution.write()
resolution.update_checked()
df = pd.concat(global_table)
df.to_excel("scores.xlsx")
print("Results saved in scores.xlsx")
df

Current set: TestSet200_v2_plus_blinded_8B_JSON_yesCoT_0SHOT


TypeError: cannot unpack non-iterable NoneType object

### Test resolution file

In [96]:
resolution.checked["modification"].value_counts()

modification
E     484
P     132
EE      1
Name: count, dtype: int64