# Inter-annotator Agreement for INLG 2020

The annotation team formalized our annotation guidelines in July 2020.
We then annotated 10 papers from ACL 2020 according to these new guidelines to get an idea of the degree of inter-annotator agreement;
this gives us a sense of how reliable the new guidelines and our annotations are.

In this notebook, we import data from our spreadsheets and do a bit of preprocessing so that we can calculate IAA easily using `nltk`.

We calculate [Krippendorff's alpha]() using [MASI distance]() and [Jaccard distance]().
We also include raw pair-wise agreement scores.

Note that we are not doing any hypothesis testing here, so you will not see any significance scores.
These are strictly descriptive statistics.

## Preliminaries

Our original annotations were collected using [Google Sheets]() so we used `gspread` to interact with Google Sheets, `nltk` to calculate $\alpha$, and `pandas` to manage our data.
These spreadsheets are not public, but the data from them is released in the CSV files in this repo.

In [None]:
import gspread
import pandas as pd

from nltk.metrics.agreement import AnnotationTask
from nltk.metrics import edit_distance, jaccard_distance, masi_distance

from IPython.display import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import iaa_utilities


# URLs for the actual spreadsheets
# NB: These spreadsheets are not public, but the data is released in the included CSV files.
urls = ["https://docs.google.com/spreadsheets/d/1UwJaPkuYh8rDk7YyjkquuqVz9rBsxe-g1GQIKRFcPYQ/edit#gid=0", # SH
        "https://docs.google.com/spreadsheets/d/1lA2M2Ds9qmJrzqd18EFxWhplq5uO7oaC4nzVzzOmYws/edit#gid=0", # AB
        "https://docs.google.com/spreadsheets/d/1R4xpYOAuLoqPfqooH_DxJjYjaXmeB4QXvOMXGvjokLg/edit#gid=0", # MC
        "https://docs.google.com/spreadsheets/d/1hmYGD21eQozGTV-rY5m1KGsbgtGpQurK2tinyOTaUM4/edit#gid=0", # SM
        "https://docs.google.com/spreadsheets/d/1f7-OAL1HE8ON-97Snd5HsUhnqV_zs6WRTaAHNBxU7W0/edit#gid=0"] #, # SMi

annotation_df = iaa_utilities.IAAv1SpreadsheetScheme.load_locally_fallback_to_web("iaa-v1.csv", urls)

### Cleaning the dataset

There were a couple of superficial differences between the different annotators that we need to handle for the first round of IAA:

1. some annotators left whole rows blank; and
2. annotator paraphrase of definition was often left blank, as was the column for statistics. blank entries compare poorly on set-distance metrics so we will replace these with "~*EMPTY*~"

In [None]:
no_values = pd.DataFrame(annotation_df.loc[:,'system_language':'op_statistics']).any(axis = 1)
annotation_df = annotation_df[no_values]

annotation_df.replace("^\s$", "~*EMPTY*~", inplace=True)

for column in iaa_utilities.IAAv1SpreadsheetScheme.OPEN_CLASS_COLUMNS:
    annotation_df[column] = annotation_df[column].str.lower()

## Extracting the relevant information

Now that we've prepared the primary dataframe, we can easily extract smaller dataframes which facilitate the analysis.
In particular, this function gives us a three-column DF with the `source_spreadsheet`, `key` (= paper identifier), and the target column,
where we have aggregated all labels given in that column for that paper in the spreadsheet `source_spreadsheet` into a set.
(Using a set means that each label will appear only once; using a `frozenset` makes it immutable.)

This code appears in `iaa_utilities.py`

    def extract_iaa_df_by_column_name(annotation_df: pd.DataFrame, column_name: str) -> pd.DataFrame:
        """Extract a three-column dataframe with `column_name` items grouped by `source_spreadsheet` and `key`."""
        return annotation_df[['source_spreadsheet', 'key', column_name]]\
            .groupby(['source_spreadsheet', 'key'])[column_name]\
            .apply(frozenset).reset_index()

    def extract_records_for_nltk(iaa_df: pd.DataFrame) -> List[Tuple]:
        """The first column in the `to_records()` representation is an index, which we don't need for `nltk`."""
        return [(b, c, d) for _, b, c, d in iaa_df.to_records()]


In [None]:
extract_iaa_df_by_column_name = iaa_utilities.extract_iaa_df_by_column_name
extract_records_for_nltk = iaa_utilities.extract_records_for_nltk

## Calculating agreement

We will first test our approach with a single column and MASI distance,
before running it for all the columns for both MASI and Jaccard.
In this example we focus on `criterion_paraphrase`, a mostly closed-class column.

In [None]:
# Grab the three-column version
criterion_paraphrase_df = extract_iaa_df_by_column_name(annotation_df, "criterion_paraphrase")
# Define an AnnotationTask with our preferred distance metric
criterion_paraphrase_task = AnnotationTask(distance=masi_distance)
# Extract the records from our three-column dataframe and load them into the AnnotationTask
criterion_paraphrase_task.load_array(extract_records_for_nltk(criterion_paraphrase_df))
# See what we get for Krippendorff's alpha
print(criterion_paraphrase_task.alpha())

Now we want to repeat the exercise for *all* of the closed-class columns **and** using both MASI and Jaccard distance.
We'll store the three-column dataframes and resulting alpha values in a dict for easy access later.

The following code appears in `iaa_utilities.py`, converted to a method for each spreadsheet format:

    def run_closed_class_jaccard_and_masi(df: pd.DataFrame) -> Dict:
        iaa_by_column = {column: {"df": extract_iaa_df_by_column_name(df, column)} for column in iaa_utilities.IAAv1SpreadsheetScheme.CLOSED_CLASS_COLUMNS}
        for column in iaa_by_column:
            task = AnnotationTask(distance=jaccard_distance)
            task.load_array(extract_records_for_nltk(iaa_by_column[column]['df']))
            iaa_by_column[column]['alpha_jaccard'] = task.alpha()
            task = AnnotationTask(distance=masi_distance)
            task.load_array(extract_records_for_nltk(iaa_by_column[column]['df']))
            iaa_by_column[column]['alpha_masi'] = task.alpha()
        return iaa_by_column

In [None]:
iaa_by_column = iaa_utilities.IAAv1SpreadsheetScheme.run_closed_class_jaccard_and_masi(annotation_df)
print(iaa_by_column['criterion_paraphrase']['df'].head())

In [None]:
iaa_utilities.pretty_print_iaa_by_column(iaa_by_column)

This did reasonable things for our dev data in a strict-agreement mode, but we should also produce a version which relaxes some of the restrictions.
In particular, we want to collapse categories which will be trivially different.
For example, the "other" and "multiple" options which annotators could use, which include further details which will always differ.
In these instances it is informative to know that annotators agreed that, e.g., the annotation scheme did not cover this paper (choosing "other").

We also want to leverage hierarchical structure in the annotation scheme when possible, such as for the `criterion_paraphrase` column.

## Broad Agreement

We will call the exact-matching (at the string level) version of agreement which we have used so far *narrow* and now define *broad* agreement.
Broad agreement uses the natural hierarchies in the annotation scheme to group elements together which we might want to consider as equivalent.

For example, if two annotators disagree about the output type of a system, with one saying *text: paragraph* and the other saying *text: document*,
we want to penalize this less than if one of them were to say *multi-modal* instead.

### Input/Output Columns

For the `system_input` and `system_output` columns, we will consider the following equivalence classes

* text = {text: subsentential units of text, text: sentence, text: paragraph, text: document, text: dialogue, text: other (please specify)},
* multiple = {all variations of *multiple (list all)*}, and
* other = {all variations of *other (please specify)*}

with all other allowed labels belonging individually to their own equivalence class containing only one element (i.e. raw data = {*raw data*})
In the narrow agreement calculations, each element in an equivalence class differs from all others with a nominal distance metric (i.e. identical strings are distance 0 and all others are distance 1 from each other).


### Task Column

For the `system_task` column, we use the following equivalence classes

* multiple = {all variations of *multiple (list all)*} and
* other = {all variations of *other (please specify)*}

with all other allowed labels belonging individually to their own equivalence class containing only one element (i.e. aggregation = {*aggregation*})
In the narrow agreement calculations, each element in an equivalence class differs from all others with a nominal distance metric (i.e. identical strings are distance 0 and all others are distance 1 from each other).

### Paraphrase of Criterion Name Column

For the `criterion_paraphrase` column, we use two sets of equivalence classes:
one for simple string-level differences related to annotator-specified details as in the above cases and
another based on the hierarchy of criteria.

#### String-level equivalence classes

* Detectability of Text Property (specify property here) = {all variations of *Detectability of Text Property*}
* Effect on listener (specify effect here) = {all variations of *Effect on listener*}
* Inferrability of Speaker Stance (specify object of stance here) = {all variations of *Inferrability of Speaker Stance*}
* Inferrability of Speaker Trait (specify trait here) = {all variations of *Inferrability of Speaker Trait*}

#### Hierarchy-based equivalence classes

* Quantitative Criteria = {*Quantitative Criteria*, *Input Surface Form Retention*, *Input Content Retention*}
* Quality of Surface Form = {*Quality of Surface Form*, *Correctness of Surface Form*, *Grammaticality*, *Spelling Accuracy*, *Quality of Expression ('well-written')*, *Speech Quality*, *Aesthetic Quality of Surface Form*, *Appropriateness of form given context*}
* Quality of Content = {*Quality of Content*, *Correctness of Content*, *Correctness relative to input*, *Correctness relative to External Reference*, *Answerability from input*, *Adequacy/Appropriateness*, *Adequacy (does text have all and only relevant content given the input)*, *Adequacy Precision (does text have only relevant content given the input)*, *Adequacy Recall (does text have all relevant content given the input)*, *Appropriateness given context*, *Quality of Content*, *Informativeness*, *Too much / not enough information*}
* Quality of text as a whole = {*Quality of text as a whole*, *Coherence*, *Cohesiveness*, *Well-orderedness*, *Referent Resolvability*, *Complexity*, *Complexity/Simplicity of Form*, *Complexity/Simplicity of Content*, *Technicality/requires subject expertise*, *Naturalness*, *Naturalness (=likelihood)*, *Naturalness (form)*, *Naturalness (content)*, *Conversationality*, *Ease of Communication*, *Readability*, *Fluency*, *Clarity*, *Understandability*, *Nonredundancy*, *Nonredundancy (form)*, *Nonredundancy (content)*, *Vagueness/Specificity*, *Vagueness/Specificity (form)*, *Vagueness/Specificity (content)*, *Variedness*, *Variedness (form)*, *Variedness (content)*, *Originality*, *Originality (form)*, *Originality (content)*,  *Intended Property*, *Detectability of Text Property*}
* Extralinguistic Quality = {*Extralinguistic Quality*, *Criteria related to Listener/Reader*, *Effect on listener*, *Inferrability of Speaker Stance*, *Inferrability of Speaker Trait*, *Learnability*, *Visualisability*, *Humanlikeness*, *Humanlikeness (form)*, *Humanlikeness (content)*, *Usefulness (nonspecific)*, *Usefulness for task / information need*, *Criteria related to system*, *User Satisfaction*, *Usability*}

### Form of Response Elicitation

* other = {all variations of *other (please specify)*}

### Performing the updates and the calculations

We work with a fresh copy of the annotation_df so that the original data is still accessible in the notebook.
For each of the columns where *other (please specify)* is a valid annotation, we replace any annotation beginning with "other" with "other":
we are collapsing the distinctions created by the further specifications.
We do the same thing for annotations of *multiple (list all)*.

In [None]:
broad_anno_df = annotation_df.copy(deep = True)
for column in ("system_input", "system_output", "system_task", "op_form"):
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Oo]ther.*", "other")
broad_anno_df['system_input']

In [None]:
for column in ("system_input", "system_output"):
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Mm]ultiple.*", "multiple")
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Tt]ext:.*", "text")
broad_anno_df['system_input']

When it comes to the `criterion_paraphrase` column, however, we can take either of the two approaches described above.
We will create one copy of the broad annotation dataframe for each of them and then apply our fixes to the copies.

For the version focused only on discrepancies caused by 'please specify' lists,
we can use the same kind of approach we used earlier for 'other' and 'multiple':
look for the keyphrase at the beginning of the cell and remove any other cell contents.

In [None]:
broad_anno_string_df = broad_anno_df.copy(deep = True)
for string_prefix in ("Detectability of Text Property", "Effect on listener", "Inferrability of Speaker Stance", "Inferrability of Speaker Trait"):
    broad_anno_string_df['criterion_paraphrase'] = broad_anno_string_df['criterion_paraphrase'].str.replace(f"^{string_prefix}.*", string_prefix)
broad_anno_string_df['criterion_paraphrase']

For the hierarchical version, we need to define the hierarchy as a replacement dictionary first.
Here we define a dictionary where the key is the 'higher-level' which we will use as a replacement for each of the values associated with it (the 'lower-level').

In [None]:
broad_anno_hierarchical_df = broad_anno_df.copy(deep = True)
hierarchy_dict = iaa_utilities.IAAv1SpreadsheetScheme.HIERARCHY_DICT

for higher_level in hierarchy_dict:
    for lower_level in hierarchy_dict[higher_level]:
        broad_anno_hierarchical_df['criterion_paraphrase'] = broad_anno_hierarchical_df['criterion_paraphrase'].str.replace(f"^{lower_level}.*", higher_level)
broad_anno_hierarchical_df['criterion_paraphrase']

We can then repeat our calculations of $\alpha$ using the broad versions of the spreadsheet.

#### For the string-based calculations

In [None]:
broad_string_dict = iaa_utilities.IAAv1SpreadsheetScheme.run_closed_class_jaccard_and_masi(broad_anno_string_df)
iaa_utilities.pretty_print_iaa_by_column(broad_string_dict)

In [None]:
broad_string_dict['system_output']['df']

In [None]:
at = AnnotationTask(data = extract_records_for_nltk(broad_string_dict['system_output']['df']), distance=jaccard_distance)
at.alpha()

#### For the hierarchical calculations

In [None]:
broad_hier_dict = iaa_utilities.IAAv1SpreadsheetScheme.run_closed_class_jaccard_and_masi(broad_anno_hierarchical_df)
iaa_utilities.pretty_print_iaa_by_column(broad_hier_dict)

## Pairwise interannotator agreement

In [None]:

iaa_utilities.IAAv1SpreadsheetScheme.print_absolute_agreement(annotation_df, iaa_by_column)

### Pair-wise agreement on the broad string-based annotations

In [None]:
iaa_utilities.IAAv1SpreadsheetScheme.print_absolute_agreement(broad_anno_string_df)

### Pair-wise agreement on the broad hierarchical annotations

In [None]:
iaa_utilities.IAAv1SpreadsheetScheme.print_absolute_agreement(broad_anno_hierarchical_df)

## Agreement tables for each column

In [None]:
for column in iaa_utilities.IAAv1SpreadsheetScheme.ALL_DATA_COLUMNS:
    display(column)
    display(extract_iaa_df_by_column_name(annotation_df, column).pivot(index="key", columns="source_spreadsheet", values=column))

### Pair-wise agreement on the broad hierarchical annotations

In [None]:
iaa_utilities.IAAv1SpreadsheetScheme.print_absolute_agreement(broad_anno_hierarchical_df)