# Inter-annotator Agreement v2.0 for INLG 2020

The annotation team formalized our annotation guidelines in July 2020.
We then annotated 10 papers from ACL 2020 according to these new guidelines to get an idea of the degree of inter-annotator agreement;
this gives us a sense of how reliable the new guidelines and our annotations are.
The results were disappointing so we iterated on our guidelines and annotation spreadsheets and did another round of 10 papers.
This document reports on our IAA for these 10 papers.

In this notebook, we import data from our spreadsheets and do a bit of preprocessing so that we can calculate IAA easily using `nltk`.

We calculate [Krippendorff's alpha]() using [MASI distance]() and [Jaccard distance]().
We also include raw pair-wise agreement scores.

Note that we are not doing any hypothesis testing here, so you will not see any significance scores.
These are strictly descriptive statistics.

## Preliminaries

Our original annotations were collected using [Google Sheets]() so we used `gspread` to interact with Google Sheets, `nltk` to calculate $\alpha$, and `pandas` to manage our data.
These spreadsheets are not public, but the data from them is released in the CSV files in this repo.

In [None]:
import pandas as pd
import re

from nltk.metrics import edit_distance, jaccard_distance

from IPython.display import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import iaa_utilities


# URLs for the actual spreadsheets
urls = ["https://docs.google.com/spreadsheets/d/17SXR1PMFiavwNUgg3q6EDHWuIiuK2jGaFgb8N-EUukA/edit",
        "https://docs.google.com/spreadsheets/d/18FhVQ0h1kRYuoSGF1DVHMyu-Ag3ZHsDS2ppOIf8z3mc/edit",
        "https://docs.google.com/spreadsheets/d/1gRyvyJT-6rnMhPyvSD82JJHAXEaL-7e2IjHkFPV8kX0/edit",
        "https://docs.google.com/spreadsheets/d/1DmnT-4ab4wCh9NczY7IEolussQhYXZu5q7mP72Kv0rY/edit"]


# If you want to force loading from the web instead of trying to load locally, uncomment the next line and comment the one after that.
# annotation_df = iaa_utilities.IAAv2SpreadsheetScheme.prepare_df_from_google_sheets(urls)
annotation_df = iaa_utilities.IAAv2SpreadsheetScheme.load_locally_fallback_to_web("iaa-v2_double-annotations.csv", urls)

### Cleaning the dataset

We keep the code used to clean up superficial differences between the different annotators that we needed to handle for the first round of IAA:

1. some annotators left whole rows blank; and
2. annotator paraphrase of definition was often left blank, as was the column for statistics. blank entries compare poorly on set-distance metrics so we will replace these with "~*EMPTY*~"

In [None]:
no_values = pd.DataFrame(annotation_df.loc[:,'system_language':'op_statistics']).any(axis = 1)
annotation_df = annotation_df[no_values]

annotation_df.replace("^\s$", "~*EMPTY*~", inplace=True)

for column in iaa_utilities.IAAv2SpreadsheetScheme.OPEN_CLASS_COLUMNS:
    annotation_df[column] = annotation_df[column].str.lower()

For the second round of IAA, in addition to updating the guidelines, we updated the spreadsheet to have dropdown menus for the criteria names and other columns.
For the `criterion_paraphrase` column, this included the enumeration from the guidelines, as well as value-initial hyphens to get some degree of indentation indicative of the overall hierarchy.
We therefore need to do a bunch of normalization for the criterion_paraphrase column to remove the hyphens and numbers.

In [None]:
annotation_df['criterion_paraphrase'] = annotation_df['criterion_paraphrase'].str.lower().str.replace("-", "").str.replace("[0123456789a-z\/]+\.", "").str.replace("\s+", " ").str.replace(";", ",").str.strip()

We also need to deal with columns where `multiple (please specify):` is a valid value, so that the likelihood of spurious differences is as low as possible.
Ideally we should standardise the order of the values listed, but we did not do that for the initial INLG submission (hence the empty code cell).

## Extracting the relevant information

Now that we've prepared the primary dataframe, we can easily extract smaller dataframes which facilitate the analysis.
In particular, this function gives us a three-column DF with the `source_spreadsheet`, `key` (= paper identifier), and the target column,
where we have aggregated all labels given in that column for that paper in the spreadsheet `source_spreadsheet` into a set.
(Using a set means that each label will appear only once; using a `frozenset` makes it immutable.)

This code appears in `iaa_utilities.py`

    def extract_iaa_df_by_column_name(annotation_df: pd.DataFrame, column_name: str) -> pd.DataFrame:
        """Extract a three-column dataframe with `column_name` items grouped by `source_spreadsheet` and `key`."""
        return annotation_df[['source_spreadsheet', 'key', column_name]]\
            .groupby(['source_spreadsheet', 'key'])[column_name]\
            .apply(frozenset).reset_index()

    def extract_records_for_nltk(iaa_df: pd.DataFrame) -> List[Tuple]:
        """The first column in the `to_records()` representation is an index, which we don't need for `nltk`."""
        return [(b, c, d) for _, b, c, d in iaa_df.to_records()]


In [None]:
extract_iaa_df_by_column_name = iaa_utilities.extract_iaa_df_by_column_name
extract_records_for_nltk = iaa_utilities.extract_records_for_nltk

## Calculating agreement

We use the same setup for calculating Krippendorff's alpha with Jaccard distance and MASI distance for the closed-class columns.


In [None]:
iaa_by_column = iaa_utilities.IAAv2SpreadsheetScheme.run_closed_class_jaccard_and_masi(annotation_df)
print(iaa_by_column['criterion_paraphrase']['df'].head())

In [None]:
iaa_utilities.pretty_print_iaa_by_column(iaa_by_column)

This does reasonable things for our dev data in a strict-agreement mode, but we should also produce a version which relaxes some of the restrictions.
We should also do one for open-class columns with a different distance measure.

## Broad Agreement

We will call the exact-matching (at the string level) version of agreement which we have used so far *narrow* and now define *broad* agreement.
Broad agreement uses the natural hierarchies in the annotation scheme to group elements together which we might want to consider as equivalent.

For example, if two annotators disagree about the output type of a system, with one saying *text: paragraph* and the other saying *text: document*,
we want to penalize this less than if one of them were to say *multi-modal* instead.

### Input/Output Columns

For the `system_input` and `system_output` columns, we will consider the following equivalence classes

* text = {text: subsentential units of text, text: sentence, text: paragraph, text: document, text: dialogue, text: other (please specify)},
* multiple = {all variations of *multiple (list all)*}, and
* other = {all variations of *other (please specify)*}

with all other allowed labels belonging individually to their own equivalence class containing only one element (i.e. raw data = {*raw data*})
In the narrow agreement calculations, each element in an equivalence class differs from all others with a nominal distance metric (i.e. identical strings are distance 0 and all others are distance 1 from each other).


### Task Column

For the `system_task` column, we use the following equivalence classes

* multiple = {all variations of *multiple (list all)*} and
* other = {all variations of *other (please specify)*}

with all other allowed labels belonging individually to their own equivalence class containing only one element (i.e. aggregation = {*aggregation*})
In the narrow agreement calculations, each element in an equivalence class differs from all others with a nominal distance metric (i.e. identical strings are distance 0 and all others are distance 1 from each other).

### Paraphrase of Criterion Name Column

For the `criterion_paraphrase` column, we use two sets of equivalence classes:
one for simple string-level differences related to annotator-specified details as in the above cases and
another based on the hierarchy of criteria.

#### String-level equivalence classes

* Detectability of Text Property (specify property here) = {all variations of *Detectability of Text Property*}
* Effect on listener (specify effect here) = {all variations of *Effect on listener*}
* Inferrability of Speaker Stance (specify object of stance here) = {all variations of *Inferrability of Speaker Stance*}
* Inferrability of Speaker Trait (specify trait here) = {all variations of *Inferrability of Speaker Trait*}

#### Hierarchy-based equivalence classes

These can be read in version 2.0 of the annotation guidelines.
We use all the immediate children of `Quality of outputs` as the top level categories and map all of their children to them.
This gives us four equivalence classes:

* `Quality of outputs` (containing only itself),
* `Correctness of outputs`,
* `Goodness of outputs (excluding correctness)`, and
* `Feature-type criteria`

### Form of Response Elicitation

* other = {all variations of *other (please specify)*}

### Performing the updates and the calculations

We work with a fresh copy of the annotation_df so that the original data is still accessible in the notebook.
For each of the columns where *other (please specify)* is a valid annotation, we replace any annotation beginning with "other" with "other:
we are collapsing the distinctions created by the further specifications.
We do the same thing for annotations of *multiple (list all)*.

In [None]:
broad_anno_df = annotation_df.copy(deep = True)
for column in ("system_input", "system_output", "system_task", "op_form"):
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Oo]ther.*", "other")
broad_anno_df['system_input']

In [None]:
for column in ("system_input", "system_output"):
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Mm]ultiple.*", "multiple")
    broad_anno_df[column] = broad_anno_df[column].str.replace("^[Tt]ext:.*", "text")
broad_anno_df['system_input']

# broad_anno_df['criterion_paraphrase'] = broad_anno_df['criterion_paraphrase'].str.replace("^[Mm]ultiple.*", "multiple")

When it comes to the `criterion_paraphrase` column, however, we can take either of the two approaches described above.
We will create one copy of the broad annotation dataframe for each of them and then apply our fixes to the copies.

For the version focused only on discrepancies caused by 'please specify' lists,
we can use the same kind of approach we used earlier for 'other' and 'multiple':
look for the keyphrase at the beginning of the cell and remove any other cell contents.

In [None]:
broad_anno_string_df = broad_anno_df.copy(deep = True)
for string_prefix in ("Text Property", "Detectability of controlled feature", "Effect on reader/listener", "Inferrability of speaker/author stance", "Inferrability of speaker/author trait"):
    broad_anno_string_df['criterion_paraphrase'] = broad_anno_string_df['criterion_paraphrase'].str.replace(f"{string_prefix}.*", string_prefix, case=False)

broad_anno_string_df['criterion_paraphrase']

For the hierarchical version, we need to define the hierarchy as a replacement dictionary first.
Here we define a dictionary where the key is the 'higher-level' which we will use as a replacement for each of the values associated with it (the 'lower-level').

In [None]:
broad_anno_hierarchical_df = broad_anno_df.copy(deep = True)

hierarchy_dict = iaa_utilities.IAAv2SpreadsheetScheme.HIERARCHY_DICT

for higher_level in hierarchy_dict:
    for lower_level in hierarchy_dict[higher_level]:
        broad_anno_hierarchical_df['criterion_paraphrase'] = broad_anno_hierarchical_df['criterion_paraphrase'].str.replace(f".*{re.escape(lower_level)}.*", higher_level, case = False)
broad_anno_hierarchical_df['criterion_paraphrase']

We can then repeat our calculations of $\alpha$ using the broad versions of the spreadsheet.

#### For the string-based calculations

In [None]:
broad_string_dict = iaa_utilities.IAAv2SpreadsheetScheme.run_closed_class_jaccard_and_masi(broad_anno_string_df)
iaa_utilities.pretty_print_iaa_by_column(broad_string_dict)

#### For the hierarchical calculations

In [None]:
broad_hier_dict = iaa_utilities.IAAv2SpreadsheetScheme.run_closed_class_jaccard_and_masi(broad_anno_hierarchical_df)
iaa_utilities.pretty_print_iaa_by_column(broad_hier_dict)

## Pairwise interannotator agreement

In [None]:

def print_absolute_agreement(dataframe: pd.DataFrame, iaa_by_column_dict: Optional[Dict] = None) -> None:
    if iaa_by_column_dict is None:
        iaa_by_column_dict = iaa_utilities.IAAv2SpreadsheetScheme.run_closed_class_jaccard_and_masi(dataframe)
    for column in iaa_utilities.IAAv2SpreadsheetScheme.CLOSED_CLASS_COLUMNS:
        df = iaa_by_column_dict[column]['df']
        print(f"Interannotator agreement for {column}")
        annotator_list = dataframe.source_spreadsheet.unique()
        print(" \t" + "\t".join([str(annotator) for annotator in annotator_list]))
        for a1 in annotator_list:
            a1_vals = list(df[df.source_spreadsheet == a1][column])
            print(f"{a1}", end="\t")
            pairwise_agreements = []
            for a2 in annotator_list:
                a2_vals = list(df[df.source_spreadsheet == a2][column])
                agreement_sum = 0
                for a1_val, a2_val in zip(a1_vals, a2_vals):
                    agreement_sum += 1 - jaccard_distance(a1_val, a2_val)
                pairwise_agreements.append(agreement_sum/min(len(a1_vals), len(a2_vals)))
                print(f"{pairwise_agreements[-1]:.2f}", end="\t")
            print(f"\t{(sum(pairwise_agreements) - 1)/(len(pairwise_agreements) - 1):.2f}")
        print()
        print()

print_absolute_agreement(annotation_df, iaa_by_column)

### Pair-wise agreement on the broad string-based annotations

In [None]:
print_absolute_agreement(broad_anno_string_df)

### Pair-wise agreement on the broad hierarchical annotations

In [None]:
print_absolute_agreement(broad_anno_hierarchical_df)

### Agreement tables for each column

In [None]:
for column in iaa_utilities.IAAv2SpreadsheetScheme.ALL_DATA_COLUMNS:
    display(column)
    display(extract_iaa_df_by_column_name(annotation_df, column).pivot(index="key", columns="source_spreadsheet", values=column))

In [None]:
for column in iaa_utilities.IAAv2SpreadsheetScheme.ALL_DATA_COLUMNS:
    display(column)
    display(extract_iaa_df_by_column_name(broad_anno_hierarchical_df, column).pivot(index="key", columns="source_spreadsheet", values=column))