# Calculating Inter-Annotator Agreement

This notebook is focused on the calculation of inter-annotator (a.k.a. inter-rater, inter-assessor) agreements, the degree of agreement between multiple human assessors.




## 0. Confirming the Environment Before Start

In [None]:
# Just a reminder of how to make sure you operate in the correct environment.
# conda info is a very useful tool.
# -e shows us the current environments
# -a shows us all available information
!conda info -a

## 1. Provisioning the Data

### 1.1 Load the Data

We will use the annotated Swiss SMS set as basis for calculating the agreements.
The SMS texts were annotated by students with the following classes:
* Content_Type (what kind of message was sent)
    * Appointment [APP]
    * News [NEWS]
    * NC [No Content]
* Age (if the author of the text message was rather young or old)
    * young [JUNG]
    * old [ALT]

In order to do some calculations on top of these assessments we will have to
* Load the CSV (created via an export from Google Sheets) file into a dataframe
* Replace the String labels (the annotations) with numeric values (this is called `encoding`)

In [None]:
# Loading the CSV as a dataframe

import pandas as pd

df = pd.read_csv('Annotation_Swiss_SMS - Annotations_03_05_2019.csv', header=None)

### 1.2 Exercise: Explore the Data

Lets get some orientation in the dataframe.
Use the commands

* shape
* head()
* tail()

in order to see what we have loaded.

In [None]:
# Explore the dataframe with shape, head(), tail() commands
df

## 2 Calculating the Agreement for Age

In order to calculate the agreement for the class `age` we have to complete the following steps:

1. Create a df with the age rows 
2. Apply the analysis of unique values
3. Calculate how often the annotators agreed (unique values = 1)

### 2.1 Create a Dataframe with age rows

In [None]:
# a) Create a df with the age rows only 

df_age_annotations = df.loc[50:]
df_age_annotations.head(5)

In [None]:
# b) Select only the columns with the labels, and reset the index to start with 0 (this is optional at this)

df_age_annotations.reset_index(inplace=True, drop=True)
df_age_annotations.loc[:,[1,2,3,4,5,6]].head(10)


#### Pandas Dataframe Selection with loc()

In the previous cell we have used the `loc()` method in order to select specific rows and columns from the dataframe. 

The method expects us to define rows and columns and returns the selection as a `view` on the original dataframe (as opposed to creating a copy of the original dataframe). 

In the above example `df_age_annotations.loc[:,[1,2,3,4,5,6]].head(5)`:
* `[:` defines to use all rows of the `df_age_annotations` dataframe. 
    * `[5:]` would define to use rows from index `5` until the end of the dataframe
    * `[:5]` this would define to use the rows `0, 1, 2, 3, 4, 5` of the dataframe
* `[1,2,3,4,5,6]]` this selects the columns with the label `1`, `2`, `3`, `4`, `5`, `6`. `loc()` always interprets values as labels (as opposed to interpreting the `2` as the third column from left or s.th like this

See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html for the official documentation. 

### 2.2 Analysing the Unique Values

One way to look for disagreement in our annotations is to look at the number of unique values in a row.

`60 	HDMFG , pfus guet und danke vil vil mal H... 	JUNG 	JUNG 	JUNG 	JUNG 	JUNG 	JUNG`

If all annotators agree, as is the case in row `60` the number of unique values in columns 1-6 is `1` ("JUNG").
If the number of unique values is `> 1`, then this indicates disagreement between the annotators.

We will use this approach for our first calculation of the inter-annotator agreement. 


In [None]:
# nunique() (number of unique) allows us to identify the number of unique values per row.

# this import is just for displaying the output as a HTML table
from IPython.core.display import HTML

# we make our selection on the dataframe and use nunique() with the argument 1 (unique values per row).
# calling nunique(0) or just nunique() instead would give us the unique values per column.
unique_values = df_age_annotations.loc[:,[1,2,3,4,5,6]].nunique(1)


# the below line just formats the output of unique_values series to html, and in addition puts out
# the corresponding rows of the dataframe
HTML(unique_values.head(5).to_frame().to_html() + df_age_annotations.loc[:4,[1,2,3,4,5,6]].to_html())

### 2.3 Calculate the Agreement

Now that we have a way to identify all rows where the annotators agreed we can easily calculate the annotator agreement with:

$
 \frac{\text{Number of Samples with Agreement}}{\text{Number of Samples}} 
$

This then translates to the cell below, where:

* `df_age_annotations.loc[:,[1,2,3,4,5,6]].nunique(1) == 1).sum()` is the number of samples where annotators agreed.
* `len(df_age_annotations)` gives us the number of samples

Based on this calculation we see that the agreement is `0.6`. In 60% of the rows (samples) all 6 assessors agreed completely in their judgement of `JUNG/ALT`.


In [None]:
simple_annotator_agreement_age = (df_age_annotations.loc[:,[1,2,3,4,5,6]].nunique(1) == 1).sum()/len(df_age_annotations)
simple_annotator_agreement_age

#### Exercise: Calculate the Agreement for content_type

Calculate the agreement for the assessments of class `content_type` analog to what we have done above for `age`.

## 3. Discussion: Simple Annotator Agreement

The tendency in the two calculated agreements fits well with the initial intuition most people had when performning the annotation. 

Most people found it easier to annotate for age than for the content_type.
This can be the result of several factors:

* Lack of a good definition for the meaning of the classes in `content_type`
* Mismatch between the classes in `content_type` and the "reality" reflected by the SMS
* simple_annotator_agreement_age < simple_annotator_agreement_content_type is of course also a reflection of the first being a binary classification and the second a multinomial classification scheme with 3 classes

The important question to ask with regard to the calculated agreement is how to react to these observations.

Generally it can be said that any annotator agreement below 0.5 should make us consider the annotation set up.

1. Clear Instructions for Annotators?
2. Annotator Fit for Task?
     * Do they have the required knowledge?
     * Do they have the required motivation?
3. Defined Classes Make Sense?





## 4. Calculating the Agreement Based on Statistical Measures

### 4.1 Encoding Categorical Data


When we are working with annotated data, we often encounter categorical data.


<img src="./images/data_types.png" alt="Drawing" style="width: 600px;"/>

#### Categorical Data

Categorical data as shown in the graphic above, refers to data that represents categories in the widest sense.
Im supervised ML we will often encounter categorical data such as:

* relevance: "relevant/ not relevant"
* class: "spam/not spam"
* contract type: "lease/mortgage/..."

or as in our case with the SMS:

* content_type: "NC/APP/NEWS"
* age: JUNG/ALT

When data is annotated this is often done with the categorical labels.
If we want to make computations based on these labels then it is often advantageous to map those categories to numeric values (e.g. removes necessity for String handling, smaller size of integer type). This process is called encoding. 

#### Encoding

Encoding is a straightforward process. 
If we have labels in the form of Strings that we want to map to Ints, we simply create a mapping between the labels and integer numbers.

Sklearn supports this with the `LabelEncoder` as shown in the cell below.



In [None]:
# There are some handy tools for transforming data in the preprocessing
# package of sci-kit learn.

from sklearn.preprocessing import LabelEncoder

df_age = df[50:]
encoder = LabelEncoder()
df_age_numeric = df_age[[1,2,3,4,5,6]].apply(encoder.fit_transform)

In [None]:
df_age_numeric

In [None]:
# we can use the encoder to map back to the original values

encoder.inverse_transform([0, 1])

### 4.2 Statistical Tools to Calculate Inter-Annotator Agreement

Statistical tooling is often applied to calculate the level of agreement between annotators.
The reason to bring in statistical tooling is the following:

"We would like to consider how likely it is, that n people agree on their assessments."

In a nutshell these tools take the distribution of the labels and the number of assessors into account when calculating the agreement.
This can be useful in cases where the distribution of labels is extremely skewed towards some labels (e.g. 2 out of 10 labels that make up 90% of the annotations).

Therein lies the advantage of these tools. Their disadvantage is that it is harder to interprete the results. 
In our simple calculation it is completely clear how to interprete the result.

### 4.3 Installing `disagree` via pip

We will use a library called "disagree" for the calculation of the agreement. 

The library is only available via pip and not via conda channels.

In [None]:
# One has to be careful with installing pip packages into a conda environment.
# conda package management and pip will not be aware of each other.
# This can lead to situations where pip will update some packages that
# are required by packages under its control.
# To avoid this we could us pip freeze to force pip to stick with the currently installed package versions. 


In [None]:
# As a first step for using pip when operating in a conda environment, we have to make sure we use the correct pip.
# This can be done with the command below on a linux or Mac. 
# The path has to be within the correct environment
!which pip

In [None]:
!pip install disagree

### 4.4 Calculating Agreement

#### Agreement Between Two Annotators - Cohens Kappa

Cohen's Kappa is probably the most commonly used approach to calculate the agreement between two annotators (exactly two, and not more).

In [None]:
### Cohens Kappa

cohens = mets.cohens_kappa(ann1=1, ann2=2)
print("Cohens kappa: {:.2f}".format(cohens))

from sklearn.metrics import cohen_kappa_score
cohens_sklearn = cohen_kappa_score(df_age_numeric[[1]], df_age_numeric[[2]])
print("Cohens kappa from sklearn: {:.2f}".format(cohens))

### Agreement Between N Annotators - Fleiss Kappa

Fleiss Kappa is a generalisation of Cohen's Kappa that allows us to calculate the agreement between a fixed number of `N` annotators.

In [None]:
# Note: This cell will give an error if you execute it as shown below.
# To fix the error you will have to reset the indices of the dataframe to start at 0.
from disagree import metrics

mets = metrics.Metrics(df_age_numeric)
fleiss = mets.fleiss_kappa()
print("Fleiss kappa: {:.2f}".format(fleiss))


In [None]:
# Note: This cell will only work once you have fixed the indices.
from disagree.agreements import BiDisagreements

bidis = BiDisagreements(df_age_numeric)
bidis.agreements_summary()



## `Exercise 1: Calculate Fleiss and Cohens Kappa for content_type`

Calculate Fleiss and Cohens Kappa for the content_type by using disagree methods as shown in the cells above. 