# Calculating Inter-Annotator Agreement

This notebook is focused on the calculation of the inter-annotator (a.k.a. inter-rater, inter-assessor) agreement. 


## Confirming the Environment Before Start

In [None]:
# Just a reminder of how to make sure you operate in the correct environment.
# conda info is a very useful tool.
# -e shows us the current environments
# -a shows us all available information
!conda info -a

## Load the Data

In [2]:
import pandas as pd

df = pd.read_csv('Annotation_Swiss_SMS - Annotations_03_05_2019.csv', header=None)

In [3]:
# The output should look like this.
# Use pd.read_csv? to look up the necessary parameters in case it does not look like below
df[:50].tail()

Unnamed: 0,0,1,2,3,4,5,6
45,"Dänn sägi mal 7 ni , Viertelab . Ok ? Sölli ...",APP,APP,APP,APP,APP,APP
46,15 nov ! Es konzert vo synthesis ! Chunsch au ...,APP,APP,NEWS,APP,APP,APP
47,Hi sockeloch ! Jaaa isch au schomal besser ga...,NC,NEWS,NEWS,NEWS,NEWS,NEWS
48,Sorry hans uf lutlos gha und im zimrliege lah...,NEWS,NEWS,NEWS,NEWS,NEWS,NEWS
49,Mini guet ! Tusigdank ! Sind gad gsi go marok...,NEWS,NEWS,NEWS,NEWS,NEWS,NEWS


In [5]:
# nunique allows us to identify the number of unique values per row.
# If we include the text of the SMS then 2 unique rows equals complete agreement of all 6 student assessment groups
(df[:51].nunique(1) == 2)

0      True
1      True
2      True
3     False
4      True
5      True
6     False
7      True
8     False
9     False
10    False
11     True
12     True
13    False
14     True
15    False
16     True
17    False
18    False
19    False
20     True
21    False
22     True
23    False
24    False
25     True
26     True
27    False
28     True
29    False
30    False
31    False
32     True
33    False
34    False
35    False
36    False
37    False
38    False
39     True
40    False
41     True
42     True
43     True
44    False
45     True
46    False
47    False
48     True
49     True
50    False
dtype: bool

In [6]:
# Dividing the sum() of the above output gives us the number of True statements (True == 1, False == 0)

simple_annotator_agreement_content_type = (df[:51].nunique(1) == 2).sum()/len(df[:51])
simple_annotator_agreement_content_type

0.45098039215686275

In [7]:
# And the same for the age annotations
simple_annotator_agreement_age = (df[51:].nunique(1) == 2).sum()/len(df[51:])
simple_annotator_agreement_age

0.631578947368421

# `Discussion: Simple Annotator Agreement`

The tendency in the two calculated agreements fits well with the initial intuition most people had when performning the annotation. 

Most people found it easier to annotate for age than for the content_type.
This can be the result of several factors:

* Lack of a good definition for the meaning of the classes in `content_type`
* Mismatch between the classes in `content_type` and the "reality" reflected by the SMS
* simple_annotator_agreement_age < simple_annotator_agreement_content_type is of course also a reflection of the first being a binary classification and the second a multinomial classification scheme with 3 classes

The important question to ask with regard to the calculated agreement is how to react to these observations.

Generally it can be said that any annotator agreement below 0.5 should make us consider the annotation set up.

1. Clear Instructions for Annotators?
2. Annotator Fit for Task?
     * Do they have the required knowledge?
     * Do they have the required motivation?
3. Defined Classes Make Sense?





## Encoding Example

In [10]:
# There are some handy tools for transforming data in the preprocessing
# package of sci-kit learn.

from sklearn.preprocessing import LabelEncoder

df_age = df[51:]
df_age_numeric = df_age[[1,2,3,4,5,6]].apply(LabelEncoder().fit_transform)

In [12]:
df_age_numeric.tail()

Unnamed: 0,1,2,3,4,5,6
65,0,0,0,0,0,0
66,0,0,1,0,0,0
67,0,0,0,0,0,0
68,0,0,0,0,0,0
69,1,1,1,0,0,0


## Installing `disagree` via pip

In [None]:
!conda install pip -y

In [None]:
# One has to be careful with installing pip packages into a conda environment.
# conda package management and pip will not be aware of each other.
# This can lead to situations where pip will update some packages that
# are required by packages under its control.
# To avoid this we could us pip freeze to save the current package versions. 


In [None]:
!which pip

In [None]:
!pip install disagree

In [13]:
from disagree import metrics 

mets = metrics.Metrics(df_age_numeric, [0,1])
fleiss = mets.fleiss_kappa()
print("Fleiss kappa: {:.2f}".format(fleiss))


Fleiss kappa: -0.92


In [14]:
cohens = mets.cohens_kappa(ann1=1, ann2=2)
print("Cohens kappa: {:.2f}".format(cohens))

Cohens kappa: 0.51


In [15]:
from disagree import BiDisagreements

bidis = BiDisagreements(df_age_numeric, [0,1])


bidis.agreements_summary()



Number of instances with:
No disagreement: 12
Bidisagreement: 7
Tridisagreement: 0
More disagreements: 0


(12, 7, 0, 0)

## `Exercise 1: Calculate Fleiss and Cohens Kappa for content_type`

Calculate Fleiss and Cohens Kappa for the content_type by using disagree methods as shown in the cells above. 