# Lab 3.5 Interannotator agreement

This notebook shows how you can calculate the Inter-Annotator agreement (IAA) for the annotation of conversations with Llama3 for Lab0. There are three well-known methods to calculate the IAA: Cohen's kappa, Fleiss Kappa and Krippendorf's alpha. All three measures compare the agreement against the chance that the annotators accidently agree. Cohen and Fleiss compare the agreement against the expected agreement by chance and Krippendorf against the expected disagreement by chance. Cohen can only compare two annotators and nominal data, Fleiss more than two annotators of nominal data, Krippendorf more than two annotators and many different data types. In general, scores of 0.6 or higher are considerd to be of sufficient quality. If scores are lower, the annotations cannot be trusted.

We will use the **metrics** module from NLTK which provides an **agreement.AnnotationTask** function to feed the data that needs to be compared. We will take the data from the annotated Llama conversations that are stored as JSON file. We first import **json** and the **NLTK** module.

In [4]:
import json
from  nltk.metrics import agreement

In the data folder, you find two annotations of the same conversation. The fake annotators are "Piek" and "Hennie". We can read the data from the file to create a json object for the original conversation with the two annotations. 

In [17]:
# Assume JSON files with a conversation as a list of turns
# each turn as an Annotator and a Gold label field
# All conversations should have an equal number of turns

file1 = open("./data/iaa/annotator_Piek_human_Piek_chat_with_llama.json")
conversation1 = json.load(file1)

file2 = open("./data/iaa/annotator_Hennie_human_Piek_chat_with_llama.json")
conversation2 = json.load(file2)

print(len(conversation1), len(conversation2))

35 35


By ckecking the length, we verify that the conversations have the same number of turns. The next function will extract a list of annotations from the data in the form of a tuple that consists of the name of the annotator, the turn identifier and the label. We are ignoring the turns uttered by Llama, as these were annotated automatically as "neutral".

In [18]:
## This function takes as parameters a conversation in JSON format 
## and it returns a list of annotations in the form of a tuple:
## annotator, turn id, label
def get_chat_annotations(conversation=[]):
    annotations = []
    for turn in conversation:
        speaker = turn['speaker']
        turn_id = turn['turn_id']
        if not speaker=='Llama':
            annotator= turn['Annotator']
            gold = turn['Gold']
            annotations.append((annotator, turn_id, gold))
    return annotations

In [20]:
anno1 = get_chat_annotations(conversation1)
anno2 = get_chat_annotations(conversation2)
print(len(anno1), len(anno2))
for a1, a2 in zip(anno1, anno2):
    print(a1[1], a2[1])

17 17
2 2
4 4
6 6
8 8
10 10
12 12
14 14
16 16
18 18
20 20
22 22
24 24
26 26
28 28
30 30
32 32
34 34


We extracted list of annotations with an equal length. By zipping both lists and getting the second element from the tuple for each pair, we can check if they have the same turn ids. Our data seems to be compatible. Now we can concatenate the two lists and call the **AnnotationTask** function with the combined data.

In [21]:
data = anno1+anno2
iaa = agreement.AnnotationTask(data=data)

print("Cohen's Kappa:", iaa.kappa())
print("Fleiss's Kappa:", iaa.multi_kappa())
print("Krippendorf's Kappa:", iaa.alpha())

Cohen's Kappa: 0.47345132743362833
Fleiss's Kappa: 0.47345132743362833
Krippendorf's Kappa: 0.48206278026905836


All three measures score below .6 so these two annotators disagree a lot. This means the annotation is not trustworthy and the annotation task needs to be reconsidered.

You can easily see how you can do this for more than two annotators by extracting more annoations and concatenating these in data: data = ann1+anno2+anno3+anno4, etc.

If you have a large crowd of annotations, you could take the majority vote as the adjudicated label. Next you can compare each annotator's labels against the adjudicated data set and validate the quality of the annotator.

### End of notebook