# Lab 1.5 Interannotator agreement

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

Data for supervised machine learning is as good as its annotation. If the annotations are not correct or consistent than the data are no good for training and testing.

This notebook shows how you can calculate the Inter-Annotator agreement (IAA) for the annotation of conversations with Llama3 for Lab0. There are three well-known methods to calculate the IAA: Cohen's kappa, Fleiss Kappa and Krippendorf's alpha. All three measures compare the agreement against the chance that the annotators accidently agree. Cohen and Fleiss compare the agreement against the expected agreement by chance and Krippendorf against the expected disagreement by chance. Cohen can only compare two annotators and nominal data, Fleiss more than two annotators of nominal data, Krippendorf more than two annotators and many different data types. In general, scores of 0.6 or higher are considerd to be of sufficient quality. If scores are lower, the annotations cannot be trusted.

We will use the **metrics** module from NLTK which provides an **agreement.AnnotationTask** function to feed the data that needs to be compared. We will take the data from the annotated Llama conversations that are stored as JSON file. We first import **json** and the **NLTK** module.

In [1]:
import json
from  nltk.metrics import agreement

In the ```iaa``` folder, you find two annotations of the same conversation. The fake annotators are "Piek" and "Hennie". We can read the data from the file to create a json object for the original conversation with the two annotations. 

In [3]:
# Assume JSON files with a conversation as a list of turns
# each turn as an Annotator and a Gold label field
# All conversations should have an equal number of turns

file1 = open("./iaa/annotator_Piek_human_Piek_chat_with_llama.json")
conversation1 = json.load(file1)

file2 = open("./iaa/annotator_Hennie_human_Piek_chat_with_llama.json")
conversation2 = json.load(file2)

print(len(conversation1), len(conversation2))

35 35


By checking the length, we verify that the conversations have the same number of turns. The next function will extract a list of annotations from the data in the form of a tuple that consists of the name of the annotator, the turn identifier and the label. We are ignoring the turns uttered by an LLM, as these were annotated automatically as "neutral". In this example, the LLM is ```Llama```.

In [6]:
## This function takes as parameters a conversation in JSON format 
## and it returns a list of annotations in the form of a tuple:
## annotator, turn id, label
def get_chat_annotations(conversation=[]):
    annotations = []
    for turn in conversation:
        speaker = turn['speaker']
        turn_id = turn['turn_id']
        if not speaker=='Llama':
            annotator= turn['Annotator']
            gold = turn['Gold']
            annotations.append((annotator, turn_id, gold))
    return annotations

In [7]:
anno1 = get_chat_annotations(conversation1)
anno2 = get_chat_annotations(conversation2)
print(len(anno1), len(anno2))
for a1, a2 in zip(anno1, anno2):
    print(a1, a2)

17 17
('Piek', 2, 'neutral') ('Hennie', 2, 'neutral')
('Piek', 4, 'sadness') ('Hennie', 4, 'joy')
('Piek', 6, 'sadness') ('Hennie', 6, 'fear')
('Piek', 8, 'joy') ('Hennie', 8, 'neutral')
('Piek', 10, 'joy') ('Hennie', 10, 'neutral')
('Piek', 12, 'neutral') ('Hennie', 12, 'joy')
('Piek', 14, 'sadness') ('Hennie', 14, 'neutral')
('Piek', 16, 'sadness') ('Hennie', 16, 'sadness')
('Piek', 18, 'neutral') ('Hennie', 18, 'surprise')
('Piek', 20, 'fear') ('Hennie', 20, 'fear')
('Piek', 22, 'joy') ('Hennie', 22, 'joy')
('Piek', 24, 'joy') ('Hennie', 24, 'joy')
('Piek', 26, 'fear') ('Hennie', 26, 'fear')
('Piek', 28, 'disgust') ('Hennie', 28, 'disgust')
('Piek', 30, 'fear') ('Hennie', 30, 'fear')
('Piek', 32, 'neutral') ('Hennie', 32, 'neutral')
('Piek', 34, 'neutral') ('Hennie', 34, 'neutral')


We extracted list of annotations with an equal length. By zipping both lists and getting the second element from the tuple for each pair, we can check if they have the same turn ids. Our data seems to be compatible. Now we can concatenate the two lists and call the **AnnotationTask** function with the combined data.

In [8]:
data = anno1+anno2
iaa = agreement.AnnotationTask(data=data)

print("Cohen's Kappa:", iaa.kappa())
print("Fleiss's Kappa:", iaa.multi_kappa())
print("Krippendorf's Kappa:", iaa.alpha())

Cohen's Kappa: 0.47345132743362833
Fleiss's Kappa: 0.47345132743362833
Krippendorf's Kappa: 0.48206278026905836


All three measures score below .6 so these two annotators disagree a lot. This means the annotation is not trustworthy and the annotation task needs to be reconsidered.

You can easily see how you can do this for more than two annotators by extracting more annoations and concatenating these in data: data = ann1+anno2+anno3+anno4, etc.

If you have a large crowd of annotations, you could take the majority vote as the adjudicated label. Next you can compare each annotator's labels against the adjudicated data set and validate the quality of the annotator.

### IAA for student annotations

In [2]:
import pandas as pd

annotation_file = "/Users/piek/Desktop/t-MA-HLT-introduction-2025/ma-hlt-labs/lab1.chat/annotation/chats_2025/adjudicated_annotations.json"

df = pd.read_json(annotation_file)
df.head()

Unnamed: 0,utterance,speaker,turn_id,Gold,Annotator,Votes,Annotators,Adjudication
0,Can you give me ideas on how to deal with some...,Bill,4,neutral,Rico,"[neutral, neutral, sadness, sadness, neutral, ...","[Rico, Ella, Vera Langeberg, Giulia, zia, Thom...",sadness
1,I am so angry. I am blind with rage. I want to...,Bill,6,anger,Rico,"[anger, anger, anger, anger, anger, anger]","[Rico, Ella, Vera Langeberg, Giulia, zia, Thom...",anger
2,But at the same time I am happy that it is ove...,Bill,8,joy,Rico,"[joy, joy, anger, anger, joy, joy]","[Rico, Ella, Vera Langeberg, Giulia, zia, Thom...",joy
3,When I found out I was so surprised. I saw no ...,Bill,10,surprise,Rico,"[surprise, surprise, surprise, surprise, surpr...","[Rico, Ella, Vera Langeberg, Giulia, zia, Thom...",surprise
4,Please do give me some steps to get over this,Bill,12,neutral,Rico,"[neutral, neutral, neutral, neutral, neutral, ...","[Rico, Ella, Vera Langeberg, Giulia, zia, Thom...",neutral


In [3]:
anno1 = []
anno2 = []
anno3 = []
for ind in df.index:
    vote = df['Votes'][ind]
    annotators = df['Annotators'][ind]
    turn_id = df['turn_id'][ind]
    speaker = df['speaker'][ind]
    key = speaker+"_"+str(turn_id)
    anno1.append((annotators[0], key, vote[0]))
    if len(vote)==3:
        anno2.append((annotators[1], key, vote[1]))
        anno3.append((annotators[2], key, vote[2]))
    elif len(vote)==2:
        anno2.append((annotators[1], key, vote[1]))
        anno3.append(("missing3", key, "missing"))
    else:
        anno2.append(("missing2", key, "missing"))
        anno3.append(("missing3", key, "missing"))


('Rico', 'Bill_4', 'neutral') ('missing2', 'Bill_4', 'missing') ('missing3', 'Bill_4', 'missing')


In [6]:
data = anno1+anno2
iaa = agreement.AnnotationTask(data=data)
print("Krippendorf's alpha:", iaa.alpha())

data = anno1+anno3
iaa = agreement.AnnotationTask(data=data)
print("Krippendorf's alpha:", iaa.alpha())


data = anno2+anno3
iaa = agreement.AnnotationTask(data=data)
print("Krippendorf's alpha:", iaa.alpha())


data = anno1+anno2+anno3
iaa = agreement.AnnotationTask(data=data)
print("Krippendorf's alpha:", iaa.alpha())

Krippendorf's alpha: 0.377630235276729
Krippendorf's alpha: 0.27840142401883383
Krippendorf's alpha: 0.6103997909589758
Krippendorf's alpha: 0.42834256293857953


In [5]:
for a1, a2, a3 in zip(anno1, anno2, anno3):
    print(a1, a2, a3)

('Rico', 'Bill_4', 'neutral') ('missing2', 'Bill_4', 'missing') ('missing3', 'Bill_4', 'missing')
('Rico', 'Bill_6', 'anger') ('missing2', 'Bill_6', 'missing') ('missing3', 'Bill_6', 'missing')
('Rico', 'Bill_8', 'joy') ('missing2', 'Bill_8', 'missing') ('missing3', 'Bill_8', 'missing')
('Rico', 'Bill_10', 'surprise') ('missing2', 'Bill_10', 'missing') ('missing3', 'Bill_10', 'missing')
('Rico', 'Bill_12', 'neutral') ('missing2', 'Bill_12', 'missing') ('missing3', 'Bill_12', 'missing')
('Rico', 'Bill_14', 'neutral') ('missing2', 'Bill_14', 'missing') ('missing3', 'Bill_14', 'missing')
('Rico', 'Bill_16', 'joy') ('missing2', 'Bill_16', 'missing') ('missing3', 'Bill_16', 'missing')
('Rico', 'Bill_18', 'sadness') ('missing2', 'Bill_18', 'missing') ('missing3', 'Bill_18', 'missing')
('Rico', 'Bill_20', 'anger') ('missing2', 'Bill_20', 'missing') ('missing3', 'Bill_20', 'missing')
('Rico', 'Bill_22', 'anger') ('missing2', 'Bill_22', 'missing') ('missing3', 'Bill_22', 'missing')
('Rico', 'Bi