# Fleiss kappa

As the examples below show, Fleiss kappa is **not** a suitable metric for our IAA. See espicially the difference between BER and ENR: in both, the agreement is actually very good, there is only one file in each where one annotator disagrees with the rest. However, the kappa's are -0.03 and 0.86.

In [1]:
import pandas as pd
from statsmodels.stats import inter_rater as ir

import sys
sys.path.insert(0, '..')
from utils.latex import show_latex, TABLES
from utils.config import PATHS
from utils.data_process import drop_disregard

In [2]:
week = 'week_14'

In [3]:
path = PATHS.getpath('data') / 'iaa'
df = pd.read_pickle(path / f'annotated_df_{week}_parsed.pkl')

In [4]:
path = PATHS.getpath('data_to_inception_conll')
batch_info = pd.read_pickle(path / f'{week}.pkl')

In [5]:
iaa = batch_info.query("samp_meth == 'kwd_iaa'").NotitieID.unique()
df = df.pipe(drop_disregard).loc[df.NotitieID.isin(iaa)]

In [6]:
domains = ['ADM', 'ATT', 'BER', 'ENR', 'ETN', 'FAC', 'INS', 'MBW', 'STM']
df[domains] = df[domains].fillna(False)

In [7]:
# make sure all IAA-notes were annotated the same number of times
# this is an assumption of Fleiss Kappa

df.pivot_table(
    index='annotator',
    values='NotitieID',
    aggfunc='nunique',
)

Unnamed: 0_level_0,NotitieID
annotator,Unnamed: 1_level_1
avelli,5
katsburg,5
meskers,5
opsomer,5
swartjes,5
vervaart,5


# ADM

In [8]:
tab = df.groupby(['NotitieID', 'annotator'])['ADM'].any().groupby('NotitieID').value_counts().unstack().fillna(0).astype(int)
tab

ADM,False,True
NotitieID,Unnamed: 1_level_1,Unnamed: 2_level_1
407393461,0,6
408103746,0,6
414924160,4,2
428039733,0,6
444788701,0,6


In [9]:
ir.fleiss_kappa(tab).round(2)

0.54

# ATT

In [10]:
tab = df.groupby(['NotitieID', 'annotator'])['ATT'].any().groupby('NotitieID').value_counts().unstack().fillna(0).astype(int)
tab

ATT,False
NotitieID,Unnamed: 1_level_1
407393461,6
408103746,6
414924160,6
428039733,6
444788701,6


# BER

In [11]:
tab = df.groupby(['NotitieID', 'annotator'])['BER'].any().groupby('NotitieID').value_counts().unstack().fillna(0).astype(int)
tab

BER,False,True
NotitieID,Unnamed: 1_level_1,Unnamed: 2_level_1
407393461,6,0
408103746,6,0
414924160,5,1
428039733,6,0
444788701,6,0


In [12]:
ir.fleiss_kappa(tab).round(2)

-0.03

# ENR

In [13]:
tab = df.groupby(['NotitieID', 'annotator'])['ENR'].any().groupby('NotitieID').value_counts().unstack().fillna(0).astype(int)
tab

ENR,False,True
NotitieID,Unnamed: 1_level_1,Unnamed: 2_level_1
407393461,0,6
408103746,0,6
414924160,6,0
428039733,6,0
444788701,5,1


In [14]:
ir.fleiss_kappa(tab).round(2)

0.86