# Inter-Annotator Agreement

Our aim is to analyse inter-annotator agreement for different subject-verb-object prediction tasks.
In these tasks an annotator must judge based on his or her knowledge about the world whether a proposed  subject-verb-object triple is plausible. A subject-verb-object triple is considered plausible when it occurs naturally in the reald world and is not a hypothetical possibility. Of course, such judgements are subjective as different individuals draw different thresholds between practical pausibility and hypothetical possibility.
In the following, we analyse agreement between different annotators.  

**Sources:**
* [statsmodels.stats.inter_rater](https://www.statsmodels.org/stable/_modules/statsmodels/stats/inter_rater.html) 
* [NLTK: Agreement metrics](https://www.nltk.org/api/nltk.metrics.agreement.html)
* [Common pitfalls in statistical analysis: Measures of agreement](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5654219/)
* [Agree or Disagree? A Demonstration of An Alternative Statistic to Cohen’s Kappa for Measuring the Extent and Reliability of Agreement between Observers](https://nces.ed.gov/FCSM/pdf/J4_Xie_2013FCSM.pdf)



In [1]:
import numpy as np
import pandas as pd

from pandas import DataFrame
from pandas import read_csv
from pandas import concat, merge

from statsmodels.stats.inter_rater import cohens_kappa

from estimators import observed_agreement
from estimators import get_confusion_matrix

## I. Detection of invalid SVO triples

All SVO-triples were generated by sampling sentences from the corpora.
This caused two types of errors to be corrected:
* incorrect triples due to incorrect syntax tree;
* triples where subject is equal to the object.

We are going to remove corresponding SVO-prediction tasks corresponding to these sentences.

In [2]:
df = read_csv('original_sentences_annotation.csv')
assert df.shape == (500, 5), 'The file must contain 500 rows and 5 columns'
assert all(df['hinnang'].isin(['yes', 'no'])), 'Incorrect judgements' 
assert all(df.columns == ['sentence','subject', 'verb', 'object', 'hinnang']), 'Incorrect column names'
invalid_tasks = df[(df['hinnang'] == 'no') | (df['subject'] == df['object'])].iloc[:, :4].reset_index()   

print(f'Number of incorrect sentences: {len(invalid_tasks)} ({len(invalid_tasks)/len(df)*100 :.2f}%)')
display(invalid_tasks)    

Number of incorrect sentences: 33 (6.60%)


Unnamed: 0,index,sentence,subject,verb,object
0,0,Mind ei võlu eriti väljavaade saada järgmiseks...,väljavaade,saama,võlu
1,5,Eesti kohta nii hulle arve esile tuua ei saa .,arve,tooma,hull
2,11,Modelliagentuuri MASS juhivad Oleg ja Raul kah...,mass,juhtima,Oleg
3,16,Vähemalt nii on väitnud lepingu lõpetanud Nõmm...,direktor,lõpetama,leping
4,26,"Tema neli last oksendasid öö läbi , kui ema ke...",laps,oksendama,öö
5,27,Aga sõda ei vallandanud serbia ega bosnia rahv...,sõda,vallandama,rahvas
6,28,Ent paljuks ei pea suu korrashoiu eest rohkem ...,õpetaja,maksma,suu
7,30,Kuid miks peab üks inimene justkui Jeesus Kris...,inimene,lunastama,pangandus
8,79,"“ Vägivald sünnitab vägivalda , ” võtab kokku ...",vägivald,sünnitama,vägivald
9,105,“ Hommikul poole üheksast õhtul poole üheksani...,pool,hoidma,ise


## I. SVO prediction without background

In this task, annotators have to predict wether subject-verb-object pair is possible without any background information, i.e., the annotator sees only a SVO-triple of lemmas and must make a binary decision. 

### Agreement between human annotators

In [3]:
annotation = {}
idx_range = range(1, 6)


# The first annotator
dfs = [read_csv(f'annotator1/svo_100_{i}.csv', index_col=0, encoding='utf-8') for i in idx_range]
assert all(df.shape == (200, 4) for df in dfs), 'Each file must contain 200 rows and 4 columns'
assert all(all(df.columns == ['subject', 'verb', 'object', 'hindaja1']) for df in dfs), 'Incorrect column names'
assert all(all(df['hindaja1'].isin(['no', 'yes'])) for df in dfs), 'Incorrect judgements'

annotation[1] = concat(dfs, axis=0).rename(columns={'hindaja1': 'hinnang_1'})


# The second annotator
dfs = [read_csv(f'annotator2/svo_{i}.csv', index_col=0, encoding='latin1') for i in idx_range]
assert all(df.shape == (200, 4) for df in dfs), 'Each file must contain 200 rows and 4 columns'
assert all(all(df.columns == ['subject', 'verb', 'object', 'hinnang']) for df in dfs), 'Incorrect column names'
assert all(all(df['hinnang'].isin(['no', 'yes'])) for df in dfs), 'Incorrect judgements'

annotation[2] = concat(dfs, axis=0).rename(columns={'hinnang': 'hinnang_2'})

Amazingly, some SVO-triples occur more than once in the dataset and thus we cannot merge tables based on the SVO columns. Instead, we just need to concatenate assignments as is and check that rows are in the same order in all files. 


In [4]:
svo_cols = ['subject', 'verb', 'object']
df = concat((annotation[1][svo_cols], annotation[2][svo_cols]), axis=1)
assert all(df['subject'].nunique(axis = 1) == 1), 'All subjects must be equal'
assert all(df['verb'].nunique(axis = 1) == 1), 'All verbs must be equal'
assert all(df['object'].nunique(axis = 1) == 1), 'All objects must be equal'

In [5]:
tbl = concat([annotation[1], annotation[2]['hinnang_2']], axis=1)
assert len(tbl) == len(annotation[1]), 'The number of rows is off'
assert all(tbl['hinnang_1'].isin(['no', 'yes'])), 'Some triples are missing form the first table'
assert all(tbl['hinnang_2'].isin(['no', 'yes'])), 'Some triples are missing form the second table'

In [6]:
print('Task with incorrect sentences')

confusion_matrix = get_confusion_matrix(tbl, ['hinnang_1', 'hinnang_2'])
display(confusion_matrix.rename_axis(None, axis=0).rename_axis(None, axis=1))

print(f'Observed agreement: {observed_agreement(confusion_matrix) * 100 :.2f}%')
print(f'Cohonen kappa:      {cohens_kappa(confusion_matrix)["kappa"] * 100 :.2f}%') 

Task with incorrect sentences


Unnamed: 0,no,yes
no,325,78
yes,26,571


Observed agreement: 89.60%
Cohonen kappa:      77.92%


In [7]:
print('Task without incorrect sentences')

# To complete

Task without incorrect sentences


### Agreement between human and svo-prediction

TODO: Birgit ole hea ja tekita siia collocation_net-il põhinevad SVO märgendused. 
Siis me saame arvutada kokkulangevuse erinevate osapoolte ja SVO-algoritmiga
Et näida palju teaduslikum võime vaadata kas kappa skoorid on oluliselt erinevad erinevate inimeste omast. 

## II. SVO prediction with background

In this task, annotators have to predict wether subject-verb-object pair is possible given the surrounding sentence as context. To avoid priming, SVO-triple is still given as a pair ov subject-object lemmas -- it is not always possible to synthesise subject-object pair in right cases.



In [8]:
bg_annotation = {}
idx_range = range(1, 6)

In [9]:
# The first annotator
dfs = [read_csv(f'annotator1/svo_100_sentences_{i}.csv', index_col=0, encoding='utf-8').iloc[:, :4] for i in idx_range]
assert all(df.shape == (200, 4) for df in dfs), 'Each file must contain 200 rows and 4 columns'
assert all(all(df.columns == ['sentence','subject', 'object', 'hinnang']) for df in dfs), 'Incorrect column names'
assert all(all(df['hinnang'].isin(['no', 'yes'])) for df in dfs), 'Incorrect judgements'

bg_annotation[1] = concat(dfs, axis=0).rename(columns={'hinnang': 'hinnang_1'})

# The second annotator
dfs = [read_csv(f'annotator2/svo_sentences_{i}.csv', index_col=0, encoding='utf-8') for i in idx_range]

# Quick hack to correct data errors
dfs[0].loc[26:27, 'hinnang'] = 'no'
dfs[0].loc[26:27, 'sentence'] = '***** on laenanud välisturgudelt 800 miljonit ***** , Tallinna Pank 400 miljonit , Ühispank 400 miljonit , Hoiupank 560 miljonit ; sama teed lähevad varsti ka väiksemad pangad .'
dfs[0].loc[26:27, 'subject'] = ['Hansapank', 'kroon']
dfs[0].loc[26:27, 'object'] = ['kroon', 'Hansapank']
dfs[3].loc[120,'hinnang'] = 'yes'

# Data validation
assert all(df.shape == (200, 4) for df in dfs), 'Each file must contain 200 rows and 4 columns'
assert all(all(df.columns == ['sentence', 'subject', 'object', 'hinnang']) for df in dfs), 'Incorrect column names'
assert all(all(df['hinnang'].isin(['no', 'yes'])) for df in dfs), 'Incorrect judgements'

bg_annotation[2] = concat(dfs, axis=0).rename(columns={'hinnang': 'hinnang_2'})

In [10]:
svo_cols = ['sentence', 'subject', 'object']
df = concat((bg_annotation[1][svo_cols], bg_annotation[2][svo_cols]), axis=1)
assert all(df['sentence'].nunique(axis = 1) == 1), 'All sentences must be equal'
assert all(df['subject'].nunique(axis = 1) == 1), 'All verbs must be equal'
assert all(df['object'].nunique(axis = 1) == 1), 'All objects must be equal'

In [11]:
tbl = concat([bg_annotation[1], bg_annotation[2]['hinnang_2']], axis=1)
assert len(tbl) == len(annotation[1]), 'The number of rows is off'
assert all(tbl['hinnang_1'].isin(['no', 'yes'])), 'Some triples are missing form the first table'
assert all(tbl['hinnang_2'].isin(['no', 'yes'])), 'Some triples are missing form the second table'

In [12]:
print('Task with incorrect sentences')
confusion_matrix = get_confusion_matrix(tbl, ['hinnang_1', 'hinnang_2'])
display(confusion_matrix.rename_axis(None, axis=0).rename_axis(None, axis=1))

print(f'Observed agreement: {observed_agreement(confusion_matrix) * 100 :.2f}%')
print(f'Cohonen kappa:      {cohens_kappa(confusion_matrix)["kappa"] * 100 :.2f}%') 

Task with incorrect sentences


Unnamed: 0,no,yes
no,474,34
yes,27,465


Observed agreement: 93.90%
Cohonen kappa:      87.80%


In [13]:
print('Task without incorrect sentences')

# To complete

Task without incorrect sentences
