<a href="https://colab.research.google.com/github/alexiamhe93/goldStandard-tutorial/blob/main/Tutorial_1_Inter_rater_reliability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Title: Tutorial 1: Running Inter-Rater Reliability in Python

Authors: Goddard, A. & Gillespie, A.

Date: November 2022

In this tutorial, we provide instructions on how to run an inter-rater reliability on a dataset manually coded for a psychological variable. For training and creating a classifier to measure a psychological construct automatically, it is necessary to create a gold-standard dataset of hand-coded texts. Inter-rater reliability provides an estimate of the quality of these human coded scores by estiamting the degree of agreement between coders' scores and the reliability of this agreement. 

This tutorial uses a dataset (from Reddit, Twitter, & Wikipedia Talk Pages)  coded for misunderstandings by five coders (include one author) with a minimum of MSc level in psychology. 



# Install packages

> **NOTE** Run this cell first.

In [None]:
!pip install agreement

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting agreement
  Downloading agreement-0.1.1-py3-none-any.whl (19 kB)
Installing collected packages: agreement
Successfully installed agreement-0.1.1


# Load packages & data

In [None]:
# for downloading data
import requests, zipfile, io
# for loading data into a dataframe
import pandas as pd
# for mathematical operations
import numpy as np
# for conducting Inter-rater reliability assessments
from nltk import agreement
from agreement.utils.kernels import linear_kernel
from agreement.metrics import gwets_gamma
from agreement.utils.transform import pivot_table_frequency

This cell downloads the data from github and unzips it.


In [None]:
r = requests.get( 'https://github.com/alexiamhe93/goldStandard-tutorial/blob/main/Data/Tutorial-data.zip?raw=true' ) 
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

This cell loads the data into a dataframe.

In [None]:
df0 = pd.read_csv("IRR_misunderstandings_data.csv")

In this cell, we examine the top five rows of the loaded data to examine it's structure. The loaded data uses the prefix `M_`to denote whether a text has been coded for misunderstanding (binary) followed by a number idincating the five different coders.

In [None]:
df0.head()

Unnamed: 0,group,turn_id,turn,author,sentences,M_1,M_2,M_3,M_4,M_5
0,group_101723,la0mdz,turn_1,Darryl Harris,CMV: Territorial integrity should be taken les...,0,0,0,0,0
1,group_101723,la0mdz,turn_1,Darryl Harris,"However, in my opinion, the insistence on part...",0,0,0,0,0
2,group_101723,la0mdz,turn_1,Darryl Harris,Take the Western Jessicatown as an example.,0,0,0,0,0
3,group_101723,la0mdz,turn_1,Darryl Harris,An independent state in the region has no prec...,0,0,0,0,0
4,group_101723,la0mdz,turn_1,Darryl Harris,The basis of such claims is that the Spanish u...,0,0,0,0,0


# Running inter-rater reliability

In this section, we create a function to generate different inter-rater reliability statistics. In our final case-study, reported Krippendorf's Alpha (1970) and Gwet's AC1 (2008) as they belong to different families of inter-rater reliability statistics (see Artstein & Poesio, 2008). We also include code to generate Cohen's Kappa (1960) and Scott's pi (1955) as examples of other alternative statistics. Finally, we provide an estimate of absolute agreement which simply calculates the degree to which the five coders agree with each other.

References:

Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596. https://doi.org/10.1162/coli.07-034-R2

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104

Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48. https://doi.org/10.1348/000711006X126600

Krippendorff, K. (1970). Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30(1), 61–70. https://doi.org/10.1177/001316447003000105

Scott, W. A. (1955). Reliability of content analysis:the case of nominal scale coding. Public Opinion Quarterly, 19(3), 321–325. https://doi.org/10.1086/266577


## Prepare data

Before running the inter-rater reliability statistics, the data needs to be in a specific format. For all statistics bar Gwet's AC1 – calculated using the Natural Language Toolkit (NLTK) package – the function expects the data to be in a list of lists format, where each sublist is composed of the following: 

> `[Coder id, Sentence id, Score]`

For Gwet's gamma, the input data is similarly structured but, rather than a list of lists, expects an array with each row in the following order:

> `[Sentence id, Coder id, Score]`

Both input datas only require the five coder columns (`M_1 ... M_5`)

In [None]:
# create new dataframe with only relevant columns
df1 = df0[[x for x in df0.columns if "M_" in x]]
# drop any NA values
df1 = df1.dropna()
# print number of rows removed
print(f"Rows removed = {len(df0) - len(df1)}")

Rows removed = 0


In [None]:
# inspect dataframe
df1.head()

Unnamed: 0,M_1,M_2,M_3,M_4,M_5
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0


### Prepare data for all Inter-rater reliability statistics bar Gwet's AC1

In [None]:
# Create empty list 
IRR_out = []
# iterate over the rows of df1
for i, row in df1.iterrows():
  # iterate over each column name (i.e. Coder id)
  for k in list(df1.columns):
    # Populate empty list with coder id, index, and score 
    # (index value expects a string not an integer)
    IRR_out.append([k, str(i), row[k]])
# inspect first 10 values in sublist
IRR_out[:10]

[['M_1', '0', 0],
 ['M_2', '0', 0],
 ['M_3', '0', 0],
 ['M_4', '0', 0],
 ['M_5', '0', 0],
 ['M_1', '1', 0],
 ['M_2', '1', 0],
 ['M_3', '1', 0],
 ['M_4', '1', 0],
 ['M_5', '1', 0]]

Observe how the list of lists relates to the original dataframe. It is now in "long" format, meaning that the rows (sublists) are each unique score, rather than the sentence id.

### Prepare data for Gwet's AC1

In [None]:
# Repeat process from previous cell
AC1_data = []
for i, row in df1.iterrows():
  for k in list(df1.columns):
    # Change order of empty list is populated (sentence id, coder id, score)
    AC1_data.append([i, k, row[k]])
# inspect first five entries of the output list
AC1_data[0:10]

[[0, 'M_1', 0],
 [0, 'M_2', 0],
 [0, 'M_3', 0],
 [0, 'M_4', 0],
 [0, 'M_5', 0],
 [1, 'M_1', 0],
 [1, 'M_2', 0],
 [1, 'M_3', 0],
 [1, 'M_4', 0],
 [1, 'M_5', 0]]

We can observe that the list of lists created for Gwet's AC1 differs only from the data for the other inter-rater reliability in the ordering of the variables.

In [None]:
# create array
AC1_array = np.array(AC1_data)
# inspect first 10 rows
AC1_array[:10]

array([['0', 'M_1', '0'],
       ['0', 'M_2', '0'],
       ['0', 'M_3', '0'],
       ['0', 'M_4', '0'],
       ['0', 'M_5', '0'],
       ['1', 'M_1', '0'],
       ['1', 'M_2', '0'],
       ['1', 'M_3', '0'],
       ['1', 'M_4', '0'],
       ['1', 'M_5', '0']], dtype='<U21')

An array expects all variables to be the same format, in this case strings (i.e. text). We now pivot the array, returning the data from long to "wide" format, with each row representing a sentence and the columns the possible scores (in our case `0`for not misunderstandings and `1`for misunderstandings). The values of the columns represent the number of times the score was selected, for instance, if four coders picked misunderstanding and one not misunderstanding, then the row would look like:
> `[1,4]`

See: https://pypi.org/project/agreement/ 

In [None]:
qat = pivot_table_frequency(AC1_array[:, 0], AC1_array[:, 2])
# inspect output
pd.DataFrame(qat).head(5)

Unnamed: 0,0,1
0,5.0,0.0
1,5.0,0.0
2,5.0,0.0
3,5.0,0.0
4,1.0,4.0


## Calculate Inter-rater reliability statistics from data

In [None]:
# Calculates Gwet's AC1 and rounds to three places
gg = round(gwets_gamma(qat, linear_kernel), 3)

In [None]:
# Calculate all other IRR statistics and stores them in an object
# see: https://www.nltk.org/api/nltk.metrics.agreement.html 
ratingtask = agreement.AnnotationTask(data=IRR_out)

## Calculate IRR on dataset and interpret results

In [None]:
# extract the statistics
kappa = round(ratingtask.kappa(),3)
alpha = round(ratingtask.alpha(),3)
scott = round(ratingtask.pi(),3)
Absolute_agreement = round(ratingtask.avg_Ao(),3)

In [None]:
# Print all the statistics
print("Cohen's Kappa " +str(kappa))
print("Scott's pi " + str(scott))
print("Krippendorf's Alpha " +str(alpha))
print("Gwet's AC1: " + str(gg))
print("Absolute Agreement: "+ str(Absolute_agreement))

Cohen's Kappa 0.801
Scott's pi 0.803
Krippendorf's Alpha 0.803
Gwet's AC1: 0.979
Absolute Agreement: 0.981


We can observe that our data has moderate inter-rater reliability according to Krippendorf's Alpha (0.803), but very high inter-rater reliability according to Gwet's AC1 (0.979). However, the latter is closer to the absoluute agreement between coders (0.981), which is highly inflated due to the distribution of misunderstandings in the dataset:

In [None]:
print(f"Percentage of sentences coded as misunderstandings by Coder 1: {round(len(df0[df0.M_1==1])/len(df0)*100, 2)}%")
print(f"Percentage of sentences coded as misunderstandings by Coder 2: {round(len(df0[df0.M_2==1])/len(df0)*100, 2)}%")
print(f"Percentage of sentences coded as misunderstandings by Coder 3: {round(len(df0[df0.M_3==1])/len(df0)*100, 2)}%")
print(f"Percentage of sentences coded as misunderstandings by Coder 4: {round(len(df0[df0.M_4==1])/len(df0)*100, 2)}%")
print(f"Percentage of sentences coded as misunderstandings by Coder 5: {round(len(df0[df0.M_5==1])/len(df0)*100, 2)}%")

Percentage of sentences coded as misunderstandings by Coder 1: 5.75%
Percentage of sentences coded as misunderstandings by Coder 2: 4.58%
Percentage of sentences coded as misunderstandings by Coder 3: 5.1%
Percentage of sentences coded as misunderstandings by Coder 4: 5.1%
Percentage of sentences coded as misunderstandings by Coder 5: 4.45%


We observe that misunderstandings account for only between 4% and 6% of the coded dataset. This means that the majority of cases are not misunderstandings, i.e., coded as zero. This inflates the absolute agreement and Gwet's AC1 statistic, while Krippendorf's alpha is less affected. 

# Conclusion

In this tutorial, we calculated the Inter-rater reliability statistics for five coders' misunderstanding scores on sentences from online dialogues. We find the data to have moderate inter-rater reliability (Krippendorf's Alpha = 0.801). The tutorial is limited in using a binary classification procedure for the coded data. For use with categorical codes, Krippendorf's Alpha and Gwet's AC2 (using the same functions as in this tutorial) may still be used to estimate inter-reliability. 