<a href="https://colab.research.google.com/github/alexiamhe93/goldStandard-tutorial/blob/main/Tutorial_1_Inter_rater_reliability_1Nov2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Title: Tutorial 1: Running Inter-Rater Reliability in Python

Authors: Goddard, A. & Gillespie, A.

Date: November 2022

In this tutorial, we provide instructions on how to run an inter-rater reliability on a dataset manually coded for a psychological variable. For training and creating a classifier to measure a psychological construct automatically, it is necessary to create a gold-standard dataset of hand-coded texts. Inter-rater reliability provides an estimate of the quality of these human coded scores by estiamting the degree of agreement between coders' scores and the reliability of this agreement. 

This tutorial uses a dataset (from Reddit, Twitter, & Wikipedia Talk Pages)  coded for misunderstandings by five coders (include one author) with a minimum of MSc level in psychology. The dataset can be downloaded here: https://osf.io/rneks/download




# Install packages

> **NOTE** Run this cell first.

In [None]:
!pip install agreement

# Load packages & data

In [None]:
from google.colab import drive
import os
import pandas as pd
import numpy as np
from nltk import agreement
from agreement.utils.kernels import linear_kernel
from agreement.metrics import gwets_gamma
from agreement.utils.transform import pivot_table_frequency

Running this cell will ask you to authenticate access to the Google Drive. You will then be required to direct the notebook towards the correct folder containing the dataset. 

In [None]:
drive.mount('/content/drive/')

Mounted at /content/drive/


This cell sets the directory within google drive and loads in the data used for calculating inter-rater reliability.

In [None]:
# pick path to data
#os.chdir("/content/drive/My Drive/YOUR/PATH/HERE")
os.chdir("/content/drive/My Drive/Colab Notebooks/Datasets/_PhDdata/Paper2_Data_19Oct2022/Anonymized/")
# load manually coded dataset
df0 = pd.read_csv('misunderstanding_sentences_for_IRR.csv')

In this cell, we examine the top five rows of the loaded data to examine it's structure. The loaded data uses the prefix `M_`to denote whether a text has been coded for misunderstanding (binary) followed by a number idincating the five different coders.

In [None]:
df0.head()

Unnamed: 0,group,turn_id,turn,author,sentences,M_1,M_2,M_3,M_4,M_5
0,group_101723,la0mdz,turn_1,Darryl Harris,CMV: Territorial integrity should be taken les...,0,0,0,0,0
1,group_101723,la0mdz,turn_1,Darryl Harris,"However, in my opinion, the insistence on part...",0,0,0,0,0
2,group_101723,la0mdz,turn_1,Darryl Harris,Take the Western Jessicatown as an example.,0,0,0,0,0
3,group_101723,la0mdz,turn_1,Darryl Harris,An independent state in the region has no prec...,0,0,0,0,0
4,group_101723,la0mdz,turn_1,Darryl Harris,The basis of such claims is that the Spanish u...,0,0,0,0,0


# Running inter-rater reliability

In this section, we create a function to generate different inter-rater reliability statistics. In our final case-study, reported Krippendorf's Alpha (1970) and Gwet's AC1 (2008) as they belong to different families of inter-rater reliability statistics (see Artstein & Poesio, 2008). We also include code to generate Cohen's Kappa (1960) and Scott's pi (1955) as examples of other alternative statistics. Finally, we provide an estimate of absolute agreement which simply calculates the degree to which the five coders agree with each other.

References:

Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596. https://doi.org/10.1162/coli.07-034-R2

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104

Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48. https://doi.org/10.1348/000711006X126600

Krippendorff, K. (1970). Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30(1), 61–70. https://doi.org/10.1177/001316447003000105

Scott, W. A. (1955). Reliability of content analysis:the case of nominal scale coding. Public Opinion Quarterly, 19(3), 321–325. https://doi.org/10.1086/266577


## Create function to calculate statistics

In [None]:
# create a function to calculate inter-rater reliability (IRR)
def get_IRR(df, # data with codes 
            variable_prefix): # prefix for the variable
  # select only the columns with the variable_prefix
  df_ = df[[x for x in df.columns if variable_prefix in x]]
  # drop any empty values in the data
  df_ = df_.dropna()
  # prepare data for calculating Gwet's AC1
  # this creates a matrix with three columns:
  # text identifier, coder identifier, coder's score
  # (converts data from wide to long format)
  AC1_data = []
  for i, row in df_.iterrows():
    for k in list(df_.columns):
      AC1_data.append([i, k, row[k]])
  df_AC1 = np.array(AC1_data)
  qat = pivot_table_frequency(df_AC1[:, 0], df_AC1[:, 2])
  # calculates Gwet's AC1 and rounds to three places
  gg = round(gwets_gamma(qat, linear_kernel), 3)
  # prepare data for calculating other IRR statistics
  # This is the same process as for Gwet's AC1, but
  # the function expects a list not a matrix (and
  # the order of values is different: Coder, Index, 
  # score).
  out = []
  for i, row in df_.iterrows():
      for j,k in zip(row.index,row):
          out.append([str(j),str(i),int(k)])
  # Calculate the remaining IRR statistics using
  # NLTK's agreement.AnnotationTask() function
  ratingtask = agreement.AnnotationTask(data=out)
  # extract the relevant statistics
  kappa = round(ratingtask.kappa(),3)
  alpha = round(ratingtask.alpha(),3)
  scott = round(ratingtask.pi(),3)
  ag_general = round(ratingtask.avg_Ao(),3)
  print("Cohen's Kappa " +str(kappa))
  print("Scott's pi " + str(scott))
  print("Krippendorf's Alpha " +str(alpha))
  print("Gwet's AC1: " + str(gg))
  print("Absolute Agreement: "+ str(ag_general))

## Calculate IRR on dataset and interpret results

In [None]:
get_IRR(df0, "M_")

Cohen's Kappa 0.801
Scott's pi 0.803
Krippendorf's Alpha 0.803
Gwet's AC1: 0.979
Absolute Agreement: 0.981


We can observe that our data has moderate inter-rater reliability according to Krippendorf's Alpha (0.803), but very high inter-rater reliability according to Gwet's AC1 (0.979). However, the latter is closer to the absoluute agreement between coders (0.981), which is highly inflated due to the distribution of misunderstandings in the dataset:

In [None]:
print(f"Percentage of sentences coded as misunderstandings by Coder 1: {round(len(df0[df0.M_1==1])/len(df0)*100, 2)}%")
print(f"Percentage of sentences coded as misunderstandings by Coder 2: {round(len(df0[df0.M_2==1])/len(df0)*100, 2)}%")
print(f"Percentage of sentences coded as misunderstandings by Coder 3: {round(len(df0[df0.M_3==1])/len(df0)*100, 2)}%")
print(f"Percentage of sentences coded as misunderstandings by Coder 4: {round(len(df0[df0.M_4==1])/len(df0)*100, 2)}%")
print(f"Percentage of sentences coded as misunderstandings by Coder 5: {round(len(df0[df0.M_5==1])/len(df0)*100, 2)}%")

Percentage of sentences coded as misunderstandings by Coder 1: 5.75%
Percentage of sentences coded as misunderstandings by Coder 2: 4.58%
Percentage of sentences coded as misunderstandings by Coder 3: 5.1%
Percentage of sentences coded as misunderstandings by Coder 4: 5.1%
Percentage of sentences coded as misunderstandings by Coder 5: 4.45%


We observe that misunderstandings account for only between 4% and 6% of the coded dataset. This means that the majority of cases are not misunderstandings, i.e., coded as zero. This inflates the absolute agreement and Gwet's AC1 statistic, while Krippendorf's alpha is less affected. 

# Conclusion

In this tutorial, we calculated the Inter-rater reliability statistics for five coders' misunderstanding scores on sentences from online dialogues. We find the data to have moderate inter-rater reliability (Krippendorf's Alpha = 0.801). The tutorial is limited in using a binary classification procedure for the coded data. For use with categorical codes, Krippendorf's Alpha and Gwet's AC2 (using the same functions as in this tutorial) may still be used to estimate inter-reliability. 