# Humans vs. computers when annotating data

We've got a classification tool with mean quality around 0,75 which means that we get 3 out of 4 plays placed correctly. That might seem to be a fair but not perfect result\*. However, the humans are complicated creatures; our data to annotate is also complicated.

I handed out the directions of several plays to 7 people with different backgrounds: among them were linguists, constructors, medical students. They all used our Annotation Guide and didn't discuss the types with each other. That means that at the end of the experiment I had 7 different annotations on each of the five chosen plays. 

This notebook is the comparison of their work to the annotation of my own. In order to understand the difference between different annotations, Cohen kappa is used. 

---

> _Not great, not terrible._

— Anatoly Dyatlov, _Chernobyl_ (an HBO series)

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import cohen_kappa_score
from statistics import mean

## 1 &emsp; Preparation and builing helper functions

### 1.1 &emsp; Variables with fixed lists

First, let's fix **annotators' names** and play titles so that we'll be able to create file names on the go in a cycle loop.

_Acknowledgement:_ I am extremely grateful to Anton B., Darya K., Daria M., Elizaveta L., Olga P., and Vlada B. for making this part of research possible. Without you, this hypothesis and general idea would have never been possible.

In [2]:
annotators = ["OP", "EL", "AB", "DK", "VB"]

Then, the **play titles**.

In [3]:
play_titles = ["chekhov-chaika", "lermontov-maskarad", "fonvizin-nedorosl",
              "ostrovsky-svoi-ljudi", "pushkin-boris-godunov"]

The final part: list of types we have chosen for classifying.

In [4]:
dir_original_types = ["setting", "entrance", "exit", "business", "delivery", 
                      "modifier", "location"]

### 1.2 &emsp; Helper functions to understand the code better

First, we load the data of my annotation (we'll believe in general that my annotation, being a person with more experience and having created and written the rules, is to a certain degree closer to the gold standard).

The algorithm for measuring the annotation agreement is:

1) load the data of both "gold" annotation and the annotator,

2) get the data on a certain direction type,

3) measure it with Cohen kappa.

This is done within `get_cohen_kappa`.

In [5]:
def get_cohen_kappa(original_path, check_annot_path, dtype):
    """Calculates a Cohen kappa for a given pair of annotations.
    
    :arg original_path — path to gold standard to measure 
    against;
    :arg check_annot_path — path to annotator's work to check;
    :arg dtype — (str) direction type to look at.
    
    :returns sklearn.cohen_kappa_score — the agreement measure for two
    annotation instances
    """
    original_annot = pd.read_csv(original_path, sep=";").fillna(0)
    dtype_original_vals = original_annot[dtype].values
    
    check_annot = pd.read_csv(check_annot_path, sep="\t").fillna(0)
    dtype_annot_vals = check_annot[dtype].values
    
    return cohen_kappa_score(dtype_original_vals, dtype_annot_vals)

Secondly, we need to update the `overall_kappa` (see the description below) carefully.

In [6]:
def update_name_cohen_dict(name_cohen_dict, dtype, dtype_cohen):
    """Adds a new Cohen kappa corresponding to the direction type.
    Is coded as a separate function to ease refactoring and prevent
    double-adding the values.
    
    :arg name_cohen_dict — (dict [str: list of float]) the dictionary 
    to be updated. Contains the information on all types;
    :arg dtype — (str) direction type being updated;
    :arg dtype_cohen — (float) Cohen kappa for a given type of a fixed
    annotator.
    
    :returns name_cohen_dict — updated dict from args.
    """
    
    current_val = name_cohen_dict[dtype]
    name_cohen_dict[dtype] += [dtype_cohen]
    
    return name_cohen_dict

`overall_kappa` is used to store information on names, plays, and their types. It's the most useful and comprehensive dict (or table, when converted to `pandas.DataFrame`) that is used for extracting pieces of data later.

In [40]:
overall_kappa = {dtype: [] for dtype in dir_original_types+["average", "name", "play"]}

## 2 &emsp; Assembling code, gathering data

The most important part of work: calculating and assembling all the numbers.

In [41]:
for play in play_titles:
    print("I'm working with {}".format(play))
    for annotator in annotators:
        try:
            print("I'm working now with data from: {}".format(annotator))

            # loading original annotation
            original_path = "./hfi/{}.csv".format(play)
            # loading annotator's data
            check_annot_path = "./hfi/{}_{}.tsv".format(annotator, play)

            annotator_kappas = []
            for dtype in dir_original_types:
                dtype_kappa = get_cohen_kappa(original_path, check_annot_path, dtype)
                overall_kappa = update_name_cohen_dict(overall_kappa, dtype, dtype_kappa)
                annotator_kappas.append(dtype_kappa)

            # update non-type fields
            overall_kappa = update_name_cohen_dict(overall_kappa, "name", annotator)
            overall_kappa = update_name_cohen_dict(overall_kappa, "play", play)
            overall_kappa = update_name_cohen_dict(overall_kappa, "average", 
                                                   mean(annotator_kappas))
        except:
            pass

I'm working with chekhov-chaika
I'm working now with data from: OP
I'm working now with data from: EL
I'm working now with data from: AB
I'm working now with data from: DK
I'm working now with data from: VB
I'm working with lermontov-maskarad
I'm working now with data from: OP
I'm working now with data from: EL
I'm working now with data from: AB
I'm working now with data from: DK
I'm working now with data from: VB
I'm working with fonvizin-nedorosl
I'm working now with data from: OP
I'm working now with data from: EL
I'm working now with data from: AB
I'm working now with data from: DK
I'm working now with data from: VB
I'm working with ostrovsky-svoi-ljudi
I'm working now with data from: OP
I'm working now with data from: EL
I'm working now with data from: AB
I'm working now with data from: DK
I'm working now with data from: VB
I'm working with pushkin-boris-godunov
I'm working now with data from: OP
I'm working now with data from: EL
I'm working now with data from: AB
I'm working now

This is what we actually get as a result:

In [42]:
df_names_plays = pd.DataFrame.from_dict(overall_kappa)
df_names_plays.head()

Unnamed: 0,setting,entrance,exit,business,delivery,modifier,location,average,name,play
0,0.709531,0.832381,0.759423,0.937293,0.920806,0.0,0.75445,0.701984,OP,chekhov-chaika
1,0.829584,0.890884,0.896484,0.953079,0.96366,0.0,0.809639,0.763333,EL,chekhov-chaika
2,0.829584,0.970686,0.884479,0.953508,0.949124,0.0,0.790451,0.768262,AB,chekhov-chaika
3,0.731597,0.824714,0.823577,0.937002,0.906407,0.0,0.75445,0.711107,DK,chekhov-chaika
4,0.900389,0.942795,0.964417,0.984287,0.942674,0.0,0.895641,0.804315,VB,chekhov-chaika


## 3 &emsp; Formulating questions, proving hypotheses

### 3.1 &emsp; How similar did an annotator generally perform on direction types?

We could've asked _how well?_, but this is rather not the appropriate question regarding the fact we're exploring the human factor. In here, we need to calculate the mean values of Cohen kappa for all five plays.

_Possible meaning:_ how explicit and unambiguous is our Annotation Guide?

In [43]:
annotator_types = {dtype: [] for dtype in dir_original_types}
annotator_types["name"] = annotators
annotator_types["average"] = []

for annotator in annotators:
    an_av = []
    for dtype in dir_original_types:
        type_mean = mean(df_names_plays[df_names_plays["name"] == annotator][dtype])
        annotator_types[dtype].append(type_mean)
        an_av.append(type_mean)
    annotator_types["average"].append(mean(an_av))

In [44]:
pd.DataFrame.from_dict(annotator_types)

Unnamed: 0,setting,entrance,exit,business,delivery,modifier,location,name,average
0,0.522668,0.898695,0.837279,0.94241,0.862137,0.136314,0.572226,OP,0.681676
1,0.40338,0.688436,0.632883,0.720664,0.690525,0.112943,0.36225,EL,0.515869
2,0.637044,0.87274,0.823291,0.908457,0.887004,0.035186,0.649284,AB,0.687572
3,0.613673,0.932481,0.92376,0.970391,0.945486,0.170561,0.751666,DK,0.758288
4,0.596763,0.947439,0.941545,0.969571,0.943517,0.170561,0.701279,VB,0.752954


### 3.2 &emsp; How similar did our annotators generally perform on a certain play?

Vice versa study; now we focus on plays instead of people.

_Possible meaning:_ how really different different opinions on the same directions are.

In [70]:
play_efficiency = {annotator: [] for annotator in annotators}
play_efficiency["title"] = play_titles
play_efficiency["average"] = []

for play in play_titles:
    av = []
    for annotator in annotators:
        play_an = df_names_plays[(df_names_plays["play"] == play) 
                                 & (df_names_plays["name"] == annotator)]["average"].values[0]
        play_efficiency[annotator].append(play_an)
        av.append(play_an)
    play_efficiency["average"].append(mean(av))

In [71]:
pd.DataFrame.from_dict(play_efficiency)

Unnamed: 0,OP,EL,AB,DK,VB,title,average
0,0.701984,0.763333,0.768262,0.711107,0.804315,chekhov-chaika,0.7498
1,0.687754,0.697744,0.767136,0.697687,0.705687,lermontov-maskarad,0.711202
2,0.697059,0.597714,0.344511,0.787997,0.753856,fonvizin-nedorosl,0.636227
3,0.551704,-0.01327,0.752463,0.791393,0.736823,ostrovsky-svoi-ljudi,0.563823
4,0.769878,0.533822,0.80549,0.803257,0.764087,pushkin-boris-godunov,0.735307
