# Humans vs. computers when annotating data

We've got a classification tool with mean quality around 0,75 which means that we get 3 out of 4 plays placed correctly. That might seem to be a fair but not perfect result\*. However, the humans are complicated creatures; our data to annotate is also complicated.

I handed out the directions of several plays to 7 people with different backgrounds: among them were linguists, constructors, medical students. They all used our Annotation Guide and didn't discuss the types with each other. That means that at the end of the experiment I had 7 different annotations on each of the five chosen plays. 

This notebook is the comparison of their work to the annotation of my own. In order to understand the difference between different annotations, Cohen kappa is used. 

---

> _Not great, not terrible._

— Dyatlov, _Chernobyl_ (an HBO series)

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import cohen_kappa_score
from statistics import mean

## 1 &emsp; Preparation and builing helper functions

### 1.1 &emsp; Variables with fixed lists

First, let's fix **annotators' names** and play titles so that we'll be able to create file names on the go in a cycle loop.

_Acknowledgement:_ I am extremely grateful to Anton B., Dasha K., Dasha M., Liza L., Olya P., Talgat K., and Vlada B. for making this part of research possible. Without you, this hypothesis and general idea would have never been possible.

In [2]:
annotators = ["olya", "liza", "anton", "dashak", "dasham", "talgat", "vlada"]

Then, the **play titles**.

In [3]:
play_titles = ["chekhov-chaika", "lermontov-maskarad", "fonvizin-nedorosl",
              "ostrovsky-svoi-ljudi", "pushkin-boris-godunov"]

The final part: list of types we have chosen for classifying.

In [4]:
dir_original_types = ["setting", "entrance", "exit", "business", "delivery", "modifier", "location", "unknown"]

### 1.2 &emsp; Helper functions to understand the code better

First, we load the data of my annotation (we'll believe in general that my annotation, being a person with more experience and having created and written the rules, is to a certain degree closer to the gold standard).

The algorithm for measuring the annotation agreement is:

1) load the data of both "gold" annotation and the annotator,

2) get the data on a certain direction type,

3) measure it with Cohen kappa.

This is done within `get_cohen_kappa`.

In [5]:
def get_cohen_kappa(original_path, check_annot_path, dtype):
    """Calculates a Cohen kappa for a given pair of annotations.
    
    :arg original_path — path to gold standard to measure 
    against;
    :arg check_annot_path — path to annotator's work to check;
    :arg dtype — (str) direction type to look at.
    
    :returns sklearn.cohen_kappa_score — the agreement measure for two
    annotation instances
    """
    original_annot = pd.read_csv(original_path, sep=";").fillna(0)
    dtype_original_vals = original_annot[dtype].values
    
    check_annot = pd.read_csv(check_annot_path, sep="\t").fillna(0)
    dtype_annot_vals = check_annot[dtype].values
    
    return cohen_kappa_score(dtype_original_vals, dtype_annot_vals)

Secondly, we need to update the `overall_kappa` (see the description below) carefully.

In [6]:
def update_name_cohen_dict(name_cohen_dict, dtype, dtype_cohen):
    """Adds a new Cohen kappa corresponding to the direction type.
    Is coded as a separate function to ease refactoring and prevent
    double-adding the values.
    
    :arg name_cohen_dict — (dict [str: list of float]) the dictionary 
    to be updated. Contains the information on all types;
    :arg dtype — (str) direction type being updated;
    :arg dtype_cohen — (float) Cohen kappa for a given type of a fixed
    annotator.
    
    :returns name_cohen_dict — updated dict from args.
    """
    
    current_val = name_cohen_dict[dtype]
    name_cohen_dict[dtype] += [dtype_cohen]
    
    return name_cohen_dict

`overall_kappa` is used to store information on names, plays, and their types. It's the most useful and comprehensive dict (or table, when converted to `pandas.DataFrame`) that is used for extracting pieces of data later.

In [7]:
overall_kappa = {dtype: [] for dtype in dir_original_types+["average", "name", "play"]}

## 2 &emsp; Assembling code, gathering data

The most important part of work: calculating and assembling all the numbers.

In [8]:
for play in play_titles:
    print("I'm working with {}".format(play))
    for annotator in ["olya"]:
        print("I'm working now with data from: {}".format(annotator.title()))
        
        # loading original annotation
        original_path = "./hfi/{}.csv".format(play)
        # loading annotator's data
        check_annot_path = "./hfi/{}_{}.tsv".format(annotator, play)
        
        annotator_kappas = []
        for dtype in dir_original_types:
            dtype_kappa = get_cohen_kappa(original_path, check_annot_path, dtype)
            overall_kappa = update_name_cohen_dict(overall_kappa, dtype, dtype_kappa)
            annotator_kappas.append(dtype_kappa)
        
        # update non-type fields
        overall_kappa = update_name_cohen_dict(overall_kappa, "name", annotator)
        overall_kappa = update_name_cohen_dict(overall_kappa, "play", play)
        overall_kappa = update_name_cohen_dict(overall_kappa, "average", 
                                               mean(annotator_kappas))

I'm working with chekhov-chaika
I'm working now with data from: Olya
I'm working with lermontov-maskarad
I'm working now with data from: Olya
I'm working with fonvizin-nedorosl
I'm working now with data from: Olya
I'm working with ostrovsky-svoi-ljudi
I'm working now with data from: Olya
I'm working with pushkin-boris-godunov
I'm working now with data from: Olya


This is what we actually get as a result:

In [9]:
df_names_plays = pd.DataFrame.from_dict(overall_kappa).set_index(["name", "play"])
df_names_plays.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,setting,entrance,exit,business,delivery,modifier,location,unknown,average
name,play,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
olya,chekhov-chaika,0.402204,0.785461,0.827473,0.594374,0.746619,0.0,0.318084,0.831257,0.563184
olya,lermontov-maskarad,0.140995,0.134487,0.194148,0.273921,0.339127,0.0,-0.008969,-0.008646,0.133133
olya,fonvizin-nedorosl,0.0,0.144934,0.865838,0.679283,0.669524,-0.023085,0.663776,1.0,0.500034
olya,ostrovsky-svoi-ljudi,0.070335,0.470152,0.955215,0.710591,0.882096,0.0,0.43031,0.0,0.439838
olya,pushkin-boris-godunov,0.20563,0.560919,0.833702,0.745733,0.906496,0.0,0.533875,1.0,0.598294


## 3 &emsp; Formulating questions, proving hypotheses

### 3.1 &emsp; How similar did an annotator generally perform on direction types?

We could've asked _how well?_, but this is rather not the appropriate question regarding the fact we're exploring the human factor. In here, we need to calculate the mean values of Cohen kappa for all five plays.

_Possible meaning:_ how explicit and unambiguous is our Annotation Guide?

In [10]:
# forming a dict to store the data
annotator_types = {dtype: [] for dtype in dir_original_types + ["name", "overall"]}

for annotator in ["Olya"]: # will change this to "annotators" later when more data will be available
    annotator_types["name"] += [annotator]
    types_list = []
    for dtype in dir_original_types:
        # grab all scores on a type of this annotator
        annotators_on_type = df_names_plays[dtype]
        # get its mean
        type_mean = mean(annotators_on_type)
        print(dtype, "\t", type_mean)
        # append it to the main infodict
        annotator_types[dtype] += [type_mean]
        # get the information on the annotator
        types_list.append(type_mean)
    # calculate the average on all types for a certain annotator
    annotator_types["overall"] += [mean(types_list)]
    print("Overall mean:\t", mean(types_list))

setting 	 0.16383302956386156
entrance 	 0.419190630851948
exit 	 0.7352753048785889
business 	 0.600780447905196
delivery 	 0.7087724693394618
modifier 	 -0.00461707585196045
location 	 0.38741547177893343
unknown 	 0.564522228366418
Overall mean:	 0.4468965633540559


In [11]:
pd.DataFrame.from_dict(annotator_types).set_index("name")

Unnamed: 0_level_0,setting,entrance,exit,business,delivery,modifier,location,unknown,overall
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Olya,0.163833,0.419191,0.735275,0.60078,0.708772,-0.004617,0.387415,0.564522,0.446897


### 3.2 &emsp; How similar did our annotators generally perform on a certain play?

Vice versa study; now we focus on plays instead of people.

_Possible meaning:_ how really different different opinions on the same directions are.

In [None]:
# this is kinda more complicated, TDB