In [1]:
import glob
import pandas as pd
from collections import defaultdict
from statistics import mean

# Krippendorff's alpha

Here are four attempts to do essentially the same thing: generate data files in Python and analyze them using R. We follow [this tutorial](https://rpubs.com/jacoblong/content-analysis-krippendorff-alpha-R) to compute Krippendorff's alpha.

## Attempt 1: create the CSV fully in Python

We can first create a matrix where the rows are raters and columns are items. Cells contain ratings. This means we don't do very much in R; just computing Kalpha.

In [2]:
def csv_for_dimension(dimension):
    "Create a CSV file for a specific dimension to compute Krippendorff's alpha."
    answer_name = {"Coherence": "Answer.best_coh",
                   "Grammaticality": "Answer.best_grammar",
                   "Repetition": "Answer.best_redun"}
    answer = answer_name[dimension]
    index = defaultdict(dict)
    for path in glob.glob(f"./Responses/{dimension}/*.xlsx"):
        df = pd.read_excel(path)
        for record in df.to_records():
            # Add worker ID to the dictionary if it's not already in there.
            if record['WorkerId'] not in index:
                index[record['WorkerId']]['worker_id'] = record['WorkerId']
            # Add response for a specific item to the dictionary.
            index[record['WorkerId']][record['Input.code']] = record[answer].upper() # Ensure uppercase
    # Convert all items in the index to a data frame.
    # This data frame will be very sparse: each row represents the responses from one worker.
    # That worker will only have rated a small subset of all items.
    kalpha_table = pd.DataFrame(index.values())
    kalpha_table.to_csv(f"./Stats/kalpha1/{dimension}_kalpha.csv" ,index=False)

In [3]:
# prepare Krippendorff's alpha.
for dimension in ["Coherence", "Repetition", "Grammaticality"]:
    csv_for_dimension(dimension)

We seem to be getting a very low alpha:

* Coherence: 0.128
* Grammaticality: 0.0363
* Repetition: 0.179

Maybe something is wrong? Maybe the issue is that the CSV is really sparse OR there was a coding issue. Otherwise the R script we have in folder `kalpha1` seems to do the job.

## Attempt 2: concatenate files and solve the rest in R

We can also create a simple CSV with just the columns for Item ID, Worker ID, and Rating. This means we can follow the full tutorial in R, using the tidyverse commands to prepare the data.

In [4]:
def csv_for_dimension(dimension):
    "Create a CSV file for a specific dimension to compute Krippendorff's alpha."
    answer_name = {"Coherence": "Answer.best_coh",
                   "Grammaticality": "Answer.best_grammar",
                   "Repetition": "Answer.best_redun"}
    answer = answer_name[dimension]
    frames = []
    for path in glob.glob(f"./Responses/{dimension}/*.xlsx"):
        df = pd.read_excel(path)
        frames.append(df)
    df = pd.concat(frames)
    kalpha_table = df.filter(['WorkerId', 'Input.code', answer], axis=1)
    # Ensure uppercase:
    kalpha_table[answer] = kalpha_table[answer].apply(lambda x:x.upper())
    kalpha_table.to_csv(f"./Stats/kalpha2/{dimension}_kalpha.csv" ,index=False)

# prepare Krippendorff's alpha.
for dimension in ["Coherence", "Repetition", "Grammaticality"]:
    csv_for_dimension(dimension)

Alpha is exactly the same:

* Coherence: 0.128 
* Grammaticality: 0.0363 
* Repetition: 0.179

So clearly not a preprocessing issue, but by far not the alpha of 0.47 we see in the paper. What did the authors do to achieve this? Perhaps they created three virtual raters instead of using a very sparse matrix?

## Attempt 3: remove bad responses

One option is that there is a data issue. Maybe there are responses that are not equal to A or B? We can simply filter those out and see whether this makes a difference.

In [5]:
def csv_for_dimension(dimension):
    "Create a CSV file for a specific dimension to compute Krippendorff's alpha."
    answer_name = {"Coherence": "Answer.best_coh",
                   "Grammaticality": "Answer.best_grammar",
                   "Repetition": "Answer.best_redun"}
    answer = answer_name[dimension]
    frames = []
    for path in glob.glob(f"./Responses/{dimension}/*.xlsx"):
        df = pd.read_excel(path)
        frames.append(df)
    df = pd.concat(frames)
    kalpha_table = df.filter(['WorkerId', 'Input.code', answer], axis=1)
    # Ensure uppercase:
    kalpha_table[answer] = kalpha_table[answer].apply(lambda x:x.upper())
    # Here comes the filtering step:
    kalpha_table = kalpha_table[kalpha_table[answer].isin(['A','B'])]
    kalpha_table.to_csv(f"./Stats/kalpha3/{dimension}_kalpha.csv" ,index=False)

# prepare Krippendorff's alpha.
for dimension in ["Coherence", "Repetition", "Grammaticality"]:
    csv_for_dimension(dimension)

Results from R:

* Coherence: 0.131
* Grammaticality: 0.0438 
* Repetition: 0.203

That does improve the scores a little, but it's otherwise not very impactful.

#### Attempt 4: create a CSV file with only three coders, resulting in a more dense matrix

In [6]:
def csv_for_dimension(dimension):
    "Create a CSV file for a specific dimension to compute Krippendorff's alpha."
    answer_name = {"Coherence": "Answer.best_coh",
                   "Grammaticality": "Answer.best_grammar",
                   "Repetition": "Answer.best_redun"}
    answer = answer_name[dimension]
    frames = []
    for path in glob.glob(f"./Responses/{dimension}/*.xlsx"):
        df = pd.read_excel(path)
        frames.append(df)
    df = pd.concat(frames)
    rating_index = defaultdict(list)
    for record in df.to_records():
        item = record['Input.code']
        response = record[answer]
        rating_index[item].append(response)
    rows = []
    for item, responses in rating_index.items():
        for i, response in enumerate(responses):
            rater = f"rater{i}"
            rows.append(dict(rater=rater, response=response.upper(), item=item))
    df = pd.DataFrame(rows)
    df.to_csv(f"./Stats/kalpha4/{dimension}_kalpha.csv" ,index=False)

for dimension in ["Coherence", "Repetition", "Grammaticality"]:
    csv_for_dimension(dimension)

Running this through R (see scripts in directory) yields similar numbers as before:

* Coherence: 0.128 
* Grammaticality: 0.0355
* Repetition: 0.178

# Percentage agreement

Here's a way to compute percentage agreement with the majority of respondents, for each item.

In [7]:
def analyse_dimension(dimension):
    "Compute percentage agreement for a particular quality dimension."
    answer_name = {"Coherence": "Answer.best_coh",
                   "Grammaticality": "Answer.best_grammar",
                   "Repetition": "Answer.best_redun"}
    total_items = 600
    answer = answer_name[dimension]
    
    # Build an index of items that workers looked at:
    item_index = defaultdict(list)
    # Build an index of answers that workers have given for each item:
    answer_index = defaultdict(dict)

    # For each batch...
    for path in glob.glob(f"./Responses/{dimension}/*.xlsx"):
        df = pd.read_excel(path)         # Read Excel
        for record in df.to_records():   # Turn rows into records for easy addressing
            # Set shorthand variables:
            worker = record['WorkerId']
            response = record[answer].upper() # Ensure uppercase.
            item = record['Input.code']

            # Add the item to the list.
            item_index[worker].append(item)
            
            # Index the response.
            answer_index[item][worker] = response

    #####################################
    # Now let's determine worker quality.
    weighted_worker_quality = 0
    
    # Could have been a number as well, but we will return this dictionary
    # for future reference. If we need to, we can easily check how good/bad
    # workers' responses are.
    worker_quality_index = dict()
    for worker, items in item_index.items():
        majority = []
        for item in items:
            worker_answer = answer_index[item][worker] # Get the relevant answer for this worker.
            answers = list(answer_index[item].values())# Get all answers.
            occurrences = answers.count(worker_answer) # Count how often the worker's answer occurs.

            # There are three responses per item. 1 is the minority, 2 and 3 count as the majority.
            # The answer restriction is in place to ensure spam does not reward the annotators.
            if occurrences == 1 or worker_answer not in ['A','B']:
                majority.append(0)
            else:
                majority.append(1)
        
        # The mean score is quite a crude indicator for workers who rated very few items.
        # It gets more accurate with more items.
        # But we just need an overall score, so this is not a big issue.
        worker_quality = mean(majority)
        worker_quality_index[worker] = worker_quality

        # The weighted score is just updated with every worker.
        weighted_worker_quality += (worker_quality * len(items))
    
    # Here we combine our two overall scores.
    # 1. Average performance score. This should represent the quality of all workers.
    # 2. Weighted average performance. 
    # The latter is fairer because workers who contribute more should also count more
    # towards the overall quality.
    avg_worker_quality = mean(worker_quality_index.values())
    weighted_worker_quality = weighted_worker_quality/total_items
    
    return avg_worker_quality, weighted_worker_quality, worker_quality_index

In [8]:
header = ["Category", "Mean", "Weighted Mean"]
data= []
avg_worker_quality, weighted_worker_quality, worker_quality_index = analyse_dimension("Coherence")
data.append(["Coherence", avg_worker_quality, weighted_worker_quality])

avg_worker_quality, weighted_worker_quality, worker_quality_index = analyse_dimension("Repetition")
data.append(["Repetition", avg_worker_quality, weighted_worker_quality])

avg_worker_quality, weighted_worker_quality, worker_quality_index = analyse_dimension("Grammaticality")
data.append(["Grammaticality", avg_worker_quality, weighted_worker_quality])

df = pd.DataFrame(columns=header, data=data)
with open("./Tables/percentage_agreement.tex", 'w') as f:
    f.write(df.style.format(formatter="{:.2f}".format, 
                            subset=["Mean", "Weighted Mean"]).hide(axis=0).to_latex(hrules=True))

df

Unnamed: 0,Category,Mean,Weighted Mean
0,Coherence,0.717073,0.778333
1,Repetition,0.7342,0.786667
2,Grammaticality,0.737685,0.756667
