# Excercise 1.1: Annotation & reliability

## Data

Here's a sample class wrapping the main abstraction, the `Dataset`.

The following utility reads the 20 newsgroups data into a Dataset object. It sets the label to True if a message comes from the talk.politics.guns group, and to False otherwise.

In [None]:
from dataset import Dataset
from sklearn.datasets import fetch_20newsgroups

def guns_dataset_factory(subset='train', labelled=False):
    """ Fetches newsgroup data and returns a Dataset. """
    newsgroups = fetch_20newsgroups(subset=subset)
    
    # Transform to guns or not.
    labels = {i: name == 'talk.politics.guns' for i, name in enumerate(newsgroups.target_names)}
    dataset = Dataset({text: labels[i] for text, i in zip(newsgroups.data, newsgroups.target)})
    return dataset

pool = guns_dataset_factory(subset='train')
test = guns_dataset_factory(subset='test')

## A random sampler

We need a way of choosing which data to annotate next. Let's start with a random sampler. This is how most crowd annotation is set up.

Our `Sampler` base clase includes some utilities for sampling, training and scoring. Other samplers inherit from `Sampler` and must implement the `__call__` method, which takes one argument (a dataset).

`Sampler` objects must be initialised with a sklearn classifier or pipeline. We use multinomial naive Bayes as a baseline since it is [fast to train and can achieve competitive accuracy](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html).

Our `Random` sampler simply shuffles the unlabelled data, then returns the first `batch_size` items.

In [None]:
from samplers import Random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')),
        ('clf', MultinomialNB(alpha=.01)),
    ])

random_sampler = Random(pipeline, batch_size=10)
for i, (text, label) in enumerate(random_sampler(pool)):
    print(i, label, repr(text[:80]))

## Simulated experiments

Now we can run simulated experiments. `run_simulation` takes a sampler and runs `n` complete simulations. It returns:
* `train_sizes` - the training set sizes for each simulation
* `train_scores` - f1 scores over the training set
* `test_scores` - f1 scores over the test set

Each is a list of tuples, with one tuple per iteration of sampling. Each tuple contains `n` values, one per simulation.

In [None]:
from evaluation import run_simulations

# suppress sklearn FutureWarnings in terminal output
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# run a simulated experiment and plot learning curve
random_sampler = Random(pipeline, batch_size=2000)
train_sizes, train_scores, test_scores = run_simulations(random_sampler, pool, test, n=3)

print(train_sizes)
print(train_scores)
print(test_scores)

## Learning curves

Plotting performance against the number of examples gives a quick visual indication of whether more annotation will help.

Here, the lower, green line corresponds to the test f1 score. This is our estimate generalisation to unseen data, our best guess of live system performance.

When test performance flattens out, more labelled data probably won't help with this classifier.

The plot also shows variance that our performance estimate has more variance with fewer samples.

In [None]:
from evaluation import plot_learning_curve
import numpy as np

plt = plot_learning_curve(np.asarray(train_sizes), np.asarray(train_scores), np.asarray(test_scores))
plt.show()

## Simulation with bootstrap resampling

In a setting where sampled batches were actually being annotated, we wouldn't run complete simulations.

Instead, we can estimate variance using bootstrap resampling of training data in a given round of sampling.

In [None]:
from evaluation import run_bootstraps

# run a simulated experiment and plot learning curve
random_sampler = Random(pipeline, batch_size=2000)
train_sizes, train_scores, test_scores = run_bootstraps(random_sampler, pool.copy, test)

print(train_sizes)
print(train_scores)
print(test_scores)

In [None]:
plt = plot_learning_curve(np.asarray(train_sizes), np.asarray(train_scores), np.asarray(test_scores))
plt.show()

## Manually label some examples

Ultimately, we need labelled data - as much labelled data as possible. Sometimes we have labelled data, e.g., captured from user activity on large platforms. Sometimes we need to annotate.

The `AnnotationPane` class below renders an annotation interface using iPython widgets. It takes two arguments: a Dataset object and a Sampler object. It displays examples selected from the dataset by the sampler.

Try it out. Clicking the Yes button saves a True label, clicking No saves False, clicking Skip saves None which is equivalent to leaving the example unlabelled.

Here we are answering the question: Does the message come from the `talk.politics.guns` newsgroup?

After finishing a batch, run the cell again to get sample more examples for annotation.

In [None]:
from annotator import AnnotationPane

pane = AnnotationPane(pool, Random(None, batch_size=10))

## See our new labels in the dataset

In [None]:
print(pool.label_distribution)

In [None]:
for text, label in pool.labelled_items:
    print(label, repr(text[:80]))

## Inter-annotator agreement

Now let's have a play with inter-annotator agreement on [a doubly-labelled data set](artstein_poesio_example.txt) from Artstein and Poesio (2007), [Inter-Coder Agreement for Computational Linguistics](http://www.mitpressjournals.org/doi/pdf/10.1162/coli.07-034-R2)

First we'll visualise the label distributions for each annotator and the confusion matrix. The distributions suggest different biases per annotator.

In [None]:
%matplotlib inline
from collections import Counter, OrderedDict
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

# Read in the data (annotator_id, instance_id, label)
data = [x.split() for x in open("artstein_poesio_example.txt")]

def filter_by_annotator(annot_id, data):
    for annotator_id, instance_id, label in data:
        if annotator_id == annot_id:
            yield annotator_id, instance_id, label

# sanity check that we get the same instances for a and b
a_instances = set([instance_id for _,instance_id,_ in filter_by_annotator('b', data)])
b_instances = set([instance_id for _,instance_id,_ in filter_by_annotator('a', data)])
assert a_instances == b_instances

# get labels for a and b
a_labels = [label for _,_,label in sorted(filter_by_annotator('a', data))]
b_labels = [label for _,_,label in sorted(filter_by_annotator('b', data))]
all_labels = list(set(a_labels).union(b_labels))

# view bar charts
def make_bar_chart(data, x_labels, y_min, y_max, title):
    c = Counter(data)
    d = OrderedDict([(k,c[k]) if k in c else (k,0) for k in x_labels])
    # bars are by default width 0.8, so we'll add 0.1 to the left coordinates
    bar_width = 0.5
    index = np.arange(len(x_labels)) + 0.5
    plt.bar(index, d.values(), bar_width)
    plt.ylabel('Number of annotations')
    plt.axis([0,len(x_labels),y_min,y_max])
    plt.title(title)
    plt.xticks(index, x_labels)#, rotation='vertical')
    plt.show()

print('Label distributions for annotators a and b')
Y_MIN, Y_MAX = 0, 60
make_bar_chart(a_labels, all_labels, Y_MIN, Y_MAX, 'Distribution for annotator a')
make_bar_chart(b_labels, all_labels, Y_MIN, Y_MAX, 'Distribution for annotator b')

# view confusion matrix
key = ['{}={}'.format(i,label) for i,label in enumerate(all_labels)]
print('Confusion matrix ({}):\n'.format(key))
_ = plt.matshow(confusion_matrix(a_labels, b_labels), cmap=plt.cm.binary, interpolation='nearest')
_ = plt.colorbar()
_ = plt.ylabel('annotator a')
_ = plt.xlabel('annotator b')

### Calculating raw agreement and Cohen's Kappa

The following are implementations of observed agreement and Cohen's Kappa, a chance-corrected measure of agreement commonly used in Computational Linguistics ([Carletta, 1996](http://www.aclweb.org/anthology/J96-2004)).

In [None]:
# Calculate observed and expected agreement
def observed_agreement(a_labels, b_labels):
    """Return percentage of instances where we observe that
    a_label==b_label"""
    Ao = 0
    for a, b in zip(a_labels, b_labels):
        if a == b:
            Ao += 1
    return Ao/len(a_labels)

def expected_agreement(a_labels, b_labels):
    """Return percentage of instances for which we expect
    a_label==b_label by chance"""
    a_freqs = Counter(a_labels)
    b_freqs = Counter(b_labels)
    total = len(a_labels)
    Ae = 0
    for label in set(a_labels).union(b_labels):
        Ae += (a_freqs[label]/total)*(b_freqs[label]/total)
    return Ae

print('Observed agreement:', observed_agreement(a_labels, b_labels))
print('Expected agreement:', expected_agreement(a_labels, b_labels))

# Calculate Cohen's Kappa
def kappa(a_labels, b_labels):
    """Calculate Cohen's Kappa"""
    Ao = observed_agreement(a_labels, b_labels)
    Ae = expected_agreement(a_labels, b_labels)
    return (Ao-Ae)/(1-Ae)

print('\nKappa:', kappa(a_labels, b_labels))

### Sklearn and NLTK implementations

Scikit-learn includes an implementaion of Cohen's kappa. NLTK also implements Kreppendorf's alpha.

In [None]:
from nltk.metrics.agreement import AnnotationTask
from sklearn.metrics import cohen_kappa_score

# Calculate cohen_kappa_score using scikit-learn
print('\nsklearn kappa:', cohen_kappa_score(a_labels, b_labels))

# Calculate agreement with NLTK
t = AnnotationTask(data=[x.split() for x in open("artstein_poesio_example.txt")])
print('\nnltk kappa:', t.kappa())
print('nltk alpha:', t.alpha()) # http://www.aclweb.org/anthology/E14-1058

### Obtaining a single gold labelling

Given multiple-labelled, we need a way of combining annotations to obtain a single gold labelling. The simplest way to do this is majority vote.

In [None]:
from collections import Counter
import random
import warnings

c_labels = ['stat', 'stat', 'chck', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'stat', 'chck', 'ireq', 'ireq', 'stat', 'stat', 'stat', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'stat', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'stat', 'ireq', 'chck', 'stat', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'stat', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'ireq', 'chck', 'chck', 'chck', 'chck', 'stat', 'ireq', 'chck', 'chck', 'chck', 'chck', 'chck', 'chck', 'ireq', 'chck', 'chck']

print('Ao(c,a):', observed_agreement(c_labels, a_labels))
print('Ao(c,b):', observed_agreement(c_labels, b_labels))
print('Kappa(c,a):', kappa(c_labels, a_labels))
print('Kappa(c,b):', kappa(c_labels, b_labels))

def majority_vote(*args):
    for i,labels in enumerate(zip(*args)):
        top_2 = Counter(labels).most_common(2)
        if len(top_2) == 1:
            # all annotators chose the same label
            yield top_2[0][0]
        elif top_2[0][1] != top_2[1][1]:
            # one label is a clear winner
            yield top_2[0][0]
        else:
            # there is a tie
            l = random.choice(labels)
            warnings.warn('Choosing {} randomly for row {}: {}'.format(l, i,labels))
            yield l
            
gold = list(majority_vote(a_labels, b_labels, c_labels))

### Weighted voting

- Given a single gold labelling, we can compute human performance according to our evaluation metrics. This provides an upper bound, indicating the best we would expect from a model over this data. Use `sklearn.metrics` to calculate a `precision_score` and a `recall_score` for each annotator against the gold. Then use these to compute average precision and recall.
- The above majority voting code generates a warning for row 42, which has a three-way tie with one vote for each label. How could we use Kappa scores to come up with a better voting scheme?
- Using Kappa is not an ideal solution. What problems does it have?

In [None]:
# 1 -
from sklearn.metrics import precision_score, recall_score
p_scores, r_scores = [], []
for annot_labels in [a_labels, b_labels, c_labels]:
    p_scores.append(precision_score(gold, annot_labels, average='micro'))
    r_scores.append(recall_score(gold, annot_labels, average='micro'))
average_p = sum(p_scores)/len(p_scores)
average_r = sum(r_scores)/len(r_scores)
f_score = 2*average_p*average_r/(average_p+average_r)
print('upper bound p/r/f:', average_p, average_r, f_score)

# 2 - Majority vote does not handle cases with no clear winner (ties).
#     We could weight each vote by the user's average agreement score.

# 3 - Neither majority vote nor weighted voting account for systematic annotator bias.
#     See https://lingpipe-blog.com/2014/10/29/beckys-and-my-annotation-paper-in-tacl/