# Excercise 1.1: Annotation & reliability

## Data

Here's a sample class wrapping the main abstraction, the `Dataset`.

The following utility reads the 20 newsgroups data into a Dataset object. It sets the label to True if a message comes from the talk.politics.guns group, and to False otherwise.

In [None]:
from dataset import Dataset
from sklearn.datasets import fetch_20newsgroups

def guns_dataset_factory(subset='train', labelled=False):
    """ Fetches newsgroup data and returns a Dataset. """
    newsgroups = fetch_20newsgroups(subset=subset)
    
    # Transform to guns or not.
    labels = {i: name == 'talk.politics.guns' for i, name in enumerate(newsgroups.target_names)}
    dataset = Dataset({text: labels[i] for text, i in zip(newsgroups.data, newsgroups.target)})
    return dataset

pool = guns_dataset_factory(subset='train')
test = guns_dataset_factory(subset='test')

## A random sampler

We need a way of choosing which data to annotate next. Let's start with a random sampler. This is how most crowd annotation is set up.

Our `Sampler` base clase includes some utilities for sampling, training and scoring. Other samplers inherit from `Sampler` and must implement the `__call__` method, which takes one argument (a dataset).

`Sampler` objects must be initialised with a sklearn classifier or pipeline. We use multinomial naive Bayes as a baseline since it is [fast to train and can achieve competitive accuracy](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html).

Our `Random` sampler simply shuffles the unlabelled data, then returns the first `batch_size` items.

In [None]:
from samplers import Random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')),
        ('clf', MultinomialNB(alpha=.01)),
    ])

random_sampler = Random(pipeline, batch_size=10)
for i, (text, label) in enumerate(random_sampler(pool)):
    print(i, label, repr(text[:60]))

## Simulated experiments

Now we can run simulated experiments. `run_simulation` takes a sampler and runs `n` complete simulations. It returns:
* `train_sizes` - the training set sizes for each simulation
* `train_scores` - f1 scores over the training set
* `test_scores` - f1 scores over the test set

Each is a list of tuples, with one tuple per iteration of sampling. Each tuple contains `n` values, one per simulation.

In [None]:
from evaluation import run_simulations

# suppress sklearn FutureWarnings in terminal output
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# run a simulated experiment and plot learning curve
random_sampler = Random(pipeline, batch_size=2000)
train_sizes, train_scores, test_scores = run_simulations(random_sampler, pool, test, n=3)

print(train_sizes)
print(train_scores)
print(test_scores)

## Learning curves

Plotting performance against the number of examples gives a quick visual indication of whether more annotation will help.

Here, the lower, green line corresponds to the test f1 score. This is our estimate generalisation to unseen data.

When test performance flattens out, more labelled data probably won't help with this classifier (multinomial naive Bayes here).

In [None]:
from evaluation import plot_learning_curve
import numpy as np

plt = plot_learning_curve(np.asarray(train_sizes), np.asarray(train_scores), np.asarray(test_scores))
plt.show()

## Simulation with bootstrap resampling

In a setting where sampled batches were actually being annotated, we wouldn't run complete simulations.

Instead, we can estimate variance using bootstrap resampling of training data in a given round of sampling.

In [None]:
from evaluation import run_bootstraps

# run a simulated experiment and plot learning curve
random_sampler = Random(pipeline, batch_size=2000)
train_sizes, train_scores, test_scores = run_bootstraps(random_sampler, pool.copy, test)

print(train_sizes)
print(train_scores)
print(test_scores)

In [None]:
plt = plot_learning_curve(np.asarray(train_sizes), np.asarray(train_scores), np.asarray(test_scores))
plt.show()

In [None]:
import numpy as np
np.std(np.asarray(test_scores), axis=1)

## Manually label some examples

In [None]:
from annotator import AnnotationPane

pane = AnnotationPane(pool, Random(pipeline, batch_size=10))

## See our new labels in the dataset

In [None]:
print(pool.label_distribution)