# Exercise 1.2: Active learning

## Data

First let's load our newsgroup guns/no-guns data.

In [None]:
from dataset import Dataset
from sklearn.datasets import fetch_20newsgroups

def guns_dataset_factory(subset='train', labelled=False):
    """ Fetches newsgroup data and returns a Dataset. """
    newsgroups = fetch_20newsgroups(subset=subset)
    
    # Transform to guns or not.
    labels = {i: name == 'talk.politics.guns' for i, name in enumerate(newsgroups.target_names)}
    dataset = Dataset({text: labels[i] for text, i in zip(newsgroups.data, newsgroups.target)})
    return dataset

pool = guns_dataset_factory(subset='train')
test = guns_dataset_factory(subset='test')

## Sampling by query

In addition to a pipeline and batch_size, `Sampler` can take two filter function arguments:
* `query` - a function to filter before prediction
* `key` - a key function to sort predictions
* `accept` - a function to filter predictions

The `query` filter can be used to sample by keyword, e.g., search for examples containing the word gun. We'll cover `accept` below.

In [None]:
from samplers import Random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import re

# use multinomial NB again
pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')),
        ('clf', MultinomialNB(alpha=.01)),
    ])

# set up a random sampler with a query filter that matches examples containing the word gun
def mentions_gun(item):
    return bool(re.search(r'\bgun\b', item[0], flags=re.IGNORECASE))
query_sampler = Random(pipeline, batch_size=10, query=mentions_gun)

# sample 
for i, (text, label) in enumerate(query_sampler(pool)):
    print(i, label, repr(text[:80]))

## A selective sampler

Here is a straw man active sampler that:
* trains a classifier on the labelled data
* predicts the labels of unlabelled data
* selects the first n examples with a specific label profile



In [None]:
from samplers import Active

# suppress sklearn FutureWarnings in terminal output
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# set up the active sampler to select uncertain examples
def accept_uncertain(pred):
    " Accepts predictions within 0.1 on either side of the decision boundary. "
    return abs(pred[True] - 0.5) < 0.167
active_sampler = Active(pipeline, batch_size=10, accept=accept_uncertain)

# seed pool with some random labelled examples for initial classifier
p2 = pool.copy
p2.seed(200)
active_sampler.fit(p2)

# sample 
for i, (text, pred) in enumerate(active_sampler(p2)):
    print(i, pred[True], repr(text[:60]))

## Sampling by uncertainty

Here is another active learner. This one selects examples in order according to distance of the predicted probability from the decision boundary.

In [None]:
# set up the active sampler to select uncertain examples
def uncertainty_sort_key(item):
    pred = dict(zip(pipeline.classes_, item[0]))
    return abs(pred[True] - 0.5)
active_sampler = Active(pipeline, batch_size=10, key=uncertainty_sort_key)

# seed pool with some random labelled examples for initial classifier
p2 = pool.copy
p2.seed(200)
active_sampler.fit(p2)

# sample 
for i, (text, pred) in enumerate(active_sampler(p2)):
    print(i, pred[True], repr(text[:60]))

## Simulation with bootstrap resampling

Let's just use bootstrap resampling this time. It's less reliable, but it's fast.

How does the active test curve compare to the random curve from exercise 1.1?

In [None]:
from evaluation import run_bootstraps
from evaluation import plot_learning_curve
import numpy as np

# run a simulated experiment and plot learning curve
active_sampler = Active(pipeline, batch_size=2000, key=uncertainty_sort_key)
train_sizes, train_scores, test_scores = run_bootstraps(active_sampler, pool.copy, test)

# plot learning curve
plt = plot_learning_curve(np.asarray(train_sizes), np.asarray(train_scores), np.asarray(test_scores))
plt.show()

## Other sampling strategies

Feel free to implement other strategies, e.g.:
* ensemble sampling with random forests,
* ensemble sampling with generative versus discriminative classifiers,
* ensemble sampling with subject versus body features.

What is the speed-accuracy tradeoff for these approaches?

In [None]:
raise NotImplementedError

## Annotation

Labelling examples now, we should notice that more are relevant and/or confusable.

In [None]:
from annotator import AnnotationPane

# let's use the sampler that returns predictions within 0.167 of the decision boundary
active_sampler = Active(pipeline, batch_size=10, accept=accept_uncertain)

# seed pool with some random labelled examples for initial classifier
pool.seed(200)
active_sampler.fit(p2)

# annotate
pane = AnnotationPane(pool, active_sampler)

In [None]:
print(train.label_distribution)