# Exercise 2: Live shared task

The challenge is to build a sentence-level classifier for identyfing [adverse drug events](https://en.wikipedia.org/wiki/Adverse_event) in 60 minutes. You are free to use any data and annotation strategy you think best trades off hacking and labelling. Just please don't look at the test data.

Some strategies to consider:
* Get started with random or query-driven sampling.
* Use the dev data for seeding learning instead of generalisation testing and analysis.
* Tune classifier choice, hyperparameters or feature extraction.
* Use error analysis over the dev data to refine your strategy.
* Active learning by uncertainty or ensembles.
* Collect 10 or more query functions and use as snorkel labelling functions.
* Find additional data, e.g., [Twitter](https://archive.org/details/twitterstream).
* Interactive web search or [Reddit queries](http://minimaxir.com/2015/10/reddit-bigquery/).
* Use external data (e.g., [MAUDE](https://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/PostmarketRequirements/ReportingAdverseEvents/ucm127891.htm)) for querying or labelling functions.

Please don't use data from the following as they are sources of our held-out data:
* CSIRO CADEC data set
* AskaPatient
* DIEGO Lab Twitter data sets

## Preliminaries

Labels are saved on the following objects. Only run this once, unless you want delete your annotations and start over.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# for tracking performance
batches = []

In [None]:
from dataset import Dataset

# load dev data
dev = Dataset.from_csv('../shared-task/dev.csv')
print('Loaded {} items to dev dataset'.format(len(dev)))

# get text and label vectors for scikit-learn
X_dev, y_dev = zip(*dev.oracle_items)

In [None]:
# load unlabelled data pools
aska = Dataset.from_csv('../shared-task/aska.csv')
print('Loaded {} items to aska dataset'.format(len(aska)))

#ader = Dataset.from_csv('../shared-task/ader.csv')
#print('Loaded {} items to ader dataset'.format(len(ader)))

#adeb = Dataset.from_csv('../shared-task/adeb.csv')
#print('Loaded {} items to adeb dataset'.format(len(adeb)))

adrc = Dataset.from_csv('../shared-task/adrc.csv')
print('Loaded {} items to adrc dataset'.format(len(adrc)))

DATASETS = [
    ('aska', aska),
    #('ader', ader),
    #('adeb', adeb),
    ('adrc', adrc),
]

## Look at some dev data

In [None]:
for i, (text, label) in enumerate(dev.oracle_items):
    if i > 9:
        break
    print(i, label, repr(text))

## Look at some unlabelled pool data

In [None]:
for i, (text, label) in enumerate(aska.oracle_items):
    if i > 9:
        break
    print(i, label, repr(text))

## Load pool data

Now let's load the unlabelled pool data. We have data from several sources:
* `aska` - Posts for additional drugs from AskaPatient
* `ader` - Comments mentioning the same drugs from Reddit
* `adeb` - Tweets mentioning the same set of drugs
* `adrc` - Tweets mentioning an overlapping set of drugs

## Annotate

In [None]:
from samplers import Random
import re

# set up a random sampler with a query filter that mathces examples containing the word pain
def mentions_pain(item):
    return bool(re.search(r'\bpain\b', item[0], flags=re.IGNORECASE))
query_sampler = Random(None, batch_size=10, query=mentions_pain)

# sample 
for i, (text, label) in enumerate(query_sampler(aska)):
    print(i, label, repr(text[:80]))

In [None]:
from annotator import AnnotationPane

# annotate
pane = AnnotationPane(aska, query_sampler)

In [None]:
aska.label_distribution

## Evaluate on dev data

In [None]:
# Collate annotations
from dataset import pool_data
train = pool_data(DATASETS)
print(train.label_distribution)

In [None]:
# define pipeline
from samplers import Random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')),
        ('clf', MultinomialNB(alpha=.01)),
    ])

In [None]:
# fit pipeline and save train/dev f1 scores
from evaluation import fit_and_score

batches.append(list(fit_and_score(pipeline, train, X_dev, y_dev, n=5)))

In [None]:
# inspect batches
print('Batches:')
for i, batch in enumerate(batches):
    train_sizes, train_scores, test_scores = zip(*batch)
    print('\n..batch', i)
    print('..train_sizes:', train_sizes)
    print('..train_scores:', ['{:.2f}'.format(s) for s in train_scores])
    print('..test_scores:', ['{:.2f}'.format(s) for s in test_scores])

In [None]:
from evaluation import plot_learning_curve

train_sizes, train_scores, test_scores = zip(*[zip(*i) for i in batches])
plt = plot_learning_curve(train_sizes, train_scores, test_scores)
plt.show()

## Error analysis on dev

In [None]:
# collate annotations
from dataset import pool_data
train = pool_data(DATASETS)
print(train.label_distribution)

In [None]:
# train the classifier
X_train, y_train = zip(*train.labelled_items)
pipeline.fit(X_train, y_train)

In [None]:
# predict
import random
X_dev = [i[0] for i in dev]
random.shuffle(X_dev)
y_pred = pipeline.predict(X_dev)
predictions = dict(zip(X_dev, y_pred))

In [None]:
# print classification report
from sklearn.metrics import classification_report
y_true = [dev.get_oracle_label(t) for t in X_dev]
print(classification_report(y_true, y_pred))

In [None]:
# print some errors
errors = filter(lambda i: i[0] != i[1], zip(y_true, y_pred, X_dev))
for i, (true, pred, text) in enumerate(errors):
    if i > 9:
        break
    print(i, true, pred, repr(text))

## Data programming 

One view of data programming is that it takes the query functions we used in the previous exercise and uses them for weak supervision. It does this by pooling labelling function output using weighted voting.

A simple implementation could use the inter-annotator agreement scripts from exercise 1.1 to weight each labelling function by its average agreement score.

In the setting here, where we have dev data, we could also weight each labelling function by its perforamance on the labelled dev data. Of course, this wouldn't work in an annotation setting where we were starting without labelled data.

A key difference with `snorkel` is that this approach in the annotation framework does not go on to train the classifier on a continuous voting confidence value.

Feel free to experiment with voting, or use `snorkel` directly. If you do plan to use `snorkel`, note that it takes a while to [install](https://github.com/HazyResearch/snorkel#installation). It would be a good idea to run the installation in the background while you start annotating and/or writing labelling functions.

Once `snorkel` is installed, the tutorials should help get things up and running. These are in the repo and can also be viewed [on github](https://github.com/HazyResearch/snorkel/tree/master/tutorials/intro).

# Wrapping up..

## Short strategy description

Before submitting, please summarise:
* The hacking/labelling strategy you followed
* How do you rate this strategy? Why?

__TODO Add your summary right here.__

__TODO If you have a list sampling strategies, please include it here.__

## Submission

Submit your annotation and system output for scoring.
* Union of annotations across all sets (except dev).
* Predict dev
* Predict test


### Step 1: Set up

First, we'll set up a pipeline. Feel free to use a different classifier here if you like.

__FIXME fix to_csv!__

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# use multinomial NB again
pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')),
        ('clf', MultinomialNB(alpha=.01)),
    ])

Submissions will be written to a USERNAME directory. This will take USER from your environment by default, but feel free to choose another name.

In [None]:
import os
USERNAME = os.environ.get('USER', 'username')

In [None]:
! mkdir -p ../submissions/$USER

### Step 2: Train and predict

Now lets collate all annotated data into a `train` dataset; use this to train the classifier; and save predictions for dev and test.

__FIXME move this into a function for learning curve use as well!__

In [None]:
# collate annotations
from dataset import pool_data
train = pool_data(DATASETS)
print(train.label_distribution)

In [None]:
# save annotations to csv
for k, d in DATASETS:
    d.to_csv('../submissions/{}/{}.csv'.format(USERNAME, k))
train.to_csv('../submissions/{}/train.csv'.format(USERNAME))

In [None]:
# train the classifier
X_train, y_train = zip(*train.labelled_items)
pipeline.fit(X_train, y_train)

In [None]:
from evaluation import label_for_submission

# prepare system output for dev data
label_for_submission(dev, pipeline, 'dev', USERNAME)

# prepare system output for test data
test = Dataset.from_csv('../shared-task/test.csv')
label_for_submission(test, pipeline, 'test', USERNAME)

### Step 3: Copy notebook and submit

In [None]:
# copy your notebook to your submission directory
! cp exercise_2.ipynb ../submissions/$USER/

In [None]:
# push your submission back to the repo
! git add ../submissions/$USER
! git commit -m "Checkpoint $USER" ../submissions/$USER/
! git push