# Exercise 2: Live shared task

The challenge is to build a sentence-level classifier for identyfing [adverse drug events](https://en.wikipedia.org/wiki/Adverse_event) in 60 minutes. You are free to use any data and annotation strategy you think best trades off hacking and labelling. Just please don't look at the test data.

Some strategies to consider:
* Get started with random or query-driven sampling.
* Use the dev data for seeding learning instead of generalisation testing and analysis.
* Tune classifier choice, hyperparameters or feature extraction.
* Use error analysis over the dev data to refine your strategy.
* Active learning by uncertainty or ensembles.
* Collect 10 or more query functions and use as snorkel labelling functions.
* Find additional data, e.g., [Twitter](https://archive.org/details/twitterstream).
* Interactive web search or [Reddit queries](http://minimaxir.com/2015/10/reddit-bigquery/).
* Use external data (e.g., [MAUDE](https://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/PostmarketRequirements/ReportingAdverseEvents/ucm127891.htm)) for querying or labelling functions.

Please don't use data from the following as they are sources of our held-out data:
* CSIRO CADEC data set
* AskaPatient
* DIEGO Lab Twitter data sets

In [1]:
# load dev data
from dataset import Dataset
dev = Dataset.from_csv('../shared-task/dev.csv')

## Load pool data

Now let's load the unlabelled pool data. We have data from several sources:
* `aska` - Posts for additional drugs from AskaPatient
* `ader` - Comments mentioning the same drugs from Reddit
* `adeb` - Tweets mentioning the same set of drugs
* `adrc` - Tweets mentioning an overlapping set of drugs

In [2]:
# load unlabelled data pools
aska = Dataset.from_csv('../shared-task/aska.csv')
# ader = Dataset.from_csv('../shared-task/ader.csv')
# adeb = Dataset.from_csv('../shared-task/adeb.csv')
adrc = Dataset.from_csv('../shared-task/adrc.csv')

In [3]:
#dev.label_distribution
print(dev.label_distribution)
print(aska.label_distribution)
print(adrc.label_distribution)

{None: 3761}
{None: 14712}
{None: 4409}


In [4]:
from collections import Counter
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
en = set(stopwords.words('english'))

unigrams = Counter()
for text, label in dev.oracle_items:
    if label:
        tokens = word_tokenize(text)
        # print(label, tokens)
        unigrams.update(t for t in tokens 
                        if not t in en 
                        and not t in set(string.punctuation))
for t, count in unigrams.most_common(10):
    print('{}\t{}'.format(count, t))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/wradford/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
768	I
483	pain
209	muscle
106	back
99	Lipitor
98	loss
97	taking
90	severe
88	cramps
85	legs


In [9]:
from annotator import AnnotationPane
from samplers import Random

def mentions_top_ten(item):
    return bool(re.search(r'\b(I|pain|muscle|back|Lipitor|loss|taking|severe|cramps|legs)\b', item[0], flags=re.IGNORECASE))
query_sampler = Random(None, batch_size=30, query=mentions_top_ten)

# annotate
pane = AnnotationPane(aska, query_sampler)

In [10]:
aska.label_distribution

{False: 41, None: 14652, True: 19}

## Data programming 

One view of data programming is that it takes the query functions we used in the previous exercise and uses them for weak supervision. It does this by pooling labelling function output using weighted voting.

A simple implementation could use the inter-annotator agreement scripts from exercise 1.1 to weight each labelling function by its average agreement score.

In the setting here, where we have dev data, we could also weight each labelling function by its perforamance on the labelled dev data. Of course, this wouldn't work in an annotation setting where we were starting without labelled data.

A key difference with `snorkel` is that this approach in the annotation framework does not go on to train the classifier on a continuous voting confidence value.

Feel free to experiment with voting, or use `snorkel` directly. If you do plan to use `snorkel`, note that it takes a while to [install](https://github.com/HazyResearch/snorkel#installation). It would be a good idea to run the installation in the background while you start annotating and/or writing labelling functions.

Once `snorkel` is installed, the tutorials should help get things up and running. These are in the repo and can also be viewed [on github](https://github.com/HazyResearch/snorkel/tree/master/tutorials/intro).

# Wrapping up..

## Short strategy description

Before submitting, please summarise:
* The hacking/labelling strategy you followed
* How do you rate this strategy? Why?

TODO Add your summary right here.

TODO If you have a list sampling strategies, please include it here.

## Submission

Submit your annotation and system output for scoring.

* Union of labels across all sets (except dev).
* Predict dev
* Predict test

### Step 1: setup your submission directory

This will use the environment `USER`, but feel free to choose your own name.
```bash
mkdir -p ../submissions/$USER
```

In [13]:
! mkdir -p ../submissions/$USER

### Step 2: Collate your labels; train a classifier; label dev/test

In [14]:
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from evaluation import label_for_submission

USERNAME = os.environ.get('USER', 'username')

# use multinomial NB again
pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')),
        ('clf', MultinomialNB(alpha=.01)),
    ])

# Collate all labels.
train = Dataset()
# FIXME Add all the datasets.
for d in [aska]:
    train.update(d)
print(train.label_distribution)
train.to_csv('../submissions/{}/train.csv'.format(USERNAME))

# Train a classifier.
X_train, y_train = zip(*train.labelled_items)
pipeline.fit(X_train, y_train)

label_for_submission(dev, pipeline, 'dev', USERNAME)
# FIXME Once test is available, use it.
# label_for_submission(test, pipeline, 'test', USERNAME)

{None: 14652, False: 41, True: 19}
Written submission to ../submissions/wradford/None.csv


### Step 3: save this notebook and copy to the submission directory

First click `<ctrl>+<s>` to save, then run:

In [15]:
# copy your notebook to your submission directory
! cp exercise_2.ipynb ../submissions/$USER/

In [None]:
# push your submission back to the repo
! git add ../submissions/$USER
! git commit -m 'Checkpoint $USER' ../submissions/$USER/
! git push