# Exercise 2: Live shared task

The challenge is to build a sentence-level classifier for identyfing [adverse drug events](https://en.wikipedia.org/wiki/Adverse_event) in 60 minutes. You are free to use any data and annotation strategy you think best trades off hacking and labelling. Just please don't look at the test data.

Some strategies to consider:
* Get started with random or query-driven sampling.
* Use the dev data for seeding learning instead of generalisation testing and analysis.
* Tune classifier choice, hyperparameters or feature extraction.
* Use error analysis over the dev data to refine your strategy.
* Active learning by uncertainty or ensembles.
* Collect 10 or more query functions and use as snorkel labelling functions.
* Find additional data, e.g., [Twitter](https://archive.org/details/twitterstream).
* Interactive web search or [Reddit queries](http://minimaxir.com/2015/10/reddit-bigquery/).
* Use external data (e.g., [MAUDE](https://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/PostmarketRequirements/ReportingAdverseEvents/ucm127891.htm)) for querying or labelling functions.

Please don't use data from the following as they are sources of our held-out data:
* CSIRO CADEC data set
* AskaPatient
* DIEGO Lab Twitter data sets

In [None]:
# load dev data
dev = Dataset.from_csv('../shared-task/dev.csv')

## Load pool data

Now let's load the unlabelled pool data. We have data from several sources:
* `aska` - Posts for additional drugs from AskaPatient
* `ader` - Comments mentioning the same drugs from Reddit
* `adeb` - Tweets mentioning the same set of drugs
* `adrc` - Tweets mentioning an overlapping set of drugs

In [None]:
# load unlabelled data pools
aska = Dataset.from_csv('../shared-task/aska.csv')
ader = Dataset.from_csv('../shared-task/ader.csv')
adeb = Dataset.from_csv('../shared-task/adeb.csv')
adrc = Dataset.from_csv('../shared-task/adrc.csv')

## Data programming 

One view of data programming is that it takes the query functions we used in the previous exercise and uses them for weak supervision. It does this by pooling labelling function output using weighted voting.

A simple implementation could use the inter-annotator agreement scripts from exercise 1.1 to weight each labelling function by its average agreement score.

In the setting here, where we have dev data, we could also weight each labelling function by its perforamance on the labelled dev data. Of course, this wouldn't work in an annotation setting where we were starting without labelled data.

A key difference with `snorkel` is that this approach in the annotation framework does not go on to train the classifier on a continuous voting confidence value.

Feel free to experiment with voting, or use `snorkel` directly. If you do plan to use `snorkel`, note that it takes a while to [install](https://github.com/HazyResearch/snorkel#installation). It would be a good idea to run the installation in the background while you start annotating and/or writing labelling functions.

Once `snorkel` is installed, the tutorials should help get things up and running. These are in the repo and can also be viewed [on github](https://github.com/HazyResearch/snorkel/tree/master/tutorials/intro).

# Wrapping up..

## Short strategy description

Before submitting, please summarise:
* The hacking/labelling strategy you followed
* How do you rate this strategy? Why?

TODO Add your summary right here.

TODO If you have a list sampling strategies, please include it here.

## Submission

Submit your annotation and system output for scoring.

In [None]:
# run current classifier on the dev and test data.
# TODO classify dev data
dev_preds = dev.copy
for 
# TODO load test data and classify
test_preds = Dataset.from_csv('../shared-task/test.csv')

In [None]:
# save annotations to csv
! mkdir -p ../submissions/YOUR_USERNAME_HERE
pool.to_csv('../submissions/YOUR_USERNAME_HERE/pool.csv')
# save system output to csv
dev.to_csv('../submissions/YOUR_USERNAME_HERE/dev.csv')
test.to_csv('../submissions/YOUR_USERNAME_HERE/test.csv')

In [None]:
# copy your notebook to your submission directory
! cp exercise_2.ipynb ../submissions/YOUR_USERNAME_HERE/

In [None]:
# push your submission back to the repo
! git add ../submissions/YOUR_USERNAME_HERE
! git commit -m 'Checkpoint YOUR_USERNAME_HERE' ../submissions/YOUR_USERNAME_HERE/
! git push