To start off, we'll recreate one of the experiments that the Snorkel authors cite and evaluate how well that experiment justifies the use of Snorkel

In [4]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
from snorkel import SnorkelSession

session = SnorkelSession()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
from snorkel.models import candidate_subclass

ChemicalDisease = candidate_subclass('ChemicalDisease', ['chemical', 'disease'])

train = session.query(ChemicalDisease).filter(ChemicalDisease.split == 0).all()
dev = session.query(ChemicalDisease).filter(ChemicalDisease.split == 1).all()
test = session.query(ChemicalDisease).filter(ChemicalDisease.split == 2).all()

print('Training set:\t{0} candidates'.format(len(train)))
print('Dev set:\t{0} candidates'.format(len(dev)))
print('Test set:\t{0} candidates'.format(len(test)))

Training set:	8437 candidates
Dev set:	920 candidates
Test set:	4697 candidates


In [6]:
from snorkel.annotations import load_marginals
train_marginals = load_marginals(session, split=0)

IndexError: list index out of range

In [8]:
from snorkel.annotations import load_gold_labels
L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1)

In [6]:
from snorkel.learning.pytorch import LSTM_orig

train_kwargs = {
    'lr':              0.01,
    'embedding_dim':   100,
    'hidden_dim':      100,
    'n_epochs':        20,
    'dropout':         0.5,
    'rebalance':       0.25,
    'print_freq':      5,
}

lstm = LSTM_orig(n_threads=None)
lstm.train(train, train_marginals, X_dev=dev, Y_dev=L_gold_dev, **train_kwargs)

[LSTM_orig] Training model
[LSTM_orig] n_train=3587  #epochs=20  batch size=64




[LSTM_orig] Epoch 1 (16.07s)	Average loss=0.688442	Dev F1=51.44
[LSTM_orig] Epoch 6 (96.40s)	Average loss=0.664749	Dev F1=51.26
[LSTM_orig] Epoch 11 (180.14s)	Average loss=0.661062	Dev F1=52.00
[LSTM_orig] Epoch 16 (265.23s)	Average loss=0.658813	Dev F1=52.77
[LSTM_orig] Epoch 20 (333.45s)	Average loss=0.657877	Dev F1=51.66
[LSTM_orig] Model saved as <LSTM_orig>
[LSTM_orig] Training done (334.56s)
[LSTM_orig] Loaded model <LSTM_orig>


In [9]:
from load_external_annotations import load_external_labels
load_external_labels(session, ChemicalDisease, split=2, annotator='gold')
L_gold_test = load_gold_labels(session, annotator_name='gold', split=2)
L_gold_test

AnnotatorLabels created: 0


<4697x1 sparse matrix of type '<class 'numpy.int64'>'
	with 4697 stored elements in Compressed Sparse Row format>

In [8]:
lstm.score(test, L_gold_test)

(0.37417123090227733, 0.85733157199471599, 0.52097130242825618)

In [10]:
# find f1 score for a model that predicts "Yes" every time
tp,fp = 0,0
for i in L_gold_test:
    if i == 1:
        tp += 1
    else:
        fp += 1
precision = tp / (tp + fp)
recall = 1
f1 = 2/(1/precision + 1/recall)
print(precision, recall, f1)

0.3236108154140941 1 0.488981824030883


Might train this again later and try to get a worse result, but just note that even though the model is better, it's not a significant difference. The "all yes" model is also better than the published result. This means that we're not really learning much other than the fact that we should say yes to most results.

In [11]:
precision

0.3236108154140941