Notes

- I installed snorkel from the redux branch on 7 Aug into an environment running Python 3.6 -- 3.5 and 3.7 both didn't work for me. Skimming the git issues seems to indicate this is going to be fixed.
- Snorkel developers are working on a new release - "coming later this summer" - so it may soon be better to install from master or just with pip/conda
- For data, I'm using the version of the Global Terrorism Database on Kaggle (https://www.kaggle.com/START-UMD/gtd/)

Beware of blogs, etc. about snorkel which may be helpful but often discuss outdated versions of the API!
- These tutorials WERE helpful, and then they got deleted: https://github.com/HazyResearch/snorkel/tree/redux/tutorials/workshop.
- Avoid these tutorials, which are outdated: https://github.com/HazyResearch/snorkel/blob/master/tutorials
- These tutorials seem to be up-to-date: https://github.com/snorkel-team/snorkel-tutorials/blob/master

In [26]:
import pandas as pd
import numpy as np
import snorkel
import re 

from sklearn.model_selection import train_test_split

In [27]:
terror_df = pd.read_csv('/Users/awhite/Documents/globalterrorismdb_0718dist.csv',
                       encoding = 'ISO-8859-1')

  interactivity=interactivity, compiler=compiler, result=result)


In [28]:
terror_df.summary.head(20)

0     NaN                                                                                                                                                                                                                                                                                                                                                                                                                             
1     NaN                                                                                                                                                                                                                                                                                                                                                                                                                             
2     NaN                                                                                                                                                 

In [29]:
#drop items w no summary, be sure all summaries are strings
terror_df = terror_df.dropna(subset=['summary'])
len(terror_df)

115562

In [58]:
train, test = train_test_split(terror_df, test_size = 0.2, random_state = 0)

#snorkel calls for a separate validation set (and optionally a dev set for LF development)
train, valid = train_test_split(train, test_size = 0.2, random_state = 0)

Y_train = train["suicide"].values
Y_valid = valid["suicide"].values
Y_test = test["suicide"].values

In [60]:
len(test)

23113

Let's try to predict if an attack was a suicide attack - this seems easy.

In [32]:
pd.set_option("display.max_colwidth", 0)

terror_df[terror_df.suicide == 1].summary.sample(5, random_state=3)

76962    09/04/2004: A Police Academy in Kirkuk Iraq was attacked by a suicide bomber in a car, killing 20 and wounding 36. Tawhid and Jihad claimed responsibility for the attack.                                                                                                                                                                                                                                                                                       
76226    01/18/2004: A pickup truck loaded with 500 kilos of explosives exploded at the Assassin's Gate, the entrance to the main industrial center of Baghdad, Iraq, and also the United States' Military headquarters. Twenty-five were people killed and over 100 were injured in the attack, for which no group claimed responsibility.                                                                                                                               
77565    02/24/2005: An unidentified suicide car bomber wearing a police uniform t

In [33]:
terror_df[terror_df.suicide == 0].summary.sample(5, random_state=3)

112007    10/30/2012: Assailants detonated an explosive device at an electricity tower in the Ghitani area of Mach town, Balochistan province, Pakistan. There were no reported casualties; however, the targeted tower was partially damaged in the blast. No group claimed responsibility for the incident.                                                                                                               
156272    12/05/2015: An explosive device detonated as police personnel were attempting to defuse it in Waghaz district, Ghazni Province, Afghanistan. Two police officers were wounded in the blast. No group claimed responsibility for the incident.                                                                                                                                                                     
158353    01/31/2016: Explosive devices detonated targeting a military patrol in eastern Norte de Santander department, Colombia. Two soldiers were killed in the blast. No gr

In [34]:
from snorkel.labeling.apply import PandasLFApplier
from snorkel.labeling.lf import labeling_function
from snorkel.types import DataPoint

POS = 1
NEG = -1 
ABSTAIN = 0

In [35]:
dev.summary.str.contains("suicide attack").value_counts()

False    46093
True     132  
Name: summary, dtype: int64

It seems snorkel isn't very forgiving about the way you write LFs. I tried a number of different ways for these basic examples - find a substring in text - and couldn't get them to work until I copied the approach in https://github.com/snorkel-team/snorkel-tutorials/blob/master/spam/01_spam_tutorial.ipynb. For instance `str.contains()` and `str.count()` didn't work. Maybe I was doing something wrong.

While there is some disagreement on this, snorkel developers generally suggest that many simple LFs are better than a few complex LFs.

In [36]:
@labeling_function()
def suicide_mentioned(x):
    return POS if "suicide" in x.summary.lower() else ABSTAIN

In [37]:
@labeling_function()
def suicide_attack(x):
    return POS if "suicide attack" in x.summary.lower() else ABSTAIN

In [38]:
@labeling_function()
def suicide_bomb(x):
    return POS if "suicide bomb" in x.summary.lower() else ABSTAIN

In [39]:
@labeling_function()
def unknown_perps(x):
    return NEG if "unknown perpetrator" in x.summary.lower() else ABSTAIN

In [40]:
@labeling_function()
def no_responsibility(x):
    return NEG if "no group claimed responsibility" in x.summary.lower() else ABSTAIN

In [61]:
lfs = [suicide_mentioned,suicide_attack,
       suicide_bomb,unknown_perps,
       no_responsibility]

applier = PandasLFApplier(lfs)

L_train = applier.apply(df=train)
L_valid = applier.apply(df=valid)
L_test = applier.apply(df=test)

100%|██████████| 73959/73959 [00:11<00:00, 6419.39it/s]
100%|██████████| 18490/18490 [00:02<00:00, 6868.09it/s]
100%|██████████| 23113/23113 [00:03<00:00, 6830.25it/s]


In [62]:
from snorkel.labeling.analysis import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
suicide_mentioned,0,"[0, 1]",1.0,1.0,0.053678
suicide_attack,1,"[0, 1]",1.0,1.0,0.053678
suicide_bomb,2,"[0, 1]",1.0,1.0,0.053678
unknown_perps,3,[0],0.983896,0.983896,0.053638
no_responsibility,4,[0],0.341797,0.341797,0.028327


In [64]:
#If you supply gold labels, snorkel calculates empirical accuracy for you!
LFAnalysis(L=L_valid, lfs=lfs).lf_summary(Y=Y_valid)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
suicide_mentioned,0,"[0, 1]",1.0,1.0,0.053488,976,0,0.99616
suicide_attack,1,"[0, 1]",1.0,1.0,0.053488,54,0,0.946944
suicide_bomb,2,"[0, 1]",1.0,1.0,0.053488,880,0,0.991347
unknown_perps,3,[0],0.98318,0.98318,0.053434,17146,1033,0.943176
no_responsibility,4,[0],0.340779,0.340779,0.028448,5746,555,0.911919


So in theory we should stop here, because LFs by themselves give good accuracy. Or, if the description says suicide bomber...it's probably a suicide bomber. But let's work through to the end for fun.

In [65]:
#First, let's try a simple majority vote approach
from snorkel.labeling.model import MajorityLabelVoter

In [66]:
majority_model = MajorityLabelVoter()
Y_pred_train = majority_model.predict(L=L_train)
Y_pred_train

array([0, 1, 0, ..., 0, 0, 0])

In [68]:
majority_acc = majority_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")

Majority Vote Accuracy:   95.7%


In [69]:
#Let's use snorkel's model for probabilistic labels and train a classifier 
from snorkel.labeling.model import LabelModel

In [70]:
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=1000, lr=0.001, log_freq=100, seed=123)

In [71]:
label_model_acc = label_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Label Model Accuracy:     99.6%


Snorkel warns against using the label model: 

>it is typically not suitable as an inference-time model to make predictions for unseen examples, due to (among other things) some data points having all abstain labels. 

So let's train a quick bag of words model

In [72]:
from sklearn.feature_extraction.text import CountVectorizer

words_train = [row.summary for i, row in train.iterrows()]
words_valid = [row.summary for i, row in valid.iterrows()]
words_test = [row.summary for i, row in test.iterrows()]

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(words_train)
X_valid = vectorizer.transform(words_valid)
X_test = vectorizer.transform(words_test)

In [92]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, Y_train)

In [93]:
predicted = classifier.predict(X_test)
np.mean(predicted == Y_test)            

0.9858953835503829

In [91]:
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier().fit(X_train, Y_train)

predicted = classifier.predict(X_test)
np.mean(predicted == Y_test)

0.9951975078959893