Notes
- I installed snorkel from the redux branch into an environment running Python 3.6 -- 3.5 and 3.7 both didn't work for me. Skimming the git issues seems to indicate this is going to be fixed.
- Snorkel developers are working on a new release - so it may soon be better to install from master
- For data, I'm starting with the version of the Global Terrorism Database on Kaggle (https://www.kaggle.com/START-UMD/gtd/) and hopefully moving toa webscraped dataset of Arabic-language RT articles labeled for news category (https://data.mendeley.com/datasets/322pzsdxwy/1)
- These tutorials are helpful: https://github.com/HazyResearch/snorkel/tree/redux/tutorials/workshop. They use this dataset: https://www.dropbox.com/s/jmrvyaqew4zp9cy/spouse_data.zip.

In [12]:
import pandas as pd
import snorkel

from sklearn.model_selection import train_test_split

In [None]:
terror_df = pd.read_csv('/Users/awhite/Documents/globalterrorismdb_0718dist.csv',
                       encoding = 'ISO-8859-1')

In [9]:
terror_df.summary.head(20)

0                                                   NaN
1                                                   NaN
2                                                   NaN
3                                                   NaN
4                                                   NaN
5     1/1/1970: Unknown African American assailants ...
6                                                   NaN
7     1/2/1970: Unknown perpetrators detonated explo...
8     1/2/1970: Karl Armstrong, a member of the New ...
9     1/3/1970: Karl Armstrong, a member of the New ...
10                                                  NaN
11    1/6/1970: Unknown perpetrators threw a Molotov...
12                                                  NaN
13    1/9/1970: Unknown perpetrators set off a fireb...
14    1/9/1970:  The Armed Commandos of Liberation c...
15                                                  NaN
16                                                  NaN
17    1/12/1970: Unknown perpetrators threw a pi

In [None]:
#drop items w no summary
terror_df = terror_df.dropna(subset=['summary'])
len(terror_df)

In [18]:
train, test = train_test_split(terror_df, test_size = 0.2, random_state = 0)

#snorkel calls for a separate development set
development, train = train_test_split(train, test_size = 0.2, random_state = 0)

In [20]:
len(train)

92449

Let's try to predict if an attack was a suicide attack - this seems easy.

In [28]:
terror_df[terror_df.suicide == 1].summary.head(20)

17427    11/11/1982: A suspected suicide car bomb deton...
24745    4/9/1985: A 16-year-old girl drove a car laden...
40835    12/06/1989: At 4:30 p.m. an assailant armed wi...
44662    11/23/1990:  Two members of the Liberation Tig...
46336    05/05/1991:  Approximately six members of the ...
60995    11/24/1995:  Two female cadres of the Liberati...
61066    12/08/1995: An assailant attacked Freddy's Fas...
61097    11/11/1995: Two suicide bombers from the Liber...
63771    10/25/1996:  Members of the Liberation Tigers ...
65013    03/24/1997:  A flotilla of boats staffed by th...
66914    10/19/1997:  A flotilla of approximately 20 sm...
67476    12/18/1997: Three individuals believed to be f...
67571    01/25/1998: Liberation Tigers of Tamil Eelam (...
67578    01/26/1998: Three suicide bombers crashed a tr...
67600    02/06/1998: A female suicide bomber blew herse...
67661    02/23/1998: At least 46 soldiers and sailors w...
67676    02/27/1998: One person died and another two we.

In [21]:
from snorkel.labeling.apply import PandasLFApplier
from snorkel.labeling.lf import labeling_function
from snorkel.types import DataPoint

POS = 1
NEG = -1 
ABSTAIN = 0

In [40]:
def suicide_mentioned(x):
    return POS if x.summary.count('suicide') > 0 else ABSTAIN
def suicide_attack(x):
    return POS if x.summary.count('suicide car|suicide bomb|suicide attack') > 0 else NEG

In [None]:
applier = PandasLFApplier([suicide_mentioned,
                          suicide_attack])
L = applier.apply(terror_df)

In [41]:
from snorkel.model.metrics import coverage_score, f1_score, precision_score, recall_score

print("Coverage: \t", coverage_score(dev_labels,L[:,0]))
print("F1 score:  \t", f1_score(dev_labels,L[:,0]))
print("Precision:  \t", precision_score(dev_labels,L[:,0]))
print("Recall:  \t", recall_score(dev_labels,L[:,0]))




  0%|          | 0/115562 [00:00<?, ?it/s][A[A[A


  0%|          | 1/115562 [00:01<47:36:20,  1.48s/it][A[A[A


  0%|          | 283/115562 [00:01<33:15:14,  1.04s/it][A[A[A


  1%|          | 1382/115562 [00:01<23:03:24,  1.38it/s][A[A[A


  2%|▏         | 2456/115562 [00:02<15:59:19,  1.97it/s][A[A[A


  3%|▎         | 3484/115562 [00:02<11:05:29,  2.81it/s][A[A[A


  4%|▍         | 4695/115562 [00:02<7:40:51,  4.01it/s] [A[A[A


  5%|▌         | 6251/115562 [00:02<5:18:06,  5.73it/s][A[A[A


  7%|▋         | 7859/115562 [00:02<3:39:25,  8.18it/s][A[A[A


  8%|▊         | 9361/115562 [00:02<2:31:29, 11.68it/s][A[A[A


  9%|▉         | 10864/115562 [00:02<1:44:34, 16.69it/s][A[A[A


 11%|█         | 12542/115562 [00:02<1:12:03, 23.83it/s][A[A[A


 12%|█▏        | 14222/115562 [00:02<49:39, 34.02it/s]  [A[A[A


 14%|█▎        | 15853/115562 [00:02<34:13, 48.55it/s][A[A[A


 15%|█▌        | 17411/115562 [00:03<23:37, 69.26it/s][A[A[A


 1