Let's try to identify articles about the Syrian conflict in the same dataset of Arabic news. A few notes on findings here...

- According to some quick testing at the end of this notebook, snorkel appears to perform best when you have one LF in which you're fairly confident, and get high or complete coverage for this LF - i.e. abstain rarely or not at all. Probably would be useful to think more about when this is/isn't true.
- With that in mind, I haven't yet found a case where snorkel's tool for dropping unlabeled rows improves performance much - because getting high coverage (w solid accuracy!) is important.
- Of course, this could really just mean that I need to write better labeling functions. Iterative testing and tweaking of LFs is defnitely important.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import snorkel

%matplotlib inline
from IPython.core.pylabtools import figsize

In [2]:
from snorkel.labeling.apply import PandasLFApplier
from snorkel.labeling.lf import labeling_function

POS = 1
NEG = 0
ABSTAIN = -1

In [3]:
media_df = pd.read_csv('/Users/awhite/Documents/GitHub/snorkel-testing/arabic_news_cleaned.csv')

media_df = media_df.assign(syria = media_df.category.str.contains("سورية") == True)
media_df.syria = media_df.syria.replace({True:1,False:0})

media_df[media_df.syria == 1].head()

Unnamed: 0,text,category,syria
1479,روحاني سوريا اخير جددت طهران تاكيد ستواصل لدمش...,الأزمة_السورية,1
1480,اشنطن ترفض تعاون مكافحه ارهاب اعلنت متحدثه باس...,الأزمة_السورية,1
1481,دمشق تطلب موسكو تنظيم جوله مشاورات ثالثه معارض...,الأزمة_السورية,1
1482,صحيفه جمهوريت توكد تورط تركيا ادخال مسلحين سور...,الأزمة_السورية,1
1483,امكانيه روسيا يجتمع مقاطعه بافاريا المانيه قاد...,الأزمة_السورية,1


In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(media_df, test_size = 0.2, random_state = 0)

train, valid = train_test_split(train, test_size = 0.2, random_state = 0)
train, dev = train_test_split(train, test_size = 0.2, random_state = 0)

Y_train = train["syria"].values
Y_dev = dev["syria"].values
Y_valid = valid["syria"].values
Y_test = test["syria"].values

len(train)

14166

In [5]:
#Let's try with no time-dependent info about the conflict
#so groups that might change their name or disban aren't allowed in LFs

provinces = r"ريف دمشق|السويداء|دمشق|طرطوس|درعا|دير الزور|حلب|حماة|الحسكة|حمص|ادلب|القنيطرة|اللاذقية|الرقة"
syria_terms = r"معارض|محرر|نظام|اسد"
regional_players = r"تركي|لبنان|اسرئيل|اردن"
politics = r"سياسي|اتفاق|مفاوضات|وفد|بعثة"
war = r"حرب|اهلي|اطلاق النار|اشتباك|صراع|معارك|اسلاح|سلح"


#Exclusion terms idea didn't work well for oil - but if Syria isn't mentioned at all,
#probably isn't about Syria
@labeling_function()
def syria(x):
    return POS if re.search(r"سوريا|سوري", x.text) and re.search(syria_terms, x.text) else NEG

@labeling_function()
def provinces_mention(x):
    return POS if re.search(provinces, x.text) else ABSTAIN 

@labeling_function()
def regional_politics(x):
    return POS if re.search(r"سوريا|سوري", x.text) and re.search(regional_players, x.text) else ABSTAIN

@labeling_function()
def syria_politics(x):
    return POS if re.search(r"سوريا|سوري", x.text) and re.search(politics, x.text) else ABSTAIN

@labeling_function()
def syria_war(x):
    return POS if re.search(r"سوريا|سوري", x.text) and re.search(war, x.text) else ABSTAIN

In [6]:
lfs = [syria,provinces_mention,regional_politics,
       syria_politics,syria_war]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train)
L_dev = applier.apply(df=dev)
L_valid = applier.apply(df=valid)
L_test = applier.apply(df=test)

100%|██████████| 14166/14166 [00:02<00:00, 4885.48it/s]
100%|██████████| 3542/3542 [00:00<00:00, 5141.23it/s]
100%|██████████| 4428/4428 [00:00<00:00, 5091.47it/s]
100%|██████████| 5534/5534 [00:01<00:00, 5009.98it/s]


In [8]:
from snorkel.labeling.analysis import LFAnalysis

LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
syria,0,"[0, 1]",1.0,0.230661,0.137775,266,0,0.892998
provinces_mention,1,[1],0.08131,0.08131,0.042349,193,95,0.670139
regional_politics,2,[1],0.094862,0.094862,0.058159,180,156,0.535714
syria_politics,3,[1],0.102767,0.102767,0.04489,257,107,0.706044
syria_war,4,[1],0.140599,0.140599,0.075381,320,178,0.64257


In [9]:
from snorkel.labeling.model import MajorityLabelVoter
from snorkel.labeling.model import LabelModel

majority_model = MajorityLabelVoter()
Y_pred_train = majority_model.predict(L=L_train)

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=1000, lr=0.001, log_freq=100, seed=123)

majority_acc = majority_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")
label_model_acc = label_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority Vote Accuracy:   89.2%
Label Model Accuracy:     88.6%


Not seeing a performance jump from the label model here. Could this be another rule of thumb indicator that more work on LFs is needed?

In [11]:
from snorkel.labeling.utils import filter_unlabeled_dataframe
from snorkel.analysis.utils import probs_to_preds

Y_probs_train = label_model.predict_proba(L=L_train)

train_filtered, Y_probs_train_filtered = filter_unlabeled_dataframe(
    X=train, y=Y_probs_train, L=L_train)

Y_preds_train_filtered = probs_to_preds(probs=Y_probs_train_filtered)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

words_train = [row.text for i, row in train_filtered.iterrows()]
words_valid = [row.text for i, row in valid.iterrows()]
words_test = [row.text for i, row in test.iterrows()]

vectorizer = TfidfVectorizer(ngram_range=(2,2))
X_train = vectorizer.fit_transform(words_train)
X_valid = vectorizer.transform(words_valid)
X_test = vectorizer.transform(words_test)

from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier().fit(X_train, Y_preds_train_filtered)

classifier.score(X_test, Y_test)            

Let's run a few more tests. First, what happens if we drop our best LF?

In [18]:
lfs = [provinces_mention,regional_politics,
       syria_politics,syria_war]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train)
L_dev = applier.apply(df=dev)
L_valid = applier.apply(df=valid)
L_test = applier.apply(df=test)

100%|██████████| 14166/14166 [00:02<00:00, 5727.57it/s]
100%|██████████| 3542/3542 [00:00<00:00, 5477.27it/s]
100%|██████████| 4428/4428 [00:00<00:00, 5962.38it/s]
100%|██████████| 5534/5534 [00:01<00:00, 5348.23it/s]


In [24]:
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
provinces_mention,0,[1],0.08131,0.071429,0.0,193,95,0.670139
regional_politics,1,[1],0.094862,0.071429,0.0,180,156,0.535714
syria_politics,2,[1],0.102767,0.076793,0.0,257,107,0.706044
syria_war,3,[1],0.140599,0.102484,0.0,320,178,0.64257


In [25]:
majority_model = MajorityLabelVoter()
Y_pred_train = majority_model.predict(L=L_train)

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=1000, lr=0.001, log_freq=100, seed=123)

majority_acc = majority_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")
label_model_acc = label_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority Vote Accuracy:   50.4%
Label Model Accuracy:     50.4%


In [29]:
Y_probs_train = label_model.predict_proba(L=L_train)

train_filtered, Y_probs_train_filtered = filter_unlabeled_dataframe(
    X=train, y=Y_probs_train, L=L_train)

Y_preds_train_filtered = probs_to_preds(probs=Y_probs_train_filtered)
      
words_train = [row.text for i, row in train_filtered.iterrows()]
words_valid = [row.text for i, row in valid.iterrows()]
words_test = [row.text for i, row in test.iterrows()]

vectorizer = TfidfVectorizer(ngram_range=(2,2))
X_train = vectorizer.fit_transform(words_train)
X_valid = vectorizer.transform(words_valid)
X_test = vectorizer.transform(words_test)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, Y_preds_train_filtered)
      
print(f"Test Accuracy: {classifier.score(X=X_test, y=Y_test) * 100:.1f}%")

Test Accuracy: 91.3%


What about ignoring snorkel's advice to drop unlabeled rows?

In [23]:
majority_model = MajorityLabelVoter()
Y_pred_train = majority_model.predict(L=L_train)

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=1000, lr=0.001, log_freq=100, seed=123)

majority_acc = majority_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")
label_model_acc = label_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")
      
Y_probs_train = label_model.predict_proba(L=L_train)

Y_preds_train = probs_to_preds(probs=Y_probs_train)
      
words_train = [row.text for i, row in train.iterrows()]
words_valid = [row.text for i, row in valid.iterrows()]
words_test = [row.text for i, row in test.iterrows()]

vectorizer = TfidfVectorizer(ngram_range=(2,2))
X_train = vectorizer.fit_transform(words_train)
X_valid = vectorizer.transform(words_valid)
X_test = vectorizer.transform(words_test)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, Y_preds_train)
      
print(f"Test Accuracy: {classifier.score(X=X_test, y=Y_test) * 100:.1f}%")

Majority Vote Accuracy:   50.4%
Label Model Accuracy:     50.4%
Test Accuracy: 20.3%


Finally, what about increasing our confidence in the (second) best LF by having it return negative instead of abstaining?

In [26]:
@labeling_function()
def syria_politics2(x):
    return POS if re.search(r"سوريا|سوري", x.text) and re.search(politics, x.text) else NEG

lfs = [provinces_mention,regional_politics,
       syria_politics2,syria_war]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train)
L_dev = applier.apply(df=dev)
L_valid = applier.apply(df=valid)
L_test = applier.apply(df=test)

100%|██████████| 14166/14166 [00:02<00:00, 5924.81it/s]
100%|██████████| 3542/3542 [00:00<00:00, 5715.12it/s]
100%|██████████| 4428/4428 [00:00<00:00, 5966.24it/s]
100%|██████████| 5534/5534 [00:00<00:00, 5992.04it/s]


In [27]:
LFAnalysis(L=L_dev, lfs=lfs).lf_summary(Y=Y_dev)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
provinces_mention,0,[1],0.08131,0.08131,0.045737,193,95,0.670139
regional_politics,1,[1],0.094862,0.094862,0.056183,180,156,0.535714
syria_politics2,2,"[0, 1]",1.0,0.204687,0.127894,257,0,0.883964
syria_war,3,[1],0.140599,0.140599,0.090627,320,178,0.64257


In [28]:
majority_model = MajorityLabelVoter()
Y_pred_train = majority_model.predict(L=L_train)

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=1000, lr=0.001, log_freq=100, seed=123)

majority_acc = majority_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")
label_model_acc = label_model.score(L=L_valid, Y=Y_valid)["accuracy"]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority Vote Accuracy:   89.1%
Label Model Accuracy:     88.5%


In [31]:
Y_probs_train = label_model.predict_proba(L=L_train)

Y_preds_train = probs_to_preds(probs=Y_probs_train)

words_train = [row.text for i, row in train.iterrows()]
words_valid = [row.text for i, row in valid.iterrows()]
words_test = [row.text for i, row in test.iterrows()]

vectorizer = TfidfVectorizer(ngram_range=(2,2))
X_train = vectorizer.fit_transform(words_train)
X_valid = vectorizer.transform(words_valid)
X_test = vectorizer.transform(words_test)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, Y_preds_train_filtered)
      
print(f"Test Accuracy: {classifier.score(X=X_test, y=Y_test) * 100:.1f}%")

Test Accuracy: 91.3%
