In [158]:
import pandas as pd
import random
import numpy as np

In [159]:
df = pd.read_csv('./dataset.csv')
df

Unnamed: 0,text,label
0,The cat is sleeping on the couch.,not a question
1,Are you going to the party tonight?,a question
2,I finished reading a book yesterday.,not a question
3,What is your favorite type of cuisine?,a question
4,She went to the store to buy some milk.,not a question
...,...,...
193,She enjoys practicing meditation and mindfulness.,not a question
194,Do you like to go to the museum?,a question
195,He is a construction worker and builds structu...,not a question
196,What is your favorite type of ice cream flavor?,a question


In [160]:
df.describe(include='all')

Unnamed: 0,text,label
count,198,198
unique,181,2
top,What is your favorite TV show?,not a question
freq,2,100


In [161]:
# Get all the unique rows
df = df[~df.duplicated()]
df

Unnamed: 0,text,label
0,The cat is sleeping on the couch.,not a question
1,Are you going to the party tonight?,a question
2,I finished reading a book yesterday.,not a question
3,What is your favorite type of cuisine?,a question
4,She went to the store to buy some milk.,not a question
...,...,...
193,She enjoys practicing meditation and mindfulness.,not a question
194,Do you like to go to the museum?,a question
195,He is a construction worker and builds structu...,not a question
196,What is your favorite type of ice cream flavor?,a question


In [162]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size= 0.5)

print(df_train.shape)
print(df_test.shape)

(90, 2)
(91, 2)


### Labeling_function class
The LabelingFunction class is a key component of the Snorkel framework, which is used for weak supervision 
and programmatic data labeling. A LabelingFunction is a Python function that takes in an input data point 
(such as a text snippet or an image) and returns a label for that data point based on some heuristic or other rule.

In Snorkel, LabelingFunctions are used to create labeled training data from large amounts of unlabeled data. By 
creating many different LabelingFunctions and combining their outputs using a generative model, it is possible to 
create labeled training data that is of high quality, even if each individual LabelingFunction is not perfect.

The LabelingFunction class in Snorkel provides a number of useful features for building LabelingFunctions, including 
a built-in error handling system that allows users to handle noisy or incorrect labels, and the ability to incorporate 
domain-specific knowledge or external data sources to improve label accuracy.

### PandasLFApplier class
The PandasLFApplier class in Snorkel is used to apply labeling functions (LFs) to a pandas DataFrame. It takes a set 
of LFs and a pandas DataFrame as input, and applies the LFs to each row of the DataFrame to produce label matrices. 
The resulting label matrices can then be used to train machine learning models.

The PandasLFApplier class has several useful features, such as the ability to automatically parallelize LF 
application across multiple CPU cores, and the ability to apply LFs to subsets of the DataFrame based on the 
values of certain columns. It also allows for easy integration with other parts of the Snorkel workflow, such as 
the LabelModel class for model training and the LabelingFunction class for defining LFs.


### LFAnalysis class
The Snorkel LFAnalysis class is a utility class that provides various analysis functions for a set of labeling 
functions in a Snorkel project. Some of the analysis functions provided by LFAnalysis include:

- coverage: Computes the percentage of examples that are covered by at least one labeling function.
- overlaps: Computes the overlap between pairs of labeling functions.
- conflicts: Computes the number of times labeling functions disagree on a given example.
- empirical_accuracy: Computes the empirical accuracy of each labeling function, based on a small set of gold labels.

In [163]:
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis
import re

# Labels
YES = 1
ABSTAIN = -1

## Creating labeling functions

In [164]:
# python decorator to take function as argument into labeling_function()
@labeling_function()
def question_keyword_tokens(x):
    keywords = ['what', 'why', 'when', 'where', 'who', 'how']
    return YES if any(word in x.text.lower() for word in keywords) else ABSTAIN 

In [165]:
@labeling_function()
def question_regex_tokens(x):
    return YES if re.search(r".*?", x.text, flags=re.I) else ABSTAIN

In [166]:
@labeling_function()
def question_regex_are_tokens_(x): 
    return YES if re.search(r"what.*?", x.text.lower(), flags=re.I) else ABSTAIN

## Feed labeling functions into PandasLFApplier

In [167]:
l_functions = [question_keyword_tokens, question_regex_tokens, question_regex_are_tokens_]

applier = PandasLFApplier(lfs = l_functions)

## Create label matrix 

In [168]:
L_train = applier.apply(df=df_train)

100%|█████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 29365.02it/s]


In [169]:
#Print first 5 questions that are labeled with all 3 labeling functions
L_train[:5] 

array([[-1,  1, -1],
       [-1,  1, -1],
       [-1,  1, -1],
       [-1,  1, -1],
       [-1,  1, -1]])

## Apply label matrix to LFAnalysis

In [170]:
LFAnalysis(L=L_train, lfs = l_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
question_keyword_tokens,0,[1],0.288889,0.288889,0.0
question_regex_tokens,1,[1],1.0,0.288889,0.0
question_regex_are_tokens_,2,[1],0.155556,0.155556,0.0


- `Polarity`: the set of unique labels each labeling function outputs, excluding abstains
- `Coverage`: the fraction of the dataset each labeling function labels
- `Overlaps`: the fraction of the dataset where each labeling function and at least another labeling function label
- `Conflicts`: the fraction of the dataset where each labeling function and at least another labeling function label, and they’re disagree

## Label model
The label model aggregates the labels from labeling functions to produce a final label. 

In [171]:
from snorkel.labeling.model import MajorityLabelVoter, LabelModel

In [172]:
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train = L_train, n_epochs = 500, log_freq=100, seed=123)


INFO:root:Computing O...
INFO:root:Estimating \mu...
  0%|                                                                     | 0/500 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.583]
INFO:root:[100 epochs]: TRAIN:[loss=0.006]
INFO:root:[200 epochs]: TRAIN:[loss=0.000]
INFO:root:[300 epochs]: TRAIN:[loss=0.000]
INFO:root:[400 epochs]: TRAIN:[loss=0.000]
100%|█████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 4615.91epoch/s]
INFO:root:Finished Training


In [173]:
df_train['labels'] = label_model.predict(L = L_train, tie_break_policy="abstain")

In [174]:
df_train

Unnamed: 0,text,label,labels
124,She enjoys spending time with her family and f...,not a question,-1
140,She likes to go to the beach and relax in the ...,not a question,-1
181,He is a teacher and educates students.,not a question,-1
137,Can you recommend a good place to go for a run?,a question,-1
87,Would you like to go to the zoo this weekend?,a question,-1
...,...,...,...
83,How do you like to stay motivated?,a question,1
17,How do you like to spend your weekends?,a question,1
98,The coffee shop has great pastries and snacks.,not a question,-1
66,He is an architect and designs buildings.,not a question,-1


In [175]:
# To see if the "majority label voter" performed as expected.
print(L_train.shape)
result = L_train.sum(axis = 1)
pos_numbers = [i for i in result if i > 0]
print(f'Number of positive labels: {len(pos_numbers)}')
print(f'Number of negative labels: {len(L_train) - len(pos_numbers)}')
print(f'Number of real negative (ABSTAIN) labels: {len((df_train[df_train.labels == ABSTAIN]))}')

(90, 3)
Number of positive labels: 26
Number of negative labels: 64
Number of real negative (ABSTAIN) labels: 64


In [176]:
df_train[df_train.labels == ABSTAIN]

Unnamed: 0,text,label,labels
124,She enjoys spending time with her family and f...,not a question,-1
140,She likes to go to the beach and relax in the ...,not a question,-1
181,He is a teacher and educates students.,not a question,-1
137,Can you recommend a good place to go for a run?,a question,-1
87,Would you like to go to the zoo this weekend?,a question,-1
...,...,...,...
128,He is a carpenter and builds furniture.,not a question,-1
106,The park is quiet today.,not a question,-1
98,The coffee shop has great pastries and snacks.,not a question,-1
66,He is an architect and designs buildings.,not a question,-1


### Observation:
> Changing the labeling function to take regex `r"what.*?"` (with better coverage) instead of `r"are.*"`,

- Before
|                           lf    |  j | Polarity | Coverage  | Overlaps  | Conflicts |
|---------------------------------|----|----------|-----------|-----------|-----------|
|      question_keyword_tokens    |  0 |      [1] |  0.288889 |  0.288889 |       0.0 |
|           question_regex_tokens |  1 |      [1] |  1.000000 |  0.333333 |       0.0 |
| question_regex_are_tokens_      |  2 |      [1] |  0.044444 |  0.044444 |       0.0 |

- After
|                           lf    |  j | Polarity | Coverage  | Overlaps  | Conflicts |
|---------------------------------|----|----------|-----------|-----------|-----------|
|      question_keyword_tokens    |  0 |      [1] |  0.233333 |  0.233333 |       0.0 |
|           question_regex_tokens |  1 |      [1] |  1.000000 |  0.233333 |       0.0 |
| question_regex_are_tokens_      |  2 |      [1] |  0.144444 |  0.144444 |       0.0 |

> Resulted in an improvement in final labels, with more accurate depiction of ABSTAINED data. 