In [120]:
import pandas as pd
import random

In [121]:
df = pd.read_csv('./dataset.csv')
df

Unnamed: 0,text,label
0,The cat is sleeping on the couch.,not a question
1,Are you going to the party tonight?,a question
2,I finished reading a book yesterday.,not a question
3,What is your favorite type of cuisine?,a question
4,She went to the store to buy some milk.,not a question
...,...,...
193,She enjoys practicing meditation and mindfulness.,not a question
194,Do you like to go to the museum?,a question
195,He is a construction worker and builds structu...,not a question
196,What is your favorite type of ice cream flavor?,a question


In [122]:
df.describe(include='all')

Unnamed: 0,text,label
count,198,198
unique,181,2
top,What is your favorite TV show?,not a question
freq,2,100


In [123]:
# Get all the unique rows
df = df[~df.duplicated()]
df

Unnamed: 0,text,label
0,The cat is sleeping on the couch.,not a question
1,Are you going to the party tonight?,a question
2,I finished reading a book yesterday.,not a question
3,What is your favorite type of cuisine?,a question
4,She went to the store to buy some milk.,not a question
...,...,...
193,She enjoys practicing meditation and mindfulness.,not a question
194,Do you like to go to the museum?,a question
195,He is a construction worker and builds structu...,not a question
196,What is your favorite type of ice cream flavor?,a question


In [124]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size= 0.5)

print(df_train.shape)
print(df_test.shape)

(90, 2)
(91, 2)


### Labeling_function class
The LabelingFunction class is a key component of the Snorkel framework, which is used for weak supervision 
and programmatic data labeling. A LabelingFunction is a Python function that takes in an input data point 
(such as a text snippet or an image) and returns a label for that data point based on some heuristic or other rule.

In Snorkel, LabelingFunctions are used to create labeled training data from large amounts of unlabeled data. By 
creating many different LabelingFunctions and combining their outputs using a generative model, it is possible to 
create labeled training data that is of high quality, even if each individual LabelingFunction is not perfect.

The LabelingFunction class in Snorkel provides a number of useful features for building LabelingFunctions, including 
a built-in error handling system that allows users to handle noisy or incorrect labels, and the ability to incorporate 
domain-specific knowledge or external data sources to improve label accuracy.

### PandasLFApplier class
The PandasLFApplier class in Snorkel is used to apply labeling functions (LFs) to a pandas DataFrame. It takes a set 
of LFs and a pandas DataFrame as input, and applies the LFs to each row of the DataFrame to produce label matrices. 
The resulting label matrices can then be used to train machine learning models.

The PandasLFApplier class has several useful features, such as the ability to automatically parallelize LF 
application across multiple CPU cores, and the ability to apply LFs to subsets of the DataFrame based on the 
values of certain columns. It also allows for easy integration with other parts of the Snorkel workflow, such as 
the LabelModel class for model training and the LabelingFunction class for defining LFs.


### LFAnalysis class
The Snorkel LFAnalysis class is a utility class that provides various analysis functions for a set of labeling 
functions in a Snorkel project. Some of the analysis functions provided by LFAnalysis include:

- coverage: Computes the percentage of examples that are covered by at least one labeling function.
- overlaps: Computes the overlap between pairs of labeling functions.
- conflicts: Computes the number of times labeling functions disagree on a given example.
- empirical_accuracy: Computes the empirical accuracy of each labeling function, based on a small set of gold labels.

In [125]:
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis
import re

# Labels
YES = 1
NO = 0
ABSTAIN = -1

## Creating labeling functions

In [126]:
# python decorator to take function as argument into labeling_function()
@labeling_function()
def question_keyword_tokens(x):
    keywords = ['what', 'why', 'when', 'where', 'who', 'how']
    return YES if any(word in x.text.lower() for word in keywords) else ABSTAIN 

In [127]:
@labeling_function()
def question_regex_tokens(x):
    return YES if re.search(r".*?", x.text, flags=re.I) else ABSTAIN

## Feed labeling functions into PandasLFApplier

In [128]:
l_functions = [question_keyword_tokens, question_regex_tokens]

applier = PandasLFApplier(lfs = l_functions)

## Create label matrix 

In [129]:
L_train = applier.apply(df=df_train)

100%|█████████████████████████████████████████████████████████████| 90/90 [00:00<00:00, 32233.57it/s]


## Apply label matrix to LFAnalysis

In [133]:
LFAnalysis(L=L_train, lfs = l_functions).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
question_keyword_tokens,0,[1],0.255556,0.255556,0.0
question_regex_tokens,1,[1],1.0,0.255556,0.0
