# 🐙 Using Rubrix and Snorkel for human-in-the-loop weak supervision

In this tutorial, we will walk through the process of using Rubrix to improve weak supervision and data programming workflows with the amazing Snorkel library.

## Introduction

**Our goal is to show you how you can incorporate Rubrix into data programming workflows** to programatically build training data with a human-in-the-loop approach. We will use the widely-known [Snorkel](https://www.snorkel.org/) library, but a similar approach can be used with other data augmentation libraries such as [Textattack](https://github.com/QData/TextAttack) or [nlpaug](https://github.com/makcedward/nlpaug).

### What is weak supervision? and Snorkel?

Weak supervision is a branch of machine learning based on getting lower quality labels more efficiently. We can achieve this by using Snorkel, a library for programmatically building and managing training datasets without manual labeling.

### This tutorial

In this tutorial, we will follow the [Spam classification tutorial](https://www.snorkel.org/use-cases/01-spam-tutorial) from Snorkel's documentation and add specific parts where you can leverage Rubrix for an improved workflow.

The tutorial is organized into:

1. **Loading and exploring data**: 

2. **Finding and writing good labelling functions**

## Install Snorkel, Textblob and spaCy

In [1]:
#!pip install snorkel textblob -qqq

In [2]:
#!python -m spacy download en_core_web_sm

## Setup Rubrix

[here we should point the user to the install and setup guide]

And then show how to start with local or remote install.

By default, rubrix will make a local initialization (as shown in the setup guide). If you want to specify an API url and key, you can pass that information via two environment variables: `RUBRIX_API_KEY` and `RUBRIX_API_URL`.

In [1]:
import rubrix as rb

## 1. Loading data

Rubrix allows you to log and track data for different NLP tasks (such as *Token Classification* or *Text Classification*). 

In this tutorial, we will use the [YouTube Spam Collection](http://www.dt.fee.unicamp.br/~tiago//youtubespamcollection/) dataset which a binary classification task for detecting spam comments in youtube videos.

Let's load it in Pandas and take a look!

In [2]:
import pandas as pd
# we should avoid this in the final tutorial (use S3, or a util function)
# i've just created the splits locally for reproducing the Snorkel results
df_train = pd.read_csv('yt_comments_train.csv')
df_test = pd.read_csv('yt_comments_test.csv')

In [3]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,author,date,text,label,video
0,0,Alessandro leite,2014-11-05T22:21:36,pls http://www10.vakinha.com.br/VaquinhaE.aspx...,-1.0,1
1,1,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Ta...",-1.0,1
2,2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,-1.0,1
3,3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",-1.0,1
4,4,css403,2014-11-07T14:25:48,"i am 2,126,492,636 viewer :D﻿",-1.0,1


In [4]:
# For clarity, we define constants to represent the class labels for spam, ham, and abstaining.
ABSTAIN = -1
HAM = 0
SPAM = 1

## 2. Finding and writing good labelling functions

### What are labeling functions?

Labeling functions (LFs) are heuristics that take a data point as input and assign a label to it or abstain (don’t assign any label).
LFs are noisy (they might not have perfect accuracy) and don't have to label every data point.

Some of the most common LFs rely on:

- Keyword searches: looking for specific words in a sentence
- Pattern matching: looking for specific syntactical patterns
- Third-party models: using an pre-trained model (usually a model for a different task than the one at hand)
- Distant supervision: using external knowledge base
- Crowdworker labels: treating each crowdworker as a black-box function that assigns labels to subsets of the data

More information can be found in the [Snorkel LFs tutorial](https://www.snorkel.org/use-cases/01-spam-tutorial#a-gentle-introduction-to-lfs)

### Typical LF development workflow

As mentioned in the Snorkel tutorial, a LF development cycle looks like this:
    
1. Look at examples to generate ideas for LFs
2. Write an initial version of an LF
3. Spot check its performance by looking at its output on data points in the training set (or development set if available)
4. Refine and debug to improve coverage or accuracy as necessary

Let's go through these steps and use Rubrix along the way

#### a) Exploring the training set with Rubrix for initial inspiration

First of all, we have to create the dataset and load it into rubrix

In [5]:
records= []

for index, record in df_train.iterrows():     
    item = rb.TextClassificationRecord(
        id=index,
        inputs={"text": record["text"]},
        metadata = {
            "textlen": str(len(record.text)), 
            "author": record.author,
            "video": str(record.video)
        }
    )
    
    records.append(item)

In [6]:
rb.log(records=records, name="yt_spam_snorkel")

Tried to log data without previous initialization. An initialization by default has been performed.


BulkResponse(dataset='yt_spam_snorkel', processed=1586, failed=0)

Once we have it into rubrix, we can explore the data and the metadata we just added by pressing the "view metadata" button on the bottom right corner of each sample. Take a look at the following picture:

<img src="metadata_view.png">

Now, we can explore the metadata by going to the top left. Let's say we want to check the different authors or inspect all the data with a specific one. By going to the author section, we can easily do all of this.

<img src="metadata_author.png">

After applying the changes we should only see the comments that belong to the selected author. We can also add multiple conditions besides the name of the author, but we will keep it simple this time.  
Another option is to use the top right search box to find samples that contain a certain word or a phrase ("put inside parantheses"). We will search for "check", since it is more likely that those comments are SPAM. 

<img src="search_check.png">

By looking at the comments, it seems like most of them belong to the SPAM class. It's time we create a function that assigns the SPAM label to those comments.  
We will also add a rule that assigns the SPAM label to comments with the expression "check out".

#### b) Writing our initial LFs and analyzing their outputs with Rubrix

In [7]:
from snorkel.labeling import labeling_function


@labeling_function()
def check(x):
    return SPAM if "check" in x.text.lower() else ABSTAIN


@labeling_function()
def check_out(x):
    return SPAM if "check out" in x.text.lower() else ABSTAIN

In [8]:
from snorkel.labeling import PandasLFApplier

# List of labeling functions
lfs = [check_out, check]

# Apply labeling functions to the dataset (df_train)
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

100%|██████████| 1586/1586 [00:00<00:00, 56539.60it/s]


To see how the functions affect our dataset, we can show some information:

In [9]:
# We could eventually remove this and further analyses to keep it short and direct. 
# Leaving only the final summary with all lfs
from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
check_out,0,[1],0.214376,0.214376,0.0
check,1,[1],0.257881,0.214376,0.0


[[Snorkel source]](https://www.snorkel.org/use-cases/01-spam-tutorial#c-evaluate-performance-on-training-set) Let’s see 10 data points where the "check out" rule abstained, but "check" rule labeled. We can use the `get_label_buckets()` to group data points by their predicted label and/or true labels.

In [10]:
from snorkel.analysis import get_label_buckets

buckets = get_label_buckets(L_train[:, 0], L_train[:, 1]) # 0 corresponds to checkout lf and 1 to check lf
sample_abstain_spam = df_train.iloc[buckets[(ABSTAIN, SPAM)]]#.sample(10, random_state=1)

In [11]:
sample_abstain_spam

Unnamed: 0.1,Unnamed: 0,author,date,text,label,video
2,2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,-1.0,1
3,3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",-1.0,1
16,16,zhichao wang,2013-11-29T02:13:56,i think about 100 millions of the views come f...,-1.0,1
21,21,BeBe Burkey,2013-11-28T16:30:13,and u should.d check my channel and tell me wh...,-1.0,1
38,38,BIGMOFO Tonkatruck,2014-11-12T06:26:42,just came to check the view count﻿,-1.0,1
...,...,...,...,...,...,...
1146,8,Dominic Randall,,Hi everyone. We are a duo and we are starting ...,-1.0,4
1490,352,MrJtill0317,,┏━━━┓┏┓╋┏┓┏━━━┓┏━━━┓┏┓╋╋┏┓ ┃┏━┓┃┃┃╋┃┃┃┏━┓┃┗┓┏...,-1.0,4
1510,372,George Raps,,••••►►My name is George and let me tell u EMIN...,-1.0,4
1554,416,Chelsea Cameron,,*****PLEASE READ***** Hey everyone! I&#39;m a...,-1.0,4


Let's use Rubrix to see how it looks

In [12]:
records= []

for index, record in sample_abstain_spam.iterrows():     
    item = rb.TextClassificationRecord(
        id=str(index),
        inputs={"text": record["text"]},
        metadata = {
            "textlen": str(len(record.text)), 
            "author": record.author,
            "video": str(record.video)
        }
    )
    
    records.append(item)

In [13]:
rb.log(records=records, name="yt_spam_snorkel_sample_abstain_spam")

BulkResponse(dataset='yt_spam_snorkel_sample_abstain_spam', processed=69, failed=0)

Most of these are SPAM, but a good number are false positives. One way to keep precision high (while not sacrificing much in terms of coverage), would be to use a regular expression to include examples where the word "check" is followed by something else and then the word "out", for example "check this out".
We could explore the dataset with Rubrix and try to find more precise patterns, or discover completely new ones.

In [14]:
import re

# Function that checks if the pattern "check [something_else] out" exists
@labeling_function()
def regex_check_out(x):
    return SPAM if re.search(r"check.*out", x.text, flags=re.I) else ABSTAIN

In [15]:
# List of labeling functions
lfs = [check_out, regex_check_out]

# Apply labeling functions to the dataset (df_train)
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

100%|██████████| 1586/1586 [00:00<00:00, 44651.70it/s]


In [16]:
buckets = get_label_buckets(L_train[:, 0], L_train[:, 1]) # 0 corresponds to check_out lf and 1 to regex_check_out lf
yt_sample_abstain_spam_regex = df_train.iloc[buckets[(ABSTAIN, SPAM)]].sample(10, random_state=1)

In [17]:
yt_sample_abstain_spam_regex

Unnamed: 0.1,Unnamed: 0,author,date,text,label,video
1146,8,Dominic Randall,,Hi everyone. We are a duo and we are starting ...,-1.0,4
178,178,Imprezzi Vidzz,2014-11-04T03:12:23,"My videos are half way decent, check them out ...",-1.0,1
535,185,David Sean,2014-09-15T10:53:46,Hey guys! I've made a amazing Smiley T-Shirt.O...,-1.0,2
815,115,Mah Productions,2014-07-26T17:07:51.274000,Check the shit out on my channel<br /><br /><b...,-1.0,3
670,320,Nadeen Efein,2014-10-01T21:15:49,hi beaties! i made a new channel please go che...,-1.0,2
630,280,Michael Jurek,2014-11-02T13:37:06,I did a cover if u want to check it out THANK ...,-1.0,2
563,213,Technibility,2014-09-12T21:08:47,Great video by a great artist in Katy Perry! A...,-1.0,2
529,179,Nerdy Peach,2014-10-29T22:44:41,Hey! I'm NERDY PEACH and I'm a new youtuber an...,-1.0,2
270,270,Kyle Jaber,2014-01-19T00:21:29,Check me out! I'm kyle. I rap so yeah ﻿,-1.0,1
1009,309,Hazetrix (EHazardStudio),2015-04-08T00:09:33.033000,hey guys im 17 years old remixer and producer ...,-1.0,3


Load dataset into Rubrix...

In [18]:
records= []

for index, record in yt_sample_abstain_spam_regex.iterrows():     
    item = rb.TextClassificationRecord(
        id=str(index),
        inputs={"text": record["text"]},
        metadata = {
            "textlen": str(len(record.text)), 
            "author": record.author,
            "video": str(record.video)
        }
    )
    
    records.append(item)

In [19]:
rb.log(records=records, name="yt_spam_snorkel_sample_abstain_spam_regex")

BulkResponse(dataset='yt_spam_snorkel_sample_abstain_spam_regex', processed=10, failed=0)

Seems like the regex rule catches quite some spam that is overlooked by the simple "check out" rule. Let's discard the two initial rules in favor of the regex rule for the rest of the tutorial.

#### c) Writing Keyword LFs with Rubrix

In the same way we created labeling functions for rules before, we can create label functions that check a list of keywords using the `LabelingFunction` class.

In [20]:
from snorkel.labeling import LabelingFunction


def keyword_lookup(x, keywords, label):
    if any(word in x.text.lower() for word in keywords):
        return label
    return ABSTAIN


def make_keyword_lf(keywords, label=SPAM):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

This are from the original Snorkel tutorial:

In [21]:
"""Spam comments talk about 'my channel', 'my video', etc."""
keyword_my = make_keyword_lf(keywords=["my"])

"""Spam comments ask users to subscribe to their channels."""
keyword_subscribe = make_keyword_lf(keywords=["subscribe"])

"""Spam comments post links to other channels."""
keyword_link = make_keyword_lf(keywords=["http"])

"""Spam comments make requests rather than commenting."""
keyword_please = make_keyword_lf(keywords=["please", "plz"])

"""Ham comments actually talk about the video's content."""
keyword_song = make_keyword_lf(keywords=["song"], label=HAM)

But we could always try to find better lists of words by searching the word we want in our Rubrix dataset and decide whether or not the word should be considered key to filter the SPAM/HAM

In [22]:
# List of labeling functions
lfs = [keyword_my, keyword_subscribe, keyword_link, keyword_please, keyword_song, regex_check_out]

# Apply labeling functions to the dataset (df_train)
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

100%|██████████| 1586/1586 [00:00<00:00, 19187.37it/s]


In [23]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
keyword_my,0,[1],0.198613,0.165826,0.030265
keyword_subscribe,1,[1],0.127364,0.081967,0.008827
keyword_http,2,[1],0.119168,0.042875,0.005675
keyword_please,3,[1],0.112232,0.102774,0.013241
keyword_song,4,[0],0.141866,0.043506,0.043506
regex_check_out,5,[1],0.233922,0.101513,0.020177


#### d) Writing Heuristic LFs with Rubrix

We have already seen how to use keywords to label our data, the next step would be to use heuristics to do the labeling. A simple approach to this could be setting a minimum length to the comment, considering it SPAM if its length is lower than a threshold.  
To use the right threshold we are going to explore our data in Rubrix by using the metadata field, similar to what we did before with the author selection. For this example we will use a threshold of 20 'words'.

In [24]:
@labeling_function()
def short_comment(x):
    """Ham comments are often short, such as 'cool video!'"""
    return HAM if len(x.text.split()) < 20 else ABSTAIN

In [25]:
# List of labeling functions
lfs = [short_comment]

# Apply labeling functions to the dataset (df_train)
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

100%|██████████| 1586/1586 [00:00<00:00, 43625.05it/s]


In [26]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
short_comment,0,[0],0.810845,0.0,0.0


#### e) Writing and exploring third-party models LFs

Let's explore Textblob predictions on the training set with Rubrix:

In [27]:
from textblob import TextBlob

records= []
for index, record in df_train.iterrows():   
    scores = TextBlob(record["text"])
    item = rb.TextClassificationRecord(
        id=str(index),
        inputs={"text": record["text"]},
        multi_label= False,
        prediction=[("subjectivity", max(0.0, scores.sentiment.subjectivity))],
        prediction_agent="TextBlob",
        metadata = {
            "textlen": str(len(record.text)), 
            "author": record.author,
            "video": str(record.video)
        }
    )
    
    records.append(item)

In [28]:
rb.log(records=records, name="yt_spam_snorkel_textblob")

BulkResponse(dataset='yt_spam_snorkel_textblob', processed=1586, failed=0)

Checking the dataset, we can filter our data based on the confidence of our classifier. This can help us since the predictions of our TextBlob tend to be SPAM the lower the subjectivity is. We can take advantage of this by filtering the predictions using intervals of confidence. For this example we are going to set a threshold of 0.56 for the subjectivity.  

<img src="confidence_interval.png">

The same way we did it for subjectivity, we can do it for polarity. In this case we will set the threshold to 0.9.
This comes in handy when we want to add restrictions on top of another. This time we won't go much deeper, but we could also use metadata or create more than what we have, in combination with the confidence, to have a better understanding of the data.

Once we have our conclusions, we can proceed with creating the labeling functions using the thresholds for the confidence.

In [29]:
from snorkel.preprocess import preprocessor
from textblob import TextBlob


@preprocessor(memoize=True)
def textblob_sentiment(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    x.subjectivity = scores.sentiment.subjectivity
    return x

@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(x):
    return HAM if x.subjectivity >= 0.5 else ABSTAIN

@labeling_function(pre=[textblob_sentiment])
def textblob_polarity(x):
    return HAM if x.polarity >= 0.9 else ABSTAIN

In [30]:
lfs = [textblob_polarity, textblob_subjectivity]

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)

LFAnalysis(L_train, lfs).lf_summary()

100%|██████████| 1586/1586 [00:00<00:00, 1760.23it/s]


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
textblob_polarity,0,[0],0.035939,0.014502,0.0
textblob_subjectivity,1,[0],0.357503,0.014502,0.0


#### SpaCy LF

In [31]:
from snorkel.preprocess.nlp import SpacyPreprocessor

# The SpacyPreprocessor parses the text in text_field and
# stores the new enriched representation in doc_field
spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)

We can also use information from a [spaCy](https://spacy.io/) model in our labeling functions:

In [32]:
@labeling_function(pre=[spacy])
def has_person(x):
    """Ham comments mention specific people and are short."""
    if len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents]):
        return HAM
    else:
        return ABSTAIN

However, spaCy is such a common preprocessor for NLP applications and Snorkel also provides a prebuilt labeling_function-like decorator that uses spaCy:

In [33]:
from snorkel.labeling.lf.nlp import nlp_labeling_function


@nlp_labeling_function()
def has_person_nlp(x):
    # Ham comments usually mention specific people
    if any([ent.label_ == "PERSON" for ent in x.doc.ents]):
        return HAM
    else:
        return ABSTAIN

In [34]:
# List of labeling functions
lfs = [has_person_nlp, regex_check_out]

# Apply labeling functions to the dataset (df_train)
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

100%|██████████| 1586/1586 [00:07<00:00, 222.26it/s]


In [35]:
buckets = get_label_buckets(L_train[:, 0], L_train[:, 1]) # 0 corresponds to person lf and 1to regex_check_out lf
yt_sample_ham_abstain_person = df_train.iloc[buckets[(HAM, SPAM)]].sample(10, random_state=1)

In [36]:
yt_sample_ham_abstain_person

Unnamed: 0.1,Unnamed: 0,author,date,text,label,video
1350,212,101Tele,,yo I know nobody will probably even read this....,-1.0,4
92,92,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",-1.0,1
1293,155,TheJohnRage,,"Alright ladies, if you like this song, then ch...",-1.0,4
1577,439,Andrew Guasch,,"Listen...Check out Andrew Guasch - Crazy, Sick...",-1.0,4
1384,246,media.uploader,,Check out my channel to see Rihanna short mix ...,-1.0,4
1366,228,DanteBTV,,Check Out The New Hot Video By Dante B Called ...,-1.0,4
1512,374,Jacob Johnson,,You guys should check out this EXTRAORDINARY w...,-1.0,4
1459,321,killtheclockhd,,check out our bands page on youtube killtheclo...,-1.0,4
1372,234,marion guy,,Okay trust me I&#39;m doing a favor. You NEED ...,-1.0,4
1039,339,Carren Mangali,2015-03-21T07:01:04.171000,Check out this playlist on YouTube:🍴🍴🏄🏄🏄🍴🏄🏄🏄🏄🏊...,-1.0,3


In [37]:
LFAnalysis(L_train, lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
has_person_nlp,0,[0],0.123581,0.034678,0.034678
regex_check_out,1,[1],0.233922,0.034678,0.034678


As we can see, since one is predicting HAM and the other one SPAM, the overlap is the same as the conflicts, but the amount is pretty small. The coverage might not be impressive but could be considered significant if the has_person_nlp heuristic works well. 

Our dataset is ready now to be used in Rubrix and see how it looks. We can also combine it with our TextBlob model (or any other) and make some predictions.

In [41]:
records= []

for index, record in yt_sample_ham_abstain_person.iterrows():   
    scores = TextBlob(record["text"])
    item = rb.TextClassificationRecord(
        id=str(index),
        inputs={"text": record["text"]},
        multi_label= True,
        prediction=[("subjectivity", max(0.0, scores.sentiment.subjectivity)),
                    ("polarity", max(0.0, scores.sentiment.polarity))],
        prediction_agent="TextBlob",
        metadata = {
            "textlen": str(len(record.text)), 
            "author": record.author,
            "video": str(record.video)
        }
    )
    records.append(item)

In [42]:
rb.log(records=records, name="yt_spam_snorkel_spacy")

BulkResponse(dataset='yt_spam_snorkel_spacy', processed=10, failed=0)

## 4. Using Snorkel Label Model

We have mentioned multiple functions that could be used to label our data, but we never gave a solution on how to deal with the overlap and conflicts. In this section, we will deal with this problem.  
But first, let's see all the functions we have created.

In [43]:
lfs = [
    keyword_my,
    keyword_subscribe,
    keyword_link,
    keyword_please,
    keyword_song,
    regex_check_out,
    short_comment,
    has_person_nlp,
    textblob_polarity,
    textblob_subjectivity,
]

And apply it to our training and test datasets

In [44]:
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_test = applier.apply(df=df_test)

100%|██████████| 1586/1586 [00:00<00:00, 10145.38it/s]
100%|██████████| 250/250 [00:01<00:00, 176.34it/s]


In [45]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts
keyword_my,0,[1],0.198613,0.197982,0.175914
keyword_subscribe,1,[1],0.127364,0.124212,0.108449
keyword_http,2,[1],0.119168,0.115385,0.108449
keyword_please,3,[1],0.112232,0.112232,0.094578
keyword_song,4,[0],0.141866,0.134931,0.043506
regex_check_out,5,[1],0.233922,0.2314,0.216267
short_comment,6,[0],0.810845,0.612863,0.372636
has_person_nlp,7,[0],0.123581,0.121059,0.071879
textblob_polarity,8,[0],0.035939,0.035939,0.005675
textblob_subjectivity,9,[0],0.357503,0.339218,0.160151


A simple baseline for doing this is to take the majority vote on a per-data point basis: if more LFs voted SPAM than HAM, label it SPAM (and vice versa). We can test this with the [MajorityLabelVoter](https://snorkel.readthedocs.io/en/master/packages/_autosummary/labeling/snorkel.labeling.model.baselines.MajorityLabelVoter.html#snorkel.labeling.model.baselines.MajorityLabelVoter) baseline model implemented in Snorkel.

In [46]:
from snorkel.labeling.model import MajorityLabelVoter

majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train) # y_train labels
Y_test = df_test.label.values # y_test labels

The major inconvenience of this approach is that we defined rules that could be correlated, resulting in certain signals being overrepresented in a majority-vote-based model.  
To handle this issue, we are going to make use of the LabelModel. You can read more about how it works in the [Snorkel tutorial](https://www.snorkel.org/use-cases/01-spam-tutorial#4-combining-labeling-function-outputs-with-the-label-model) and the [documentation](https://snorkel.readthedocs.io/en/master/packages/_autosummary/labeling/snorkel.labeling.model.label_model.LabelModel.html#snorkel.labeling.model.label_model.LabelModel).

In [47]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True) # cardinality = nº of classes
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

In [48]:
majority_acc = majority_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")

label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
    "accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")

Majority Vote Accuracy:   76.0%
Label Model Accuracy:     88.8%


As we can see, the LabelModel outperforms the basic MajorityLabelVoter configuration.

#### Filtering unlabeled data

The method we saw above, has a small inconvenience, our data can receive no prediction (predicted as ABSTAIN) from any of our LFs. Those cases with no label (labeled as ABSTAIN) will be removed from the training dataset using a [built-in utility](https://snorkel.readthedocs.io/en/master/packages/_autosummary/labeling/snorkel.labeling.filter_unlabeled_dataframe.html#snorkel.labeling.filter_unlabeled_dataframe).

In [49]:
from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_train,
    y=label_model.predict_proba(L_train), # Probabilities of each data point for each class
    L=L_train
)

Now that we have our data, we can explore the results in Rubrix and manually relabel those cases that have been wrongly classified or keep exploring the performance of our LFs.

In [51]:
records = []
i=0
for index, record in df_train_filtered.iterrows():   
    item = rb.TextClassificationRecord(
        id=str(index),
        inputs={"text": record["text"]},
        multilabel=True,
        # our confidences/scores come from probs_train_filtered
        # probs_train_filtered[i][j] is the probability the sample i belongs to class j
        prediction=[("HAM", probs_train_filtered[i][0]),   # 0 for HAM
                    ("SPAM", probs_train_filtered[i][1])], # 1 for SPAM
        prediction_agent="LabelModel",
        metadata = {
            "textlen": str(len(record.text)), 
            "author": record.author,
            "video": str(record.video)
        }
    )
    records.append(item)
    i+=1

In [52]:
rb.log(records=records, name="yt_filtered_classified_sample")

BulkResponse(dataset='yt_filtered_classified_sample', processed=1568, failed=0)

To relabel the data we switch into the annotation mode by pressing the bottom right button in the rubrix UI. 
After that, we can simply click on the label for each sample and it will be marked as a "validated sample".

## 5. Train a classifier and explore its predictions

The last thing we can do with our data is training a classifier using some of the most popular libraries such as Scikit-learn, Tensorflow or Pytorch. For simplicity, we will use Scikit-learn, a well known library in ML.  
We recommend checking [biome.txt](https://www.recogn.ai/biome-text/), a simple and powerful tool for NLP problems, developed by the same team behind Rubrix.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 5)) # Bag Of Words (BoW) with n-grams
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

Since we need to tell the model the class for each sample, and we have probabilities, we can assign to each sample the class with the highest probability.

In [54]:
from snorkel.utils import probs_to_preds

preds_train_filtered = probs_to_preds(probs=probs_train_filtered)

And then build the classifier

In [55]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(C=1e3, solver="liblinear")
sklearn_model.fit(X=X_train, y=preds_train_filtered)

LogisticRegression(C=1000.0, solver='liblinear')

In [56]:
print(f"Test Accuracy: {sklearn_model.score(X=X_test, y=Y_test) * 100:.1f}%")

Test Accuracy: 89.6%


## Summary

In this tutorial, we accomplished the following:

- We introduced the concept of Labeling Functions (LFs) and demonstrated some of the forms they can take.
- We used the Snorkel LabelModel to automatically learn how to combine the outputs of our LFs into strong probabilistic labels.
- We showed that a classifier trained on a weakly supervised dataset can outperform an approach based on the LFs alone as it learns to generalize beyond the noisy heuristics we provide.

## Next steps

If you enjoyed this tutorial, check out the [Snorkel Tutorials](https://www.snorkel.org/use-cases/) page for other tutorials that you may find interesting, including demonstrations of how to use Snorkel

- [As part of a hybrid crowdsourcing pipeline](https://www.snorkel.org/use-cases/crowdsourcing-tutorial)
- [For visual relationship detection over images](https://www.snorkel.org/use-cases/visual-relation-tutorial)
- [For information extraction over text](https://www.snorkel.org/use-cases/spouse-demo)
- [For data augmentation](https://www.snorkel.org/use-cases/02-spam-data-augmentation-tutorial)

and more! You can also visit the [Snorkel website](https://www.snorkel.org/) or [Snorkel API](https://snorkel.readthedocs.io/en/v0.9.7/) documentation for more info!