# Weak supervision


This guide gives you a brief introduction to weak supervision with Rubrix.

Rubrix currently supports weak supervision for text classification use cases, but we'll be adding support for token classification (e.g., Named Entity Recognition) soon.


.. nbinfo::
   This feature is experimental, you can expect some changes in the Python API. Please report on [Github](https://github.com/recognai/rubrix) any issue you encounter.
   
   
   
   
![Labeling workflow](https://raw.githubusercontent.com/recognai/rubrix-materials/main/tutorials/weak_supervision/weak_supervision.svg "Labeling workflow")

## Rubrix weak supervision in a nutshell

Doing weak supervision with Rubrix should be straightforward. Keeping the same spirit as other parts of the library, you can virtually use any weak supervision library or method, such as Snorkel or Flyingsquid. 

Rubrix weak supervision support is built around two basic abstractions:


### `Rule`
A rule encodes an heuristic for labeling a record.

Heuristics can be defined using [Elasticsearch's queries](../reference/rubrix_webapp_reference.rst#search-input):

```python
plz = Rule(query="plz OR please", label="SPAM")
```

or with Python functions (similar to Snorkel's labeling functions, which you can use as well):

```python
def contains_http(record: rb.TextClassificationRecord) -> Optional[str]:
    if "http" in record.inputs["text"]:
        return "SPAM"
```

Besides textual features, Python labeling functions can exploit metadata features:

```python
def author_channel(record: rb.TextClassificationRecord) -> Optional[str]:
    # the word channel appears in the comment author name
    if "channel" in record.metadata["author"]:
        return "SPAM"
```

A rule should either return a string value, that is a weak label, or a `None` type in case of abstention.


### `Weak Labels`

Weak Labels objects bundle and apply a set of rules to the records of a Rubrix dataset. Applying a rule to a record means assigning a weak label or abstaining.

This abstraction provides you with the building blocks for training and testing weak supervision "denoising", "label" or even "end" models:

```python
rules = [contains_http, author_channel]
weak_labels = WeakLabels(
    rules=rules, 
    dataset="weak_supervision_yt"
)

# returns a summary of the applied rules
weak_labels.summary()
```

More information about these abstractions can be found in [the Python Labeling module docs](../reference/python/python_labeling.rst).

## Built-in label models

To make things even easier for you, we provide wrapper classes around the most common label models, that directly consume a `WeakLabels` object.
This makes working with those models a breeze.
Take a look at the list of built-in models in the [labeling module docs](../reference/python/python_labeling.rst#rubrix.labeling.text_classification.label_models.LabelModel).


## Workflow

A typical workflow to use weak supervision is:

1. Create a Rubrix dataset with your raw dataset. If you actually have some labelled data you can log it into the the same dataset.
2. Define a set of rules, exploring and trying out different things directly in the Rubrix web app.
3. Create a `WeakLabels` object and apply the rules. Typically, you'll iterate between this step and step 2.
4. Once you are satisfied with your weak labels, use the matrix of the `WeakLabels` instance with your library/method of choice to build a training set or even train a downstream text classification model.


This guide shows you an end-to-end example using Snorkel and Flyingsquid. Let's get started!

## Example dataset

We'll be using a well-known dataset for weak supervision examples, the [YouTube Spam Collection](http://www.dt.fee.unicamp.br/~tiago//youtubespamcollection/) dataset, which is a binary classification task for detecting spam comments in Youtube videos. 

In [1]:
import pandas as pd

# load data
train_df = pd.read_csv('../tutorials/data/yt_comments_train.csv')
test_df = pd.read_csv('../tutorials/data/yt_comments_test.csv')

# preview data
train_df.head()

Unnamed: 0.1,Unnamed: 0,author,date,text,label,video
0,0,Alessandro leite,2014-11-05T22:21:36,pls http://www10.vakinha.com.br/VaquinhaE.aspx...,-1.0,1
1,1,Salim Tayara,2014-11-02T14:33:30,"if your like drones, plz subscribe to Kamal Ta...",-1.0,1
2,2,Phuc Ly,2014-01-20T15:27:47,go here to check the views :3﻿,-1.0,1
3,3,DropShotSk8r,2014-01-19T04:27:18,"Came here to check the views, goodbye.﻿",-1.0,1
4,4,css403,2014-11-07T14:25:48,"i am 2,126,492,636 viewer :D﻿",-1.0,1


## 1. Create a Rubrix dataset with unlabelled data and test data

Let's load the train (non-labelled) and the test (containing labels) dataset.

In [None]:
import rubrix as rb

# build records from the train dataset
records = [
    rb.TextClassificationRecord(
        inputs=row.text,
        metadata={"video":row.video, "author": row.author}
    )
    for i,row in train_df.iterrows()
]

# build records from the test dataset
labels = ["HAM", "SPAM"]
records = [
    rb.TextClassificationRecord(
        inputs=row.text,
        annotation=labels[row.label],
        metadata={"video":row.video, "author": row.author}
    )
    for i,row in test_df.iterrows()
]

# log records to Rubrix
rb.log(records, name="weak_supervision_yt")

After this step, you have a fully browsable dataset available at `http://localhost:6900/weak_supervision_yt` (or the base URL where your Rubrix instance is hosted).

## 2. Defining rules

Let's now define some of the rules proposed in the tutorial [Snorkel Intro Tutorial: Data Labeling](https://www.snorkel.org/use-cases/01-spam-tutorial).


Remember you can use [Elasticsearch's query string DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html) and test your queries directly in the web app. Available fields in the query are described in [the Rubrix web app reference](../reference/rubrix_webapp_reference.rst#search-input).

In [None]:
from rubrix.labeling.text_classification import Rule, WeakLabels

#  rules defined as Elasticsearch queries
check_out = Rule(query="check out", label="SPAM")
plz = Rule(query="plz OR please", label="SPAM")
subscribe = Rule(query="subscribe", label="SPAM")
my = Rule(query="my", label="SPAM")
song = Rule(query="song", label="HAM")
love = Rule(query="love", label="HAM")

Besides using the UI, if you want to quickly see the effect of a rule, you can do:

In [4]:
# display full length text
pd.set_option('display.max_colwidth', None)

# get the subset for the rule query
rb.load(name="weak_supervision_yt", query="plz OR please")[['inputs']]

Unnamed: 0,inputs
0,{'text': 'For latest movies 2014 please visit this site http://www.networkedblogs.com/p/11cPWb?ref=panorama﻿'}
1,"{'text': 'Could you please check out my covers on my channel? I do covers like Adele, Kodaline, Imagine Dragons...and more. Please if you could spare a few minutes, could you have a listen to one or two of my covers , Feel free to comment and subscribe :) Thank you! '}"
2,{'text': ' Subscribe and like my video please'}
3,{'text': 'plz check out fablife / welcome to fablife for diys and challenges so plz subscribe thx!﻿'}
4,"{'text': 'Anyone Who LOVEs music , please go check out my youtube page and tell me what you think . I just put a video up and will be doing more song. I'm just trying to get myself started. Any love is much Appreciated ﻿'}"
...,...
307,{'text': 'watch youtube video &quot;EMINEM -YTMA artist of the year&quot; plz share to vote!!!'}
308,{'text': 'please subscribe i am a new youtuber and need help please subscribe and i will subscribe back :D hoppa HOPPA GaNgAm StYlE﻿'}
309,"{'text': 'I make guitar covers, please have a look at my channel﻿'}"
310,"{'text': 'My honest opinion. It's a very mediocre song. Nothing unique or special about her music, lyrics or voice. Nothing memorable like Billie Jean or Beat It. Before her millions of fans reply with hate comments, i know this is a democracy and people are free to see what they want. But then don't I have the right to express my opinion? Please don't reply with dumb comments lie ""if you don't like it don't watch it"". I just came here to see what's the buzz about(661 million views??) and didn't like what i saw. OK?﻿'}"


You can also define plain Python labeling functions:

In [None]:
import re

# rules defined as Python labeling functions
def contains_http(record: rb.TextClassificationRecord):
    if "http" in record.inputs["text"]:
        return "SPAM"

def short_comment(record: rb.TextClassificationRecord):
    return "HAM" if len(record.inputs["text"].split()) < 5 else None

def regex_check_out(record: rb.TextClassificationRecord):
    return "SPAM" if re.search(r"check.*out", record.inputs["text"], flags=re.I) else None

## 3. Building and analizing weak labels

In [8]:
# bundle our rules in a list
rules = [check_out, plz, subscribe, my, song, love, contains_http, short_comment, regex_check_out]

# apply the rules to a dataset to obtain the weak labels
weak_labels = WeakLabels(
    rules=rules, 
    dataset="weak_supervision_yt"
)

# show some stats about the rules, see the `summary()` docstring for details
weak_labels.summary()

Preparing rules:   0%|          | 0/9 [00:00<?, ?it/s]

Applying rules:   0%|          | 0/3422 [00:00<?, ?it/s]

Unnamed: 0,polarity,coverage,overlaps,conflicts,correct,incorrect,precision
check out,{SPAM},0.247516,0.239918,0.030684,45,0,1.0
plz OR please,{SPAM},0.091175,0.082408,0.019871,20,0,1.0
subscribe,{SPAM},0.105786,0.083577,0.02893,30,0,1.0
my,{SPAM},0.190824,0.166277,0.048802,41,6,0.87234
song,{HAM},0.12858,0.075979,0.033022,39,9,0.8125
love,{HAM},0.088545,0.06692,0.031268,28,7,0.8
contains_http,{SPAM},0.112215,0.078025,0.052309,6,0,1.0
short_comment,{HAM},0.236119,0.109001,0.067504,84,8,0.913043
regex_check_out,{SPAM},0.229982,0.229398,0.028346,45,0,1.0
total,"{SPAM, HAM}",0.748977,0.449737,0.12332,338,30,0.918478


## 4. Using the weak labels

At this step you have at least two options:

1. Use the weak labels for training a "denoising" or label model to build a less noisy training set. Highly popular options for this are [Snorkel](https://snorkel.org/) or [Flyingsquid](https://github.com/HazyResearch/flyingsquid). After this step, you can train a downstream model with the "clean" labels.

2. Use the weak labels directly with recent "end-to-end" (e.g., [Weasel](https://github.com/autonlab/weasel)) or joint models (e.g., [COSINE](https://github.com/yueyu1030/COSINE)).


Let's see some examples:

### Label model with Snorkel

Snorkel is by far the most popular option for using weak supervision, and Rubrix provides built-in support for it. 
Using Snorkel with Rubrix's `WeakLabels` is as simple as:

In [None]:
%pip install snorkel -qqq

In [None]:
from rubrix.labeling.text_classification import Snorkel

# we pass our WeakLabels instance to our Snorkel label model
label_model = Snorkel(weak_labels)

# we train the model
label_model.fit()

# we check its performance
label_model.score()

After fitting your label model, you can quickly explore its predictions, before building a training set for training a downstream text classifier. 

This step is useful for validation, manual revision, or defining score thresholds for accepting labels from your label model (for example, only considering labels with a score greater then 0.8.)

In [None]:
# get your training records with the predictions of the label model
records_for_training = label_model.predict()

# log the records to a new dataset in Rubrix
rb.log(records_for_training, name="snorkel_results")

### Label model with FlyingSquid

FlyingSquid is a powerful method developed by [Hazy Research](https://hazyresearch.stanford.edu/), a research group from Stanford behind ground-breaking work on programmatic data labeling, including Snorkel. FlyingSquid uses a closed-form solution for fitting the label model with great speed gains and similar performance.

In [None]:
%pip install flyingsquid pgmpy -qqq

By default, the `WeakLabels` class uses `-1` as value for an abstention. 
FlyingSquid, though, expects a value of `0`.
With Rubrix you can define a custom `label2int` mapping like this:

In [None]:
weak_labels = WeakLabels(rules=rules, dataset="weak_supervision_yt", label2int={None: 0, 'SPAM': -1, 'HAM': 1})

In [None]:
from flyingsquid.label_model import LabelModel

# train our label model
label_model = LabelModel(len(weak_labels.rules))
label_model.fit(L_train=weak_labels.matrix(has_annotation=False),verbose=True)

After fitting your label model, you can quickly explore its predictions, before building a training set for training a downstream text classifier. 

This step is useful for validation, manual revision, or defining score thresholds for accepting labels from your label model (for example, only considering labels with a score greater then 0.8.)

In [None]:
# get the part of the weak label matrix that has no corresponding annotation
train_matrix = weak_labels.matrix(has_annotation=False)

# get predictions from our label model
predictions = label_model.predict_proba(L_matrix=train_matrix)
predicted_labels = label_model.predict(L_matrix=train_matrix)
preds = [[('SPAM', pred[0]), ('HAM', pred[1])] for pred in predictions]

# get the records that do not have an annotation
train_records = weak_labels.records(has_annotation=False)

In [None]:
# add the predictions to the records
def add_prediction(record, prediction):
    record.prediction = prediction
    return record

train_records_with_lm_prediction = [
    add_prediction(rec, pred)
    for rec, pred, label in zip(train_records, preds, predicted_labels)
    if label != weak_labels.label2int[None] # exclude records where the label model abstains
]

# log a new dataset to Rubrix
rb.log(train_records_with_lm_prediction, name="flyingsquid_results")

## Joint Model with Weasel

[Weasel](https://github.com/autonlab/weasel) lets you train downstream models end-to-end using directly weak labels.

Coming soon.