<img align="left" src="imgs/logo.jpg" width="50px" style="margin-right:10px">

# Snorkel Workshop: Extracting Spouse Relations <br> from the News
## Part 2: Writing  Labeling Functions

In Snorkel, our primary interface through which we provide training signal to the end extraction model we are training is by writing **labeling functions (LFs)** (as opposed to hand-labeling massive training sets).  We'll go through some examples for our spouse classification task below.

A labeling function is a Python function that accepts a candidate, or a row of the DataFrame, as the input argument and outputs a label for the candidate. For ease of exposition in this notebook, we return `1` if it says the pair of persons in the candidate were married at some point,  `-1` if the pair of persons in the candidate were never married, and `0` if it doesn't know how to vote and abstains. In practice, many labeling functions are often unipolar: it labels only `1`s and `0`s, or it labels only `-1`s and `0`s.

(Note we will change our mapping to use `2` to represent the absence of a relationship to match the multiclass convention in the next notebook for the `LabelModel`. This does not affect this notebook.)

Recall that our goal is to ultimately train a high-performance classification model that predicts which of our candidates are true spouse relations. It turns out that we can do this by writing potentially low-quality labeling functions!

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import re
import sys

import numpy as np
import pandas as pd
import scipy.sparse as sp

##  I. Background

### Preprocessing the Database

In a real application, there is a lot of data preparation, parsing, and database loading that needs to be completed before we dive into writing labeling functions. Here we've pre-generated candidates in a pandas DataFrame object per split (train,dev,test).

###  Using a _Development Set_ of Human-labeled Data

In our setting, we will use the phrase _development set_ to refer to a set of examples (here, a subset of our training set) which we label by hand and use to help us develop and refine labeling functions.  Unlike the _test set_, which we do not look at and use for final evaluation, we can inspect the development set while writing labeling functions. This is a list of `{-1,1}` labels.

In [2]:
import pickle

with open('dev_data.pkl', 'rb') as f:
    dev_df = pickle.load(f)
    dev_labels = pickle.load(f)
    
with open('train_data.pkl', 'rb') as f:
    train_df = pickle.load(f)

### Labeling Function Helpers

When writing labeling functions, there are several operators you will use over and over again. In the case of text relation extraction as with this task, common operators include fetching text between mentions of the two people in a candidate, examing word windows around person mentions, etc. Note that other domains and tasks, the required preprocessors will be different. 

We provide several helper functions in `spouse_preprocessors`:  these are Python helper functions that you can apply to candidates in the DataFrame to return objects that are helpful during LF development. You can (and should!) write your own helper functions to help write LFs.

We provide an example of a preprocessor definition here:

In [49]:
??Preprocessor

In [3]:
from snorkel.labeling.preprocess import Preprocessor, PreprocessorMode, preprocessor

@preprocessor
def get_text_between(cand):
    """
    Returns the text between the two person mentions in the sentence for a candidate
    """
    start = cand.person1_word_idx[1] + 1
    end = cand.person2_word_idx[0]
    cand.text_between = ' '.join(cand.tokens[start:end])
    return cand

### Candidate PreProcessors

We provide a set of helper functions for this task in `spouse_preprocessors.py` that take as input a candidate, or row of a DataFrame in our case. For the purpose of the tutorial, we have two of these fields preprocessed in the data, which can be used when creating labeling functions.

`get_between_tokens(cand)`

`get_left_tokens(cand)`

`get_right_tokens(cand)`

# II. Labeling Functions

## A. Pattern Matching Labeling Functions

One powerful form of labeling function design is defining sets of keywords or regular expressions that, as a human labeler, you know are correlated with the true label. For example, we could define a dictionary of terms that occur between person names in a candidate. One simple dictionary of terms indicating a true relation could be, which we could use in a labeling function like shown below:
    
    spouses = {'spouse', 'wife', 'husband', 'ex-wife', 'ex-husband'}

 
    @labeling_function(resources=dict(spouses=spouses), preprocessors=[get_left_tokens])
    def LF_husband_wife_left_window(x, spouses):
        if len(set(spouses).intersection(set(x.person1_left_tokens))) > 0:
            return POS
        elif len(set(spouses).intersection(set(x.person2_left_tokens))) > 0:
            return POS
        else:
            return ABSTAIN

**Note that:**
1. To access the text between the person mentions, we can use the **`get_left_tokens` preprocessor!**
2. We use **resources like the spouses dictionary** to encode themes/categories of relationships!

There are a few advantages of having preprocessors and labeling functions in this form: 

**Data Agnostic:**  Operate over multiple data types without rewriting
    
**Incremental Processing:** Can create preprocessors as needed while writing LFs!
     
**Future Use:** Can store them for later for different tasks since they are reproducible and modular
     
**Optimizations:** Allows caching behind-the-scenes

In [4]:
from typing import List

from snorkel.labeling.apply import PandasLFApplier
from snorkel.labeling.lf import labeling_function
from snorkel.types import DataPoint

from spouse_preprocessors import get_left_tokens, get_person_last_names, get_person_text

POS = 1
NEG = -1 
ABSTAIN = 0 

In [54]:
from snorkel.labeling.apply import lf_applier_pandas
#!cat ~/anaconda3/envs/snorkel-v2/lib/python3.6/site-packages/snorkel/labeling/apply/lf_applier_pandas.py

Check for the `spouse` words appearing between the person mentions

In [5]:
spouses = {'spouse', 'wife', 'husband', 'ex-wife', 'ex-husband'}
@labeling_function(resources=dict(spouses=spouses))
def LF_husband_wife(x, spouses):
    return POS if len(spouses.intersection(set(x.between_tokens))) > 0 else ABSTAIN

Check for the `spouse` words appearing to the left of the person mentions

In [6]:
@labeling_function(resources=dict(spouses=spouses), preprocessors=[get_left_tokens])
def LF_husband_wife_left_window(x, spouses):
    if len(set(spouses).intersection(set(x.person1_left_tokens))) > 0:
        return POS
    elif len(set(spouses).intersection(set(x.person2_left_tokens))) > 0:
        return POS
    else:
        return ABSTAIN

Check for the person mentions having the same last name

In [7]:
@labeling_function()
def LF_same_last_name(x):
    p1_ln, p2_ln = get_person_last_names(x)
    
    if p1_ln and p2_ln and p1_ln == p2_ln:
        return POS
    return ABSTAIN

Check for the words `and ... married` between person mentions

In [8]:
@labeling_function()
def LF_and_married(x):
    return POS if 'and' in x.between_tokens and 'married' in x.person2_right_tokens else ABSTAIN    

Check for words that refer to `family` relationships between and to the left of the person mentions

In [9]:
family = ['father', 'mother', 'sister', 'brother', 'son', 'daughter',
              'grandfather', 'grandmother', 'uncle', 'aunt', 'cousin']
family = set(family+[f + '-in-law' for f in family])

@labeling_function(resources=dict(family=family))
def LF_familial_relationship(x, family):
    return POS if len(family.intersection(set(x.between_tokens))) > 0 else ABSTAIN  


@labeling_function(resources=dict(family=family), preprocessors=[get_left_tokens])
def LF_family_left_window(x, family):
    if len(set(family).intersection(set(x.person1_left_tokens))) > 0:
        return NEG
    elif len(set(family).intersection(set(x.person2_left_tokens))) > 0:
        return NEG
    else:
        return ABSTAIN

Check for `other` relationship words between person mentions

In [10]:
other = {'boyfriend', 'girlfriend' 'boss', 'employee', 'secretary', 'co-worker'}
@labeling_function(resources=dict(other=other))
def LF_other_relationship(x, other):
    return NEG if len(other.intersection(set(x.between_tokens))) > 0 else ABSTAIN

#### Apply Labeling Functions to the Data
We create a list of labeling functions and apply them to the data

In [50]:
??PandasLFApplier

In [11]:
applier = PandasLFApplier([LF_husband_wife,
                           LF_husband_wife_left_window,
                           LF_same_last_name,
                           LF_and_married, 
                           LF_familial_relationship,
                           LF_family_left_window,
                           LF_other_relationship])
L = applier.apply(dev_df)

100%|██████████| 2811/2811 [00:03<00:00, 837.05it/s]


### Labeling Function Metrics

#### Coverage
One simple metric we can compute quickly is our _coverage_, the number of candidates labeled by our LF, on our training set (or any other set).

#### Precision / Recall / F1
If we have gold labeled data, we can also compute standard precision, recall, and F1 metrics for the output of a single labeling function. These metrics are computed over 4 _error buckets_: _True Positives_ (tp), _False Positives_ (fp), _True Negatives_ (tn), and _False Negatives_ (fn).

\begin{equation*}
precision = \frac{tp}{(tp + fp)}
\end{equation*}

\begin{equation*}
recall = \frac{tp}{(tp + fn)}
\end{equation*}

\begin{equation*}
F1 = 2 \cdot \frac{ (precision \cdot recall)}{(precision + recall)}
\end{equation*}

#### Viewing Performance Metrics
If we have gold labeled data, we can evaluate formal metrics. Below, we'll compute our empirical scores using human-labeled development set data and then look at performance metrics for `LF_husband_wife` LF.

In [12]:
from snorkel.model.metrics import coverage_score, f1_score, precision_score, recall_score

print("LF_husband_wife coverage: \t", coverage_score(dev_labels,L[:,0]))
print("LF_husband_wife F1 score:  \t", f1_score(dev_labels,L[:,0]))
print("LF_husband_wife precision:  \t", precision_score(dev_labels,L[:,0]))
print("LF_husband_wife recall:  \t", recall_score(dev_labels,L[:,0]))

LF_husband_wife coverage: 	 0.08964781216648879
LF_husband_wife F1 score:  	 0.4208144796380091
LF_husband_wife precision:  	 0.36904761904761907
LF_husband_wife recall:  	 0.48947368421052634


## B. Distant Supervision Labeling Functions

In addition to using factories that encode pattern matching heuristics, we can also write labeling functions that _distantly supervise_ examples. Here, we'll load in a list of known spouse pairs and check to see if the pair of persons in a candidate matches one of these.

**DBpedia**
http://wiki.dbpedia.org/
Our database of known spouses comes from DBpedia, which is a community-driven resource similar to Wikipedia but for curating structured data. We'll use a preprocessed snapshot as our knowledge base for all labeling function development.

We can look at some of the example entries from DBPedia and use them in a simple distant supervision labeling function.

Make sure `dbpedia.pkl` is in the `tutorials/workshop/` directory. 

In [13]:
import pickle 

with open('dbpedia.pkl', 'rb') as f:
     known_spouses = pickle.load(f)
        
list(known_spouses)[0:5]

[('Duke George', 'Grand Duchess Catherine Pavlovna'),
 ('Helen Milliken', 'William Milliken'),
 ('Alexander', 'Aspasia Manos'),
 ('Brooke Shields', 'Chris Henchy'),
 ('Bronwyn Bancroft', 'Ned Manning')]

In [25]:
??labeling_function

In [14]:
@labeling_function(resources=dict(known_spouses=known_spouses))
def LF_distant_supervision(x: DataPoint, known_spouses: List[str]) -> int:
    p1, p2 = get_person_text(x)
    return POS if (p1, p2) in known_spouses or (p2, p1) in known_spouses else ABSTAIN

In [30]:
# Helper function to get last name
def last_name(s):
    name_parts = s.split(' ')
    return name_parts[-1] if len(name_parts) > 1 else None 

# Last name pairs for known spouses
last_names = set([(last_name(x), last_name(y)) for x, y in known_spouses if last_name(x) and last_name(y)])

@labeling_function(resources=dict(last_names=last_names))
def LF_distant_supervision_last_names(x: DataPoint, last_names: List[str]) -> int:
    p1_ln, p2_ln = get_person_last_names(x)
    
    return POS if (p1_ln != p2_ln) and ((p1_ln, p2_ln) in last_names or (p2_ln, p1_ln) in last_names) else ABSTAIN

Every time you write a new labeling function, add it to appliers and make sure to include it in the new L matrix!

In [39]:
fns = [LF_husband_wife,
                           LF_husband_wife_left_window,
                           LF_same_last_name,
                           LF_and_married, 
                           LF_familial_relationship,
                           LF_family_left_window,
                           LF_other_relationship,
                           LF_distant_supervision,
                           LF_distant_supervision_last_names]

In [41]:
import numpy as np
np.random.seed(1)

def corrupt(fn):
    def fnx(x):
        do_abstain = np.random.rand(1)[0] > .5
        do_rand = np.random.rand(1)[0] > .5
        if do_abstain:
            return 0
        if do_rand:
            return np.random.choice([-1, 1], size=1)[0]
        return fn(x)
corrupted_fns = [corrupt(fn) for fn in fns]

In [42]:
applier = PandasLFApplier(fns)
#applier = PandasLFApplier(corrupted_fns)

In [43]:
dev_L = applier.apply(dev_df)
with open('dev_L.pkl', 'wb') as f:
    pickle.dump(dev_L, f)
    
train_L = applier.apply(train_df)
with open('train_L.pkl', 'wb') as f:
    pickle.dump(train_L, f)

100%|██████████| 2811/2811 [00:04<00:00, 666.02it/s]
100%|██████████| 22254/22254 [00:28<00:00, 774.04it/s]


In [47]:
import pickle

with open('dev_data.pkl', 'rb') as f:
    dev_df = pickle.load(f)
    dev_labels = pickle.load(f)
    
dev_labels.shape, dev_L.toarray().shape

((2811,), (2811, 9))

In [48]:
dev_L.toarray()

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## C. Writing Custom Labeling Functions

The strength of LFs is that you can write any arbitrary function and use it to supervise a classification task. This approach can combine many of the same strategies discussed above or encode other information. 

For example, we observe that when mentions of person names occur far apart in a sentence, this is a good indicator that the candidate's label is False. You can write a labeling function that uses preprocessor `get_text_between` or the existing field `get_between_tokens` to write such an LF!


    
**IMPORTANT** Good labeling functions manage a trade-off between high coverage and high precision. When constructing your dictionaries, think about building larger, noiser sets of terms instead of relying on 1 or 2 keywords. Sometimes a single word can be very predictive (e.g., `ex-wife`) but it's almost always better to define something more general, such as a regular expression pattern capturing _any_ string with the `ex-` prefix. 

**Try editing and running the cells below!**

In [18]:
# @labeling_function()
# def LF_new(x: DataPoint) -> int:
#     return POS if x.person1_word_idx[0] > 3 else ABSTAIN #TODO: Change this!

# applier = PandasLFApplier([LF_new])

In [19]:
# new_dev_L = applier.apply(dev_df)
# sp.hstack((dev_L, new_dev_L), format='csr')
# with open('dev_L.pkl', 'wb') as f:
#     pickle.dump(dev_L, f)
    
# new_train_L = applier.apply(train_df)
# sp.hstack((train_L, new_train_L), format='csr')
# with open('train_L.pkl', 'wb') as f:
#     pickle.dump(train_L, f)