# Intro. to Snorkel: Extracting Spouse Relations from the News

## Part IV: Training a Model with Data Programming

In this part of the tutorial, we will train a statistical model to differentiate between true and false `Spouse` mentions.

We will train this model using _data programming_, and we will **ignore** the training labels provided with the training data. This is a more realistic scenario; in the wild, hand-labeled training data is rare and expensive. Data programming enables us to train a model using only a modest amount of hand-labeled data for validation and testing. For more information on data programming, see the [NIPS 2016 paper](https://arxiv.org/abs/1605.07723).

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os,sys
os.environ['SNORKELDB']="postgres:///stromatolite"

import numpy as np
from snorkel import SnorkelSession
session = SnorkelSession()

In [None]:
from snorkel.models import candidate_subclass

StromStrat = candidate_subclass('StromStrat', ['strom', 'stratname'])

We repeat our definition of the `Spouse` `Candidate` subclass from Parts II and III.

## Loading `CandidateSet` objects

We reload the training and development `CandidateSet` objects from the previous parts of the tutorial.

In [None]:
from snorkel.models import CandidateSet
train_candidates = session.query(CandidateSet).filter(CandidateSet.name == 'Training Candidates').one()
print len(train)
test_candidates = session.query(CandidateSet).filter(CandidateSet.name == 'Test Candidates').one()
print len(test)

## Automatically Creating Features
Recall that our goal is to distinguish between true and false mentions of spouse relations. To train a model for this task, we first embed our `Spouse` candidates in a feature space.

In [None]:
from snorkel.annotations import FeatureManager

feature_manager = FeatureManager()

We can create a new feature set- note that we _create_ a set of features based on the training candidates, and then featurize the test set using this set of features (using _update_)

In [None]:
%time F_train = feature_manager.create(session, train, 'Training Features')
F_train

In [None]:
%time F_test = feature_manager.update(session, test, 'Training Features', False)
F_test

**OR** if we've already created one, we can simply load as follows:

In [None]:
F_train = feature_manager.load(session, train_candidates, 'Training Features')
F_train

In [None]:
F_test = feature_manager.load(session, test_candidates, 'Training Features')
F_test

Note that the returned matrix is a special subclass of the `scipy.sparse.csr_matrix` class, with some special features which we demonstrate below:

In [None]:
F_train

In [None]:
F_train.get_candidate(0)

In [None]:
F_train.get_key(0)

## Creating Labeling Functions
Labeling functions are a core tool of data programming. They are heuristic functions that aim to classify candidates correctly. Their outputs will be automatically combined and denoised to estimate the probabilities of training labels for the training data.

In [None]:
import re
from snorkel.lf_helpers import is_inverted,get_left_tokens, get_right_tokens, get_between_tokens, get_text_between, get_tagged_text

## Applying Labeling Functions

First we construct a `LabelManager`.

In [None]:
from snorkel.annotations import LabelManager
label_manager = LabelManager()

Next we run the `LabelManager` to to apply the labeling functions to the training `CandidateSet`.  We'll start with some of our labeling functions:

In [None]:
from snorkel.models import CandidateSet
all_c = session.query(CandidateSet).filter(CandidateSet.name == 'Candidate Set').one()

for c in all_c:
    if c[0].parent_id==11996:
        break

In [None]:
import yaml, psycopg2
from snorkel.models import Span

good_words={'strom':{'present','found','abundant'},'strat':{'contain','contains','include','includes'}}

# Connect to Postgres
"""
with open('../credentials', 'r') as credential_yaml:
    credentials = yaml.load(credential_yaml)
with open('../config', 'r') as config_yaml:
    config = yaml.load(config_yaml)
"""

# Connect to Postgres
connection = psycopg2.connect(
    dbname= 'stromatolite' #credentials['snorkel_postgres']['database'],
    #user=credentials['snorkel_postgres']['user'],
    #password=credentials['snorkel_postgres']['password'],
    #host=credentials['snorkel_postgres']['host'],
    #port=credentials['snorkel_postgres']['port'])
    )
cursor = connection.cursor()


def LF_num_stratphrase(c):
    cursor.execute("""
        SELECT distinct span.id from span 
        JOIN strom_strat on span.id=strom_strat.stratname_id  
        WHERE span.parent_id=%(parent_id)s;""",
                   {"parent_id": c[0].parent.id
                    })
    tmp_span=cursor.fetchall()

    tmp_strat = session.query(Span).filter(Span.id.in_(tmp_span)).all()
    num_strat = len({a.get_span() for a in tmp_strat})

    return -1 if num_strat > 1 else 1

test=LF_num_stratphrase(c)
print test

def LF_wordsep_forty(c):
    ws = len(get_between_tokens(c))
    return -1 if ws > 40 else 0

test=LF_wordsep_forty(c)
print test


def LF_wordsep_twenty(c):
    ws = len(get_between_tokens(c))
    return -1 if ws > 20 and ws <= 40 else 0

test=LF_wordsep_twenty(c)
print test

def LF_wordsep_ten(c):
    ws = len(get_between_tokens(c))
    return -1 if ws > 10 and ws <= 20 else 0

test=LF_wordsep_ten(c)
print test


def LF_nlp_parent(c):
    strom_parent = c[0].get_attrib_tokens('dep_parents')
    strom_idx = [c[0].get_word_start()+1,c[0].get_word_end()+1]

    strat_parent = c[1].get_attrib_tokens('dep_parents')
    strat_idx = [c[1].get_word_start()+1,c[1].get_word_end()+1]
    
    nlp_check = [True for a in strom_idx if a in strat_parent] + [True for a in strat_idx if a in strom_parent]
    return 0 if not nlp_check else 1

test=LF_nlp_parent(c)
print test

def LF_goodwords(c):
    if is_inverted(c):
        if len(good_words['strat'].intersection(set(get_between_tokens(c)))) > 0:
            return 1
        else:
            return 0
    else:
        if len(good_words['strom'].intersection(set(get_between_tokens(c)))) > 0:
            return 1
        else:
            return 0
        
test=LF_goodwords(c)
print test


In [None]:
LFs = [
    #LF_num_stratphrase,
    LF_wordsep_forty,
    LF_wordsep_twenty,
    LF_wordsep_ten,
    LF_nlp_parent,
    LF_goodwords
]

In [None]:
%time L_train = label_manager.create(session, train_candidates, 'Training LF Labels', f=LFs)
L_train

**OR** load if we've already created:

In [None]:
%time L_train = label_manager.load(session, train_candidates, 'LF Labels')
L_train

We can view statistics about the resulting label matrix:

In [None]:
L_train.lf_stats()

## Fitting the Generative Model
We estimate the accuracies of the labeling functions without supervision. Specifically, we estimate the parameters of a `NaiveBayes` generative model.

In [None]:
from snorkel.learning import NaiveBayes

gen_model = NaiveBayes()
gen_model.train(L_train, n_iter=10000, rate=1e-4)

We now apply the generative model to the training candidates.

In [None]:
train_marginals = gen_model.marginals(L_train)

In [None]:
gen_model.w

## Training the Discriminative Model
We use the estimated probabilites to train a discriminative model that classifies each `Candidate` as a true or false mention. We'll use a random hyperparameter search, evaluated on the development set labels, to find the best hyperparameters for our model. To run a hyperparameter search, we need labels for a development set. If they aren't already available, we can manually create labels using the Viewer.

In [None]:
from snorkel.learning import LogReg
disc_model = LogReg(bias_term=True)

**Note: Here, we're training our model with hand-tuned hyperparameters... another option (the better one at some point) is to use some of our ground-truth-labeled candidates to serve as a "dev set" to automatically tune the model hyperparameters.  See the tutorial for this**

In [None]:
disc_model.train(F_train, train_marginals, n_iter=1000, rate=0.01, mu=1e-3)

### Scoring against the test set

In [None]:
L_gold_test = label_manager.load(session, test_candidates, 'iross')
L_gold_test

In [None]:
tp, fp, tn, fn = disc_model.score(F_test, L_gold_test, set_unlabeled_as_neg=False)

## Viewing Examples
After evaluating on the development `CandidateSet`, the labeling functions can be modified. Try changing the labeling functions to improve performance. You can view the true positives, false positives, true negatives, and false negatives using the `Viewer`.

In [None]:
from snorkel.viewer import SentenceNgramViewer

sv = SentenceNgramViewer(fn, session)
sv