# Intro. to Context Aware Snorkel: Extracting Spouse Relations from the News Across Sentences

In this tutorial, we will walk through the process of using context aware `Snorkel` to identify mentions of spouses in a corpus of news articles. The tutorial is broken up into 3 notebooks, each covering a step in the pipeline:
1. Preprocessing
2. Training
3. Evaluation

This tutorial is an adaptation of the Intro tutorial that demonstrates the use of the cross-context candidate extractor.

## Part I: Preprocessing

In this notebook, we preprocess several documents using `Snorkel` utilities, parsing them into a simple hierarchy of component parts of our input data, which we refer to as _contexts_. We'll also create _candidates_ out of these contexts, which are the objects we want to classify, in this case, possible mentions of spouses. Finally, we'll load some gold labels for evaluation.

All of this preprocessed input data is saved to a database.  (Connection strings can be specified by setting the `SNORKELDB` environment variable.  In Snorkel, if no database is specified, then a SQLite database at `./snorkel.db` is created by default--so no setup is needed here!

### Initializing a `SnorkelSession`

First, we initialize a `SnorkelSession`, which manages a connection to a database automatically for us, and will enable us to save intermediate results.  If we don't specify any particular database (see commented-out code below), then it will automatically create a SQLite database in the background for us:

In [1]:
import re
def substring_range(s, substring):
    for i in re.finditer(re.escape(substring), s):
        yield (i.start(), i.end()-1)

s = "felipe"
substring = "Ivana Trump"
print([x for x in substring_range(s, substring)])

[]


In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os

# TO USE A DATABASE OTHER THAN SQLITE, USE THIS LINE
# Note that this is necessary for parallel execution amongst other things...
# os.environ['SNORKELDB'] = 'postgres:///snorkel-intro'

from snorkel import SnorkelSession
session = SnorkelSession()

# Here, we just set how many documents we'll process for automatic testing- you can safely ignore this!
n_docs = 500 if 'CI' in os.environ else 2591

## Loading the Corpus

Next, we load and pre-process the corpus of documents.

### Configuring a `DocPreprocessor`

We'll start by defining a `TSVDocPreprocessor` class to read in the documents, which are stored in a tab-seperated value format as pairs of document names and text.

In [3]:
from snorkel.parser import TSVDocPreprocessor

doc_preprocessor = TSVDocPreprocessor('data/articles.tsv', max_docs=n_docs)

### Running a `CorpusParser`

We'll use [Spacy](https://spacy.io/), an NLP preprocessing tool, to split our documents into sentences and tokens, and provide named entity annotations.

In [4]:
from snorkel.parser import CorpusParser
from snorkel.parser.spacy_parser import Spacy

corpus_parser = CorpusParser(parser=Spacy())
%time corpus_parser.apply(doc_preprocessor, count=n_docs)

Clearing existing...
Running UDF...

CPU times: user 2min 4s, sys: 1.79 s, total: 2min 5s
Wall time: 2min 6s


We can then use simple database queries (written in the syntax of [SQLAlchemy](http://www.sqlalchemy.org/), which Snorkel uses) to check how many documents and sentences were parsed:

In [5]:
from snorkel.models import Document, Sentence

print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

Documents: 2591
Sentences: 67820


## Generating Candidates

The next step is to extract _candidates_ from our corpus. A `Candidate` in Snorkel is an object for which we want to make a prediction. In this case, the candidates are pairs of people mentioned in sentences, and our task is to predict which pairs are described as married in the associated text.

### Defining a `Candidate` schema
We now define the schema of the relation mention we want to extract (which is also the schema of the candidates).  This must be a subclass of `Candidate`, and we define it using a helper function. Here we'll define a binary _spouse relation mention_ which connects two `Span` objects of text.  Note that this function will create the table in the database backend if it does not exist:

In [6]:
from snorkel.models import candidate_subclass
    
Spouse = candidate_subclass('Spouse', ['person1', 'person2'])

### Writing a basic `CrossContextCandidateExtractor`

Next, we'll write a basic function to extract **candidate spouse relation mentions** from the corpus.  The [Spacy](https://spacy.io/) parser we used performs _named entity recognition_ for us.

We will extract `Candidate` objects of the `Spouse` type by identifying, for each group of `Sentences` within a specified `window`, all pairs of n-grams (up to 7-grams) that were tagged as people. (An n-gram is a span of text made up of n tokens.) We do this with three objects:

* A `ContextSpace` defines the "space" of all candidates we even potentially consider; in this case we use the `Ngrams` subclass, and look for all n-grams up to 7 words long

* A `Matcher` heuristically filters the candidates we use.  In this case, we just use a pre-defined matcher which looks for all n-grams tagged by Spacy as "PERSON". The keyword argument `longest_match_only` means that we'll skip n-grams contained in other n-grams.

* A `CrossContextCandidateExtractor` combines this all together!

Next, we'll split up the documents into train, development, and test splits; and collect the associated sentences. Note that the `CrossContextAnnotator` requires a list of each document's sentences in an ordered list.

In [7]:
from snorkel.models import Document
from util import number_of_people

docs = session.query(Document).order_by(Document.name).all()

docs_list = []
for i, doc in enumerate(docs):
    if (i % 10 == 8) or (i % 10 == 9):
        docs_list.append((doc.name , doc.sentences))

Finally, we'll apply the candidate extractor to the three sets of sentences. `window_size` determines how many adjacent sentences the user would like to extract candidates from at once and `thresholds` determines the maximum numbers of person detections per matcher. The results will be persisted in the database backend.

In [None]:
from snorkel.candidates import Ngrams, CrossContextAnnotator
from snorkel.matchers import PersonMatcher

ngrams         = Ngrams(n_max=7)
person_matcher = PersonMatcher(longest_match_only=True)
cand_annotator = CrossContextAnnotator(Spouse, [ngrams, ngrams], [person_matcher, person_matcher])
for i, sents in enumerate([docs_list]):
    cand_annotator.apply(sents, window_size = 2, thresholds=[3,3], split=i)
print("Number of candidates:", session.query(Spouse).filter(Spouse.split == i).count())

## Loading Gold Labels
-1
Finally, we'll load gold labels for development and evaluation. Even though Snorkel is designed to create labels for data, we still use gold labels to evaluate the quality of our models. Fortunately, we need far less labeled data to _evaluate_ a model than to _train_ it.

In [None]:
from util import load_external_labels

%time missed = load_external_labels(session, Spouse, annotator_name='gold')