# 1: Candidate Extraction

In [None]:
%load_ext autoreload
%autoreload 2

import os
os.environ['SNORKELDB']="postgres:///stromatolite"

from snorkel import SnorkelSession
session = SnorkelSession()

## Loading the `Sentence` objects

In [None]:
from snorkel.models import Sentence

sentences = session.query(Sentence).all()
len(sentences)

## Defining a `Candidate` schema
We now define the schema of the relation mention we want to extract (which is also the schema of the candidates).  This must be a subclass of `Candidate`, and we define it using a helper function.

Here we'll define a binary _spouse relation mention_ which connects two `Span` objects of text.  Note that this function will create the table in the database backend if it does not exist:

In [None]:
from snorkel.models import candidate_subclass

StromStrat = candidate_subclass('StromStrat', ['strom', 'stratname'])

## Writing a basic `CandidateExtractor`

Next, we'll write a basic function to extract **candidate spouse relation mentions** from the corpus.  The `SentenceParser` we used in Part I is built on [CoreNLP](http://stanfordnlp.github.io/CoreNLP/), which performs _named entity recognition_ for us.

We will extract `Candidate` objects of the `Spouse` type by identifying, for each `Sentence`, all pairs of ngrams (up to trigrams) that were tagged as people.

First, we define a child context space for our sentences.

In [None]:
from snorkel.candidates import Ngrams

ngram_strom = Ngrams(n_max=1)
ngram_strat = Ngrams(n_max=9)

Next, we use a `PersonMatcher` to enforce that candidate relations are composed of pairs of spans that were tagged as people by the `SentenceParser`.

In [None]:
from snorkel.matchers import RegexMatchSpan

strom_matcher = RegexMatchSpan(rgx="stromatolit|thrombolit")

In [None]:
from snorkel.matchers import DictionaryMatch
import urllib
import json

request = urllib.urlopen('https://macrostrat.org/api/v2/defs/strat_names?all')
data = json.loads(request.read())

#FULL STRAT NAME
strat_dict_long = { r['strat_name_long'] for r in data['success']['data'] }

#ABBREVIATED STRAT NAME - V1
strat_dict_abV1 = { r['strat_name'] + ' ' + r['rank'] for r in data['success']['data'] }

#ABBREVIATED STRAT NAME - V2
strat_dict_abV2 = { r['strat_name'] + ' ' + r['rank'] + '.' for r in data['success']['data'] }

#LITHOLOGY STRAT NAMES
request = urllib.urlopen('https://macrostrat.org/api/v2/defs/lithologies?all')
lithologies = json.loads(request.read())
lithologies=[l['name'].capitalize() for l in lithologies['success']['data']]

strat_dict_short = { r['strat_name'] for r in data['success']['data'] }

strat_dict_lith=set()
for r in strat_dict_short:
    if r.split(' ')[-1] in lithologies:
        strat_dict_lith.add(r)
        
strat_dict=set(list(strat_dict_long)+list(strat_dict_abV1)+list(strat_dict_abV2)+list(strat_dict_lith))
        
strat_matcher=DictionaryMatch(d=strat_dict,ignore_case=False,longest_match_only=True)


Finally, we combine the candidate class, child context space, and matcher into an extractor.

In [None]:
from snorkel.candidates import CandidateExtractor

ce = CandidateExtractor(StromStrat, [ngram_strom, ngram_strat], [strom_matcher, strat_matcher],
                        symmetric_relations=True, nested_relations=False, self_relations=False)

## Running the `CandidateExtractor`

We run the `CandidateExtractor` by calling extract with the contexts to extract from, a name for the `CandidateSet` that will contain the results, and the current session.

In [None]:
%time c = ce.extract(sentences, 'Candidate Set', session)
print "Number of candidates:", len(c)

### Saving the extracted candidates

In [None]:
session.add(c)
session.commit()

### Splitting into train / test sets now...

Splitting by _document_; first, let's see the distribution of candidates by document:

In [None]:
from collections import defaultdict
import matplotlib.pyplot as plt
%matplotlib inline

candidates_by_doc = defaultdict(set)
for cand in c:
    candidates_by_doc[cand[0].parent.document.id].add(cand)

plt.hist(map(len, candidates_by_doc.values()))

And total number of documents:

In [None]:
len(candidates_by_doc.keys())

Now, split the candidates into train / test:

In [None]:
from random import shuffle

doc_ids = list(candidates_by_doc.keys())
shuffle(doc_ids)
split = int(0.66 * len(doc_ids))

train = CandidateSet(name='Training Candidates')
session.add(train)
for doc_id in doc_ids[:split]:
    for cand in candidates_by_doc[doc_id]:
        train.append(cand)
print len(train)

test = CandidateSet(name='Test Candidates')
session.add(test)
for doc_id in doc_ids[split:]:
    for cand in candidates_by_doc[doc_id]:
        test.append(cand)
print len(test_candidates)

session.commit()

### Reloading the candidates

In [None]:
from snorkel.models import CandidateSet

train = session.query(CandidateSet).filter(CandidateSet.name == 'Training Candidates').one()
print len(train)

test = session.query(CandidateSet).filter(CandidateSet.name == 'Test Candidates').one()
print len(test)

## Using the `Viewer` to inspect candidates

Next, we'll use the `Viewer` class--here, specifically, the `SentenceNgramViewer`--to inspect the data.

It is important to note, our goal here is to **maximize the recall of true candidates** extracted, **not** to extract _only_ the correct candidates. Learning to distinguish true candidates from false candidates is covered in Tutorial 4.

First, we instantiate the `Viewer` object, which groups the input `Candidate` objects by `Sentence`:

In [None]:
from snorkel.viewer import SentenceNgramViewer

sv = SentenceNgramViewer(train, session)
sv

In [None]:
sv.get_selected()

Note that we can **navigate using the provided buttons**, or **using the keyboard (hover over buttons to see controls)**, highlight candidates (even if they overlap), and also **apply binary labels** (more on where to use this later!).  In particular, note that **the Viewer is synced dynamically with the notebook**, so that we can for example get the `Candidate` that is currently selected. Try it out!