# Phase 1: KBC Initialization

In this first phase of `Fonduer`'s pipeline, `Fonduer` uses a user specified _schema_ to initialize a relational database where the output KB will be stored. Furthermore, `Fonduer` iterates over its input _corpus_ and transforms each document into a unified data model, which captures the variability and multimodality of richly formatted data. This unified data model then servers as an intermediate representation used in the rest of the phases.

This preprocessed data is saved to a database. Connection strings can be specified by setting the `SNORKELDB` environment variable. If no database is specified, then SQLite at `./snorkel.db` is created by default. However, to enabled parallel execution, we use PostgreSQL throughout this tutorial.

We initialize several variables for convenience that define what the database should be called and what level of parallelization the `Fonduer` pipeline will be run with. In the code below, we use PostgreSQL as our database backend. 

Before you continue, please make sure that you have PostgreSQL installed and have created a new database named `zeugma`.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import sys

PARALLEL = 4 # assuming a quad-core machine
ATTRIBUTE = "zeugma"

os.environ['FONDUERDBNAME'] = ATTRIBUTE
os.environ['SNORKELDB'] = 'postgres://weijiechen1994@localhost:5432/' + os.environ['FONDUERDBNAME']

## 1.1 Defining a Candidate Schema

We first initialize a `SnorkelSession`, which manages the connection to the database automatically, and enables us to save intermediate results.

In [2]:
from fonduer import SnorkelSession

session = SnorkelSession()

  from ._conv import register_converters as _register_converters


Next, we define the _schema_ of the relation we want to extract. This must be a subclass of Candidate, and we define it using a helper function. Here, we define a binary relation which connects two Span objects of text. This is what creates the relation's database table if it does not already exist.

In [26]:
from fonduer import candidate_subclass

Cata_Labref = candidate_subclass('Cata_Labref', ['cata','labref'])

## 1.2 Parsing and Transforming the Input Documents into Unified Data Models

Next, we load the corpus of datasheets and transform them into the unified data model. Each datasheet has a PDF and HTML representation. Both representations are used in conjunction to create a robust unified data model with textual, structural, tabular, and visual modality information. Note that since each document is independent of each other, we can parse the documents in parallel. Note that parallel execution will not work with SQLite, the default database engine. We depend on PostgreSQL for this functionality.

### Configuring an `HTMLPreprocessor`
We start by setting the paths to where our documents are stored, and defining a `HTMLPreprocessor` to read in the documents found in the specified paths. `max_docs` specified the number of documents to parse. For the sake of this tutorial, we only look at 100 documents.

**Note that you need to have run `download_data.sh` before executing these next steps or you won't have the documents needed for the tutorial.**

In [27]:
from fonduer import HTMLPreprocessor, OmniParser

docs_path = 'data/html/'
pdf_path = 'data/pdf/'

max_docs = float('inf')
doc_preprocessor = HTMLPreprocessor(docs_path, max_docs=max_docs)

### Configuring an `OmniParser`
Next, we configure an `OmniParser`, which serves as our `CorpusParser` for PDF documents. We use [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) as a preprocessing tool to split our documents into phrases and tokens, and to provide annotations such as part-of-speech tags and dependency parse structures for these phrases. In addition, we can specify which modality information to include in the unified data model for each document. Below, we enable all modality information.

In [45]:
corpus_parser = OmniParser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

CPU times: user 1.5 s, sys: 36.6 ms, total: 1.54 s
Wall time: 5min 29s


We can then use simple database queries (written in the syntax of [SQLAlchemy](http://www.sqlalchemy.org/), which `Fonduer` uses) to check how many documents and phrases (sentences) were parsed, or even check how many phrases and tables are contained in each document.

In [46]:
from fonduer import Document, Phrase

print("Documents:", session.query(Document).count())
print("Phrases:", session.query(Phrase).count())

Documents: 11
Phrases: 25358


## 1.3 Dividing the Corpus into Test and Train

We'll split the documents 80/10/10 into train/dev/test splits. Note that here we do this in a non-random order to preverse the consistency in the tutorial, and we reference the splits by 0/1/2 respectively.

In [47]:
docs = session.query(Document).order_by(Document.name).all()
ld   = len(docs)

train_docs = set()
dev_docs   = set()
test_docs  = set()
splits = (0.8, 0.9)
data = [(doc.name, doc) for doc in docs]
data.sort(key=lambda x: x[0])
for i, (doc_name, doc) in enumerate(data):
    if i < splits[0] * ld:
        train_docs.add(doc)
    elif i < splits[1] * ld:
        dev_docs.add(doc)
    else:
        test_docs.add(doc)
from pprint import pprint
pprint([x.name for x in train_docs])

['v2ch05',
 'v2ch01',
 'v2ch06',
 'v2ch02',
 'v2ch07',
 'v2ch03',
 'v2ch08',
 'v2ch04',
 'v2hawari-plates']


# Phase 2: Candidate Extraction & Multimodal Featurization
Given the unified data model from Phase 1, `Fonduer` extracts relation candidates based on user-provided **matchers** and **throttlers**. Then, `Fonduer` leverages the multimodality information captured in the unified data model to provide multimodal features for each candidate.

## 2.1 Candidate Extraction

The next step is to extract **candidates** from our corpus. A `candidate` is the object for which we want to make predictions. In this case, the candidates are pairs of transistor part numbers and their corresponding maximum storage temperatures as found in their datasheets. Our task is to predict which pairs are true in the associated document.

To do so, we write **matchers** to define which spans of text in the corpus are instances of each entity. Matchers can leverage a variety of information from regular expressions, to dictionaries, to user-defined functions. Furthermore, different techniques can be combined to form higher quality matchers. In general, matchers should seek to be as precise as possible while maintaining complete recall.

In our case, we need to write a matcher that defines a transistor part number and a matcher to define a valid temperature value.

### Writing a simple temperature matcher

Our maximum storage temperature matcher can be a very simple regular expression since we know that we are looking for integers, and by inspecting a portion of our corpus, we see that maximum storage temperatures fall within a fairly narrow range.

In [48]:
from fonduer import RegexMatchSpan, DictionaryMatch, LambdaFunctionMatcher, Intersect, Union

labref1_rgx = r'[0-9]{4,5}'
labref2_rgx = r'[0-9]{4,5}.[0-9]{1,3}'
lab_rgx = '|'.join([labref1_rgx, labref2_rgx])
labref_matcher = RegexMatchSpan(rgx=lab_rgx, longest_match_only=False)

### Writing an advanced transistor part matcher

In contrast, transistor part numbers are complex expressions. Here, we show how transistor part numbers can leverage [naming conventions](https://en.wikipedia.org/wiki/Transistor#Part_numbering_standards.2Fspecifications) as regular expressions, and use a dictionary of known part numbers, and use user-defined functions together. First, we create a regular expression matcher for standard transistor naming conventions.

In [49]:
from fonduer import RegexMatchSpan, DictionaryMatch, LambdaFunctionMatcher, Intersect, Union

### Catalogue name and lab ref as Regular Expressions ###
cata_rgx = r'(A|AM|B|BR|C|G|GD|IN|IR|L|LW|M|ML|PT|Q|SM|SS|ST|SV|SW|TC|TX|ZB)[0-9]{1,3}'
cata_rgx_matcher = RegexMatchSpan(rgx=cata_rgx, longest_match_only=True)
# part_matcher = Union(part_rgx_matcher, part_dict_matcher, part_file_name_matcher)
cata_matcher = cata_rgx_matcher

In [50]:
# ### Transistor Naming Conventions as Regular Expressions ###
# eeca_rgx = r'([ABC][A-Z][WXYZ]?[0-9]{3,5}(?:[A-Z]){0,5}[0-9]?[A-Z]?(?:-[A-Z0-9]{1,7})?(?:[-][A-Z0-9]{1,2})?(?:\/DG)?)'
# jedec_rgx = r'(2N\d{3,4}[A-Z]{0,5}[0-9]?[A-Z]?)'
# jis_rgx = r'(2S[ABCDEFGHJKMQRSTVZ]{1}[\d]{2,4})'
# others_rgx = r'((?:NSVBC|SMBT|MJ|MJE|MPS|MRF|RCA|TIP|ZTX|ZT|ZXT|TIS|TIPL|DTC|MMBT|SMMBT|PZT|FZT|STD|BUV|PBSS|KSC|CXT|FCX|CMPT){1}[\d]{2,4}[A-Z]{0,5}(?:-[A-Z0-9]{0,6})?(?:[-][A-Z0-9]{0,1})?)'

# part_rgx = '|'.join([eeca_rgx, jedec_rgx, jis_rgx, others_rgx])
# part_rgx_matcher = RegexMatchSpan(rgx=part_rgx, longest_match_only=True)

### Define a relation's `ContextSpaces`

Next, in order to define the "space" of all candidates that are even considered from the document, we need to define a `ContextSpace` for each component of the relation we wish to extract.

In the case of transistor part numbers, the `ContextSpace` can be quite complex due to the need to handle implicit part numbers that are implied in text like "BC546A/B/C...BC548A/B/C", which refers to 9 unique part numbers. In addition, to handle these, we consider all n-grams up to 3 words long.

In contrast, the `ContextSpace` for temperature values is simpler: we only need to process different unicode representations of a (`-`), and don't need to look at more than two works at a time.

When no special preproessing like this is needed, we could have used the default `OmniNgrams` class provided by `snorkel.candidates`. For example, if we were looking to match polarities, which only take the form of "NPN" or "PNP", we could've used `attr_ngrams = OmniNgrams(n_max=1)`.

In [51]:
from zeugma_space import OmniNgramsCata, OmniNgramsLabref
    
cata_ngrams = OmniNgramsCata(parts_by_doc=None, n_max=1)
labref_ngrams = OmniNgramsLabref(n_max=1)

### Defining candidate `Throttlers`

Next, we need to define **throttlers**, which allow us to further prune excess candidates and avoid unnecessarily materializing invalid candidates. Trottlers, like matchers, act as hard filters, and should be created to have high precision while maintaining complete recall, if possible.

Here, we create a throttler that discards candidates if they are in the same table, but the part and storage temperature are not vertically or horizontally aligned.

In [52]:
from fonduer.lf_helpers import *
import re

def stg_temp_filter(c):
    (part, attr) = c
    if same_table((part, attr)):
        return (is_horz_aligned((part, attr)) or is_vert_aligned((part, attr)))
    return True

candidate_filter = stg_temp_filter

### Running the `CandidateExtractor`

Now, we have all the component necessary to perform candidate extraction. We have defined the "space" of things to consider for each candidate, provided matchers that signal when a valid mention is seen, and a throttler to prunes away excess candidates. We now can define the `CandidateExtractor` with the contexts to extract from, the matchers, and the throttler to use. 

In [55]:
from fonduer import CandidateExtractor


candidate_extractor = CandidateExtractor(Cata_Labref, 
                        [cata_ngrams, labref_ngrams], 
                        [cata_matcher, labref_matcher], 
                        candidate_filter=candidate_filter)

%time candidate_extractor.apply(train_docs, split=0, parallelism=PARALLEL)

Process CandidateExtractorUDF-29:
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/session.py", line 2380, in _flush
    transaction.rollback(_capture_exception=True)
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/weijiechen1994/Projects/Python/fonduer/fonduer/snorkel/udf.py", line 170, in run
    self.session.commit()
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/session.py", line 943, in commit
    self.transaction.commit()
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/session.py", line 467, in commit
    self._prepare_impl()
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/session.py", line 447, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/session.py", line 2254, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/util/langhelpers.py

KeyboardInterrupt: 

Process CandidateExtractorUDF-32:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/weijiechen1994/Projects/Python/fonduer/fonduer/snorkel/udf.py", line 160, in run
    for y in self.apply(x, **self.apply_kwargs):
  File "/home/weijiechen1994/Projects/Python/fonduer/fonduer/candidates.py", line 93, in apply
    for tc in self.matchers[i].apply(self.candidate_spaces[i].apply(self.session, context)):
  File "/home/weijiechen1994/Projects/Python/fonduer/fonduer/snorkel/matchers.py", line 75, in apply
    for c in candidates:
  File "/home/weijiechen1994/Projects/Python/fonduer/tutorials/zeugma/zeugma_space.py", line 185, in apply
    for ts in OmniNgrams.apply(self, session, context):
  File "/home/weijiechen1994/Projects/Python/fonduer/fonduer/candidates.py", line 160, in apply
    doc = session.query(Document).filter(Document.id == context.id).one()
  File "/usr/local/lib/python3.5/dist-package

  File "/home/weijiechen1994/Projects/Python/fonduer/fonduer/lf_helpers.py", line 162, in same_table
    for i in range(len(c))))
  File "/home/weijiechen1994/Projects/Python/fonduer/fonduer/lf_helpers.py", line 162, in <genexpr>
    for i in range(len(c))))
  File "/home/weijiechen1994/Projects/Python/fonduer/fonduer/models/context.py", line 202, in is_tabular
    return self.table is not None
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/attributes.py", line 242, in __get__
    return self.impl.get(instance_state(instance), dict_)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/attributes.py", line 599, in get
    value = self.callable_(state, passive)
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/strategies.py", line 623, in _load_for_state
    return self._emit_lazyload(session, state, ident_key, passive)
  File "<string>", line 1, in <lambda>
  File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/orm/strategies.py", line 710, in _em

Here we specified that these `Candidates` belong to the training set by specifying `split=0`; recall that we're referring to train/dev/test as splits 0/1/2.

In [56]:
train_cands = session.query(Part_Attr).filter(Part_Attr.split == 0).all()
print("Number of candidates:", len(train_cands))

Number of candidates: 0
