# Extracting Transistor Electrical Characteristics

In this notebook, we seek to extract

- maximum storage tempurature
- minimum storage tempurature
- polarity
- maximum collector emitter voltage
- maximum emitter base voltage
- maximum collector current
- total device dissipation
- minimum dc gain



In [1]:
%load_ext autoreload
%autoreload 1
%matplotlib inline
import os
import sys
import logging

# To allow importing from the general utils
sys.path.insert(0, "..")

# Configure logging for Fonduer
logging.basicConfig(
    stream=sys.stdout,
    format="[%(levelname)s] %(name)s:%(lineno)s - %(message)s",
    level=logging.INFO,
)
logger = logging.getLogger(__name__)

# See https://docs.python.org/3/library/os.html#os.cpu_count
PARALLEL = len(os.sched_getaffinity(0))
COMPONENT = "transistors"
conn_string = "postgresql://localhost:5432/" + COMPONENT

In [2]:
# If you've run this before, set FIRST_TIME to False to save time
FIRST_TIME = True

In [3]:
from fonduer import Meta

session = Meta.init(conn_string).Session()

[INFO] fonduer.meta:86 - Connecting user:None to localhost:5432/transistors
[INFO] fonduer.meta:110 - Initializing the storage schema


In [4]:
from utils import parse_dataset
docs, train_docs, dev_docs, test_docs = parse_dataset(session, first_time=False, parallel=PARALLEL)
logger.info(f"# of train Documents: {len(train_docs)}")
logger.info(f"# of dev Documents: {len(dev_docs)}")
logger.info(f"# of test Documents: {len(test_docs)}")

[INFO] utils.utils:60 - Reloading pre-parsed dataset.
[INFO] __main__:3 - # of train Documents: 100
[INFO] __main__:4 - # of dev Documents: 100
[INFO] __main__:5 - # of test Documents: 75


In [5]:
from fonduer.parser.models import Document, Section, Paragraph, Sentence, Figure

logger.info(f"Documents: {session.query(Document).count()}")
logger.info(f"Sections: {session.query(Section).count()}")
logger.info(f"Paragraphs: {session.query(Paragraph).count()}")
logger.info(f"Sentences: {session.query(Sentence).count()}")
logger.info(f"Figures: {session.query(Figure).count()}")

[INFO] __main__:3 - Documents: 275
[INFO] __main__:4 - Sections: 275
[INFO] __main__:5 - Paragraphs: 136992
[INFO] __main__:6 - Sentences: 141638
[INFO] __main__:7 - Figures: 7440


# Phase 2: Mention Extraction, Candidate Extraction Multimodal Featurization

Given the unified data model from Phase 1, `Fonduer` extracts relation
candidates based on user-provided **matchers** and **throttlers**. Then,
`Fonduer` leverages the multimodality information captured in the unified data
model to provide multimodal features for each candidate.

## 2.1 Mention Extraction

The first step is to extract **mentions** from our corpus. A `mention` is the
type of object which makes up a `candidate`. For example, if we wanted to
extract pairs of transistor part numbers and their corresponding maximum
storage temperatures, the transistor part number would be one `mention` while
the temperature value would be another. These `mention`s are then combined to
create `candidates`, where our task is to predict which `candidates` are true
in the associated document.

We first start by defining and naming our two `mention`s:

In [6]:
from fonduer.candidates.models import mention_subclass

Part = mention_subclass("Part")
StgTempMin = mention_subclass("StgTempMin")
StgTempMax = mention_subclass("StgTempMax")
Polarity = mention_subclass("Polarity")
CeVMax = mention_subclass("CeVMax")

In [7]:
from transistor_matchers import get_matcher
stg_temp_min_matcher = get_matcher("stg_temp_min")
stg_temp_max_matcher = get_matcher("stg_temp_max")
polarity_matcher = get_matcher("polarity")
ce_v_max_matcher = get_matcher("ce_v_max")
part_matcher = get_matcher("part")

These two matchers define each entity in our relation schema.

### Define a Mention's `MentionSpace`

Next, in order to define the "space" of all mentions that are even considered
from the document, we need to define a `MentionSpace` for each component of the
relation we wish to extract. Fonduer provides a default `MentionSpace` for you
to use, but you can also extend the default `MentionSpace` depending on your
needs.

In the case of transistor part numbers, the `MentionSpace` can be quite complex
due to the need to handle implicit part numbers that are implied in text like
"BC546A/B/C...BC548A/B/C", which refers to 9 unique part numbers. To handle
these, we consider all n-grams up to 3 words long.

In contrast, the `MentionSpace` for temperature values is simpler: we only need
to process different Unicode representations of a (`-`), and don't need to look
at more than two words at a time.

When no special preprocessing like this is needed, we could have used the
default `Ngrams` class provided by `fonduer`. For example, if we were looking
to match polarities, which only take the form of "NPN" or "PNP", we could've
used `ngrams = MentionNgrams(n_max=1)`.

In [8]:
from fonduer.candidates import MentionNgrams
from transistor_spaces import MentionNgramsPart, MentionNgramsTemp, MentionNgramsVolt
    
part_ngrams = MentionNgramsPart(parts_by_doc=None, n_max=3)
temp_ngrams = MentionNgramsTemp(n_max=2)
volt_ngrams = MentionNgramsVolt(n_max=1)
polarity_ngrams = MentionNgrams(n_max=1)

### Running Mention Extraction 

Next, we create a `MentionExtractor` to extract the mentions from all of
our documents based on the `MentionSpace` and matchers we defined above.

View the API for the MentionExtractor on [ReadTheDocs](https://fonduer.readthedocs.io/en/latest/user/candidates.html#fonduer.candidates.MentionExtractor).


In [9]:
from fonduer.candidates import MentionExtractor

mention_extractor = MentionExtractor(
    session,
    [Part, StgTempMin, StgTempMax, Polarity, CeVMax],
    [part_ngrams, temp_ngrams, temp_ngrams, polarity_ngrams, volt_ngrams],
    [
        part_matcher,
        stg_temp_min_matcher,
        stg_temp_max_matcher,
        polarity_matcher,
        ce_v_max_matcher,
    ],
)

Then, we run the extractor on all of our documents.

In [11]:
from fonduer.candidates.models import Mention

if FIRST_TIME and False:
    mention_extractor.apply(docs, parallelism=PARALLEL)

logger.info(f"Total Mentions: {session.query(Mention).count()}")
logger.info(f"Total Part: {session.query(Part).count()}")
logger.info(f"Total StgTempMin: {session.query(StgTempMin).count()}")
logger.info(f"Total StgTempMax: {session.query(StgTempMax).count()}")
logger.info(f"Total Polarity: {session.query(Polarity).count()}")
logger.info(f"Total CeVMax: {session.query(CeVMax).count()}")

[INFO] __main__:6 - Total Mentions: 21286
[INFO] __main__:7 - Total Part: 5940
[INFO] __main__:8 - Total StgTempMin: 3438
[INFO] __main__:9 - Total StgTempMax: 3438
[INFO] __main__:10 - Total Polarity: 1475
[INFO] __main__:11 - Total CeVMax: 3557


## 2.2 Candidate Extraction

Now that we have both defined and extracted the Mentions that can be used to compose Candidates, we are ready to move on to extracting Candidates. Like we did with the Mentions, we first define what each candidate schema looks like. In this example, we create a candidate that is composed of a `Part` and a `Temp` mention as we defined above. We name this candidate "PartTemp".

In [None]:
from fonduer.candidates.models import candidate_subclass

PartStgTempMin = candidate_subclass("PartStgTempMin", [Part, StgTempMin])
PartStgTempMin = candidate_subclass("PartStgTempMax", [Part, StgTempMax])
PartPolarity = candidate_subclass("PartPolarity", [Part, Polarity])
PartCeVMax = candidate_subclass("PartCeVMax", [Part, CeVMax])

### Defining candidate `Throttlers`

Next, we need to define **throttlers**, which allow us to further prune excess candidates and avoid unnecessarily materializing invalid candidates. Throttlers, like matchers, act as hard filters, and should be created to have high precision while maintaining complete recall, if possible.

Here, we create a throttler that discards candidates if they are in the same table, but the part and storage temperature are not vertically or horizontally aligned.

In [None]:
from transistor_throttlers import stg_temp_filter

temp_throttler = stg_temp_filter

### Running the `CandidateExtractor`

Now, we have all the component necessary to perform candidate extraction. We have defined the Mentions that compose each candidate and a throttler to prunes away excess candidates. We now can define the `CandidateExtractor` with the candidate subclass and corresponding throttler to use.

View the API for the CandidateExtractor on [ReadTheDocs](https://fonduer.readthedocs.io/en/docstrings/user/candidates.html#fonduer.candidates.CandidateExtractor).

In [None]:
from fonduer.candidates import CandidateExtractor


candidate_extractor = CandidateExtractor(session, [PartTemp], throttlers=[temp_throttler])

Here we specified that these `Candidates` belong to the training set by specifying `split=0`; recall that we're referring to train/dev/test as splits 0/1/2.

In [None]:
if FIRST_TIME:
    for i, docs in enumerate([train_docs, dev_docs, test_docs]):
        candidate_extractor.apply(docs, split=i, parallelism=PARALLEL)
        logger.info("Number of Candidates in split={}: {}".format(i, session.query(PartTemp).filter(PartTemp.split == i).count()))

train_cands = candidate_extractor.get_candidates(split = 0)
dev_cands = candidate_extractor.get_candidates(split = 1)
test_cands = candidate_extractor.get_candidates(split = 2)

logger.info(f"Total train candidate:\t{len(train_cands[0])}")
logger.info(f"Total dev candidate:\t{len(dev_cands[0])}")
logger.info(f"Total test candidate:\t{len(test_cands[0])}")

## 2.2 Multimodal Featurization
Unlike dealing with plain unstructured text, `Fonduer` deals with richly formatted data, and consequently featurizes each candidate with a baseline library of multimodal features. 

### Featurize with `Fonduer`'s optimized Postgres Featurizer
We now annotate the candidates in our training, dev, and test sets with features. The `Featurizer` provided by `Fonduer` allows this to be done in parallel to improve performance.

View the API provided by the `Featurizer` on [ReadTheDocs](https://fonduer.readthedocs.io/en/latest/user/features.html#fonduer.features.Featurizer).

In [None]:
from fonduer.features import Featurizer

featurizer = Featurizer(session, [PartTemp])
if FIRST_TIME:
    %time featurizer.apply(split=0, train=True, parallelism=PARALLEL)
    %time featurizer.apply(split=1, parallelism=PARALLEL)
    %time featurizer.apply(split=2, parallelism=PARALLEL)

%time F_train = featurizer.get_feature_matrices(train_cands)
%time F_dev = featurizer.get_feature_matrices(dev_cands)
%time F_test = featurizer.get_feature_matrices(test_cands)

logger.info(f"Train shape:\t{F_train[0].shape}")
logger.info(f"Test shape:\t{F_test[0].shape}")
logger.info(f"Dev shape:\t{F_dev[0].shape}")

In [None]:
from transistor_utils import load_hardware_labels, TRUE, FALSE, ABSTAIN

if FIRST_TIME:
    load_hardware_labels(session, PartTemp, "stg_temp_max" ,annotator_name='gold')

In [None]:
from fonduer.supervision import Labeler
from transistor_lfs import stg_temp_lfs

labeler = Labeler(session, [PartTemp])
if FIRST_TIME:
    %time labeler.apply(split=0, lfs=[stg_temp_lfs], train=True, parallelism=PARALLEL)
%time L_train = labeler.get_label_matrices(train_cands)

In [None]:
from fonduer.supervision import get_gold_labels
L_gold_train = get_gold_labels(session, train_cands, annotator_name='gold')

In [None]:
from metal import analysis

analysis.lf_summary(L_train[0], lf_names=labeler.get_keys(), Y=L_gold_train[0].todense().reshape(-1,).tolist()[0])

### Fitting the Generative Model

Now, we'll train a model of the LFs to estimate their accuracies. Once the model is trained, we can combine the outputs of the LFs into a single, noise-aware training label set for our extractor. Intuitively, we'll model the LFs by observing how they overlap and conflict with each other. To do so, we use [MeTaL](https://github.com/HazyResearch/metal)'s single-task label model.

In [None]:
from metal.label_model import LabelModel

gen_model = LabelModel(k=2)
%time gen_model.train_model(L_train[0], n_epochs=500, print_every=100)

We now apply the generative model to the training candidates to get the noise-aware training label set. We'll refer to these as the training marginals:

In [None]:
train_marginals = gen_model.predict_proba(L_train[0])

We'll look at the distribution of the training marginals:

In [None]:
import matplotlib.pyplot as plt
plt.hist(train_marginals[:, TRUE - 1], bins=20)
plt.show()

We can view the learned accuracy parameters as well.

In [None]:
# gen_model.weights.lf_accuracy
L_train[0].shape

### Using the Model to Iterate on Labeling Functions

Now that we have learned the generative model, we can stop here and use this to potentially debug and/or improve our labeling function set. First, we apply the LFs to our development set:

In [None]:
labeler.apply(split=1, lfs=[stg_temp_lfs], parallelism=PARALLEL)
%time L_dev = labeler.get_label_matrices(dev_cands)

In [None]:
L_dev[0].shape

### Interpreting Generative Model Performance

At this point, we should be getting an F1 score of around 0.6 to 0.7 on the development set, which is pretty good! However, we should be very careful in interpreting this. Since we developed our labeling functions using this development set as a guide, and our generative model is composed of these labeling functions, we expect it to score very well here!

In fact, it is probably somewhat overfit to this set. However this is fine, since in the next, we'll train a more powerful end extraction model which will generalize beyond the development set, and which we will evaluate on a blind test set (i.e. one we never looked at during development).


### Training the Discriminative Model

Now, we'll use the noisy training labels we generated in the last part to train our end extraction model. For this tutorial, we will be training a simple--but fairly effective--logistic regression model.

We use the training marginals to train a discriminative model that classifies each Candidate as a true or false mention. 

In [None]:
from fonduer.learning import LogisticRegression

disc_model = LogisticRegression()
%time disc_model.train((train_cands[0], F_train[0]), train_marginals, n_epochs=50, lr=0.001)

In [None]:
import numpy as np
from transistor_utils import entity_level_f1
import pickle
pickle_file = 'data/parts_by_doc_dict.pkl'
with open(pickle_file, 'rb') as f:
    parts_by_doc = pickle.load(f)

Now, we score using the discriminitive model:

In [None]:
test_score = disc_model.predict((test_cands[0], F_test[0]), b=0.6, pos_label=TRUE)
true_pred = [test_cands[0][_] for _ in np.nditer(np.where(test_score == TRUE))]
%time (TP, FP, FN) = entity_level_f1(true_pred, "stg_temp_max", test_docs, parts_by_doc=parts_by_doc)

We can see that there are actually only a few documents that are causing us problems. In particular, we see that `BC546-D` is giving us many false positives. So, let's inspect one of those candidates. 

In [None]:
from fonduer.utils.visualizer import Visualizer
from transistor_utils import entity_to_candidates
vis = Visualizer("data/test/pdf")

# Get a list of candidates that match the FN[10] entity
fp_cands = entity_to_candidates(FP[1], test_cands[0])
# Display a candidate
vis.display_candidates([fp_cands[0]])