# Snorkel Database Initialization

Snorkel handles all data in a database. In this file we will initialize the database we are working on. 
We import two annotated datasets
- **Train set** used to create the rule set for the generative model
- **Test set** used to evaluate the quality of the rules

And we also import the new data that is to be tagged by Snorkel next. 

First we initialize Snorkel with a Session from which we access the database and the candidate scheme (software) we want to extract. 

We are using Postgres as a database managment system. So we need to make sure the database is initilized in Postgres so Snorkel can find it. 

In [None]:
%load_ext autoreload
%autoreload 2
# %matplotlib inline
import os
import pickle
import json
import random

from glob import glob
from shutil import copy

BASE_NAME = 'sosci_ssc_0' 
DATABASE_NAME = 'sosci_ssc_0' 
LABELS_NAME = 'sosci_ssc_annotation' 
os.environ['SNORKELDB'] = 'postgres://snorkel:snorkel@localhost/' + DATABASE_NAME

from snorkel import SnorkelSession
from snorkel.models import candidate_subclass

session = SnorkelSession()
software = candidate_subclass('software', ['software'])

set_mapping = {
    'train': 0, 
    'test': 1,
    'new': 2
}

Next we select all documents we want to parse.

In [None]:
%%time
from snorkel.parser import TextDocPreprocessor, CorpusParser
from snorkel.parser.spacy_parser import Spacy
from snorkel.models import Document, Sentence
from snorkel.candidates import Ngrams, CandidateExtractor
from snorkel.matchers import RegexMatchEach, RegexMatchSpan

ngrams_one = Ngrams(n_max=6)
software_matcher = RegexMatchEach(rgx=r'.*', longest_match_only=False)
cand_extractor = CandidateExtractor(software, [ngrams_one], [software_matcher])

doc_preprocessor = TextDocPreprocessor('../data/{}/'.format(BASE_NAME))  
corpus_parser = CorpusParser(parser=Spacy())
corpus_parser.apply(doc_preprocessor, parallelism=60)
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())

docs = session.query(Document).order_by(Document.name).all()

Now we actually build all possible ngrams on the data up to a max ngram count of 6 and already calculate all Spacy features which are built in for Snorkel.

In [None]:
%%time
with open('../data/sosci_train_dev_test_split.json', 'r') as sosci_data_json:
    train_dev_test_split = json.load(sosci_data_json)

train_sents = set()
dev_sents = set()
new_sents = set()

for _, doc in enumerate(docs):
    if doc.name.startswith('sent'):
        if doc.name in train_dev_test_split['train']:
            for s in doc.sentences:
                train_sents.add(s)
        elif doc.name in train_dev_test_split['devel']:
            for s in doc.sentences:
                dev_sents.add(s)
    else: 
        for s in doc.sentences:
            new_sents.add(s)

print("Working on " + str(len(train_sents)) + " training samples.")  
print("and on " + str(len(dev_sents)) + " testing samples.")
print("The set of new sentences contain {} sentences.".format(len(new_sents)))

for i, sents in enumerate([train_sents, dev_sents, new_sents]):
    cand_extractor.apply(sents, split=i, parallelism=60)
    print("Number of candidates:", session.query(software).filter(software.split == i).count())

The last step is to also import the annotate labels which we do with an adjusted version of Snorkels BRAT importer.

In [None]:
%%time
from util.brat_import import BratAnnotator

brat = BratAnnotator(session, software, encoding='utf-8') 
train_cands = session.query(software).filter(software.split!=set_mapping['new']).all()
brat.import_gold_labels(session, "../data/{}/".format(LABELS_NAME), train_cands)

Since we split up the files we need to let this process run on each individual fraction of the data. We therefore also wrote a script that performs the same operations on a variable input to automate this process.  