# Context Labelling functions

Creating labeling functions based on the surrounding context of a word. Here we can use some prior knowledge from existing studies. Namely all rule-based (and maybe also some supervised methods) we identified in the review dealing with software extraction. The rules they used are exactly what is needed to create the weak supervision labels.
Combining this set of rules can serve to be starting point. 

But first, lets set up the standard environment as always. For context we mainly can use the practical **Helper Functions** provided by snorkel.

**Important**: the `test_LF` function is not imported, because it has hard coded queries and does not evaluate the results in a meaningful way.

We also load the development set we use for evaluation purposes, so we do not have to calculate it more than once. 

Importing **spacy**, mainly because it lets us identify stop words and Snorkel is already built with spacy. 

In [None]:
%load_ext autoreload
%autoreload 2

import os
import re
import string
import math
import spacy
import numpy as np
import pandas as pd

from glob import glob
from shutil import copy
from functools import partial, update_wrapper 

BASE_NAME = 'Snorkel/SSC_0' 
DATABASE_NAME = 'SSC_0' 
LABELS_NAME = 'Snorkel/SSC_annotation' 
PARALLELISM = 1
# os.environ['SNORKELDB'] = 'postgres://snorkel:snorkel@localhost/' + DATABASE_NAME
# PARALLELISM = 64
os.environ['SNORKELDB'] = 'sqlite:///' + DATABASE_NAME + '.db'

from snorkel import SnorkelSession
from snorkel.models import candidate_subclass
from snorkel.annotations import load_gold_labels
from snorkel.learning.utils import MentionScorer
from snorkel.viewer import SentenceNgramViewer

set_mapping = {
    'train': 0, 
    'test': 1,
    'new': 2
}

from snorkel.lf_helpers import (
    contains_token, get_between_tokens, get_doc_candidate_spans,
    get_left_tokens, get_matches, get_right_tokens, 
    get_sent_candidate_spans, get_tagged_text, get_text_between, 
    get_text_splits, is_inverted
)

session = SnorkelSession()
software = candidate_subclass('software', ['software'])
devel_gold_labels = load_gold_labels(session, annotator_name='gold', split=set_mapping['train'])

test_cands = session.query(software).filter(software.split==set_mapping['train']).all()
test_labels = load_gold_labels(session, annotator_name="gold", split=set_mapping['train'])
scorer = MentionScorer(test_cands, test_labels)

## Context Functions

Getting started:
After all it comes down to two articles, which really deal with the topic and of extracting software mentions per rules.
Here is the starting information this provides us with:

### Pan et al.'s Assessing the Impact of Software on science most extracted patterns:
`use <> software
perform use <>
be perform use <>
analysis be perform use <>
analyze use <>
analysis be perform with <>
<> statistical software
<> software be use
quantify use <>
be calculate use <>`

Those rules all require lemmatization.
All of them need to be implemented separately.

#### Implementing and testing

##### All matches strategy
Apparently Snorkel does in fact match all candidates, not just the longest or shortest one.. The problem becomes apparent with the second rule: For the sentence 'All processing was performed using ScopeWin and Matlab software.' we match 'ScopeWin', 'ScopeWin and', 'ScopeWin and Matlab' .., where only the first is a true positive and all others are true negatives. This means we do have to look at the individual tokens and determine if too much was matched. The strategy is desgined watching rule 2. 
- if a stop word is in the match it is discarded
- Numbers (versions - regex: (\d+\.)?(\d+\.)?(\*|\d+) )
- Version statement
- Pretty much any punctuation
- other likely keywords (this corresponds to head nouns, since we do not want to match them), so software, program, toolbox, etc. but also for other stuff: method, procedures, etc. 
- our old fried 'statistical'

A similar problem presents itself with rule one. Since we have a left and right context, everything inbetween in matched. What means that in the case of 'we used the Matlab software', it will match 'the Matlab' instead of just 'Matlab'. However, we have to make sure that our matches are as accurate as possible. There are a number of solutions for this problem: 
- do not consider matches with stop words in them, of course on the other hand we have to view a larger context and remove stop words in order to be able to catch the right candidate.
- aside stop words, 'statistical' is often mentioned before software, this should also be considered.
- we can also include mentions that will reduce the number of false negatives, e.g. if the mention itself contains 'computer', 'custom' or other smiliar words, it is likely that software in referred to in general rather than to a specific one, and we can exclude that case. 
- one **very important** rule is that everything we take out of the middle match, we have to add to the left or right context in order to achieve consistent matches (except stuff we want to exclude in general). The question is how big the right context is supposed to be? A possible strategy could be to look at the words and see how much we would exclude. This is however not doable in snorkel. So we have to define a maxium right context or **expand it based on the actual match**. 

##### Individual Rules - Top 1
- Top 1: Of course the rule does have a quite low recall, because it just considers a specific mentioning context (and this is also going to be true for all other rules in this scope) but it actually has a pretty good precision.

In [None]:
import spacy
spacy_nlp = spacy.load('en')
stopwords = spacy.lang.en.stop_words.STOP_WORDS
stopwords_left_context = spacy.lang.en.stop_words.STOP_WORDS
print(stopwords)

In [None]:
def dynamic_growing_right_context(c, max_size=4):
    version_number = re.compile(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)')
    top1_head_words = ['statistical', 
                       'method', 
                       'procedure', 
                       'kit', 
                       'version',
                       'Version',
                       'v',
                       'V',
                       'v.',
                       'V.',
                       'ver.']
    right_context = [x for x in get_right_tokens(c, window=max_size+1, attrib="lemmas")]
    for i,token in enumerate(right_context):
        if i == max_size-1:
            return right_context[max_size:max_size+1]
        if (token in string.punctuation or 
            token in top1_head_words or
            version_number.match(token)):
            pass
        else:
            return right_context[i:i+1]
        
def LF_pan_top_1(c, stopwords):
    '''use <> software'''
    version_number = re.compile(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)')
    top1_head_words = ['statistical', 
                       'method', 
                       'procedure', 
                       'kit', 
                       'version',
                       'Version',
                       'v',
                       'V',
                       'v.',
                       'V.',
                       'ver.']
    left_context = ['use']
    right_context = ['software']
    tokens = [x.lower() for x in c[0].get_attrib_tokens()]
    
    if tokens[0] in stopwords or 'computer' in tokens or 'custom' in tokens or 'and' in tokens or tokens[-1] in ['statistical']:
        return -1 
    for tok in tokens:
        if tok in string.punctuation or tok in top1_head_words or version_number.match(tok):
            return -1
    left_win_1 = [x for x in get_left_tokens(c, window=1, attrib="lemmas")]
    left_win_2 = [x for x in get_left_tokens(c, window=2, attrib="lemmas")]
    if len(left_win_2) > 0 and left_win_2[-1] in stopwords:
        left_features = left_win_2[:-1]
    else:
        left_features = left_win_1
    right_features = dynamic_growing_right_context(c)
    if not right_features or len(left_context) != len(left_features) or len(right_context) != len(right_features):
        return 0
    for cont,feat in zip(left_context, left_features):
        if cont != feat:
            return 0
    for cont,feat in zip(right_context, right_features):
        if cont != feat:
            return 0
    return 1

In [None]:
spacy_nlp = spacy.load('en')
stopwords = spacy.lang.en.stop_words.STOP_WORDS
stopwords_left_context = spacy.lang.en.stop_words.STOP_WORDS

lf = partial(LF_pan_top_1, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 2

In [None]:
def LF_pan_top_2(c,stopwords):
    '''perform use <>'''
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    for tok in tokens:
        if tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    left_context = ['perform', 'use']
    left_features = [x for x in get_left_tokens(c, window=2, attrib="lemmas")]
    #left_win_2 = [x for x in get_left_tokens(c, window=2, attrib="lemmas")]
    #left_win_3 = [x for x in get_left_tokens(c, window=3, attrib="lemmas")]
    #if len(left_win_3) > len(left_win_2) and left_win_3[-1] in stopwords:
    #    left_features = left_win_3[:-1]
    #else:
    #    left_features = left_win_2
    #left_features = left_win_2
    if len(left_context) != len(left_features):
        return 0
    for c,f in zip(left_context, left_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_2, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 3
Output is **identical to rule top 2**. The mention context does always include the **be**.

In [None]:
def LF_pan_top_3(c, stopwords):
    '''be perform use <>'''
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    for tok in tokens:
        if tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    left_context = ['be', 'perform', 'use']
    left_features = [x for x in get_left_tokens(c, window=3, attrib="lemmas")]
    if len(left_context) != len(left_features):
        return 0
    for c,f in zip(left_context, left_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_3, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 4

In [None]:
def LF_pan_top_4(c, stopwords):
    '''analysis be perform use <>'''
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    for tok in tokens:
        if tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    left_context = ['analysis', 'be', 'perform', 'use']
    left_features = [x for x in get_left_tokens(c, window=4, attrib="lemmas")]
    if len(left_context) != len(left_features):
        return 0
    for c,f in zip(left_context, left_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_4, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 5 
The big problem with this rule is that it also matches statistical tests which were performed.

In [None]:
def LF_pan_top_5(c, stopwords):
    '''analyze use <>'''
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'unpaired',
                           'one-way',
                           'two-way',
                           'anova',
                           't-test',
                           'chi-square',
                           'ver.']
    for tok in tokens:
        if tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    left_context = ['analyze', 'use']
    left_features = [x for x in get_left_tokens(c, window=2, attrib="lemmas")]
    if len(left_context) != len(left_features):
        return 0
    for c,f in zip(left_context, left_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_5, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 6 

In [None]:
def LF_pan_top_6(c, stopwords):
    '''analysis be perform with <>'''
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    for tok in tokens:
        if tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    left_context = ['analysis', 'be', 'perform', 'with']
    left_features = [x for x in get_left_tokens(c, window=4, attrib="lemmas")]
    if len(left_context) != len(left_features):
        return 0
    for c,f in zip(left_context, left_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_6, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 7 

In [None]:
def dynamic_growing_context_top7(c, max_size=4, context=2, debug=False):
    top7_head_words = ['version',
                       'Version',
                       'v',
                       'V',
                       'v.',
                       'V.',
                       'ver.']
    version_number = re.compile(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)')
    right_context = [x for x in get_right_tokens(c, window=max_size+context, attrib="lemmas")]
    if debug:
        print("right context")
        print(right_context)
    for i,token in enumerate(right_context):
        if i == max_size-1:
            if debug:
                print("Return")
                print(right_context[max_size:max_size+context])
            return right_context[max_size:max_size+context]
        if (token in string.punctuation or 
            token in top7_head_words or
            version_number.match(token)):
            if debug:
                print("passed up token "+ token)
            pass
        else:
            if debug:
                print("Return")
                print(right_context[i:i+context])
            return right_context[i:i+context]

def LF_pan_top_7(c, stopwords):
    '''<> statistical software'''
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    for tok in tokens:
        if tok == 'use' or tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    debug = False
    #if tokens[0] == 'spss':
    #    print("Tokens")
    #    print(tokens)
    #    debug = True
    right_context = ['statistical', 'software']
    right_features = dynamic_growing_context_top7(c, debug=debug)
    if not right_features or len(right_context) != len(right_features):
        return 0
    for c,f in zip(right_context, right_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_7, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 8 

In [None]:
def LF_pan_top_8(c, stopwords):
    '''<> software be use'''
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    for tok in tokens:
        if tok == 'use' or tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    right_context = ['software', 'be', 'use']
    right_features = dynamic_growing_context_top7(c, context=3, debug=False)
    if not right_features or len(right_context) != len(right_features):
        return 0
    for c,f in zip(right_context, right_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_6, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 9 
In our case there was not a single tp from applying this rule, so it should probably be discarded instead of included. 

In [None]:
def LF_pan_top_9(c, stopwords):
    '''quantify use <>'''
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    for tok in tokens:
        if tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    left_context = ['quantify', 'use']
    left_features = [x for x in get_left_tokens(c, window=2, attrib="lemmas")]
    if len(left_context) != len(left_features):
        return 0
    for c,f in zip(left_context, left_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_9, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Individual Rules - Top 10 
Does not yield a single tp in our case, so it should rather be discarded. 

In [None]:
def LF_pan_top_10(c, stopwords):
    '''be caclulate use <>'''
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    for tok in tokens:
        if tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return 0 # -1
    left_context = ['be', 'calculate', 'use']
    left_features = [x for x in get_left_tokens(c, window=3, attrib="lemmas")]
    if len(left_context) != len(left_features):
        return 0
    for c,f in zip(left_context, left_features):
        if c != f:
            return 0
    return 1

In [None]:
lf = partial(LF_pan_top_10, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

### Duck et al.'s BioNerDS and further similar points:
- Head Nouns, basically we are considering keywords appearing with software: **software, package, tool, toolkit, program, framework, web-service, ..**
- Version Number, many software mentions come with a Version number, if they do this can be a strong hint.
- References and URLs (footnotes), software can often be cited on a Reference to the publisher could be given. However this reference might be hidden in footnotes. 
- Mention of the Developer, this is especially found with commercial software. In this style the producer is mentioned in brackets behind the software)
- Duck et al. also use negative head nouns.

#### Considerations
The rules are not quite as clear as the ones provided by Pan et al., therefore we have to make some more considerations on how to best apply them.

Head Nouns should appear right around the word, it is possilbe to be in the left as well as the right context. There might be some space between the artefact name and the head noun. For example if we consider that the artefact is followed by 'version 2.0 software' this would be two tokens and we have to examine a context of three. The question is should a larger context be weighted lower than a smaller one?  

We always consider version numbers to stand behind the software. However the identification of a proper version number is not as easy at it seems. V2, Version 2, Version 2.0, v 2.0, etc. This requires digging up or creation of a suited rule/regex which nicely covers all of this stuff. 

URLs following in the right context of a word are easily identified because they are quite easy to discriminate from normal text. References are more difficult, but we will probably mainly match them over matching `[num]`. However, references should probably be combined with another mechanism because they are quite a weak indicator in a scientific article. 

Finding a reference to the publisher should be complex, but should be possible following the same principle, except that regex matching is necessary in the individual steps. 

#### 3 Rules
We will try out three rules based on the contexts reported by Duck et al.

##### Head Nouns
Looking for positive head nouns in the right context (it is more common for the right context to contain head nouns). 

One main source of fps is that software is also partially matched. When we apply POS tags on the left context in order to see if left are more nouns we can improve peformance.

One 'problem' that is still given in the data is the matching of abbreviations instead of the full name. But since this can still be considered as correct matching we will not remove it to increase the quality of the data. The case can be illustrated in the following example:
`First, genotypes from the exomes were entered into the Copy Number Inference from Exome Reads (CoNIFER) package [74].` where we match `CoNIFER` instead of `Copy Number Inference from Exome Reads`.

In [None]:
def get_normalized_next_right_word(c, max_size=6):
    version_number = re.compile(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)')
    positive_head_nouns = [
        'software',
        'package',
        'program', 
        'tool', 
        'toolbox',
        'web-service',
        'spreadsheet'
    ]
    top1_head_words = ['statistical', 
                       'method', 
                       'procedure', 
                       'kit', 
                       'version',
                       'Version',
                       'v',
                       'V',
                       'v.',
                       'V.',
                       'ver.']
    right_context = [x for x in get_right_tokens(c, window=max_size, attrib="lemmas")]
    for i,token in enumerate(right_context):
        if token in positive_head_nouns:
            return 1
        elif (token in string.punctuation or 
            token in top1_head_words or
            version_number.match(token)):
            pass
        else:
            return 0
        
def LF_software_head_nouns(c, stopwords):
    negative_head_words = ['statistical', 
                           'software', 
                           'method', 
                           'procedure', 
                           'kit', 
                           'program', 
                           'tool', 
                           'toolbox',
                           'version',
                           'Version',
                           'v',
                           'V',
                           'v.',
                           'V.',
                           'ver.']
    tokens = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    for tok in tokens:
        if tok in ['software', 'program', 'tool', 'computer', 'custom'] or tok in stopwords or tok in string.punctuation or tok in negative_head_words or re.match(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)', tok):
            return -1
    poses = [x for x in c[0].get_attrib_tokens(a="pos_tags")]
    for pos in poses:
        if pos not in ['NN', 'NNS', 'NNP', 'NNPS']:
            return -1 # 0
    left_context = [x for x in get_left_tokens(c, window=1, attrib="pos_tags")]
    left_words = [x for x in get_left_tokens(c, window=1, attrib="words")]
    if left_context and left_context[0] in ['nn', 'nns', 'nnp', 'nnps']:
        return 0
    res = get_normalized_next_right_word(c)
    if res:
        return res
    else:
        return 0

In [None]:
lf = partial(LF_software_head_nouns, stopwords=stopwords)
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Version Numbers
Looking for version numbers in the right context. Version numbers almost exclusively appear right of the software mention. 

Applying the left context rule that was introduced above here actual hurts performance somewhat while managing further improvent on the number of fps.

In [None]:
def LF_version_number(c):
    version_number = re.compile(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)')
    simple_float = re.compile(r'^0\.\d{1,3}$')
    common_measures = ['nm', 'µm', 'mm', 'cm', 'dm', 'm', 'km', 'mg', 'g', 'kg', 'ml', 'l', 's', 'h', 'y']
    restrictive_version_number = re.compile(r'^(v|V|v.|V.)?(\d{1,3}\.)?(\d{1,3}\.)(\d{1,3})$')
    lemmas = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    pos_tags = [x for x in c[0].get_attrib_tokens(a="pos_tags")]
    for lem in lemmas:
        if (len(lem) <= 1 and not lem.isalpha()) or lem in ['v', 'V', 'v.', 'V.', 'ver.', 'Ver.', 'version', 'Version'] or lem in ['software', 'package', 'program'] or lem == 'ph':
            return -1
    for pos in pos_tags:
        if pos not in ['NN', 'NNS', 'NNP', 'NNPS']:
            return -1
    #left_context = [x for x in get_left_tokens(c, window=1, attrib="pos_tags")]
    #left_words = [x for x in get_left_tokens(c, window=1, attrib="words")]
    #if left_context and left_context[0] in ['nn', 'nns', 'nnp', 'nnps']:
    #    return -1
    right_context = [x for x in get_right_tokens(c, window=4, attrib="lemmas")]
    to_examine = 0
    if not right_context:
        return 0
    if right_context[0] in ['(']:#,'[','{']: #TODO: also exclude software, package, software package, etc.
        to_examine = 1
    if len(right_context) > 1 and right_context[to_examine] in ['v', 'V', 'v.', 'V.', 'ver.', 'Ver.', 'version', 'Version']:
        if len(right_context) > 2:
            potential_version_number = right_context[to_examine+1]
            if version_number.match(potential_version_number):
                return 1
        return 0
    if len(right_context) > 1 and not simple_float.match(right_context[to_examine]) and restrictive_version_number.match(right_context[to_examine]):
        if len(right_context) > 2:
            next_right_context = right_context[to_examine+1]
            if next_right_context in ['%'] or next_right_context in common_measures:
                return -1
        return 1
    return 0


In [None]:
lf = LF_version_number
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

##### Reference/URL/Developer

Reference: We are not doing references because they are to ambiguous in scientific studies and do no serve as a solid feature. (Examining our development set confirms that.)

URL: Does not work to well, but might make a nice addition to other rules.

Developer: We are looking for constructs like: `(SPSS, Chicago, IL, USA)`, `(v. 15 IBM, Chicago, IL, USA,)` or `(SAS Institute, Cary, North Carolina)` which should be easy to identify in a text. The basic construction of the rule should be equal to extracting URLs. Just the match between the brackets has to be refined. 

In [None]:
def LF_url(c):
    # The following regex certainly matches, but that might also work in a far less complex way..
    url_regex = re.compile(r"^((http(s)?:\/\/www\.)|(http(s)?:\/\/)|(www\.))[a-z\.-]+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$")
    version_indicators = ['v', 'V', 'v.', 'V.', 'ver.', 'Ver.', 'version', 'Version']
    version_number = re.compile(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)')
    positive_head_nouns = [
        'software',
        'package',
        'program', 
        'tool', 
        'toolbox',
        'web-service',
        'spreadsheet'
    ]
    lemmas = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    pos_tags = [x for x in c[0].get_attrib_tokens(a="pos_tags")]
    for lem in lemmas:
        if (len(lem) <= 1 and not lem.isalpha()) or version_number.match(lem) or lem in version_indicators or lem in positive_head_nouns or lem in ['computer']:
            return -1
    for pos in pos_tags:
        if pos not in ['NN', 'NNS', 'NNP', 'NNPS']:
            return  -1
    left_context = [x for x in get_left_tokens(c, window=1, attrib="pos_tags")]
    left_words = [x for x in get_left_tokens(c, window=1, attrib="words")]
    if left_context and left_context[0] in ['nn', 'nns', 'nnp', 'nnps']:
        return 0 # -1
    # In principle we just want to look at the entire right context and decide for our selfs how big it should be
    right_context = [x for x in get_right_tokens(c, window=15, attrib="lemmas")]
    right_context_size = len(right_context)
    first_token_to_consider = 0
    while (first_token_to_consider < right_context_size and 
           (right_context[first_token_to_consider] in version_indicators or 
            right_context[first_token_to_consider] in positive_head_nouns or right_context[first_token_to_consider] in ['computer'] or 
            version_number.match(right_context[first_token_to_consider]))):
        first_token_to_consider += 1 
        
    if first_token_to_consider == right_context_size or right_context[first_token_to_consider] != '(':
        return 0
    else:
        remaining_right_context = right_context[first_token_to_consider+1:]
        while remaining_right_context:
            tok = remaining_right_context.pop(0)
            if tok == ')':
                return 0
            if url_regex.match(tok):
                return 1
        return 0 

def LF_developer(c):
    version_number = re.compile(r'(v|V|v.|V.)?(\d+\.)?(\d+\.)?(\d+)')
    developer_version_addition = ['v.', 'ver.', 'version']
    version_indicators = ['v', 'V', 'v.', 'V.', 'ver.', 'Ver.', 'version', 'Version']
    positive_head_nouns = [
        'software',
        'package',
        'program', 
        'tool', 
        'toolbox',
        'web-service',
        'spreadsheet'
    ]
    lemmas = [x.lower() for x in c[0].get_attrib_tokens(a="lemmas")]
    pos_tags = [x for x in c[0].get_attrib_tokens(a="pos_tags")]
    for lem in lemmas:
        if (len(lem) <= 1 and not lem.isalpha()) or version_number.match(lem) or lem in version_indicators or lem in positive_head_nouns or lem in ['computer']:
            return -1
    for pos in pos_tags:
        if pos not in ['NN', 'NNS', 'NNP', 'NNPS']:
            return -1
    left_context = [x for x in get_left_tokens(c, window=1, attrib="pos_tags")]
    left_words = [x for x in get_left_tokens(c, window=1, attrib="words")]
    if left_context and left_context[0] in ['nn', 'nns', 'nnp', 'nnps']:
        return 0# -1
    # In principle we just want to look at the entire right context and decide for our selfs how big it should be
    right_context = [x for x in get_right_tokens(c, window=20, attrib="lemmas")]
    right_context_size = len(right_context)
    first_token_to_consider = 0
    while (first_token_to_consider < right_context_size and 
           (right_context[first_token_to_consider] in version_indicators or 
            right_context[first_token_to_consider] in positive_head_nouns or right_context[first_token_to_consider] in ['computer'] or 
            version_number.match(right_context[first_token_to_consider]))):
        first_token_to_consider += 1 
    # Behave different for here: We want to examine the context, therefore we first extract the entire context 
    # by looking for the closing bracket first. 
    
    if first_token_to_consider == right_context_size or right_context[first_token_to_consider] != '(':
        return 0
    else:
        remaining_tokens = right_context[first_token_to_consider+1:]
        #remaining_words = right_words[first_token_to_consider+1:]
        last_token_to_consider = -1
        for i,tok in enumerate(remaining_tokens):
            if tok == ')':
                last_token_to_consider = i 
                break
        if last_token_to_consider < 0:
            return 0
        else: 
            remaining_tokens = remaining_tokens[:last_token_to_consider]
            #remaining_words = remaining_words[:last_token_to_consider]
            # Here we perform the actual test
            for tok in remaining_tokens:
                if tok in developer_version_addition or tok in ['inc', 'ltd', 'corp', 'apply']:
                    return 1
                if tok in ['such', 'i.e', 'e.g']:
                    return -1
            #for tok in remaining_words:
            #    if tok in us_states:
            #        return 1
            token_split = [[]]
            for i in remaining_tokens:
                if i in [',', ';']:
                    token_split.append([])
                else:
                    token_split[-1].append(i)
            
            return 0

In [None]:
lf = LF_developer
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

In [None]:
lf = LF_url
test_marginals  = np.array([0.5 * (lf(c) + 1) for c in test_cands])
tp, fp, tn, fn = scorer.score(test_marginals, set_unlabeled_as_neg=True, set_at_thresh_as_neg=False)

In [None]:
SentenceNgramViewer(fp, session)