# NERTools
(The code that executes pretrained models of some tools such as StanfordNER, Spacy, and calculates scores such as precision, recall and MCC.)

[Github Repo](https://github.com/berfubuyukoz/NERTools)

### IMPORTANT NOTE: 
Default values of arguments assume that the files are on the same folder as this notebook. For more information about the arguments please refer to the README in the GitHub repo. Another version with extension .py is also available to run the code on the command prompt.

You can download the files used as default from the Github repo.

In [9]:
import nltk
import sys
from foliaHelper import readFoliaIntoSentences
from conllHelper import readConllIntoSentences
from metricHelper import findMCC
from metricHelper import findPrecisionRecalls
from stanfordNER import runStfModel
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/berfu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Default values of arguments. Assumes the files are on the same folder as this notebook. Change the argument values here if you wish them have different values:

In [42]:
ner_tool = 'stanford'
tag_types = ["ORG", "LOC", "PER"]  # No matter what tag is specified, eval for total data is also calculated.
eval_metrics = 0 # 0: Precision and Recall. 1: MCC. 2: Precision, Recall, and MCC
annotation_format = 0  # Conll: 0, Folia: 1

testfile = './alladjudicated' # For Folia files: './alladjudicated'. For a Conll file: './conll-dataset/test-a.txt'
tagger = './stanford-ner.jar' 
model = './ner-model-english-conll-4class.ser.gz' # Or another model file in your local.

#Nevermind the lines below. Paths for my local folder structure.
#testfile = './conll-dataset/test-a.txt'
#tagger = './stanford-ner-files/stanford-ner.jar'
#model = './stanford-ner-files/ner-model-english-conll-4class.ser.gz'


Reading the input data and extracting sentences as token lists and actual tags as a list. Separate parsing methods depending on the annotation format (CONLL, Folia, etc):

The function that reads Conll-formatted files:

In [43]:
import re


def readConllIntoSentences(testfile):
    with open(testfile, 'r') as f:
        lines = []
        sentences = [[]]
        for line in f:
            if line != '\n':
                sentences[-1].append(line.split(None, 1)[0])
                lines.append(line.split())
            else:
                sentences.append([])
    all_tokens = [line[0] for line in lines]
    actual_tags = [line[-1] for line in lines]
    actual_tags = ['LOCATION' if re.match('^.*LOC.*$', tag)
                       else tag for tag in actual_tags]

    actual_tags = ['PERSON' if re.match('^.*PER.*$', tag)
                       else tag for tag in actual_tags]

    actual_tags = ['ORGANIZATION' if re.match('^.*ORG.*$', tag)
                       else tag for tag in actual_tags]

    actual_tags = ['O' if re.match('^.*MISC.*$', tag)
                       else tag for tag in actual_tags]
    return [sentences, all_tokens, actual_tags]

The function that reads Folia-formatted files:

In [44]:
from pynlpl.formats import folia
import os
import re


def convertFoliaClass2stfTag(e):
    per = 'PERSON'
    loc = 'LOCATION'
    org = 'ORGANIZATION'
    cls = e.cls
    if re.match('^.*Target.*$', e.set):
        if cls == 'name':
            return per
    elif re.match('^.*Organizer.*$', e.set):
        if cls == 'name':
            return org
    if cls == 'loc' or cls == 'place' or cls == 'place_pub':
        return loc
    if cls == 'pname':
        return per
    if cls == 'fname':
        return org
    return 'O'


def readFoliaIntoSentences(path):
    sentences_as_tokens = []
    ids = []
    id2idx = {}
    idx2id = {}
    all_tokens = []
    actual_stf_tags = []
    if os.path.isdir(path):
        idx = -1
        for filename in os.listdir(path):
            doc = folia.Document(file=path + '/' + filename)
            for h, sentence in enumerate(doc.sentences()):
                sentence_tokenized = sentence.select(folia.Word)
                words_folia = list(sentence_tokenized)
                sentence_tokens = []
                for word in words_folia:
                    w_id = word.id
                    w_text = word.text()
                    if w_id in ids:
                        continue
                    idx = idx + 1
                    if idx == 16307 and w_text == '<P>':
                        idx = idx - 1
                        continue
                    ids.append(w_id)
                    id2idx[w_id] = idx
                    idx2id[idx] = w_id
                    actual_stf_tags.append('O')
                    sentence_tokens.append(w_text)
                    all_tokens.append(w_text)

                sentences_as_tokens.append(sentence_tokens)
                for layer in sentence.select(folia.EntitiesLayer):
                    for entity in layer.select(folia.Entity):
                        for word in entity.wrefs():
                            word_id = word.id
                            _idx = id2idx[word_id]
                            stf_tag = convertFoliaClass2stfTag(entity)
                            actual_stf_tags[_idx] = stf_tag

    else:
        print("TODO: Handling of a single Folia file instead of a folder of Folia files.")
    return [sentences_as_tokens, ids, id2idx, idx2id, all_tokens, actual_stf_tags]

The code that calls the right file-reader function (one of the two above):

In [55]:
_sentences = []
actual_stf_tokens = []
actual_stf_tags = []

if annotation_format == 0:  # Conll
    [_sentences, actual_stf_tokens, actual_stf_tags] = readConllIntoSentences(testfile)
elif annotation_format == 1:  # Folia
    [_sentences, ids, id2idx, idx2id, actual_stf_tokens, actual_stf_tags] = readFoliaIntoSentences(testfile)

The function that executes the StanfordNER model loaded:

In [56]:
from nltk.tag.stanford import StanfordNERTagger


def runStfModel(sents, tagger, model):
    # Prepare NER tagger with english model
    ner_tagger = StanfordNERTagger(model, tagger, encoding='utf8')
    # Run NER tagger on words
    return ner_tagger.tag_sents(sents)

Running the ner tool desired (stanford, spacy, etc.):

In [57]:
if ner_tool == 'stanford':
    result = runStfModel(_sentences, tagger, model)
    token_predTag = [item for sublist in result for item in sublist]
else:
    print('TODO: Calling other ner tools.')

Calculating some intermediate variables for scoring the predictions:

In [60]:
pred_stf_tokens = [tp[0] for tp in token_predTag]
pred_stf_tags = [tp[1] for tp in token_predTag]
pred_stf_tags = [tp[1] for tp in token_predTag]

actual = actual_stf_tags
pred = pred_stf_tags

In [61]:
# idx_act_pred_same = [(i,actual_stf_tokens[i], actual[i],pred[i]) for i in range(len(pred))
# if actual[i] == pred[i] and actual[i] != O]

# all fp and fn including 'other'
idx_token_act_pred_diff = [(i,actual_stf_tokens[i], actual[i],pred[i]) for i in range(len(pred)) if actual[i] != pred[i]]
# all fp and fn including 'other'
idx_diff = [i[0] for i in idx_token_act_pred_diff]
# tp except 'other'
idx_tag_numerator = [(i,actual_stf_tokens[i], actual[i], pred[i]) for i in range(len(pred)) if i not in idx_diff and actual[i] != 'O']


The function that finds precision and recall:

In [62]:
def findPrecisionRecalls(actual_stf_tokens, actual,pred, idx_diff, tag_types):
    tag2scores = {}
    # tp for 'loc'
    idx_tag_numerator_loc = [(i, actual[i], pred[i]) for i in range(len(pred)) if
                             i not in idx_diff and actual[i] == 'LOCATION']
    idx_tag_act_loc = [(i, actual[i], pred[i]) for i in range(len(pred)) if actual[i] == 'LOCATION']
    idx_tag_pred_loc = [(i, actual[i], pred[i]) for i in range(len(pred)) if pred[i] == 'LOCATION']

    """EXAMINE RESULTS FOR LOC"""
    actual_locs_missed = [[actual_stf_tokens[a[0]], a] for a in idx_tag_act_loc if pred[a[0]] != 'LOCATION']  # 558, nearly all of them has lower-cased first letters.
    actual_locs_catched = [[actual_stf_tokens[a[0]], a] for a in idx_tag_act_loc if pred[a[0]] == 'LOCATION']  # 17. All of them start with upper-cased letters.

    # tp for 'per'
    idx_tag_numerator_per = [(i, actual[i], pred[i]) for i in range(len(pred)) if
                             i not in idx_diff and actual[i] == 'PERSON']
    idx_tag_act_per = [(i, actual[i], pred[i]) for i in range(len(pred)) if actual[i] == 'PERSON']
    idx_tag_pred_per = [(i, actual[i], pred[i]) for i in range(len(pred)) if pred[i] == 'PERSON']

    """EXAMINE RESULTS FOR PER"""
    actual_pers_missed = [[actual_stf_tokens[a[0]], a] for a in idx_tag_act_per if
                          pred[a[0]] != 'PERSON']  # 558, nearly all of them has lower-cased first letters.
    actual_pers_catched = [[actual_stf_tokens[a[0]], a] for a in idx_tag_act_per if
                           pred[a[0]] == 'PERSON']  # 17. All of them start with upper-cased letters.

    # tp for 'org'
    idx_tag_numerator_org = [(i, actual[i], pred[i]) for i in range(len(pred)) if
                             i not in idx_diff and actual[i] == 'ORGANIZATION']
    idx_tag_act_org = [(i, actual[i], pred[i]) for i in range(len(pred)) if actual[i] == 'ORGANIZATION']
    idx_tag_pred_org = [(i, actual[i], pred[i]) for i in range(len(pred)) if pred[i] == 'ORGANIZATION']

    """EXAMINE RESULTS FOR ORG"""
    actual_orgs_missed = [[actual_stf_tokens[a[0]], a] for a in idx_tag_act_org if
                          pred[a[0]] != 'ORGANIZATION']  # 558, nearly all of them has lower-cased first letters.
    actual_orgs_catched = [[actual_stf_tokens[a[0]], a] for a in idx_tag_act_org if
                           pred[a[0]] == 'ORGANIZATION']  # 17. All of them start with upper-cased letters.

    total_numerator = len(idx_tag_numerator_loc) + len(idx_tag_numerator_per) + len(idx_tag_numerator_org)
    total_recall = total_numerator / (len(idx_tag_act_loc) + len(idx_tag_act_per) + len(idx_tag_act_org))
    total_prec = total_numerator / (len(idx_tag_pred_loc) + len(idx_tag_pred_per) + len(idx_tag_pred_org))
    tag2scores['TOTAL'] = [total_prec,total_recall]

    if "LOC" in tag_types:
        loc_recall = len(idx_tag_numerator_loc) / len(idx_tag_act_loc)
        loc_prec = len(idx_tag_numerator_loc) / len(idx_tag_pred_loc)
        tag2scores['LOC'] = [loc_prec, loc_recall]
    if "PER" in tag_types:
        per_recall = len(idx_tag_numerator_per) / len(idx_tag_act_per)
        per_prec = len(idx_tag_numerator_per) / len(idx_tag_pred_per)
        tag2scores['PER'] = [per_prec, per_recall]
    if "ORG" in tag_types:
        org_recall = len(idx_tag_numerator_org) / len(idx_tag_act_org)
        org_prec = len(idx_tag_numerator_org) / len(idx_tag_pred_org)
        tag2scores['ORG'] = [org_prec, org_recall]

    return tag2scores


The function that finds Matthew's Correlation Coefficient (MCC) score. (A good metric for unbalanced data)

In [63]:
import math

def findMCC(idx_tag_numerator, idx_act_pred_diff, idx_diff, actual, pred):
    # tp for 'loc'
    idx_tag_numerator_loc = [(i, actual[i], pred[i]) for i in range(len(pred)) if
                             i not in idx_diff and actual[i] == 'LOCATION']
    # tp for 'per'
    idx_tag_numerator_per = [(i, actual[i], pred[i]) for i in range(len(pred)) if
                             i not in idx_diff and actual[i] == 'PERSON']
    # tp for 'org'
    idx_tag_numerator_org = [(i, actual[i], pred[i]) for i in range(len(pred)) if
                             i not in idx_diff and actual[i] == 'ORGANIZATION']

    total_tp = idx_tag_numerator
    # fp_for loc
    # itd[0] corresponds to the 'id' column of the element in the idx_tag_diff list.
    fp_loc = [itd[0]
              for itd in idx_act_pred_diff if itd[2] == 'LOCATION']

    # fn for loc
    fn_loc = [itd[0]
              for itd in idx_act_pred_diff if itd[1] == 'LOCATION']

    # fp_for per
    fp_per = [itd[0]
              for itd in idx_act_pred_diff if itd[2] == 'PERSON']

    # fn for per
    fn_per = [itd[0]
              for itd in idx_act_pred_diff if itd[1] == 'PERSON']

    # fp_for org
    fp_org = [itd[0]
              for itd in idx_act_pred_diff if itd[2] == 'ORGANIZATION']

    # fn for org
    fn_org = [itd[0]
              for itd in idx_act_pred_diff if itd[1] == 'ORGANIZATION']

    # tn for loc
    tn_loc = [i for i in range(len(pred)) if i not in idx_diff and actual[i] != 'LOCATION']

    # tn for per
    tn_per = [i for i in range(len(pred)) if i not in idx_diff and actual[i] != 'PERSON']
    # tn for org
    tn_org = [i for i in range(len(pred)) if i not in idx_diff and actual[i] != 'ORGANIZATION']

    tp_loc = idx_tag_numerator_loc
    tp_per = idx_tag_numerator_per
    tp_org = idx_tag_numerator_org
    total_tp = len(tp_loc) + len(tp_per) + len(tp_org)
    total_tn = len(tn_loc) + len(tn_per) + len(tn_org)
    total_fp = len(fp_loc) + len(fp_per) + len(fp_org)
    total_fn = len(fn_loc) + len(fn_per) + len(fn_org)

    total_pred_p = total_tp + total_fp
    total_pred_n = total_tn + total_fn
    total_actual_n = total_fp + total_tn
    total_actual_p = total_tp + total_fn

    MCC_numerator = total_tp * total_tn - total_fp * total_fn
    MCC_denominator = math.sqrt(total_pred_p * total_pred_n * total_actual_p * total_actual_n)

    return MCC_numerator / MCC_denominator

Now calcuating scores specified by the user (precision, recall, mcc, etc.), using the functions in the previous two cells above.

In [64]:
# Calculate Precision and Recall for tags individually, or MCC, depending on the arguments.
if eval_metrics == 0:
    tag2precrec = findPrecisionRecalls(actual_stf_tokens, actual,pred, idx_diff, tag_types)
elif eval_metrics == 1:
    mcc = findMCC(idx_tag_numerator, idx_token_act_pred_diff, idx_diff, actual, pred)
elif eval_metrics == 2:
    tag2precrec = findPrecisionRecalls(actual_stf_tokens, actual,pred, idx_diff, tag_types)
    mcc = findMCC(idx_tag_numerator, idx_token_act_pred_diff, idx_diff, actual, pred)

Writing scores to the output file:

In [65]:
print("Scores: \n")
print("(Type 'other' results are omitted before calculating scores other than MCC.) \n")
if eval_metrics != 0:
    print("Matthew's Correlation Coefficient: "+ str(round(mcc, 2)) + "\n\n")

if eval_metrics != 1:
    for t in tag2precrec.keys():
        print(t + " precision: " + str(round(tag2precrec[t][0], 2)) + "\n")
        print(t + " recall: " + str(round(tag2precrec[t][1], 2)) + "\n\n")

Scores: 

(Type 'other' results are omitted before calculating scores other than MCC.) 

TOTAL precision: 0.94

TOTAL recall: 0.93


LOC precision: 0.95

LOC recall: 0.91


PER precision: 0.96

PER recall: 0.97


ORG precision: 0.89

ORG recall: 0.9


