# Feature-based NER tagger


**Model description**

The purpose of this work is to build a feature-based classifier for named entity recognition. 
The classifier below relies on the following features:
  - NLTK part-of-speech tags of the current word
  - Scapy part-of-speech tags of the current, and the two previous words
  - are the current and the previous words tagged as a proper noun? (according to Scapy tags)
  - are the current and the previous words in title case?
  - are the current and the previous words in capital case?
  - do the current and the previous words consist of alphabetic characters only?
  - do the current and the previous words consist of numeric characters only?
  - word length of the current and the previous words

I chose to work with a simple logistic regression, as it appears to be the most performant model while compared with other models, such as SVM, Gradient Boosting or Naive Bayes ones. I also chose to work with all the training data, despite unbalanced classes. To solve this issue, I have specified different class weights in the logistic regression model, so that smaller classes have a bigger weight in the classification task.


**Additional tricks**

In addition to these features, two methods are used to improve the performance of the classifier. 

First, accoring to Lev Ratinov and Dan Roth (Ratinov  & Roth, 2009), the BILOU tagging scheme is more efficient for NER tasks than the BIO scheme. The BILOU scheme assigns the following tags to words:
  - for a multi-token named entity:
    - B for the first token
    - I for intermediate tokens
    - L for the last token
  - U for a unit named entity
  - O for all other words

Therefore, I have converted BIO labels into BILOU tags, trained a classifier with the BILOU tags, and converted the BILOU predictions into BIO predictions to compute the performance of the system.

Then, I have noticed that BIO predictions are sometimes inconsistent. For exemple, a word can be labeled as I, but the previous word is labeled as O. To fix this problem, I have converted the predicted labels in order that the first token of an entity is labeled as B and the next tokens (if they exist) of the same entity is labeled as I. This method significantly improve the final F1-score of the model.


## Import libraries

In [None]:
# import usefule libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
from nltk.util import ngrams

import spacy
nlp = spacy.load("en_core_web_sm")

from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, RidgeClassifier, RidgeClassifierCV

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Useful functions

In [None]:
# evaluation function
def wnut_evaluate(txt):
  '''row by row entity evaluation: we evaluate by whole named entities'''
  tp = 0; fp = 0; fn = 0
  in_entity = 0
  for i in txt.index:
    if txt['prediction'][i]=='B' and txt['bio_only'][i]=='B':
      if in_entity==1:  # if there's a preceding named entity which didn't have intervening O...
        tp += 1  # count a true positive
      in_entity = 1  # start tracking this entity (don't count it until we know full span of entity)
    elif txt['prediction'][i]=='B':
      fp += 1  # if not a B in gold annotations, it's a false positive
      in_entity = 0
    elif txt['prediction'][i]=='I' and txt['bio_only'][i]=='I':
      next  # correct entity continuation: do nothing
    elif txt['prediction'][i]=='I' and txt['bio_only'][i]=='B':
      fn += 1  # if a new entity should have begun, it's a false negative
      in_entity = 0
    elif txt['prediction'][i]=='I':  # if gold is O...
      if in_entity==1:  # and if tracking an entity, then the span is too long
        fp += 1  # it's a false positive
      in_entity = 0
    elif txt['prediction'][i]=='O':
      if txt['bio_only'][i]=='B':
        fn += 1  # false negative if there's B in gold but no predicted B
        if in_entity==1:  # also check if there was a named entity in progress
          tp += 1  # count a true positive
      elif txt['bio_only'][i]=='I':
        if in_entity==1:  # if this should have been a continued named entity, the span is too short
          fn += 1  # count a false negative
      elif txt['bio_only'][i]=='O':
        if in_entity==1:  # if a named entity has ended in right place
          tp += 1  # count a true positive
      in_entity = 0

  if in_entity==1:  # catch any final named entity
    tp += 1

  prec = tp / (tp+fp)
  rec = tp / (tp+fn)
  f1 = (2*(prec*rec)) / (prec+rec)
  print('Sum of TP and FP = %i' % (tp+fp))
  print('Sum of TP and FN = %i' % (tp+fn))
  print('True positives = %i, False positives = %i, False negatives = %i' % (tp, fp, fn))
  print('Precision = %.3f, Recall = %.3f, F1 = %.3f' % (prec, rec, f1))

In [None]:
# function to add nltk and scapy part-of-speech tags to data
def add_tags(train):
  nltktags_tmp = pos_tag(train.token)
  nltktags = [word_tag[1] for word_tag in nltktags_tmp]
  train['nltk_tag'] = nltktags

  scapytags = [nlp(u)[0].tag_ for u in train['token']]
  train['scapy_tag'] = scapytags

  return train

In [None]:
# function to convert BIO labels into BILOU ones
def bio_to_bilou(bio):
  bilou = [0]*len(bio)
  for i in range(len(bio)):
    if i == len(bio)-1 :
      if bio[i] == 'B':
        bilou[i] = 'U'
      elif bio[i] == 'I':
        bilou[i] = 'L'
      else:
        bilou[i] = 'O'
    
    else:
      if bio[i] == 'O':
        bilou[i] = 'O'
      elif bio[i] == 'B':
        if bio[i+1] == 'I':
          bilou[i] = 'B'
        else:
          bilou[i] = 'U'
      else:
        if bio[i+1] == 'I':
          bilou[i] = 'I'
        else:
          bilou[i] = 'L'
    
  return bilou


# function to convert BILOU labels into BIO ones
def bilou_to_bio(bilou):
  bio = [0]*len(bilou)
  for i in range(len((bilou))):
    if (bilou[i] == 'B') or (bilou[i] == 'U'):
      bio[i] = 'B'
    elif (bilou[i] == 'I') or (bilou[i] == 'L'):
      bio[i] = 'L'
    else:
      bio[i] = 'O'
  return bio


# function to convert BIO indices into BIO labels
def reverse_bio(ind):
  bio = 'B'
  if ind==0:
    bio = 'B'
  elif ind==1:
    bio = 'I'
  elif ind==2:
    bio = 'O'
  return bio


# function to convert BILOU indices into BILOU labels
def reverse_bilou(ind):
  bilou = 'B'
  if ind==0:
    bilou = 'B'
  elif ind==1:
    bilou = 'I'
  elif ind==2:
    bilou = 'L'
  elif ind==3:
    bilou = 'O'
  elif ind==4:
    bilou = 'U'
  return bilou


# function to rectify BIO predictions
def correct_preds(preds):
  for i in range(len(preds)):
    if i == 0:
      if preds[i] == 'I':
        preds[i] = 'B'
    
    else:
      if preds[i] == 'B' and preds[i-1] != 'O':
        preds[i] = 'I'
      elif preds[i] == 'I' and preds[i-1] == 'O':
        preds[i] = 'B'
  
  return preds

In [None]:
# feature 1: convert POS-tags to integers
def pos_index(pos, vocab):
  try:
    ind = vocab.index(pos)
  except ValueError:
    ind=len(vocab)
  return ind

# feature 2: is this a proper noun?
def is_propn(pos):
  resp = False
  if pos=='PROPN':
    resp = True
  return resp

# feature 3: is the first character a capital letter?
def title_case(tok):
  resp = False
  if tok[0:1].isupper():
    resp = True
  return resp

# feature 4: is the word in capital letters?
def capital_case(tok):
  if tok.isupper():
    return True
  return False

# feature 6: is the token alphabetic?
def is_alphabetic(tok):
  if tok.isalpha():
    return True
  return False

# feature 6: is the token numeric?
def is_numeric(tok):
  if tok.isnumeric():
    return True
  return False

# training labels: convert BIO to integers
def bio_index(bio):
  ind = 0
  if bio=='B':
    ind = 0
  elif bio=='I':
    ind = 1
  elif bio=='O':
    ind = 2
  return ind

# training labels: convert BILOU to integers
def bilou_index(bio):
  ind = 0
  if bio=='B':
    ind = 0
  elif bio=='I':
    ind = 1
  elif bio=='L':
    ind = 2
  elif bio=='O':
    ind = 3
  elif bio=='U':
    ind = 4
  return ind

# pass a data frame through our feature extractor
def extract_features(txt_orig, test=False):
  txt = txt_orig.copy()
  if not test:
    txt['bilou_only'] = bio_to_bilou(txt['bio_only'])
    bioints = [bio_index(b) for b in txt['bio_only']]
    txt['bio_only'] = bioints
    bilouints = [bilou_index(b) for b in txt['bilou_only']]
    txt['bilou_only'] = bilouints
  nltkinds = [pos_index(u, nltk_vocab) for u in txt['nltk_tag']]
  txt['nltk_indices'] = nltkinds
  scapyinds = [pos_index(u, scapy_vocab) for u in txt['scapy_tag']]
  txt['scapy_indices'] = scapyinds
  isprop_upos = [is_propn(u) for u in txt['upos']]
  txt['is_propn_upos'] = isprop_upos
  isprop_scapy = [is_propn(u) for u in txt['scapy_tag']]
  txt['is_propn_scapy'] = isprop_scapy
  tcase = [title_case(t) for t in txt['token']]
  txt['title_case'] = tcase
  ccase = [capital_case(t) for t in txt['token']]
  txt['capital_case'] = ccase
  isalpha = [is_alphabetic(t) for t in txt['token']]
  txt['is_alpha'] = isalpha
  isnum = [is_numeric(t) for t in txt['token']]
  txt['is_numeric'] = isnum
  wordlen = [len(t) for t in txt['token']]
  txt['word_len'] = wordlen

  # lag features
  txt.loc[1:,'scapy_indices_1'] = txt['scapy_indices'][:len(txt)-1]
  txt.scapy_indices_1.fillna(scapy_vocab.index("."), inplace=True)
  txt.loc[2:,'scapy_indices_2'] = txt['scapy_indices'][:len(txt)-2]
  txt.scapy_indices_2.fillna(scapy_vocab.index("."), inplace=True)

  txt.loc[1:,'is_propn_1'] = txt['is_propn_upos'][:len(txt)-1]
  txt.is_propn_1.fillna(False, inplace=True)
  txt.loc[2:,'is_propn_2'] = txt['is_propn_upos'][:len(txt)-2]
  txt.is_propn_2.fillna(False, inplace=True)

  txt.loc[1:,'title_case_1'] = txt['title_case'][:len(txt)-1]
  txt.title_case_1.fillna(False, inplace=True)
  txt.loc[2:,'title_case_2'] = txt['title_case'][:len(txt)-2]
  txt.title_case_2.fillna(False, inplace=True)

  txt.loc[1:,'capital_case_1'] = txt['capital_case'][:len(txt)-1]
  txt.capital_case_1.fillna(False, inplace=True)
  txt.loc[2:,'capital_case_2'] = txt['capital_case'][:len(txt)-2]
  txt.capital_case_2.fillna(False, inplace=True)

  txt.loc[1:,'is_alpha_1'] = txt['is_alpha'][:len(txt)-1]
  txt.is_alpha_1.fillna(True, inplace=True)
  txt.loc[2:,'is_alpha_2'] = txt['is_alpha'][:len(txt)-2]
  txt.is_alpha_2.fillna(True, inplace=True)

  txt.loc[1:,'is_numeric_1'] = txt['is_numeric'][:len(txt)-1]
  txt.is_numeric_1.fillna(False, inplace=True)
  txt.loc[2:,'is_numeric_2'] = txt['is_numeric'][:len(txt)-2]
  txt.is_numeric_2.fillna(False, inplace=True)

  txt.loc[1:,'word_len_1'] = txt['word_len'][:len(txt)-1]
  txt.word_len_1.fillna(0, inplace=True)
  txt.loc[2:,'word_len_2'] = txt['word_len'][:len(txt)-2]
  txt.word_len_2.fillna(0, inplace=True)

  return txt

## Data preparation

In [None]:
# load training data
wnuttrain = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt'
train = pd.read_table(wnuttrain, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna().reset_index()

In [None]:
train_tags = add_tags(train)

In [None]:
# in order to convert POS tags to integers
upos_vocab = train_tags.upos.unique().tolist()
nltk_vocab = train_tags.nltk_tag.unique().tolist()
scapy_vocab = train_tags.scapy_tag.unique().tolist()

train_copy = extract_features(train_tags)
train_copy.to_csv("train_copy.csv")
train_copy.head(20)

Unnamed: 0,index,token,label,bio_only,upos,nltk_tag,scapy_tag,bilou_only,nltk_indices,scapy_indices,is_propn_upos,is_propn_scapy,title_case,capital_case,is_alpha,is_numeric,word_len,scapy_indices_1,scapy_indices_2,is_propn_1,is_propn_2,title_case_1,title_case_2,capital_case_1,capital_case_2,is_alpha_1,is_alpha_2,is_numeric_1,is_numeric_2,word_len_1,word_len_2
0,0,@paulwalk,O,2,NOUN,VB,NNP,3,0,0,False,False,False,False,False,False,9,11.0,11.0,False,False,False,False,False,False,True,True,False,False,0.0,0.0
1,1,It,O,2,PRON,PRP,PRP,3,1,1,False,False,True,False,True,False,2,1.0,11.0,False,False,True,False,False,False,True,True,False,False,2.0,0.0
2,2,'s,O,2,AUX,VBZ,POS,3,2,2,False,False,False,False,False,False,2,2.0,2.0,False,False,False,False,False,False,False,False,False,False,2.0,2.0
3,3,the,O,2,DET,DT,DT,3,3,3,False,False,False,False,True,False,3,3.0,3.0,False,False,False,False,False,False,True,True,False,False,3.0,3.0
4,4,view,O,2,NOUN,NN,NN,3,4,4,False,False,False,False,True,False,4,4.0,4.0,False,False,False,False,False,False,True,True,False,False,4.0,4.0
5,5,from,O,2,ADP,IN,IN,3,5,5,False,False,False,False,True,False,4,5.0,5.0,False,False,False,False,False,False,True,True,False,False,4.0,4.0
6,6,where,O,2,ADV,WRB,WRB,3,6,6,False,False,False,False,True,False,5,6.0,6.0,False,False,False,False,False,False,True,True,False,False,5.0,5.0
7,7,I,O,2,PRON,PRP,PRP,3,1,1,False,False,True,True,True,False,1,1.0,1.0,False,False,True,True,True,True,True,True,False,False,1.0,1.0
8,8,'m,O,2,X,VBP,``,3,7,7,False,False,False,False,False,False,2,7.0,7.0,False,False,False,False,False,False,False,False,False,False,2.0,2.0
9,9,living,O,2,NOUN,VBG,VBG,3,8,8,False,False,False,False,True,False,6,8.0,8.0,False,False,False,False,False,False,True,True,False,False,6.0,6.0


In [None]:
# load development data, add part-of-speech tags and extract features
wnutdev = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt'
dev = pd.read_table(wnutdev, header=None, names=['token', 'label', 'bio_only', 'upos']).dropna().reset_index()

dev_tags = add_tags(dev)
dev_copy = extract_features(dev_tags)
dev_copy.to_csv("dev_copy.csv")
dev_copy.head()

Unnamed: 0,index,token,label,bio_only,upos,nltk_tag,scapy_tag,bilou_only,nltk_indices,scapy_indices,is_propn_upos,is_propn_scapy,title_case,capital_case,is_alpha,is_numeric,word_len,scapy_indices_1,scapy_indices_2,is_propn_1,is_propn_2,title_case_1,title_case_2,capital_case_1,capital_case_2,is_alpha_1,is_alpha_2,is_numeric_1,is_numeric_2,word_len_1,word_len_2
0,0,Stabilized,O,2,PROPN,VBN,VBN,3,23,16,True,False,True,False,True,False,10,11.0,11.0,False,False,False,False,False,False,True,True,False,False,0.0,0.0
1,1,approach,O,2,NOUN,NN,NNP,3,4,0,False,False,False,False,True,False,8,0.0,11.0,False,False,False,False,False,False,True,True,False,False,8.0,0.0
2,2,or,O,2,CCONJ,CC,CC,3,21,21,False,False,False,False,True,False,2,21.0,21.0,False,False,False,False,False,False,True,True,False,False,2.0,2.0
3,3,not,O,2,PART,RB,RB,3,14,14,False,False,False,False,True,False,3,14.0,14.0,False,False,False,False,False,False,True,True,False,False,3.0,3.0
4,4,?,O,2,PUNCT,.,.,3,11,11,False,False,False,False,False,False,1,11.0,11.0,False,False,False,False,False,False,False,False,False,False,1.0,1.0


In [None]:
columns_list = [#'token', 
                #'label',
                #'bio_only',
                #'bilou_only',
                #'upos',
                #'nltk_tags',
                #'scapy_tags',
                #'is_propn_upos',
                'nltk_indices',
                'scapy_indices',
                'is_propn_scapy',
                'title_case',
                'capital_case',
                'is_alpha',
                'is_numeric',
                'word_len',
                'scapy_indices_1',
                'scapy_indices_2',
                'is_propn_1',
                'is_propn_2',
                'title_case_1',
                'title_case_2', 
                'capital_case_1',
                'capital_case_2', 
                'is_alpha_1',
                'is_alpha_2',
                'is_numeric_1',
                'is_numeric_2', 
                'word_len_1',
                'word_len_2',
                ]

In [None]:
# create X_train and Y_train dataframe for training
train_copy = pd.read_csv("train_copy.csv")
X_train = train_copy[columns_list]
Y_train = train_copy['bilou_only']
print(X_train.head())
Y_train.head()

   nltk_indices  scapy_indices  ...  word_len_1  word_len_2
0             0              0  ...         0.0         0.0
1             1              1  ...         2.0         0.0
2             2              2  ...         2.0         2.0
3             3              3  ...         3.0         3.0
4             4              4  ...         4.0         4.0

[5 rows x 22 columns]


0    3
1    3
2    3
3    3
4    3
Name: bilou_only, dtype: int64

## Classification model

In [41]:
# logistic regression model with specified class weights
model = LogisticRegression(max_iter = 1000, random_state=0, class_weight={0:15,1:15,2:15,3:1,4:10}, multi_class='multinomial', penalty='l2', solver='newton-cg').fit(X_train, Y_train)
model

LogisticRegression(C=1.0, class_weight={0: 15, 1: 15, 2: 15, 3: 1, 4: 10},
                   dual=False, fit_intercept=True, intercept_scaling=1,
                   l1_ratio=None, max_iter=1000, multi_class='multinomial',
                   n_jobs=None, penalty='l2', random_state=0,
                   solver='newton-cg', tol=0.0001, verbose=0, warm_start=False)

In [42]:
# create X_dev and Y_dev dataframe for evaluating the classifier model
dev_copy = pd.read_csv("dev_copy.csv")
X_dev = dev_copy[columns_list]
Y_dev = dev_copy['bilou_only']
preds = model.predict(X_dev)

# labels counts in the development set
(unique, counts) = np.unique(preds, return_counts=True)
print('Predicted label, Count of labels')
print(np.asarray((unique, counts)).T)

Predicted label, Count of labels
[[    0   595]
 [    2   182]
 [    3 14346]
 [    4   259]]


In [43]:
# evaluate the classifier model
bio_labs = [reverse_bio(b) for b in dev_copy['bio_only']]
dev_pred = dev_copy.copy()
dev_pred['bio_only'] = bio_labs
bio_preds = bilou_to_bio([reverse_bilou(p) for p in preds])  # we convert BILOU tags into BIO tags
bio_preds = correct_preds(bio_preds)   # we correct predictions to have consistent results
dev_pred['prediction'] = bio_preds
print(dev_pred.head())

print('New evaluation:')
wnut_evaluate(dev_pred)

   Unnamed: 0  index       token  ... word_len_1 word_len_2 prediction
0           0      0  Stabilized  ...        0.0        0.0          O
1           1      1    approach  ...        8.0        0.0          O
2           2      2          or  ...        2.0        2.0          O
3           3      3         not  ...        3.0        3.0          O
4           4      4           ?  ...        1.0        1.0          O

[5 rows x 33 columns]
New evaluation:
Sum of TP and FP = 655
Sum of TP and FN = 754
True positives = 361, False positives = 294, False negatives = 393
Precision = 0.551, Recall = 0.479, F1 = 0.512


## Predictions on the test set

In [None]:
wnuttest = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17test_clean_tagged.txt'
testset = pd.read_table(wnuttest, header=None, names=['token', 'upos']).dropna()

test_tags = add_tags(testset)
test_copy = extract_features(test_tags, test=True)
test_copy.to_csv("test_copy.csv")
test_copy.head()

Unnamed: 0,token,upos,nltk_tag,scapy_tag,nltk_indices,scapy_indices,is_propn_upos,is_propn_scapy,title_case,capital_case,is_alpha,is_numeric,word_len,scapy_indices_1,scapy_indices_2,is_propn_1,is_propn_2,title_case_1,title_case_2,capital_case_1,capital_case_2,is_alpha_1,is_alpha_2,is_numeric_1,is_numeric_2,word_len_1,word_len_2
0,&,CCONJ,CC,CC,21,21,False,False,False,False,False,False,1,11.0,11.0,False,False,False,False,False,False,True,True,False,False,0.0,0.0
1,gt,X,NN,NNP,4,0,False,False,False,False,True,False,2,0.0,11.0,False,False,False,False,False,False,True,True,False,False,2.0,0.0
2,;,PUNCT,:,:,15,15,False,False,False,False,False,False,1,15.0,15.0,False,False,False,False,False,False,False,False,False,False,1.0,1.0
3,*,PUNCT,CC,NFP,21,34,False,False,False,False,False,False,1,34.0,34.0,False,False,False,False,False,False,False,False,False,False,1.0,1.0
4,The,DET,DT,DT,3,3,False,False,True,False,True,False,3,3.0,3.0,False,False,True,True,False,False,True,True,False,False,3.0,3.0


In [44]:
X_test = test_copy[columns_list]
preds_test = model.predict(X_test)

bio_preds_test = bilou_to_bio([reverse_bilou(p) for p in preds_test])  # we convert BILOU tags into BIO tags
bio_preds_test = correct_preds(bio_preds_test)   # we correct predictions to have consistent results
testset['prediction'] = bio_preds_test

In [45]:
testset[['token','upos','prediction']].to_csv('test_set.txt', sep=' ', index=False)