# NER with Neural Networks





# Model description

The purpose of this work is to build a neurak-network classifier for Named-Entity Recognition. This work is extracted from the W-NUT 2017 Shared Task, that challenged competitors with unusual and previously unseen named-entities. To do so, data come from user-generated texts from Twitter, YouTube, Reddit and StackExchange.

In this work, we explore two approaches to improve the performance of a simple neural network. In the first one, we compare the impact on performance of the tagging scheme used to tag named-entities. In the second one, we examine which word embedding representation achieves best results. 

The following code consists of three parts:
  - data processing: data are pre-processed so that they can be used as input of our neural network:
      1. NER labels are converted into the tagging scheme chosen
      2. word tokens and NER labels are encoded as integer values
      3. tokens are converted to lower case
      4. data are transformed into padded sequences of the same length
      5. NER label sequences are encoded with a one-hot scheme and weighted to overcome the imbalanced classesissue
  - neural-network model: 
      1. word embedding vectors are computed according to the method chosen
      2. a bi-LSTM neural network is instantiated and fitted with the training data
  - model evaluation: the model is evaluated with an entity-level F1-score
      1. we remove padding from sentences and convert data into their initial table format 
      2. we convert NER labels into the BIO2 scheme
      3. we evaluate the fitted model on the development dataset


All this work is detailed in my report.


NB: in this notebook, the BIO tagging scheme refers to the BIO2 one, described in the report.


## Import libraries

In [1]:
# import usefule libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import copy
from sklearn.utils.class_weight import compute_class_weight

# load Keras and TensorFlow
from tensorflow import keras
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# word embedding models
import gensim.downloader
from gensim.test.utils import common_texts
from gensim.models import Word2Vec, FastText
from gensim import models

# Global variable

In [2]:
ner_scheme = 'IO'

# Data Pre-processing

Load data from the W-NUT 2017 Shared Task

In [3]:
# load training data
wnuttrain = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17train_clean_tagged.txt'
train = pd.read_table(wnuttrain, header=None, names=['token', 'label', 'bio_only', 'upos'])
train.head()

Unnamed: 0,token,label,bio_only,upos
0,@paulwalk,O,O,NOUN
1,It,O,O,PRON
2,'s,O,O,AUX
3,the,O,O,DET
4,view,O,O,NOUN


### Tagging scheme conversion

All the following functions deal with the NER tags. They whether convert the current tagging scheme into another one, convert labels into integers or retrieve labels from integers.

#### BIO functions

NB: in this notebook, the BIO tagging scheme refers to the BIO2 one described in the report.

In [4]:
# training labels: convert BIO to integers
def bio_index(bio):
  ind = bio
  if not pd.isnull(bio):  # deal with empty lines
    if bio=='B':
      ind = 0
    elif bio=='I':
      ind = 1
    elif bio=='O':
      ind = 2
  return ind


# function to convert BIO indices into BIO labels
def reverse_bio(ind):
  bio = 'B'
  if ind==0:
    bio = 'B'
  elif ind==1:
    bio = 'I'
  elif ind==2:
    bio = 'O'
  return bio


# function to rectify BIO predictions
def correct_preds(preds):
  for i in range(len(preds)):
    if i == 0:
      if preds[i] == 'I':
        preds[i] = 'B'
    
    else:
      if preds[i] == 'B' and preds[i-1] != 'O':
        preds[i] = 'I'
      elif preds[i] == 'I' and preds[i-1] == 'O':
        preds[i] = 'B'
  
  return preds

#### BIO to BIO1

In [5]:
# function to convert BIO labels into BIO1 ones
def bio_to_bio1(bio):
  bio1 = bio.copy()
  for i in range(len(bio)):
    if not pd.isnull(bio[i]):
      if bio[i] == 'O':
        bio1[i] = 'O'
      elif bio[i] == 'I':
        bio1[i] = 'I'
      else:
        if i != 0 and bio[i-1] == 'I':
          bio1[i] = 'B'
        else:
          bio1[i] = 'I'
  return bio1


# function to convert BIO1 labels into BIO ones
def bio1_to_bio(bio1):
  bio = bio1.copy()
  for i in range(len(bio1)):
    if not pd.isnull(bio1[i]):
      if bio1[i] == 'O':
        bio[i] = 'O'
      elif bio1[i] == 'B':
        bio[i] = 'B'
      else:
        if i == 0 or (i != 0 and bio1[i-1] in ['O', None]):
          bio[i] = 'B'
        else:
          bio[i] = 'I'
  return bio


# training labels: convert BIO1 to integers
def bio1_index(bio1):
  ind = bio1
  if not pd.isnull(bio1):  # deal with empty lines
    if bio1=='B':
      ind = 0
    elif bio1=='I':
      ind = 1
    elif bio1=='O':
      ind = 2
  return ind


# function to convert BIO1 indices into BIO1 labels
def reverse_bio1(ind):
  bio1 = 'B'
  if ind==0:
    bio1 = 'B'
  elif ind==1:
    bio1 = 'I'
  elif ind==2:
    bio1 = 'O'
  return bio1

#### BIO to IOE1

In [6]:
# function to convert BIO labels into IOE1 ones
def bio_to_ioe1(bio):
  ioe1 = bio.copy()
  for i in range(len(bio)):
    if not pd.isnull(bio[i]):
      if bio[i] == 'O':
        ioe1[i] = 'O'
      else:
        if i != len(bio)-1 and bio[i+1] == 'B':
          ioe1[i] = 'E'
        else:
          ioe1[i] = 'I'
  return ioe1


# function to convert IOE1 labels into BIO ones
def ioe1_to_bio(ioe1):
  bio = ioe1.copy()
  for i in range(len(ioe1)):
    if not pd.isnull(ioe1[i]):
      if ioe1[i] == 'O':
        bio[i] = 'O'
      else:
        if i == 0 or (i != 0 and ioe1[i-1] in ['O', 'E', None]):
          bio[i] = 'B'
        else:
          bio[i] = 'I'
  return bio


# training labels: convert IOE1 to integers
def ioe1_index(ioe1):
  ind = ioe1
  if not pd.isnull(ioe1):  # deal with empty lines
    if ioe1=='I':
      ind = 0
    elif ioe1=='O':
      ind = 1
    elif ioe1=='E':
      ind = 2
  return ind


# function to convert IOE1 indices into IOE1 labels
def reverse_ioe1(ind):
  ioe1 = 'I'
  if ind==0:
    ioe1 = 'I'
  elif ind==1:
    ioe1 = 'O'
  elif ind==2:
    ioe1 = 'E'
  return ioe1

#### BIO to IOE2

In [7]:
# function to convert BIO labels into IOE2 ones
def bio_to_ioe2(bio):
  ioe2 = bio.copy()
  for i in range(len(bio)):
    if not pd.isnull(bio[i]):
      if bio[i] == 'O':
        ioe2[i] = 'O'
      else:
        if i == len(bio)-1 or bio[i+1] != 'I':
          ioe2[i] = 'E'
        else:
          ioe2[i] = 'I'
  return ioe2


# function to convert IOE2 labels into BIO ones
def ioe2_to_bio(ioe2):
  bio = ioe2.copy()
  for i in range(len(ioe2)):
    if not pd.isnull(ioe2[i]):
      if (ioe2[i] == 'O'):
        bio[i] = 'O'
      else:
        if i == 0 or (i != 0 and ioe2[i-1] in ['O', 'E', None]):
          bio[i] = 'B'
        else:
          bio[i] = 'I'
  return bio


# training labels: convert IOE2 to integers
def ioe2_index(ioe2):
  ind = ioe2
  if not pd.isnull(ioe2):  # deal with empty lines
    if ioe2=='I':
      ind = 0
    elif ioe2=='O':
      ind = 1
    elif ioe2=='E':
      ind = 2
  return ind


# function to convert IOE2 indices into IOE2 labels
def reverse_ioe2(ind):
  ioe2 = 'I'
  if ind==0:
    ioe2 = 'I'
  elif ind==1:
    ioe2 = 'O'
  elif ind==2:
    ioe2 = 'E'
  return ioe2

#### BIO to IO

In [8]:
# function to convert BIO labels into IO ones
def bio_to_io(bio):
  io = bio.copy()
  for i in range(len(bio)):
    if not pd.isnull(bio[i]):
      if bio[i] == 'O':
        io[i] = 'O'
      else:
        io[i] = 'I'
  return io


# function to convert IO labels into BIO ones
def io_to_bio(io):
  bio = io.copy()
  for i in range(len(io)):
    if not pd.isnull(io[i]):
      if (io[i] == 'O'):
        bio[i] = 'O'
      else:
        if i == 0 or (i != 0 and io[i-1] != 'I'):
          bio[i] = 'B'
        else:
          bio[i] = 'I'
  return bio


# training labels: convert BIO to integers
def io_index(io):
  ind = io
  if not pd.isnull(io):  # deal with empty lines
    if io=='I':
      ind = 0
    elif io=='O':
      ind = 1
  return ind


# function to convert IO indices into IO labels
def reverse_io(ind):
  io = 'I'
  if ind==0:
    io = 'I'
  elif ind==1:
    io = 'O'
  return io

#### BIO to BILOU

In [9]:
# function to convert BIO labels into BILOU ones
def bio_to_bilou(bio):
  bilou = bio.copy()
  for i in range(len(bio)):
    if not pd.isnull(bio[i]):
      if i == len(bio)-1 :
        if bio[i] == 'B':
          bilou[i] = 'U'
        elif bio[i] == 'I':
          bilou[i] = 'L'
        else:
          bilou[i] = 'O'
      
      else:
        if bio[i] == 'O':
          bilou[i] = 'O'
        elif bio[i] == 'B':
          if bio[i+1] == 'I':
            bilou[i] = 'B'
          else:
            bilou[i] = 'U'
        else:
          if bio[i+1] == 'I':
            bilou[i] = 'I'
          else:
            bilou[i] = 'L'
    
  return bilou


# function to convert BILOU labels into BIO ones
def bilou_to_bio(bilou):
  bio = bilou.copy()
  for i in range(len(bilou)):
    if not pd.isnull(bilou[i]):
      if (bilou[i] == 'B') or (bilou[i] == 'U'):
        bio[i] = 'B'
      elif (bilou[i] == 'I') or (bilou[i] == 'L'):
        bio[i] = 'I'
      else:
        bio[i] = 'O'
  return bio


# training labels: convert BIO to integers
def bilou_index(bilou):
  ind = bilou
  if not pd.isnull(bilou):  # deal with empty lines
    if bilou=='B':
      ind = 0
    elif bilou=='I':
      ind = 1
    elif bilou=='L':
      ind = 2
    elif bilou=='O':
      ind = 3
    elif bilou=='U':
      ind = 4
  return ind


# function to convert BILOU indices into BILOU labels
def reverse_bilou(ind):
  bilou = 'B'
  if ind==0:
    bilou = 'B'
  elif ind==1:
    bilou = 'I'
  elif ind==2:
    bilou = 'L'
  elif ind==3:
    bilou = 'O'
  elif ind==4:
    bilou = 'U'
  return bilou

### Features extraction

The following cell converts word tokens and NER labels into integers, and transform tokens to lower case.

In [10]:
# in order to convert word tokens to integers: list the set of token types
token_vocab = train.token.unique().tolist()
oov = len(token_vocab)  # OOV (out of vocabulary) token as vocab length (because that's max.index + 1)

# convert word tokens to integers
def token_index(tok):
  ind = tok
  if not pd.isnull(tok):  # new since last time: deal with the empty lines which we didn't drop yet
    if tok in token_vocab:  # if token in vocabulary
      ind = token_vocab.index(tok)
    else:  # else it's OOV
      ind = oov
  return ind


# convert word tokens to lower case
def token_lower(tok):
  low = tok
  if not pd.isnull(tok):  # new since last time: deal with the empty lines which we didn't drop yet
    low = tok.lower()
  return low


# pass a data frame through our feature extractor
def extract_features(txt_orig, ner_scheme=ner_scheme, istest=False):
  txt = txt_orig.copy()
  tokinds = [token_index(u) for u in txt['token']]
  txt['token_indices'] = tokinds
  toklows = [token_lower(u) for u in txt['token']]
  txt['token'] = toklows
  if not istest:  # can't do this with the test set
    if (ner_scheme == 'IO'):
      txt['bio_only'] = bio_to_io(txt['bio_only'])
      bioints = [io_index(b) for b in txt['bio_only']]
    elif (ner_scheme == 'BILOU'):
      txt['bio_only'] = bio_to_bilou(txt['bio_only'])
      bioints = [bilou_index(b) for b in txt['bio_only']]
    elif (ner_scheme == 'IOE1'):
      txt['bio_only'] = bio_to_ioe1(txt['bio_only'])
      bioints = [ioe1_index(b) for b in txt['bio_only']]
    elif (ner_scheme == 'IOE2'):
      txt['bio_only'] = bio_to_ioe2(txt['bio_only'])
      bioints = [ioe2_index(b) for b in txt['bio_only']]
    elif (ner_scheme == 'BIO1'):
      txt['bio_only'] = bio_to_bio1(txt['bio_only'])
      bioints = [bio1_index(b) for b in txt['bio_only']]
    elif (ner_scheme == 'BIO'):
      bioints = [bio_index(b) for b in txt['bio_only']]
    txt['bio_only'] = bioints
  return txt

train_copy = extract_features(train)
train_copy.head(n=30)

Unnamed: 0,token,label,bio_only,upos,token_indices
0,@paulwalk,O,1.0,NOUN,0.0
1,it,O,1.0,PRON,1.0
2,'s,O,1.0,AUX,2.0
3,the,O,1.0,DET,3.0
4,view,O,1.0,NOUN,4.0
5,from,O,1.0,ADP,5.0
6,where,O,1.0,ADV,6.0
7,i,O,1.0,PRON,7.0
8,'m,O,1.0,X,8.0
9,living,O,1.0,NOUN,9.0


### Data formatting

The following functions convert table-format data into sequences and pad sequences to make them of the same length. To do so, we find the longuest sequence in all datasets, and add a padding element to sequences to make them of the length of the longuest sequence.

In [11]:
def tokens2sequences(txt_orig,istest=False):
  '''
  Takes panda dataframe as input, copies, and adds a sequence index based on full-stops.
  Outputs a dataframe with sequences of tokens, named entity labels, and token indices as lists.
  '''
  txt = txt_orig.copy()
  txt['sequence_num'] = 0
  seqcount = 0
  for i in txt.index:  # in each row...
    txt.loc[i,'sequence_num'] = seqcount  # set the sequence number
    if pd.isnull(txt.loc[i,'token']):  # increment sequence counter at empty lines
      seqcount += 1
  # now drop the empty lines, group by sequence number and output df of sequence lists
  txt = txt.dropna()
  if istest:  # test set doesn't have labels
    txt_seqs = txt.groupby(['sequence_num'],as_index=False)[['token', 'token_indices']].agg(lambda x: list(x))
  else:
    txt_seqs = txt.groupby(['sequence_num'],as_index=False)[['token', 'bio_only', 'token_indices']].agg(lambda x: list(x))
  return txt_seqs

print("This cell takes a little while to run: be patient :)")
train_seqs = tokens2sequences(train_copy)
train_seqs.head()

This cell takes a little while to run: be patient :)


Unnamed: 0,sequence_num,token,bio_only,token_indices
0,0,"[@paulwalk, it, 's, the, view, from, where, i,...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ..."
1,1,"[from, green, newsfeed, :, ahfa, extends, dead...","[1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, ...","[26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 10...."
2,2,"[pxleyes, top, 50, photography, contest, pictu...","[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46...."
3,3,"[today, is, my, last, day, at, the, office, .]","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]","[51.0, 52.0, 53.0, 23.0, 54.0, 55.0, 3.0, 56.0..."
4,4,"[4dbling, 's, place, til, monday, ,, party, pa...","[0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[57.0, 2.0, 58.0, 59.0, 60.0, 61.0, 62.0, 62.0..."


In [12]:
def find_longest_sequence(txt,longest_seq):
  '''find the longest sequence in the dataframe'''
  seq = ""
  for i in txt.index:
    seqlen = len(txt['token'][i])
    if seqlen > longest_seq:  # update high water mark if new longest sequence encountered
      longest_seq = seqlen
      seq = txt['token'][i]
  return longest_seq, seq

train_longest, train_seq = find_longest_sequence(train_seqs,0)
print('The longest sequence in the training set is %i tokens long:' % train_longest)
print(train_seq)

The longest sequence in the training set is 41 tokens long:
['re', ':', 're', ':', 're', ':', 're', ':', 're', ':', 're', ':', 're', ':', 're', ':', 're', ':', 're', ':', 're', ':', 're', ':', 'esther', 'sikkimese', 'is', 'now', 'following', 'me', 'on', 'twitter', '!', 'http://t.co/z58brwgxfp', 'thanks', 'a', 'bunch', '!', '103', 'january', '...']


In [13]:
# the dev set
wnutdev = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17dev_clean_tagged.txt'
dev = pd.read_table(wnutdev, header=None, names=['token', 'label', 'bio_only', 'upos'])
dev_copy = extract_features(dev)
dev_seqs = tokens2sequences(dev_copy)
dev_longest, dev_seq = find_longest_sequence(dev_seqs,0)
print('The longest sequence in the dev set is %i tokens long:' % dev_longest)
print(dev_seq)

# the test set
wnuttest = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17test_clean_tagged.txt'
test = pd.read_table(wnuttest, header=None, names=['token', 'upos'])
test_copy = extract_features(test, istest=True)
test_seqs = tokens2sequences(test_copy, True)
test_longest, test_seq = find_longest_sequence(test_seqs,0)
print('The longest sequence in the test set is %i tokens long:' % test_longest)
print(test_seq)

The longest sequence in the dev set is 82 tokens long:
['link', 'should', 'not', 'hold', 'knives', 'or', 'dogs', '.', 'what', "'", 's', 'with', 'all', 'that', 'excessive', 'rubbing', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '/', '/', '/']
The longest sequence in the test set is 105 tokens long:
['in', 'order', 'to', 'calculate', 'anything', ',', 'more', 'input', 'data', 'is', 'required', ';', 'such', 'as', ':', '""""', 'from', 'where', 'do', 'you', 'hit', 'the', 'ball', 'and', 'at', 'what', 'angle', '?', 'do', 'you', 'want', 'to', 'take', 'into', 'account', 'effects', 'due', 'to', 'the', 'spin', 'of', 'the', 'ball', '?', 'should', 'friction', 'be', 'included', '?', '""""', 'generally', '(', 'if', 'you', 'do', 'n', "'

In [14]:
# set maximum sequence length
seq_length = max(train_longest, dev_longest, test_longest)

# a new dummy token index, one more than OOV
padtok = oov+1
print('The padding token index is %i' % padtok)

# use pad_sequences, padding or truncating at the end of the sequence (default is 'pre')
train_seqs_padded = pad_sequences(train_seqs['token_indices'].tolist(), maxlen=seq_length,
                                  dtype='int32', padding='post', truncating='post', value=padtok)
print('Example of padded token sequence:')
print(train_seqs_padded[1])

The padding token index is 14802
Example of padded token sequence:
[   26    27    28    29    30    31    32    10    33    34    35    36
    13    37    38 14802 14802 14802 14802 14802 14802 14802 14802 14802
 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802
 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802
 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802
 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802
 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802
 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802 14802
 14802 14802 14802 14802 14802 14802 14802 14802 14802]


NER labels are converted into a one-hot scheme.

In [15]:
# get lists of named entity labels, padded with a null label
if ner_scheme in ['BIO', 'IOE2', 'IOE1','BIO1']:
  padlab = 3
elif ner_scheme == 'IO':
  padlab = 2
elif ner_scheme == 'BILOU':
  padlab = 5

train_labs_padded = pad_sequences(train_seqs['bio_only'].tolist(), maxlen=seq_length,
                                  dtype='int32', padding='post', truncating='post', value=padlab)

# convert those labels to one-hot encoding
n_labs = padlab + 1  # we have 2, 3 or 4 labels: B, I, O (0, 1, 2) + the pad label 3
train_labs_onehot = [to_categorical(i, num_classes=n_labs) for i in train_labs_padded]

# follow the print outputs below to see how the labels are transformed
print('Example of padded label sequence and one-hot encoding (first 10 tokens):')
print(train_seqs.loc[1])
print('Length of input sequence: %i' % len(train_labs_padded[1]))
print('Length of label sequence: %i' % len(train_labs_onehot[1]))
print(train_labs_padded[1][:11])
print(train_labs_onehot[1][:11])

Example of padded label sequence and one-hot encoding (first 10 tokens):
sequence_num                                                     1
token            [from, green, newsfeed, :, ahfa, extends, dead...
bio_only         [1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, ...
token_indices    [26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 10....
Name: 1, dtype: object
Length of input sequence: 105
Length of label sequence: 105
[1 1 1 1 0 1 1 1 1 1 1]
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]


In [16]:
# now process the dev set in the same way: padding the tokens & labels, and one-hot encoding the labels
dev_seqs_padded = pad_sequences(dev_seqs['token_indices'].tolist(), maxlen=seq_length,
                                dtype='int32', padding='post', truncating='post', value=padtok)
dev_labs_padded = pad_sequences(dev_seqs['bio_only'].tolist(), maxlen=seq_length,
                                dtype='int32', padding='post', truncating='post', value=padlab)
dev_labs_onehot = [to_categorical(i, num_classes=n_labs) for i in dev_labs_padded]

print('Dev set padded label sequence and one-hot encoding (first 10 tokens):')
print(dev_seqs.loc[2])
print('Length of input sequence: %i' % len(dev_labs_padded[1]))
print('Length of label sequence: %i' % len(dev_labs_onehot[1]))
print(dev_labs_padded[2][:11])
print(dev_labs_onehot[2][:11])

Dev set padded label sequence and one-hot encoding (first 10 tokens):
sequence_num                                                     2
token            [all, i, ', ve, been, doing, is, binge, watchi...
bio_only         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...
token_indices    [405.0, 7.0, 573.0, 12927.0, 90.0, 848.0, 52.0...
Name: 2, dtype: object
Length of input sequence: 105
Length of label sequence: 105
[1 1 1 1 1 1 1 1 1 0 0]
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


NER labels are weighted to overcome the imbalanced classes issue: weights are inversely proportional to classes occurrences in order that common classes have a lower weight.

In [17]:
# use deep copy to ensure we aren't updating original values
train_weights_onehot = copy.deepcopy(train_labs_onehot)

if ner_scheme == 'BIO':
  y_integers = [0]*1964 + [1]*1177 + [2]*59095 + [3]*292139
if ner_scheme == 'BIO1':
  y_integers = [0]*13 + [1]*3128 + [2]*59095 + [3]*292139
elif ner_scheme == 'IOE1':
  y_integers = [0]*3115 + [1]*59095 + [2]*26 + [3]*292139 
elif ner_scheme == 'IOE2':
  y_integers = [0]*1177 + [1]*59095 + [2]*1964 + [3]*292139 
if ner_scheme == 'IO':
  y_integers = [0]*3141 + [1]*59095 + [2]*292139  
elif ner_scheme == 'BILOU':
  y_integers = [0]*786 + [1]*391 + [2]*786 + [3]*59095 + [4]*1178 + [5]*292139  

class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
class_wts = list(dict(enumerate(class_weights)).values())


# apply our weights to the label lists
for i,labs in enumerate(train_weights_onehot):
  for j,lablist in enumerate(labs):
    lablistaslist = lablist.tolist()
    whichismax = lablistaslist.index(max(lablistaslist))
    train_weights_onehot[i][j][whichismax] = class_wts[whichismax]

# what's this like, before and after?
print('Initial one-hot label encoding:')
print(train_labs_onehot[1][:11])

print('Weighted label encoding:')
print(train_weights_onehot[1][:11])

Initial one-hot label encoding:
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]
Weighted label encoding:
[[ 0.       1.9989   0.     ]
 [ 0.       1.9989   0.     ]
 [ 0.       1.9989   0.     ]
 [ 0.       1.9989   0.     ]
 [37.60745  0.       0.     ]
 [ 0.       1.9989   0.     ]
 [ 0.       1.9989   0.     ]
 [ 0.       1.9989   0.     ]
 [ 0.       1.9989   0.     ]
 [ 0.       1.9989   0.     ]
 [ 0.       1.9989   0.     ]]


# Neural network model

### Word embedding representation

When we work with pre-trained word representations, we first load or train these embedding methods with ```gensim``` and create an embedding matrix for our vocabulary list.



In [18]:
embedding_dim = 300

#word2vec = Word2Vec(sentences=list(train_seqs.token), size=embedding_dim, window=5, min_count=1, workers=4)
#word_emb = word2vec.wv

fasttext = FastText(sentences=list(train_seqs.token), size=embedding_dim, window=5, min_count=1, workers=4)
word_emb = fasttext.wv

#word_emb = gensim.downloader.load('word2vec-ruscorpora-300')
#word_emb = gensim.downloader.load('fasttext-wiki-news-subwords-300')
#word_emb = gensim.downloader.load('glove-twitter-25')
#word_emb = gensim.downloader.load('glove-twitter-200')

embedding_matrix = np.zeros((len(token_vocab)+2, embedding_dim))
for i in range(len(token_vocab)):
  try:
    embedding_vector = fasttext_vectors[str(token_vocab[i]).lower()]
    if embedding_vector is not None:
      # words not found in embedding index will be all-zeros.
      embedding_matrix[i] = embedding_vector
  except:
    continue

### Neural Network Classifier

Definition of our neural network classifier:

We first prepare the input data as numpy arrays, list the metrics for model evaluation, define some important hyperparameters, then build the model layer by layer. It's a sequential model with an embedding layer followed by bidirectional LSTM, then a dropout layer before a final dense layer with softmax activation. We have early stopping criteria to halt training if improvements are not seen after 10 epochs.

In [19]:
# prepare sequences and labels as numpy arrays, check dimensions
X = np.array(train_seqs_padded)
y = np.array(train_weights_onehot)
print('Input sequence dimensions (n.docs, seq.length):')
print(X.shape)
print('Label dimensions (n.docs, seq.length, one-hot encoding of 4 NER labels):')
print(y.shape)

# our final vocab size is the padding token + 1 (OR length of vocab + OOV + PAD)
vocab_size = padtok+1
print(vocab_size==len(token_vocab)+2)
embed_size = 128  # try an embedding size of 128 (could tune this)

# list of metrics to use: true & false positives, negatives, accuracy, precision, recall, area under the curve
METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
]

# our model has the option for an label prediction bias, it's sequential, starts with an embedding layer, then bi-LSTM,
# a dropout layer follows for regularisation, and a dense final layer with softmax activation to output class probabilities
# we compile with the Adam optimizer at a low learning rate, use categorical cross-entropy as our loss function
def make_model(metrics = METRICS, output_bias=None):
  if output_bias is not None:
    output_bias = tf.keras.initializers.Constant(output_bias)

  if word_emb == None:
      model = keras.Sequential([
        keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=seq_length, mask_zero=True, trainable=True),
        keras.layers.Bidirectional(keras.layers.LSTM(units=50, return_sequences=True, dropout=0.2, recurrent_dropout=0)),  # 2 directions, 50 units each, concatenated (can change this)
        keras.layers.Dropout(0.5),
        keras.layers.TimeDistributed(keras.layers.Dense(n_labs, activation='softmax', bias_initializer=output_bias)),
      ])
  else:  # if we use pre-trained word representations, we provide the embedding layer with our embedding matrix created previously
    model = keras.Sequential([
      keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, weights=[embedding_matrix], input_length=seq_length, trainable=True),
      keras.layers.Bidirectional(keras.layers.LSTM(units=50, return_sequences=True, dropout=0.2, recurrent_dropout=0)),  # 2 directions, 50 units each, concatenated (can change this)
      keras.layers.Dropout(0.5),
      keras.layers.TimeDistributed(keras.layers.Dense(n_labs, activation='softmax', bias_initializer=output_bias)),
    ])
  model.compile(optimizer=keras.optimizers.Adam(lr=1e-3), loss=keras.losses.CategoricalCrossentropy(), metrics=metrics)
  return model

# early stopping criteria based on area under the curve: will stop if no improvement after 10 epochs
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_auc', verbose=1, patience=10, mode='max', restore_best_weights=True)

# the number of training epochs we'll use, and the batch size (how many texts are input at once)
EPOCHS = 100
BATCH_SIZE = 32

print('\n**Defining a neural network**')
model = make_model()
model.summary()

Input sequence dimensions (n.docs, seq.length):
(3375, 105)
Label dimensions (n.docs, seq.length, one-hot encoding of 4 NER labels):
(3375, 105, 3)
True

**Defining a neural network**
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 105, 300)          4440900   
_________________________________________________________________
bidirectional (Bidirectional (None, 105, 100)          140400    
_________________________________________________________________
dropout (Dropout)            (None, 105, 100)          0         
_________________________________________________________________
time_distributed (TimeDistri (None, 105, 3)            303       
Total params: 4,581,603
Trainable params: 4,581,603
Non-trainable params: 0
_________________________________________________________________


Because our dataset is highly imbalanced, we try to help the situation by setting an initial bias based on the label distribution in our dataset.

In [20]:
# figure out the label distribution in our fixed-length texts
all_labs = [l for lab in train_labs_padded for l in lab]
label_count = Counter(all_labs)
total_labs = len(all_labs)
print(label_count)
print(total_labs)

# use this to define an initial model bias
initial_bias=[]
for i in range(len(label_count)):
  initial_bias.append(label_count[i]/total_labs)
print('Initial bias:')
print(initial_bias)

# pass the bias to the model and re-evaluate
model = make_model(output_bias=initial_bias)
results = model.evaluate(X, y, batch_size=BATCH_SIZE, verbose=0)
print("Loss: {:0.4f}".format(results[0]))

Counter({2: 292139, 1: 59095, 0: 3141})
354375
Initial bias:
[0.008863492063492063, 0.1667583774250441, 0.8243781305114638]
Loss: 1.1642


We fit our model with our training set and use the development set to evaluate metrics during training.

In [39]:
# prepare the dev sequences and labels as numpy arrays
dev_X = np.array(dev_seqs_padded)
dev_y = np.array(dev_labs_onehot)

# re-initiate model with bias
model = make_model(output_bias=initial_bias)

# and fit...
model.fit(X, y, batch_size=BATCH_SIZE, epochs=EPOCHS, callbacks = [early_stopping], validation_data=(dev_X, dev_y))

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 00012: early stopping


<tensorflow.python.keras.callbacks.History at 0x7ffb764d8ac8>

# Model evaluation

### Data processing

Predictions on the development set and labels distribution in the predictions

In [40]:
# use argmax to figure out the class with highest probability per token
preds = np.argmax(model.predict(dev_seqs_padded), axis=-1)
flat_preds = [p for pred in preds for p in pred]
print(Counter(flat_preds))

Counter({2: 88890, 1: 13430, 0: 1945})


We remove padding elements from sequences to retrieve the original sequence length.

In [41]:
# start a new column for the model predictions
dev_seqs['prediction'] = ''

# for each text: get original sequence length and trim predictions accordingly
# (_trim_ because we know that our seq length is longer than the longest seq in dev)
for i in dev_seqs.index:
  this_seq_length = len(dev_seqs['token'][i])
  dev_seqs['prediction'][i] = preds[i][:this_seq_length].astype(int)

dev_seqs.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,sequence_num,token,bio_only,token_indices,prediction
0,0,"[stabilized, approach, or, not, ?, that, ´, s,...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[14801.0, 10361.0, 414.0, 556.0, 131.0, 1740.0...","[0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1]"
1,1,"[you, should, ', ve, stayed, on, redondo, beac...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, ...","[151.0, 1018.0, 573.0, 12927.0, 9346.0, 137.0,...","[1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ..."
2,2,"[all, i, ', ve, been, doing, is, binge, watchi...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[405.0, 7.0, 573.0, 12927.0, 90.0, 848.0, 52.0...","[1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0]"
3,3,"[wow, emma, and, kaite, is, so, very, cute, an...","[1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","[4777.0, 14801.0, 113.0, 14801.0, 52.0, 79.0, ...","[1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ..."
4,4,"[this, is, so, good]","[1.0, 1.0, 1.0, 1.0]","[2239.0, 1567.0, 1089.0, 9176.0]","[1, 1, 1, 1]"


We convert data into their original table format.

In [42]:
# use sequence number as the index and apply pandas explode to all other columns
dev_long = dev_seqs.set_index('sequence_num').apply(pd.Series.explode).reset_index()
dev_long.head()

Unnamed: 0,sequence_num,token,bio_only,token_indices,prediction
0,0,stabilized,1,14801,0
1,0,approach,1,10361,1
2,0,or,1,414,1
3,0,not,1,556,1
4,0,?,1,131,1


We convert NER labels into the BIO (BIO2) tagging scheme, because it is the one used to label the development and test sets.


In [43]:
if (ner_scheme == 'IO'):
  bio_labs = io_to_bio([reverse_io(b) for b in dev_long['bio_only']])
  dev_long['bio_only'] = bio_labs
  pred_labs = io_to_bio([reverse_io(b) for b in dev_long['prediction']])
elif (ner_scheme == 'BIO'):
  bio_labs = [reverse_bio(b) for b in dev_long['bio_only']]
  dev_long['bio_only'] = bio_labs
  pred_labs = [reverse_bio(b) for b in dev_long['prediction']]
  pred_labs = correct_preds(pred_labs)
elif (ner_scheme == 'BIO1'):
  bio_labs = bio1_to_bio([reverse_bio1(b) for b in dev_long['bio_only']])
  dev_long['bio_only'] = bio_labs
  pred_labs = bio1_to_bio([reverse_bio1(b) for b in dev_long['prediction']])
elif (ner_scheme == 'IOE1'):
  bio_labs = ioe1_to_bio([reverse_ioe1(b) for b in dev_long['bio_only']])
  dev_long['bio_only'] = bio_labs
  pred_labs = ioe1_to_bio([reverse_ioe1(b) for b in dev_long['prediction']])
elif (ner_scheme == 'IOE2'):
  bio_labs = ioe2_to_bio([reverse_ioe2(b) for b in dev_long['bio_only']])
  dev_long['bio_only'] = bio_labs
  pred_labs = ioe2_to_bio([reverse_ioe2(b) for b in dev_long['prediction']])
else:
  bio_labs = bilou_to_bio([reverse_bilou(b) for b in dev_long['bio_only']])
  dev_long['bio_only'] = bio_labs
  pred_labs = bilou_to_bio([reverse_bilou(b) for b in dev_long['prediction']])

dev_long['prediction'] = pred_labs

dev_long.head()
dev_long.prediction.value_counts()

O    13430
B     1314
I      638
Name: prediction, dtype: int64

### Evaluation of predictions

This function aims at computing the precision, recall and F1-score metrics of our model. We use entity-level measures in order to reward only once multi-token named-entities.

In [44]:
# evaluation function
def wnut_evaluate(txt):
  '''row by row entity evaluation: we evaluate by whole named entities'''
  tp = 0; fp = 0; fn = 0
  in_entity = 0
  for i in txt.index:
    if txt['prediction'][i]=='B' and txt['bio_only'][i]=='B':
      if in_entity==1:  # if there's a preceding named entity which didn't have intervening O...
        tp += 1  # count a true positive
      in_entity = 1  # start tracking this entity (don't count it until we know full span of entity)
    elif txt['prediction'][i]=='B':
      fp += 1  # if not a B in gold annotations, it's a false positive
      in_entity = 0
    elif txt['prediction'][i]=='I' and txt['bio_only'][i]=='I':
      next  # correct entity continuation: do nothing
    elif txt['prediction'][i]=='I' and txt['bio_only'][i]=='B':
      fn += 1  # if a new entity should have begun, it's a false negative
      in_entity = 0
    elif txt['prediction'][i]=='I':  # if gold is O...
      if in_entity==1:  # and if tracking an entity, then the span is too long
        fp += 1  # it's a false positive
      in_entity = 0
    elif txt['prediction'][i]=='O':
      if txt['bio_only'][i]=='B':
        fn += 1  # false negative if there's B in gold but no predicted B
        if in_entity==1:  # also check if there was a named entity in progress
          tp += 1  # count a true positive
      elif txt['bio_only'][i]=='I':
        if in_entity==1:  # if this should have been a continued named entity, the span is too short
          fn += 1  # count a false negative
      elif txt['bio_only'][i]=='O':
        if in_entity==1:  # if a named entity has ended in right place
          tp += 1  # count a true positive
      in_entity = 0

  if in_entity==1:  # catch any final named entity
    tp += 1

  prec = tp / (tp+fp)
  rec = tp / (tp+fn)
  f1 = (2*(prec*rec)) / (prec+rec)
  print('Sum of TP and FP = %i' % (tp+fp))
  print('Sum of TP and FN = %i' % (tp+fn))
  print('True positives = %i, False positives = %i, False negatives = %i' % (tp, fp, fn))
  print('Precision = %.3f, Recall = %.3f, F1 = %.3f' % (prec, rec, f1))

In [45]:
wnut_evaluate(dev_long)

Sum of TP and FP = 1240
Sum of TP and FN = 760
True positives = 339, False positives = 901, False negatives = 421
Precision = 0.273, Recall = 0.446, F1 = 0.339


# Predictions on the test set

### Predictions on the test set

In [46]:
wnuttest = 'https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17test_clean_tagged.txt'
test = pd.read_table(wnuttest, header=None, names=['token', 'upos'])

test_copy = extract_features(test, istest=True)
test_seqs = tokens2sequences(test_copy, True)

In [47]:
# now process the dev set in the same way: padding the tokens & labels, and one-hot encoding the labels
test_seqs_padded = pad_sequences(test_seqs['token_indices'].tolist(), maxlen=seq_length,
                                dtype='int32', padding='post', truncating='post', value=padtok)

print('Dev set padded label sequence and one-hot encoding (first 10 tokens):')
print(test_seqs.loc[2])

Dev set padded label sequence and one-hot encoding (first 10 tokens):
sequence_num                                                     2
token            [&, gt, ;, *, the, army, on, thursday, recover...
token_indices    [14801.0, 14801.0, 1625.0, 1743.0, 191.0, 1480...
Name: 2, dtype: object


In [48]:
test_X = np.array(test_seqs_padded)

Predictions on the test set and labels distribution in the predictions

In [49]:
# use argmax to figure out the class with highest probability per token
preds_test = np.argmax(model.predict(test_seqs_padded), axis=-1)
flat_preds_test = [p for pred in preds_test for p in pred]
print(Counter(flat_preds_test))

Counter({2: 111427, 1: 19180, 0: 4108})


We remove padding elements from sequences to retrieve the original sequence length.

In [50]:
# start a new column for the model predictions
test_seqs['prediction'] = ''

# for each text: get original sequence length and trim predictions accordingly
# (_trim_ because we know that our seq length is longer than the longest seq in dev)
for i in test_seqs.index:
  this_seq_length = len(test_seqs['token'][i])
  test_seqs['prediction'][i] = preds_test[i][:this_seq_length].astype(int)

test_seqs.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,sequence_num,token,token_indices,prediction
0,0,"[&, gt, ;, *, the, soldier, was, killed, when,...","[14801.0, 14801.0, 1625.0, 1743.0, 191.0, 1480...","[0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, ..."
1,1,"[&, gt, ;, *, police, last, week, evacuated, 8...","[14801.0, 14801.0, 1625.0, 1743.0, 14801.0, 23...","[0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, ..."
2,2,"[&, gt, ;, *, the, army, on, thursday, recover...","[14801.0, 14801.0, 1625.0, 1743.0, 191.0, 1480...","[0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, ..."
3,3,"[&, gt, ;, *, the, four, civilians, killed, in...","[14801.0, 14801.0, 1625.0, 1743.0, 191.0, 4012...","[0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, ..."
4,4,"[the, bodies, of, the, soldiers, were, recover...","[191.0, 14801.0, 45.0, 3.0, 14801.0, 225.0, 14...","[0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, ..."


We convert data into their original table format.

In [51]:
# use sequence number as the index and apply pandas explode to all other columns
test_long = test_seqs.set_index('sequence_num').apply(pd.Series.explode).reset_index()
test_long.head()

Unnamed: 0,sequence_num,token,token_indices,prediction
0,0,&,14801,0
1,0,gt,14801,0
2,0,;,1625,1
3,0,*,1743,1
4,0,the,191,0


We convert NER labels into the BIO (BIO2) tagging scheme, because it is the one used to label the development and test sets.


In [52]:
if (ner_scheme == 'IO'):
  test_pred_labs = io_to_bio([reverse_io(b) for b in test_long['prediction']])
elif (ner_scheme == 'BIO'):
  test_pred_labs = [reverse_bio(b) for b in test_long['prediction']]
  test_pred_labs = correct_preds(test_pred_labs)
elif (ner_scheme == 'BIO1'):
  test_pred_labs = bio1_to_bio([reverse_bio1(b) for b in dev_long['prediction']])
elif (ner_scheme == 'IOE1'):
  test_pred_labs = ioe1_to_bio([reverse_ioe1(b) for b in dev_long['prediction']])
elif (ner_scheme == 'IOE2'):
  test_pred_labs = ioe2_to_bio([reverse_ioe2(b) for b in test_long['prediction']])
else:
  test_pred_labs = bilou_to_bio([reverse_bilou(b) for b in test_long['prediction']])

test_long['prediction'] = test_pred_labs

print(test_long.head())
print(test_long.prediction.value_counts())

   sequence_num token token_indices prediction
0             0     &         14801          B
1             0    gt         14801          I
2             0     ;          1625          O
3             0     *          1743          O
4             0   the           191          B
O    19180
B     2764
I     1379
Name: prediction, dtype: int64


We add our predictions to the initial test dataset.

In [53]:
j = 0
test['prediction'] = None
for i in range(len(test)):
  if str(test.token[i]) != "nan":  # all NaN lines have been removed in test_long
    test.loc[i,'prediction'] = test_long.loc[j,'prediction']
    j += 1
print(test)

       token   upos prediction
0          &  CCONJ          B
1         gt      X          I
2          ;  PUNCT          O
3          *  PUNCT          O
4        The    DET          B
...      ...    ...        ...
24601   this   PRON          O
24602  dress   NOUN          O
24603   code   NOUN          O
24604      😂    SYM          O
24605    NaN    NaN       None

[24606 rows x 3 columns]


In [54]:
test.to_csv("/content/drive/My Drive/Cambridge/NER/test_preds.csv", index=False)

### Evaluation on the test set

In [55]:
test_tag_file = "https://storage.googleapis.com/wnut-2017_ner-shared-task/wnut17test_annotated_clean_tagged.txt"
test_tag = pd.read_table(test_tag_file, header=None, names=['token','bio','bio_only','upos'])
test_tag['prediction'] = test['prediction']
test_tag.head()

Unnamed: 0,token,bio,bio_only,upos,prediction
0,&,O,O,CCONJ,B
1,gt,O,O,X,I
2,;,O,O,PUNCT,O
3,*,O,O,PUNCT,O
4,The,O,O,DET,B


In [56]:
wnut_evaluate(test_tag)

Sum of TP and FP = 2682
Sum of TP and FN = 951
True positives = 479, False positives = 2203, False negatives = 472
Precision = 0.179, Recall = 0.504, F1 = 0.264
