# Named Entity Recognition

This notebook is based on the tutorial of [sklearn_crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html)

We start with loading all the necessary packages:
* **sklearn_crfsuite** for the implementation of the Conditional Random Field
* **pandas** for nice displaying of dataframes
* **eli5** for illustration of learned parameters

In [1]:
import sklearn_crfsuite
import eli5
import pandas as pd
from sklearn_crfsuite import metrics
from sklearn_crfsuite.metrics import flat_classification_report



# Corpus

Within this notebook we're working on the Conll 2000 Corpus, which was obtained from https://github.com/Franck-Dernoncourt/NeuroNER/

First, we take a look at the data to inspect the data format.

In [2]:
def print_input_file(filename, lines=10):
    with open(filename, 'r') as file:
        for line_number, line in enumerate(file):
            if line_number >= lines:
                break;
            else:
                print(line.strip())
                
print_input_file('train.txt')

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O


Load the data into train and test corpora and inspecting the raw sentences.

In [3]:
def load_file(filename): 
    text = []
    with open(filename, 'r') as f:
        sentence = []
        for line in f:
            if line.startswith('-DOCSTART-') or len(line.strip()) == 0:
                if len(sentence) > 0:
                    text.append(sentence)
                sentence = []
            else:
                l = line.strip().split(' ')
                sentence.append((l[0], l[1], l[3]))
    return text

train = load_file('train.txt')
test = load_file('test.txt')

In [4]:
def print_text(corpus, amount=1):
    for sentence in corpus[:amount]:
        print([l[0] for l in sentence])

print_text(train,4)

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['Peter', 'Blackburn']
['BRUSSELS', '1996-08-22']
['The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 'shun', 'British', 'lamb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.']


Inspecting the loaded data.

In [5]:
def print_sentence(corpus, idx=0):
    for token in corpus[idx]:
        print(token)
        
print_sentence(train)

('EU', 'NNP', 'B-ORG')
('rejects', 'VBZ', 'O')
('German', 'JJ', 'B-MISC')
('call', 'NN', 'O')
('to', 'TO', 'O')
('boycott', 'VB', 'O')
('British', 'JJ', 'B-MISC')
('lamb', 'NN', 'O')
('.', '.', 'O')


# Compute features

In order to allow the CRF to learn how to distinguish between different Named Entities, we have to compute features for the individual words.

For the time being, we select information about the word itself, such as
* the word
* the suffix of the word
* the shape of the word
* the POS tag
but also about the word before and after. In addition, we add information about whether the word is at the beginning or the end of the sentence.

In [6]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],        
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True
                
    return features

For the first word, the features look as follows

In [7]:
word2features(train[0], 0)

{'word.lower()': 'eu',
 'word[-3:]': 'EU',
 'word.isupper()': True,
 'word.istitle()': False,
 'word.isdigit()': False,
 'postag': 'NNP',
 'postag[:2]': 'NN',
 'BOS': True,
 '+1:word.lower()': 'rejects',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False,
 '+1:postag': 'VBZ',
 '+1:postag[:2]': 'VB'}

For the first sentence, the features look like:

In [8]:
def sentence2features(corpus, sent_idx):
    sentence_features = []
    for i in range(len(corpus[sent_idx])):
        sentence_features.append(word2features(corpus[sent_idx], i))
    return sentence_features

sentence2features(train, 0)    

[{'word.lower()': 'eu',
  'word[-3:]': 'EU',
  'word.isupper()': True,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'NNP',
  'postag[:2]': 'NN',
  'BOS': True,
  '+1:word.lower()': 'rejects',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False,
  '+1:postag': 'VBZ',
  '+1:postag[:2]': 'VB'},
 {'word.lower()': 'rejects',
  'word[-3:]': 'cts',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'VBZ',
  'postag[:2]': 'VB',
  '-1:word.lower()': 'eu',
  '-1:word.istitle()': False,
  '-1:word.isupper()': True,
  '-1:postag': 'NNP',
  '-1:postag[:2]': 'NN',
  '+1:word.lower()': 'german',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'JJ',
  '+1:postag[:2]': 'JJ'},
 {'word.lower()': 'german',
  'word[-3:]': 'man',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'JJ',
  'postag[:2]': 'JJ',
  '-1:word.lower()': 'rejects',
  '-1:word.istitle()': False,
  '-

## Computing features for the entire corpus

In [9]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

X_train = [sent2features(s) for s in train]
y_train = [sent2labels(s) for s in train]

X_test = [sent2features(s) for s in test]
y_test = [sent2labels(s) for s in test]

In [10]:
X_train[0]

[{'word.lower()': 'eu',
  'word[-3:]': 'EU',
  'word.isupper()': True,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'NNP',
  'postag[:2]': 'NN',
  'BOS': True,
  '+1:word.lower()': 'rejects',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False,
  '+1:postag': 'VBZ',
  '+1:postag[:2]': 'VB'},
 {'word.lower()': 'rejects',
  'word[-3:]': 'cts',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'VBZ',
  'postag[:2]': 'VB',
  '-1:word.lower()': 'eu',
  '-1:word.istitle()': False,
  '-1:word.isupper()': True,
  '-1:postag': 'NNP',
  '-1:postag[:2]': 'NN',
  '+1:word.lower()': 'german',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'JJ',
  '+1:postag[:2]': 'JJ'},
 {'word.lower()': 'german',
  'word[-3:]': 'man',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'JJ',
  'postag[:2]': 'JJ',
  '-1:word.lower()': 'rejects',
  '-1:word.istitle()': False,
  '-

# Training of the CRF

We train the CRF based on the computed features by use of gradient descent with elastic net regularisation.

In [11]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1, 
    c2=0.1, 
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(X_train, y_train);

CPU times: user 16.2 s, sys: 191 ms, total: 16.4 s
Wall time: 16.4 s




CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=False,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=20,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

CRFsuite CRF models use two kinds of features: state features and transition features. Let's check their weights 
using eli5.explain_weights:

In [12]:
eli5.show_weights(crf, top=10)



From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,4.066,1.535,0.0,1.73,0.0,1.777,0.0,2.204,0.0
B-LOC,0.952,-1.435,5.22,-0.143,0.0,-0.348,0.0,-1.955,0.0
I-LOC,-0.071,-0.494,1.091,-0.304,0.0,-0.672,0.0,0.0,0.0
B-MISC,1.009,-1.368,0.0,-0.807,4.41,-0.241,0.0,-0.175,0.0
I-MISC,-0.246,-0.807,0.0,-0.439,3.323,-0.386,0.0,-0.878,0.0
B-ORG,1.267,-2.203,0.0,-1.306,0.0,-0.299,5.805,-2.039,0.0
I-ORG,0.0,-1.758,0.0,-0.987,0.0,-1.126,5.303,-2.005,0.0
B-PER,0.971,-1.253,0.0,-1.79,0.0,0.0,0.0,0.0,6.866
I-PER,-0.19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.189

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
+5.048,word[-3:]:day,,,,,,,
+3.755,EOS,,,,,,,
+3.330,BOS,,,,,,,
+2.627,postag[:2]:PR,,,,,,,
+2.478,word[-3:]:ent,,,,,,,
+2.220,word[-3:]:ber,,,,,,,
+2.209,postag:PRP,,,,,,,
+2.129,word.lower():thursday,,,,,,,
… 13462 more positive …,… 13462 more positive …,,,,,,,
… 3960 more negative …,… 3960 more negative …,,,,,,,

Weight?,Feature
+5.048,word[-3:]:day
+3.755,EOS
+3.330,BOS
+2.627,postag[:2]:PR
+2.478,word[-3:]:ent
+2.220,word[-3:]:ber
+2.209,postag:PRP
+2.129,word.lower():thursday
… 13462 more positive …,… 13462 more positive …
… 3960 more negative …,… 3960 more negative …

Weight?,Feature
+3.057,-1:word.lower():at
+2.185,word[-3:]:and
+1.935,word.lower():u.s.
+1.924,word[-3:]:.S.
+1.774,+1:word.lower():1996-08-28
+1.599,word.lower():london
+1.527,word.lower():england
+1.466,postag:NNP
+1.435,word[-3:]:ain
+1.416,word.isupper()

Weight?,Feature
+1.175,-1:word.lower():new
+1.143,word[-3:]:ica
+1.047,-1:word.lower():south
+0.912,word.lower():kong
+0.910,-1:word.lower():hong
+0.877,-1:word.lower():united
+0.857,word.lower():states
+0.840,word[-3:]:lic
… 947 more positive …,… 947 more positive …
… 193 more negative …,… 193 more negative …

Weight?,Feature
+2.833,word[-3:]:ian
+2.739,word[-3:]:ish
+1.824,postag:NNPS
+1.681,word.isupper()
+1.633,+1:word.lower():league
+1.345,word.istitle()
+1.300,word[-3:]:can
+1.260,word.lower():german
+1.248,word[-3:]:ese
+1.175,+1:word.lower():open

Weight?,Feature
+1.567,word.lower():cup
+1.351,word.lower():league
+1.280,word.lower():open
+1.240,word[-3:]:Cup
+0.897,word[-3:]:pen
+0.845,-1:word.lower():world
+0.738,word[-3:]:gue
+0.718,word.lower():division
+0.664,word[-3:]:ION
+0.658,-1:word.istitle()

Weight?,Feature
+2.419,+1:word.lower():3
+1.880,word.isupper()
+1.598,+1:word.lower():2
+1.376,+1:word.lower():0
+1.345,+1:word.lower():1
+1.203,-1:word.lower():1
+1.188,word[-3:]:ire
+1.117,-1:word.lower():0
+1.101,BOS
… 4262 more positive …,… 4262 more positive …

Weight?,Feature
+1.220,+1:word.lower():3
+1.193,word[-3:]:ion
+1.029,+1:word.lower():0
+0.935,+1:word.lower():1
+0.894,+1:word.lower():2
… 3491 more positive …,… 3491 more positive …
… 502 more negative …,… 502 more negative …
-0.908,postag[:2]:VB
-1.002,postag:NNS
-1.029,EOS

Weight?,Feature
+1.555,word.istitle()
+1.480,-1:word.lower():minister
+1.460,-1:word.lower():president
… 6378 more positive …,… 6378 more positive …
… 569 more negative …,… 569 more negative …
-1.089,word[-3:]:and
-1.177,-1:postag[:2]:DT
-1.177,-1:postag:DT
-1.301,+1:word.lower():)
-1.305,+1:postag[:2]:)

Weight?,Feature
+1.021,-1:word.istitle()
… 5094 more positive …,… 5094 more positive …
… 418 more negative …,… 418 more negative …
-0.918,postag:JJ
-0.988,postag[:2]:JJ
-1.003,+1:postag:NNS
-1.023,+1:postag:NN
-1.083,word[-3:]:ion
-1.363,postag:NNS
-1.382,+1:postag[:2]:NN


# Predict and Evaluate

In [13]:
y_pred = crf.predict(X_test)

def show_sentence_results(corpus, prediction, idx):
    df = pd.DataFrame(corpus[idx])
    df['prediction'] = prediction[idx]
    return df

show_sentence_results(test, y_pred, 0)

Unnamed: 0,0,1,2,prediction
0,SOCCER,NN,O,O
1,-,:,O,O
2,JAPAN,NNP,B-LOC,O
3,GET,VB,O,O
4,LUCKY,NNP,O,B-LOC
5,WIN,NNP,O,I-LOC
6,",",",",O,O
7,CHINA,NNP,B-PER,B-LOC
8,IN,IN,O,O
9,SURPRISE,DT,O,O


In [14]:
y_pred_train = crf.predict(X_train)

show_sentence_results(train, y_pred_train, 0)

Unnamed: 0,0,1,2,prediction
0,EU,NNP,B-ORG,B-ORG
1,rejects,VBZ,O,O
2,German,JJ,B-MISC,B-MISC
3,call,NN,O,O
4,to,TO,O,O
5,boycott,VB,O,O
6,British,JJ,B-MISC,B-MISC
7,lamb,NN,O,O
8,.,.,O,O


In [15]:
print(flat_classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       B-LOC       0.62      0.73      0.67      1668
      B-MISC       0.62      0.66      0.64       702
       B-ORG       0.74      0.46      0.56      1661
       B-PER       0.69      0.79      0.74      1617
       I-LOC       0.58      0.51      0.54       257
      I-MISC       0.28      0.64      0.39       216
       I-ORG       0.67      0.47      0.55       835
       I-PER       0.76      0.90      0.83      1156
           O       0.98      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.66      0.68      0.66     46435
weighted avg       0.93      0.92      0.92     46435



In [16]:
labels = list(crf.classes_)

sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(flat_classification_report(y_test, y_pred, labels=sorted_labels))

              precision    recall  f1-score   support

           O       0.98      0.98      0.98     38323
       B-LOC       0.62      0.73      0.67      1668
       I-LOC       0.58      0.51      0.54       257
      B-MISC       0.62      0.66      0.64       702
      I-MISC       0.28      0.64      0.39       216
       B-ORG       0.74      0.46      0.56      1661
       I-ORG       0.67      0.47      0.55       835
       B-PER       0.69      0.79      0.74      1617
       I-PER       0.76      0.90      0.83      1156

    accuracy                           0.92     46435
   macro avg       0.66      0.68      0.66     46435
weighted avg       0.93      0.92      0.92     46435



In [17]:
from sklearn.metrics import confusion_matrix

def flatten(y):
    return [word_label for sentence_labels in y for word_label in sentence_labels]

truth = flatten(y_test)
pred = flatten(y_pred)

cm = confusion_matrix(truth, pred)
cm

array([[ 1220,    50,   133,   143,     1,     6,     5,     4,   106],
       [   53,   466,    29,    36,     1,    13,     6,     3,    95],
       [  333,    81,   756,   236,     2,    10,     5,     7,   231],
       [  158,    24,    25,  1276,     0,     9,    16,    10,    99],
       [    5,     1,     0,     1,   130,    23,    45,    41,    11],
       [    1,     4,     0,     3,     6,   139,    12,    22,    29],
       [   15,     3,     8,    30,    57,    69,   392,   156,   105],
       [    9,     0,     0,    16,    15,    20,    32,  1039,    25],
       [  165,   117,    73,   102,    13,   214,    71,    77, 37491]])

In [18]:
labels.remove('O')
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(flat_classification_report(y_test, y_pred, labels=sorted_labels))

              precision    recall  f1-score   support

       B-LOC       0.62      0.73      0.67      1668
       I-LOC       0.58      0.51      0.54       257
      B-MISC       0.62      0.66      0.64       702
      I-MISC       0.28      0.64      0.39       216
       B-ORG       0.74      0.46      0.56      1661
       I-ORG       0.67      0.47      0.55       835
       B-PER       0.69      0.79      0.74      1617
       I-PER       0.76      0.90      0.83      1156

   micro avg       0.66      0.67      0.66      8112
   macro avg       0.62      0.64      0.62      8112
weighted avg       0.67      0.67      0.66      8112

