# NER Using sklearn crfsuite
Link: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

Reference: https://medium.com/data-science-in-your-pocket/named-entity-recognition-ner-using-conditional-random-fields-in-nlp-3660df22e95c

## BIO Annotation for NER
Tags of entities are encoded in a BIO-annotation scheme. Each entity is labeled with a B or an I to detect multi-word entities, where B denotes the beginning of an entity and I denote the inside of an entity. O denotes all other words which are not named entities.

# CRF Basics
Given a sequence $x$, we predict the sequence $y$ of labels for $x$, as follows, where the labels are drawn from a set $\{l_1, l_2, ..., l_k\}$:

$p_\theta(y|x) = \frac{\exp(\sum_{j}w_{j}F_{j}(x,y))}{\sum_{y^{'}}(\exp(\sum_{j}w_{j}F_{j}(x,y^{'})))}$

$Fⱼ(x,y)$ = summation of values of a feature function for all words. The numerator can be written as:

$\exp(\sum_{j}w_{j}\sum_{i}feature\_function_{j}(x,y_{i}, y_{i-1}, i))$

- The inner summation goes from $i=1$ to $i=length$ of a sentence. Hence we are summating the value of any feature function for all words of the sentence

If we have a sentence ‘Ram is cool’, the inner summation will add values of the output of the jᵗʰ feature function for all 3 words of the sentence

- The outer summation goes from $j=1$ to the total number of feature functions. It is doing something like this: $\sum_{j}w_{j}\sum_{i}feature\_function_{j}(x,y_{i}, y_{i-1}, i))$
- $w_{j}$ refers to the weight assigned to a $feature\_function_{j}$.

The denominator sums over all possible sequences.

## Feature Functions

- embedding of word $w_{i}$
- embedding of neighboring words
- part of speech of word $w_i$
- part of speech of neighboring words
- presence in a gazetteer
- prefix of word $w_i$ and neighboring words
- suffix of word $w_i$ and neighboring words
- case of word $w_i$ and neighboring words
- shape of word $w_i$ and neighboring words
- and lot more...


# Let's Code

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [None]:
! pip install -U 'scikit-learn<0.24'

Collecting scikit-learn<0.24
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 4.4 MB/s 
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.4 requires scikit-learn>=1.0.0, but you have scikit-learn 0.23.2 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.23.2 which is incompatible.[0m
Successfully installed scikit-learn-0.23.2


In [None]:
! pip install sklearn-crfsuite

Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading python_crfsuite-0.9.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (965 kB)
[K     |████████████████████████████████| 965 kB 4.2 MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.8 sklearn-crfsuite-0.3.6


In [None]:
from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

## Let's use CoNLL 2002 data to build a NER system

CoNLL2002 corpus is available in NLTK. We use Spanish data.

In [None]:
nltk.download('conll2002')

[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2002.zip.


True

In [None]:
nltk.corpus.conll2002.fileids()

['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']

In [None]:
%%time
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

CPU times: user 2.4 s, sys: 175 ms, total: 2.57 s
Wall time: 2.91 s


In [None]:
train_sents[0]

[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

In [None]:
train_sents[:2]

[[('Melbourne', 'NP', 'B-LOC'),
  ('(', 'Fpa', 'O'),
  ('Australia', 'NP', 'B-LOC'),
  (')', 'Fpt', 'O'),
  (',', 'Fc', 'O'),
  ('25', 'Z', 'O'),
  ('may', 'NC', 'O'),
  ('(', 'Fpa', 'O'),
  ('EFE', 'NC', 'B-ORG'),
  (')', 'Fpt', 'O'),
  ('.', 'Fp', 'O')],
 [('-', 'Fg', 'O')]]

## Features

Next, define some features. In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used. 

This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

sklearn-crfsuite (and python-crfsuite) supports several feature formats; here we use feature dicts.

In [None]:
def word2features(sent, i):
    word = sent[i][0] # assume tuples in a sentence (word, pos, tag)
    postag = sent[i][1] # the second element in the tuple is pos tag
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:], # last 3 chars
        'word[-2:]': word[-2:], # last 2 chars
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2], # the first 2 chars in a pos tag        
    }
    if i > 0:
        word1 = sent[i-1][0] # the previous word
        postag1 = sent[i-1][1] # the previous pos tag
        features.update({ # add new features to the words except for the first one
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True # a feature for the first word
        
    if i < len(sent)-1: # words before the last word
        word1 = sent[i+1][0] # the next word
        postag1 = sent[i+1][1] # the next pos
        features.update({ # add features for words except for the last word
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True # a feature for the last word
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

## 自己定义的看这里哦

In [None]:
a = pd.read_csv("./extracted_8.csv")
a.head(2)

Unnamed: 0,notesId,SentenceId,Word,Start,End,Label,Lemma,POS,TAG,DEP,Shape,Is_Alpha,Is_Stop,findRxcuiByString,filterByProperty,getApproximateMatch,getDrugs,getSpellingSuggestions
0,108-04,108-04_1,Record,3,9,O,record,NOUN,NN,compound,Xxxxx,True,False,,,219591.0,,
1,108-04,108-04_1,date,10,14,O,date,NOUN,NN,ROOT,xxxx,True,False,899742.0,,899742.0,,date


In [None]:
import pandas as pd


def df2x_y(input_df: pd.DataFrame):

  # extract label and delete unrelated colums
  labels = input_df["Label"]
  df = input_df.drop(columns = ["notesId", "Start", "End", "Label"]).loc[:,["SentenceId","Word"]]

  # define stack(list type) to store output features and labels
  f_stack = []  # features
  l_stack = []  # labels

  # initialize a empty predecessor
  predecessor = {"SentenceId": None}

  for index, current in df.iterrows():

    # features of current word, related to current word
    features = dict(current)

    # if the sentenceId of current line and the previous line are the same, then
    if current["SentenceId"] == predecessor["SentenceId"]:

      # features of current word, related to previous word
      features.update(dict(("pre_" + key, value)  # pre_: previous word's
          for key, value in predecessor.items()
          if "pre_" not in key))

      # features of previous word, related to current word
      predecessor.update(dict(("nex_" + key, value) # nex_: next word's
          for key, value in current.items()))
      
      # put the new word to the end of its sentence
      f_stack[-1].append(features)
      l_stack[-1].append(labels[index])

    # else, it means here we start a new sentence
    else:  # be care of the "[]" since you start a new sentence
      f_stack.append([features])
      l_stack.append([labels[index]])

    # replace predecessor word to current word
    predecessor = features

  return f_stack, l_stack  # X and y

In [None]:
%%time

X_train, y_train = df2x_y(a)

CPU times: user 97.4 ms, sys: 0 ns, total: 97.4 ms
Wall time: 98.9 ms


In [None]:
X_train[0:3]

[[{'SentenceId': '108-04_1',
   'Word': 'Record',
   'nex_SentenceId': '108-04_1',
   'nex_Word': 'date'},
  {'SentenceId': '108-04_1',
   'Word': 'date',
   'nex_SentenceId': '108-04_1',
   'nex_Word': ':',
   'pre_SentenceId': '108-04_1',
   'pre_Word': 'Record'},
  {'SentenceId': '108-04_1',
   'Word': ':',
   'nex_SentenceId': '108-04_1',
   'nex_Word': '2135',
   'pre_SentenceId': '108-04_1',
   'pre_Word': 'date'},
  {'SentenceId': '108-04_1',
   'Word': '2135',
   'nex_SentenceId': '108-04_1',
   'nex_Word': '-',
   'pre_SentenceId': '108-04_1',
   'pre_Word': ':'},
  {'SentenceId': '108-04_1',
   'Word': '-',
   'nex_SentenceId': '108-04_1',
   'nex_Word': '09',
   'pre_SentenceId': '108-04_1',
   'pre_Word': '2135'},
  {'SentenceId': '108-04_1',
   'Word': '09',
   'nex_SentenceId': '108-04_1',
   'nex_Word': '-',
   'pre_SentenceId': '108-04_1',
   'pre_Word': '-'},
  {'SentenceId': '108-04_1',
   'Word': '-',
   'nex_SentenceId': '108-04_1',
   'nex_Word': '08',
   'pre_Sent

In [None]:
y_train[0:3]

[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O'], ['O']]

In [None]:
a["Label"].value_counts()

O    1085
B      42
I       1
Name: Label, dtype: int64

This is what word2features extracts:

In [None]:
train_sents[0][0]

('Melbourne', 'NP', 'B-LOC')

In [None]:
sent2features(train_sents[0])

Extract features from the data:

In [None]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

CPU times: user 2.44 s, sys: 114 ms, total: 2.55 s
Wall time: 3.11 s


In [None]:
len(X_train), len(y_train)

(8323, 8323)

In [None]:
X_train[0]

In [None]:
y_train[0:1]

[['B-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O']]

## Training

To see all possible CRF parameters check its docstring. Here we are useing L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.

In [None]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CPU times: user 62.9 ms, sys: 1.62 ms, total: 64.5 ms
Wall time: 67.3 ms


## Evaluation

There is much more O entities in data set, but we're more interested in other entities. To account for this we'll use averaged F1 score computed for all labels except for O. ``sklearn-crfsuite.metrics`` package provides some useful metrics for sequence classification task, including this one.

In [None]:
labels = list(crf.classes_)
labels.remove('O')
labels

['B', 'I']

In [None]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred,average='weighted', labels=labels)

  Returns


0.0

Inspect per-class results in more detail:

In [None]:
# group B and I results
sorted_labels = sorted(
    labels, 
    key=lambda name: (name[1:], name[0])
)

In [None]:
from sklearn.metrics import classification_report

In [None]:
y_test_flatten = [lab for sent in y_test for lab in sent]
y_pred_flatten = [lab for sent in y_pred for lab in sent]

In [None]:
print(classification_report(
    y_test_flatten, y_pred_flatten, labels=sorted_labels, digits=3
))

  labels : list, optional
  labels : list, optional
  labels : list, optional
  labels : list, optional


              precision    recall  f1-score   support

           B      0.000     0.000     0.000         0
           I      0.000     0.000     0.000         0

   micro avg      0.000     0.000     0.000         0
   macro avg      0.000     0.000     0.000         0
weighted avg      0.000     0.000     0.000         0



  labels : list, optional
  labels : list, optional
  labels : list, optional
  labels : list, optional


## Hyperparameter Optimization

To improve quality try to select regularization parameters using randomized search and 3-fold cross-validation.

It takes quite a lot of CPU time and RAM (we're fitting a model ``50 * 3 = 150`` times), so grab a tea and be patient, or reduce n_iter in RandomizedSearchCV, or fit model only on a subset of training data.

In [None]:
%%time
# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    max_iterations=100, 
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score, 
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf, params_space, 
                        cv=3, 
                        verbose=1, 
                        n_jobs=-1, 
                        n_iter=50, 
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

AttributeError: ignored

Best result:

In [None]:
# crf = rs.best_estimator_
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

### Check parameter space

A chart which shows which ``c1`` and ``c2`` values have RandomizedSearchCV checked. Red color means better results, blue means worse.

In [None]:
_x = [s['c1'] for s in rs.cv_results_['params']]
_y = [s['c2'] for s in rs.cv_results_['params']]
_c = rs.cv_results_['mean_test_score']

fig = plt.figure()
fig.set_size_inches(12, 12)
ax = plt.gca()
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('C1')
ax.set_ylabel('C2')
ax.set_title("Randomized Hyperparameter Search CV Results (min={:0.3}, max={:0.3})".format(
    min(_c), max(_c)
))

ax.scatter(_x, _y, c=_c, s=60, alpha=0.9, edgecolors=[0,0,0])

print("Dark blue => {:0.4}, dark red => {:0.4}".format(min(_c), max(_c)))

AttributeError: ignored

## Check best estimator on our test data

As you can see, quality is improved.

In [None]:
crf = rs.best_estimator_
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

## Let's check what classifier learned

In [None]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

We can see that, for example, it is very likely that the beginning of an organization name (B-ORG) will be followed by a token inside organization name (I-ORG), but transitions to I-ORG from tokens with other labels are penalized.

Check the state features:

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])



Some observations:

   * **9.385823 B-ORG word.lower():psoe-progresistas** - the model remembered names of some entities - maybe it is overfit, or maybe our features are not adequate, or maybe remembering is indeed helpful;
   * **4.636151 I-LOC -1:word.lower():calle:** "calle" is a street in Spanish; model learns that if a previous word was "calle" then the token is likely a part of location;
   * **-5.632036 O word.isupper()**, **-8.215073 O word.istitle()** : UPPERCASED or TitleCased words are likely entities of some kind;
   * **-2.097561 O postag:NP** - proper nouns (NP is a proper noun in the Spanish tagset) are often entities.

What to do next

    * Load 'testa' Spanish data.
    * Use it to develop better features and to find best model parameters.
    * Apply the model to 'testb' data again.

The model in this notebook is just a starting point; you certainly can do better!

