<a href="https://colab.research.google.com/github/akaver/NLP2019/blob/master/Lab06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lab, we will do word classification, based on an English  named entity recognition (NER) dataset.

First, download and save the following files to the current directory:

  * https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train
  * https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa

If you are running Linux, you can do it using the follwoing commands:

In [0]:
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train

--2019-03-19 13:58:33--  https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3281528 (3.1M) [text/plain]
Saving to: ‘eng.train’


2019-03-19 13:58:38 (13.0 MB/s) - ‘eng.train’ saved [3281528/3281528]



In [0]:
!wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa

--2019-03-19 13:58:40--  https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 827012 (808K) [text/plain]
Saving to: ‘eng.testa’


2019-03-19 13:58:40 (18.6 MB/s) - ‘eng.testa’ saved [827012/827012]



The files are in the 'CoNLL' format: each word is on it's own line. The word is accompanied by some linguistic attributes, and the NER class is in the final column. We will use only the words for classification in this lab.


In [0]:
!head -30 eng.testa

-DOCSTART- -X- O O

CRICKET NNP I-NP O
- : O O
LEICESTERSHIRE NNP I-NP I-ORG
TAKE NNP I-NP O
OVER IN I-PP O
AT NNP I-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NN I-NP O
. . O O

LONDON NNP I-NP I-LOC
1996-08-30 CD I-NP O

West NNP I-NP I-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP I-PER
Simmons NNP I-NP I-PER
took VBD I-VP O
four CD I-NP O
for IN I-PP O
38 CD I-NP O
on IN I-PP O
Friday NNP I-NP O
as IN I-PP O
Leicestershire NNP I-NP I-ORG


Here is a function that reads the datasets into memory. You don't have to fully follow it.

In [0]:
from itertools import chain, groupby
def read_conll(filename):  
    result = []
    f = open(filename)
    lines = (str.strip(line) for line in  f)
    groups = (grp for nonempty, grp in groupby(lines, bool) if nonempty)

    for group in groups:
        group = list(group)

        obs, lbl = zip(*(ln.rsplit(None, 1) for ln in group))
        lbl = [l.lstrip("B-").lstrip("I-") for l in lbl]
        word = [x.split()[0] for x in obs]

        result.append(list(zip(word, lbl)))
    return result

Let's read the train and dev data:

In [0]:
train_data = read_conll("eng.train")
dev_data = read_conll("eng.testa")


Now, the data is represented as a list of sentences. Each sentence consists of tuples (word, class), where class  is the corresponding NER class. Let's check a random sentence:

In [0]:
print(train_data[100])

[('Port', 'O'), ('conditions', 'O'), ('from', 'O'), ('Lloyds', 'ORG'), ('Shipping', 'ORG'), ('Intelligence', 'ORG'), ('Service', 'ORG'), ('--', 'O')]


In [0]:
print(len(train_data), len(dev_data))

14987 3466


As you can see, in the above sentence, the words 'Lloyds', 'Shipping', 'Intelligence', 'Service' belong to the 'ORG' class and other words to the 'O' class (meaning 'other').

Next, we will implement a feature extraction function. The features can look at the current word (i.e., the one that is to be classified) and its neighouring words. Thus, the feature extraction function takes a sentence and the number of the word as input. It returns the extracted features for the i-th word as a simple string.

In [0]:
def features(sentence, i):
    result = []
    word = sentence[i][0]
    if (word == word.title()):
        result.append("titlecase")
    result.append("word:" + word.lower())
    return result

Currently, the feature extractor extracts two features for the for i: `word:<word>` and `titlecase`, if the word is in titlecase. Let's test it on a random word:

In [0]:
print(train_data[100][3])
print(features(train_data[100], 3))

('Lloyds', 'ORG')
['titlecase', 'word:lloyds']


Currently, our data consists of sentences. In this lab, we will use a normal single-item classifier for NER, not a sequence classifier. Therefore, we need to transform our words that are grouped by sentences into flattened form, so that is consists of items that we want to classify, e.g., words and the corresponding labels. 

The next function extracts features for all words in all the sentences in the data, and returns a pair X, y, where X is consists of features for words (formatted as a string), and y the corresponding labels:

In [0]:
def flatten_with_features(data):
    X = []
    y = []
    for sentence in data:
        for i in range(len(sentence)):
           X.append(" ".join(features(sentence, i)))
           y.append(sentence[i][1])
    return X, y

In [0]:
X_train, y_train = flatten_with_features(train_data)
X_dev, y_dev = flatten_with_features(dev_data)

In [0]:
X_train[100:110], y_train[100:110]

(['word:van',
  'word:der',
  'titlecase word:pas',
  'word:told',
  'word:a',
  'word:news',
  'word:briefing',
  'titlecase word:.',
  'titlecase word:he',
  'word:said'],
 ['PER', 'PER', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])

Now we have our data in the right format to train a classifier. We will use a multinomial logistic regression but you could try with any other classifier (e.g. DecisionTree, SVD, MultiNomialNB).

As in the document classification example, we first need to convert our textual features to feature vectors. We will use the same transformer for it as before, e.g. CountVectorizer. However, we will use the `token_pattern="\S+",` argument when constructing it, so that it won't break features like "word=foo" into pieaces "word" "foo". Also, since we will possibly add a lot of features later, wil will use the `max_features=10000` argument so that only the 10000 most frequently appearing features are passed to the classifier. Otherwise, training could be slow and we could run out of memory. Of course, you can experiment with different values. 

Note that the training takes about one minute.

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.linear_model
from sklearn.linear_model import LogisticRegression

clf_pipeline = Pipeline([('vect', CountVectorizer(token_pattern="\S+", max_features=10000)), 
                         ('clf', LogisticRegression(solver='lbfgs', multi_class='multinomial', verbose=1, max_iter=500))])
clf_pipeline.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   53.0s finished


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=10000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=1, warm_start=False))])

Let's test it:

In [0]:
print(clf_pipeline.score(X_dev, y_dev))

0.9309977122028772


As you see, using only two features gives us already pretty good accuracy. Of course, the NER task is quite simple, and most of the words are in class "O", so that it's quite easy to reach relatively high accuracy very quickly.

But let's look at the more detailed report:

In [0]:
from sklearn import metrics
predicted = clf_pipeline.predict(X_dev)
print(metrics.classification_report(y_dev, predicted))

              precision    recall  f1-score   support

         LOC       0.88      0.72      0.79      2094
        MISC       0.89      0.62      0.73      1268
           O       0.94      1.00      0.97     42975
         ORG       0.83      0.57      0.67      2092
         PER       0.95      0.52      0.68      3149

   micro avg       0.93      0.93      0.93     51578
   macro avg       0.90      0.68      0.77     51578
weighted avg       0.93      0.93      0.92     51578



Because tag “O” (outside) is the most common tag and it will make our results look much better than they actual are. So we remove tag “O” when we evaluate classification metrics.


In [0]:
print(metrics.classification_report(y_dev, clf_pipeline.predict(X_dev), labels=["LOC", "MISC", "ORG", "PER"]))

              precision    recall  f1-score   support

         LOC       0.88      0.72      0.79      2094
        MISC       0.89      0.62      0.73      1268
         ORG       0.83      0.57      0.67      2092
         PER       0.95      0.52      0.68      3149

   micro avg       0.89      0.60      0.71      8603
   macro avg       0.89      0.61      0.72      8603
weighted avg       0.89      0.60      0.71      8603



## Exercise

Your task is to improve the feature extraction code. For example, use previous and next words, suffixes, prefixes, word shapes, etc.

# Using CRF

Instead of using simple logistic regression for classifying words, we can also use a method called Conditional Random Field (CRF). CRF is a fancy name for multinomial logistic regression with a slightly different loss function, specifically designed for handling sequences. It can also handle inter-label dependencies much better.

We will train a CRF model for named entity recognition using sklearn-crfsuite on our data set.

In [0]:
!pip install sklearn_crfsuite
import sklearn_crfsuite

Collecting sklearn_crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting python-crfsuite>=0.8.3 (from sklearn_crfsuite)
[?25l  Downloading https://files.pythonhosted.org/packages/2f/86/cfcd71edca9d25d3d331209a20f6314b6f3f134c29478f90559cee9ce091/python_crfsuite-0.9.6-cp36-cp36m-manylinux1_x86_64.whl (754kB)
[K    100% |████████████████████████████████| 757kB 21.0MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.6 sklearn-crfsuite-0.3.6


For CRF, the features must not be "flattened" as we did before, but the sentence structure has to be preserved. Let's write a function that converts a sentence to a list of lists, where each list  holds the features of the i-th word of the sentence:

In [0]:
def sentence2features_and_labels(data):
    X = []
    y = []
    for sentence in data:
        X_i = []
        y_i = []
        for i in range(len(sentence)):
           X_i.append(features(sentence, i))
           y_i.append(sentence[i][1])
        X.append(X_i)
        y.append(y_i)
    return X, y



In [0]:
X_train_crf, y_train_crf = sentence2features_and_labels(train_data)
X_dev_crf, y_dev_crf = sentence2features_and_labels(dev_data)

In [0]:
print(X_train_crf[100])
print(y_train_crf[100])


[['titlecase', 'word:port'], ['word:conditions'], ['word:from'], ['titlecase', 'word:lloyds'], ['titlecase', 'word:shipping'], ['titlecase', 'word:intelligence'], ['titlecase', 'word:service'], ['titlecase', 'word:--']]
['O', 'O', 'O', 'ORG', 'ORG', 'ORG', 'ORG', 'O']


Now, let's train the CRF:

In [0]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train_crf, y_train_crf)

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

Let's use the trained CRF for getting predicted labels:

In [0]:
y_pred_crf = crf.predict(X_dev_crf)

The predicted labels are also represented as a list of lists. Let's compare the predicted and true labels of a sentence:

In [0]:
print(y_dev_crf[100])
print(y_pred_crf[100])

['O', 'O', 'LOC', 'O', 'LOC', 'O', 'O', 'O', 'O']
['O', 'O', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O']


Since the labels have such structure, we cannot use sklearns `classification_report` function directly. However, `sklearn_crfsuite` provides a very similar function `sklearn_crfsuite.metrics.flat_classification_report`:

In [0]:
import sklearn_crfsuite.metrics
print(sklearn_crfsuite.metrics.flat_classification_report(y_dev_crf, y_pred_crf, labels=["LOC", "MISC", "ORG", "PER"]))

              precision    recall  f1-score   support

         LOC       0.86      0.82      0.84      2094
        MISC       0.90      0.70      0.79      1268
         ORG       0.76      0.70      0.73      2092
         PER       0.83      0.76      0.79      3149

   micro avg       0.83      0.75      0.79      8603
   macro avg       0.84      0.74      0.79      8603
weighted avg       0.83      0.75      0.79      8603



In [0]:
def word_features(word):
    result = [word.lower()]
    if (word == word.title()):
        result.append("titlecase")
    if (word == word.upper()):
        result.append("uppercase")
    if (word.isdigit()):
        result.append("digits") 

    result.append("suffix2=" + word[-2:])
    result.append("prefix2=" + word[:2])
        
    return result
  

def features(sentence, i):
    result = []
    word = sentence[i][0]
    result.extend(["word[i]:" + x for x in word_features(word)])
    if (i == 0):
      result.append("BOS")
    else:
      prev_word = sentence[i-1][0]
      result.extend(["word[i-1]:" + x for x in word_features(prev_word)])      
    if (i == len(sentence) - 1):
      result.append("EOS")
    else:
      next_word = sentence[i+1][0]
      result.extend(["word[i+1]:" + x for x in word_features(next_word)])      
      
    return result

In [0]:
for i in range(len(y_dev_crf)):
  for j in range(len(y_dev_crf[i])):
    if y_dev_crf[i][j] != y_pred_crf[i][j]:
      print(dev_data[i][j], y_pred_crf[i][j])

('LEICESTERSHIRE', 'ORG') O
('and', 'O') ORG
('Grace', 'LOC') O
('Road', 'LOC') O
('Simmons', 'PER') O
('Such', 'PER') O
('Such', 'PER') O
('ex-England', 'MISC') O
('McCague', 'PER') O
('CHAMPIONSHIP', 'MISC') O
('county', 'O') MISC
('4-38', 'O') PER
('296', 'O') ORG
('4-43', 'O') PER
('214', 'O') ORG
('84', 'O') PER
('4-55', 'O') PER
('108-3', 'O') ORG
('The', 'LOC') O
('429-7', 'O') ORG
('363', 'O') ORG
('4-37', 'O') PER
('197-8', 'O') ORG
('Portsmouth', 'LOC') ORG
('Chesterfield', 'LOC') ORG
('471', 'O') ORG
('123', 'O') PER
("T.O'Gorman", 'PER') O
('T.', 'PER') ORG
('Moody', 'PER') ORG
('6-82', 'O') ORG
('Bristol', 'LOC') ORG
('190', 'O') ORG
('ASHES', 'MISC') O
('Ashes', 'MISC') O
('tour', 'O') MISC
('Test', 'ORG') O
('and', 'ORG') O
('county', 'O') MISC
('British', 'ORG') MISC
('Universities', 'ORG') MISC
('Minor', 'ORG') O
('Counties', 'ORG') O
('Tour', 'O') MISC
('May', 'O') PER
('Lord', 'LOC') O
("'s", 'LOC') O
('Duke', 'ORG') O
('of', 'ORG') O
('Norfolk', 'ORG') O
("'s", 'ORG

In [0]:
print(sklearn_crfsuite.metrics.flat_accuracy_score(y_dev_crf, y_pred_crf))

0.9430571173756253
