# Solutions for Sequence Labeling: Shallow Parsing

- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

## 4. Named Entity Recognition with NLTK

### 4.2. Training NLTK Taggers

In [1]:
import nltk
nltk.download('conll2002')

[nltk_data] Downloading package conll2002 to /Users/eas/nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


True

In [2]:
from nltk.corpus import conll2002
import nltk.tag.hmm as hmm

### Exercise

#### Segmentation 
Train a tagger to perform *segmentation* of input sentences into constituents
- Strip concept information from output labels (i.e. keep only IOB-prefix)
- Train tagger to predict segmentation labels
- Evaluate segmentation performance

Let's define a function to stip concept information.

In [3]:
def split_tag(tag):
    parts = tag.split("-")
    return tag if len(parts) == 1 else parts[0]

In [5]:
train_sents_seg = [[(text, split_tag(iob)) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.train')]
testa_sents_seg = [[(text, split_tag(iob)) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testa')]
testb_sents_seg = [[(text, split_tag(iob)) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testb')]

In [20]:
# re-define again to avoid parameters being taken from other models
hmm_seg_model = hmm.HiddenMarkovModelTrainer()
hmm_seg = hmm_seg_model.train(train_sents_seg)
    
# evaluation
accuracy = hmm_seg.accuracy(testa_sents_seg)

print("Accuracy: {:6.4f}".format(accuracy))

Accuracy: 0.4297


#### CoNLL Eval Evaluation

In [8]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate
# for nice tables
import pandas as pd

##### Segmentation Evaluation
we need to retrain or post-process the output of segmentation tagger to match the required input format.

`PREFIX-TAG`

###### Post-Processing

In [21]:
def add_ne(tag, ne_tag="NE"):
    return tag if tag == "O" else "-".join([tag, ne_tag])

In [22]:
# tag
hyp_a_seg = [hmm_seg.tag(s) for s in conll2002.sents('esp.testa')]
hyp_b_seg = [hmm_seg.tag(s) for s in conll2002.sents('esp.testb')]

In [23]:
# post-process references
ref_a = [[(text, add_ne(iob)) for text, iob in sent] for sent in testa_sents_seg]
ref_b = [[(text, add_ne(iob)) for text, iob in sent] for sent in testb_sents_seg]

In [24]:
# post-process hypotheses
hyp_a = [[(text, add_ne(iob)) for text, iob in sent] for sent in hyp_a_seg]
hyp_b = [[(text, add_ne(iob)) for text, iob in sent] for sent in hyp_b_seg]

In [25]:
res_a = evaluate(ref_a, hyp_a)
res_b = evaluate(ref_b, hyp_b)

In [26]:
pd_tbl_a = pd.DataFrame().from_dict(res_a, orient='index')
pd_tbl_a.round(decimals=3)

Unnamed: 0,p,r,f,s
NE,0.087,0.679,0.153,4352
total,0.087,0.679,0.153,4352


In [27]:
pd_tbl_b = pd.DataFrame().from_dict(res_b, orient='index')
pd_tbl_b.round(decimals=3)

Unnamed: 0,p,r,f,s
NE,0.08,0.717,0.144,3559
total,0.08,0.717,0.144,3559


###### Re-Training

In [28]:
def split_tag_ne(tag, ne_tag="NE"):
    parts = tag.split("-")
    return tag if len(parts) == 1 else "-".join([parts[0], ne_tag])

In [29]:
train_sents_seg_ne = [[(text, split_tag_ne(iob)) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.train')]
testa_sents_seg_ne = [[(text, split_tag_ne(iob)) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testa')]
testb_sents_seg_ne = [[(text, split_tag_ne(iob)) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testb')]

In [30]:
hmm_seg_ne_model = hmm.HiddenMarkovModelTrainer()
hmm_seg_ne = hmm_seg_ne_model.train(train_sents_seg_ne)
    
# evaluation
acca = hmm_seg_ne.accuracy(testa_sents_seg_ne)
accb = hmm_seg_ne.accuracy(testb_sents_seg_ne)
print("Accuracy A: {:6.4f}".format(acca))
print("Accuracy B: {:6.4f}".format(accb))

Accuracy A: 0.4297
Accuracy B: 0.4432


In [31]:
# getting hypotheses
hyp_a_ne = [hmm_seg_ne.tag(s) for s in conll2002.sents('esp.testa')]
hyp_b_ne = [hmm_seg_ne.tag(s) for s in conll2002.sents('esp.testb')]

res_a_ne = evaluate(testa_sents_seg_ne, hyp_a_ne)
res_b_ne = evaluate(testb_sents_seg_ne, hyp_b_ne)

In [32]:
pd_tbl_a = pd.DataFrame().from_dict(res_a_ne, orient='index')
pd_tbl_a.round(decimals=3)

Unnamed: 0,p,r,f,s
NE,0.087,0.679,0.153,4352
total,0.087,0.679,0.153,4352


In [34]:
pd_tbl_b = pd.DataFrame().from_dict(res_b_ne, orient='index')
pd_tbl_b.round(decimals=3)

Unnamed: 0,p,r,f,s
NE,0.08,0.717,0.144,3559
total,0.08,0.717,0.144,3559


## 7. Feature Engineering

### 7.1. SpaCy Token Features

[spaCy](https://spacy.io/) provides a convenient way to augment our feature set with common features using in Natural Language Processing. 

The list of provided token-level features is available [here](https://spacy.io/api/token#attributes).

## Lab Exercise

- add suffix features to the model and report performances
- try the feature template from the tutorial on CoNLL dataset
- increase the feature window (number of previous and next token) to:
    - `[-1, +1]`
    - `[-2, +2]`
- learn & experiment with model parameters

Let's modify `sent2features` function to make use of spaCy features.

In [35]:
# get Spanish model of spacy
!python -m spacy download es_core_news_sm

Collecting es-core-news-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.2.0/es_core_news_sm-3.2.0-py3-none-any.whl (14.0 MB)
[K     |████████████████████████████████| 14.0 MB 14.9 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [39]:
train_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.train')]
testa_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testa')]
testb_sents = [[(text, iob) for text, pos, iob in sent] for sent in conll2002.iob_sents('esp.testb')]

In [40]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, label in sent]

def sent2tokens(sent):
    return [token for token, label in sent]

In [None]:
def word2features(sent, i):
    word = sent[i][0]
    return {'bias': 1.0, 'word.lower()': word.lower()}

In [62]:
import spacy
from spacy.tokenizer import Tokenizer
import es_core_news_sm
nlp = es_core_news_sm.load()

# nlp = spacy.load("es_core_news_sm")
nlp.tokenizer = Tokenizer(nlp.vocab)  # to use white space tokenization (generally a bad idea for unknown data)

def sent2spacy_features(sent):
    neg_window = 2 # number of previous tokens to featurize
    pos_window = 2 # number of following tokens to featurize
    spacy_sent = nlp(" ".join(sent2tokens(sent)))
    feats = []
    # this replaces word 2 features
    for token in spacy_sent:
        token_feats = {
            'bias': 1.0,
            'word.lower()': token.lower_,
            'pos': token.pos_,
            'lemma': token.lemma_,
            # adding this as part of the exercise 
            # (if word is less than suffix size, it will be used: e.g. suffix3 for 'we' is 'we')
            'prefix1': token.text[:1],
            'prefix2': token.text[:2],
            'prefix3': token.text[:3],
            'suffix1': token.text[-1:],
            'suffix2': token.text[-2:],
            'suffix3': token.text[-3:],
            # some features from template
            # https://spacy.io/api/token#attributes
            'word.isupper()': token.is_upper,
            'word.istitle()': token.is_title,
            'word.isdigit()': token.is_digit,
        }
        # windowing: adding +/-2 with uniform features
         # returns Span
        prev_tokens = spacy_sent[(0 if (token.i - neg_window) < 0 else token.i - neg_window):token.i]
        next_tokens = spacy_sent[(token.i + 1):(token.i + 1 + pos_window)]
        
        if len(prev_tokens) == 0:
            token_feats['BOS'] = True
        else:
            for t in prev_tokens:
                # unique identifier
                str_id = str((t.i - token.i))
                token_feats.update({
                    f'{str_id}:word.lower()': t.lower_,
                    f'{str_id}:word.istitle()': t.is_title,
                    f'{str_id}:word.isupper()': t.is_upper,
                    f'{str_id}:postag': t.pos_,
            })
                
        if len(next_tokens) == 0:
            token_feats['EOS'] = True
        else:
            for t in next_tokens:
                # unique identifier
                str_id = "+" + str((t.i - token.i))
                token_feats.update({
                    f'{str_id}:word.lower()': t.lower_,
                    f'{str_id}:word.istitle()': t.is_title,
                    f'{str_id}:word.isupper()': t.is_upper,
                    f'{str_id}:postag': t.pos_,
            })

        feats.append(token_feats)
    
    return feats

In [63]:
train_feats = [sent2spacy_features(s) for s in train_sents]
train_label = [sent2labels(s) for s in train_sents]
testa_feats = [sent2spacy_features(s) for s in testa_sents]

In [64]:
print(train_feats[0])

[{'bias': 1.0, 'word.lower()': 'melbourne', 'pos': 'PROPN', 'lemma': 'Melbourne', 'prefix1': 'M', 'prefix2': 'Me', 'prefix3': 'Mel', 'suffix1': 'e', 'suffix2': 'ne', 'suffix3': 'rne', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'BOS': True, '+1:word.lower()': '(', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'PUNCT', '+2:word.lower()': 'australia', '+2:word.istitle()': True, '+2:word.isupper()': False, '+2:postag': 'PROPN'}, {'bias': 1.0, 'word.lower()': '(', 'pos': 'PUNCT', 'lemma': '(', 'prefix1': '(', 'prefix2': '(', 'prefix3': '(', 'suffix1': '(', 'suffix2': '(', 'suffix3': '(', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, '-1:word.lower()': 'melbourne', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'PROPN', '+1:word.lower()': 'australia', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'PROPN', '+2:word.lower()': ')', '+2:word.istitle()': False, '+2:word.is

In [65]:
from sklearn_crfsuite import CRF

crf = CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)

In [66]:
%%time
# workaround for scikit-learn 1.0
try:
    crf.fit(train_feats, train_label)
except AttributeError:
    pass

CPU times: user 36.9 s, sys: 419 ms, total: 37.3 s
Wall time: 38.5 s


In [67]:
pred = crf.predict(testa_feats)

hyp = [[(testa_feats[i][j], t) for j, t in enumerate(tokens)] for i, tokens in enumerate(pred)]

In [68]:
results = evaluate(testa_sents, hyp)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
ORG,0.775,0.745,0.759,1700
LOC,0.666,0.798,0.726,985
MISC,0.562,0.449,0.499,445
PER,0.877,0.797,0.835,1222
total,0.754,0.741,0.747,4352
