# Chunking and Named Entity Recognition

This notebook provides an introduction on using NLTK for Chunking and Named Entity Recognition

## Initialize NTLK

Download some of the resources that NLTK needs

In [None]:
import nltk
nltk.download('book')

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
)

## Loading the Data and Working on the Data Representation

A labelled data can be loaded from `nltk` by using the `nltk.corpus.conll2000` module. This provides sentences labeled with the POS tags and the appropriate phrase.

NLTK works with different format for the usecase. It can provide the tree format or the conlltags format. This is important since some NLTK functions and libraries outside NLTK can be used with ease if the appropriate format is used.

### Loading the Data and Sorting by length

The data is sorted by length so an appropriate sample can be used. Sentences with very long lengths are hard to visualize in a notebook environment.

By default, NLTK provides the data in a tree format from the `nltk.corpus.conll2000` module

In [None]:
conll_trees = sorted(nltk.corpus.conll2000.chunked_sents('train.txt'), key=len)
conll_trees[1000]

### Conversion to IOB Tags

A tree can easily be converted by calling the `nltk.chunk.tree2conlltags` function.

In [None]:
sample_iob = nltk.chunk.tree2conlltags(conll_trees[1000])
sample_iob

### Conversion to Tree

To convert the IOB Tags to the tree format, the `nltk.chunk.conlltags2tree` can be used.

In [None]:
nltk.chunk.conlltags2tree(sample_iob)

### Removing the some information

If there is a need to remove information in the IOB data, the data can easily be iterated on as it is just a list of tuple. This is useful for creating training data and for reusing POS taggers for chunking.

In [None]:
sample_pos = [(w, pos) for w, pos, iob in sample_iob]
sample_pos

## Rule Based Chunking

The `nltk.RegexpParser` can be used to provide a regex rule that will be used to match a phrase label. NLTK also allows the inversion of the patterns to allow chinking

### Chunking Rules

Multiple rules can be defined to assign to a phrase label. These two examples however, can already show how tedious creating rules are for tagging

In [None]:
grammar = r"""
    NP: {<DT><NN>}
"""
rule1_chunker = nltk.RegexpParser(grammar)
rule1_chunker.parse(sample_pos)

In [None]:
grammar = r"""
    NP: {<DT><NN>}
        {<NNP><NN>+}
"""
rule2_chunker = nltk.RegexpParser(grammar)
rule2_chunker.parse(sample_pos)

### Chinking Rules

A chinking rule can be defined by inverting the brackets the surrounds the regular expression.

In [None]:
grammar = r"""
    NP: {<.*>+}             # Chunk everything
        }<VBZ|RB|VBG|IN>{   # Remove everything in between
"""
rule3_chunker = nltk.RegexpParser(grammar)
rule3_chunker.parse(sample_pos)

## Machine Learning Base Chunking

The N-gram chunkers and CRF chunkers will be utilized in this part to show how data driven models can be created for chunking. For this part, the focus will be `NP` chunking.

### Loading and Splitting the Data

The data is split to train and and test. A validation won't be created since the models won't be optimized here. The goal is to only show how a basic implementation of the algorithms will generalize

*   TRAIN: 80%
*   TEST: 20% 

In [None]:
conll2000_data = nltk.corpus.conll2000.chunked_sents('train.txt', chunk_types=['NP'])
conll2000_train, conll2000_test = train_test_split(conll2000_data, test_size=0.2, random_state=0)
len(conll2000_train), len(conll2000_test)

### Baseline Models

The performance of a no tag output and a simple rule that labels any sequence of POS tags that starts with D or N as NP are evaluated so have a view on the performance of baseline chunkers

In [None]:
grammar = r""""""
empty_chunker = nltk.RegexpParser(grammar)
print(empty_chunker.accuracy(conll2000_test))

The example is not rendered as a tree due to the lack of chunk

In [None]:
empty_chunker.parse(sample_pos)

In [None]:
grammar = r"""
    NP: {<[DN].*>+}
"""
simple_chunker = nltk.RegexpParser(grammar)
print(simple_chunker.accuracy(conll2000_test))

The example is rendered as a tree since the parsing provided chunks

In [None]:
simple_chunker.parse(sample_pos)

### Helper Functions

While its easy to work with NLTK provided functionalities, in this part several helper functions are provided to utilize taggers as chunckers and to allow the usage of the data into other machine learning libraries.

*   `tree2ngram`: Converts the data for the use of N-gram taggers as chunkers. Transforms the data to (pos_tags, iob) or pos_tags depending on the label flag.
*   `tree2crf`: Converts the data for the use of CRF tagger as chunkers. Transforms the data to ((word, pos_tags), iob) or (word, pos_tags) depending on the label_flag
*   `tree2metric`: Takes all of the iob_tags and flattens the resulting array for use of Scikit-Learn metrics functions.

In [None]:
def tree2ngram(data, label):
    if label:
        def func(item):
            _, pos, iob = item
            return (pos, iob)
    else:
        def func(item):
            return item[1]

    return [
        [func(item) for item in nltk.chunk.tree2conlltags(sent)]
        for sent in data
    ]

def tree2crf(data, label):
    if label:
        def func(item):
            w, pos, iob = item
            return ((w, pos), iob)
    else:
        def func(item):
            w, pos, _ = item
            return (w, pos)

    return [
        [func(item) for item in nltk.chunk.tree2conlltags(sent)]
        for sent in data
    ]
    
def tree2metric(data):
    return [
        word[-1] for sent in data
        for word in (
            nltk.chunk.tree2conlltags(sent) if isinstance(sent, nltk.tree.Tree)
            else sent
        )
    ]

### N-Gram Chunkers

Instead of using the taggers to tag POS tags, the words are replaced by the POS Tags sa features and the POS Tags are replaced by the IOB tags as targets. This transformation to the data is done using the `tree2ngram` helper function.

In [None]:
ngram_conll2000_train = tree2ngram(conll2000_train, label=True)
ngram_conll2000_test = tree2ngram(conll2000_test, label=False)
ngram_conll2000_true = tree2metric(conll2000_test)

In [None]:
unigram_chunker = nltk.UnigramTagger(ngram_conll2000_train)

unigram_chunker_res = unigram_chunker.tag_sents(ngram_conll2000_test)
unigram_chunker_pred = tree2metric(unigram_chunker_res)

print(classification_report(ngram_conll2000_true, unigram_chunker_pred))

In [None]:
bigram_chunker = nltk.BigramTagger(ngram_conll2000_train)

bigram_chunker_res = bigram_chunker.tag_sents(ngram_conll2000_test)
bigram_chunker_pred = tree2metric(bigram_chunker_res)

print(classification_report(ngram_conll2000_true, bigram_chunker_pred))

### CRF Chunker

To utilize the CRF chunker, instead of passing a word, the word will be replaced by a tuple of word and POS tags. Since the CFR now accepts a different data than the default (word only), the definition of a function that creates the features is required. 

The data is converted into this tuple format using the `tree2crf` function. The CRF feature function should then be able to work with this kind of data format, treating each token as a tuple.

In [None]:
crf_conll2000_train = tree2crf(conll2000_train, label=True)
crf_con112000_test = tree2crf(conll2000_test, label=False)
crf_con112000_true = tree2metric(conll2000_test)

In [None]:
def custom_crf_features(tokens, idx):
    feature_list = []
    
    # NEIGHBOR TAGS
    feature_list.append(f'TAG_{tokens[idx][1]}')
    try:
        feature_list.append(f'TAG-1_{tokens[idx-1][1]}')
    except IndexError:
        pass
    try:
        feature_list.append(f'TAG+1_{tokens[idx+1][1]}')
    except IndexError:
        pass
    try:
        feature_list.append(f'TAG-1+1_{tokens[idx-1][1]}_{tokens[idx+1][1]}')
    except IndexError:
        pass
                
    return feature_list

In [None]:
crf_chunker = nltk.crf.CRFTagger(feature_func=custom_crf_features)
crf_chunker.train(crf_conll2000_train, '../models/crf_chunker.tag')

crf_chunker_res = crf_chunker.tag_sents(crf_con112000_test)
crf_chunker_pred = tree2metric(crf_chunker_res)

print(classification_report(crf_con112000_true, crf_chunker_pred))

## Named Entity Recognition

While tagging noun phrases using the POS tags may provide good results, named entity recognition goes another hierarchy in detail. The impact of word features can be see to improve the result significantly as they provide more context into the word in use.

In [None]:
conll2002_data = nltk.corpus.conll2002.chunked_sents('esp.train')
conll2002_train, conll2002_test = train_test_split(conll2002_data, test_size=0.2, random_state=0)
len(conll2002_train), len(conll2002_test)

In [None]:
conll2002_data[0]

In [None]:
crf_conll2002_train = tree2crf(conll2002_train, label=True)
crf_conll2002_test = tree2crf(conll2002_test, label=False)
crf_conll2002_true = tree2metric(conll2002_test)

In [None]:
def ner_pos_features(tokens, idx):
    feature_list = []
    
    # NEIGHBOR TAGS
    feature_list.append(f'TAG_{tokens[idx][1]}')
    try:
        feature_list.append(f'TAG-1_{tokens[idx-1][1]}')
    except IndexError:
        pass
    try:
        feature_list.append(f'TAG+1_{tokens[idx+1][1]}')
    except IndexError:
        pass
    try:
        feature_list.append(f'TAG-1+1_{tokens[idx-1][1]}_{tokens[idx+1][1]}')
    except IndexError:
        pass
                
    return feature_list

In [None]:
crf_pos_ner = nltk.crf.CRFTagger(feature_func=ner_pos_features)
crf_pos_ner.train(crf_conll2002_train, '../models/crf_ner_pos.tag')

crf_pos_ner_res = crf_pos_ner.tag_sents(crf_conll2002_test)
crf_pos_ner_pred = tree2metric(crf_pos_ner_res)

print(classification_report(crf_conll2002_true, crf_pos_ner_pred))

In [None]:
confusion_matrix(crf_conll2002_true, crf_pos_ner_pred)

In [None]:
def ner_pos_word_features(tokens, idx):
    feature_list = []
    
    # NEIGHBOR TAGS
    feature_list.append(f'TAG_{tokens[idx][1]}')
    try:
        feature_list.append(f'TAG-1_{tokens[idx-1][1]}')
    except IndexError:
        pass
    try:
        feature_list.append(f'TAG+1_{tokens[idx+1][1]}')
    except IndexError:
        pass
    try:
        feature_list.append(f'TAG-1+1_{tokens[idx-1][1]}_{tokens[idx+1][1]}')
    except IndexError:
        pass
    
    # WORDS
    feature_list.append(f'WORD_{tokens[idx][0]}')
    try:
        feature_list.append(f'WORD-1_{tokens[idx-1][0]}')
    except IndexError:
        pass
    try:
        feature_list.append(f'WORD+1_{tokens[idx+1][0]}')
    except IndexError:
        pass
    
    # SUFFIX
    token = tokens[idx][0]
    if len(token) > 1:
        feature_list.append("SUF_" + token[-1:])
    if len(token) > 2:
        feature_list.append("SUF_" + token[-2:])
    if len(token) > 3:
        feature_list.append("SUF_" + token[-3:])
            
    return feature_list

In [None]:
crf_pos_word_ner = nltk.crf.CRFTagger(feature_func=ner_pos_word_features)
crf_pos_word_ner.train(crf_conll2002_train, '../models/crf_ner_pos+word.tag')

crf_pos_word_ner_res = crf_pos_word_ner.tag_sents(crf_conll2002_test)
crf_pos_word_ner_pred = tree2metric(crf_pos_word_ner_res)

print(classification_report(crf_conll2002_true, crf_pos_word_ner_pred))

In [None]:
confusion_matrix(crf_conll2002_true, crf_pos_word_ner_pred)