# Natural Language Processing

## Window Classifier for NER

## 1. Load data

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:
    
    [PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years of the seventies in [ORG Real Madrid ] .

The shared task of CoNLL-2002 (https://www.clips.uantwerpen.be/conll2002/ner/) concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The data consists of two columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word and the second the named entity tag. The tags have the same format as in the chunking task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC).

In [1]:
import nltk
nltk.download('conll2002')

[nltk_data] Downloading package conll2002 to
[nltk_data]     /Users/chaklam/nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


True

In [2]:
corpus = nltk.corpus.conll2002.iob_sents()

In [3]:
data = []
for cor in corpus:
    sent, _, tag = list(zip(*cor))
    data.append([sent, tag])

In [4]:
# data[0]

## 2. Tokenization

Since the dataset is already tokenized, our life is easy.  Just skip!

## 3. Numericalization

Note that we need to build separate id for vocab and tags for prediction

In [5]:
flatten = lambda l: [item for sublist in l for item in sublist]

sents, tags = list(zip(*data))
vocab  = list(set(flatten(sents)))
tagset = list(set(flatten(tags)))

In [6]:
tagset

['I-LOC', 'B-PER', 'I-ORG', 'B-LOC', 'I-PER', 'B-MISC', 'I-MISC', 'B-ORG', 'O']

In [7]:
word2index = {'<UNK>': 0, '<DUMMY>': 1}
for v in vocab:
    if word2index.get(v) is None:
        word2index[v] = len(word2index)
index2word = {v:k for k, v in word2index.items()}

tag2index = {}
for v in tagset:
    if tag2index.get(v) is None:
        tag2index[v] = len(tag2index)
index2tag = {v:k for k, v in tag2index.items()}

## 4. Prepare window data

<img src="../figures/ner_win.png" width="400">

In [8]:
window_size = 2
windows = []

In [9]:
for sample in data:
    dummy = ['<DUMMY>'] * window_size
    text  = sample[0] #first tuple containing the sentence; sample[1] contains the tags
    padded_text = dummy + list(sample[0]) + dummy
    window = list(nltk.ngrams(padded_text, window_size * 2 + 1))
    
    windows.extend([[list(window[i]), sample[1][i]] for i in range(len(sample[0]))])

In [10]:
windows = windows[:10000]

In [11]:
import random
random.shuffle(windows)

train_data = windows[:int(len(windows) * 0.9)]
test_data  = windows[int(len(windows) * 0.9):]

### 4.1 Prepare batch

In [12]:
import torch
def prepare_sequence(seq, word2index):
    idxs = list(map(lambda w: word2index[w] if word2index.get(w) is not None else word2index["<UNK>"], seq))
    return torch.LongTensor(idxs)

def prepare_tag(tag,tag2index):
    return torch.LongTensor([tag2index[tag]])

In [13]:
def getBatch(batch_size, train_data):
    random.shuffle(train_data)
    sindex = 0
    eindex = batch_size
    while eindex < len(train_data):
        batch = train_data[sindex:eindex]
        temp  = eindex
        eindex = eindex + batch_size
        sindex = temp
        yield batch
        
    if eindex >= len(train_data):
        batch = train_data[sindex:]
        yield batch

## 5. Modeling

<img src="../figures/ner_model.png" width="600">

In [14]:
import torch.nn as nn

class NER(nn.Module):
    
    def __init__(self, vocab_size, embed_size, hidden_size, window_size, out_size):
        super(NER, self).__init__()
        
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.layer1 = nn.Linear(embed_size * (window_size*2+1), hidden_size)
        self.layer2 = nn.Linear(hidden_size, out_size) #predict the probability of each tag
        self.relu   = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, inputs):
        embeds = self.embed(inputs) #(batch_size, 5, emb_size)
        embeds = embeds.view(-1, embeds.size(1) * embeds.size(2)) #(batch_size, 5 * emb_size)
        h0 = self.dropout(self.relu(self.layer1(embeds)))
        out = self.layer2(h0)
        return out
        

## 6. Training 

It takes for a while if you use just cpu.

In [15]:
batch_size = 2
embed_size = 4
hidden_size = 8
num_epochs  = 5

In [16]:
import torch.optim as optim

model = NER(len(word2index), embed_size, hidden_size, window_size, len(tag2index))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = 0.001)

In [17]:
import numpy as np

model.train() #turn on dropout

for epoch in range(num_epochs):
    losses = []
    for i, batch in enumerate(getBatch(batch_size, train_data)):
        
        x,y = list(zip(*batch))
        
        inputs  = torch.cat([prepare_sequence(sent, word2index).view(1, -1) for sent in x])
        targets = torch.cat([prepare_tag(tag, tag2index) for tag in y])
        
        preds = model(inputs)
        loss  = criterion(preds, targets)
        losses.append(loss.item())
        
        model.zero_grad()
        loss.backward()
        optimizer.step()
        
    print(f"Epoch: {epoch + 1} | Batch: {i: 5.0f} | Loss: {np.mean(losses):.6f}")

Epoch: 1 | Batch:  4499 | Loss: 0.876453
Epoch: 2 | Batch:  4499 | Loss: 0.676201
Epoch: 3 | Batch:  4499 | Loss: 0.622139
Epoch: 4 | Batch:  4499 | Loss: 0.565712
Epoch: 5 | Batch:  4499 | Loss: 0.521096


## 7. Test 

In [29]:
for_f1_score = []

In [30]:
accuracy = 0

model.eval() #this will turn off dropout

for test in test_data:
    x, y = test[0], test[1]
    input = prepare_sequence(x, word2index).view(1, -1)
    preds = model(input) #(batch_size, probability over all tagset)
    
    i = preds.max(1)[1]
    pred = index2tag[i.item()]
    for_f1_score.append([pred, y])
    if pred == y:
        accuracy += 1
    
print(accuracy / len(test_data) * 100)

87.3


This high score is because most of labels are 'O' tag. So we need to measure f1 score.

### f1-score

In [31]:
yhat, y = list(zip(*for_f1_score))

In [35]:
set(yhat)

{'O'}

In [36]:
from sklearn import metrics

print(metrics.classification_report(yhat, y))

              precision    recall  f1-score   support

       B-LOC       0.00      0.00      0.00         0
      B-MISC       0.00      0.00      0.00         0
       B-ORG       0.00      0.00      0.00         0
       B-PER       0.00      0.00      0.00         0
       I-LOC       0.00      0.00      0.00         0
      I-MISC       0.00      0.00      0.00         0
       I-ORG       0.00      0.00      0.00         0
       I-PER       0.00      0.00      0.00         0
           O       1.00      0.87      0.93      1000

    accuracy                           0.87      1000
   macro avg       0.11      0.10      0.10      1000
weighted avg       1.00      0.87      0.93      1000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
