## NLP Task 2: Artur Xarles & Enric Azuara - Train notebook

### Named Entity Recognition

Import necessary packages and functions

In [1]:
import pandas as pd
import numpy as np
import skseq
import skseq.readers.pos_corpus
import skseq.sequences.structured_perceptron as spc
from skseq.sequences.extended_feature import ExtendedFeatures
from skseq.sequences.sequence_list import SequenceList
from skseq.sequences.label_dictionary import LabelDictionary
from tqdm import tqdm
from sklearn.metrics import confusion_matrix, f1_score
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
from utils import *

Read the train data to fit the different models:

In [3]:
train = pd.read_csv('./data/train_data_ner.csv')
test = pd.read_csv('./data/test_data_ner.csv')
train.head()

Unnamed: 0,sentence_id,words,tags
0,0,Thousands,O
1,0,of,O
2,0,demonstrators,O
3,0,have,O
4,0,marched,O


### Build a corpus using train data that can construct sequences of words/taggs. Function Corpus declared in utils.py file

In [4]:
#Define corpus that also returns sequence and use for train data
corpus = Corpus()
corpus.build_corpus(train)
train_seq = corpus.read_sequence(train)
    

100%|███████████████████████████████████████████████████████████████████████████| 38366/38366 [02:17<00:00, 278.69it/s]


Check that the output of our function is correct in the first sentence

In [5]:
print(train_seq[0])
print(train.tags[train.sentence_id == 0].values)

0/0 1/0 2/0 3/0 4/0 5/0 6/1 7/0 8/0 9/0 10/0 11/0 12/1 13/0 14/0 9/0 15/0 1/0 16/2 17/0 18/0 19/0 20/0 21/0 
['O' 'O' 'O' 'O' 'O' 'O' 'B-geo' 'O' 'O' 'O' 'O' 'O' 'B-geo' 'O' 'O' 'O'
 'O' 'O' 'B-gpe' 'O' 'O' 'O' 'O' 'O']


We can see that it return the correct values for the tags. 

In [6]:
corpus.tag_dict

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-tim': 3,
 'B-org': 4,
 'I-geo': 5,
 'B-per': 6,
 'I-per': 7,
 'I-org': 8,
 'B-art': 9,
 'I-art': 10,
 'I-tim': 11,
 'I-gpe': 12,
 'B-nat': 13,
 'I-nat': 14,
 'B-eve': 15,
 'I-eve': 16}

We can also see the constructed dictionary for the tags

### Structured perceptron without added features

We use the already given function IDFeatures to create a set of features for the train sequence.

In [7]:
feature_mapper = skseq.sequences.id_feature.IDFeatures(train_seq)
feature_mapper.build_features()

Once the features are constructed, we can initialize the structured perceptron class and train the model. We need to give as input the words dictionary, the tags dictionary and the feature mapper that determine the features for each feature. We train it for 15 epochs.

In [8]:
sp = spc.StructuredPerceptron(corpus.word_dict, corpus.tag_dict, feature_mapper)

In [9]:
%%time
num_epochs = 15
sp.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.893815
Epoch: 1 Accuracy: 0.931674
Epoch: 2 Accuracy: 0.940913
Epoch: 3 Accuracy: 0.946175
Epoch: 4 Accuracy: 0.950018
Epoch: 5 Accuracy: 0.952577
Epoch: 6 Accuracy: 0.954425
Epoch: 7 Accuracy: 0.956033
Epoch: 8 Accuracy: 0.957185
Epoch: 9 Accuracy: 0.958481
Epoch: 10 Accuracy: 0.959217
Epoch: 11 Accuracy: 0.960524
Epoch: 12 Accuracy: 0.961121
Epoch: 13 Accuracy: 0.961207
Epoch: 14 Accuracy: 0.961983
Wall time: 2h 4min 40s


After that, we store the model in the specified directory.

In [10]:
sp.save_model("./fitted_models/model1")

### Structured perceptron adding features

In this case we add different features explained in the .pdf file that are related with the current and previous word, which can add valuable information to better classify the words to tags.

In this case we use the ExtendedFeatures class which is an extention of the IDFeatures one. It contains all the different features added.

In [7]:
feature_mapper = ExtendedFeatures(train_seq)
feature_mapper.build_features()

Once we have constructed the feature mapper we initialize the structured perceptron and we train it for 15 epochs.

In [8]:
sp = spc.StructuredPerceptron(corpus.word_dict, corpus.tag_dict, feature_mapper)

In [9]:
%%time
num_epochs = 15
sp.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.934751
Epoch: 1 Accuracy: 0.947807
Epoch: 2 Accuracy: 0.951447
Epoch: 3 Accuracy: 0.953468
Epoch: 4 Accuracy: 0.955193
Epoch: 5 Accuracy: 0.956500
Epoch: 6 Accuracy: 0.957469
Epoch: 7 Accuracy: 0.958749
Epoch: 8 Accuracy: 0.959261
Epoch: 9 Accuracy: 0.959952
Epoch: 10 Accuracy: 0.960971
Epoch: 11 Accuracy: 0.961128
Epoch: 12 Accuracy: 0.961820
Epoch: 13 Accuracy: 0.962563
Epoch: 14 Accuracy: 0.962290
Wall time: 2h 1min 23s


After that, we store the model in the specified directory.

In [10]:
sp.save_model("./fitted_models/model2")