# Train Models
<div style="color:red; font-size:14px;">!! Don't define functions here, import them from utils.py</div>

This notebook contains the code needed to train and store models to disk.

Remember that if you use a function with a random state you have to fix it to a number so that the results are reproducible.

## Imports

In [1]:
# Cython import
!python skseq/setup.py build_ext --build-lib=./skseq

running build_ext


In [2]:
import pandas as pd
import pickle

from skseq.id_feature import IDFeatures
from skseq.extended_feature import ExtendedFeatures

from skseq import structured_perceptron_c
from skseq.structured_perceptron import StructuredPerceptron

from utils.utils import *

## Create Train and Test sets

In [3]:
train = pd.read_csv("data/train_data_ner.csv")

In [4]:
X_train, y_train = get_data_target_sets(train)

Processing: 100%|██████████| 38366/38366 [00:39<00:00, 974.86sentence/s] 


### Create Corpus

We need to create our corpus using the training data. The corpus consists of two dictionaries, one for the words and one for the tags. The words dictionary maps each word to an index and the tags dictionary maps each tag to an index. We also need to create the reverse mapping for the tags dictionary. This is needed to convert the predictions back to the tag names.

Example:
```python
        sentences = [['I', 'love', 'Python'], ['Python', 'is', 'great']]
        tags = ['O', 'O', 'B']
        word_dict, tag_dict, tag_dict_rev = create_corpus(sentences, tags)
        # word_dict: {'I': 0, 'love': 1, 'Python': 2, 'is': 3, 'great': 4}
        # tag_dict: {'O': 0, 'B': 1}
        # tag_dict_rev: {0: 'O', 1: 'B'}
```

In [5]:
word_dict, tag_dict, tag_dict_rev = create_corpus(X_train, y_train)

### Create Training Sequence List

#### No Cython

In [7]:
train_seq = create_sequence_list(word_dict, tag_dict, X_train, y_train)

Adding sequences: 100%|██████████| 38366/38366 [05:28<00:00, 116.81sequence/s]


#### Cython

In [6]:
train_seq = create_sequence_listC(word_dict, tag_dict, X_train, y_train)

Adding sequences: 100%|██████████| 38366/38366 [05:25<00:00, 117.79sequence/s]


In [8]:
print(train_seq[0])
print(train_seq[3].to_words(sequence_list=train_seq))

0/0 1/0 2/0 3/0 4/0 5/0 6/1 7/0 8/0 9/0 10/0 11/0 12/1 13/0 14/0 9/0 15/0 1/0 16/2 17/0 18/0 19/0 20/0 21/0 
U.N./B-geo relief/O coordinator/O Jan/B-per Egeland/I-per said/O Sunday/B-tim ,/O U.S./B-geo ,/O Indonesian/B-gpe and/O Australian/B-gpe military/O helicopters/O are/O ferrying/O out/O food/O and/O supplies/O to/O remote/O areas/O of/O western/O Aceh/B-geo province/O that/O ground/O crews/O can/O not/O reach/O ./O 


## Train Models

<div class="alert" style="padding: 20px;background-color: #2cbc84; color: white; margin-bottom: 15px;">
<h3>Structured Perceptron w/ Default Features</h3>
</div>

To train the structured perceptron we must create a feature mapper and build it.

In [9]:
feature_mapper = IDFeatures(train_seq)
feature_mapper.build_features()

In [10]:
show_features(feature_mapper, train_seq[0])

Initial features
[0] init_tag:O


Transition features
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[9] prev_tag:O::B-geo
[11] prev_tag:B-geo::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[9] prev_tag:O::B-geo
[11] prev_tag:B-geo::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[21] prev_tag:O::B-gpe
[23] prev_tag:B-gpe::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O


Final features
[28] final_prev_tag:O


Emission features
[1] id:Thousands::O
[2] id:of::O
[4] id:demonstrators::O
[5] id:have::O
[6] id:marched::O
[7] id:through::O
[8] id:London::B-geo
[10] id:to::O
[12] id:protest::O
[13] id:the::O
[14] id:war::O
[15] id:in::O
[16] id:Iraq::B-geo
[17] id:and::O
[18] id:demand::O
[13] id:the::O
[19] id:withdrawal::O
[2] id:of::O
[20] id:British::B-gpe
[22] id:troops::O
[24] id:from::O
[25] id:that::O
[26] id:country::O
[27] id:.::O




### Train

#### No Cython

In [11]:
num_epochs = 1
sp = StructuredPerceptron(word_dict, tag_dict, feature_mapper)
sp.num_epochs = 5

In [12]:
%%time
sp.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.893815
CPU times: user 4min 7s, sys: 1.77 s, total: 4min 9s
Wall time: 4min 9s


#### Cython

In [11]:
num_epochs = 1
sp_c = structured_perceptron_c.StructuredPerceptronC(word_dict, tag_dict, feature_mapper)
sp_c.num_epochs = 5

In [12]:
%%time
sp_c.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.893815
CPU times: user 3min 59s, sys: 1.27 s, total: 4min
Wall time: 4min


### Save

In [13]:
sp.save_model("fitted_models/01_SP_Default_Features")
sp_c.save_model("fitted_models/01C_SP_Default_Features")

NameError: name 'sp' is not defined

<div class="alert" style="padding: 20px;background-color: #2cbc84; color: white; margin-bottom: 15px;">
<h3>Structured Perceptron w/ New Features</h3>
</div>

In [11]:
feature_mapper_ext = ExtendedFeatures(train_seq)
feature_mapper_ext.build_features()

In [12]:
show_features(feature_mapper_ext, train_seq[0])

Initial features
[0] init_tag:O


Transition features
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[14] prev_tag:O::B-geo
[16] prev_tag:B-geo::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[14] prev_tag:O::B-geo
[16] prev_tag:B-geo::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[28] prev_tag:O::B-gpe
[30] prev_tag:B-gpe::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O


Final features
[35] final_prev_tag:O


Emission features
[1, 2, 3] id:Thousands::O
[1, 2, 3] firstupper::O
[1, 2, 3] alphanum::O
[4, 5, 3] id:of::O
[4, 5, 3] lower::O
[4, 5, 3] alphanum::O
[7, 5, 3] id:demonstrators::O
[7, 5, 3] lower::O
[7, 5, 3] alphanum::O
[8, 5, 3] id:have::O
[8, 5, 3] lower::O
[8, 5, 3] alphanum::O
[9, 5, 3] id:marched::O
[9, 5, 3] lower::O
[9, 5, 3] alphanum::O
[10, 5, 3] id:through::O
[10, 5, 3] lower::O
[10, 5, 3] alphanum::O
[11, 12, 13] id:London::B-geo
[11, 12, 13] fi

### Train

#### No Cython

In [13]:
num_epochs = 15
sp_ext = StructuredPerceptron(word_dict, tag_dict, feature_mapper_ext)
sp_ext.num_epochs = 5
sp_ext.fit(feature_mapper_ext.dataset, num_epochs)

Epoch: 0 Accuracy: 0.929235
Epoch: 1 Accuracy: 0.944526
Epoch: 2 Accuracy: 0.948609
Epoch: 3 Accuracy: 0.951267
Epoch: 4 Accuracy: 0.953126
Epoch: 5 Accuracy: 0.954476
Epoch: 6 Accuracy: 0.955556
Epoch: 7 Accuracy: 0.956719
Epoch: 8 Accuracy: 0.957269
Epoch: 9 Accuracy: 0.958295
Epoch: 10 Accuracy: 0.958931
Epoch: 11 Accuracy: 0.959925
Epoch: 12 Accuracy: 0.960049
Epoch: 13 Accuracy: 0.960416
Epoch: 14 Accuracy: 0.961072


#### Cython

In [None]:
num_epochs = 15
sp_ext_c = structured_perceptron_c.StructuredPerceptron(word_dict, tag_dict, feature_mapper_ext)
sp_ext_c.num_epochs = 5
sp_ext_c.fit(feature_mapper_ext.dataset, num_epochs)

### Save

In [14]:
sp_ext.save_model("fitted_models/02_SP_Extended_Features")
sp_ext_c.save_model("fitted_models/02C_SP_Extended_Features")