# Train Models
<div style="color:red; font-size:14px;">!! Don't define functions here, import them from utils.py</div>

This notebook contains the code needed to train and store models to disk.

Remember that if you use a function with a random state you have to fix it to a number so that the results are reproducible.

## Imports

In [4]:
# Cython import
!python skseq/setup.py build_ext --build-lib=./skseq

running build_ext


In [1]:
import pandas as pd
import pickle

from skseq.id_feature import IDFeatures
from skseq.extended_feature import ExtendedFeatures

from skseq import structured_perceptron_c 
from skseq.structured_perceptron import StructuredPerceptron

from utils.utils import *

## Create Train and Test sets

In [2]:
train = pd.read_csv("data/train_data_ner.csv")

In [7]:
X_train, y_train = get_data_target_sets(train)

Processing: 100%|██████████| 38366/38366 [00:41<00:00, 933.59sentence/s]


### Create Corpus

We need to create our corpus using the training data. The corpus consists of two dictionaries, one for the words and one for the tags. The words dictionary maps each word to an index and the tags dictionary maps each tag to an index. We also need to create the reverse mapping for the tags dictionary. This is needed to convert the predictions back to the tag names.

Example:
```python
        sentences = [['I', 'love', 'Python'], ['Python', 'is', 'great']]
        tags = ['O', 'O', 'B']
        word_dict, tag_dict, tag_dict_rev = create_corpus(sentences, tags)
        # word_dict: {'I': 0, 'love': 1, 'Python': 2, 'is': 3, 'great': 4}
        # tag_dict: {'O': 0, 'B': 1}
        # tag_dict_rev: {0: 'O', 1: 'B'}
```

In [10]:
word_dict, tag_dict, tag_dict_rev = create_corpus(X_train, y_train)

### Create Training Sequence List

#### No Cython

In [7]:
train_seq = create_sequence_list(word_dict, tag_dict, X_train, y_train)

Adding sequences: 100%|██████████| 38366/38366 [05:28<00:00, 116.81sequence/s]


#### Cython

In [6]:
train_seq = create_sequence_listC(word_dict, tag_dict, X_train, y_train)

Adding sequences: 100%|██████████| 38366/38366 [05:25<00:00, 117.79sequence/s]


In [8]:
print(train_seq[0])
print(train_seq[0].to_words(sequence_list=train_seq))

0/0 1/0 2/0 3/0 4/0 5/0 6/1 7/0 8/0 9/0 10/0 11/0 12/1 13/0 14/0 9/0 15/0 1/0 16/2 17/0 18/0 19/0 20/0 21/0 
U.N./B-geo relief/O coordinator/O Jan/B-per Egeland/I-per said/O Sunday/B-tim ,/O U.S./B-geo ,/O Indonesian/B-gpe and/O Australian/B-gpe military/O helicopters/O are/O ferrying/O out/O food/O and/O supplies/O to/O remote/O areas/O of/O western/O Aceh/B-geo province/O that/O ground/O crews/O can/O not/O reach/O ./O 


## Train Models

<div class="alert" style="padding: 20px;background-color: #2cbc84; color: white; margin-bottom: 15px;">
<h3>Structured Perceptron w/ Default Features</h3>
</div>

To train the structured perceptron we must create a feature mapper and build it.

In [9]:
feature_mapper = IDFeatures(train_seq)
feature_mapper.build_features()

In [10]:
show_features(feature_mapper, train_seq[0])

Initial features
[0] init_tag:O


Transition features
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[9] prev_tag:O::B-geo
[11] prev_tag:B-geo::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[9] prev_tag:O::B-geo
[11] prev_tag:B-geo::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[21] prev_tag:O::B-gpe
[23] prev_tag:B-gpe::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O
[3] prev_tag:O::O


Final features
[28] final_prev_tag:O


Emission features
[1] id:Thousands::O
[2] id:of::O
[4] id:demonstrators::O
[5] id:have::O
[6] id:marched::O
[7] id:through::O
[8] id:London::B-geo
[10] id:to::O
[12] id:protest::O
[13] id:the::O
[14] id:war::O
[15] id:in::O
[16] id:Iraq::B-geo
[17] id:and::O
[18] id:demand::O
[13] id:the::O
[19] id:withdrawal::O
[2] id:of::O
[20] id:British::B-gpe
[22] id:troops::O
[24] id:from::O
[25] id:that::O
[26] id:country::O
[27] id:.::O




### Train

#### No Cython

In [11]:
num_epochs = 15
sp = StructuredPerceptron(word_dict, tag_dict, feature_mapper)
sp.num_epochs = 5

In [12]:
%%time
sp.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.893815
CPU times: user 4min 7s, sys: 1.77 s, total: 4min 9s
Wall time: 4min 9s


#### Cython

In [11]:
num_epochs = 15
sp_c = structured_perceptron_c.StructuredPerceptronC(word_dict, tag_dict, feature_mapper)
sp_c.num_epochs = 5

In [12]:
%%time
sp_c.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.893815
CPU times: user 3min 59s, sys: 1.27 s, total: 4min
Wall time: 4min


### Save

In [13]:
sp.save_model("fitted_models/01_SP_Default_Features")
sp_c.save_model("fitted_models/01C_SP_Default_Features")

NameError: name 'sp' is not defined

<div class="alert" style="padding: 20px;background-color: #2cbc84; color: white; margin-bottom: 15px;">
<h3>Structured Perceptron w/ New Features</h3>
</div>

In [11]:
feature_mapper_ext = ExtendedFeatures(train_seq)
feature_mapper_ext.build_features()

In [12]:
show_features(feature_mapper_ext, train_seq[0])

Initial features
[0] init_tag:O


Transition features
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[14] prev_tag:O::B-geo
[16] prev_tag:B-geo::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[14] prev_tag:O::B-geo
[16] prev_tag:B-geo::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[28] prev_tag:O::B-gpe
[30] prev_tag:B-gpe::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O
[6] prev_tag:O::O


Final features
[35] final_prev_tag:O


Emission features
[1, 2, 3] id:Thousands::O
[1, 2, 3] firstupper::O
[1, 2, 3] alphanum::O
[4, 5, 3] id:of::O
[4, 5, 3] lower::O
[4, 5, 3] alphanum::O
[7, 5, 3] id:demonstrators::O
[7, 5, 3] lower::O
[7, 5, 3] alphanum::O
[8, 5, 3] id:have::O
[8, 5, 3] lower::O
[8, 5, 3] alphanum::O
[9, 5, 3] id:marched::O
[9, 5, 3] lower::O
[9, 5, 3] alphanum::O
[10, 5, 3] id:through::O
[10, 5, 3] lower::O
[10, 5, 3] alphanum::O
[11, 12, 13] id:London::B-geo
[11, 12, 13] fi

### Train

#### No Cython

In [13]:
num_epochs = 15
sp_ext = StructuredPerceptron(word_dict, tag_dict, feature_mapper_ext)
sp_ext.num_epochs = 5
sp_ext.fit(feature_mapper_ext.dataset, num_epochs)

Epoch: 0 Accuracy: 0.929235
Epoch: 1 Accuracy: 0.944526
Epoch: 2 Accuracy: 0.948609
Epoch: 3 Accuracy: 0.951267
Epoch: 4 Accuracy: 0.953126
Epoch: 5 Accuracy: 0.954476
Epoch: 6 Accuracy: 0.955556
Epoch: 7 Accuracy: 0.956719
Epoch: 8 Accuracy: 0.957269
Epoch: 9 Accuracy: 0.958295
Epoch: 10 Accuracy: 0.958931
Epoch: 11 Accuracy: 0.959925
Epoch: 12 Accuracy: 0.960049
Epoch: 13 Accuracy: 0.960416
Epoch: 14 Accuracy: 0.961072


#### Cython

In [None]:
num_epochs = 15
sp_ext_c = structured_perceptron_c.StructuredPerceptron(word_dict, tag_dict, feature_mapper_ext)
sp_ext_c.num_epochs = 5
sp_ext_c.fit(feature_mapper_ext.dataset, num_epochs)

### Save

In [14]:
sp_ext.save_model("fitted_models/02_SP_Extended_Features")
sp_ext_c.save_model("fitted_models/02C_SP_Extended_Features")

# Deep Learning NER

## Bert

In [3]:
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping


import transformers
from transformers import BertTokenizerFast
from transformers import TFBertModel

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

2023-06-02 17:45:37.870907: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Connecting TPU

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Device:', tpu.master())
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except:
    strategy = tf.distribute.get_strategy()
print('Number of replicas:', strategy.num_replicas_in_sync)

In [4]:
train_BERT = train
MAX_LEN = 128
train_BERT

Unnamed: 0,sentence_id,words,tags
0,0,Thousands,O
1,0,of,O
2,0,demonstrators,O
3,0,have,O
4,0,marched,O
...,...,...,...
839144,47957,officials,O
839145,47957,within,O
839146,47957,the,O
839147,47957,government,O


## Grouping, Tokenizing and Padding


In [5]:
train_BERT["sentence_id"] = train_BERT["sentence_id"].fillna(method="ffill")
sentence = train_BERT.groupby("sentence_id")["words"].apply(list).values
tag = train_BERT.groupby(by = 'sentence_id')['tags'].apply(list).values

In [8]:
def process_data(df):
    df.loc[:, "sentence_id"] = df["sentence_id"].fillna(method="ffill")

    enc_pos = preprocessing.LabelEncoder()
    enc_tag = preprocessing.LabelEncoder()

    df.loc[:, "tags"] = enc_tag.fit_transform(df["tags"])

    sentences = train_BERT.groupby("sentence_id")["words"].apply(list).values
    tag = train_BERT.groupby(by = 'sentence_id')['tags'].apply(list).values
    return sentences, tag, enc_pos, enc_tag

sentences, tag, enc_pos, enc_tag = process_data(train_BERT)

In [11]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
def tokenize(data, max_len = MAX_LEN):
    input_ids = list()
    attention_mask = list()
    for i in tqdm(range(len(data))):
        # print(data[i])
        encoded = tokenizer.encode_plus(data[i],
                                        add_special_tokens = True,
                                        max_length = MAX_LEN,
                                        is_split_into_words=True,
                                        return_attention_mask=True,
                                        padding = 'max_length',
                                        truncation=True,return_tensors = 'np')
                        
        
        input_ids.append(encoded['input_ids'])
        attention_mask.append(encoded['attention_mask'])
    return np.vstack(input_ids),np.vstack(attention_mask)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [12]:
X_train,X_test,y_train,y_test = train_test_split(sentence,tag,random_state=42,test_size=0.1)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((34529,), (3837,), (34529,), (3837,))

In [13]:
input_ids,attention_mask = tokenize(X_train,max_len = MAX_LEN)

100%|██████████| 34529/34529 [00:07<00:00, 4422.08it/s]


In [14]:
val_input_ids,val_attention_mask = tokenize(X_test,max_len = MAX_LEN)

100%|██████████| 3837/3837 [00:00<00:00, 4373.87it/s]


## Testing Padding and Truncation Length

In [15]:
# TEST: Checking Padding and Truncation length's
was = list()
for i in range(len(input_ids)):
    was.append(len(input_ids[i]))
set(was)

{128}

In [16]:
# Train Padding
test_tag = list()
for i in range(len(y_test)):
    test_tag.append(np.array(y_test[i] + [0] * (128-len(y_test[i]))))
    
# TEST:  Checking Padding Length
was = list()
for i in range(len(test_tag)):
    was.append(len(test_tag[i]))
set(was)

{128}

In [17]:
# Train Padding
train_tag = list()
for i in range(len(y_train)):
    train_tag.append(np.array(y_train[i] + [0] * (128-len(y_train[i]))))
    
# TEST:  Checking Padding Length
was = list()
for i in range(len(train_tag)):
    was.append(len(train_tag[i]))
set(was)

{128}

## Building BERT Model: Transfer Learning

In [18]:
# bert_model = TFBertModel.from_pretrained('bert-base-uncased')

def create_model(bert_model,max_len = MAX_LEN):
    input_ids = tf.keras.Input(shape = (max_len,),dtype = 'int32')
    attention_masks = tf.keras.Input(shape = (max_len,),dtype = 'int32')
    bert_output = bert_model(input_ids,attention_mask = attention_masks,return_dict =True)
    embedding = tf.keras.layers.Dropout(0.3)(bert_output["last_hidden_state"])
    output = tf.keras.layers.Dense(17,activation = 'softmax')(embedding)
    model = tf.keras.models.Model(inputs = [input_ids,attention_masks],outputs = [output])
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.00001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])
    
    return model

In [19]:
with strategy.scope():
    bert_model = TFBertModel.from_pretrained('bert-base-uncased')
    model = create_model(bert_model,MAX_LEN)

Downloading tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

2023-06-02 17:40:18.004851: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model che

In [None]:
model.summary()

In [None]:
tf.keras.utils.plot_model(model)

In [None]:
early_stopping = EarlyStopping(mode='min',patience=5)
history_bert = model.fit([input_ids,attention_mask],
                         np.array(train_tag),
                         validation_data = ([val_input_ids,val_attention_mask],np.array(test_tag)),
                         epochs = 25,batch_size = 30*2,
                         callbacks = early_stopping,verbose = True)

Epoch 1/25
  2/576 [..............................] - ETA: 6:14:32 - loss: 3.1938 - accuracy: 0.0570

In [None]:
plt.plot(history_bert.history['accuracy'])
plt.plot(history_bert.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
plt.plot(history_bert.history['loss'])
plt.plot(history_bert.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

## Testing Model

In [None]:
def pred(val_input_ids,val_attention_mask):
    return model.predict([val_input_ids,val_attention_mask])

In [None]:
def testing(val_input_ids,val_attention_mask,enc_tag,y_test):
    val_input = val_input_ids.reshape(1,128)
    val_attention = val_attention_mask.reshape(1,128)
    
    # Print Original Sentence
    sentence = tokenizer.decode(val_input_ids[val_input_ids > 0])
    print("Original Text : ",str(sentence))
    print("\n")
    true_enc_tag = enc_tag.inverse_transform(y_test)

    print("Original Tags : " ,str(true_enc_tag))
    print("\n")
    
    pred_with_pad = np.argmax(pred(val_input,val_attention),axis = -1) 
    pred_without_pad = pred_with_pad[pred_with_pad>0]
    pred_enc_tag = enc_tag.inverse_transform(pred_without_pad)
    print("Predicted Tags : ",pred_enc_tag)

In [None]:
testing(val_input_ids[0],val_attention_mask[0],enc_tag,y_test[0])