# HW 4 - Neural POS Tagger

In this exercise, you are going to build a set of deep learning models on part-of-speech (POS) tagging using Tensorflow and Keras. Tensorflow is a deep learning framwork developed by Google, and Keras is a frontend library built on top of Tensorflow (or Theano, CNTK) to provide an easier way to use standard layers and networks.

To complete this exercise, you will need to build deep learning models for POS tagging in Thai using NECTEC's ORCHID corpus. You will build one model for each of the following type:

- Neural POS Tagging with Word Embedding using Fixed / non-Fixed Pretrained weights
- Neural POS Tagging with Viterbi / Marginal CRF

Pretrained word embeddding are already given for you to use (albeit, a very bad one). Optionally, you can use your best pretrained word embeddding from previous exercise.

We also provide the code for data cleaning, preprocessing and some starter code for keras in this notebook but feel free to modify those parts to suit your needs. You can also complete this exercise using only Tensorflow (without using Keras). Feel free to use additional libraries (e.g. scikit-learn) as long as you have a model for each type mentioned above.

### Don't forget to shut down your instance on Gcloud when you are not using it ###

## 1. Setup and Preprocessing

We use POS data from [ORCHID corpus](https://www.nectec.or.th/corpus/index.php?league=pm), which is a POS corpus for Thai language.
A method used to read the corpus into a list of sentences with (word, POS) pairs have been implemented already. The example usage has shown below.
We also create a word vector for unknown word by random.

In [1]:
from data.orchid_corpus import get_sentences
import numpy as np
import numpy.random
import keras.preprocessing
np.random.seed(42)

Using TensorFlow backend.


In [2]:
unk_emb =np.random.randn(32)
train_data = get_sentences('train')
test_data = get_sentences('test')
print(train_data[0])

[('การ', 'FIXN'), ('ประชุม', 'VACT'), ('ทาง', 'NCMN'), ('วิชาการ', 'NCMN'), ('<space>', 'PUNC'), ('ครั้ง', 'CFQC'), ('ที่ 1', 'DONM')]


Next, we load pretrained weight embedding using pickle. The pretrained weight is a dictionary which map a word to its embedding.

In [3]:
import pickle
fp = open('basic_ff_embedding.pt', 'rb')
embeddings = pickle.load(fp)
fp.close()

The given code below generates an indexed dataset(each word is represented by a number) for training and testing data. The index 0 is reserved for padding to help with variable length sequence. (Additionally, You can read more about padding here [https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/])

## 2. Prepare Data

In [4]:
word_to_idx ={}
idx_to_word ={}
label_to_idx = {}
for sentence in train_data:
    for word,pos in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)+1
            idx_to_word[word_to_idx[word]] = word
        if pos not in label_to_idx:
            label_to_idx[pos] = len(label_to_idx)+1
word_to_idx['UNK'] = len(word_to_idx)

n_classes = len(label_to_idx.keys())+1

This section is tweaked a little from the demo, word2features will return word index instead of features, and sent2labels will return a sequence of word indices in the sentence.

In [5]:
def word2features(sent, i, emb):
    word = sent[i][0]
    if word in word_to_idx :
        return word_to_idx[word]
    else :
        return word_to_idx['UNK']

def sent2features(sent, emb_dict):
    return np.asarray([word2features(sent, i, emb_dict) for i in range(len(sent))])

def sent2labels(sent):
    return numpy.asarray([label_to_idx[label] for (word, label) in sent],dtype='int32')

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [6]:
sent2features(train_data[100], embeddings)

array([ 29, 327,   5, 328])

Next we create train and test dataset, then we use keras to post-pad the sequence to max sequence with 0. Our labels are changed to a one-hot vector.

In [7]:
%%time
x_train = np.asarray([sent2features(sent, embeddings) for sent in train_data])
y_train = [sent2labels(sent) for sent in train_data]
x_test = [sent2features(sent, embeddings) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

CPU times: user 362 ms, sys: 15.3 ms, total: 377 ms
Wall time: 376 ms


In [8]:
x_train=keras.preprocessing.sequence.pad_sequences(x_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
y_train=keras.preprocessing.sequence.pad_sequences(y_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
x_test=keras.preprocessing.sequence.pad_sequences(x_test, maxlen=102, dtype='int32', padding='post', truncating='pre', value=0.)
y_temp =[]
for i in range(len(y_train)):
    y_temp.append(np.eye(n_classes)[y_train[i]][np.newaxis,:])
y_train = np.asarray(y_temp).reshape(-1,102,n_classes)
del(y_temp)

In [9]:
print(x_train[100],x_train.shape)
print(y_train[100][3],y_train.shape)

[ 29 327   5 328   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0] (18500, 102)
[ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.] (18500, 102, 48)


## 3. Evaluate

Our output from keras is a distribution of problabilities on all possible label. outputToLabel will return an indices of maximum problability from output sequence.

evaluation_report is the same as in the demo

In [10]:
def outputToLabel(yt,seq_len):
    out = []
    for i in range(0,len(yt)):
        if(i==seq_len):
            break
        out.append(np.argmax(yt[i]))
    return out

In [11]:
import pandas as pd
from IPython.display import display

def evaluation_report(y_true, y_pred):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    for sent in y_pred:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))
    
    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    accuracy = (all_correct / all_count) * 100
            
    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else '-'
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100 if (tag_info[tag]['y_true'] > 0) else 0
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else '-'
        
        eval_list.append(eval_result)

    eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': '', 'precision': '', 'recall': '', 'f_score': ''})
    
    df = pd.DataFrame.from_dict(eval_list)
    df = df[['tag', 'precision', 'recall', 'f_score', 'correct_count']]
    display(df)

## 4. Train a model

In [12]:
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Dense,GRU,Reshape,TimeDistributed,Bidirectional,Dropout,Masking
from keras_contrib.layers import CRF
from keras.optimizers import Adam

The model is this section is separated to two groups

- Neural POS Tagger (4.1)
- Neural CRF POS Tagger (4.2)

## 4.1.1 Neural POS Tagger  (Example)

We create a simple Neural POS Tagger as an example for you. This model dosen't use any pretrained word embbeding so it need to use Embedding layer to train the word embedding from scratch.

In [54]:
model = Sequential()
model.add(Embedding(len(word_to_idx), 32, input_length=102, mask_zero=True, weights=[np.array(pre_em)], trainable=False))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 102, 64)           12480     
_________________________________________________________________
dropout_2 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_2 (TimeDist (None, 102, 48)           3120      
Total params: 496,208
Trainable params: 15,600
Non-trainable params: 480,608
_________________________________________________________________


In [56]:
%%time
model.fit(x_train,y_train,batch_size=128,epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 28min 20s, sys: 4min 24s, total: 32min 45s
Wall time: 10min 18s


<keras.callbacks.History at 0x7f7907c364e0>

In [57]:
%%time
model.save_weights('/data/my_pos_no_crf.h5')
# model.load_weights('/data/my_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,93.2294,99.0231,96.039,3649.0
1,2,54.6838,59.2386,56.8702,4886.0
2,3,49.3192,60.0509,54.1585,10142.0
3,4,55.8613,87.9895,68.3375,11370.0
4,5,-,0.0,-,0.0
5,6,0,0.0,-,0.0
6,7,93.0233,76.9601,84.2327,1600.0
7,8,84.2105,3.85542,7.37327,16.0
8,9,-,0.0,-,0.0
9,10,-,0.0,-,0.0


CPU times: user 44.2 s, sys: 7.43 s, total: 51.6 s
Wall time: 17.9 s


## 4.1.2 Neural POS Tagger - Fix Weight

### #TODO 1
We would like you create a neural postagger model with keras with the pretrained word embedding as an input. The word embedding should be fixed across training time. To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

(You may want to read about Keras's Masking layer)

Optionally, you can use your own pretrained word embedding from previous homework

In [58]:
%%time
# Write your code here
pre_em = []
pre_em.append(np.zeros(32))
for i in range(1,len(idx_to_word)+1):
    if(idx_to_word[i] in embeddings.keys()):
        pre_em.append(embeddings[idx_to_word[i]])
    else:
        pre_em.append(np.zeros(32))

CPU times: user 28.3 ms, sys: 170 µs, total: 28.4 ms
Wall time: 27.7 ms


## 4.1.3 Neural POS Tagger - Trainable pretrained weight

### #TODO 2
We would like you create a neural postagger model with keras with the pretrained word embedding as an input. However The word embedding is trainable (not fixed). To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Please note that the given pretrained word embedding only have weights for the vocabuary in BEST corpus from previous homework.

Optionally, you can use your own pretrained word embedding from previous homework.

In [59]:
# Write your code here
model = Sequential()
model.add(Embedding(len(word_to_idx), 32, input_length=102, mask_zero=True, weights=[np.array(pre_em)], trainable=True))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 102, 64)           12480     
_________________________________________________________________
dropout_3 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_3 (TimeDist (None, 102, 48)           3120      
Total params: 496,208
Trainable params: 496,208
Non-trainable params: 0
_________________________________________________________________


In [61]:
%%time
model.fit(x_train,y_train,batch_size=256,epochs=10,verbose=1)
model.save_weights('/data/my_pos_w_crf.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 14min 59s, sys: 2min 11s, total: 17min 10s
Wall time: 5min 37s


In [62]:
%%time
model.load_weights('/data/my_pos_w_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8368,99.5929,99.7147,3670.0
1,2,95.0721,93.5621,94.311,7717.0
2,3,89.6411,97.3,93.3136,16433.0
3,4,99.7134,99.6363,99.6748,12875.0
4,5,95.2381,89.5522,92.3077,60.0
5,6,98.0603,87.1648,92.2921,455.0
6,7,97.1278,97.595,97.3608,2029.0
7,8,75,33.9759,46.7662,141.0
8,9,75.8621,47.8261,58.6667,176.0
9,10,60.5787,39.9285,48.1322,335.0


CPU times: user 44.6 s, sys: 6.85 s, total: 51.4 s
Wall time: 17.9 s


### #TODO 3
Compare the result between all neural tagger models in 4.1.x and provide a convincing reason and example for the result of these models (which model perform best or worst, why?)

(If you use your own weight please state so in the answer)

<b>Write your answer here :</b>
<pre style="background-color: lightgreen">
Since pre-trained weight has already been set up to better initial condition from pre-trained words, it performed better in accuracy with 92.73 to 57.83 accuracy. Also the non-pretrained has set `Trainable` to `False`, so the weight is fixed at random vaule. Eventhough the overall network perform better as more epoch, the model cannot be changed. Thus lead to the lower accuracy rate.
</pre>

## 4.2.1 CRF Viterbi

Your next two tasks are to incorporate Conditional random fields (CRF) to your model. <b>You do not need to use pretrained weight</b>.

Keras already implement a CRF neural model for you. However, you need to use the official extension repository for Keras library, call keras-contrib. You should read about keras-contrib crf layer before attempt this exercise section

### #TODO 4
Use Keras-contrib CRF layer in your model. You should set the layer parameter so it can give the best performance on testing using <b>viterbi algorithm</b>. Your model must use crf for loss function and metric. CRF is quite complex compare to previous example model, so you should train it with more epoch, so it can converge.

To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Do not forget to save this model weight.

In [23]:
# Write your code here
from keras_contrib.layers import CRF
from keras.callbacks import ReduceLROnPlateau
from keras import regularizers

In [79]:
# Write your code here
import keras.backend as K
K.clear_session()
model = Sequential()
model.add(Embedding(len(word_to_idx),32,input_length=102,mask_zero=True))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes, activation = 'relu')))
crf = CRF(n_classes,
          learn_mode='join',
          test_mode='viterbi',
          sparse_target=False,
          use_boundary=True,
          use_bias=True,
          activation='linear',
          kernel_initializer='glorot_uniform',
          chain_initializer='orthogonal',
          bias_initializer='zeros',
          boundary_initializer='zeros',
          kernel_regularizer=None,
          chain_regularizer=None,
          boundary_regularizer=None,
          bias_regularizer=None,
          kernel_constraint=None,
          chain_constraint=None,
          boundary_constraint=None,
          bias_constraint=None,
          input_dim=None,
          unroll=False)
model.add(crf)
model.summary()
adam  = Adam(lr=0.0015)
model.compile(optimizer=adam,loss=crf.loss_function, metrics=[crf.accuracy])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 102, 64)           12480     
_________________________________________________________________
dropout_1 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 102, 48)           3120      
_________________________________________________________________
crf_1 (CRF)                  (None, 102, 48)           4752      
Total params: 500,960
Trainable params: 500,960
Non-trainable params: 0
_________________________________________________________________


In [42]:
%%time
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.05, patience=3, min_lr=0.001)
model.fit(x_train,y_train,batch_size=128,epochs=5,verbose=1,shuffle=True,validation_split=0.2,
         callbacks=[reduce_lr])

Train on 14800 samples, validate on 3700 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 15min 57s, sys: 2min 17s, total: 18min 15s
Wall time: 5min 57s


In [43]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8911,99.5929,99.7418,3670.0
1,2,91.9514,94.6048,93.2592,7803.0
2,3,91.1332,94.5704,92.82,15972.0
3,4,99.9922,99.5976,99.7945,12870.0
4,5,86.8421,98.5075,92.3077,66.0
5,6,98.6957,86.9732,92.4644,454.0
6,7,98.105,97.114,97.607,2019.0
7,8,71.6049,27.9518,40.208,116.0
8,9,73.8351,55.9783,63.6785,206.0
9,10,61.1111,40.6436,48.8189,341.0


In [44]:
model.fit(x_train,y_train,batch_size=128,epochs=5,verbose=1,shuffle=True,validation_split=0.2, callbacks=[reduce_lr])

Train on 14800 samples, validate on 3700 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f7907c36a90>

In [45]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8639,99.5929,99.7283,3670.0
1,2,94.6768,93.5863,94.1284,7719.0
2,3,91.1388,95.6717,93.3503,16158.0
3,4,99.9534,99.6518,99.8024,12877.0
4,5,91.6667,98.5075,94.964,66.0
5,6,99.1266,86.9732,92.6531,454.0
6,7,98.2885,96.6811,97.4782,2010.0
7,8,72.1992,41.9277,53.0488,174.0
8,9,70.8738,59.5109,64.6972,219.0
9,10,62.1572,40.5244,49.062,340.0


In [53]:
model.save_weights('/data/viterbi_crf_2.h5')

In [75]:
model.load_weights('/data/viterbi_crf.h5')

In [80]:
%%time
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.05, patience=3, min_lr=0.001)
model.fit(x_train,y_train,batch_size=256,epochs=10,verbose=1,shuffle=True,validation_split=0.2, callbacks=[reduce_lr])

Train on 14800 samples, validate on 3700 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 17min 15s, sys: 2min 23s, total: 19min 39s
Wall time: 6min 25s


In [81]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8911,99.5929,99.7418,3670.0
1,2,94.1198,92.9559,93.5342,7667.0
2,3,89.9381,96.3763,93.046,16277.0
3,4,99.9689,99.5898,99.779,12869.0
4,5,79.4521,86.5672,82.8571,58.0
5,6,100,86.9732,93.0328,454.0
6,7,98.0948,96.5849,97.334,2008.0
7,8,69.1943,35.1807,46.6454,146.0
8,9,73.9777,54.0761,62.4804,199.0
9,10,59.6078,36.2336,45.0704,304.0


In [82]:
%%time
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.05, patience=3, min_lr=0.001)
model.fit(x_train,y_train,batch_size=256,epochs=5,verbose=1,shuffle=True,validation_split=0.2, callbacks=[reduce_lr])

Train on 14800 samples, validate on 3700 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 8min 55s, sys: 1min 13s, total: 10min 9s
Wall time: 3min 17s


In [83]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8911,99.5929,99.7418,3670.0
1,2,94.3702,92.4709,93.4109,7627.0
2,3,90.2278,96.1632,93.101,16241.0
3,4,99.9689,99.4583,99.7129,12852.0
4,5,86.9565,89.5522,88.2353,60.0
5,6,99.7802,86.9732,92.9376,454.0
6,7,98.3831,96.5849,97.4757,2008.0
7,8,64.5038,40.7229,49.9261,169.0
8,9,73.8019,62.7717,67.8414,231.0
9,10,58.498,35.2801,44.0149,296.0


In [84]:
model.save_weights('/data/viterbi_crf_3.h5')

In [47]:
%%time
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.05, patience=3, min_lr=0.001)
model.fit(x_train,y_train,batch_size=128,epochs=5,verbose=1,shuffle=True,validation_split=0.2, callbacks=[reduce_lr])

Train on 14800 samples, validate on 3700 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 15min 46s, sys: 2min 20s, total: 18min 7s
Wall time: 5min 52s


In [48]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.891,99.5115,99.7009,3667.0
1,2,93.8,93.9137,93.8568,7746.0
2,3,90.8744,94.3395,92.5745,15933.0
3,4,99.9689,99.3809,99.674,12842.0
4,5,83.5443,98.5075,90.411,66.0
5,6,96.8553,88.5057,92.4925,462.0
6,7,98.0507,96.7773,97.4098,2012.0
7,8,63.1737,50.8434,56.3418,211.0
8,9,63.5838,59.7826,61.6246,220.0
9,10,59.2453,37.4255,45.8729,314.0


In [51]:
%%time
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.05, patience=3, min_lr=0.001)
model.fit(x_train,y_train,batch_size=128,epochs=2,verbose=1,shuffle=True,validation_split=0.3, callbacks=[reduce_lr])

Train on 12950 samples, validate on 5550 samples
Epoch 1/2
Epoch 2/2
CPU times: user 5min 54s, sys: 52.4 s, total: 6min 47s
Wall time: 2min 12s


In [52]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8641,99.6744,99.7691,3673.0
1,2,93.6979,94.0955,93.8963,7761.0
2,3,90.9699,94.1856,92.5498,15907.0
3,4,99.93,99.4428,99.6858,12850.0
4,5,87.1429,91.0448,89.0511,61.0
5,6,97.4522,87.931,92.4471,459.0
6,7,97.9116,96.9697,97.4384,2016.0
7,8,61.5591,55.1807,58.1957,229.0
8,9,59.6059,65.7609,62.5323,242.0
9,10,61.2903,38.4982,47.2914,323.0


## 4.2.2 CRF Marginal

### #TODO 5

Use Keras-contrib CRF layer in your model. You should set the layer parameter so it can give the best performance on testing using <b>marginal problabilities</b>. You <b>must not train the model</b> from scratch but use the pretrained weight from previous CRF Viterbi model.

To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

In [64]:
# Write your code here
K.clear_session()
model = Sequential()
model.add(Embedding(len(word_to_idx),32,input_length=102,mask_zero=True))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes, activation = 'relu')))
crf = CRF(n_classes,
          learn_mode='marginal',
          test_mode='marginal',
          sparse_target=False,
          use_boundary=True,
          use_bias=True,
          activation='linear',
          kernel_initializer='glorot_uniform',
          chain_initializer='orthogonal',
          bias_initializer='zeros',
          boundary_initializer='zeros',
          kernel_regularizer=None,
          chain_regularizer=None,
          boundary_regularizer=None,
          bias_regularizer=None,
          kernel_constraint=None,
          chain_constraint=None,
          boundary_constraint=None,
          bias_constraint=None,
          input_dim=None,
          unroll=False)
model.add(crf)
model.summary()
adam  = Adam(lr=0.0015)
model.compile(optimizer=adam,loss=crf.loss_function, metrics=[crf.accuracy])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 102, 64)           12480     
_________________________________________________________________
dropout_1 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 102, 48)           3120      
_________________________________________________________________
crf_1 (CRF)                  (None, 102, 48)           4752      
Total params: 500,960
Trainable params: 500,960
Non-trainable params: 0
_________________________________________________________________


In [65]:
model.fit(x_train,y_train,batch_size=256,epochs=5,verbose=1,shuffle=True,validation_split=0.3, callbacks=[reduce_lr])

Train on 12950 samples, validate on 5550 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f79087c89b0>

In [66]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8911,99.5929,99.7418,3670.0
1,2,94.7627,91.2585,92.9776,7527.0
2,3,88.502,96.1632,92.1737,16241.0
3,4,99.9766,99.3964,99.6857,12844.0
4,5,95.082,86.5672,90.625,58.0
5,6,99.1266,86.9732,92.6531,454.0
6,7,96.5567,97.114,96.8345,2019.0
7,8,78.1818,31.0843,44.4828,129.0
8,9,72.4771,21.4674,33.1237,79.0
9,10,54.159,34.9225,42.4638,293.0


In [67]:
model.fit(x_train,y_train,batch_size=256,epochs=5,verbose=1,shuffle=True,validation_split=0.3, callbacks=[reduce_lr])

Train on 12950 samples, validate on 5550 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f79087c8630>

In [68]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8911,99.5929,99.7418,3670.0
1,2,95.3231,91.9253,93.5934,7582.0
2,3,89.4036,96.2165,92.685,16250.0
3,4,99.9689,99.6053,99.7868,12871.0
4,5,95.2381,89.5522,92.3077,60.0
5,6,99.1266,86.9732,92.6531,454.0
6,7,97.4818,96.8254,97.1525,2013.0
7,8,74.1935,38.7952,50.9494,161.0
8,9,73.1602,45.9239,56.4274,169.0
9,10,64.1618,39.6901,49.0427,333.0


In [69]:
model.fit(x_train,y_train,batch_size=256,epochs=5,verbose=1,shuffle=True,validation_split=0.3, callbacks=[reduce_lr])

Train on 12950 samples, validate on 5550 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f79071dcbe0>

In [70]:
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8097,99.6472,99.7284,3672.0
1,2,94.8731,92.8831,93.8675,7661.0
2,3,89.3545,96.9625,93.0032,16376.0
3,4,99.9689,99.5434,99.7557,12863.0
4,5,95.2381,89.5522,92.3077,60.0
5,6,99.7802,86.9732,92.9376,454.0
6,7,96.887,97.3064,97.0962,2023.0
7,8,73.5632,46.2651,56.8047,192.0
8,9,75.2941,52.1739,61.6372,192.0
9,10,66.3286,38.975,49.0991,327.0


In [71]:
model.save_weights('/data/marginal_crf.h5')

### #TODO 6

Please pick the best example that can show the different between CRF that use viterbi and CRF that use marginal problabilities. Compare the result and provide a convincing reason. (which model perform better, why?)

<b>Write your answer here :</b>
<pre style="white-space: wrap; padding-right: 10px">
&nbsp;&nbsp;&nbsp;&nbsp;The result from marginal CRF and viterbi CRF yield similar result. But as seen above the best result is from viterbi CRF with 92.87 accuracy. When compared to marginal probabilities, with 92.71 accuracy, which yield very little difference. `Forward-Backward gives <i>marginal</i> probability for each individual state, <i>Viterbi</i> gives probability of the most likely sequence of states` (<a href="https://stats.stackexchange.com/questions/31746/what-is-the-difference-between-the-forward-backward-and-viterbi-algorithms/222270">reference</a>). Viterbi will give the best result from a given choices while marginal will calculate from inputs to outputs. Viterbi is dynamic programming algorithm, but marginal is greedy algorithm as its core. Because the difference in their nature, DP give best results while greedy give good enough result. Viterbi is prefered if the state is small but marginal will give faster result.
</pre>