# HW 4 - Neural POS Tagger

In this exercise, you are going to build a set of deep learning models on part-of-speech (POS) tagging using Tensorflow and Keras. Tensorflow is a deep learning framwork developed by Google, and Keras is a frontend library built on top of Tensorflow (or Theano, CNTK) to provide an easier way to use standard layers and networks.

To complete this exercise, you will need to build deep learning models for POS tagging in Thai using NECTEC's ORCHID corpus. You will build one model for each of the following type:

- Neural POS Tagging with Word Embedding using Fixed / non-Fixed Pretrained weights
- Neural POS Tagging with Viterbi / Marginal CRF

Pretrained word embeddding are already given for you to use (albeit, a very bad one). Optionally, you can use your best pretrained word embeddding from previous exercise.

We also provide the code for data cleaning, preprocessing and some starter code for keras in this notebook but feel free to modify those parts to suit your needs. You can also complete this exercise using only Tensorflow (without using Keras). Feel free to use additional libraries (e.g. scikit-learn) as long as you have a model for each type mentioned above.

### Don't forget to shut down your instance on Gcloud when you are not using it ###

## 1. Setup and Preprocessing

We use POS data from [ORCHID corpus](https://www.nectec.or.th/corpus/index.php?league=pm), which is a POS corpus for Thai language.
A method used to read the corpus into a list of sentences with (word, POS) pairs have been implemented already. The example usage has shown below.
We also create a word vector for unknown word by random.

In [1]:
from data.orchid_corpus import get_sentences
import numpy as np
import numpy.random
import keras.preprocessing
np.random.seed(42)

Using TensorFlow backend.


In [2]:
unk_emb =np.random.randn(32)
train_data = get_sentences('train')
test_data = get_sentences('test')
print(train_data[0] , len(train_data[0]))
print(train_data[1], len(train_data[1]))
print(train_data[-1], len(train_data[-1]))
print(len(train_data))

max_len = max([len(line) for line in train_data])
print(max_len)

[('การ', 'FIXN'), ('ประชุม', 'VACT'), ('ทาง', 'NCMN'), ('วิชาการ', 'NCMN'), ('<space>', 'PUNC'), ('ครั้ง', 'CFQC'), ('ที่ 1', 'DONM')] 7
[('โครงการวิจัยและพัฒนา', 'NCMN'), ('อิเล็กทรอนิกส์', 'NCMN'), ('และ', 'JCRG'), ('คอมพิวเตอร์', 'NCMN')] 4
[('อาจ', 'XVMM'), ('มี', 'VSTA'), ('โรคมะเร็ง', 'NCMN'), ('ได้', 'XVAE'), ('<space>', 'PUNC'), ('มาก', 'ADVN'), ('กว่า', 'JCMP'), ('<space>', 'PUNC'), ('1', 'DCNM'), ('<space>', 'PUNC'), ('ชนิด', 'CLTV'), ('<space>', 'PUNC'), ('ดัง', 'RPRE'), ('ตาราง', 'NCMN'), ('ที่ 2', 'DONM')] 15
18500
102


Next, we load pretrained weight embedding using pickle. The pretrained weight is a dictionary which map a word to its embedding.

In [3]:
import pickle
fp = open('basic_ff_embedding.pt', 'rb')
embeddings = pickle.load(fp)
fp.close()

The given code below generates an indexed dataset(each word is represented by a number) for training and testing data. The index 0 is reserved for padding to help with variable length sequence. (Additionally, You can read more about padding here [https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/])

## 2. Prepare Data

In [4]:
word_to_idx ={}
idx_to_word ={}
label_to_idx = {}
for sentence in train_data:
    for word,pos in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)+1
            idx_to_word[word_to_idx[word]] = word
        if pos not in label_to_idx:
            label_to_idx[pos] = len(label_to_idx)+1
word_to_idx['UNK'] = len(word_to_idx)

n_classes = len(label_to_idx.keys())+1

This section is tweaked a little from the demo, word2features will return word index instead of features, and sent2labels will return a sequence of word indices in the sentence.

In [5]:
def word2features(sent, i, emb):
    word = sent[i][0]
    if word in word_to_idx :
        return word_to_idx[word]
    else :
        return word_to_idx['UNK']

def sent2features(sent, emb_dict):
    return np.asarray([word2features(sent, i, emb_dict) for i in range(len(sent))])

def sent2labels(sent):
    return numpy.asarray([label_to_idx[label] for (word, label) in sent],dtype='int32')

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [6]:
sent2features(train_data[100], embeddings)

array([ 29, 327,   5, 328])

In [7]:
print(len(word_to_idx), len(label_to_idx), idx_to_word[29],idx_to_word[327],
     idx_to_word[5],idx_to_word[328])

15019 47 รัฐมนตรีว่าการ กระทรวงวิทยาศาสตร์เทคโนโลยีและการพลังงาน <space> ประธานกรรมการ


Next we create train and test dataset, then we use keras to post-pad the sequence to max sequence with 0. Our labels are changed to a one-hot vector.

In [8]:
%%time
x_train = np.asarray([sent2features(sent, embeddings) for sent in train_data])
y_train = [sent2labels(sent) for sent in train_data]
y_train_temp = y_train
x_test = [sent2features(sent, embeddings) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

CPU times: user 364 ms, sys: 8 ms, total: 372 ms
Wall time: 372 ms


In [9]:
y_train

[array([1, 2, 3, 3, 4, 5, 6], dtype=int32),
 array([3, 3, 7, 3], dtype=int32),
 array([3, 4, 8], dtype=int32),
 array([9, 4, 6], dtype=int32),
 array([10], dtype=int32),
 array([10], dtype=int32),
 array([3, 6, 4, 6, 4, 3, 4, 8], dtype=int32),
 array([3, 4, 9, 4, 8], dtype=int32),
 array([3], dtype=int32),
 array([11,  4,  3, 10], dtype=int32),
 array([10, 12, 13,  1,  2,  3, 14,  1,  2,  3, 14,  3,  4, 14,  3,  3, 14,
         1, 13,  3, 15, 15], dtype=int32),
 array([14,  1,  2, 16, 13,  3, 17,  4, 18, 19, 13,  3, 20,  9, 14,  1, 13,
         3,  7, 13,  3,  4, 14,  4,  1,  2,  3, 21,  2, 14,  1,  2, 14,  3], dtype=int32),
 array([10,  4, 18, 12,  2, 13,  3, 22, 14,  1,  2,  3,  3,  7,  3], dtype=int32),
 array([16,  3, 17, 18, 13,  3, 21, 22, 15, 14,  3], dtype=int32),
 array([16, 13,  3,  4,  7,  3, 21, 22, 14,  1,  2,  3, 20, 20,  9], dtype=int32),
 array([14,  3,  4,  8,  4, 10,  4, 14,  3,  3,  3, 12,  2, 10, 23], dtype=int32),
 array([16,  2,  3,  3, 14,  3,  4,  7,  3, 14,  3,

In [34]:
x_train=keras.preprocessing.sequence.pad_sequences(x_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
y_train=keras.preprocessing.sequence.pad_sequences(y_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
x_test=keras.preprocessing.sequence.pad_sequences(x_test, maxlen=102, dtype='int32', padding='post', truncating='pre', value=0.)
y_temp =[]
for i in range(len(y_train)):
    y_temp.append(np.eye(n_classes)[y_train[i]][np.newaxis,:])
y_train = np.asarray(y_temp).reshape(-1,102,n_classes)
del(y_temp)

In [9]:
print(x_train[100],x_train.shape)
print(y_train[100][3],y_train.shape)

[ 29 327   5 328   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0] (18500, 102)
[ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.] (18500, 102, 48)


## 3. Evaluate

Our output from keras is a distribution of problabilities on all possible label. outputToLabel will return an indices of maximum problability from output sequence.

evaluation_report is the same as in the demo

In [10]:
def outputToLabel(yt,seq_len):
    out = []
    for i in range(0,len(yt)):
        if(i==seq_len):
            break
        out.append(np.argmax(yt[i]))
    return out

In [11]:
import pandas as pd
from IPython.display import display

def evaluation_report(y_true, y_pred):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    for sent in y_pred:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))
    
    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    accuracy = (all_correct / all_count) * 100
            
    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else '-'
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100 if (tag_info[tag]['y_true'] > 0) else 0
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else '-'
        
        eval_list.append(eval_result)

    eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': '', 'precision': '', 'recall': '', 'f_score': ''})
    
    df = pd.DataFrame.from_dict(eval_list)
    df = df[['tag', 'precision', 'recall', 'f_score', 'correct_count']]
    display(df)

## 4. Train a model

In [14]:
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Dense,GRU,Reshape,TimeDistributed,Bidirectional,Dropout,Masking
from keras_contrib.layers import CRF
from keras.optimizers import Adam

The model is this section is separated to two groups

- Neural POS Tagger (4.1)
- Neural CRF POS Tagger (4.2)

## 4.1.1 Neural POS Tagger  (Example)

We create a simple Neural POS Tagger as an example for you. This model dosen't use any pretrained word embbeding so it need to use Embedding layer to train the word embedding from scratch.

In [12]:
model = Sequential()
model.add(Embedding(len(word_to_idx),32,input_length=102,mask_zero=True))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 102, 64)           12480     
_________________________________________________________________
dropout_1 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 102, 48)           3120      
Total params: 496,208
Trainable params: 496,208
Non-trainable params: 0
_________________________________________________________________


In [13]:
%%time
model.fit(x_train,y_train,batch_size=64,epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 55min 5s, sys: 8min 56s, total: 1h 4min 1s
Wall time: 20min 12s


<keras.callbacks.History at 0x7fe82116f048>

In [14]:
%%time
model.save_weights('/data/exp_pos_no_crf.h5')
#model.load_weights('/data/exp_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8092,99.3758,99.5921,3662.0
1,2,94.8278,94.4714,94.6493,7792.0
2,3,91.062,96.5184,93.7108,16301.0
3,4,99.9689,99.3654,99.6662,12840.0
4,5,91.6667,98.5075,94.964,66.0
5,6,99.7817,87.5479,93.2653,457.0
6,7,97.6374,97.4026,97.5199,2025.0
7,8,67.3716,53.7349,59.7855,223.0
8,9,56.3725,62.5,59.2784,230.0
9,10,62.6316,42.5507,50.6742,357.0


CPU times: user 45.4 s, sys: 7.66 s, total: 53 s
Wall time: 18.5 s


## 4.1.2 Neural POS Tagger - Fix Weight

### #TODO 1
We would like you create a neural postagger model with keras with the pretrained word embedding as an input. The word embedding should be fixed across training time. To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

(You may want to read about Keras's Masking layer)

Optionally, you can use your own pretrained word embedding from previous homework

In [70]:
pre_em = []
pre_em.append(np.zeros(32))
for i in range(1,len(idx_to_word)+1):
    if(idx_to_word[i] in embeddings.keys()):
        pre_em.append(embeddings[idx_to_word[i]])
    else:
        pre_em.append(np.zeros(32))

In [71]:
len(pre_em)

15019

In [19]:
# Write your code here
model = Sequential()
model.add(Embedding(len(word_to_idx),32,input_length=102,mask_zero=True, weights=[np.array(pre_em)], trainable=False))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 102, 64)           12480     
_________________________________________________________________
dropout_1 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 102, 48)           3120      
Total params: 496,208
Trainable params: 15,600
Non-trainable params: 480,608
_________________________________________________________________


In [20]:
%%time
model.fit(x_train,y_train,batch_size=64,epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 55min 8s, sys: 8min 50s, total: 1h 3min 58s
Wall time: 19min 59s


<keras.callbacks.History at 0x7f8837a8a208>

In [21]:
%%time
model.save_weights('/data/fixw_pos_no_crf.h5')
#model.load_weights('/data/exp_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,94.7097,99.5929,97.0899,3670.0
1,2,64.1585,65.9554,65.0445,5440.0
2,3,54.5816,64.2963,59.042,10859.0
3,4,62.9213,85.2422,72.4004,11015.0
4,5,-,0.0,-,0.0
5,6,22.2222,0.383142,0.753296,2.0
6,7,93.981,85.6181,89.6048,1780.0
7,8,26.3158,4.81928,8.14664,20.0
8,9,16.1765,2.98913,5.04587,11.0
9,10,-,0.0,-,0.0


CPU times: user 45.6 s, sys: 7.3 s, total: 52.9 s
Wall time: 18.2 s


## 4.1.3 Neural POS Tagger - Trainable pretrained weight

### #TODO 2
We would like you create a neural postagger model with keras with the pretrained word embedding as an input. However The word embedding is trainable (not fixed). To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Please note that the given pretrained word embedding only have weights for the vocabuary in BEST corpus from previous homework.

Optionally, you can use your own pretrained word embedding from previous homework.

In [17]:
# Write your code here
# Write your code here
model = Sequential()
model.add(Embedding(len(word_to_idx),32,input_length=102,mask_zero=True, weights=[np.array(pre_em)], trainable=True))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 102, 64)           12480     
_________________________________________________________________
dropout_1 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 102, 48)           3120      
Total params: 496,208
Trainable params: 496,208
Non-trainable params: 0
_________________________________________________________________


In [25]:
%%time
model.fit(x_train,y_train,batch_size=64,epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 55min 25s, sys: 8min 56s, total: 1h 4min 21s
Wall time: 20min 11s


<keras.callbacks.History at 0x7f88377034e0>

In [26]:
%%time
model.save_weights('/data/nfixw_pos_no_crf.h5')
#model.load_weights('/data/exp_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8911,99.5929,99.7418,3670.0
1,2,94.9155,93.9258,94.418,7747.0
2,3,91.199,96.6961,93.8671,16331.0
3,4,99.9379,99.5898,99.7636,12869.0
4,5,84.4156,97.0149,90.2778,65.0
5,6,99.7817,87.5479,93.2653,457.0
6,7,97.824,97.3064,97.5645,2023.0
7,8,69.209,59.0361,63.7191,245.0
8,9,70.3264,64.4022,67.234,237.0
9,10,63.4752,42.6698,51.0335,358.0


CPU times: user 46 s, sys: 7.4 s, total: 53.4 s
Wall time: 18.4 s


### #TODO 3
Compare the result between all neural tagger models in 4.1.x and provide a convincing reason and example for the result of these models (which model perform best or worst, why?)

(If you use your own weight please state so in the answer)

<b>Write your answer here :</b> 

Trainable pre-trained embedding layers gave the highest probability.

Pre-trained weights are trained from BEST2010 dictionary, which is larger dictionary, so it probably be a better representation of word embedding.

Trainable help adapting weights from BEST2010 to be more suitable for ORCHID dictionary and training ORCRHID's words that are not in BEST2010 dictionary.

## 4.2.1 CRF Viterbi

Your next two tasks are to incorporate Conditional random fields (CRF) to your model. <b>You do not need to use pretrained weight</b>.

Keras already implement a CRF neural model for you. However, you need to use the official extension repository for Keras library, call keras-contrib. You should read about keras-contrib crf layer before attempt this exercise section

### #TODO 4
Use Keras-contrib CRF layer in your model. You should set the layer parameter so it can give the best performance on testing using <b>viterbi algorithm</b>. Your model must use crf for loss function and metric. CRF is quite complex compare to previous example model, so you should train it with more epoch, so it can converge.

To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Do not forget to save this model weight.

In [18]:
from keras_contrib.layers import CRF
from keras.callbacks import ReduceLROnPlateau
from keras import regularizers

In [123]:
# Write your code here
# import keras.backend as K
# K.clear_session()
model = Sequential()
model.add(Embedding(len(word_to_idx),32,input_length=102,mask_zero=True))
# model.add(Masking(mask_value=0))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes, activation='tanh')))
crf = CRF(n_classes,
#           learn_mode='join',
#           test_mode='viterbi',
#           sparse_target=True,
#           use_boundary=True,
#           use_bias=True,
#           activation='linear',
#           kernel_initializer='glorot_uniform',
#           chain_initializer='orthogonal',
#           bias_initializer='zeros',
#           boundary_initializer='zeros',
#           kernel_regularizer=regularizers.l1_l2(0.),
#           chain_regularizer=regularizers.l1_l2(0.),
#           boundary_regularizer=regularizers.l1_l2(0.),
#           bias_regularizer=regularizers.l1_l2(0.),
#           kernel_constraint=None,
#           chain_constraint=None,
#           boundary_constraint=None,
#           bias_constraint=None,
#           input_dim=None,
#           unroll=False
         )
model.add(crf)
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,loss=crf.loss_function, metrics=[crf.accuracy])
# model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_13 (Embedding)     (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_13 (Bidirectio (None, 102, 64)           12480     
_________________________________________________________________
dropout_20 (Dropout)         (None, 102, 64)           0         
_________________________________________________________________
time_distributed_13 (TimeDis (None, 102, 48)           3120      
_________________________________________________________________
crf_9 (CRF)                  (None, 102, 48)           4752      
Total params: 500,960
Trainable params: 500,960
Non-trainable params: 0
_________________________________________________________________


In [124]:
%%time
model.fit(x_train,y_train,batch_size=128,epochs=20,verbose=1,shuffle=True,validation_split=0.15)

Train on 15725 samples, validate on 2775 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
CPU times: user 1h 8min 59s, sys: 10min 20s, total: 1h 19min 20s
Wall time: 25min 36s


<keras.callbacks.History at 0x7f2fe6e12e10>

In [126]:
%%time
model.fit(x_train,y_train,batch_size=128,epochs=5,verbose=1,shuffle=True,validation_split=0.15)

Train on 15725 samples, validate on 2775 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 17min 29s, sys: 2min 37s, total: 20min 6s
Wall time: 6min 28s


<keras.callbacks.History at 0x7f2fe6e12208>

In [128]:
%%time
model.save_weights('/data/crf_viterbi.h5')
#model.load_weights('/data/exp_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.7554,99.5929,99.6741,3670.0
1,2,93.5669,92.9316,93.2482,7665.0
2,3,91.9416,93.9014,92.9111,15859.0
3,4,99.8991,99.5821,99.7403,12868.0
4,5,80.7692,94.0299,86.8966,63.0
5,6,98.5325,90.0383,94.0941,470.0
6,7,97.4916,97.2102,97.3507,2021.0
7,8,66.2338,49.1566,56.4315,204.0
8,9,65.4867,60.3261,62.8006,222.0
9,10,50.4803,56.3766,53.2658,473.0


CPU times: user 1min 8s, sys: 11.4 s, total: 1min 19s
Wall time: 27.1 s


## 4.2.2 CRF Marginal

### #TODO 5

Use Keras-contrib CRF layer in your model. You should set the layer parameter so it can give the best performance on testing using <b>marginal problabilities</b>. You <b>must not train the model</b> from scratch but use the pretrained weight from previous CRF Viterbi model.

To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

In [129]:
# Write your code here
model = Sequential()
model.add(Embedding(len(word_to_idx),32,input_length=102,mask_zero=True))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes)))
crf = CRF(n_classes,
          learn_mode='marginal',
          test_mode='marginal',
          sparse_target=False,
          use_boundary=True,
          use_bias=True,
          activation='linear',
          kernel_initializer='glorot_uniform',
          chain_initializer='orthogonal',
          bias_initializer='zeros',
          boundary_initializer='zeros',
          kernel_regularizer=None,
          chain_regularizer=None,
          boundary_regularizer=None,
          bias_regularizer=None,
          kernel_constraint=None,
          chain_constraint=None,
          boundary_constraint=None,
          bias_constraint=None,
          input_dim=None,
          unroll=False)
model.add(crf)
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss=crf.loss_function, metrics=[crf.accuracy])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_14 (Bidirectio (None, 102, 64)           12480     
_________________________________________________________________
dropout_21 (Dropout)         (None, 102, 64)           0         
_________________________________________________________________
time_distributed_14 (TimeDis (None, 102, 48)           3120      
_________________________________________________________________
crf_10 (CRF)                 (None, 102, 48)           4752      
Total params: 500,960
Trainable params: 500,960
Non-trainable params: 0
_________________________________________________________________


In [138]:
%%time
model.load_weights('/data/crf_viterbi.h5')
model.fit(x_train,y_train,batch_size=256,epochs=10,verbose=1,shuffle=True,validation_split=0.2)

Train on 14800 samples, validate on 3700 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 18min 48s, sys: 2min 49s, total: 21min 38s
Wall time: 7min 41s


In [140]:
%%time
model.fit(x_train,y_train,batch_size=256,epochs=5,verbose=1,shuffle=True,validation_split=0.2)

Train on 14800 samples, validate on 3700 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 9min 20s, sys: 1min 24s, total: 10min 45s
Wall time: 3min 49s


<keras.callbacks.History at 0x7f2fec68f9b0>

In [142]:
%%time
model.save_weights('/data/crf_marginal2.h5')
# model.load_weights('/data/crf_marginal.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.5931,99.6201,99.6066,3671.0
1,2,92.0245,92.7498,92.3857,7650.0
2,3,90.5401,94.6948,92.5708,15993.0
3,4,99.8758,99.5589,99.7171,12865.0
4,5,83.7838,92.5373,87.9433,62.0
5,6,96.146,90.8046,93.399,474.0
6,7,97.7151,96.6811,97.1954,2010.0
7,8,63.9394,50.8434,56.6443,211.0
8,9,66.3551,57.8804,61.8287,213.0
9,10,52.8455,46.4839,49.461,390.0


CPU times: user 1min 7s, sys: 12.4 s, total: 1min 19s
Wall time: 28.5 s


### #TODO 6

Please pick the best example that can show the different between CRF that use viterbi and CRF that use marginal problabilities. Compare the result and provide a convincing reason. (which model perform better, why?)

<b>Write your answer here :</b>

If there are a lot features, viterbi will need a lot more time than marginal for decoding.

However, marginal decoding can't guarantee that solutions are optimal due to greedy algorithm.