# HW 3 - Neural POS Tagger

In this exercise, you are going to build a set of deep learning models on part-of-speech (POS) tagging using Tensorflow and Keras. Tensorflow is a deep learning framwork developed by Google, and Keras is a frontend library built on top of Tensorflow (or Theano, CNTK) to provide an easier way to use standard layers and networks.

To complete this exercise, you will need to build deep learning models for POS tagging in Thai using NECTEC's ORCHID corpus. You will build one model for each of the following type:

- Neural POS Tagging with Word Embedding using Fixed / non-Fixed Pretrained weights
- Neural POS Tagging with Viterbi / Marginal CRF

Pretrained word embeddding are already given for you to use (albeit, a very bad one).

We also provide the code for data cleaning, preprocessing and some starter code for keras in this notebook but feel free to modify those parts to suit your needs. You can also complete this exercise using only Tensorflow (without using Keras). Feel free to use additional libraries (e.g. scikit-learn) as long as you have a model for each type mentioned above.

### Don't forget to shut down your instance on Gcloud when you are not using it ###

## 1. Setup and Preprocessing

We use POS data from [ORCHID corpus](https://www.nectec.or.th/corpus/index.php?league=pm), which is a POS corpus for Thai language.
A method used to read the corpus into a list of sentences with (word, POS) pairs have been implemented already. The example usage has shown below.
We also create a word vector for unknown word by random.

In [1]:
from data.orchid_corpus import get_sentences
import numpy as np
import numpy.random
import keras.preprocessing
np.random.seed(42)

Using TensorFlow backend.


In [2]:
unk_emb =np.random.randn(32)
train_data = get_sentences('train')
test_data = get_sentences('test')
print(train_data[0])
print(test_data[0])

[('การ', 'FIXN'), ('ประชุม', 'VACT'), ('ทาง', 'NCMN'), ('วิชาการ', 'NCMN'), ('<space>', 'PUNC'), ('ครั้ง', 'CFQC'), ('ที่ 1', 'DONM')]
[('5', 'NLBL'), ('<full_stop>', 'PUNC'), ('การ', 'FIXN'), ('ออกแบบ', 'VACT'), ('คลังข้อมูล', 'NCMN'), ('มาตรฐาน', 'NCMN'), ('<space>', 'PUNC'), ('คลังข้อมูล', 'NCMN'), ('มาตรฐาน', 'NCMN'), ('หมายถึง', 'VSTA'), ('<space>', 'PUNC'), ('แฟ้มข้อมูล', 'NCMN'), ('ที่', 'PREL'), ('เก็บ', 'VACT'), ('รวบรวม', 'VACT'), ('ข้อมูล', 'NCMN')]


Next, we load pretrained weight embedding using pickle. The pretrained weight is a dictionary which map a word to its embedding.

In [3]:
import pickle
fp = open('basic_ff_embedding.pt', 'rb')
embeddings = pickle.load(fp)
fp.close()

The given code below generates an indexed dataset(each word is represented by a number) for training and testing data. The index 0 is reserved for padding to help with variable length sequence. (Additionally, You can read more about padding here [https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/])

## 2. Prepare Data

In [4]:
word_to_idx ={}
idx_to_word ={}
label_to_idx = {}
for sentence in train_data:
    for word,pos in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)+1
            idx_to_word[word_to_idx[word]] = word
        if pos not in label_to_idx:
            label_to_idx[pos] = len(label_to_idx)+1
word_to_idx['UNK'] = len(word_to_idx)

n_classes = len(label_to_idx.keys())+1

This section is tweaked a little from the demo, word2features will return word index instead of features, and sent2labels will return a sequence of word indices in the sentence.

In [5]:
def word2features(sent, i, emb):
    word = sent[i][0]
    if word in word_to_idx :
        return word_to_idx[word]
    else :
        return word_to_idx['UNK']

def sent2features(sent, emb_dict):
    return np.asarray([word2features(sent, i, emb_dict) for i in range(len(sent))])

def sent2labels(sent):
    return numpy.asarray([label_to_idx[label] for (word, label) in sent],dtype='int32')

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [6]:
sent2features(train_data[100], embeddings)

array([ 29, 327,   5, 328])

Next we create train and test dataset, then we use keras to post-pad the sequence to max sequence with 0. Our labels are changed to a one-hot vector.

In [7]:
%%time
x_train = np.asarray([sent2features(sent, embeddings) for sent in train_data])
y_train = [sent2labels(sent) for sent in train_data]
x_test = [sent2features(sent, embeddings) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

CPU times: user 301 ms, sys: 0 ns, total: 301 ms
Wall time: 300 ms


In [8]:
x_train=keras.preprocessing.sequence.pad_sequences(x_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
y_train=keras.preprocessing.sequence.pad_sequences(y_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
x_test=keras.preprocessing.sequence.pad_sequences(x_test, maxlen=102, dtype='int32', padding='post', truncating='pre', value=0.)
y_temp =[]
for i in range(len(y_train)):
    y_temp.append(np.eye(n_classes)[y_train[i]][np.newaxis,:])
y_train = np.asarray(y_temp).reshape(-1,102,n_classes)
del(y_temp)

In [9]:
print(x_train[100],x_train.shape)
print(y_train[100][3],y_train.shape)

[ 29 327   5 328   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0] (18500, 102)
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (18500, 102, 48)


## 3. Evaluate

Our output from keras is a distribution of problabilities on all possible label. outputToLabel will return an indices of maximum problability from output sequence.

evaluation_report is the same as in the demo

In [10]:
def outputToLabel(yt,seq_len):
    out = []
    for i in range(0,len(yt)):
        if(i==seq_len):
            break
        out.append(np.argmax(yt[i]))
    return out

In [11]:
import pandas as pd
from IPython.display import display

def evaluation_report(y_true, y_pred):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    for sent in y_pred:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))
    
    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    accuracy = (all_correct / all_count) * 100
            
    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else '-'
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100 if (tag_info[tag]['y_true'] > 0) else 0
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else '-'
        
        eval_list.append(eval_result)

    eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': '', 'precision': '', 'recall': '', 'f_score': ''})
    
    df = pd.DataFrame.from_dict(eval_list)
    df = df[['tag', 'precision', 'recall', 'f_score', 'correct_count']]
    display(df)

## 4. Train a model

In [12]:
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Dense,GRU,Reshape,TimeDistributed,Bidirectional,Dropout,Masking,Flatten,Conv1D,InputLayer
from keras_contrib.layers import CRF
from keras.optimizers import Adam
from keras.initializers import Constant

The model is this section is separated to two groups

- Neural POS Tagger (4.1)
- Neural CRF POS Tagger (4.2)

## 4.1.1 Neural POS Tagger  (Example)

We create a simple Neural POS Tagger as an example for you. This model dosen't use any pretrained word embbeding so it need to use Embedding layer to train the word embedding from scratch.

In [42]:
model = Sequential()
model.add(Embedding(len(word_to_idx),32,input_length=102,mask_zero=True))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 102, 32)           480608    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 102, 64)           12480     
_________________________________________________________________
dropout_1 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 102, 48)           3120      
Total params: 496,208
Trainable params: 496,208
Non-trainable params: 0
_________________________________________________________________


In [14]:
%%time
model.fit(x_train,y_train,batch_size=64,epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 33min 40s, sys: 4min 58s, total: 38min 39s
Wall time: 14min 19s


<keras.callbacks.History at 0x7f984f2fc7f0>

In [15]:
%%time
#model.save_weights('/data/my_pos_no_crf.h5')
#model.load_weights('/data/my_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8092,99.3758,99.5921,3662.0
1,2,94.8284,94.4835,94.6557,7793.0
2,3,91.0045,96.5007,93.6721,16298.0
3,4,99.9766,99.3654,99.6701,12840.0
4,5,91.6667,98.5075,94.964,66.0
5,6,99.7817,87.5479,93.2653,457.0
6,7,97.6374,97.4026,97.5199,2025.0
7,8,67.6647,54.4578,60.3471,226.0
8,9,57.6441,62.5,59.9739,230.0
9,10,62.7866,42.4315,50.6401,356.0


CPU times: user 53.5 s, sys: 9.87 s, total: 1min 3s
Wall time: 16.9 s


## 4.1.2 Neural POS Tagger - Fix Weight

### #TODO 1
We would like you create a neural postagger model with keras with the pretrained word embedding as an input. The word embedding should be fixed across training time. To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

(You may want to read about Keras's Masking layer and Trainable parameter)

In [15]:
# Write your code here

In [13]:
from embeddings import emb_reader
embeddings = emb_reader.get_embeddings()
vector_size = embeddings['การ'].shape[0]

In [16]:
embedding_matrix=np.zeros((len(idx_to_word),vector_size))
for i in idx_to_word:
    if idx_to_word[i] in embeddings:
        embedding_matrix[i] = embeddings[idx_to_word[i]]

64

In [24]:
model = Sequential()
model.add(Embedding(len(word_to_idx),64,input_length=102,embeddings_initializer=Constant(embedding_matrix),mask_zero=True,trainable=False))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 102, 64)           961216    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 102, 64)           18624     
_________________________________________________________________
dropout_3 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_3 (TimeDist (None, 102, 48)           3120      
Total params: 982,960
Trainable params: 21,744
Non-trainable params: 961,216
_________________________________________________________________


In [25]:
%%time
model.fit(x_train,y_train,batch_size=64,epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 36min 49s, sys: 5min 40s, total: 42min 29s
Wall time: 14min 23s


<keras.callbacks.History at 0x7f89591312b0>

In [26]:
%%time
#model.save_weights('/data/my_pos_no_crf.h5')
#model.load_weights('/data/my_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.7825,99.5929,99.6876,3670.0
1,2,78.8364,78.8555,78.8459,6504.0
2,3,75.0896,76.9436,76.0053,12995.0
3,4,70.7736,87.9276,78.4235,11362.0
4,5,95.082,86.5672,90.625,58.0
5,6,79.2079,61.3027,69.1145,320.0
6,7,98.7437,94.5166,96.5839,1965.0
7,8,80,3.85542,7.35632,16.0
8,9,60.3175,41.3043,49.0323,152.0
9,10,56.6986,28.2479,37.7088,237.0


CPU times: user 57.9 s, sys: 10.6 s, total: 1min 8s
Wall time: 20.9 s


In [159]:
def word2features(sent, i, emb):
    word = sent[i][0]
    if word in emb :
        return emb[word]
    else :
        return np.zeros(vector_size)

def sent2features(sent, emb_dict):
    return np.asarray([word2features(sent, i, emb_dict) for i in range(len(sent))])

In [160]:
%%time
x_train = np.array([sent2features(sent, embeddings) for sent in train_data])
y_train = np.array([sent2labels(sent) for sent in train_data])
x_test = np.array([sent2features(sent, embeddings) for sent in test_data])
y_test = np.array([sent2labels(sent) for sent in test_data])

CPU times: user 476 ms, sys: 34.5 ms, total: 511 ms
Wall time: 508 ms


In [161]:
max_len = max(len(i) for i in x_train)
x_train = np.array([np.concatenate((i,[np.zeros(vector_size)]*(max_len - len(i))),axis = 0) if len(i)<max_len else i for i in x_train])
y_train=keras.preprocessing.sequence.pad_sequences(y_train, maxlen=max_len, dtype='int32', padding='post', truncating='pre', value=0.)
x_test = np.array([np.concatenate((i,[np.zeros(vector_size)]*(max_len - len(i))),axis = 0) if len(i)<max_len else i for i in x_test])
print(max_len)

102


In [162]:
#x_train=keras.preprocessing.sequence.pad_sequences(x_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
y_train=keras.preprocessing.sequence.pad_sequences(y_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
#x_test=keras.preprocessing.sequence.pad_sequences(x_test, maxlen=103, dtype='int32', padding='post', truncating='pre', value=0.)
y_temp =[]
for i in range(len(y_train)):
    y_temp.append(np.eye(n_classes)[y_train[i]][np.newaxis,:])
y_train = np.asarray(y_temp).reshape(-1,102,n_classes)
del(y_temp)

In [163]:
print(x_train[100][3],x_train.shape)
print(y_train[100][3],y_train.shape)

[ 0.041246   -0.08056918 -0.21603911  0.03531642  0.01896307  0.10224628
 -0.14995357  0.12535487  0.02024684  0.22443148 -0.29895535 -0.19694647
 -0.1635168   0.08557106 -0.17634703  0.01820213 -0.00468827 -0.07651532
 -0.05876088  0.15585257 -0.02346553 -0.11359906 -0.00310849  0.03356488
  0.14015509 -0.09045982  0.01143226  0.00039972  0.07332941  0.08260775
 -0.11846358  0.02441154  0.00845897  0.27604362 -0.04589748  0.00915465
  0.07176109  0.21123503  0.00435497  0.13480981  0.04913695  0.05938303
 -0.08741292  0.22676456 -0.03131349 -0.05550113 -0.088519    0.0824531
  0.04506927 -0.00963591 -0.1833221   0.0277024   0.03430984 -0.02346132
 -0.08337204  0.05305323 -0.03119821  0.03707563 -0.10878314  0.01872645
  0.13266806  0.00048555  0.16408308 -0.21319303] (18500, 102, 64)
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (18500, 102, 48)


In [164]:
print(train_data[100])
print(embeddings['ประธานกรรมการ'])

[('รัฐมนตรีว่าการ', 'NCMN'), ('กระทรวงวิทยาศาสตร์เทคโนโลยีและการพลังงาน', 'NPRP'), ('<space>', 'PUNC'), ('ประธานกรรมการ', 'NCMN')]
[ 0.041246   -0.08056918 -0.2160391   0.03531642  0.01896307  0.10224628
 -0.14995357  0.12535487  0.02024684  0.22443148 -0.29895535 -0.19694647
 -0.1635168   0.08557106 -0.17634703  0.01820213 -0.00468827 -0.07651532
 -0.05876088  0.15585257 -0.02346553 -0.11359906 -0.00310849  0.03356488
  0.14015509 -0.09045982  0.01143226  0.00039972  0.07332941  0.08260775
 -0.11846358  0.02441154  0.00845897  0.27604362 -0.04589748  0.00915465
  0.07176109  0.21123503  0.00435497  0.1348098   0.04913695  0.05938303
 -0.08741292  0.22676456 -0.03131349 -0.05550113 -0.088519    0.0824531
  0.04506927 -0.00963591 -0.1833221   0.0277024   0.03430984 -0.02346132
 -0.08337204  0.05305323 -0.03119821  0.03707563 -0.10878314  0.01872645
  0.13266806  0.00048555  0.16408308 -0.21319303]


In [165]:
model = Sequential()
#model.add(Dense(64, input_shape=x_train.shape, activation='relu'))
model.add(InputLayer(input_shape=(x_train.shape[1],x_train.shape[2])))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
#model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_26 (Bidirectio (None, 102, 64)           18624     
_________________________________________________________________
dropout_18 (Dropout)         (None, 102, 64)           0         
_________________________________________________________________
time_distributed_18 (TimeDis (None, 102, 48)           3120      
Total params: 21,744
Trainable params: 21,744
Non-trainable params: 0
_________________________________________________________________


In [166]:
%%time
model.fit(x_train,y_train,batch_size=64,epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 26min 6s, sys: 3min 47s, total: 29min 53s
Wall time: 12min 30s


<keras.callbacks.History at 0x7fa80c52ebe0>

In [167]:
%%time
#model.save_weights('/data/my_pos_no_crf.h5')
#model.load_weights('/data/my_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,0,0,0.0,-,0.0
1,1,99.6741,99.5929,99.6335,3670.0
2,2,79.6765,74.0543,76.7626,6108.0
3,3,72.8166,72.8166,72.8166,12298.0
4,4,65.8297,75.0658,70.145,9700.0
5,5,100,86.5672,92.8,58.0
6,6,88.141,52.682,65.9472,275.0
7,7,98.4947,94.4204,96.4145,1963.0
8,8,-,0.0,-,0.0
9,9,40.625,17.663,24.6212,65.0


CPU times: user 37.9 s, sys: 7.23 s, total: 45.2 s
Wall time: 14.2 s


## 4.1.3 Neural POS Tagger - Trainable pretrained weight

### #TODO 2
We would like you create a neural postagger model with keras with the pretrained word embedding as an input. However The word embedding is trainable (not fixed). To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Please note that the given pretrained word embedding only have weights for the vocabuary in BEST corpus.

Optionally, you can use your own pretrained word embedding.

In [113]:
# Write your code here

In [35]:
def word2features(sent, i, emb):
    word = sent[i][0]
    if word in word_to_idx :
        return word_to_idx[word]
    else :
        return word_to_idx['UNK']

def sent2features(sent, emb_dict):
    return np.asarray([word2features(sent, i, emb_dict) for i in range(len(sent))])

def sent2labels(sent):
    return numpy.asarray([label_to_idx[label] for (word, label) in sent],dtype='int32')

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [36]:
%%time
x_train = np.asarray([sent2features(sent, embeddings) for sent in train_data])
y_train = [sent2labels(sent) for sent in train_data]
x_test = [sent2features(sent, embeddings) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

CPU times: user 276 ms, sys: 2.33 ms, total: 278 ms
Wall time: 276 ms


In [37]:
x_train=keras.preprocessing.sequence.pad_sequences(x_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
y_train=keras.preprocessing.sequence.pad_sequences(y_train, maxlen=None, dtype='int32', padding='post', truncating='pre', value=0.)
x_test=keras.preprocessing.sequence.pad_sequences(x_test, maxlen=102, dtype='int32', padding='post', truncating='pre', value=0.)
y_temp =[]
for i in range(len(y_train)):
    y_temp.append(np.eye(n_classes)[y_train[i]][np.newaxis,:])
y_train = np.asarray(y_temp).reshape(-1,102,n_classes)
del(y_temp)

In [38]:
print(x_train[100],x_train.shape)
print(y_train[100][3],y_train.shape)

[ 29 327   5 328   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0] (18500, 102)
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (18500, 102, 48)


In [19]:
model = Sequential()
model.add(Embedding(len(word_to_idx),64,input_length=102,embeddings_initializer=Constant(embedding_matrix),mask_zero=True,trainable=True))
model.add(Bidirectional(GRU(32, return_sequences=True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
adam  = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 102, 64)           961216    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 102, 64)           18624     
_________________________________________________________________
dropout_1 (Dropout)          (None, 102, 64)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 102, 48)           3120      
Total params: 982,960
Trainable params: 982,960
Non-trainable params: 0
_________________________________________________________________


In [20]:
%%time
model.fit(x_train,y_train,batch_size=64,epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 34min 34s, sys: 5min 2s, total: 39min 37s
Wall time: 14min 49s


<keras.callbacks.History at 0x7f8949bee4e0>

In [21]:
%%time
#model.save_weights('/data/my_pos_no_crf.h5')
#model.load_weights('/data/my_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8098,99.6744,99.742,3673.0
1,2,95.5415,93.2711,94.3926,7693.0
2,3,91.3902,95.4053,93.3546,16113.0
3,4,99.9611,99.5202,99.7402,12860.0
4,5,88,98.5075,92.9577,66.0
5,6,98.3158,89.4636,93.681,467.0
6,7,97.7724,97.114,97.4421,2019.0
7,8,74.2947,57.1084,64.5777,237.0
8,9,74.4409,63.3152,68.4288,233.0
9,10,60.7383,43.1466,50.453,362.0


CPU times: user 52.1 s, sys: 9.71 s, total: 1min 1s
Wall time: 16.6 s


### #TODO 3
Compare the result between all neural tagger models in 4.1.x and provide a convincing reason and example for the result of these models (which model perform better, why?)

(If you use your own weight please state so in the answer)

<b>Write your answer here :</b>
I use get_embeddings in embeddings/emb_reader to get word embedding vector of size 64.
from f1, precision and recall the best model result in trainable model.
And found that many words cannot map to word embedding vector - default [0.]*64 vector-.

## 4.2.1 CRF Viterbi

Your next two tasks are to incorporate Conditional random fields (CRF) to your model. <b>You do not need to use pretrained weight</b>.

Keras already implement a CRF neural model for you. However, you need to use the official extension repository for Keras library, call keras-contrib. You should read about keras-contrib crf layer before attempt this exercise section

### #TODO 4
Use Keras-contrib CRF layer in your model. You should set the layer parameter so it can give the best performance on testing using <b>viterbi algorithm</b>. Your model must use crf for loss function and metric. CRF is quite complex compare to previous example model, so you should train it with more epoch, so it can converge.

To finish this excercise you must train the model and show the evaluation report with this model as shown in the example.

Do not forget to save this model weight.

In [13]:
# Write your code here

In [14]:
from keras.callbacks import ModelCheckpoint
weight_path_viterbi='./weight_model/model_weight_viterbi.h5'
callbacks_list_viterbi = [
        ModelCheckpoint(
            weight_path_viterbi,
            save_best_only=True,
            save_weights_only=True,
            monitor='loss',
            mode='min',
            verbose=1)]
#model_feedforward_nn.fit(x, y, epochs, batch_size, verbose,callbacks=callbacks_list_feedforward_nn,validation_data=(x, y))

In [15]:
n_classes,x_train.shape,len(word_to_idx),n_classes

(48, (18500, 102), 15019, 48)

In [13]:
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_viterbi_accuracy
from keras_contrib.layers.advanced_activations import PELU

In [14]:
model = Sequential()
model.add(Embedding(len(word_to_idx), 64, mask_zero=True,input_length=102))
#model.add(Bidirectional(GRU(32, return_sequences=True)))
#model.add(TimeDistributed(Dense(n_classes,activation='relu')))
#model.add(Dropout(0.2))
# use learn_mode = 'join', test_mode = 'viterbi',
# sparse_target = True (label indice output)
model.add(PELU())
model.add(CRF(n_classes,test_mode = 'viterbi'))
# crf_accuracy is default to Viterbi acc if using join-mode (default).
# One can add crf.marginal_acc if interested, but may slow down learning
#model.add(Dropout(0.2))
#model.add(TimeDistributed(Dense(n_classes,activation='softmax')))
model.summary()
model.compile('adam', loss=crf_loss, metrics=[crf_viterbi_accuracy])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 102, 64)           961216    
_________________________________________________________________
pelu_1 (PELU)                (None, 102, 64)           13056     
_________________________________________________________________
crf_1 (CRF)                  (None, 102, 48)           5520      
Total params: 979,792
Trainable params: 979,792
Non-trainable params: 0
_________________________________________________________________


In [19]:
# y must be label indices (with shape 1 at dim 3) here,
# since `sparse_target=True`
model.fit(x_train, y_train,epochs=30,batch_size=128,verbose=1,callbacks=callbacks_list_viterbi)

Epoch 1/30

Epoch 00001: loss improved from 47.84250 to 47.13446, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 2/30

Epoch 00002: loss improved from 47.13446 to 45.01877, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 3/30

Epoch 00003: loss improved from 45.01877 to 44.68395, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 4/30

Epoch 00004: loss improved from 44.68395 to 44.57676, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 5/30

Epoch 00005: loss improved from 44.57676 to 44.52855, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 6/30

Epoch 00006: loss improved from 44.52855 to 44.50575, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 7/30

Epoch 00007: loss improved from 44.50575 to 44.49368, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 8/30

Epoch 00008: loss improved from 44.49368 to 44.48565, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 9/30

Epoch 00009: loss im

<keras.callbacks.History at 0x7f2dd0544d30>

In [20]:
%%time
#model.save_weights('/data/my_pos_no_crf.h5')
#model.load_weights('/data/my_pos_no_crf.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8368,99.5929,99.7147,3670.0
1,2,94.7778,93.0771,93.9197,7677.0
2,3,89.8074,96.3586,92.9677,16274.0
3,4,99.9534,99.5898,99.7713,12869.0
4,5,95.6522,98.5075,97.0588,66.0
5,6,100,88.1226,93.6864,460.0
6,7,97.3038,97.2102,97.257,2021.0
7,8,61.5702,35.9036,45.3577,149.0
8,9,66.3043,49.7283,56.8323,183.0
9,10,62.1429,41.4779,49.7498,348.0


CPU times: user 23.3 s, sys: 3.2 s, total: 26.5 s
Wall time: 7.44 s


In [22]:
model.fit(x_train, y_train,epochs=100,batch_size=128,verbose=1,callbacks=callbacks_list_viterbi)

Epoch 1/100

Epoch 00001: loss improved from 44.43491 to 44.43117, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 2/100

Epoch 00002: loss improved from 44.43117 to 44.42988, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 3/100

Epoch 00003: loss improved from 44.42988 to 44.42938, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 4/100

Epoch 00004: loss improved from 44.42938 to 44.42889, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 5/100

Epoch 00005: loss improved from 44.42889 to 44.42849, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 6/100

Epoch 00006: loss improved from 44.42849 to 44.42811, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 7/100

Epoch 00007: loss improved from 44.42811 to 44.42766, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 8/100

Epoch 00008: loss improved from 44.42766 to 44.42738, saving model to ./weight_model/model_weight_viterbi.h5
Epoch 9/100

Epoch 00009

KeyboardInterrupt: 

In [15]:
%%time
#model.save_weights('/data/my_pos_no_crf.h5')
#model.load_weights('/data/my_pos_no_crf.h5')
model.load_weights('./weight_model/model_weight_viterbi.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.8367,99.5387,99.6875,3668.0
1,2,94.0549,93.0286,93.5389,7673.0
2,3,89.3577,96.2994,92.6988,16264.0
3,4,99.9301,99.5434,99.7364,12863.0
4,5,95.6522,98.5075,97.0588,66.0
5,6,99.3521,88.1226,93.401,460.0
6,7,97.4001,97.3064,97.3532,2023.0
7,8,66.6667,38.5542,48.855,160.0
8,9,60.4167,47.2826,53.0488,174.0
9,10,62.4549,41.2396,49.677,346.0


CPU times: user 22.4 s, sys: 3.2 s, total: 25.6 s
Wall time: 7.39 s


## 4.2.2 CRF Marginal

### #TODO 5

Use Keras-contrib CRF layer in your model. You should set the layer parameter so it can give the best performance on testing using <b>marginal problabilities</b>. You <b>must not train a new model</b>  but use the pretrained weight from #TODO 4.

To finish this excercise you must use the weights from the model trained in previous step and show the evaluation report of marginal problability decoding (testing mode).

In [None]:
# Write your code here

In [19]:
from keras_contrib.losses import crf_loss
from keras_contrib.metrics import crf_marginal_accuracy

In [20]:
model = Sequential()
model.add(Embedding(len(word_to_idx), 64, mask_zero=True,input_length=102))
model.add(PELU())
model.add(CRF(n_classes,test_mode = 'marginal'))
model.summary()
model.compile('adam', loss=crf_loss, metrics=[crf_marginal_accuracy])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 102, 64)           961216    
_________________________________________________________________
pelu_3 (PELU)                (None, 102, 64)           13056     
_________________________________________________________________
crf_3 (CRF)                  (None, 102, 48)           5520      
Total params: 979,792
Trainable params: 979,792
Non-trainable params: 0
_________________________________________________________________


In [21]:
model.load_weights('./weight_model/model_weight_viterbi.h5')
from keras.callbacks import ModelCheckpoint
weight_path_viterbi2='./weight_model/model_weight_viterbi2.h5'
callbacks_list_viterbi2 = [
        ModelCheckpoint(
            weight_path_viterbi2,
            save_best_only=True,
            save_weights_only=True,
            monitor='loss',
            mode='min',
            verbose=1)]
#model_feedforward_nn.fit(x, y, epochs, batch_size, verbose,callbacks=callbacks_list_feedforward_nn,validation_data=(x, y))

In [22]:
# y must be label indices (with shape 1 at dim 3) here,
# since `sparse_target=True`
model.fit(x_train, y_train,epochs=30,batch_size=128,verbose=1,callbacks=callbacks_list_viterbi2)

Epoch 1/30

Epoch 00001: loss improved from inf to 44.42443, saving model to ./weight_model/model_weight_viterbi2.h5
Epoch 2/30

Epoch 00002: loss improved from 44.42443 to 44.42305, saving model to ./weight_model/model_weight_viterbi2.h5
Epoch 3/30

Epoch 00003: loss improved from 44.42305 to 44.42270, saving model to ./weight_model/model_weight_viterbi2.h5
Epoch 4/30

Epoch 00004: loss did not improve from 44.42270
Epoch 5/30

Epoch 00005: loss did not improve from 44.42270
Epoch 6/30

KeyboardInterrupt: 

In [23]:
%%time
#model.save_weights('/data/my_pos_no_crf.h5')
#model.load_weights('/data/my_pos_no_crf.h5')
model.load_weights('./weight_model/model_weight_viterbi2.h5')
y_pred=model.predict(x_test)
ypred = [outputToLabel(y_pred[i],len(y_test[i])) for i in range(len(y_pred))]
evaluation_report(y_test, ypred)

Unnamed: 0,tag,precision,recall,f_score,correct_count
0,1,99.7014,99.6744,99.6879,3673.0
1,2,85.6002,89.6581,87.5822,7395.0
2,3,87.3645,86.8317,87.0973,14665.0
3,4,99.9689,99.5512,99.7596,12864.0
4,5,86.8421,98.5075,92.3077,66.0
5,6,89.1473,88.1226,88.632,460.0
6,7,96.8584,96.3925,96.6249,2004.0
7,8,38.7295,45.5422,41.8605,189.0
8,9,6.21242,25.2717,9.97319,93.0
9,10,60.644,40.4052,48.4979,339.0


CPU times: user 34.3 s, sys: 5.68 s, total: 39.9 s
Wall time: 11 s


### #TODO 6

Please pick the best example that can show the different between CRF that use viterbi and CRF that use marginal problabilities. Compare the result and provide a convincing reason. (Which model perform better, why? / Which model should be faster? Is it true in this case, why?)

<b>Write your answer here :</b>

For prediction (test phrase), one can choose either Viterbi best path (class indices) or marginal probabilities if probabilities are needed. However, if one chooses *join mode* for training, Viterbi output is typically better than marginal output, but the marginal output will still perform reasonably close, while if *marginal mode* is used for training, marginal output usually performs much better. 
For accuracy, Viterbi better than Marginal because of class indices.
But Marginal may faster due to probabilities. In this case the time of training for each epoch - both Viterbi and Marginal - is about 33s no different on 128-batch-size.