# Bidirectional RNN
* Using several kind of pretrained embedding from TF-hub
  * nnlm-en-dim128
  * gnews-swivel-20dim-with-oov
  * Wiki-words-500
  * Wiki-words-250
* Tokenize the sentences
* Embed **each** words into vector
* Using Bidirectional RNN to extract the summary of the sentences
* Classify using a Fully connected layer
* We need to patch the sentences to same length, so the model can process them as a batch

In [0]:
%tensorflow_version 2.x  # use TF2.0
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd

print(tf.__version__) # confirm version

`%tensorflow_version` only switches the major version: `1.x` or `2.x`.
You set: `2.x  # use TF2.0`. This will be interpreted as: `2.x`.


TensorFlow is already loaded. Please restart the runtime to change versions.
2.0.0


## Load data
* Using Tokenize data
* Stemming or not?  **-> Result: no stemming will yield better result**

In [0]:
STEMMING = False #@param {type:"boolean"}

In [0]:
if STEMMING:
  DATA = pd.read_csv('train_tokenize.csv')
else:
  DATA = pd.read_csv('train_tokenize_nostem.csv')

# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
from ast import literal_eval
print(type(DATA.loc[0,'TOKEN']))

# convert str back to correct list type, this happens since we store the file into .csv
DATA['TOKEN'] = DATA['TOKEN'].apply(literal_eval)
print(type(DATA.loc[0,'TOKEN']))
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

<class 'str'>
<class 'list'>


If the LENGTH==1 sentences are also desired in training, we should check the empty sentences and replace it as a **\<UNK\>** token in case whole sentence will be mask out (since we use empty string to mask the sentence) and cause error

In [0]:
# TRAIN = DATA.loc[DATA['LENGTH']>1,'TOKEN']
# LABEL = DATA.loc[DATA['LENGTH']>1,'BACKGROUND':'OTHERS'] 

TRAIN = DATA.loc[:,'TOKEN']
LABEL = DATA.loc[:,'BACKGROUND':'OTHERS'] 

print(TRAIN.shape)
print(LABEL.shape)

(46867,)
(46867, 6)


In [0]:
#padding to MAX_LENGTH
MAX_LENGTH = 256

for row in TRAIN:
  if len(row) < MAX_LENGTH:
    row.extend(['' for _ in range(MAX_LENGTH-len(row))])

In [0]:
# check 
print(len(TRAIN[0]))
print(len(TRAIN[1000]))
print(TRAIN[0])

256
256
['rapid', 'popularity', 'of', 'internet', 'of', 'things', 'and', 'cloud', 'computing', 'permits', 'neuroscientists', 'to', 'collect', 'multilevel', 'and', 'multichannel', 'brain', 'data', 'to', 'better', 'understand', 'brain', 'functions', 'diagnose', 'diseases', 'and', 'devise', 'treatments', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '

In [0]:
#split to train and val
from sklearn.model_selection import train_test_split

X_train, X_val, Y_train, Y_val = train_test_split(TRAIN, LABEL,  test_size=0.25)
print('X_train.shape: ', X_train.shape)
print('y_train.shape: ', Y_train.shape)
print('X_test.shape: ', X_val.shape)
print('y_test.shape: ', Y_val.shape)

X_train.shape:  (35150,)
y_train.shape:  (35150, 6)
X_test.shape:  (11717,)
y_test.shape:  (11717, 6)


### create dataset


In [0]:
BATCH_SIZE = 512 #@param {type:"slider", min:64, max:1024, step:64}

In [0]:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train.values))
val_dataset = tf.data.Dataset.from_tensor_slices((X_val, Y_val.values))
#investigate the dataset
for feat, targ in train_dataset.take(1):
  print ('Features: {}, Target: {}'.format(feat, targ))

print ('--------------------------------------------------')
# shuffle, set batch and set prefetch
train_dataset = train_dataset.batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
val_dataset = val_dataset.batch(BATCH_SIZE)
val_dataset = val_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
# now it will take 1 batch out and the order is messed (only show 5 below)
for feat, targ in train_dataset.take(1):
  print ('Features: {}, \nTarget: {}'.format(feat[0:5], targ[0:5]))

Features: [b'we' b'introduce' b'a' b'multilayer' b'ground' b'model' b'for' b'the'
 b'recently-proposed' b'mom-so' b'method' b'suitable' b'to' b'accurately'
 b'predict' b'ground' b'return' b'effects' b'in' b'such' b'scenarios' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b'' b''
 b'' b'' b'' b'' b'' b'' b'' b'' b'' b''

## Embed using pretrain Wiki-words-250/nnlm-128
* [How-To-Embed-in-TensorFlow](https://github.com/FrancescoSaverioZuppichini/How-To-Embed-in-TensorFlow)
* In **Wiki-words-250/500**, Unseen token will output as a whole 0 vector
* **Wiki-words-250/500** no big performance different
* **Wiki-words-250** can't embed punctuation
* **NNLM** seems it is a network-like weight-based model, it can handle any input include punctuation (But I don't know whether it is a reasonable embediing or not)
* Both of them care about the tense and singular/plural of the word, so *stemming* might not be a good choice

In [0]:
EMBED_SIZE = 250 
if EMBED_SIZE == 128:
  Model_URL = "https://tfhub.dev/google/nnlm-en-dim128/2"
elif EMBED_SIZE == 20:  
  Model_URL = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1"
elif EMBED_SIZE == 500: 
  Model_URL = "https://tfhub.dev/google/Wiki-words-500/2"
else:
  Model_URL = "https://tfhub.dev/google/Wiki-words-250/2"

In [0]:
embed = hub.load(Model_URL)

### Define a layer that can be used later in Keras
we need to reshape the input tensor, since the pretrain embed layer only accept 1D input. It seems that this layer was originally designed to embed a sentence.

In [0]:
def WikiWordsEmbedding(x):
  x = tf.reshape(tf.cast(x, tf.string), [-1])
  result = embed(x)
  return tf.reshape(result, [-1, MAX_LENGTH, EMBED_SIZE]) # reshape back to the Tensor we want

def compute_mask(x,y):  # receive 2 argument but y is None (don't know why yet, but we don't need it)
  return tf.math.not_equal(x,'')

embed_layer = tf.keras.layers.Lambda(WikiWordsEmbedding, output_shape=(None,MAX_LENGTH,EMBED_SIZE), mask=compute_mask)

In [0]:
# testing, check the shape and value wiil be the same after reshape
for feat, targ in train_dataset.take(1):
  print(feat[0].shape)
  print(embed_layer(feat).shape)
  print(embed_layer(feat[0]))
  print(embed(feat[0]))

(256,)
(512, 256, 250)
tf.Tensor(
[[[-0.10056633 -0.01571468  0.04843812 ... -0.00983353 -0.014313
   -0.06990035]
  [-0.00509347  0.00251882  0.08547001 ... -0.01865194  0.08163237
   -0.01506454]
  [-0.04388279 -0.14637887 -0.02515217 ... -0.01614045 -0.00135054
   -0.04300075]
  ...
  [ 0.          0.          0.         ...  0.          0.
    0.        ]
  [ 0.          0.          0.         ...  0.          0.
    0.        ]
  [ 0.          0.          0.         ...  0.          0.
    0.        ]]], shape=(1, 256, 250), dtype=float32)
tf.Tensor(
[[-0.10056633 -0.01571468  0.04843812 ... -0.00983353 -0.014313
  -0.06990035]
 [-0.00509347  0.00251882  0.08547001 ... -0.01865194  0.08163237
  -0.01506454]
 [-0.04388279 -0.14637887 -0.02515217 ... -0.01614045 -0.00135054
  -0.04300075]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0. 

## Build the model

In [0]:
# Function to calculate F1_score
def F1_score(y_true, y_pred):
  DTYPE = tf.float32
  THRESHOLD = 0.5

  y_pred = tf.cast(y_pred > THRESHOLD, DTYPE) 

  true_positives = tf.math.count_nonzero(tf.math.logical_and(tf.math.equal(y_pred,1.0), tf.math.equal(y_true,1.0)), axis=0)
  false_positives = tf.math.count_nonzero(tf.math.logical_and(tf.math.equal(y_pred,1.0), tf.math.equal(y_true,0.0)), axis=0)
  false_negatives = tf.math.count_nonzero(tf.math.logical_and(tf.math.equal(y_pred,0.0), tf.math.equal(y_true,1.0)), axis=0)

  TP = tf.math.reduce_sum(tf.cast(true_positives, DTYPE), axis=0)
  FP = tf.math.reduce_sum(tf.cast(false_positives, DTYPE), axis=0)
  FN = tf.math.reduce_sum(tf.cast(false_negatives, DTYPE), axis=0)

  precision = tf.math.divide_no_nan(TP, TP+FP)
  recall = tf.math.divide_no_nan(TP, TP+FN)

  F1 = tf.math.divide_no_nan(2 * (precision * recall) , (precision + recall))
  return F1

### Using variational RNN
* Same dropout mask accross the time steps using [RNNCellDropWrapper](https://www.tensorflow.org/api_docs/python/tf/nn/RNNCellDropoutWrapper)
* [paper](https://arxiv.org/pdf/1512.05287.pdf)
* [stack overflow](https://stackoverflow.com/questions/43950515/how-to-use-dropoutwrapper-in-lstm-training-and-decoding)

In [0]:
# Constrcut forward/backward LSTM cell with same dropout mask
Forward = tf.keras.layers.GRU(256, dropout=0.5, name='forward')
Backward = tf.keras.layers.GRU(256, dropout=0.5, go_backwards=True, name='backward')

In [0]:
# using Dropout and kernel_regularizer to prevent overfitting
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(MAX_LENGTH,),dtype="string", batch_size=BATCH_SIZE),   
    embed_layer,               
    tf.keras.layers.Bidirectional(Forward, backward_layer=Backward, input_shape=(None, EMBED_SIZE),
                         merge_mode='concat', dtype=tf.float32),
    # tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    # tf.keras.layers.Dense(64, activation='relu', dtype=tf.float32),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(6, activation='sigmoid', dtype=tf.float32)
])
#compile the model
model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-3),
              metrics=[F1_score])

# build the model to get the weight
model.build((None,MAX_LENGTH))
model.summary()
# store the weight for later usage
weights = model.get_weights()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lambda_1 (Lambda)            (None, 256, 250)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (512, 512)                780288    
_________________________________________________________________
dropout_1 (Dropout)          (512, 512)                0         
_________________________________________________________________
dense_2 (Dense)              (512, 6)                  3078      
Total params: 783,366
Trainable params: 783,366
Non-trainable params: 0
_________________________________________________________________


## Training

In [0]:
earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_F1_score',mode='max', patience=10)
history = model.fit(train_dataset, epochs=1000, verbose=1,
                    validation_data=val_dataset, 
                    callbacks = [earlystop],
                    validation_steps=10)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000


### Evaluate

In [0]:
result = model.predict(X_val.to_list())

print(result.shape)
print(result[-5:-1])

(11717, 6)
[[0.53796464 0.42478228 0.52777076 0.5120058  0.4802838  0.6152423 ]
 [0.5076143  0.5007042  0.550709   0.5261106  0.42900875 0.5208264 ]
 [0.5185112  0.34719393 0.5887345  0.4804732  0.40590596 0.67691636]
 [0.5297364  0.40212327 0.5193733  0.51374394 0.52558    0.6285825 ]]


In [0]:
from sklearn.metrics import f1_score

greater = (result>=0.5).astype(int)
print(Y_val.shape)
print(greater.shape)
print(f1_score(Y_val, greater, average='micro'))

(11717, 6)
(11717, 6)
0.5690900337024555


### Refit on the whole data with around 40 epochs

In [0]:
print('Training shape:{} and Label shape{}'. format(TRAIN.shape, LABEL.shape))
# Build the dataset
ALL_TRAIN_dataset = tf.data.Dataset.from_tensor_slices((TRAIN, LABEL.values))
# shuffle, set batch and set prefetch
ALL_TRAIN_dataset = ALL_TRAIN_dataset.shuffle(10000).batch(BATCH_SIZE)
ALL_TRAIN_dataset = ALL_TRAIN_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Training shape:(46867,) and Label shape(46867, 6)


In [0]:
# reload the model weight
model.set_weights(weights)
# train with best epoch = 43
history = model.fit(ALL_TRAIN_dataset, epochs=43, verbose=0)

##Predict

In [0]:
TEST_DATA = pd.read_csv('./test_tokenize_nostem.csv')
TEST = TEST_DATA['TOKEN'].apply(literal_eval) # convert to list (don't need this step if you use your own method)

In [0]:
# pad to same length
for row in TEST:
  if len(row) < MAX_LENGTH:
    row.extend(['' for _ in range(MAX_LENGTH-len(row))])

print(TEST.shape)
print(TEST[0])

(131166,)
['mobile', 'crowdsensing', 'is', 'a', 'promising', 'paradigm', 'for', 'ubiquitous', 'sensing', 'which', 'explores', 'the', 'tremendous', 'data', 'collected', 'by', 'mobile', 'smart', 'devices', 'with', 'prominent', 'spatial-temporal', 'coverage', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',

In [0]:
TEST_dataset = tf.data.Dataset.from_tensor_slices(TEST)
TEST_dataset = TEST_dataset.batch(BATCH_SIZE)
TEST_dataset = TEST_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

In [0]:
TEST_RESULT = model.predict(TEST_dataset)

print(TEST_RESULT.shape)
print(TEST_RESULT[-5:-1])

(131166, 6)
[[1.3198853e-03 5.6018233e-03 2.6428729e-02 8.2059246e-01 4.9370641e-01
  2.4116337e-03]
 [9.9096978e-01 2.2083253e-02 7.5790286e-04 3.4436584e-04 2.1257997e-04
  1.7622113e-04]
 [1.8643668e-01 5.8309507e-01 2.6996088e-01 1.3433120e-01 1.2393901e-01
  3.6059946e-02]
 [1.4346242e-02 8.9676261e-02 4.6251270e-01 4.9801958e-01 1.5494758e-01
  2.2586972e-02]]


In [0]:
# RESULT = (TEST_RESULT>=0.5).astype(int)
RESULT = pd.DataFrame(TEST_RESULT, columns=LABEL.columns)
RESULT.head()

Unnamed: 0,BACKGROUND,OBJECTIVES,METHODS,RESULTS,CONCLUSIONS,OTHERS
0,0.946022,0.079508,0.003645,0.001922,0.002508,0.001954
1,0.743929,0.22045,0.032969,0.05648,0.041752,0.013041
2,0.11944,0.380424,0.188641,0.174334,0.254004,0.07598
3,0.036066,0.628425,0.392521,0.096206,0.042264,0.005441
4,0.003079,0.024162,0.19281,0.674457,0.322713,0.01835


In [0]:
# save to csv file
RESULT.to_csv('test_result.csv', index=False)