#### This particular assignment focuses on text classification using CNN. It has been picking up pace over the past few years. So, I thought this would be a good exercise to try out. The dataset is provided to you and there will be specific instrucions on how to curate the data, split into train and validation and the like.  You will be using MXnet for this task.  The data comprises tweets pertaining to common causes of cancer. The objective is to classify the tweets as medically relevant or not.  The dataset is skewed with positive class or 'yes' being 6 times less frequent than the negative class or 'no'. (Total marks = 50). Individual marks to the sub-problems are given in bracket. 

In [84]:
#Anmesh Choudhury
#16NA30003
# these are the modules you are allowed to work with. 

import nltk
import re
import itertools
import numpy as np
import mxnet as mx
import sys, os
import re
import random
from collections import Counter
from io import open
from sklearn.metrics import f1_score
'''
First job is to clean and preprocess the social media text. (5)

1) Replace URLs and mentions (i.e strings which are preceeded with @)
2) Segment #hastags 
3) Remove emoticons and other unicode characters
'''

def preprocess_tweet(input_text):
    '''
    Input: The input string read directly from the file
    
    Output: Pre-processed tweet text
    '''
    cleaned_text = re.sub(r"http\S+", "", input_text)
    cleaned_text = re.sub(r'(\s)@\w+', r'\1', cleaned_text)
    cleaned_text = (cleaned_text.encode('ascii', 'ignore')).decode("utf-8")
    cleaned_text = cleaned_text.replace("#", "")
    cleaned_text = cleaned_text.replace("@", "")
    cleaned_text = re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', cleaned_text)
    return cleaned_text


# read the input file and create the set of positive examples and negative examples. 

file=open('cancer_data.tsv',encoding="utf8")
pos_data=[]
neg_data=[]

for line in file:
    line=line.strip().split('\t')
    text2= preprocess_tweet(line[0]).strip().split()
    if line[1]=='yes':
        pos_data.append(text2)
    if line[1]=='no':
        neg_data.append(text2)

print(len(pos_data), len(neg_data))     

sentences= list(pos_data)
sentences.extend(neg_data)
pos_labels= [1 for _ in pos_data]
neg_labels= [0 for _ in neg_data]
y=list(pos_labels)
y.extend(neg_labels)
y=np.array(y)

'''
After this you will obtain the following :

1) sentences =  List of sentences having the positive and negative examples with all the positive examples first
2) y = List of labels with the positive labels first.
'''

'''
Before running the CNN there are a few things one needs to take care of: (5)

1) Pad the sentences so that all of them are of the same length
2) Build a vocabulary comprising all unique words that occur in the corpus
3) Convert each sentence into a corresponding vector by replacing each word in the sentence with the index in the vocabulary. 

Example :
S1 = a b a c
S2 = d c a 

Step 1:  S1= a b a c, 
         S2 =d c a </s> 
         (Both sentences are of equal length). 

Step 2:  voc={a:1, b:2, c:3, d:4, </s>: 5}

Step 3:  S1= [1,2,1,3]
         S2= [4,3,1,5]

'''

def create_word_vectors(sentences):
    '''
    Input: List of sentences
    Output: List of word vectors corresponding to each sentence, vocabulary
    '''
    sequence_length = max(len(x) for x in sentences)
    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence)
        new_sentence = sentence + ["</s>"] * num_padding
        padded_sentences.append(new_sentence)
    word_counts = Counter(itertools.chain(*padded_sentences))
    vocabulary_inv = [x[0] for x in word_counts.most_common()]
    vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
    word_vectors = np.array([[vocabulary[word] for word in sentence] for sentence in padded_sentences])
  
    return word_vectors, vocabulary, padded_sentences


x, vocabulary,padded_sentences = create_word_vectors(sentences)


def create_shuffle(x,y):
    '''
    Create an equal distribution of the positive and negative examples. 
    Please do not change this particular shuffling method.
    '''
    pos_len= len(pos_data)
    neg_len= len(neg_data)
    pos_len_train= int(0.8*pos_len)
    neg_len_train= int(0.8*neg_len)
    train_data= [(x[i],y[i]) for i in range(0, pos_len_train)]
    train_data.extend([(x[i],y[i]) for i in range(pos_len, pos_len+ neg_len_train )])
    test_data=[(x[i],y[i]) for i in range(pos_len_train, pos_len)]
    test_data.extend([(x[i],y[i]) for i in range(pos_len+ neg_len_train, len(x) )])
    
    random.shuffle(train_data)
    x_train=[i[0] for i in train_data]
    y_train=[i[1] for i in train_data]
    random.shuffle(test_data)
    x_test=[i[0] for i in test_data]
    y_test=[i[1] for i in test_data]
    
    x_train=np.array(x_train)
    y_train=np.array(y_train)
    x_test= np.array(x_test)
    y_test= np.array(y_test)
    return x_train, y_train, x_test, y_test

x_train, y_train, x_test, y_test= create_shuffle(x,y)




(208, 1298)


In [85]:
'''
We now define the neural architecture of the CNN. The architecture is defined as : (10)

1) Embedding layer that converts the vector representation of the sentence from a one-hot encoding to a fixed sized word embedding
   (mx.sym.Embedding)
   
2) Convolution + activation + max pooling layer 
   (mx.sym.Convolution+ mx.sym.Activation+ mx.sym.Pooling)
   This procedure is to be followed for different sizes of filters (the filters corresponding to size 2 looks at the bigram distribution, 3 looks at trigram etc. 

3) Concat all the filters together (mx.sym.Concat)

4) Pass the results through a fully Connected layer of size 2 and then run softmax on it. 
   (mx.sym.FullyConnected, mx.sym.SoftmaxOutput)
   

We then initialize the intermediate layers of appropriate size and train the model using back prop. (10)
(Look up the mxnet tutorial if you have any doubt)

Run the classifier and for each epoch with a specified batch size observe the accuracy on the training set and test set (5)


Default parameters:

1) No of epochs = 10
2) Batch size = 20
3) Size of word embeddings = 200
4) Size of filters =[2,3,4,5]
5) Filter embedding= 100
6) Optimizer = rmsprop
7) learning rate = 0.005

'''
sentence_size = x_train.shape[1]
vocab_size = len(vocabulary)
batch_size = 20
print('batch size', batch_size)

input_x = mx.sym.Variable('data') 
input_y = mx.sym.Variable('softmax_label') 

num_embed = 200 
print('embedding dimensions', num_embed)


embed_layer = mx.sym.Embedding(data=input_x, input_dim=vocab_size, output_dim=num_embed, name='vocab_embed')

conv_input = mx.sym.Reshape(data=embed_layer, shape=(batch_size, 1, sentence_size, num_embed))

# create convolution + (max) pooling layer for each filter operation
filter_list=[2, 3, 4, 5] # the size of filters to use
print('convolution filters', filter_list)

num_filter=100
pooled_outputs = []
for filter_size in filter_list:
    convi = mx.sym.Convolution(data=conv_input, kernel=(filter_size, num_embed), num_filter=num_filter)
    relui = mx.sym.Activation(data=convi, act_type='relu')
    pooli = mx.sym.Pooling(data=relui, pool_type='max', kernel=(sentence_size - filter_size + 1, 1), stride=(1, 1))
    pooled_outputs.append(pooli)

# combine all pooled outputs
total_filters = num_filter * len(filter_list)
concat = mx.sym.Concat(*pooled_outputs, dim=1)

# reshape for next layer
h_pool = mx.sym.Reshape(data=concat, shape=(batch_size, total_filters))

# dropout layer
dropout = 0.5
print('dropout probability', dropout)

if dropout > 0.0:
    h_drop = mx.sym.Dropout(data=h_pool, p=dropout)
else:
    h_drop = h_pool

# fully connected layer
num_label = 2

cls_weight = mx.sym.Variable('cls_weight')
cls_bias = mx.sym.Variable('cls_bias')

fc = mx.sym.FullyConnected(data=h_pool, weight=cls_weight, bias=cls_bias, num_hidden=num_label)

# softmax output
sm = mx.sym.SoftmaxOutput(data=fc, label=input_y, name='softmax')

# set CNN pointer to the "back" of the network
cnn = sm

from collections import namedtuple
import math
import time

# Define the structure of our CNN Model (as a named tuple)
CNNModel = namedtuple("CNNModel", ['cnn_exec', 'symbol', 'data', 'label', 'param_blocks'])

# Define what device to train/test on, use GPU if available
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()

arg_names = cnn.list_arguments()

input_shapes = {}
input_shapes['data'] = (batch_size, sentence_size)

arg_shape, out_shape, aux_shape = cnn.infer_shape(**input_shapes)
arg_arrays = [mx.nd.zeros(s, ctx) for s in arg_shape]
args_grad = {}
for shape, name in zip(arg_shape, arg_names):
    if name in ['softmax_label', 'data']: # input, output
        continue
    args_grad[name] = mx.nd.zeros(shape, ctx)

cnn_exec = cnn.bind(ctx=ctx, args=arg_arrays, args_grad=args_grad, grad_req='add')

param_blocks = []
arg_dict = dict(zip(arg_names, cnn_exec.arg_arrays))
initializer = mx.initializer.Uniform(0.1)
for i, name in enumerate(arg_names):
    if name in ['softmax_label', 'data']: # input, output
        continue
    initializer(mx.init.InitDesc(name), arg_dict[name])

    param_blocks.append( (i, arg_dict[name], args_grad[name], name) )

data = cnn_exec.arg_dict['data']
label = cnn_exec.arg_dict['softmax_label']

cnn_model= CNNModel(cnn_exec=cnn_exec, symbol=cnn, data=data, label=label, param_blocks=param_blocks)

'''
Train the cnn_model using back prop
'''

optimizer = 'rmsprop'
max_grad_norm = 5.0
learning_rate = 0.001
epoch = 20

print('optimizer', optimizer)
print('maximum gradient', max_grad_norm)
print('learning rate (step size)', learning_rate)
print('epochs to train for', epoch)

# create optimizer
opt = mx.optimizer.create(optimizer)
opt.lr = learning_rate

updater = mx.optimizer.get_updater(opt)

# For each training epoch
for iteration in range(epoch):
    tic = time.time()
    num_correct = 0
    num_total = 0
    c=0
    F1_train=0
    # Over each batch of training data
    for begin in range(0, x_train.shape[0], batch_size):
        batchX = x_train[begin:begin+batch_size]
        batchY = y_train[begin:begin+batch_size]
        if batchX.shape[0] != batch_size:
            continue

        cnn_model.data[:] = batchX
        cnn_model.label[:] = batchY

        # forward
        cnn_model.cnn_exec.forward(is_train=True)

        # backward
        cnn_model.cnn_exec.backward()

        # eval on training data
        z1=np.argmax(cnn_model.cnn_exec.outputs[0].asnumpy(), axis=1)
        num_correct += sum(batchY == z1)
        F1_train+=f1_score(z1, batchY, average='binary')
        c+=1
        num_total += len(batchY)
       
        # update weights
        norm = 0
        for idx, weight, grad, name in cnn_model.param_blocks:
            grad /= batch_size
            l2_norm = mx.nd.norm(grad).asscalar()
            norm += l2_norm * l2_norm

        norm = math.sqrt(norm)
        for idx, weight, grad, name in cnn_model.param_blocks:
            if norm > max_grad_norm:
                grad *= (max_grad_norm / norm)

            updater(idx, grad, weight)

            # reset gradient to zero
            grad[:] = 0.0
    F1_train/=c
    # End of training loop for this epoch
    toc = time.time()
    train_time = toc - tic
    train_acc = num_correct * 100 / float(num_total)

    # Saving checkpoint to disk
    if (iteration + 1) % 10 == 0:
        prefix = 'cnn'
        cnn_model.symbol.save('./%s-symbol.json' % prefix)
        save_dict = {('arg:%s' % k) : v  for k, v in cnn_model.cnn_exec.arg_dict.items()}
        save_dict.update({('aux:%s' % k) : v for k, v in cnn_model.cnn_exec.aux_dict.items()})
        param_name = './%s-%04d.params' % (prefix, iteration)
        mx.nd.save(param_name, save_dict)
        print('Saved checkpoint to %s' % param_name)


    # Evaluate model after this epoch on dev (test) set
    num_correct = 0
    num_total = 0
    c=0
    F1=0
    # For each test batch
    for begin in range(0, x_test.shape[0], batch_size):
        batchX = x_test[begin:begin+batch_size]
        batchY = y_test[begin:begin+batch_size]

        if batchX.shape[0] != batch_size:
            continue

        cnn_model.data[:] = batchX
        cnn_model.cnn_exec.forward(is_train=False)
        z=np.argmax(cnn_model.cnn_exec.outputs[0].asnumpy(), axis=1)
        num_correct += sum(batchY == z)
        num_total += len(batchY)
        F1+=f1_score(z, batchY, average='binary')
        c+=1
    dev_acc = num_correct * 100 / float(num_total)

    F1/=c
    print('Iter [%d] Train: Time: %.3fs, Training Accuracy: %.3f \
            --- Dev Accuracy thus far: %.3f' % (iteration, train_time, train_acc, dev_acc))
    print("Iter [%d] F1 score  for train  %.3f for DEV %.3f" %(iteration,F1_train,F1))

('batch size', 20)
('embedding dimensions', 200)
('convolution filters', [2, 3, 4, 5])
('dropout probability', 0.5)
('optimizer', 'rmsprop')
('maximum gradient', 5.0)
('learning rate (step size)', 0.001)
('epochs to train for', 20)
Iter [0] Train: Time: 1.100s, Training Accuracy: 86.500             --- Dev Accuracy thus far: 87.333
Iter [0] F1 score  for train  0.019 for DEV 0.098
Iter [1] Train: Time: 1.034s, Training Accuracy: 89.500             --- Dev Accuracy thus far: 88.333
Iter [1] F1 score  for train  0.316 for DEV 0.258
Iter [2] Train: Time: 1.035s, Training Accuracy: 96.667             --- Dev Accuracy thus far: 88.667
Iter [2] F1 score  for train  0.805 for DEV 0.290
Iter [3] Train: Time: 1.047s, Training Accuracy: 99.500             --- Dev Accuracy thus far: 89.333
Iter [3] F1 score  for train  0.916 for DEV 0.330
Iter [4] Train: Time: 1.037s, Training Accuracy: 99.500             --- Dev Accuracy thus far: 89.333
Iter [4] F1 score  for train  0.913 for DEV 0.307
Iter [5]

In [86]:
'''
So far, the assignment has been posed in a manner so that you can refer to directly the mxnet tutorial on the same problem. 

The final 15 marks is meant to carry out experimentations of your own and observe how the results change by experimentation. 

1) Would the results improve if instead of using the word embeddings that is based solely on frequency, if you have been able to incorporate sub-word information
   (In short run fasttext on the corpus and use the word embeddings generated by fastetxt). (8)
   
2) Accuracy might not be the best way to measure the performance of a skewed dataset. What other metrics would you use ? Why? 
   Experiment with different hyper-paramters to show the performance in terms of metric? 
   You can assume that we want to identify all the medically relevant tweets (i.e. tweets with 'yes' class more). (7)
    

Delivearbles:

The ipython notebook with the results to each part of the question. 


P.S: This assignment is part of a research question I am working on my free time. So if you have any insights, I'd love to hear them. 
Happy coding 

Ritam Dutt
14CS30041

'''





"\nSo far, the assignment has been posed in a manner so that you can refer to directly the mxnet tutorial on the same problem. \n\nThe final 15 marks is meant to carry out experimentations of your own and observe how the results change by experimentation. \n\n1) Would the results improve if instead of using the word embeddings that is based solely on frequency, if you have been able to incorporate sub-word information\n   (In short run fasttext on the corpus and use the word embeddings generated by fastetxt). (8)\n   \n2) Accuracy might not be the best way to measure the performance of a skewed dataset. What other metrics would you use ? Why? \n   Experiment with different hyper-paramters to show the performance in terms of metric? \n   You can assume that we want to identify all the medically relevant tweets (i.e. tweets with 'yes' class more). (7)\n    \n\nDelivearbles:\n\nThe ipython notebook with the results to each part of the question. \n\n\nP.S: This assignment is part of a rese

In [87]:
from gensim.models import FastText
def create_ft_weights(sentences, num_embed, vocabulary, epoch):
    model = FastText(size=num_embed, window=3, min_count=1)  # instantiate
    model.build_vocab(sentences=sentences)
    model.train(sentences=sentences, total_examples=model.corpus_count, epochs=epoch)  # train
    weights = []
    for word in vocabulary:
        weights.append(model.wv[word])
    return np.asarray(weights)

In [88]:

from sklearn.metrics import f1_score
def train_and_test(num_embed, filter_embed, batch_size, vocab_size, sentence_size, filter_list, optimizer, learning_rate, epoch,weight_init, ft = True,f1score = True, verbose = True):
    input_x = mx.sym.Variable('data') # placeholder for input data
    input_y = mx.sym.Variable('softmax_label') # placeholder for output label

    if ft == True:
        weight = mx.sym.Variable('vocab_embed_weight')
    '''
    Define the first network layer (embedding)
    '''

    # create embedding layer to learn representation of words in a lower dimensional subspace (much like word2vec)

    #print('embedding dimensions', num_embed)
    if ft == False:
        embed_layer = mx.sym.Embedding(data=input_x, input_dim=vocab_size, output_dim=num_embed, name='vocab_embed')
    else:
        embed_layer = mx.sym.Embedding(data=input_x, input_dim=vocab_size, output_dim=num_embed, weight = weight, name='vocab_embed')

    # reshape embedded data for next layer
    conv_input = mx.sym.Reshape(data=embed_layer, shape=(batch_size, 1, sentence_size, num_embed))

    # create convolution + (max) pooling layer for each filter operation
     # the size of filters to use
    #print('convolution filters', filter_list)

    num_filter=1
    pooled_outputs = []
    for filter_size in filter_list:
        convi = mx.sym.Convolution(data=conv_input, kernel=(filter_size, filter_embed), num_filter=num_filter)
        relui = mx.sym.Activation(data=convi, act_type='relu')
        pooli = mx.sym.Pooling(data=relui, pool_type='max', kernel=(sentence_size - filter_size + 1, 1), stride=(1, 1))
        pooled_outputs.append(pooli)

    # combine all pooled outputs
    total_filters = num_filter * len(filter_list)
    concat = mx.sym.Concat(*pooled_outputs, dim=1)

    # reshape for next layer
    h_pool = mx.sym.Reshape(data=concat, shape=(batch_size, total_filters*(num_embed - filter_embed + 1)))

    num_label = 2

    cls_weight = mx.sym.Variable('cls_weight')
    cls_bias = mx.sym.Variable('cls_bias')

    fc = mx.sym.FullyConnected(data=h_pool, weight=cls_weight, bias=cls_bias, num_hidden=num_label)

    # softmax output
    sm = mx.sym.SoftmaxOutput(data=fc, label=input_y, name='softmax')

    # set CNN pointer to the "back" of the network
    cnn = sm

    # Define what device to train/test on, use GPU if available
    ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()

    arg_names = cnn.list_arguments()
#     print(arg_names)
#     exit()
    input_shapes = {}
    input_shapes['data'] = (batch_size, sentence_size)

    arg_shape, out_shape, aux_shape = cnn.infer_shape(**input_shapes)
    arg_arrays = [mx.nd.zeros(s, ctx) for s in arg_shape]
    args_grad = {}
    for shape, name in zip(arg_shape, arg_names):
        if name in ['softmax_label', 'data']: # input, output
            continue
        args_grad[name] = mx.nd.zeros(shape, ctx)

    cnn_exec = cnn.bind(ctx=ctx, args=arg_arrays, args_grad=args_grad, grad_req='add')

    param_blocks = []
    arg_dict = dict(zip(arg_names, cnn_exec.arg_arrays))
    initializer = mx.initializer.Uniform(0.1)
    for i, name in enumerate(arg_names):
        if name in ['softmax_label', 'data']: # input, output
            continue
        if ft == True:
            if name == 'vocab_embed_weight':
                initializer_2 = mx.initializer.Load({name:weight_init})
                initializer_2(mx.init.InitDesc(name), arg_dict[name])
                print('FastText weights initialized!')
            else:
                initializer(mx.init.InitDesc(name), arg_dict[name])
        else:
            initializer(mx.init.InitDesc(name), arg_dict[name])

        param_blocks.append( (i, arg_dict[name], args_grad[name], name) )

    data = cnn_exec.arg_dict['data']
    label = cnn_exec.arg_dict['softmax_label']

    '''
    Train the cnn_model using back prop
    '''

    # create optimizer
    opt = mx.optimizer.create(optimizer)
    opt.lr = learning_rate

    updater = mx.optimizer.get_updater(opt)

    # For each training epoch
    max_test_acc = 0
    for iteration in range(epoch):
        #tic = time.time()
        score = []
        k_labels = []
        num_correct = 0
        num_total = 0
        # Over each batch of training data
        for begin in range(0, x_train.shape[0], batch_size):
            batchX = x_train[begin:begin+batch_size]
            batchY = y_train[begin:begin+batch_size]
            if batchX.shape[0] != batch_size:
                continue

            data[:] = batchX
            label[:] = batchY

            # forward
            cnn_exec.forward(is_train=True)

            # backward
            cnn_exec.backward()

            # eval on training data
            num_correct += sum(batchY == np.argmax(cnn_exec.outputs[0].asnumpy(), axis=1))
            
            num_total += len(batchY)
            score.extend(np.argmax(cnn_exec.outputs[0].asnumpy(), axis=1).tolist())
            k_labels.extend(batchY.tolist())


            # update weights
            norm = 0
            for idx, weight, grad, name in param_blocks:
                grad /= batch_size
                l2_norm = mx.nd.norm(grad).asscalar()
                norm += l2_norm * l2_norm

            norm = np.sqrt(norm)
            for idx, weight, grad, name in param_blocks:
    #             if norm > max_grad_norm:
    #                 grad *= (max_grad_norm / norm)

                updater(idx, grad, weight)

                # reset gradient to zero
                grad[:] = 0.0

        # End of training loop for this epoch
    #     toc = time.time()
    #     train_time = toc - tic
        train_acc = num_correct * 100 / float(num_total)
        
        train_f1 = f1_score(score,k_labels)
        # Saving checkpoint to disk
    #     if (iteration + 1) % 10 == 0:
    #         prefix = 'cnn'
    #         symbol.save('./%s-symbol.json' % prefix)
    #         save_dict = {('arg:%s' % k) : v  for k, v in cnn_exec.arg_dict.items()}
    #         save_dict.update({('aux:%s' % k) : v for k, v in cnn_exec.aux_dict.items()})
    #         param_name = './%s-%04d.params' % (prefix, iteration)
    #         mx.nd.save(param_name, save_dict)
    #         print('Saved checkpoint to %s' % param_name)


        # Evaluate model after this epoch on dev (test) set
        num_correct = 0
        num_total = 0
        test_score = []
        test_label = []


        # For each test batch
        for begin in range(0, x_test.shape[0], batch_size):
            batchX = x_test[begin:begin+batch_size]
            batchY = y_test[begin:begin+batch_size]

            if batchX.shape[0] != batch_size:
                continue

            data[:] = batchX
            cnn_exec.forward(is_train=False)

            num_correct += sum(batchY == np.argmax(cnn_exec.outputs[0].asnumpy(), axis=1))
            test_score.extend(np.argmax(cnn_exec.outputs[0].asnumpy(), axis=1).tolist())
            test_label.extend(batchY.tolist())
            num_total += len(batchY)

        dev_acc = num_correct * 100 / float(num_total)
        test_f1 = f1_score(test_score,test_label)
        if dev_acc > max_test_acc: 
            max_test_acc = dev_acc
        if verbose == True:
            print('Epoch [%d] Train: Training F1 score: %.3f \
                        --- Test F1 score thus far: %.3f' % (iteration, train_f1, test_f1))

    return(test_f1)

In [89]:
matrix = create_ft_weights(padded_sentences , num_embed, vocabulary, epoch)

In [None]:
filter_list =[2,3,4,5]
epochs = 10
batch_size = 20
word_embeddings = 200
#filter_list =[2,3,4,5]
filter_embedding= 100
optimizer = 'rmsprop'
learning_rate = 0.005
filter_embed = 100
scores = train_and_test(num_embed, filter_embed, batch_size, vocab_size, sentence_size, filter_list, optimizer, learning_rate, epoch, matrix)
#input_x = mx.sym.Variable('data') 
# placeholder for input data

FastText weights initialized!
Epoch [0] Train: Training F1 score: 0.135                         --- Test F1 score thus far: 0.174
Epoch [1] Train: Training F1 score: 0.472                         --- Test F1 score thus far: 0.393
Epoch [2] Train: Training F1 score: 0.884                         --- Test F1 score thus far: 0.441
Epoch [3] Train: Training F1 score: 0.970                         --- Test F1 score thus far: 0.377
Epoch [4] Train: Training F1 score: 0.982                         --- Test F1 score thus far: 0.386
Epoch [5] Train: Training F1 score: 0.988                         --- Test F1 score thus far: 0.400
Epoch [6] Train: Training F1 score: 1.000                         --- Test F1 score thus far: 0.400
Epoch [7] Train: Training F1 score: 1.000                         --- Test F1 score thus far: 0.400
Epoch [8] Train: Training F1 score: 1.000                         --- Test F1 score thus far: 0.393
Epoch [9] Train: Training F1 score: 1.000                         --- 