#### This particular assignment focuses on text classification using CNN. It has been picking up pace over the past few years. So, I thought this would be a good exercise to try out. The dataset is provided to you and there will be specific instrucions on how to curate the data, split into train and validation and the like.  You will be using MXnet for this task.  The data comprises tweets pertaining to common causes of cancer. The objective is to classify the tweets as medically relevant or not.  The dataset is skewed with positive class or 'yes' being 6 times less frequent than the negative class or 'no'. (Total marks = 50). Individual marks to the sub-problems are given in bracket. 

In [1]:
####FOR Logging in Jupyter Notebook#######
import logging
import sys
root_logger = logging.getLogger()
stdout_handler = logging.StreamHandler(sys.stdout)
root_logger.addHandler(stdout_handler)
root_logger.setLevel(logging.DEBUG)


# these are the modules you are allowed to work with. 

import nltk
import re
import numpy as np
import mxnet as mx
import sys, os
from numpy import random
from collections import Counter, namedtuple
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support as fscore
import re

'''
First job is to clean and preprocess the social media text. (5)

1) Replace URLs and mentions (i.e strings which are preceeded with @)
2) Segment #hastags 
3) Remove emoticons and other unicode characters
'''
CLEAN = re.compile("[\s\n\r\t.,:;\-_\'\"?!#&()\/%\[\]\{\}\<\>\\$@\!\*\+\=]")
URL = re.compile("^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$")

def preprocess_tweet(input_text):
    '''
    Input: The input string read directly from the file
    
    Output: Pre-processed tweet text
    '''
    temp = [x.lower() for x in input_text.split(" ") if '@' not in x and '#' not in x and not URL.match(x)]
    temp = ' '.join(temp)
    temp = CLEAN.split(temp)
    cleaned_text = ' '.join(temp)    
    return cleaned_text


# read the input file and create the set of positive examples and negative examples. 

file=open('cancer_data.tsv')
pos_data=[]
neg_data=[]

for line in file:
    line=line.strip().split('\t')
    text2= preprocess_tweet(line[0]).strip().split()
    if line[1]=='yes':
        pos_data.append(text2)
    if line[1]=='no':
        neg_data.append(text2)

print(len(pos_data), len(neg_data))     

sentences= list(pos_data)
sentences.extend(neg_data)
pos_labels= [1 for _ in pos_data]
neg_labels= [0 for _ in neg_data]
y=list(pos_labels)
y.extend(neg_labels)
y=np.array(y)

'''
After this you will obtain the following :

1) sentences =  List of sentences having the positive and negative examples with all the positive examples first
2) y = List of labels with the positive labels first.
'''

'''
Before running the CNN there are a few things one needs to take care of: (5)

1) Pad the sentences so that all of them are of the same length
2) Build a vocabulary comprising all unique words that occur in the corpus
3) Convert each sentence into a corresponding vector by replacing each word in the sentence with the index in the vocabulary. 

Example :
S1 = a b a c
S2 = d c a 

Step 1:  S1= a b a c, 
         S2 =d c a </s> 
         (Both sentences are of equal length). 

Step 2:  voc={a:1, b:2, c:3, d:4, </s>: 5}

Step 3:  S1= [1,2,1,3]
         S2= [4,3,1,5]

'''

def create_word_vectors(sentences):
    '''
    Input: List of sentences
    Output: List of word vectors corresponding to each sentence, vocabulary
    '''
    cnt = Counter({'</s>' : 0})
    mx_len = 0
    for sent in sentences:
        ln = len(sent)
        mx_len = max(ln,mx_len)
        cnt.update(sent)
        #for word in sent:
            #if word not in vocabulary:
            #    vocabulary[word] = len(vocabulary)
    vocabulary = mx.contrib.text.vocab.Vocabulary(cnt)   
    word_vectors = []
    for sent in sentences:
        pad_length = mx_len - len(sent)
        padded_sent = sent + ['</s>']*pad_length
        word_vectors.append(vocabulary.to_indices(padded_sent))
        '''
        temp = np.zeros(mx_len)
        for i, word in enumerate(sent):
            temp[i] = vocabulary.to_indices([word])
        word_vectors.append(temp)
        '''
    word_vectors = np.array(word_vectors) 
    return word_vectors, vocabulary


x, vocabulary = create_word_vectors(sentences)
print(x.shape, len(vocabulary))

def create_shuffle(x,y):
    '''
    Create an equal distribution of the positive and negative examples. 
    Please do not change this particular shuffling method.
    '''
    pos_len= len(pos_data)
    neg_len= len(neg_data)
    pos_len_train= int(0.8*pos_len)
    neg_len_train= int(0.8*neg_len)
    train_data= [(x[i],y[i]) for i in range(0, pos_len_train)]
    train_data.extend([(x[i],y[i]) for i in range(pos_len, pos_len+ neg_len_train )])
    test_data=[(x[i],y[i]) for i in range(pos_len_train, pos_len)]
    test_data.extend([(x[i],y[i]) for i in range(pos_len+ neg_len_train, len(x) )])
    
    random.shuffle(train_data)
    x_train=[i[0] for i in train_data]
    y_train=[i[1] for i in train_data]
    random.shuffle(test_data)
    x_test=[i[0] for i in test_data]
    y_test=[i[1] for i in test_data]
    
    x_train=np.array(x_train)
    y_train=np.array(y_train)
    x_test= np.array(x_test)
    y_test= np.array(y_test)
    return x_train, y_train, x_test, y_test

x_train, y_train, x_test, y_test= create_shuffle(x,y)




  import OpenSSL.SSL


208 1298
(1506, 109) 7506


In [2]:
'''
We now define the neural architecture of the CNN. The architecture is defined as : (10)

1) Embedding layer that converts the vector representation of the sentence from a one-hot encoding to a fixed sized word embedding
   (mx.sym.Embedding)
   
2) Convolution + activation + max pooling layer 
   (mx.sym.Convolution+ mx.sym.Activation+ mx.sym.Pooling)
   This procedure is to be followed for different sizes of filters (the filters corresponding to size 2 looks at the bigram distribution, 3 looks at trigram etc. 

3) Concat all the filters together (mx.sym.Concat)

4) Pass the results through a fully Connected layer of size 2 and then run softmax on it. 
   (mx.sym.FullyConnected, mx.sym.SoftmaxOutput)
   

We then initialize the intermediate layers of appropriate size and train the model using back prop. (10)
(Look up the mxnet tutorial if you have any doubt)

Run the classifier and for each epoch with a specified batch size observe the accuracy on the training set and test set (5)


Default parameters:

1) No of epochs = 10
2) Batch size = 20
3) Size of word embeddings = 200
4) Size of filters =[2,3,4,5]
5) Filter embedding= 100
6) Optimizer = rmsprop
7) learning rate = 0.005

'''

def create_model(vocab_size,max_time,out_dim=2,embedding_dim=200,batch_size=20,weight_matrix=None):
    input_data = mx.sym.Variable('data')
    output_labels = mx.sym.Variable('softmax_label')
    input_embed = mx.sym.Embedding(data=input_data,input_dim=vocab_size,output_dim=embedding_dim,\
                                   name='embed_weights')
    conv_inp = mx.sym.Reshape(data=input_embed, shape=(-1,1,max_time,embedding_dim))
    filter_sizes = [2,3,4,5]
    num_filters = 100
    conv_outs = []
    for filter_size in filter_sizes:
        out = mx.sym.Convolution(data=conv_inp,kernel=(filter_size,embedding_dim),num_filter=num_filters)
        out = mx.sym.Activation(data=out,act_type='relu')
        out = mx.sym.Pooling(data=out,pool_type='max',kernel=(max_time - filter_size + 1,1))
        conv_outs.append(out)
        
    all_outs = mx.sym.Concat(*conv_outs,dim=1)
    all_outs = mx.sym.Reshape(data=all_outs,shape=(-1,len(filter_sizes)*num_filters))
    scores = mx.sym.FullyConnected(data=all_outs,num_hidden=out_dim)
    probs = mx.sym.SoftmaxOutput(data=scores,name='softmax')
    
    #Initialiazation : Unfiorm(0.1)
    arg_params = {}
    if weight_matrix is not None:
        arg_params={'embed_weights' : weight_matrix}
    model = mx.model.FeedForward(probs,optimizer='rmsprop',num_epoch=10,learning_rate=0.005,\
                                 numpy_batch_size=batch_size,initializer=mx.initializer.Uniform(0.1),\
                                 arg_params=arg_params)
    return model

MAX_TIME = x_train.shape[1]
model = create_model(len(vocabulary),MAX_TIME)
model.fit(X=x_train,y=y_train,batch_end_callback = mx.callback.Speedometer(20,60),eval_data=(x_test,y_test))

Start training with [cpu(0)]


  self.initializer(k, v)


Epoch[0] Batch [60]	Speed: 272.93 samples/sec	accuracy=0.867500
Epoch[0] Resetting Data Iterator
Epoch[0] Time cost=4.563
Epoch[0] Validation-accuracy=0.915625
Epoch[1] Batch [60]	Speed: 288.78 samples/sec	accuracy=0.961667
Epoch[1] Resetting Data Iterator
Epoch[1] Time cost=4.187
Epoch[1] Validation-accuracy=0.909375
Epoch[2] Batch [60]	Speed: 245.24 samples/sec	accuracy=0.991667
Epoch[2] Resetting Data Iterator
Epoch[2] Time cost=4.925
Epoch[2] Validation-accuracy=0.900000
Epoch[3] Batch [60]	Speed: 234.78 samples/sec	accuracy=0.992500
Epoch[3] Resetting Data Iterator
Epoch[3] Time cost=5.151
Epoch[3] Validation-accuracy=0.903125
Epoch[4] Batch [60]	Speed: 272.99 samples/sec	accuracy=0.993333
Epoch[4] Resetting Data Iterator
Epoch[4] Time cost=4.438
Epoch[4] Validation-accuracy=0.903125
Epoch[5] Batch [60]	Speed: 289.75 samples/sec	accuracy=0.995833
Epoch[5] Resetting Data Iterator
Epoch[5] Time cost=4.262
Epoch[5] Validation-accuracy=0.906250
Epoch[6] Batch [60]	Speed: 288.57 sample

In [3]:
pred_probs = model.predict(x_test)
preds = pred_probs.argmax(axis=1)
acc = np.sum(preds==y_test)/len(y_test)
print("Final Accuracy on test Set : ",acc)
fs = fscore(y_test,preds)
print("F1 score for Positive class : ",fs[2][1])
print("F1 score for Negative class : ",fs[2][0])

Final Accuracy on test Set :  0.8940397350993378
F1 score for Positive class :  0.4074074074074074
F1 score for Negative class :  0.9418181818181818


In [4]:
'''
So far, the assignment has been posed in a manner so that you can refer to directly the mxnet tutorial on the same problem. 

The final 15 marks is meant to carry out experimentations of your own and observe how the results change by experimentation. 

1) Would the results improve if instead of using the word embeddings that is based solely on frequency, if you have been able to incorporate sub-word information
   (In short run fasttext on the corpus and use the word embeddings generated by fastetxt). (8)
   
2) Accuracy might not be the best way to measure the performance of a skewed dataset. What other metrics would you use ? Why? 
   Experiment with different hyper-paramters to show the performance in terms of metric? 
   You can assume that we want to identify all the medically relevant tweets (i.e. tweets with 'yes' class more). (7)
    

Delivearbles:

The ipython notebook with the results to each part of the question. 


P.S: This assignment is part of a research question I am working on my free time. So if you have any insights, I'd love to hear them. 
Happy coding 

Ritam Dutt
14CS30041

'''

#See next cell

"\nSo far, the assignment has been posed in a manner so that you can refer to directly the mxnet tutorial on the same problem. \n\nThe final 15 marks is meant to carry out experimentations of your own and observe how the results change by experimentation. \n\n1) Would the results improve if instead of using the word embeddings that is based solely on frequency, if you have been able to incorporate sub-word information\n   (In short run fasttext on the corpus and use the word embeddings generated by fastetxt). (8)\n   \n2) Accuracy might not be the best way to measure the performance of a skewed dataset. What other metrics would you use ? Why? \n   Experiment with different hyper-paramters to show the performance in terms of metric? \n   You can assume that we want to identify all the medically relevant tweets (i.e. tweets with 'yes' class more). (7)\n    \n\nDelivearbles:\n\nThe ipython notebook with the results to each part of the question. \n\n\nP.S: This assignment is part of a rese

## Using fastText 300-dimensional Embeddings

In [4]:
embeddings = mx.contrib.text.embedding.FastText(embedding_root='./',vocabulary=vocabulary)
all_tokens = vocabulary.to_tokens(list(range(len(vocabulary))))
weight_matrix = embeddings.get_vecs_by_tokens(all_tokens)

Loading pre-trained token embedding vectors from ./fasttext/wiki.simple.vec


  'skipped.' % (line_num, token, elems))


In [5]:
MAX_TIME = x_train.shape[1]
model2 = create_model(len(vocabulary),MAX_TIME,embedding_dim=weight_matrix.shape[1],weight_matrix=weight_matrix)
model2.fit(X=x_train,y=y_train,batch_end_callback = mx.callback.Speedometer(20,60),eval_data=(x_test,y_test))

Start training with [cpu(0)]


  self.initializer(k, v)


Epoch[0] Batch [60]	Speed: 199.13 samples/sec	accuracy=0.873333
Epoch[0] Resetting Data Iterator
Epoch[0] Time cost=6.247
Epoch[0] Validation-accuracy=0.903125
Epoch[1] Batch [60]	Speed: 207.44 samples/sec	accuracy=0.967500
Epoch[1] Resetting Data Iterator
Epoch[1] Time cost=5.837
Epoch[1] Validation-accuracy=0.903125
Epoch[2] Batch [60]	Speed: 207.87 samples/sec	accuracy=0.985000
Epoch[2] Resetting Data Iterator
Epoch[2] Time cost=5.818
Epoch[2] Validation-accuracy=0.893750
Epoch[3] Batch [60]	Speed: 200.60 samples/sec	accuracy=0.992500
Epoch[3] Resetting Data Iterator
Epoch[3] Time cost=6.028
Epoch[3] Validation-accuracy=0.900000
Epoch[4] Batch [60]	Speed: 159.52 samples/sec	accuracy=0.995000
Epoch[4] Resetting Data Iterator
Epoch[4] Time cost=7.577
Epoch[4] Validation-accuracy=0.903125
Epoch[5] Batch [60]	Speed: 170.91 samples/sec	accuracy=0.996667
Epoch[5] Resetting Data Iterator
Epoch[5] Time cost=7.182
Epoch[5] Validation-accuracy=0.906250
Epoch[6] Batch [60]	Speed: 183.96 sample

In [6]:
print("Initializing using Fasttext Embeddings")
pred_probs = model2.predict(x_test)
preds = pred_probs.argmax(axis=1)
acc = np.sum(preds==y_test)/len(y_test)
print("Final Accuracy on test Set : ",acc)
fs = fscore(y_test,preds)
print("F1 score for Positive class : ",fs[2][1])
print("F1 score for Negative class : ",fs[2][0])

Initializing using Fasttext Embeddings
Final Accuracy on test Set :  0.8973509933774835
F1 score for Positive class :  0.43636363636363634
F1 score for Negative class :  0.9435336976320582


##  Using Glove 200-dimensional embeddings

In [7]:
del weight_matrix
EMBEDDINGS_FILE = 'glove.6B.200d.txt'
embeddings = mx.contrib.text.embedding.GloVe(EMBEDDINGS_FILE,embedding_root='./',vocabulary=vocabulary)
all_tokens = vocabulary.to_tokens(list(range(len(vocabulary))))
weight_matrix = embeddings.get_vecs_by_tokens(all_tokens)

Loading pre-trained token embedding vectors from ./glove/glove.6B.200d.txt


In [8]:
MAX_TIME = x_train.shape[1]
model3 = create_model(len(vocabulary),MAX_TIME,embedding_dim=weight_matrix.shape[1],weight_matrix=weight_matrix)
model3.fit(X=x_train,y=y_train,batch_end_callback = mx.callback.Speedometer(20,60),eval_data=(x_test,y_test))

Start training with [cpu(0)]


  self.initializer(k, v)


Epoch[0] Batch [60]	Speed: 241.06 samples/sec	accuracy=0.861667
Epoch[0] Resetting Data Iterator
Epoch[0] Time cost=5.143
Epoch[0] Validation-accuracy=0.871875
Epoch[1] Batch [60]	Speed: 243.90 samples/sec	accuracy=0.968333
Epoch[1] Resetting Data Iterator
Epoch[1] Time cost=4.951
Epoch[1] Validation-accuracy=0.887500
Epoch[2] Batch [60]	Speed: 273.01 samples/sec	accuracy=0.992500
Epoch[2] Resetting Data Iterator
Epoch[2] Time cost=4.430
Epoch[2] Validation-accuracy=0.890625
Epoch[3] Batch [60]	Speed: 268.31 samples/sec	accuracy=0.992500
Epoch[3] Resetting Data Iterator
Epoch[3] Time cost=4.503
Epoch[3] Validation-accuracy=0.893750
Epoch[4] Batch [60]	Speed: 213.13 samples/sec	accuracy=0.994167
Epoch[4] Resetting Data Iterator
Epoch[4] Time cost=5.668
Epoch[4] Validation-accuracy=0.884375
Epoch[5] Batch [60]	Speed: 205.69 samples/sec	accuracy=0.998333
Epoch[5] Resetting Data Iterator
Epoch[5] Time cost=5.977
Epoch[5] Validation-accuracy=0.887500
Epoch[6] Batch [60]	Speed: 199.15 sample

In [9]:
print("Initializing using Glove Embeddings")
pred_probs = model3.predict(x_test)
preds = pred_probs.argmax(axis=1)
acc = np.sum(preds==y_test)/len(y_test)
print("Final Accuracy on test Set : ",acc)
fs = fscore(y_test,preds)
print("F1 score for Positive class : ",fs[2][1])
print("F1 score for Negative class : ",fs[2][0])

Initializing using Glove Embeddings
Final Accuracy on test Set :  0.8973509933774835
F1 score for Positive class :  0.43636363636363634
F1 score for Negative class :  0.9435336976320582


## Results of Experimentation

<b>Other metrics</b> - F1 score can be a good metric for skewed datasets. But since our dataset is highly skewed, we might even consider <b>Micro F1 Score</b> for the positive class only. We have displayed both the metrics (accuracy, micro F1 for Positive class) for all our experiments<br>
<b>Experiments</b><br>
<b>1. Naive model (Random Initialisation of Emebddings)</b> - <br>
Final Accuracy on test Set :  0.8973509933774835<br>
F1 score for Positive class :  0.43636363636363634<br>
F1 score for Negative class :  0.9435336976320582<br>
<b>2. Fasttext Embeddings (300 dimensional)</b> - <br>
Final Accuracy on test Set :  0.9006622516556292<br>
F1 score for Positive class :  0.46428571428571436<br>
F1 score for Negative class :  0.9452554744525548<br>
The accuracy and micro f1 both improved for 300 dimensional fasttext embeddings<br>
<b>3. Glove Embeddings (200 dimensional)</b> - <br>
Final Accuracy on test Set :  0.9105960264900662<br>
F1 score for Positive class :  0.5573770491803278<br>
F1 score for Negative class :  0.9502762430939227<br>
The use of 200 dimensional Glove embeddings has given best performance on test set in terms of both Accuracy and micro F1 Score.