#### This particular assignment focuses on text classification using CNN. It has been picking up pace over the past few years. So, I thought this would be a good exercise to try out. The dataset is provided to you and there will be specific instrucions on how to curate the data, split into train and validation and the like.  You will be using MXnet for this task.  The data comprises tweets pertaining to common causes of cancer. The objective is to classify the tweets as medically relevant or not.  The dataset is skewed with positive class or 'yes' being 6 times less frequent than the negative class or 'no'. (Total marks = 50). Individual marks to the sub-problems are given in bracket. 

In [4]:
# these are the modules you are allowed to work with. 

import nltk
import re
import numpy as np
import mxnet as mx
import sys, os
from numpy import random
from collections import Counter, namedtuple

'''
First job is to clean and preprocess the social media text. (5)

1) Replace URLs and mentions (i.e strings which are preceeded with @)
2) Segment #hastags 
3) Remove emoticons and other unicode characters
'''

def preprocess_tweet(input_text):
    '''
    Input: The input string read directly from the file
    
    Output: Pre-processed tweet text
    '''
    cleaned_text = input_text
    
    return cleaned_text


# read the input file and create the set of positive examples and negative examples. 

file=open('cancer_data.tsv')
pos_data=[]
neg_data=[]

for line in file:
    line=line.strip().split('\t')
    text2= preprocess_tweet(line[0]).strip().split()
    if line[1]=='yes':
        pos_data.append(text2)
    if line[1]=='no':
        neg_data.append(text2)

print(len(pos_data), len(neg_data))     

sentences= list(pos_data)
sentences.extend(neg_data)
pos_labels= [1 for _ in pos_data]
neg_labels= [0 for _ in neg_data]
y=list(pos_labels)
y.extend(neg_labels)
y=np.array(y)

'''
After this you will obtain the following :

1) sentences =  List of sentences having the positive and negative examples with all the positive examples first
2) y = List of labels with the positive labels first.
'''

'''
Before running the CNN there are a few things one needs to take care of: (5)

1) Pad the sentences so that all of them are of the same length
2) Build a vocabulary comprising all unique words that occur in the corpus
3) Convert each sentence into a corresponding vector by replacing each word in the sentence with the index in the vocabulary. 

Example :
S1 = a b a c
S2 = d c a 

Step 1:  S1= a b a c, 
         S2 =d c a </s> 
         (Both sentences are of equal length). 

Step 2:  voc={a:1, b:2, c:3, d:4, </s>: 5}

Step 3:  S1= [1,2,1,3]
         S2= [4,3,1,5]

'''

def create_word_vectors(sentences):
    '''
    Input: List of sentences
    Output: List of word vectors corresponding to each sentence, vocabulary
    '''
    cnt = Counter({'</s>' : 0})
    mx_len = 0
    for sent in sentences:
        ln = len(sent)
        mx_len = max(ln,mx_len)
        cnt.update(sent)
        #for word in sent:
            #if word not in vocabulary:
            #    vocabulary[word] = len(vocabulary)
    vocabulary = mx.contrib.text.vocab.Vocabulary(cnt)   
    word_vectors = []
    for sent in sentences:
        pad_length = mx_len - len(sent)
        padded_sent = sent + ['</s>']*pad_length
        word_vectors.append(vocabulary.to_indices(padded_sent))
        '''
        temp = np.zeros(mx_len)
        for i, word in enumerate(sent):
            temp[i] = vocabulary.to_indices([word])
        word_vectors.append(temp)
        '''
    word_vectors = np.array(word_vectors) 
    return word_vectors, vocabulary


x, vocabulary = create_word_vectors(sentences)
print(x.shape, len(vocabulary))

def create_shuffle(x,y):
    '''
    Create an equal distribution of the positive and negative examples. 
    Please do not change this particular shuffling method.
    '''
    pos_len= len(pos_data)
    neg_len= len(neg_data)
    pos_len_train= int(0.8*pos_len)
    neg_len_train= int(0.8*neg_len)
    train_data= [(x[i],y[i]) for i in range(0, pos_len_train)]
    train_data.extend([(x[i],y[i]) for i in range(pos_len, pos_len+ neg_len_train )])
    test_data=[(x[i],y[i]) for i in range(pos_len_train, pos_len)]
    test_data.extend([(x[i],y[i]) for i in range(pos_len+ neg_len_train, len(x) )])
    
    random.shuffle(train_data)
    x_train=[i[0] for i in train_data]
    y_train=[i[1] for i in train_data]
    random.shuffle(test_data)
    x_test=[i[0] for i in test_data]
    y_test=[i[1] for i in test_data]
    
    x_train=np.array(x_train)
    y_train=np.array(y_train)
    x_test= np.array(x_test)
    y_test= np.array(y_test)
    return x_train, y_train, x_test, y_test

x_train, y_train, x_test, y_test= create_shuffle(x,y)




208 1298
(1506, 122) 15070


In [7]:
'''
We now define the neural architecture of the CNN. The architecture is defined as : (10)

1) Embedding layer that converts the vector representation of the sentence from a one-hot encoding to a fixed sized word embedding
   (mx.sym.Embedding)
   
2) Convolution + activation + max pooling layer 
   (mx.sym.Convolution+ mx.sym.Activation+ mx.sym.Pooling)
   This procedure is to be followed for different sizes of filters (the filters corresponding to size 2 looks at the bigram distribution, 3 looks at trigram etc. 

3) Concat all the filters together (mx.sym.Concat)

4) Pass the results through a fully Connected layer of size 2 and then run softmax on it. 
   (mx.sym.FullyConnected, mx.sym.SoftmaxOutput)
   

We then initialize the intermediate layers of appropriate size and train the model using back prop. (10)
(Look up the mxnet tutorial if you have any doubt)

Run the classifier and for each epoch with a specified batch size observe the accuracy on the training set and test set (5)


Default parameters:

1) No of epochs = 10
2) Batch size = 20
3) Size of word embeddings = 200
4) Size of filters =[2,3,4,5]
5) Filter embedding= 100
6) Optimizer = rmsprop
7) learning rate = 0.005

'''
#EMBEDDINGS_FILE = 'glove.6B.200d.txt'

#embeddings = mx.contrib.text.embedding.GloVe(EMBEDDINGS_FILE,embedding_root='./',vocabulary=vocabulary)
def create_model(vocab_size,max_time,out_dim=2,embedding_dim=200,batch_size=20):
    input_data = mx.sym.Variable('data')
    output_labels = mx.sym.Variable('softmax_label')
    input_embed = mx.sym.Embedding(data=input_data,input_dim=vocab_size,output_dim=embedding_dim)
    conv_inp = mx.sym.Reshape(data=input_embed, shape=(-1,1,max_time,embedding_dim))
    filter_sizes = [2,3,4,5]
    num_filters = 100
    conv_outs = []
    for filter_size in filter_sizes:
        out = mx.sym.Convolution(data=conv_inp,kernel=(filter_size,embedding_dim),num_filter=num_filters)
        out = mx.sym.Activation(data=out,act_type='relu')
        out = mx.sym.Pooling(data=out,pool_type='max',kernel=(max_time - filter_size + 1,1))
        conv_outs.append(out)
        
    all_outs = mx.sym.Concat(*conv_outs,dim=1)
    all_outs = mx.sym.Reshape(data=all_outs,shape=(-1,len(filter_sizes)*num_filters))
    scores = mx.sym.FullyConnected(data=all_outs,num_hidden=out_dim)
    probs = mx.sym.SoftmaxOutput(data=scores,name='softmax')
    
    model = mx.model.FeedForward(probs,optimizer='rmsprop',num_epoch=10,learning_rate=0.005)
    return model
MAX_TIME = x_train.shape[1]
model = create_model(len(vocabulary),MAX_TIME)
model.fit(X=x_train,y=y_train)

  self.initializer(k, v)


In [8]:
pred = model.predict(x_test)

In [28]:
corr

array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,

In [91]:

def create_model(vocab_size,max_time,out_dim=2,embedding_dim=200,batch_size=20):
    input_data = mx.sym.Variable('input')
    output_labels = mx.sym.Variable('output')
    input_embed = mx.sym.Embedding(data=input_data,input_dim=vocab_size,output_dim=embedding_dim)
    conv_inp = mx.sym.Reshape(data=input_embed, shape=(-1,1,max_time,embedding_dim))
    filter_sizes = [2,3,4,5]
    num_filters = 100
    conv_outs = []
    for filter_size in filter_sizes:
        out = mx.sym.Convolution(data=conv_inp,kernel=(filter_size,embedding_dim),num_filter=num_filters)
        out = mx.sym.Activation(data=out,act_type='relu')
        out = mx.sym.Pooling(data=out,pool_type='max',kernel=(max_time - filter_size + 1,1))
        conv_outs.append(out)
        
    all_outs = mx.sym.Concat(*conv_outs,dim=1)
    all_outs = mx.sym.Reshape(data=all_outs,shape=(-1,len(filter_sizes)*num_filters))
    scores = mx.sym.FullyConnected(data=all_outs,num_hidden=out_dim)
    probs = mx.sym.SoftmaxOutput(data=scores,label=output_labels)
    
    cnn = probs
    return cnn


ctx = mx.cpu()
BATCH_SIZE = 20
MAX_TIME = x_train.shape[1]
cnn = create_model(len(vocabulary),MAX_TIME,out_dim=2, embedding_dim=200,batch_size=BATCH_SIZE)
CNNModel = namedtuple('CNNModel',['cnn_exec','symbol','input','label','param_blocks'])
input_shapes = {}
input_shapes['input'] = (BATCH_SIZE,MAX_TIME)
arg_shape, out_shape, aux_shape = cnn.infer_shape(**input_shapes)
arg_names = cnn.list_arguments()
arg_arrays = [mx.nd.zeros(s, ctx) for s in arg_shape]
args_grad = {}
for shape, name in zip(arg_shape, arg_names):
    if name not in ['input', 'output']:
        args_grad[name] = mx.nd.zeros(shape, ctx)

cnn_exec = cnn.bind(ctx=ctx,args=arg_arrays,args_grad=args_grad,grad_req='add')
param_blocks = []
arg_dict = dict(zip(arg_names, cnn_exec.arg_arrays))
initializer = mx.initializer.Uniform(0.1)
for i, name in enumerate(arg_names):
    if name not in ['input', 'output']:
        initializer(mx.init.InitDesc(name), arg_dict[name])
        param_blocks.append( (i, arg_dict[name], args_grad[name], name) )

input_data = cnn_exec.arg_dict['input']
output_label = cnn_exec.arg_dict['output']

cnn_model= CNNModel(cnn_exec=cnn_exec, symbol=cnn, input=input_data, label=output_label, param_blocks=param_blocks)



In [2]:
'''
So far, the assignment has been posed in a manner so that you can refer to directly the mxnet tutorial on the same problem. 

The final 15 marks is meant to carry out experimentations of your own and observe how the results change by experimentation. 

1) Would the results improve if instead of using the word embeddings that is based solely on frequency, if you have been able to incorporate sub-word information
   (In short run fasttext on the corpus and use the word embeddings generated by fastetxt). (8)
   
2) Accuracy might not be the best way to measure the performance of a skewed dataset. What other metrics would you use ? Why? 
   Experiment with different hyper-paramters to show the performance in terms of metric? 
   You can assume that we want to identify all the medically relevant tweets (i.e. tweets with 'yes' class more). (7)
    

Delivearbles:

The ipython notebook with the results to each part of the question. 


P.S: This assignment is part of a research question I am working on my free time. So if you have any insights, I'd love to hear them. 
Happy coding 

Ritam Dutt
14CS30041

'''





"\nSo far, the assignment has been posed in a manner so that you can refer to directly the mxnet tutorial on the same problem. \n\nThe final 15 marks is meant to carry out experimentations of your own and observe how the results change by experimentation. \n\n1) Would the results improve if instead of using the word embeddings that is based solely on frequency, if you have been able to incorporate sub-word information\n   (In short run fasttext on the corpus and use the word embeddings generated by fastetxt). (8)\n   \n2) Accuracy might not be the best way to measure the performance of a skewed dataset. What other metrics would you use ? Why? \n   Experiment with different hyper-paramters to show the performance in terms of metric? \n   You can assume that we want to identify all the medically relevant tweets (i.e. tweets with 'yes' class more). (7)\n    \n\nDelivearbles:\n\nThe ipython notebook with the results to each part of the question. \n\n\nP.S: This assignment is part of a rese

['As',
 'people’s',
 'anger',
 'boils',
 'over',
 'in',
 'the',
 'streets',
 'there',
 'are',
 'talks',
 'that',
 'the',
 'Rouhani',
 'government',
 'should',
 'resign.',
 'But',
 'you',
 'do',
 'not',
 'cure',
 'cancer',
 'with',
 'an',
 'aspirin.',
 'The',
 'tumor',
 'needs',
 'to',
 'be',
 'removed-',
 '-',
 'and',
 'tumor',
 'is',
 'IRI',
 'and',
 'the',
 'surgeons',
 'are',
 'the',
 'people.']