# Assignment 5

In this assignment, you will build a multi-class neural network classifier for text classification. You will first need to import the libraries. Then you will need to pre-process your data by removing stop words and stemming. After cleaning the data, you will download a pretrained word embedding and use the embedding to give each word a vector. The vectors will be the features of your classifier. You will split your data into training (80%) and validation (20%). Then, you will train your neural network and find the best model using random search.

N.B: You will have to use tensorflow (version 1.4) library.

(Hint: check tf.keras)

## Import Libraries

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
import pickle
import itertools
from keras.layers import Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

## Load Data

In [22]:
'''implement your code'''
dataset = pd.read_csv('Tweets.csv', encoding ='latin1').values
X, y = dataset[:, 1], dataset[:, 0]

## Clean Data

 Pre-process your data by removing stop words and perform stemming.

In [23]:
##### CLEAN HERE
# Remove urls, handles, and pu
def remove_urls_handles_punc(X):
    new_tweets = []
    for tweet in X:
        new_tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", tweet).split())
        new_tweets.append(new_tweet)
    return new_tweets

def tokenize(X):
    return [word_tokenize(document) for document in X]

def remove_punc_stopwords_stem(X):
     #see documentation here: http://docs.python.org/2/library/string.html
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    X_cleaned_stemmed = []
    
    for tweet in X:
        new_tweet = []
        for token in tweet: 
            new_token = regex.sub(u'', token)
            ## check if token is neither a punctutation or stopword.
            if not new_token == u'' and not token.lower() in stopwords.words('english'):
                new_token = SnowballStemmer("english").stem(new_token)
                new_tweet.append(new_token)
        X_cleaned_stemmed.append(new_tweet)
    return X_cleaned_stemmed

def words_list_to_string(X):
    return np.array([[' '.join(word for word in document)] for document in X]).flatten()

def preprocess(X):
#     X = remove_urls_handles_punc(X)
    
    X = tokenize(X)
    
    X = remove_punc_stopwords_stem(X)

    X = words_list_to_string(X)
    
    return X

X = preprocess(X)

print(X)

['virginamerica dhepburn said'
 'virginamerica plus ve ad commerci experi tacki'
 'virginamerica nt today must mean need take anoth trip' ...,
 'americanair pleas bring american airlin blackberry10'
 'americanair money chang flight nt answer phone suggest make commit'
 'americanair 8 ppl need 2 know mani seat next flight plz put us standbi 4 peopl next flight']


## Binarize Labels

In [24]:
encoder = LabelBinarizer()
y = encoder.fit_transform(y)
print(y)

[[0 1 0]
 [0 0 1]
 [0 1 0]
 ..., 
 [0 1 0]
 [1 0 0]
 [0 1 0]]


## Word Embedding

Download the word embedding from this link http://nlp.stanford.edu/data/glove.twitter.27B.zip and create the embedding matrix to be used in the embedding layer. You have to use the embedding file of dimension 50.

In [25]:
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(X)

vocab_size = len(t.word_index) + 1

# integer encode the documents
encoded_docs = t.texts_to_sequences(X)

# pad documents to a max length of 4 words
max_length = 200
X = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.twitter.27B.50d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 50))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

e = Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=200, trainable=False)

## Split Data

Split your data into training (80%) and validation (20%).

In [26]:
X_train, X_validate, y_train, y_validate = train_test_split(X, y, test_size=0.20)

print(X_train.shape)

(11712, 200)


## Exercise 1

You will train a neural network for 100 epochs with a batch size of 32 without doing any hyperameters tuning.

The architecture should be as follow:
- One embedding layer ( You don't need to retrain the embeddings. You have to use the pretrained embeddings)
- One hidden layer with 50 units
- One output layer
- The activation of the hidden layer is a Relu
- The activation of the output layer is a Softmax
- The loss function is a categorical cross-entropy funtion
- The optimizer of this model is RMSProp

### Create Model

Create the above neural network architecture.

In [None]:
# define model
model = Sequential()

# add embedding layer
model.add(e)

# flatten the embedding layer into 1d
model.add(Flatten())

# add one hidden layer with 50 units
model.add(Dense(50, activation='relu'))

# add softmax output layer with 3 outputs
model.add(Dense(3, activation='softmax'))

# compile the model
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 200, 50)           685000    
_________________________________________________________________
flatten_3 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_5 (Dense)              (None, 50)                500050    
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 153       
Total params: 1,185,203
Trainable params: 500,203
Non-trainable params: 685,000
_________________________________________________________________
None


### Training

Train your model on the training dataset.

In [None]:
# fit the model
model.fit(X_train, y_train, epochs=100, batch_size=32)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 86/100
Epoch 87/100
Epoch 88/100

### Testing

Test your model on the validation and compute the F-measure and accuracy.

In [None]:
# evaluate the model
loss, accuracy = model.evaluate(X_validate, y_validate, verbose=0)

print('Accuracy: %f' % (accuracy * 100))

## Exercise 2

### Random Search

Write the random search function. You will use the random search method in exercise 3 to find the best hyperparameters.

In [None]:
'''implement your code'''

## Exercise 3

### Hyperparameters Tuning

You will tune the hyperparameters of the above architecture using random search by validating on the validation dataset.

Plot the learning curve of the best model (loss versus number
of epochs). You should show both the training loss and the validation loss.

You should also report the values of the hyperparameters of your best model and the validation accuracy and F-measure.  

The hyperparameters that need to be tuned are:
- Learning rates
- Dropout
- Number of hidden units
- Mini-batch size
- Learning rate decay
- Number of layers ( This part should be done manually. You can't do this using random search)

In [None]:
'''implement your code'''

### Save Model

Save your best model

In [None]:
'''implement your code'''

Rename the jupyter notebook to Assignment5_*netid*.ipynb (Assignment5_xyz01.ipynb) and upload it on Moodle no later than Friday, Dec 1 11:55 pm.