# News 20 group dataset modelling using deep learning

The libraries used for building the model are : 
<br>
&ensp; 1) Keras deep learning library
<br>
&ensp; 2) Theano as computational backend 
<br>
&ensp; 3) Intel distribution of python for efficient math kernel library
<br>
&ensp; 4) Gensim library for text mining and building word embeddings
<br>
&ensp; 5) nltk and nltk_rake library for tokenizing and text cleaning 
<br>
&ensp; 6) sklearn library for metrics 
<br>
&ensp; 7) scipy and numpy for scientific computations
<br>

System configuration used for training : 
<br>
&ensp;      MEMORY : 16384MB RAM
<br>
 &ensp;      CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz (8 CPUs), ~3.6GHz
<br>
&ensp;       OS     : Windows 7
<br>
 &ensp;      Addons : Intel Math kernel library
<br>
&ensp;       GPU    : No



    
<img src="architecture.png">

# Design Considerations
&ensp;1) Traditional word frequency based approaches doesnot catpure the information about the context in which a particualr &ensp;word is used.
<br>
&ensp;2) So, in order to caputure the context based information about the word, I have used gensim to train the word2vector &ensp;matrix which obtains how one word is related to other words and creates a weight matrix of combination of each word with &ensp;every other word in the text corpus. 
<br>
&ensp;3) The next design consideration was about the neural network architecture to chose and the type of model to use i.e &ensp;either discriminative or generative. 
<br>
&ensp;4) In generative models, the assumption is the data coming from a prior stochasitic distribution. So the number of &ensp;parameters to learn becomes very high and as the corpus size increases the latent feature vector calculation becomes &ensp;intractible. This is one of the reasons not to use generative models like Restricted boltzmann machines which uses Markov &ensp;chains with markov property. 
<br>
&ensp;5) Similar is the case with Deep beilef or graphical generative models as they use RBM in their initial layers. So, I tend &ensp;to proceed with generative modelling. 
<br>
&ensp;6) Based on the scenario, since we are considering text classification and since each word has some context associated &ensp;with its previous and succeeding words, it would be better if we chose a sliding window model. 
<br>
&ensp;7) This leaves me with option to chose either Convolutional networks or LSTM(Long short term memory models) a type of &ensp;recurrent network. 
<br>
&ensp;8) The architecture of Recurrent neural network is as below which uses the feature at time T while training T+1. The &ensp;problem with such model is, if there are thousands of features to be trained, then the derivate at 1000th feature will &ensp;diminish to 0 or infinity and so the the Recurrent network will suffer with so called vanishing/exploding gradient problem. 
    <img src = "recurrentnetwork.PNG">
<br>
&ensp;9) To tackle this issue,we use LSTM model which uses previous states only based on some previous conditions if they were &ensp;proven to be true. This also reduces the computational time very much as compared to Recurrent networks as some weights &ensp;are set off. 
    <img src = "lstm.png">
<br>
&ensp;10) On the other hand, we have convolutional neural networks which can also function well. But convolutional network can &ensp;operate only on fixed length inputs. 
<br>
&ensp;11) Since the text sentences are not typically of fixed length, we need to pad up all the input vectors to same size and &ensp;this increases the number of features. So, I chose LSTM over convolution. 
<br>
&ensp;12) The metric chosen to optimize was categorical cross entropy which is basically a kind of misclassification metric over &ensp;categories. This is provided by thano. 
&ensp;13) The optimizer chosen was Adam optimizer which I usually use which tends to outperform others in majority of ocassions. 

In [2]:
## Libraries used 
import os
os.environ['KERAS_BAKCEND'] = 'theano'   # setting backend to theano
os.environ['OMP_NUM_THREADS']='6'        # number of threads theano can use  
os.environ['MKL_NUM_THREADS'] = '6'
import io
import theano
import string
theano.config.openmp = True
from rake_nltk import Rake
from keras import backend as K
import numpy as np
import scipy.sparse as sp
from rake_nltk import Rake
import random
from keras.utils import np_utils
from keras.callbacks import  Callback
from nltk import word_tokenize
from nltk.corpus import stopwords
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import load_model
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout
from tensorflow.contrib import learn
import gensim.models.word2vec as w2v
from gensim.corpora.dictionary import Dictionary
from keras.optimizers import Adam, RMSprop
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt


In [2]:
## Global declaration secion
corpus_dict = {} 
vectorarray = []
x_vector = []
vocab_len = 200
batch_size = 40
n_epoch = 8
y_train=[]
testdatapath = r'C:\Users\bxt160230\PycharmProjects\Zefr\20news-bydate-test'
traindatapath = r'C:\Users\bxt160230\PycharmProjects\Zefr\20news-bydate-train'

In [3]:
##Helper class that is used for logging the metrics per batch of training per epoch

class NBatchLogger(Callback):
    def __init__(self, display):
        self.seen = 0
        self.display = display

    def on_batch_end(self, batch, logs={}):
        self.seen += logs.get('size', 0)
        if self.seen % self.display == 0:
            # you can access loss, accuracy in self.params['metrics']
            print ( "- Batch Loss:", self.seen ,self.params['metrics'][1])

In [None]:
## Metho description
## Plots the summary of training / validation accuracy and loss for each epoch.

def plot_charts(history):
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.savefig('accuracy.png')
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.savefig('loss.png')
    return

In [None]:
## Method used for training the model. 
## Starts by building a sequntial model. 
## 1) First is the weights embedding model 
## 2) Second is LSTM model 
## 3) Drop out method used for regularization . Drops out if weights are less than 0.3 
## 4) Dense model with sigmoid activation for classifying the labels.

def train_model(x_train, y_train ,n_symbols , embedding_weights, input_length ,num_classes):
    model = Sequential()
    model.add(Embedding(output_dim=vocab_len, input_dim=n_symbols, mask_zero=True, weights=[embedding_weights], input_length=input_length))
    model.add(LSTM(vocab_len))
    model.add(Dropout(0.3))
    model.add(Dense(num_classes, activation='sigmoid'))
    model.summary()
    print('Compiling the Model...')
    adam = Adam(lr= 0.01)
    model.compile(optimizer = adam, loss = 'categorical_crossentropy', metrics = ['accuracy'])
    print("Train...")
    out_batch = NBatchLogger(display= batch_size)
    y_train = np.array(y_train)
    history = model.fit(x_train, y_train, batch_size = batch_size, epochs = n_epoch, shuffle = True , validation_split= 0.2,callbacks=[out_batch])
    print(model.summary())
    plot_charts(history)
    return model

# Charts showing how model accuracy changes over each epoch 
<table>
<tr>
<td>
<img src = "accuracy.png">
</td>
<td>
<img src = "loss.png">
</td>
</tr>
</table>

In [4]:
## Method description
## Removes the punctuation marks, uses RAKE (Rapid Keyword Extraction Algorithm) for processing the text \
## which provides ranked phrases and extracts the keywords. The stop words and digits in the text are removed. 

def clean_sentence(text):
    stop = stopwords.words('english')
    translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
    cleaned_text = text.translate(translator)
    r= Rake()
    r.extract_keywords_from_text(cleaned_text)
    cleaned_text = ' '.join([x[1] for x in r.rank_list if x[0] > 5]).strip().lower()
    cleaned_text = ' '.join([x for x in word_tokenize(cleaned_text) if x not in stop])
    return ''.join([i for i in cleaned_text if not i.isdigit()])

In [None]:
## Method description
## Makes the data get converted to indexes used in dictionary and returns a vector of indexes in sequence of words in sentence

def parse_dataset(w2indx, data):
    x_train_vec_wordembd = []
    for item in x_vector:
        txt = item.split()
        new_txt = []
        for word in txt:
            try:
                new_txt.append(w2indx[word])
            except:
                new_txt.append(0)
        x_train_vec_wordembd.append(new_txt)
    return x_train_vec_wordembd

In [3]:
## Method description 
##  1 Creates a word to index mapping
##  2- Creates a word to vector mapping from the word embeddings 
##  3- Transforms the Training and Testing Dictionaries and retruns the and constructs a bag of words with pretrained weights.

def create_dictionaries(model, x_vector):
    gensim_dict = Dictionary()
    gensim_dict.doc2bow(model.wv.vocab.keys(), allow_update=True)
    w2indx = {v: k+1 for k, v in gensim_dict.items()}
    w2vec = {word: model[word] for word in w2indx.keys()}
    return w2indx, w2vec , parse_dataset(w2indx,x_vector)

In [5]:
## Method description
## Function that gets the filename of each file in every directory, and converts multiple lines of text to single line 
## Calls clean text and returns the output of clean text to caller function

def process_content(page_content):
    single_sentence = ""
    with io.open(page_content) as f:
        next(line for line in f if line.isspace())
        for text in f:
            if not text.isspace():
                single_sentence = single_sentence + text.strip() + ' '
    return clean_sentence(single_sentence)

In [6]:
## Method description
## This function takes each file from the directory of train and calls the process_content function defined above. 
## Creates X and Y vectors , performs random shuffle on X and Y and returns back x , y , number of classes to main function. 

def process_data(traindatapath, type):
    y_vector = []
    num_files_processed = 0
    filenames = []
    directories = os.listdir(traindatapath)
    class_number = 0
    per_directory_count = 0
    for sub_dir in directories:
        files = os.listdir(os.path.join(traindatapath,sub_dir))
        for file in files:
            num_files_processed += 1
            filename = os.path.join(os.path.join(traindatapath,sub_dir),file)
            filenames.append(filename)
            cleaned_sentence = process_content(filename)
            corpus_dict[filename] = cleaned_sentence
            x_vector.append(cleaned_sentence)
            y_vector.append(class_number)
        class_number += 1
    print("number of files processed is" , per_directory_count)
    combined = list(zip(x_vector, y_vector))
    random.shuffle(combined)
    x_vector[:], y_vector[:] = zip(*combined)
    if(type == 'train'):
        y_vector = np_utils.to_categorical(y_vector)
        return x_vector , y_vector , class_number
    else:
        return x_vector, y_vector, class_number


In [7]:
## Method description 
## Uses the word2vec of gensim library. The function of word2vec is to build a word emedding. 
## The sliding window size is 5, which means ,for 5 words in sequence, the weights of embedding layer are trained 
## such that the context of the word is not lost. It prepares weights for each word and it is present in word2vector. 

def word2vec_construct():
    max_sent_length = 0
    print("called word to vec construct")
    word2vector = w2v.Word2Vec(size= vocab_len, min_count= 5, window=5 , sample=1e-3, iter = 2)
    sentences = []
    for line in x_vector:
        sentences.append(word_tokenize(line))
        max_sent_length = max(max_sent_length,len(word_tokenize(line)))
    print("length of sentences in vocab is ",max_sent_length )
    word2vector.build_vocab(sentences)
    print("Word2Vec vocabulary length:", len(word2vector.wv.vocab))
    word2vector.train(sentences, total_examples=word2vector.corpus_count ,epochs= 2)
    word2vector.save("word2vector.w2v")
    return word2vector , max_sent_length

In [None]:
## Method description 
## Driver function

def main():
    training_filenames , labels ,x_train_vector , y_train_vector , num_classes = process_data(traindatapath , 'train')
    print("shapes of x and y vectors " ,len(x_train_vector),len(y_train_vector))
    word2vector , max_sent_length = word2vec_construct()
    index_dict, word_vectors , x_train_vec_wordembd = create_dictionaries(word2vector,x_train_vector )
    print("maximum sentence length is " , max_sent_length)
    n_symbols = len(index_dict) + 1
    ## code for embedding weights with word vectors 
    embedding_weights = np.zeros((n_symbols,vocab_len))
    for word, index in index_dict.items():
        embedding_weights[index, :] = word_vectors[word]
    #print(embedding_weights)
    #x_train_vec_wordembd = sequence.pad_sequences(x_train_vec_wordembd, maxlen= max_sent_length)
    ## calling train function 
    model = train_model(x_train_vec_wordembd,y_train_vector , n_symbols, embedding_weights, max_sent_length ,num_classes)
    model.save('my_model.h5') ## Saving the model 
    ## code for performing testing and prediction using the model 
    testing_filenames, labels, x_test_vector, y_test_vector , num_classes = get_filenames(testdatapath , 'test')
    print("shapes of x and y vectors ", len(x_test_vector), len(y_test_vector))
    index_dict, word_vectors, x_test_vec_wordembd = create_dictionaries(word2vector, x_test_vector)
    #x_test_vec_wordembd = sequence.pad_sequences(x_test_vec_wordembd, maxlen=max_sent_length)
    y_pred = model.predict_proba(x_test_vec_wordembd)
    y_results = np.argmax(y_pred, axis=1)
    ## prints the confusion matrix. 
    print(classification_report(y_test_vector, y_results))


# output window  
<br>
shapes of x and y vectors  11314 11314
<br>
called word to vec construct
<br>
length of sentences in vocab is  13973
<br>
Word2Vec vocabulary length: 19953
<br>
maximum sentence length is  13973
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
_________________________________________________________________
embedding_1 (Embedding)      (None, 13973, 200)        3990880   
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               320800    
________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 20)                4020      
_________________________________________________________________
<br>
Total params: 4,315,620
<br>
Trainable params: 4,315,620
<br>
Non-trainable params: 0
<br>
Compiling the Model...
<br>
Train...
<br>
Train on 9051 samples, validate on 2263 samples
<br>
Summarized the output . Actual output would be to print the metrics for each batch iteration
<br>
 Epoch 1
 <br>
 loss: 2.6549 - categorical_accuracy: 0.1130 - val_loss: 2.4141 - val_categorical_accuracy: 0.1657
 <br>
 Epoch 2
 <br>
 loss: 2.1765 - categorical_accuracy: 0.2163 - val_loss: 2.1450 - val_categorical_accuracy: 0.2762
 <br>
 Epoch 3
 <br>
 loss: 1.8237 - categorical_accuracy: 0.3428 - val_loss: 1.9869 - val_categorical_accuracy: 0.3380
 <br>
 Epoch 4
 <br>
 loss: 1.5530 - categorical_accuracy: 0.4329 - val_loss: 1.9271 - val_categorical_accuracy: 0.3668
 <br>
 Epoch 5
 <br>
 loss: 1.3262 - categorical_accuracy: 0.5081 - val_loss: 1.9690 - val_categorical_accuracy: 0.3915
 <br>
 Epoch 6
 <br>
 loss: 1.2317 - categorical_accuracy: 0.5622 - val_loss: 1.8232 - val_categorical_accuracy: 0.4891

# Issues faced : 
   &ensp; 1) Each epoch took almost 3 hours for execution with system configuration mentioned at the start. 
<br>
   &ensp; 2) Ran the training for 6 epochs with batch size of 40 due to memory restrictions on the machine as high batch size was leading to memory issues.
<br>
   &ensp; 3) Did not perform hyper parameter tuning using cross validation/ grid search as the training took lot of time. 
# Future enhancements : 
   &ensp; 1) Run on GPU for more number of epochs with larger batch size to reduce over-fitting. 
<br>
    &ensp; 2) Perform cross validation and grid search and chose the hyper parameters of learning rate and the vocab size that original vector can be shrinked to. 
<br>
    &ensp; 3) Model selection by trying a multi-layer model and Convolutional network as to which will perform better. 
<br>
   &ensp; 4) Find out if there are anyways to reduce the dimensionality. Applying PCA can work upon flattening the input vector which is of shape (None * sentencesize * 200), but I believe contextual information will be lost for training. So, need to research on dimensionality reduction techinques that can reduce the computational time exponentially. 

In [None]:
#Code for cross validation :  
#wrap up the model into a base estimator class and use sklearns gridsearch cv 
from sklearn.base import BaseEstimator
from sklearn.grid_search import GridSearchCV

Class OurEstimator(BaseEstimator):
    """ our base esimtator"""
    def __init__(self, x_train, vocal_len, y_train ,n_symbols , embedding_weights, input_length ,num_classes, learning_rate, activation):
        self.x_train = x_train
        self.y_train = y_train
        self.n_symbols = n_symbols
        self.embedding_weights = embedding_weights
        self.input_length = input_length
        self.num_classes = num_classes 
        self.vocal_len = vocal_len
        self.learning_rate = learning_rate
        self.activation = activation
    def fit():
        model = Sequential()
        model.add(Embedding(output_dim= vocab_len, input_dim=self.n_symbols, mask_zero=True, weights=[self.embedding_weights], input_length=self.input_length))
        model.add(LSTM(vocab_len))
        model.add(Dropout(0.3))
        model.add(Dense(num_classes, activation='sigmoid'))
        model.summary()
        print('Compiling the Model...')
        adam = Adam(lr= 0.01)
        model.compile(optimizer = adam, loss = 'categorical_crossentropy', metrics = ['accuracy'])
        print("Train...")
        out_batch = NBatchLogger(display= batch_size)
        y_train = np.array(y_train)
        model.fit(x_train, y_train, batch_size = batch_size, epochs = n_epoch, shuffle = True , validation_split= 0.2,callbacks=[out_batch])
        print(model.summary())
        return model
    def predict():
        y_pred = model.predict_proba(x_test_vec_wordembd)
        return np.argmax(y_pred, axis=1)
        

tune_params = {'activation' = ['sigmoid', 'reLU'], learning_rate = np.arange(0.03,0.1 , 0.02)}
gs = GridSearchCV(MeanClassifier(), tun_params,n_jobs = 6)


# References : 
[Microsoft research over generative vs discriminative models in deep learning](https://www.microsoft.com/en-us/research/publication/deep-discriminative-and-generative-models-for-pattern-recognition/)
<br>
[CS 224d : Deep learning for NLP](http://cs224d.stanford.edu/)
<br>
[Theano deep learning](http://deeplearning.net/tutorial/lstm.html)