# Data Set (Training and Test set)

We used IMDb Dataset which can be downloaded from [here](https://ai.stanford.edu/~amaas/data/sentiment/).
This data set contains 50,000 reviews which is evenly split into two groups: 25,000 reviews for each of training and testing. The reviews for training and testing data sets contain a disjoint set of movies. Therefore, we can assume that the validation result with testing data set can be applicable for other movie reviews.

Each group has the same number of positive and negative reviews: a positive review has a score from 7 to 10 while a negative review has a score from 1 to 4. The reviews having score 5 or 6 are excluded to avoid vagueness.


# Environment

For this project, we used my own Linux machine having AMD Ryzen 7 2700X, 16GB Memory, Geforce RTX 2070.
In addition, Keras with Tensorflow backend is used for making a deep learning model.

In [1]:
from string import punctuation
import os 
#from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
#
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
#
from keras.callbacks import EarlyStopping, ModelCheckpoint
import keras.backend.tensorflow_backend as K
import tensorflow as tf
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

import string
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.tokenize import word_tokenize
import glob
#from tqdm import tqdm
from tqdm import tqdm_notebook as tqdm
from nltk.stem.porter import PorterStemmer
from collections import Counter
from operator import itemgetter
import numpy as np
from multiprocessing import Pool
import sys

from IPython.display import HTML, display
import tabulate
import pandas as pd
from keras_tqdm import TQDMNotebookCallback


Using TensorFlow backend.


# Tensorflow initial setup

To allow Tensorflow to use enough GPU memory, *allow_growth* option is turned on.

In [2]:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
K.set_session(sess)

# Loading data set files

First of all, all the documents are loaded. The data sets for training and testing are stored in *data/train* and *data/test*, respectively. For each data set, positive and negative reviews are stored in *pos* and *neg* sub-directories.

I have attached the progress bars using the [tqdm](https://github.com/tqdm/tqdm), which is useful in dealting with large data by allowing us to estimate each time of the stages.

Referenced article for tqdm: https://towardsdatascience.com/progress-bars-in-python-4b44e8a4c482

In [3]:
# load all docs in a directory
def load_docs(directory):
    documents = list()
    # walk through all files in the folder
    for filename in tqdm(os.listdir(directory)):
        # create the full path of the file to open
        path = directory + '/' + filename
        with open(path, 'r') as f:
            # load the doc
            doc = f.read()
            # add to list
            documents.append(doc)
    return documents

# load all training reviews
print("Loading training-positive-docs")
global_train_positive_docs = load_docs('data/train/pos')
print("Loading training-negative-docs")
global_train_negative_docs = load_docs('data/train/neg')
# load all test reviews
print("Loading test-positive-docs")
global_test_positive_docs = load_docs('data/test/pos')
print("Loading test-negative-docs")
global_test_negative_docs = load_docs('data/test/neg')

Loading training-positive-docs


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


Loading training-negative-docs


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


Loading test-positive-docs


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


Loading test-negative-docs


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))




# Cleaning documents

### Pre-processing techniques

In most of NLP releated works, documents are normally pre-processed to get better performance.
We tried to apply several techniques which are well-known as follows:

**1. Removing punctuations**  
Normally punctuations do not have any meaning, but they exist for understandability. Therefore, such punctuations should be removed. But, we did not remove the apostrophe mark (') since such removing caused the incorrect stemming.

**2. Removing stopwords**  
We filtered out the stopwords.
The stop words are those words that do not contribute to the deeper meaning of the phrase.
They are the most common words such as: “the“, “a“, and “is“.
NLTK provides a list of commonly agreed upon stop words for a variety of languages.

**3. Stemming**    
The *PorterStemmer* is provided in *NLTK python package*.
We made the words into lowercases and used the stemming method in order to both reduce the vocabulary and to focus on the sense or sentiment of a document rather than deeper meaning.

**4. Removing non-frequent words**   
It is important to define a vocabulary of known words when using a bag-of-words or embedding model.
The more words, the larger the representation of documents, therefore it is important to constrain the words to only those believed to be predictive. 

In this project, **we set up the vocabulary dictionary by removing the non-frequent words to prevent a model from overfitting.** 
This is implemeted in [*vocab.ipynb*](https://github.com/ahrimhan/data-science-project/tree/master/project2/src/vocab.ipynb).

* After removing all words that have a length <= 1 character, we first construct the vocabulary dictionary based on only reviews in the training dataset (Number of vocabularies: 52,826).
* Then, we iterate the vocabulary dictionary again for counting the word occurrences and removing the non-frequent words that have a low occurrence, such as only being used once or none. Thus, remaining vocabularies have the two or more occurrences (Number of filtered vocabularies: 30,819). These filtered vocabularies are saved in [*vocab.txt*](https://github.com/ahrimhan/data-science-project/tree/master/project2/src/vocab.txt).

In [4]:
remove_punctuation_table = str.maketrans('', '', '\'"!.,?:;')
stop_words = set(stopwords.words('english'))
# turn a doc into clean tokens
vocab = []

with open('./vocab/vocab.txt') as f:
    vocab = f.read().split() 
    
def clean_doc(doc):
    # split into tokens by white space
    tokens = word_tokenize(doc)
    
    # remove punctuation from each token
    tokens = [w.translate(remove_punctuation_table) for w in tokens]
    
    # remove stop words
    tokens = [w for w in tokens if w not in stop_words]
    
    # stemming
    porter = PorterStemmer()
    tokens = [porter.stem(w.lower()) for w in tokens]

    # filter out tokens not in vocab
    if len(vocab) > 0:
        tokens = [w for w in tokens if w in vocab]
    
    tokens = ' '.join(tokens)
    return tokens

### Multiprocessing

Pre-processing mentioned above requires heavy computation. 
To improve the speed, we parallelized the pre-processing using the *Pool module* in *multiprocessing package*.
Since we use a CPU having 8 cores, the size of Pool is set as 8.
By using this technique, **we could achieve 6~7 times speed up.** 
Using the single thread, it takes 10~12 minutes for cleaning up 12500 documents, whereas, using the multiple threads, it takes only 1 minute and 20~40 seconds.

In [5]:
# Serial version of clean_docs function
# def clean_docs(documents):
#     for doc in tqdm(documents):
#         clean_doc(doc)

# Parallel version of clean_docs function
def clean_docs(documents):
    # Since we use a CPU having 8 cores, the size of Pool is set as 8
    with Pool(8) as p:
        return list(tqdm(p.imap(clean_doc, documents), total=len(documents)))

print("Cleaning up for training-positive-docs")
cleaned_train_positive_docs = clean_docs(global_train_positive_docs)
print("Cleaning up for training-negative-docs")
cleaned_train_negative_docs = clean_docs(global_train_negative_docs)
print("Cleaning up for test-positive-docs")
cleaned_test_positive_docs = clean_docs(global_test_positive_docs)
print("Cleaning up for test-negative-docs")
cleaned_test_negative_docs = clean_docs(global_test_negative_docs) 

Cleaning up for training-positive-docs


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


Cleaning up for training-negative-docs


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


Cleaning up for test-positive-docs


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))


Cleaning up for test-negative-docs


HBox(children=(IntProgress(value=0, max=12500), HTML(value='')))




In [6]:
print(cleaned_test_negative_docs[0].split()[0:100])

['nt', 'realli', 'consid', 'conserv', 'nt', 'person', 'offend', 'film', 'pretti', 'clear', 'plot', 'character', 'film', 'secondari', 'messag', 'and', 'messag', 'conserv', 'either', 'evil', 'stupid', 'charact', 'either', 'good', 'american', 'brainless', 'greedi', 'evil', 'conserv', 'there', 'noth', 'clever', 'creativ', 'nt', 'realli', 'mind', 'polit', 'bia', 'nt', 'purpos', 'behind', 'movi', 'and', 'clearli', 'br', 'br', 'on', 'posit', 'side', 'cast', 'wonder', 'chri', 'cooper', 'impress', 'funni', 'first', 'two', 'three', 'time', 'old', 'joke', 'told', 'br', 'br', 'so', 'realli', 'hate', 'conserv', 'probabl', 'enjoy', 'film', 'look', 'someth', 'realist', 'charact', 'stori', 'less', 'better', 'watch', 'someth', 'els']


# Encoding data set into sequence

To use documents as an input of a model, each document is encoded as a sequence object of Keras.
The function below encodes the documents as sequence objects as well as creates a list of labels: '0' for negative reviews and '1' for positive reviews.
We do not need the one hot encoding process (a function called *to_categorical()* in Keras) because there is only two classes of positive and negative.

In [7]:
def encode_data_set(tokenizer, positive_docs, negative_docs, max_word_length):
    docs = negative_docs + positive_docs
    # sequence encode
    encoded_docs = tokenizer.texts_to_sequences(docs)
    # pad sequences
    x = pad_sequences(encoded_docs, maxlen=max_word_length, padding='post')
    # define training labels
    y = array(([0] * len(negative_docs)) + ([1] * len(positive_docs)))
    return x, y

# Word Embedding
Word embedding is a common technique to deal with texts in Deep Learning.
To compare the effectiveness of use of pre-trained word embedding, here, both of pre-trained word embedding and new (not-trained) embedding will be used.

## Pre-trained word embedding
In this project, we will be using GloVe embeddings, which you can read about [here](https://nlp.stanford.edu/projects/glove/). GloVe stands for "Global Vectors for Word Representation". It's a somewhat popular embedding technique based on factorizing a matrix of word co-occurence statistics.

Specifically, we will use the 200-dimensional GloVe embeddings of 400k words computed on a 2014 dump of English Wikipedia. You can download them [here](http://nlp.stanford.edu/data/glove.6B.zip).

In addition, to check whether the pre-trained word embedding needs to be trained or not, we made the function below configurable for *trainable* parameter of Embedding object.

In [8]:
EMBEDDING_DIM = 200
def load_pre_trained_embedding(word_index, max_word_length, trainable_for_embedding):
    embeddings_index = {}
    with open('./glove/glove.6B.200d.txt') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

    embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    return Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=max_word_length, weights=[embedding_matrix], trainable=trainable_for_embedding)

## New (not-trained) word embedding
New word embedding is created with no pre-trained weights, and it should be trainable always.

In [9]:
def new_embedding(word_index, max_word_length, trainable_for_embedding):
    # This is new embedding layer, so trainable must be True regardless the value of trainable_for_embedding
    return Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=max_word_length, trainable=True)

# Building a Deep Learning Model

To build a deep learning model, we basically use the sequential model of Keras.

First, the Embedding layer is located. There are two options of setting embedding layers: using the pre-trained word embedding or training new embedding from scratch.

Second, a series of **convolution 1D** and **pooling layers** are added according to typical CNN for text analysis. 

In order to check the effects of the number of convolution layers, we made the function below configurable to set the number of additional convolution layers.

Then, after flattening layer, fully connected dense layers are added.
Since this is a binary classification problem, we use the sigmoid function as an activation function for the final dense layer. If you try to predict a score of a review, it would be better to use 'softmax' function as the activation function.

  - **Activation Function**  
  The activation function is used as a decision making body at the output of a neuron. The neuron learns Linear or Non-linear decision boundaries based on the activation function.

  It also has a normalizing effect on the neuron output which prevents the output of neurons after several layers to become very large, due to the cascading effect.  
  
  There are three most widely used activation functions:
      - **Sigmoid**: It maps the input (x axis) to values between 0 and 1 (which may later results in the [vanishing gradient problem](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484).
      - **Tanh**: It is similar to the sigmoid function butmaps the input to values between -1 and 1.
      - **Rectified Linear Unit (ReLU)**: - It allows only positive values to pass through it. The negative values are mapped to zero.
  - **Dropout**  
  During training, when dropout is applied to a layer, some percentage of its neurons (a hyperparameter, with common values being between 20 and 50%) are randomly deactivated or “dropped out,” along with their connections. Which neurons are dropped out are constantly shuffled randomly during training. This forces the network to learn a more balanced representation, and **helps with overfitting**.

In [10]:
def build_model(word_index, max_word_length, number_of_additional_conv_layers, use_pre_trained_embedding, trainable_for_embedding, number_of_filters, use_dropout, num_units):
    # define model
    model = Sequential()
    embedding_func = new_embedding
    if use_pre_trained_embedding: 
        embedding_func = load_pre_trained_embedding
    model.add(embedding_func(word_index, max_word_length, trainable_for_embedding))
    if use_dropout:
        model.add(Dropout(0.5))
    
    for i in range(number_of_additional_conv_layers):
        model.add(Conv1D(filters=number_of_filters, kernel_size=5, activation='relu'))
        model.add(MaxPooling1D(pool_size=4))
        if use_dropout:
            model.add(Dropout(0.5))
    
    model.add(Conv1D(filters=number_of_filters, kernel_size=5, activation='relu'))
    model.add(MaxPooling1D(pool_size=10))
    if use_dropout:
        model.add(Dropout(0.5))

    model.add(Flatten())
    #model.add(Dense(64, activation='relu'))
    model.add(Dense(units=num_units, activation='relu')) #num_units= [128,32]
    model.add(Dense(1, activation='sigmoid'))
    model.summary()
    # compile network
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Building Deep Learning Models using Various Parameters

#### Previous model accuracy

[1] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

In the paper [1], the performance of machine learning modelsis in range of 67.42% to 88.89%. The model performed best in the cross validated Support Vector Machine (SVM) when concatenated with bag of words representation.

#### Base model

Now, all the functions defined before are combined in the function *build_and_train_model* as below.
We compare the different models by changing the six parameters.
Each model is trained by different combinations of six parameters.

The meaning of the parameters are as follows:
* **use_cleaned_docs**: Cleaning review documents or not. 
* **number_of_additional_conv_layers**: Number of additional convolution layers. Basically, one convolution layer is used (`number_of_additional_conv_layers`=0). If you want to add more convolution layers, we can adjust this parameter to a higher number. For experiment, we adjust the `number_of_additional_conv_layers`=2, so the total convolution layers becomes 3.
* **use_pre_trained_embedding**: Using pre-trained embedding or not. If True, the GloVe embedding will be used as mentioned above.
* **trainable_for_embedding**: Training the embedding layer with training data set or freezing. Note, when using the new embedding layer, then trainable_for_embedding should be True. 
* **number_of_filters**: Number of filters in the convolution layers (96 or 24).
* **use_dropout**: Using Dropout or not. In the experiment, 50% percentage of its neurons are randomly deactivated.
* **num_units**: Number of units in Dense layer. This reduces the capacity of network (128 or 32). 

When setting up model configurations by combination of features, there are no cases of no embedding layers 
(`use_pre_trained_embedding`= **False** and `trainable_for_embedding` = **False**)
, so these should be eliminated.

In [11]:
MODEL_DIR = './model/'
if not os.path.exists(MODEL_DIR):
    os.mkdir(MODEL_DIR)
    
#modelpath = "./model/{epoch:02d}-{val_loss:.4f}.hdf5"
modelpath = "./model/{epoch:02d}-{val_acc:.4f}.hdf5"
checkpointer = ModelCheckpoint(filepath=modelpath, monitor='val_acc', verbose=0, save_best_only=True)
#early_stopping_callback = EarlyStopping(monitor='val_loss', mode='min', patience=2)
early_stopping_callback = EarlyStopping(monitor='val_acc', mode='max', patience=2)

In [12]:
import gc
def build_and_train_model(use_cleaned_docs=False, 
                          number_of_additional_conv_layers=2, 
                          use_pre_trained_embedding=True,
                          trainable_for_embedding=True,
                          number_of_filters=96, 
                          use_dropout=True,
                          num_units=128):
    
    train_positive_docs = global_train_positive_docs
    train_negative_docs = global_train_negative_docs
    test_positive_docs = global_test_positive_docs
    test_negative_docs = global_test_negative_docs
    
    # clean up documents if required
    if use_cleaned_docs:
        train_positive_docs = cleaned_train_positive_docs
        train_negative_docs = cleaned_train_negative_docs
        test_positive_docs = cleaned_test_positive_docs
        test_negative_docs = cleaned_test_negative_docs
    
    # create the tokenizer
    tokenizer = Tokenizer()
    train_docs = train_positive_docs + train_negative_docs
    # fit the tokenizer on the documents
    tokenizer.fit_on_texts(train_docs)
    #
    print('Fitted tokenizer on {} documents'.format(tokenizer.document_count))
    #print('{} words in dictionary'.format(tokenizer.num_words))
    print('Top 5 most common words are:', Counter(tokenizer.word_counts).most_common(5))
    
    # calculate maximum length of words in training docs
    max_word_length = max([len(s.split()) for s in train_docs])
    # get word_index
    word_index = tokenizer.word_index
    #print('Found %s unique tokens.' % len(word_index))

    # encode data into two sequences: x = input, y = output
    x_train, y_train = encode_data_set(tokenizer, train_positive_docs, train_negative_docs, max_word_length)
    x_test, y_test = encode_data_set(tokenizer, test_positive_docs, test_negative_docs, max_word_length)

    # build a model
    model = build_model(word_index, max_word_length, number_of_additional_conv_layers, 
                        use_pre_trained_embedding, trainable_for_embedding, number_of_filters, use_dropout, num_units)
    # fit network (Training)
    history = model.fit(x_train, y_train, epochs=16, verbose=2, validation_data=(x_test, y_test), batch_size=96, callbacks=[TQDMNotebookCallback(), early_stopping_callback, checkpointer])
    # evaluate
    loss, acc = model.evaluate(x_test, y_test, verbose=0)
    print('Test Accuracy: %.2f%%' % (acc*100))
    
    K.clear_session()
    gc.collect()
    del model
    
    return history

We first start with the following configurations of parameters.

In [13]:
exp_conditions = []

# for use_cleaned_docs in [False]:
#     for number_of_additional_conv_layers in [2]: #[2, 0]
#         for use_pre_trained_embedding in [True, False]:
#             for trainable_for_embedding in [True, False]:
#                 if (not use_pre_trained_embedding) and (not trainable_for_embedding):
#                     continue
#                 for number_of_filters in [96]: #[96, 24]
#                     for use_dropout in [False]: #[True, False]
#                         exp_conditions.append({
#                             'use_cleaned_docs': use_cleaned_docs, 
#                             'number_of_additional_conv_layers': number_of_additional_conv_layers,
#                             'use_pre_trained_embedding': use_pre_trained_embedding,
#                             'trainable_for_embedding': trainable_for_embedding,
#                             'number_of_filters': number_of_filters,
#                             'use_dropout': use_dropout,
#                             'num_units' : 128
#                         })
     

# for use_pre_trained_embedding in [True, False]:
#     for trainable_for_embedding in [True, False]:
#         if (not use_pre_trained_embedding) and (not trainable_for_embedding):
#             continue
#         exp_conditions.append({
#             'use_cleaned_docs': False, 
#             'number_of_additional_conv_layers': 2,
#             'use_pre_trained_embedding': use_pre_trained_embedding,
#             'trainable_for_embedding': trainable_for_embedding,
#             'number_of_filters': 96,
#             'use_dropout': False,
#             'num_units' : 128
#         })

exp_conditions.append({
        'use_cleaned_docs': False, 
        'number_of_additional_conv_layers': 0,
        'use_pre_trained_embedding': True,
        'trainable_for_embedding': False,
        'number_of_filters': 96,
        'use_dropout': False,
        'num_units' : 128
})

exp_conditions.append({
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 0,
    'use_pre_trained_embedding': False,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': False,
    'num_units' : 128
})
    
exp_conditions.append({
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 0,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': False,
    'num_units' : 128
})
    
exp_conditions.append({
        'use_cleaned_docs': False, 
        'number_of_additional_conv_layers': 2,
        'use_pre_trained_embedding': True,
        'trainable_for_embedding': False,
        'number_of_filters': 96,
        'use_dropout': False,
        'num_units' : 128
})

exp_conditions.append({
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': False,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': False,
    'num_units' : 128
})
    
exp_conditions.append({
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': False,
    'num_units' : 128
})

columns_display = ['use_cleaned_docs','number_of_additional_conv_layers', 'use_pre_trained_embedding', 'trainable_for_embedding', 'number_of_filters', 'use_dropout', 'num_units']
exp_cond_df = pd.DataFrame(exp_conditions, columns=columns_display)
exp_cond_df

Unnamed: 0,use_cleaned_docs,number_of_additional_conv_layers,use_pre_trained_embedding,trainable_for_embedding,number_of_filters,use_dropout,num_units
0,False,0,True,False,96,False,128
1,False,0,False,True,96,False,128
2,False,0,True,True,96,False,128
3,False,2,True,False,96,False,128
4,False,2,False,True,96,False,128
5,False,2,True,True,96,False,128


We build and train models.

In [14]:
exp_result = []

for i, exp_cond in enumerate(exp_conditions):
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.Session(config=config)
    K.set_session(sess)
    display(exp_cond_df[i:i+1])
    history = build_and_train_model(**exp_cond)
    print(history)
    exp_result.append(history)

Unnamed: 0,use_cleaned_docs,number_of_additional_conv_layers,use_pre_trained_embedding,trainable_for_embedding,number_of_filters,use_dropout,num_units
0,False,0,True,False,96,False,128


Fitted tokenizer on 25000 documents
Top 5 most common words are: [('the', 336148), ('and', 164097), ('a', 163040), ('of', 145847), ('to', 135708)]
Instructions for updating:
Colocations handled automatically by placer.


InternalError: Dst tensor is not initialized.
	 [[{{node _arg_embedding_1/Placeholder_0_0}}]]
	 [[node embedding_1/Assign (defined at /usr/local/lib/python3.7/dist-packages/keras/backend/tensorflow_backend.py:2465) ]]

In [None]:
%matplotlib notebook

import matplotlib.pyplot as plt
 
columns_display = ['use_cleaned_docs','number_of_additional_conv_layers', 'use_pre_trained_embedding', 'trainable_for_embedding', 'number_of_filters', 'use_dropout', 'num_units', 'val_acc']
                   
def drawModelAcc_Loss(history, exp_cond):
    exp_cond['val_acc'] = max(history.history['val_acc'])
    #exp_cond['val_acc'] = history.history['val_acc'][-1]
    #exp_cond['num'] = i
    exp_cond_row_df = pd.DataFrame([exp_cond], columns=columns_display)
    #print(history.history['val_acc'])
    display(exp_cond_row_df)
    #display(exp_cond_row_df.to_string(index=False))
    #i = i + 1
    #print(exp_cond)
    
    plt.figure()
    # summarize history for accuracy
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])

    plt.title('Model accuracy and loss')
    plt.ylabel('accuracy and loss')
    plt.xlabel('epoch')
    plt.legend(['train_acc', 'test_acc', 'train_loss', 'test_loss'], loc='upper left')
    plt.show()
    
    
    # Plot training & validation loss values
#     plt.figure()
#     plt.plot(history.history['loss'])
#     plt.plot(history.history['val_loss'])
#     plt.title('Model loss')
#     plt.ylabel('loss')
#     plt.xlabel('epoch')
#     plt.legend(['train', 'test'], loc='upper left')
#     plt.show()

#i = 0
for history, exp_cond in zip(exp_result, exp_conditions):
    #exp_cond['val_acc'] = history.history['val_acc'][-1]
    drawModelAcc_Loss(history, exp_cond)   

In [None]:
exp_cond_df = pd.DataFrame(exp_conditions, columns=columns_display)
exp_cond_df['val_acc'] = exp_cond_df['val_acc'].apply(lambda x:x*100)
exp_cond_df['val_acc'] = exp_cond_df['val_acc'].round(2)

In [None]:
exp_cond_df

In [None]:
def draw_build_and_train_model(exp_cond):
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.Session(config=config)
    K.set_session(sess)
    history = build_and_train_model(exp_cond)
    #print(history)
    exp_result.append(history)
    drawModelAcc_Loss(history, exp_cond)
    
    #one dict to Dataframe
    df_exp_cond = pd.DataFrame([exp_cond], columns=columns_display)
    df_exp_cond['val_acc'] = df_exp_cond['val_acc'].apply(lambda x:x*100)
    df_exp_cond['val_acc'] = df_exp_cond['val_acc'].round(2)
    return df_exp_cond

# Preventing Overfitting

We use the callbacks functions of `EarlyStopping` and `ModelCheckpoint`.

One way to avoid overfitting is to terminate the process early.
We used the `EarlyStopping` function and set the arguments `monitor`= val_acc (test accuracy) and `patience`=2.
The `patience` indicates the number of epochs with no improvement after which training will be stopped.

The `ModelCheckpoint`callback saves the model after every epoch. 


Here the most common ways to prevent overfitting in neural networks:

- Get more training data.
- Reduce the capacity of the network.
- Add weight regularization.
- Add dropout.
- Data-augmentation
- Batch normalization

#### 'number_of_additional_conv_layers' = 0 && network size reduction && dropout true or false

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 0,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 24,
    'use_dropout': False,
    'num_units': 32
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 0,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 24,
    'use_dropout': True,
    'num_units': 32
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

#### Reduce the network size 1. 'number_of_filters' : 96 to 48 and 24

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 48,
    'use_dropout': False,
    'num_units': 128
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 24,
    'use_dropout': False,
    'num_units': 128
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

#### Reduce the network size 2-1. 'num_units': 128 to 32

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': False,
    'num_units': 32
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

#### Reduce the network size 2-2. 'num_units': 128 to 64

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': False,
    'num_units': 64
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

#### Reduce the network size 2-3. 'num_units': 128 to 8

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': False,
    'num_units': 8
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

#### Reduce the network size 3. 'number_of_filters' : 96 to 24 and 'num_units': 128 to 32

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 24,
    'use_dropout': False,
    'num_units': 32
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

#### Dropout

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': True,
    'num_units': 128
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

In [None]:
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 48,
    'use_dropout': True,
    'num_units': 128
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

#### use_clean_docs = True

In [None]:
# use_clean_docs = True
exp_cond = {
    'use_cleaned_docs': True, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': True,
    'num_units': 128
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

In [None]:
#best
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': False,
    'num_units': 128
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

In [None]:
#best
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 96,
    'use_dropout': True,
    'num_units': 128
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

In [None]:
#best
exp_cond = {
    'use_cleaned_docs': False, 
    'number_of_additional_conv_layers': 2,
    'use_pre_trained_embedding': True,
    'trainable_for_embedding': True,
    'number_of_filters': 24,
    'use_dropout': True,
    'num_units': 128
}

df_exp_cond_result = draw_build_and_train_model(exp_cond)
exp_cond_df = exp_cond_df.append(df_exp_cond_result)

### Final results

In [None]:
#final
exp_cond_df

In [None]:
exp_cond_df = exp_cond_df.reset_index()

In [None]:
# print("max accuracy\nindex: {}, accuracy: {}".format(exp_cond_df['val_acc'].idxmax(), exp_cond_df['val_acc'].max()))
# exp_cond_df.iloc[exp_cond_df['val_acc'].idxmax()]

In [None]:
# exp_cond_df['val_acc'].nlargest(24)

In [None]:
exp_cond_df.sort_values(by=['val_acc'], ascending=False)

#### Conclusion


1. **Overall accuracy**  
The accuracy is best as 90.18% in the following condiiton.

In [None]:
exp_cond_df.iloc[[0]]

2. **Word embedding**  
For the word embedding, the **use of sole pre-trained word embedding produces the least performance**. This can be interpreted that Embedding model have limitations to be generalized and should be differentiated into their contexts. Thus, **the use of the new training word embedding was the better choice.** Finally, **the mixed use of pre-trained word embedding and new training word embedding produces the best performance.**

In [None]:
exp_cond_df.iloc[[0,1,2]]

3. **Network size: Number of filter in convolutional neural network and Number of units in Dense layer**  

- The larger number of filter (96,48,24) in convolutional neural network gives better performance.

In [None]:
exp_cond_df.iloc[[0,3]]

- The larger number of units the Dense layer (128, 64, 32, 8) has, the better performance, but not necessarily.

In [None]:
exp_cond_df.iloc[[0,4,5,6]]

- But, when reducing the number of filter in convolutional neural network and number of units in Dense layer together, it returns better performance than reducing one for each, even it still has lower accuracy compared to the model having higher numbers in both.

In [None]:
display(exp_cond_df.iloc[[3,4,5,6,7]])
display(exp_cond_df.iloc[[0,7]])

4. **Dropout**  
When the size of the network is large, the dropout helps avoid overfitting. However, when the network size is small, it is not effective since too much data is missing.

In [None]:
exp_cond_df.iloc[[0,8,9]]

5. **Cleaning documents**  
In deep learning modeling, it seems that the pre-processing (cleaning) does not seem to affect the accuracy.

In [None]:
exp_cond_df.iloc[[0,10]]

6. Nubmer of convolutional neural network

In [None]:
exp_cond_df.iloc[[0,11,12]]