<center> <div style="text-align: center"> <h1>DNN based Text Classifier using Keras   </h1></div>
<hl></center>

**Anoop K. & Manjary P. Gangan** <br>
CIDA Labs, Department of Computer Science<br>
University of Calicut<br>
https://dcs.uoc.ac.in/~anoop <br>
https://dcs.uoc.ac.in/~manjary <br><br>
________________________

<center><img width="600" height="350" src="https://drive.google.com/uc?id=1TD138tj4cb89_3E62wqsvqR9WUiwNDx_"></center>

# Sentiment Analysis: Predict Sentiment from Movie Reviews
<div style="text-align: justify">Definition 1: Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. </div> <br>
<div style="text-align: justify"> Definition 2: Sentiment analysis is a type of data mining that measures the inclination of people’s opinions through natural language processing (NLP), computational linguistics and text analysis, which are used to extract and analyze subjective information from the Web - mostly social media and similar sources. The analyzed data quantifies the general public's sentiments or reactions toward certain products, people or ideas and reveal the contextual polarity of the information.</div>

# Text to Vector
### 1. TF-IDF
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
<center>
<img width="600" height="500" src="https://drive.google.com/uc?id=18BFjUHxONPZiNT7wogyjA9KRWbiU3RjD"> </center>

### 2. One-Hot encoding
### 3. Word Embedding

# IMDB Movie Review Sentiment Problem Description
## Dataset Description : Large Movie Review Dataset v1.0
<div style="text-align: justify">The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly polar moving reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given moving review has a positive or negative sentiment.
The data was collected by Stanford researchers and was used in a 2011 paper <a href="http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf"> [PDF] </a>where a split of 50/50 of the data was used for training and test. An accuracy of 88.89% was achieved.
The data was also used as the basis for a Kaggle competition titled “Bag of Words Meets Bags of Popcorn” in late 2014 to early 2015. Accuracy was achieved above 97% with winners achieving 99%.</div>


## Load the IMDB Dataset With Keras
<div style="text-align: justify">
Keras provides access to the IMDB dataset built-in. The keras.datasets.imdb.load_data() allows you to load the dataset in a format that is ready for use in neural network and deep learning models.The words have been replaced by integers that indicate the absolute popularity of the word in the dataset. The sentences in each review are therefore comprised of a sequence of integers. More details; the reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".<br>

Calling imdb.load_data() the first time will download the IMDB dataset to your computer and store it in your home directory under ~/.keras/datasets/imdb.pkl as a 32 megabyte file.Usefully, the imdb.load_data() provides additional arguments including the number of top words to load (where words with a lower integer are marked as zero in the returned data), the number of top words to skip (to avoid the “the”‘s) and the maximum length of reviews to support. </div> <br>
**Note:** <br>
If you have raw text data, better to use keras <b>Text Preprocessing</b> package. 
1. texts_to_sequences
2. text_to_word_sequence
3. one_hot <br>
etc.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
import numpy
from keras.datasets import imdb
from matplotlib import pyplot
# load the dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data()
X = numpy.concatenate((X_train, X_test), axis=0)
y = numpy.concatenate((y_train, y_test), axis=0)

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [0]:
# summarize size: Shape of training dataset
print("Total Data: ")
print(X.shape)
print(y.shape)

Total Data: 
(50000,)
(50000,)


In [0]:
# Summarize number of classes
print("Classes: ")
print(numpy.unique(y))

Classes: 
[0 1]


In [0]:
# Summarize number of words
print("Number of words: ")
print(len(numpy.unique(numpy.hstack(X))))

Number of words: 
88585


In [0]:
# Summarize review length
print("Review length: ")
result = [len(x) for x in X]
print("Mean %.2f words (%f)" % (numpy.mean(result), numpy.std(result)))
#We can see that the average review has just under 300 words with a standard deviation of just over 200 words.

Review length: 
Mean 234.76 words (172.911495)


## Word Embeddings
<div style="text-align: justify"> This is a technique where words are encoded as real-valued vectors in a high-dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space. Discrete words are mapped to vectors of continuous numbers. This is useful when working with natural language problems with neural networks and deep learning models are we require numbers as input. </div>
<br><div style="text-align: justify"> 
Keras provides a convenient way to convert positive integer representations of words into a word embedding by an <b>Embedding layer</b>. The layer takes arguments that define the mapping including the maximum number of expected words also called the vocabulary size (e.g. the largest integer value that will be seen as an integer). The layer also allows you to specify the dimensionality for each word vector, called the output dimension.</div>
<br><div style="text-align: justify"> 
Let’s say that we are only interested in the first 5,000 most used words in the dataset. Therefore our vocabulary size will be 5,000. We can choose to use a 32-dimension vector to represent each word. Finally, we may choose to cap the maximum review length at 500 words, truncating reviews longer than that and padding reviews shorter than that with 0 values. </div>

imdb.load_data(nb_words=5000) <br>
X_train = sequence.pad_sequences(X_train, maxlen=500)<br>
X_test = sequence.pad_sequences(X_test, maxlen=500)<br>
Embedding(5000, 32, input_length=500)<br>

The output of this first layer would be a matrix with the size 32×500 for a given review training or test pattern in integer format. <br>

**Note:**
1. How to Use Word Embedding Layers for Deep Learning with Keras <br>
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

# Simple Multi-Layer Perceptron Model for the IMDB Dataset
## Load the Data

In [0]:
# MLP for the IMDB problem
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

## Create the Model 

In [0]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               4000250   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
Total params: 4,160,501
Trainable params: 4,160,501
Non-trainable params: 0
_________________________________________________________________


## Fit & Evaluate the Model

In [0]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=1)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 86.74%


# One-Dimensional CNN Model for the IMDB Dataset
<b>Ref:</b> https://missinglink.ai/guides/keras/keras-conv1d-working-1d-convolutional-neural-networks-keras/


---




## Load the Data

In [0]:
# CNN for the IMDB problem
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
# pad dataset to a maximum review length in words
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

In [0]:
print('Shape of X train and X validation tensor:', X_train.shape,X_test.shape)
print('Shape of label train and validation tensor:', y_train.shape,y_test.shape)

Shape of X train and X validation tensor: (25000, 500) (25000, 500)
Shape of label train and validation tensor: (25000,) (25000,)


## Create the Model
Keras supports one-dimensional convolutions and pooling by the Conv1D and MaxPooling1D classes respectively.

In [0]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Conv1D(32, 3, padding='same', activation='relu'))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 8000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 250)               2000250   
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 251       
Total params: 2,163,605
Trainable params: 2,163,605
Non-trainable params: 0
____________________________________________

## Fit and Evaluate the Model

In [0]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=1)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 87.31%


# CNN for Sentence Classification by Zhang et al. & Kim

**1. Zhang, Y., & Wallace, B. (2015), A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.** </br>
**2. Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014).** 

<center><img width="600" height="500" src="https://drive.google.com/uc?id=1Q90KCeWdyB2vACz2z9rQR27nssYO7Hxi"> </center>
<div style="text-align: justify">
 Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states.</div> 
<h1> let's try to implement this model! </h1> 

Reference: 
1. https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/
2. https://stackabuse.com/python-for-nlp-movie-sentiment-analysis-using-deep-learning-in-keras/
3. http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ <br>



<center><img width="400" height="250" src="https://drive.google.com/uc?id=1LdciBzE4Oc__NE00Bw0TisofYTP0qGc0"> </center>