# Text Classification with Keras
Author: [Valentin Malykh](http://val.maly.hk)

We will start our Natural Language Processing(NLP) journey with classification, because it is one basic steps in understanding natural languages yet still very practical. Once we can figure out the meaning of a word then more complex tasks are possible such as sentiment analysis. Knowing sentiment of the words are very useful and common for many industries. For example, online reviews or comments are a common way for any big company to track their public image and how customers feel about them. 

## Terms
First, let us define technical terms used in NLP to describe the inputs we need to parse. 

*token* - a unit of text, it could be a word (and almost always is), but also it could be a group of words like "New York", or sub-word like "mega" in "megabyte"

*document* - sequence of tokens, this could be whole book or a tweet, pedending on a task

*corpus* - set of documents

## Basic Steps

We will assign each document in a corpus to come class to perform *text classification* for *sentiment analysis*. The task is to breakdown whether a document conveys 3 different sentiments: "positive", "negative" or "neutral". The idea is to understand overall context of the word used along with the meaning to figure out the emotional tone of the document whether it'd be a comment or a paragraph. It's not possible to do the analysis of the world document in one shot but have to be broken down. The strategy we will employ is *part of speech tagging* or simply *PoS-tagging*; a markup of a sentence by PoS for every word. These tags for every word then can be used to feed into doing overall task such as *text classificatioon* we will do in this lab.

Now, let us import all the libraries to setup our text classification process. We will be utilizing Keras framework for convenience and utilities like numpy for ease of use. Go ahead and run the following cell to bring in the proper libraries. 

In [None]:
import numpy as np
import keras
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Activation, Input
from keras.preprocessing.text import Tokenizer

Now we need a corpus to do text classification on. So let's get download Sentiment Tree Bank from Stanford's [NLP](https://nlp.stanford.edu/sentiment/) group. They describe the complexity of sentiment analysis with their work as following:

"Most sentiment prediction systems work just by looking at words in isolation, giving positive points for positive words and negative points for negative words and then summing up these points. That way, the order of words is ignored and important information is lost. In constrast, our new deep learning model actually builds up a representation of whole sentences based on the sentence structure. It computes the sentiment based on how words compose the meaning of longer phrases. This way, the model is not as easily fooled as previous models. For example, our model learned that funny and witty are positive but the following sentence is still negative overall:

*This movie was actually neither that funny, nor super witty.*"

Now let's execute the following two cells to download the [dataset](https://nlp.stanford.edu/sentiment/treebank.html) and unzip it into our workspace.

In [None]:
#! if [ ! -f stanfordSentimentTreebank.zip ]; then wget http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip; fi

In [None]:
#! unzip stanfordSentimentTreebank.zip

## Pandas
We need help to parse the dataset and don't want to do it manually. Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools. 

So let's import pandas into our environment and read CSV data:

In [None]:
import pandas
split = pandas.read_csv("stanfordSentimentTreebank/datasetSplit.txt")

When we do a read for pandas, it creates a base object in it called DataFrame. DataFrame is representated as numpy array internally and thus, have some interesting properties. We can exploit some of those properties with the following code to manipulate:

In [None]:
split.head()

In [None]:
sentences = pandas.read_csv("stanfordSentimentTreebank/datasetSentences.txt", sep="\t")

In [None]:
sentences.head()

Here we're using another property of DataFrame - column access ```sentences["sentence"]``` which will return only one specific column of this particular DataFrame. ```tolist()``` obviously returns a python list instead of Series (another base class in pandas). Execute the following cells to prepare the labels.

In [None]:
def sent_labels(sentences):
    dictionary = dict()
    with open("stanfordSentimentTreebank/dictionary.txt", "rt", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            splitted = line.split("|")
            dictionary[splitted[0].lower()] = int(splitted[1])


    labels = [0.5] * (max(dictionary.values()) + 1)
    with open("stanfordSentimentTreebank/sentiment_labels.txt", "rt", encoding="utf-8") as f:
        f.readline()
        for line in f:
            line = line.strip()
            if not line:
                continue
            splitted = line.split("|")
            labels[int(splitted[0])] = float(splitted[1])

    sent_labels = [0.5] * len(sentences)
    for i in range(len(sentences)):
        full_sent = sentences[i].replace("-lrb-", "(").replace("-rrb-", ")").replace("\\\\", "")
        try:
            sent_labels[i] = labels[dictionary[full_sent.lower()]]
        except KeyError:
            pass

    return sent_labels

Now we can create labels and check how many sentences there are.

In [None]:
labels = sent_labels(sentences=sentences["sentence"].tolist())

In [None]:
len(sentences)

That's a pretty good dataset for us to start working with. Note that ```concat``` will concatinate DataFrames (and Series) even if they are of different lengths. This flexibility is another reason we are utilizing Panda. 

In [None]:
dataset = pandas.concat([sentences, pandas.DataFrame(labels), split], axis=1)
dataset

Here we at first select some columns by their names - ```dataset[["sentence",0,"splitset_label"]]```, and after that filtering the produced DataFrame by value of one of its columns ```d[d["splitset_label"] == 1]```.

Also, if you call a DataFrame in jupyter, it is an equivalent of ```head()```.

In [None]:
d = dataset[["sentence",0,"splitset_label"]]
d[d["splitset_label"] == 1]

In [None]:
max_words = 20
batch_size = 32
epochs = 5

Here we are going to split the dataset into 3 sets: training, validation and testing. 

In [None]:
import pandas
df_train = d[d["splitset_label"] == 1]
df_test = d[d["splitset_label"] == 2]
df_val = d[d["splitset_label"] == 3]

In [None]:
df_train.head()

The ```Tokenizer``` class from Keras is implementing TF-IDF method of text analysis on provided corpus.

## TF-IDF

*term frequency* or *TF*: 
$$TF(w, d) = \frac{count(w, d)}{\sum_{v \in V}count(v, d)}$$
where $w, v$ are tokens (words), $V$ - vocabulary, $d$ - document in corpus

*inversed document frequency* or *IDF*:
$$IDF(w) = log \frac{|D|}{\sum_{d \in D}\mathbb{1}(w, d)} $$
where $D$ is a corpus, $\mathbb{1}$ is an indicator function of presence of specific token in a document.

In [None]:
print("Preparing the Tokenizer...")
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df_train["sentence"])

In [None]:
tokenizer.num_words

In [None]:
tokenizer.word_counts

Since we've fitted the ```Tokenizer``` on our corpus, it can create a matrix representation of texts. One dimension of the matrix will be number of a text, and other one will be TF-IDF weights of words in it.

In [None]:
print('Vectorizing sequence data...')
x_train = tokenizer.texts_to_matrix(df_train["sentence"], mode='binary')
x_test = tokenizer.texts_to_matrix(df_test["sentence"], mode='binary')
x_val = tokenizer.texts_to_matrix(df_val["sentence"], mode='binary')
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('x_val shape:', x_val.shape)

In [None]:
x_train

In [None]:
df_train =df_train[0].copy()
df_test =df_test[0].copy()
df_val =df_val[0].copy()

Now we need to create a matrix for our labels. One dimension again will be number of text, but the other one is little bit tricky: we need to produce one-hot encoding for the labels. One-hot encoding will be here zero vector by lenght of number of classes with one at position which correspond to actual label.

In [None]:
df_train[df_train >= 0.5] = 1.
df_train[df_train < 0.5] = 0.

df_test[df_test >= 0.5] = 1.
df_test[df_test < 0.5] = 0.

df_val[df_val >= 0.5] = 1.
df_val[df_val < 0.5] = 0.

In [None]:
print('Convert class vector to binary class matrix '
      '(for use with categorical_crossentropy)')
num_classes = 2
y_train = keras.utils.to_categorical(df_train, num_classes)

y_test = keras.utils.to_categorical(df_test, num_classes)

y_val = keras.utils.to_categorical(df_val, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)
print('y_val shape:', y_val.shape)

In [None]:
y_train

In [None]:
y_train.shape

In [None]:
import numpy as np
np.where(y_train[:,1] == 0)[0].shape

Now we'll create a model in Keras. This model will consist of two ```Dense``` layers and some non-linear function, which is called ```Activation```. ```Dense``` layer is just matrix multiplication, and nothing more.

But before we start we need to discuss two more things. First one is Rectified Linear Unit or just __ReLU__. This is common nonlinearity, which is defined by simple formula:
$$ReLU(z) = max(0, z)$$
Here is its graphical representation (and also sigmoid for comparison):
![](https://cdn-images-1.medium.com/max/1600/1*XxxiA0jJvPrHEJHD4z893g.png)

Also just a remainder about SoftMax function we'll be using later in this lab:
$$SoftMax(x_i)=\frac{e^{x_i}}{\sum_{j=1..N}e^{x_j}}$$

In [None]:
print('Building Fully-Connected...')
model = Sequential()
model.add(Dense(16, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

In [None]:
print(model.to_yaml())

In [None]:
from keras.models import model_from_yaml

yaml_string = model.to_yaml()
model = model_from_yaml(yaml_string)

Now we want to draw our model:

In [None]:
from keras.utils import plot_model
plot_model(model, to_file='model.png', show_shapes=True)

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

The final touch: loss function for the model.

In [None]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Teaching the network at last!

In [None]:
from keras.callbacks import TensorBoard  
tensorboard=TensorBoard(log_dir='./logs', write_graph=True, write_images=True)
from keras.callbacks import EarlyStopping  
early_stopping=EarlyStopping(monitor='val_loss')  


history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1,
                    callbacks=[tensorboard, early_stopping])

### Click [here](/tensorboard/) to start TensorBoard.

In [None]:
score = model.evaluate(x_val, y_val, batch_size=batch_size, verbose=1)
print('\n')
print('Test score:', score[0])
print('Test accuracy:', score[1])

In [None]:
score2 = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('\n')
print('Test score:', score2[0])
print('Test accuracy:', score2[1])

In [None]:
y_test_predict =  model.predict(x_test, batch_size=batch_size, verbose=1)
y_test_predict

In [None]:
conf_y_p=y_test_predict[:,1]

In [None]:
conf_y_p[conf_y_p >= 0.5] = 1
conf_y_p[conf_y_p <= 0.5] = 0

In [None]:
conf_y_test = y_test[:,1]

In [None]:
from sklearn.metrics import confusion_matrix
confusion_test = confusion_matrix(conf_y_test, conf_y_p)
confusion_test

In [None]:
accu= (confusion_test[0][0]+confusion_test[1][1])/confusion_test.sum()
neg_prec = confusion_test[0][0]/(confusion_test[0][0]+confusion_test[1][0])
neg_recall = confusion_test[0][0]/confusion_test[0].sum()
print('accu : ',accu)
print('neg_prec : ',neg_prec)
print('neg_recall : ',neg_recall)

As you may see this network isn't that great at this task. So we propose you to get acquinted with recurrent neural networks, which are now industry standard for NLP tasks.

## RNN

![](https://cdn-images-1.medium.com/max/759/1*UkI9za9zTR-HL8uM15Wmzw.png)

DESCRIPTION

## LSTM
Formalae:
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/2db2cba6a0d878e13932fa27ce6f3fb71ad99cf1)


And on figure:
![](https://www.researchgate.net/profile/Marijn_Stollenga/publication/304346489/figure/fig13/AS:376211038588933@1466707109201/Figure-74-RNN-and-LSTM-A-graphical-representation-of-the-RNN-and-LSTM-networks-are.png)

In [None]:
from keras.layers import LSTM, Embedding
from keras.datasets import imdb

In [None]:
maxlen = 80  # cut texts after this number of words (among top max_features most common words)

Now we have no need in the matrices described above, so we just use the embedded version of the same dataset:

In [None]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_words)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')


In [None]:
from keras.preprocessing import sequence
from keras.layers import GlobalAveragePooling1D

We need to pad (or trim) sentences to maxlen we want for a RNN to be able work with them in batches.

In [None]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


Now we introduce two new layers: ```Embedding``` - the layer which learn a vector for each word, and ```LSTM``` - which is just an LSTM cell described above.

In [None]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_words, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
#model.add(Embedding(max_words,128 , dropout=0.2))
#model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary() 

#See this : https://gaussic.github.io/2017/03/03/imdb-sentiment-classification/

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Run the followig training cell for the number of epochs we have specified. This will take about 10 minutes to run.

In [None]:
batch_size=128
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1,
                    #callbacks=[tensorboard, early_stopping]
                   )

In [None]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('\n')
print('model.metrics_names:', model.metrics_names)
print('Test score:', score)
print('Test accuracy:', acc)

In [None]:
score2= model.predict(x_test, batch_size=batch_size)
print('Sentiment value:', score2)

In [None]:
print('label:', y_test)
y_test = y_test[:]

In [None]:
score2[score2>=0.5] = 1.
score2[score2 <0.5] = 0.


In [None]:
from sklearn.metrics import confusion_matrix
confusion_test = confusion_matrix(y_test, score2)
confusion_test

In [None]:
accu= (confusion_test[0][0]+confusion_test[1][1])/confusion_test.sum()
neg_prec = confusion_test[0][0]/(confusion_test[0][0]+confusion_test[1][0])
neg_recall = confusion_test[0][0]/confusion_test[0].sum()
print('accu : ',accu)
print('neg_prec : ',neg_prec)
print('neg_recall : ',neg_recall)

### Exercise 1

The score is better, but not much. You can improve it dramatically, just try add some layers, or tweak some hyperparams. Be creative! Your goal is to reach 0.75 on this dataset, but it is not the maximum achievable limit, just a metric in the time you have to complete this lab. 

Once we can do text classification then many identification tasks open up to apply the same approach on. One example is [MBTI][https://www.kaggle.com/datasnaek/mbti-type) where people's personalities can be divided into 16 different types. The dataset includes writing samples from each of the personality types. 

### Exercise 2

Take home exercise to use the kaggle dataset to see if people's personality can be discerned based on the online written samples.