# MLP Classification with TREC Dataset
<hr>

We will build a text classification model using MLP model on the TREC Dataset. 

## Load the library

In [3]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import random
from nltk.corpus import stopwords, twitter_samples
# from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import KFold
from nltk.stem import PorterStemmer
from string import punctuation
from sklearn.preprocessing import OneHotEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
import time

%config IPCompleter.greedy=True
%config IPCompleter.use_jedi=False
# nltk.download('twitter_samples')

## Load the Dataset

In [4]:
corpus = pd.read_pickle('../../../0_data/TREC/TREC.pkl')
corpus.label = corpus.label.astype(int)
print(corpus.shape)
corpus

(5952, 3)


Unnamed: 0,sentence,label,split
0,how did serfdom develop in and then leave russ...,0,train
1,what films featured the character popeye doyle ?,1,train
2,how can i find a list of celebrities ' real na...,0,train
3,what fowl grabs the spotlight after the chines...,1,train
4,what is the full form of .com ?,2,train
...,...,...,...
5947,who was the 22nd president of the us ?,3,test
5948,what is the money they use in zambia ?,1,test
5949,how many feet in a mile ?,5,test
5950,what is the birthstone of october ?,1,test


In [5]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5952 entries, 0 to 5951
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  5952 non-null   object
 1   label     5952 non-null   int32 
 2   split     5952 non-null   object
dtypes: int32(1), object(2)
memory usage: 116.4+ KB


In [6]:
corpus.groupby( by=['split','label']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,sentence
split,label,Unnamed: 2_level_1
test,0,138
test,1,94
test,2,9
test,3,65
test,4,81
test,5,113
train,0,1162
train,1,1250
train,2,86
train,3,1223


In [7]:
corpus.groupby(by='split').count()

Unnamed: 0_level_0,sentence,label
split,Unnamed: 1_level_1,Unnamed: 2_level_1
test,500,500
train,5452,5452


In [8]:
# Separate the sentences and the labels
# Separate the sentences and the labels for training and testing
train_x = list(corpus[corpus.split=='train'].sentence)
train_y = np.array(corpus[corpus.split=='train'].label)
print(len(train_x))
print(len(train_y))

test_x = list(corpus[corpus.split=='test'].sentence)
test_y = np.array(corpus[corpus.split=='test'].label)
print(len(test_x))
print(len(test_y))

5452
5452
500
500


# Data Preprocessing
<hr>

Preparing data for word embedding, especially for pre-trained word embedding like Word2Vec or GloVe, __don't use standard preprocessing steps like stemming or stopword removal__. Compared to our approach on cleaning the text when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc, now we will keep these words as we do not want to lose such information that might help the model learn better.

__Tomas Mikolov__, one of the developers of Word2Vec, in _word2vec-toolkit: google groups thread., 2015_, suggests only very minimal text cleaning is required when learning a word embedding model. Sometimes, it's good to disconnect
In short, what we will do is:
- Puntuations removal
- Lower the letter case
- Tokenization

The process above will be handled by __Tokenizer__ class in TensorFlow

- <b>One way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set.</b>## Develop Vocabulary

A part of preparing text for text classification involves defining and tailoring the vocabulary of words supported by the model. **We can do this by loading all of the documents in the dataset and building a set of words.**

The larger the vocabulary, the more sparse the representation of each word or document. So, we may decide to support all of these words, or perhaps discard some. The final chosen vocabulary can then be saved to a file for later use, such as filtering words in new documents in the future.

In [9]:
# Define a function to compute the max length of sequence
def max_length(sequences):
    '''
    input:
        sequences: a 2D list of integer sequences
    output:
        max_length: the max length of the sequences
    '''
    max_length = 0
    for i, seq in enumerate(sequences):
        length = len(seq)
        if max_length < length:
            max_length = length
    return max_length

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

trunc_type='post'
padding_type='post'
oov_tok = "<UNK>"

# Separate the sentences and the labels
train_x = list(corpus[corpus.split=='train'].sentence)
train_y = np.array(corpus[corpus.split=='train'].label)
test_x = list(corpus[corpus.split=='test'].sentence)
test_y = np.array(corpus[corpus.split=='test'].label)

# Cleaning and Tokenization
tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(train_x)

print("Example of sentence: ", train_x[4])

# Turn the text into sequence
training_sequences = tokenizer.texts_to_sequences(train_x)
max_len = max_length(training_sequences)

print('Into a sequence of int:', training_sequences[4])

# Pad the sequence to have the same size
training_padded = pad_sequences(training_sequences, maxlen=max_len, padding=padding_type, truncating=trunc_type)
print('Into a padded sequence:', training_padded[4])

Example of sentence:  what is the full form of .com ?
Into a sequence of int: [3, 4, 2, 471, 261, 5, 372]
Into a padded sequence: [  3   4   2 471 261   5 372   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]


In [16]:
# See the first 10 words in the vocabulary

word_index = tokenizer.word_index
for i, word in enumerate(word_index):
    print(word, word_index.get(word))
    if i==9:
        break
vocab_size = len(word_index)+1
print(vocab_size)

<UNK> 1
the 2
what 3
is 4
of 5
in 6
a 7
how 8
's 9
was 10
8461


# Model 1: Embedding Random
<hr>

A __standard model__ for document classification is to use (quoted from __Jason Brownlee__, the author of [machinelearningmastery.com](https://machinelearningmastery.com)):
>- Word Embedding: A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.
>- Convolutional Model: A feature extraction model that learns to extract salient features from documents represented using a word embedding.
>- Fully Connected Model: The interpretation of extracted features in terms of a predictive output.


Therefore, the model is comprised of the following elements:
- __Input layer__ that defines the length of input sequences.
- __Embedding layer__ set to the size of the vocabulary and 100-dimensional real-valued representations.
- __Conv1D layer__ with 32 filters and a kernel size set to the number of words to read at once.
- __MaxPooling1D layer__ to consolidate the output from the convolutional layer.
- __Flatten layer__ to reduce the three-dimensional output to two dimensional for concatenation.

The CNN model is inspired by __Yoon Kim__ paper in his study on the use of Word Embedding + CNN for text classification. The hyperparameters we use based on his study are as follows:
- Transfer function: rectified linear.
- Kernel sizes: 1-8.
- Number of filters: 100.
- Dropout rate: 0.5.
- L2 Constraint: 3.
- Batch Size: 50.
- Update Rule: Adam

We will perform the best parameter using __grid search__ and 10-fold cross validation.

## CNN Model

Now, we will build Convolutional Neural Network (CNN) models to classify encoded documents as either positive or negative.

The model takes inspiration from `CNN for Sentence Classification` by *Yoon Kim*.

Now, we will define our CNN model as follows:
- One Conv layer with 100 filters, kernel size 5, and relu activation function;
- One MaxPool layer with pool size = 2;
- One Dropout layer after flattened;
- Optimizer: Adam (The best learning algorithm so far)
- Loss function: binary cross-entropy (suited for binary classification problem)

**Note**: 
- The whole purpose of dropout layers is to tackle the problem of over-fitting and to introduce generalization to the model. Hence it is advisable to keep dropout parameter near 0.5 in hidden layers. 
- https://missinglink.ai/guides/keras/keras-conv1d-working-1d-convolutional-neural-networks-keras/

In [17]:
from tensorflow.keras import regularizers
from tensorflow.keras.constraints import MaxNorm

def define_model(filters = 100, kernel_size = 3, activation='relu', input_dim = None, output_dim=300, max_length = None ):
    
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, 
                                  output_dim=output_dim, 
                                  input_length=max_length, 
                                  input_shape=(max_length, )),
        
        tf.keras.layers.Conv1D(filters=filters, kernel_size = kernel_size, activation = activation, 
                               # set 'axis' value to the first and second axis of conv1D weights (rows, cols)
                               kernel_constraint= MaxNorm( max_value=3, axis=[0,1])),
        
        tf.keras.layers.MaxPool1D(2),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation=activation, 
                              # set axis to 0 to constrain each weight vector of length (input_dim,) in dense layer
                              kernel_constraint = MaxNorm( max_value=3, axis=0)),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(units=6, activation='softmax')
    ])
    
    model.compile( loss = 'sparse_categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
#     model.summary()
    return model

In [18]:
model_0 = define_model( input_dim=1000, max_length=100)
model_0.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 300)          2538300   
_________________________________________________________________
conv1d (Conv1D)              (None, 98, 100)           90100     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 49, 100)           0         
_________________________________________________________________
flatten (Flatten)            (None, 4900)              0         
_________________________________________________________________
dropout (Dropout)            (None, 4900)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                49010     
_________________________________________________________________
dropout_1 (Dropout)          (None, 10)                0