# MLP Classification with CR Dataset
<hr>

We will build a text classification model using Multi Layer Perceptron on the Customer Reviews Dataset. Since there is no standard train/test split for this dataset, we will use 10-Fold Cross Validation (CV). 

## Load the library

In [20]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import KFold
from scipy.io import savemat

%config IPCompleter.greedy=True
%config IPCompleter.use_jedi=False


In [2]:
tf.config.list_physical_devices('GPU') 

[]

## Load the Dataset

In [4]:
corpus = pd.read_pickle('../../../0_data/CR/CR.pkl')
corpus.label = corpus.label.astype(int)
print(corpus.shape)
corpus

(3775, 3)


Unnamed: 0,sentence,label,split
0,weaknesses are minor the feel and layout of th...,0,train
1,many of our disney movies do n 't play on this...,0,train
2,player has a problem with dual layer dvd 's su...,0,train
3,i know the saying is you get what you pay for ...,0,train
4,will never purchase apex again .,0,train
...,...,...,...
3770,"so far , the anti spam feature seems to be ver...",1,train
3771,i downloaded a trial version of computer assoc...,1,train
3772,i did not have any of the installation problem...,1,train
3773,their products have been great and have saved ...,1,train


In [5]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3775 entries, 0 to 3774
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  3775 non-null   object
 1   label     3775 non-null   int32 
 2   split     3775 non-null   object
dtypes: int32(1), object(2)
memory usage: 73.9+ KB


In [6]:
corpus.groupby( by='label').count()

Unnamed: 0_level_0,sentence,split
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1368,1368
1,2407,2407


In [7]:
# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

In [8]:
sentences[0]

"weaknesses are minor the feel and layout of the remote control are only so so . it does n 't show the complete file names of mp3s with really long names . you must cycle through every zoom setting ( 2x , 3x , 4x , 1 2x , etc . ) before getting back to normal size sorry if i 'm just ignorant of a way to get back to 1x quickly ."

<!--## Split Dataset-->

# Data Preprocessing: Word2Vec Static
<hr>

Preparing data for word embedding, especially for pre-trained word embedding like Word2Vec or GloVe, __don't use standard preprocessing steps like stemming or stopword removal__. Compared to our approach on cleaning the text when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc, now we will keep these words as we do not want to lose such information that might help the model learn better.

__Tomas Mikolov__, one of the developers of Word2Vec, in _word2vec-toolkit: google groups thread., 2015_, suggests only very minimal text cleaning is required when learning a word embedding model. Sometimes, it's good to disconnect
In short, what we will do is:
- Puntuations removal
- Lower the letter case
- Tokenization

The process above will be handled by __Tokenizer__ class in TensorFlow

## Load Pre-trained Word Embedding: Word2Vec

__1. Load `Word2Vec` Pre-trained Word Embedding__

__Using and updating pre-trained embeddings__
* In this part, we will create an Embedding layer in Tensorflow Keras using a pre-trained word embedding called Word2Vec 300-d tht has been trained 100 bilion words from Google News.
* In this part,  we will leave the embeddings fixed instead of updating them (dynamic).

In [9]:
from gensim.models import KeyedVectors
word2vec = KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)

In [10]:
# Access the dense vector value for the word 'handsome'
# word2vec.word_vec('handsome') # 0.11376953
word2vec.word_vec('cool') # 1.64062500e-01

array([ 1.64062500e-01,  1.87500000e-01, -4.10156250e-02,  1.25000000e-01,
       -3.22265625e-02,  8.69140625e-02,  1.19140625e-01, -1.26953125e-01,
        1.77001953e-02,  8.83789062e-02,  2.12402344e-02, -2.00195312e-01,
        4.83398438e-02, -1.01074219e-01, -1.89453125e-01,  2.30712891e-02,
        1.17675781e-01,  7.51953125e-02, -8.39843750e-02, -1.33666992e-02,
        1.53320312e-01,  4.08203125e-01,  3.80859375e-02,  3.36914062e-02,
       -4.02832031e-02, -6.88476562e-02,  9.03320312e-02,  2.12890625e-01,
        1.72119141e-02, -6.44531250e-02, -1.29882812e-01,  1.40625000e-01,
        2.38281250e-01,  1.37695312e-01, -1.76757812e-01, -2.71484375e-01,
       -1.36718750e-01, -1.69921875e-01, -9.15527344e-03,  3.47656250e-01,
        2.22656250e-01, -3.06640625e-01,  1.98242188e-01,  1.33789062e-01,
       -4.34570312e-02, -5.12695312e-02, -3.46679688e-02, -8.49609375e-02,
        1.01562500e-01,  1.42578125e-01, -7.95898438e-02,  1.78710938e-01,
        2.30468750e-01,  

In [11]:
def training_words_in_word2vector(word_to_vec_map, word_to_index):
    '''
    input:
        word_to_vec_map: a word2vec GoogleNews-vectors-negative300.bin model loaded using gensim.models
        word_to_index: word to index mapping from training set
    '''
    
    vocab_size = len(word_to_index) + 1
    count = 0
    # Set each row "idx" of the embedding matrix to be 
    # the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        if word in word_to_vec_map:
            count+=1
            
    return print('Found {} words present from {} training vocabulary in the set of pre-trained word vector'.format(count, vocab_size))

In [12]:
oov_tok = '<UNK>'

# Separate the sentences and the labels
sentences, labels = list(corpus.sentence), list(corpus.label)

# Cleaning and Tokenization
tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
training_words_in_word2vector(word2vec, word_index)

Found 5046 words present from 5336 training vocabulary in the set of pre-trained word vector


## Define `clean_doc` function
__2. Define a function to clean the document called __`clean_doc()`____

In [13]:
def clean_doc(sentences, word_index):
    clean_sentences = []
    for sentence in sentences:
        sentence = sentence.lower().split()
        clean_word = []
        for word in sentence:
            if word in word_index:
                clean_word.append(word)
        clean_sentence = ' '.join(clean_word)
        clean_sentences.append(clean_sentence)
    return clean_sentences

In [14]:
clean_sentences = clean_doc(sentences, word_index)
clean_sentences[0:3]

["weaknesses are minor the feel and layout of the remote control are only so so it does n 't show the complete file names of mp3s with really long names you must cycle through every zoom setting 2x 3x 4x 1 2x etc before getting back to normal size sorry if i 'm just ignorant of a way to get back to 1x quickly",
 "many of our disney movies do n 't play on this dvd player",
 "player has a problem with dual layer dvd 's such as alias season 1 and season 2"]

## Define `sentence_to_avg` function
__3. Define a `sentence_to_avg` function__

We will use this function to calculate the mean of word embedding representation.

In [15]:
def sentence_to_avg(sentence, word_to_vec_map):
    """
    Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each word
    and averages its value into a single vector encoding the meaning of the sentence.
    
    Arguments:
    sentence -- string, one training example from X
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    
    Returns:
    avg -- average vector encoding information about the sentence, numpy-array of shape (50,)
    """
    
    # Step 1: Split sentence into list of lower case words (≈ 1 line)
    words = (sentence.lower()).split()

    # Initialize the average word vector, should have the same shape as your word vectors.
    avg = np.zeros(word2vec.word_vec('i').shape)
    
    # Step 2: average the word vectors. You can loop over the words in the list "words".
    total = 0
    count = 0
    for w in words:
        if w in word_to_vec_map:
            total += word_to_vec_map.word_vec(w)
            count += 1
            
    if count!= 0:
        avg = total/count
    else:
        avg
    return avg

In [16]:
i = word2vec.word_vec('i')[0]
print(word2vec.word_vec('i')[0])
j = word2vec.word_vec('am')[0]
print(word2vec.word_vec('am')[0])
k = word2vec.word_vec('handsome')[0]
print(word2vec.word_vec('handsome')[0])
mean = (i+j+k)/3
print('the mean of word embedding is: ', mean)

-0.22558594
-0.16699219
0.11376953
the mean of word embedding is:  -0.09293619791666667


In [17]:
# Example of the functions used in a sentence
mysentence = 'I am handsome'
sentence_to_avg(mysentence, word2vec)

array([-0.0929362 ,  0.03125   , -0.03914388,  0.09879557,  0.07088598,
        0.03092448, -0.00651042, -0.04437256,  0.08068848,  0.07242838,
        0.00160726, -0.10530599, -0.07389323, -0.08854166,  0.00565592,
        0.15136719, -0.0460612 ,  0.19482422,  0.1101888 ,  0.05924479,
       -0.18457031,  0.00716146,  0.16153972,  0.02437337, -0.01578776,
        0.06119792, -0.25048828,  0.02799479,  0.0853475 , -0.14029948,
        0.13688152, -0.01350911, -0.05493164, -0.01090495,  0.03352864,
        0.09635416, -0.04239909,  0.00777181, -0.1438802 ,  0.06510416,
        0.14560954, -0.11295573,  0.25520834,  0.08833822,  0.14339192,
        0.037028  , -0.02832031, -0.00139872,  0.00309245, -0.17871094,
        0.06852213,  0.07910156,  0.09513346,  0.11425781, -0.00488281,
        0.11051432, -0.01139323, -0.08479818, -0.09277344, -0.03263346,
       -0.00374349,  0.07977295, -0.26416016, -0.05135091,  0.06111654,
       -0.06933594, -0.06486002,  0.18766277, -0.04826609,  0.03

## Encode Sentence into Word2Vec Representation

In [18]:
def encoded_sentences(sentences):

    encoded_sentences = []

    for sentence in sentences:

        encoded_sentence = sentence_to_avg(sentence, word2vec)
        encoded_sentences.append(encoded_sentence)

    encoded_sentences = np.array(encoded_sentences)
    return encoded_sentences

In [19]:
embedded_sentences = encoded_sentences(clean_sentences)
print(embedded_sentences.shape)
embedded_sentences

(3775, 300)


array([[ 0.02931269,  0.04064355,  0.00335943, ..., -0.05255696,
        -0.00678974, -0.03428983],
       [ 0.02625621,  0.07036244, -0.00320712, ..., -0.0249717 ,
         0.02348744, -0.04393421],
       [ 0.0471889 ,  0.04338728, -0.02501352, ..., -0.07487269,
         0.00302996, -0.02290562],
       ...,
       [-0.01438395,  0.04383342,  0.0245463 , ..., -0.08561961,
         0.0373319 , -0.09303284],
       [-0.01439794,  0.05486043, -0.01898448, ..., -0.08718872,
         0.02372233, -0.02985636],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

# Saving to Matlab Files

In [24]:
# Parameter Initialization
oov_tok = "<UNK>"
columns = ['acc1', 'acc2', 'acc3', 'acc4', 'acc5', 'acc6', 'acc7', 'acc8', 'acc9', 'acc10', 'AVG']
record3 = pd.DataFrame(columns = columns)

sentences, labels = list(corpus.sentence), list(corpus.label)

# prepare cross validation with 10 splits and shuffle = True
kfold = KFold(10, True)

exp=0

data = {}

# kfold.split() will return set indices for each split
acc_list = []
for train, test in kfold.split(sentences):
    
    exp+=1
    
    train_x, test_x = [], []
    train_y, test_y = [], []

    for i in train:
        train_x.append(sentences[i])
        train_y.append(labels[i])

    for i in test:
        test_x.append(sentences[i])
        test_y.append(labels[i])

    # Turn the data into a numpy array
    train_y = np.array(train_y)
    test_y = np.array(test_y)
    
    # Define the word_index
    tokenizer = Tokenizer(oov_token=oov_tok)
    tokenizer.fit_on_texts(train_x)
    word_index = tokenizer.word_index
    
    # Clean the sentences
    Xtrain = clean_doc(train_x, word_index)
    Xtest = clean_doc(test_x, word_index)

    # Encode the sentences into word embedding average representation
    Xtrain = encoded_sentences(Xtrain)
    Xtest = encoded_sentences(Xtest)
    
    data['Xtrain'] = Xtrain
    data['ytrain'] = train_y
    data['Xtest'] = Xtest
    data['ytest'] = test_y
    
    filename = 'we_CR_'+str(exp)+'.mat'
    savemat(filename, data)
    print('{}st 10-fold data is saved successfully!'.format(exp))

1st 10-fold data is saved successfully!
2st 10-fold data is saved successfully!
3st 10-fold data is saved successfully!
4st 10-fold data is saved successfully!
5st 10-fold data is saved successfully!
6st 10-fold data is saved successfully!
7st 10-fold data is saved successfully!
8st 10-fold data is saved successfully!
9st 10-fold data is saved successfully!
10st 10-fold data is saved successfully!


In [30]:
data

{'Xtrain': array([[ 0.02931269,  0.04064355,  0.00335943, ..., -0.05255696,
         -0.00678974, -0.03428983],
        [ 0.02625621,  0.07036244, -0.00320712, ..., -0.0249717 ,
          0.02348744, -0.04393421],
        [ 0.0471889 ,  0.04338728, -0.02501352, ..., -0.07487269,
          0.00302996, -0.02290562],
        ...,
        [-0.01438395,  0.04383342,  0.0245463 , ..., -0.08561961,
          0.0373319 , -0.09303284],
        [-0.01439794,  0.05486043, -0.01898448, ..., -0.08718872,
          0.02372233, -0.02985636],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]]),
 'ytrain': array([0, 0, 0, ..., 1, 1, 1]),
 'Xtest': array([[-0.00097656,  0.01123047, -0.02592773, ..., -0.146875  ,
         -0.06824951, -0.15471192],
        [ 0.02078346,  0.02795804, -0.00971616, ...,  0.0144406 ,
          0.0066804 , -0.03116312],
        [ 0.0548584 ,  0.02761824, -0.0045105 , ..., -0.06584124,
          0.02709089, -0.03709194],
     