<a href="https://colab.research.google.com/github/beekal/MachieneLearningProjects/blob/master/2.%20Deep%20Learning/4.%20Text%20Classification%20with%20ConvNet%20%2B%20Glove/CNN_based_Classifier_Using_pre_trained_Glove_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Binary Text Classifier using Glove Vectors : 97.9%

In this noteboook, we train a  Convolution Neural Network based text classifier using pre-trained word embeddings from glove and imdb dataset.

### Input Data 
We are using the IMDB "Large Movie Review Dataset v1.0" dataset which contains 50,000 reviews split evenly into 25K train and 25K test set.

The data is balanced with 25K positive review and 25K negative reviews, evvely distributed across train and test dataset.

**To Do**: If you have OOV, in your train text then you can train them with the classification task by enabling the training=True flag for the embedding layer.

However to get word vectors for OOV words, large enough data is required. Hence if you have small data, then  setting the embedding layer to True may be detrimental.



### Import Library

In [3]:
# ! pip install Keras==2.4.3
# ! pip install Keras-Preprocessing==1.1.2
# ! pip install numpy==1.18.5
# ! pip install tensorboard==2.3.0
# ! pip install tensorboard-plugin-wit==1.7.0
# ! pip install tensorflow-estimator==2.3.0
# ! pip install tensorflow-gpu==2.3.1

import sys
import os
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPool1D
from keras.layers import Conv1D, MaxPool1D, Embedding, LSTM
from keras.models import Model, Sequential
from keras.initializers import Constant


### Download Glove / Imdb Data

In [4]:
! mkdir 'text_cnn_1d'
! mkdir 'text_cnn_1d/glove'
! wget -c http://nlp.stanford.edu/data/glove.6B.zip -O  'text_cnn_1d/glove/glove_6B.zip'
! unzip -d 'text_cnn_1d/glove' 'text_cnn_1d/glove/glove_6B.zip'

# BASE_DIR = 'text_cnn_1d'

--2020-09-27 02:17:45--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-09-27 02:17:46--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-09-27 02:17:46--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘text_cnn_1d/glove/glov

In [6]:
! mkdir 'text_cnn_1d'
! mkdir 'text_cnn_1d/imdb'
! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -O 'text_cnn_1d/imdb/aclImdb_v1.tar.gz'
! tar zxvf  'text_cnn_1d/imdb/aclImdb_v1.tar.gz' -C  'text_cnn_1d/imdb/' 

## Define Processing Classes

In [14]:
class TextPreProcessor:

  @staticmethod
  def get_text_to_id_sequence(tokenizer, train_texts, test_texts):
    """ Returns text to sequence of word2id i.e
        I/p : ['A cat', 'A dog']
        O/p : [[3, 1129], [3, 910]]
    """    
    train_sequences = tokenizer.texts_to_sequences(train_texts)
    test_sequences  = tokenizer.texts_to_sequences(test_texts)
    return train_sequences, test_sequences
  
  @staticmethod
  def pad_text_to_constant_length(train_seq, test_seq, maxlen): 
    """ Returns padded seq for given seq e.g
    I/P: [[3, 1129, 1], [3]]
    O/P: [[0, 0, 3, 1129, 1], [0, 0, 0, 0, 3]]
    """
    train_pad_seq = pad_sequences(train_seq, maxlen=maxlen)
    test_pad_seq = pad_sequences(train_seq, maxlen=maxlen)
    return train_pad_seq, test_pad_seq
  
  @staticmethod
  def binary_label_to_dummy_categorical(train_labels, test_labels):
    """ Returns dummy categorical labels for binary label e.g
    I/P: [0, 1]
    O/P:  [[0 , 1],  [1, 0]]
    """
    train_labels = to_categorical(np.asarray(train_labels))
    test_labels  = to_categorical(np.asarray(test_labels))
    return train_labels, test_labels
  
  @staticmethod
  def split_train_data_to_train_n_valid(split_ration, train_data, train_labels):
    """ Simple Random index based split used  for now. Stratified Split to be done """ 

    ## Data split
    num_examples =  train_data.shape[0]
    num_features = train_data.shape[1]
    print(train_data.shape)

    # shuffle index
    shuffled_index = np.arange(num_examples)
    np.random.shuffle(shuffled_index)

    # Shuffle using shuffled index
    train_data = train_data[shuffled_index]
    train_labels = train_labels[shuffled_index]

    # Split
    validation_split = 0.2
    num_valid_examples =  int(split_ration * num_examples)

    X_train = train_data[:-num_valid_examples]
    y_train = train_labels[:-num_valid_examples]

    X_valid = train_data[-num_valid_examples:]
    y_valid = train_labels[-num_valid_examples:]

    return X_train, y_train, X_valid, y_valid


class GloveEmbeddingMatrix:

  @staticmethod
  def get_embedding_matrix(glove_dir, glove_embedding_file):
    """ Returns embedding matrix containing words in training data for efficiency reason
        For Production level model, you should create embedding matrix containing largest possible words, to handle
        all  unseen words that needs to be handled during inference time.
    """
    word2embedding = {}
    with open(os.path.join(glove_dir, glove_embedding_file)) as f:
      for line in f:
        values = line.split()
        word = values[0]
        embedding_vector = np.asarray( values[1:], dtype='float32' )        
        word2embedding[word] = embedding_vector

    return word2embedding    

  @staticmethod
  def link_our_word2index_to_glove(word2index, glove_word2embedding_matrix, embedding_dimension):
    """ Link our word2 index used in our model with its corresponding embedding vectors from glove"""
    num_words = len(glove_word2embedding_matrix)

    our_word2embedding_matrix = np.zeros((num_words, embedding_dimension))

    for word, word_idx in word2index.items():
        cur_word_vector = glove_word2embedding_matrix.get(word, None)
        if cur_word_vector is not None:
            our_word2embedding_matrix[word_idx] = glove_word2embedding_matrix.get(word)

    return our_word2embedding_matrix


class ImdbInputDataHandler:    

    # Pad text, to make all text of equal length 
    MAX_TEXT_LENGTH = 1000

    # MAX Vocab Words Supported
    MAX_WORDS = 20000    
    VALIDATION_SPLIT = 0.2     
    labels_2_binary = {'pos':1, 'neg':0}
    tokenizer = None
    
    @staticmethod
    def _get_data(data_dir):
      """ Scans the directory and returns a list of text[sentence_1, sentence_2] and binary labels[0,1] """
      texts, binary_labels = [], []
      labels_2_binary = {'pos':1, 'neg':0}

      for dir_name in sorted(os.listdir(data_dir)):
        # e.g text_cnn_1d/aclImdb/train/pos
        dir_path = os.path.join(data_dir, dir_name)
        if os.path.isdir(dir_path) and dir_name in labels_2_binary.keys():
          # pos dir_name -> 1 and vice versa
          cur_binary_label = labels_2_binary[dir_name]
          # 44960_0.txt
          for file_name in sorted(os.listdir(dir_path)):
            # e.g text_cnn_1d/aclImdb/train/pos/44960_0.txt
            file_path = os.path.join(dir_path, file_name)
            texts.append(open(file_path).read() )
            binary_labels.append(cur_binary_label)

      return texts, binary_labels


    @staticmethod
    def fit_tokenizer(train_texts):
        tokenizer = Tokenizer(num_words=ImdbInputDataHandler.MAX_WORDS)
        tokenizer.fit_on_texts(train_texts)
        return tokenizer
    
    @staticmethod
    def get_tokenizer():        
        return ImdbInputDataHandler.tokenizer
    
    @staticmethod
    def get_train_texts(imdb_train_dir):       
        train_texts, _ = ImdbInputDataHandler._get_data(imdb_train_dir)
        return train_texts

    @staticmethod
    def get_train_valid_test_data(imdb_train_dir, imdb_test_dir):
        """ Takes in imdb train dir and test dir containing the training and test data.
            Splits Train data into train and validation data.
            O/P : Returns Train dataset, validation data set and test data set
        """
        
        train_texts, train_labels = ImdbInputDataHandler._get_data(imdb_train_dir)
        test_texts, test_labels   = ImdbInputDataHandler._get_data(imdb_test_dir)       
        
        ImdbInputDataHandler.tokenizer = ImdbInputDataHandler.fit_tokenizer(train_texts)
        
        # [ [the cat], [the dog eats] ] => [ [15, 165], [15, 167, 145] ]
        train_sequences, test_sequences = TextPreProcessor.get_text_to_id_sequence(ImdbInputDataHandler.tokenizer, train_texts, test_texts)
        # [ [the cat], [the dog eats] ] => [ [0, 0, 15, 165], [0, 15, 167, 145] ]
        train_padded_data, test_padded_data = TextPreProcessor.pad_text_to_constant_length(train_sequences, test_sequences, ImdbInputDataHandler.MAX_TEXT_LENGTH)
        # [1,0] => [ [1,0], [0,1] ]
        train_labels, test_labels = TextPreProcessor.binary_label_to_dummy_categorical(train_labels, test_labels)
        X_train, y_train, X_valid, y_valid = TextPreProcessor.split_train_data_to_train_n_valid(ImdbInputDataHandler.VALIDATION_SPLIT , train_padded_data, train_labels  )
        X_test, y_test = test_padded_data, test_labels
        
        return X_train, y_train, X_valid, y_valid, X_test, y_test






### Modeling

In [8]:
IMDB_TRAIN_DIR = 'text_cnn_1d/imdb/aclImdb/train'
IMDB_TEST_DIR  = 'text_cnn_1d/imdb/aclImdb/test'
GLOVE_DIR= 'text_cnn_1d/glove'    
GLOVE_FILE = 'glove.6B.100d.txt'
EMBEDDING_DIM = 100


X_train, y_train, X_valid, y_valid, X_test, y_test = ImdbInputDataHandler.get_train_valid_test_data(IMDB_TRAIN_DIR, IMDB_TEST_DIR)

tokenizer = ImdbInputDataHandler.get_tokenizer()
our_word_2_index = tokenizer.word_index
glove_word2embedding_matrix = GloveEmbeddingMatrix.get_embedding_matrix(GLOVE_DIR, GLOVE_FILE )
our_word2embedding_matrix = GloveEmbeddingMatrix.link_our_word2index_to_glove( our_word_2_index , glove_word2embedding_matrix, EMBEDDING_DIM )



(25000, 1000)


### Define Model

In [9]:
# Initialise embedding to the glove word2embedding matrix
# Set trainable to False or else the embedding layer will also be trained
glove_word2embedding_layer = Embedding( our_word2embedding_matrix.shape[0],
                                       EMBEDDING_DIM,
                                        embeddings_initializer=Constant(our_word2embedding_matrix),
                                        input_length=ImdbInputDataHandler.MAX_TEXT_LENGTH,                            
                                        trainable=False)
"""
CNN 1D Text Model
"""
cnnmodel_1d = Sequential()
cnnmodel_1d.add(glove_word2embedding_layer)
cnnmodel_1d.add(Conv1D(filters=128, kernel_size=5, activation='relu' ) )
cnnmodel_1d.add(MaxPool1D(pool_size=5))
cnnmodel_1d.add(Conv1D(filters=128, kernel_size=5, activation='relu') )
cnnmodel_1d.add(MaxPool1D(pool_size=5))
cnnmodel_1d.add(Conv1D(filters=128, kernel_size=5, activation='relu') )
cnnmodel_1d.add(GlobalMaxPool1D())
cnnmodel_1d.add(Dense(units=128, activation='relu' ))
# 2 output, Probability distribution across 2 outputs 
cnnmodel_1d.add(Dense(units=2, activation='softmax'  ))

cnnmodel_1d.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
cnnmodel_1d.fit(X_train, y_train, batch_size=128, epochs=1, validation_data=(X_valid, y_valid))
score, accuracy = cnnmodel_1d.evaluate(X_test, y_test)

print('Test Accuracy of the CNN Text 1D model is ', accuracy)


Test Accuracy of the CNN Text 1D model is  0.7142800092697144


### Creating Cnn1DTextClass

In [10]:
class TextCnn1D(Model):
    def __init__(self, glove_embedding_layer, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.glove_embedding_layer = glove_embedding_layer
        self.conv_layer = Conv1D(filters=128, kernel_size=5, activation='relu')
        self.max_pool_layer = MaxPool1D(pool_size=5 )
        self.global_max_pool = GlobalMaxPool1D()
        self.dense_layer = Dense(units=128, activation='relu')
        self.out_layer = Dense(units=output_dim, activation='softmax')
        
    def call(self, inputs):
        Z = inputs
        Z = self.glove_embedding_layer(Z)
        Z = self.conv_layer(Z)
        Z = self.global_max_pool(Z)
        Z = self.dense_layer(Z)
        return self.out_layer(Z)

output_dim = 2
cnnmodel_1d = TextCnn1D(glove_word2embedding_layer, output_dim)
cnnmodel_1d.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
cnnmodel_1d.fit(X_train, y_train, batch_size=128, epochs=10, validation_data=(X_valid, y_valid))
score, accuracy = cnnmodel_1d.evaluate(X_test, y_test)

print('Test Accuracy of the CNN Text 1D model is ', accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy of the CNN Text 1D model is  0.9739199876785278


## UnitTest

In [17]:
import unittest

class InputTester(unittest.TestCase):
  
  def setUp(self):
    print('Setup')

  def test_input_text_must_be_list_of_text(self, train_texts):
      assert train_texts[0] == "Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.", 'ERROR: Expected String Text'
      assert train_texts[0:2] == ["Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."
                                  , "Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. The luxury jetliner takes off as planned but mid-air the plane is hi-jacked by the co-pilot Chambers (Robert Foxworth) & his two accomplice's Banker (Monte Markham) & Wilson (Michael Pataki) who knock the passengers & crew out with sleeping gas, they plan to steal the valuable cargo & land on a disused plane strip on an isolated island but while making his descent Chambers almost hits an oil rig in the Ocean & loses control of the plane sending it crashing into the sea where it sinks to the bottom right bang in the middle of the Bermuda Triangle. With air in short supply, water leaking in & having flown over 200 miles off course the problems mount for the survivor's as they await help with time fast running out...<br /><br />Also known under the slightly different tile Airport 1977 this second sequel to the smash-hit disaster thriller Airport (1970) was directed by Jerry Jameson & while once again like it's predecessors I can't say Airport '77 is any sort of forgotten classic it is entertaining although not necessarily for the right reasons. Out of the three Airport films I have seen so far I actually liked this one the best, just. It has my favourite plot of the three with a nice mid-air hi-jacking & then the crashing (didn't he see the oil rig?) & sinking of the 747 (maybe the makers were trying to cross the original Airport with another popular disaster flick of the period The Poseidon Adventure (1972)) & submerged is where it stays until the end with a stark dilemma facing those trapped inside, either suffocate when the air runs out or drown as the 747 floods or if any of the doors are opened & it's a decent idea that could have made for a great little disaster flick but bad unsympathetic character's, dull dialogue, lethargic set-pieces & a real lack of danger or suspense or tension means this is a missed opportunity. While the rather sluggish plot keeps one entertained for 108 odd minutes not that much happens after the plane sinks & there's not as much urgency as I thought there should have been. Even when the Navy become involved things don't pick up that much with a few shots of huge ships & helicopters flying about but there's just something lacking here. George Kennedy as the jinxed airline worker Joe Patroni is back but only gets a couple of scenes & barely even says anything preferring to just look worried in the background.<br /><br />The home video & theatrical version of Airport '77 run 108 minutes while the US TV versions add an extra hour of footage including a new opening credits sequence, many more scenes with George Kennedy as Patroni, flashbacks to flesh out character's, longer rescue scenes & the discovery or another couple of dead bodies including the navigator. While I would like to see this extra footage I am not sure I could sit through a near three hour cut of Airport '77. As expected the film has dated badly with horrible fashions & interior design choices, I will say no more other than the toy plane model effects aren't great either. Along with the other two Airport sequels this takes pride of place in the Razzie Award's Hall of Shame although I can think of lots of worse films than this so I reckon that's a little harsh. The action scenes are a little dull unfortunately, the pace is slow & not much excitement or tension is generated which is a shame as I reckon this could have been a pretty good film if made properly.<br /><br />The production values are alright if nothing spectacular. The acting isn't great, two time Oscar winner Jack Lemmon has said since it was a mistake to star in this, one time Oscar winner James Stewart looks old & frail, also one time Oscar winner Lee Grant looks drunk while Sir Christopher Lee is given little to do & there are plenty of other familiar faces to look out for too.<br /><br />Airport '77 is the most disaster orientated of the three Airport films so far & I liked the ideas behind it even if they were a bit silly, the production & bland direction doesn't help though & a film about a sunken plane just shouldn't be this boring or lethargic. Followed by The Concorde ... Airport '79 (1979)."
                                  ], "ERROR: Input expected to be list of text"


class TextPreProcessorTester(unittest.TestCase):
  
  def setUp(self):
    print('Setup')

  def test_get_text_to_sequence(self, _tokenizer):    
    assert TextPreProcessor.get_text_to_id_sequence(_tokenizer, ['A cat', 'A dog'], [] )== ([[3, 1129], [3, 910]], []), 'ERROR: text to sequemcey conversion'
  
  def test_get_text_to_sequence_2(self, _tokenizer, _train_texts):    
    _train_sequences, _ = TextPreProcessor.get_text_to_id_sequence(_tokenizer, _train_texts[0:2], [] )    
    assert _train_sequences[0:2] == [
                                      [62, 4, 3, 129, 34, 44, 7576, 1414, 15, 3, 4252, 514, 43, 16, 3, 633, 133, 12, 6, 3, 1301, 459, 4, 1751, 209, 3, 10785, 7693, 308, 6, 676, 80, 32, 2137, 1110, 3008, 31, 1, 929, 4, 42, 5120, 469, 9, 2665, 1751, 1, 223, 55, 16, 54, 828, 1318, 847, 228, 9, 40, 96, 122, 1484, 57, 145, 36, 1, 996, 141, 27, 676, 122, 1, 13886, 411, 59, 94, 2278, 303, 772, 5, 3, 837, 11037, 20, 3, 1755, 646, 42, 125, 71, 22, 235, 101, 16, 46, 49, 624, 31, 702, 84, 702, 378, 3493, 12997, 2, 16816, 8422, 67, 27, 107, 3348]
                                      , [4517, 19499, 514, 14, 3, 3417, 159, 8595, 12998, 1702, 6, 4892, 53, 16, 4518, 5674, 138, 11926, 5, 1023, 4988, 3050, 4519, 588, 1339, 34, 6, 1544, 95, 3, 758, 4, 5, 24, 3513, 8, 10786, 4, 9, 109, 3051, 5, 1, 1067, 14, 3, 4520, 79, 20, 2086, 6, 4519, 574, 2798, 7262, 38, 489, 1, 8595, 301, 122, 14, 4253, 18, 1693, 942, 1, 1702, 6, 6538, 31, 1, 998, 1807, 667, 24, 104, 14896, 15492, 19500, 2602, 485, 34, 3285, 1, 6539, 1048, 43, 16, 2753, 2547, 33, 1340, 5, 2103, 1, 4518, 11927, 1537, 20, 3, 1702, 3249, 20, 32, 4348, 1105, 18, 134, 228, 24, 4760, 217, 1927, 32, 3230, 17633, 8, 1, 4676, 1975, 1135, 4, 1, 1702, 5675, 9, 6627, 80, 1, 2016, 118, 9, 8169, 5, 1, 1321, 205, 4010, 8, 1, 652, 4, 1, 5924, 16, 942, 8, 343, 6259, 1090, 8, 257, 16115, 117, 6260, 2058, 122, 261, 1, 709, 12258, 15, 1, 14, 33, 12606, 335, 16, 55, 699, 617, 43, 7, 7, 79, 570, 463, 1, 1072, 272, 4517, 6041, 11, 330, 751, 5, 1, 6792, 566, 1685, 705, 4517, 5456, 13, 523, 31, 1513, 9878, 134, 277, 171, 37, 42, 8288, 10, 188, 132, 4517, 19499, 6, 98, 429, 4, 1547, 353, 9, 6, 438, 258, 21, 2696, 15, 1, 205, 1003, 43, 4, 1, 286, 4517, 105, 10, 25, 107, 35, 227, 10, 162, 420, 11, 28, 1, 115, 40, 9, 44, 58, 1636, 111, 4, 1, 286, 16, 3, 324, 1693, 942, 6538, 92, 1, 6627, 158, 26, 64, 1, 3230, 17633, 6261, 4, 1, 12998, 276, 1, 1184, 68, 266, 5, 1656, 1, 201, 4517, 16, 157, 1059, 1685, 506, 4, 1, 807, 1, 1150, 4483, 19501, 6, 118, 9, 2665, 363, 1, 127, 16, 3, 5283, 6540, 4416, 145, 2603, 1001, 342, 51, 1, 942, 1126, 43, 39, 11038, 14, 1, 12998, 39, 45, 98, 4, 1, 3584, 23, 3051, 42, 3, 539, 323, 12, 97, 25, 90, 15, 3, 84, 114, 1685, 506, 18, 75, 6887, 1727, 750, 411, 12607, 267, 1322, 3, 144, 580, 4, 2373, 39, 833, 39, 1071, 814, 11, 6, 3, 1045, 1429, 134, 1, 244, 12999, 111, 938, 28, 2161, 15, 1028, 231, 21, 12, 73, 567, 100, 1, 1702, 8169, 222, 21, 14, 73, 8926, 14, 10, 194, 47, 141, 25, 74, 57, 51, 1, 3022, 410, 571, 180, 89, 1257, 53, 12, 73, 16, 3, 168, 659, 4, 663, 5121, 10787, 1544, 41, 18, 222, 40, 139, 1883, 130, 739, 4201, 14, 1, 12259, 3494, 911, 6, 142, 18, 61, 211, 3, 375, 4, 136, 1196, 57, 555, 229, 16817, 5, 40, 165, 3765, 8, 1, 972, 7, 7, 1, 341, 371, 2246, 307, 4, 4517, 19499, 518, 231, 134, 1, 175, 245, 2052, 759, 32, 1724, 531, 4, 926, 583, 3, 159, 633, 894, 717, 108, 50, 136, 16, 739, 4201, 14, 2178, 5, 2104, 43, 1727, 1203, 2225, 136, 1, 3789, 39, 157, 375, 4, 348, 2345, 583, 1, 134, 10, 59, 37, 5, 64, 11, 1724, 926, 10, 241, 21, 249, 10, 97, 866, 140, 3, 747, 286, 531, 602, 4, 4517, 19499, 14, 870, 1, 19, 44, 1957, 906, 16, 524, 8058, 7476, 1588, 2823, 10, 77, 132, 54, 50, 82, 71, 1, 2876, 1702, 2179, 299, 710, 84, 342, 364, 16, 1, 82, 104, 4517, 2279, 11, 301, 3108, 4, 270, 8, 1, 11307, 2365, 4, 899, 258, 10, 67, 101, 4, 773, 4, 430, 105, 71, 11, 35, 10, 11039, 195, 3, 114, 2485, 1, 202, 136, 23, 3, 114, 750, 469, 1, 1060, 6, 547, 21, 73, 2315, 39, 1071, 6, 4844, 60, 6, 3, 899, 14, 10, 11039, 11, 97, 25, 74, 3, 181, 49, 19, 45, 90, 2877, 7, 7, 1, 362, 1230, 23, 2652, 45, 161, 2087, 1, 113, 215, 84, 104, 55, 731, 2280, 714, 4311, 44, 298, 234, 9, 13, 3, 1319, 5, 320, 8, 11, 28, 55, 731, 2280, 588, 1339, 269, 151, 11308, 79, 28, 55, 731, 2280, 844, 2105, 269, 1816, 134, 2682, 1365, 844, 6, 345, 114, 5, 78, 47, 23, 955, 4, 82, 1076, 1586, 5, 165, 43, 15, 96, 7, 7, 4517, 19499, 6, 1, 88, 1685, 4, 1, 286, 4517, 105, 35, 227, 10, 420, 1, 1005, 493, 9, 57, 45, 33, 68, 3, 224, 706, 1, 362, 1898, 455, 149, 335, 148, 3, 19, 41, 3, 16116, 1702, 40, 1609, 27, 11, 354, 39, 12607, 1474, 31, 1, 4517, 5457]
                                    ], "ERROR: Text sequence expected to be List of List of word_id(int)"
  
  def test_correct_word_to_id_conversion(self, _tokenizer, _train_texts):   
    assert _train_texts[0:2] == ["Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."
                                  , "Airport '77 starts as a brand new luxury 747 plane is loaded up with valuable paintings & such belonging to rich businessman Philip Stevens (James Stewart) who is flying them & a bunch of VIP's to his estate in preparation of it being opened to the public as a museum, also on board is Stevens daughter Julie (Kathleen Quinlan) & her son. The luxury jetliner takes off as planned but mid-air the plane is hi-jacked by the co-pilot Chambers (Robert Foxworth) & his two accomplice's Banker (Monte Markham) & Wilson (Michael Pataki) who knock the passengers & crew out with sleeping gas, they plan to steal the valuable cargo & land on a disused plane strip on an isolated island but while making his descent Chambers almost hits an oil rig in the Ocean & loses control of the plane sending it crashing into the sea where it sinks to the bottom right bang in the middle of the Bermuda Triangle. With air in short supply, water leaking in & having flown over 200 miles off course the problems mount for the survivor's as they await help with time fast running out...<br /><br />Also known under the slightly different tile Airport 1977 this second sequel to the smash-hit disaster thriller Airport (1970) was directed by Jerry Jameson & while once again like it's predecessors I can't say Airport '77 is any sort of forgotten classic it is entertaining although not necessarily for the right reasons. Out of the three Airport films I have seen so far I actually liked this one the best, just. It has my favourite plot of the three with a nice mid-air hi-jacking & then the crashing (didn't he see the oil rig?) & sinking of the 747 (maybe the makers were trying to cross the original Airport with another popular disaster flick of the period The Poseidon Adventure (1972)) & submerged is where it stays until the end with a stark dilemma facing those trapped inside, either suffocate when the air runs out or drown as the 747 floods or if any of the doors are opened & it's a decent idea that could have made for a great little disaster flick but bad unsympathetic character's, dull dialogue, lethargic set-pieces & a real lack of danger or suspense or tension means this is a missed opportunity. While the rather sluggish plot keeps one entertained for 108 odd minutes not that much happens after the plane sinks & there's not as much urgency as I thought there should have been. Even when the Navy become involved things don't pick up that much with a few shots of huge ships & helicopters flying about but there's just something lacking here. George Kennedy as the jinxed airline worker Joe Patroni is back but only gets a couple of scenes & barely even says anything preferring to just look worried in the background.<br /><br />The home video & theatrical version of Airport '77 run 108 minutes while the US TV versions add an extra hour of footage including a new opening credits sequence, many more scenes with George Kennedy as Patroni, flashbacks to flesh out character's, longer rescue scenes & the discovery or another couple of dead bodies including the navigator. While I would like to see this extra footage I am not sure I could sit through a near three hour cut of Airport '77. As expected the film has dated badly with horrible fashions & interior design choices, I will say no more other than the toy plane model effects aren't great either. Along with the other two Airport sequels this takes pride of place in the Razzie Award's Hall of Shame although I can think of lots of worse films than this so I reckon that's a little harsh. The action scenes are a little dull unfortunately, the pace is slow & not much excitement or tension is generated which is a shame as I reckon this could have been a pretty good film if made properly.<br /><br />The production values are alright if nothing spectacular. The acting isn't great, two time Oscar winner Jack Lemmon has said since it was a mistake to star in this, one time Oscar winner James Stewart looks old & frail, also one time Oscar winner Lee Grant looks drunk while Sir Christopher Lee is given little to do & there are plenty of other familiar faces to look out for too.<br /><br />Airport '77 is the most disaster orientated of the three Airport films so far & I liked the ideas behind it even if they were a bit silly, the production & bland direction doesn't help though & a film about a sunken plane just shouldn't be this boring or lethargic. Followed by The Concorde ... Airport '79 (1979)."
                                  ], "ERROR: Input expected to be list of text"
    _train_sequences, _ = TextPreProcessor.get_text_to_id_sequence(_tokenizer, _train_texts[0:2], [] )    
    word_index =  tokenizer.word_index
    assert _train_sequences[0:2] == [
                                      [62, 4, 3, 129, 34, 44, 7576, 1414, 15, 3, 4252, 514, 43, 16, 3, 633, 133, 12, 6, 3, 1301, 459, 4, 1751, 209, 3, 10785, 7693, 308, 6, 676, 80, 32, 2137, 1110, 3008, 31, 1, 929, 4, 42, 5120, 469, 9, 2665, 1751, 1, 223, 55, 16, 54, 828, 1318, 847, 228, 9, 40, 96, 122, 1484, 57, 145, 36, 1, 996, 141, 27, 676, 122, 1, 13886, 411, 59, 94, 2278, 303, 772, 5, 3, 837, 11037, 20, 3, 1755, 646, 42, 125, 71, 22, 235, 101, 16, 46, 49, 624, 31, 702, 84, 702, 378, 3493, 12997, 2, 16816, 8422, 67, 27, 107, 3348]
                                      , [4517, 19499, 514, 14, 3, 3417, 159, 8595, 12998, 1702, 6, 4892, 53, 16, 4518, 5674, 138, 11926, 5, 1023, 4988, 3050, 4519, 588, 1339, 34, 6, 1544, 95, 3, 758, 4, 5, 24, 3513, 8, 10786, 4, 9, 109, 3051, 5, 1, 1067, 14, 3, 4520, 79, 20, 2086, 6, 4519, 574, 2798, 7262, 38, 489, 1, 8595, 301, 122, 14, 4253, 18, 1693, 942, 1, 1702, 6, 6538, 31, 1, 998, 1807, 667, 24, 104, 14896, 15492, 19500, 2602, 485, 34, 3285, 1, 6539, 1048, 43, 16, 2753, 2547, 33, 1340, 5, 2103, 1, 4518, 11927, 1537, 20, 3, 1702, 3249, 20, 32, 4348, 1105, 18, 134, 228, 24, 4760, 217, 1927, 32, 3230, 17633, 8, 1, 4676, 1975, 1135, 4, 1, 1702, 5675, 9, 6627, 80, 1, 2016, 118, 9, 8169, 5, 1, 1321, 205, 4010, 8, 1, 652, 4, 1, 5924, 16, 942, 8, 343, 6259, 1090, 8, 257, 16115, 117, 6260, 2058, 122, 261, 1, 709, 12258, 15, 1, 14, 33, 12606, 335, 16, 55, 699, 617, 43, 7, 7, 79, 570, 463, 1, 1072, 272, 4517, 6041, 11, 330, 751, 5, 1, 6792, 566, 1685, 705, 4517, 5456, 13, 523, 31, 1513, 9878, 134, 277, 171, 37, 42, 8288, 10, 188, 132, 4517, 19499, 6, 98, 429, 4, 1547, 353, 9, 6, 438, 258, 21, 2696, 15, 1, 205, 1003, 43, 4, 1, 286, 4517, 105, 10, 25, 107, 35, 227, 10, 162, 420, 11, 28, 1, 115, 40, 9, 44, 58, 1636, 111, 4, 1, 286, 16, 3, 324, 1693, 942, 6538, 92, 1, 6627, 158, 26, 64, 1, 3230, 17633, 6261, 4, 1, 12998, 276, 1, 1184, 68, 266, 5, 1656, 1, 201, 4517, 16, 157, 1059, 1685, 506, 4, 1, 807, 1, 1150, 4483, 19501, 6, 118, 9, 2665, 363, 1, 127, 16, 3, 5283, 6540, 4416, 145, 2603, 1001, 342, 51, 1, 942, 1126, 43, 39, 11038, 14, 1, 12998, 39, 45, 98, 4, 1, 3584, 23, 3051, 42, 3, 539, 323, 12, 97, 25, 90, 15, 3, 84, 114, 1685, 506, 18, 75, 6887, 1727, 750, 411, 12607, 267, 1322, 3, 144, 580, 4, 2373, 39, 833, 39, 1071, 814, 11, 6, 3, 1045, 1429, 134, 1, 244, 12999, 111, 938, 28, 2161, 15, 1028, 231, 21, 12, 73, 567, 100, 1, 1702, 8169, 222, 21, 14, 73, 8926, 14, 10, 194, 47, 141, 25, 74, 57, 51, 1, 3022, 410, 571, 180, 89, 1257, 53, 12, 73, 16, 3, 168, 659, 4, 663, 5121, 10787, 1544, 41, 18, 222, 40, 139, 1883, 130, 739, 4201, 14, 1, 12259, 3494, 911, 6, 142, 18, 61, 211, 3, 375, 4, 136, 1196, 57, 555, 229, 16817, 5, 40, 165, 3765, 8, 1, 972, 7, 7, 1, 341, 371, 2246, 307, 4, 4517, 19499, 518, 231, 134, 1, 175, 245, 2052, 759, 32, 1724, 531, 4, 926, 583, 3, 159, 633, 894, 717, 108, 50, 136, 16, 739, 4201, 14, 2178, 5, 2104, 43, 1727, 1203, 2225, 136, 1, 3789, 39, 157, 375, 4, 348, 2345, 583, 1, 134, 10, 59, 37, 5, 64, 11, 1724, 926, 10, 241, 21, 249, 10, 97, 866, 140, 3, 747, 286, 531, 602, 4, 4517, 19499, 14, 870, 1, 19, 44, 1957, 906, 16, 524, 8058, 7476, 1588, 2823, 10, 77, 132, 54, 50, 82, 71, 1, 2876, 1702, 2179, 299, 710, 84, 342, 364, 16, 1, 82, 104, 4517, 2279, 11, 301, 3108, 4, 270, 8, 1, 11307, 2365, 4, 899, 258, 10, 67, 101, 4, 773, 4, 430, 105, 71, 11, 35, 10, 11039, 195, 3, 114, 2485, 1, 202, 136, 23, 3, 114, 750, 469, 1, 1060, 6, 547, 21, 73, 2315, 39, 1071, 6, 4844, 60, 6, 3, 899, 14, 10, 11039, 11, 97, 25, 74, 3, 181, 49, 19, 45, 90, 2877, 7, 7, 1, 362, 1230, 23, 2652, 45, 161, 2087, 1, 113, 215, 84, 104, 55, 731, 2280, 714, 4311, 44, 298, 234, 9, 13, 3, 1319, 5, 320, 8, 11, 28, 55, 731, 2280, 588, 1339, 269, 151, 11308, 79, 28, 55, 731, 2280, 844, 2105, 269, 1816, 134, 2682, 1365, 844, 6, 345, 114, 5, 78, 47, 23, 955, 4, 82, 1076, 1586, 5, 165, 43, 15, 96, 7, 7, 4517, 19499, 6, 1, 88, 1685, 4, 1, 286, 4517, 105, 35, 227, 10, 420, 1, 1005, 493, 9, 57, 45, 33, 68, 3, 224, 706, 1, 362, 1898, 455, 149, 335, 148, 3, 19, 41, 3, 16116, 1702, 40, 1609, 27, 11, 354, 39, 12607, 1474, 31, 1, 4517, 5457]
                                    ], "ERROR: Text sequence expected to be List of List of word_id(int)"

    assert word_index['story'] == 62, 'Error: Train_text to sequence conversion does not match word_index dict'
    assert word_index['of'] == 4, 'Error: Train_text to sequence conversion does not match word_index dict'
    assert word_index['a'] == 3, 'Error: Train_text to sequence conversion does not match word_index dict'
    assert word_index['man'] == 129, 'Error: Train_text to sequence conversion does not match word_index dict'
    assert word_index['who'] == 34, 'Error: Train_text to sequence conversion does not match word_index dict'
    assert word_index['has'] == 44, 'Error: Train_text to sequence conversion does not match word_index dict'
    assert word_index['unnatural'] == 7576, 'Error: Train_text to sequence conversion does not match word_index dict'
    assert word_index['feelings'] == 1414, 'Error: Train_text to sequence conversion does not match word_index dict'

    print('Len Tokens ', len(word_index))
    # print('word_index ', word_index)

  def test_pad_text_to_constant_lenght_with_zero_padding(self):
    train_sequences, test_sequences = [ [ 2, 1, 10], [21] ], []
    padded_train_seq, padded_test_seq = TextPreProcessor.pad_text_to_constant_length(train_sequences, test_sequences, maxlen=5)
 
    assert list(padded_train_seq[0]) == [ 0,  0,  2,  1, 10], "ERROR in padding sequence"
    assert list(padded_train_seq[1]) == [ 0, 0, 0, 0, 21]   , "ERROR in padding sequence"
    padded_train_seq = [ list(padded_train_seq[0]),
                         list(padded_train_seq[1])
                       ]
    assert list(padded_train_seq) == [
                                [ 0, 0, 2, 1, 10],
                                [ 0, 0, 0, 0, 21]
                              ], "ERROR in padding sequence"
  
  def test_labels_to_categorical(self):
    _train_labels = [1, 1, 0]    
    _train_labels, _ = TextPreProcessor.binary_label_to_dummy_categorical(_train_labels, [0,1] )

    assert list(_train_labels[0]) == [ 0,  1], "ERROR in dummy encoding categorical variable"
    assert list(_train_labels[1]) == [ 0,  1], "ERROR in dummy encoding categorical variable"
    assert list(_train_labels[2]) == [ 1,  0], "ERROR in dummy encoding categorical variable"
    
class GloveEmbeddingMatrixTester(unittest.TestCase):

    def test_our_word_2_index_has_correct_corresponding_glove_vectors(self, our_word_2_index, glove_word2embedding_matrix, our_word2embedding_matrix):
        
        id_for_story = our_word_2_index['story']
        assert  our_word_2_index['story']==62, 'ERROR: Our word 2 index contains diff id. Check id and change  the asse'        
        assert list(our_word2embedding_matrix[id_for_story]) == list(glove_word2embedding_matrix['story']) , 'ERROR:Words have been assigned wrong vectors. There is some error during the vector assigning process'


train_texts = ImdbInputDataHandler.get_train_texts(IMDB_TRAIN_DIR)

"""
Input Tests 
"""
InputTester().test_input_text_must_be_list_of_text(train_texts)
word_index = tokenizer.word_index

"""
Text PreProcessor Tests 
"""
TextPreProcessorTester().test_get_text_to_sequence(tokenizer)
TextPreProcessorTester().test_get_text_to_sequence_2(tokenizer,  train_texts)
TextPreProcessorTester().test_correct_word_to_id_conversion(tokenizer, train_texts)
TextPreProcessorTester().test_pad_text_to_constant_lenght_with_zero_padding()
TextPreProcessorTester().test_labels_to_categorical()

"""
GloveEmbeddingMatrix Tests 
"""
GloveEmbeddingMatrixTester().test_our_word_2_index_has_correct_corresponding_glove_vectors(our_word_2_index,\
                    glove_word2embedding_matrix, our_word2embedding_matrix)

print('Unit Test Run completed !')

Len Tokens  88582
Unit Test Run completed !


## Print preview

In [19]:
# print('Sample')
print('train_text[0] \n\t', train_texts[0])
# print('train_text[0] converted to Sequence  \n\t', train_sequences[0])
# print('train_sequences[0:2] \n\t', train_sequences[0:2]) 

# print('train_padded_data[0] \n\t', train_padded_data[0])
# print('train_labels[0] \n\t', train_labels[0])
# print('\n train_padded_data[0:2] \n', train_padded_data[0:2])
# print('train_labels[0:2] \n\t', train_labels[0:2])

print('X_train.shape : ', X_train.shape) # 20000, 1000
print('X_valid.shape : ', X_valid.shape) # 5000,  1000
print('y_train.shape : ', y_train.shape) # 20000, 2
print('y_valid.shape : ', y_valid.shape) # 5000,  2

train_text[0] 
	 Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.
X_train.shape :  (20000, 1000)
X_valid.shape :  (5000, 1000)
y_train.shape :  (20000, 2)
y_valid.shape :  (5000, 2)
