# ðŸ”· PART 2: Predictive Data Modeling ðŸ”·

In this Jupyter notebook, we analyze the processed data through a **predictive** lens: we train and test segmented datasets on various machine learning models (and potentially advanced machine learning and/or deep learning algorithms) to attain a well-performing predictor.

---

## ðŸ”µ TABLE OF CONTENTS ðŸ”µ <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the predictive data modeling notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for predictive analytics.

#### 2. [Section B: Data Processing & Finalization](#section-B)

    Data curation and preparation for directed predictive modeling.

#### 6. [Section C: Encoder-Decoder LSTM Model](#section-D)

    Use of deep learning model to translate between different languages.

#### 7. [Section D: Evaluating Model](#section-E)

    Evaluating how well the model performs.
    
#### 8. [Appendix: Supplementary Custom Objects](#appendix)

    Custom Python object architectures used throughout the data predictions.
    
---

## ðŸ”¹ Section A: Imports and Initializations <a name="section-A"></a>

General Imports for Data Manipulation and Visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from numpy import array
from numpy import argmax
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import LSTM
from keras.utils import np_utils
from keras.utils.vis_utils import plot_model
from keras.preprocessing.sequence import pad_sequences
from keras.layers import TimeDistributed
from keras.callbacks import ModelCheckpoint

Using TensorFlow backend.
  return f(*args, **kwds)


Custom Algorithmic Structures for Processed Data Visualization.

In [2]:
import sys
import os

sys.path.append("../source/structures")

# TODO: Place custom structures from `../source/structures` here.
sys.path.insert(0, os.path.abspath('../helper'))


##### [(back to top)](#TOC)

---

## ðŸ”¹ Section B: Data Processing & Finalization <a name="section-B"></a>

### Tokenizer Function for Bilingual Pairs

In [3]:
from function import load_clean
import keras
from keras.preprocessing.text import Tokenizer
from pickle import load

def load_clean(filename):
    with open(filename, 'rb') as f:
        data = load(f)
    return data

#tokenize text
def tokenize_words(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

def max_len(lines):
    return max(len(line.split()) for line in lines)

data = load_clean('../datasets/processed/eng-fra-both.pickle')
train = load_clean('../datasets/processed/eng-fra-train.pickle')
test = load_clean('../datasets/processed/eng-fra-test.pickle')


### English Tokenizer

In [4]:
engl_tokens = tokenize_words(data[:, 0])
eng_vocab_size = len(engl_tokens.word_index) + 1
eng_len = max_len(data[:, 0])

print("English Vocabulary size: {}".format(eng_vocab_size))
print("Max Length of English Vocab: {}".format(eng_len))

English Vocabulary size: 2912
Max Length of English Vocab: 5


### French Tokenizer

In [5]:
fra_tokens = tokenize_words(data[:, 1])
fra_vocab_size = len(fra_tokens.word_index) + 1
fra_len = max_len(data[:, 1])

print("French Vocabulary size: {}".format(fra_vocab_size))
print("Max Length of French Vocab: {}".format(fra_len))
print("Shape of Input Vector: {}x{}".format(fra_vocab_size, fra_len))

French Vocabulary size: 5791
Max Length of French Vocab: 10
Shape of Input Vector: 5791x10


### Encode Input and Output to Ints/Pad to Max Phrase Length

In [6]:
def encode_input(tokenizer, length, lines):
    #integer encoding input
    X = tokenizer.texts_to_sequences(lines)
    #padding sequences with 0 to max length
    X = pad_sequences(X, maxlen=length, padding='post')
    return X

In [7]:
def encode_output(sequences, vocab_size):
    y_list = []
    for s in sequences:
        encoded = to_categorical(s, num_classes=vocab_size)
        y_list.append(encoded)
    
    y = array(y_list)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

### Prepare Data for Training and Testing

In [8]:
X_train = encode_input(fra_tokens, fra_len, train[:, 1])
Y_train = encode_input(engl_tokens, eng_len, train[:, 0])
Y_train = encode_output(Y_train, eng_vocab_size)

X_test = encode_input(fra_tokens, fra_len, test[:, 1])
Y_test = encode_input(engl_tokens, eng_len, test[:, 0])
Y_test = encode_output(Y_test, eng_vocab_size)

### Handy Function to Convert Logits to Text 

The purpose of this function is to be able to see the output of the Neural Network. Credit to Tommy Tracey for code!

In [9]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


##### [(back to top)](#TOC)

---

## ðŸ”¹ Section C: Encoder-Decoder LSTM Model
<a name="section-C"></a>

### Creating the Model

In [10]:
def create_model(input_vocab, output_vocab, input_time_steps, targ_time_steps, n_units):
    """
    Model Explanation here
    """
    model = Sequential()
    model.add(Embedding(input_vocab, n_units, input_length=input_time_steps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(targ_time_steps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(output_vocab, activation='softmax')))
    return model
    

### Defining The Model

In [11]:
model = create_model(fra_vocab_size, eng_vocab_size, fra_len, eng_len, 200)
model.compile(optimizer='adam', loss='categorical_crossentropy')

#summarize model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 10, 200)           1158200   
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               320800    
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 5, 200)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 5, 200)            320800    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 5, 2912)           585312    
Total params: 2,385,112
Trainable params: 2,385,112
Non-trainable params: 0
_________________________________________________________________
None


##### [(back to top)](#TOC)

---

### Training the Model

In [12]:
filename = "model.h5"
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
model.fit(X_train, Y_train, epochs=30, batch_size=64, validation_data=(X_test, Y_test), callbacks=[checkpoint], verbose=2)

Train on 12000 samples, validate on 3000 samples
Epoch 1/30
 - 39s - loss: 4.3869 - val_loss: 4.5758

Epoch 00001: val_loss improved from inf to 4.57584, saving model to model.h5
Epoch 2/30
 - 36s - loss: 3.4302 - val_loss: 4.4892

Epoch 00002: val_loss improved from 4.57584 to 4.48918, saving model to model.h5
Epoch 3/30
 - 40s - loss: 3.2662 - val_loss: 4.4930

Epoch 00003: val_loss did not improve from 4.48918
Epoch 4/30
 - 39s - loss: 3.1116 - val_loss: 4.4478

Epoch 00004: val_loss improved from 4.48918 to 4.44775, saving model to model.h5
Epoch 5/30
 - 36s - loss: 2.9405 - val_loss: 4.2921

Epoch 00005: val_loss improved from 4.44775 to 4.29206, saving model to model.h5
Epoch 6/30
 - 36s - loss: 2.7860 - val_loss: 4.2787

Epoch 00006: val_loss improved from 4.29206 to 4.27866, saving model to model.h5
Epoch 7/30
 - 35s - loss: 2.6469 - val_loss: 4.1691

Epoch 00007: val_loss improved from 4.27866 to 4.16909, saving model to model.h5
Epoch 8/30
 - 36s - loss: 2.5137 - val_loss: 4.

<keras.callbacks.History at 0x1a622a04a8>

## ðŸ”¹ Section D: Model Evaluation <a name="section-D"></a>

In [19]:
# Print prediction(s)
print(model.predict(X_test, verbose=0))

[[[5.3949616e-06 3.9550187e-06 8.8687672e-04 ... 1.9808064e-09
   1.9636830e-09 1.2631832e-09]
  [6.4501633e-05 1.2913855e-05 7.5000986e-05 ... 2.6966077e-08
   2.3724809e-08 1.7406707e-08]
  [5.5918777e-01 7.4381511e-08 1.9888962e-03 ... 2.1655049e-08
   1.5560998e-08 1.0444708e-08]
  [9.9992716e-01 1.3637837e-10 7.3377052e-08 ... 4.9240564e-12
   2.4985623e-12 1.8299058e-12]
  [9.9997509e-01 4.8153457e-11 7.1639668e-09 ... 1.3429669e-12
   6.2668073e-13 4.2473520e-13]]

 [[2.2723200e-06 5.7536447e-08 9.4584556e-07 ... 8.0793694e-10
   7.0415640e-10 6.2333538e-10]
  [9.1253500e-03 1.5286450e-06 1.3363203e-07 ... 1.1299251e-07
   8.9393730e-08 9.3803436e-08]
  [9.9759299e-01 4.2207129e-10 1.4012790e-09 ... 2.5091748e-10
   1.7342135e-10 1.5353793e-10]
  [9.9992812e-01 2.7022638e-11 7.7491513e-11 ... 4.7471250e-12
   2.8146179e-12 2.6336899e-12]
  [9.9995482e-01 1.7271412e-11 1.3468428e-10 ... 1.5378096e-12
   9.3945804e-13 8.3484169e-13]]

 [[3.7028822e-06 4.6479195e-06 1.6592587e-05 .

##### [(back to top)](#TOC)

---

## ðŸ”¹ Appendix: Supplementary Custom Objects <a name="appendix"></a>

see helper folder for custom python functions used.


##### [(back to top)](#TOC)

---