# ðŸ”· PART 2: Predictive Data Modeling ðŸ”·

In this Jupyter notebook, we analyze the processed data through a **predictive** lens: we train and test segmented datasets on various machine learning models (and potentially advanced machine learning and/or deep learning algorithms) to attain a well-performing predictor.

---

## ðŸ”µ TABLE OF CONTENTS ðŸ”µ <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the predictive data modeling notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for predictive analytics.

#### 2. [Section B: Data Processing & Finalization](#section-B)

    Data curation and preparation for directed predictive modeling.

#### 6. [Section C: Introductory Machine Learning](#section-D)

    Use of classical and basic machine learning algorithms to run predictive modeling.

#### 7. [Section D: Deep Learning & Advanced Modeling](#section-E)

    Use of deep learning and advanced statistical algorithms to run predictive modeling.
    
#### 8. [Appendix: Supplementary Custom Objects](#appendix)

    Custom Python object architectures used throughout the data predictions.
    
---

## ðŸ”¹ Section A: Imports and Initializations <a name="section-A"></a>

General Imports for Data Manipulation and Visualization.

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from keras.preprocessing.sequence import pad_sequences

Custom Algorithmic Structures for Processed Data Visualization.

In [7]:
import sys
import os

sys.path.append("../source/structures")

# TODO: Place custom structures from `../source/structures` here.
sys.path.insert(0, os.path.abspath('../helper'))


##### [(back to top)](#TOC)

---

## ðŸ”¹ Section B: Data Processing & Finalization <a name="section-B"></a>

### Tokenizer Function for Bilingual Pairs

In [42]:
from function import load_clean
import keras
from keras.preprocessing.text import Tokenizer
from pickle import load

def load_clean(filename):
    with open(filename, 'rb') as f:
        data = load(f)
    return data

#tokenize text
def tokenize_words(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

def max_len(lines):
    return max(len(line.split()) for line in lines)

data = load_clean('../datasets/processed/eng-fra-both.pickle')
train = load_clean('../datasets/processed/eng-fra-train.pickle')
test = load_clean('../datasets/processed/eng-fra-test.pickle')


### English Tokenizer

In [24]:
engl_tokens = tokenize_words(data[:, 0])
eng_vocab_size = len(engl_tokens.word_index) + 1
eng_len = max_len(data[:, 0])

print("English Vocabulary size: {}".format(eng_vocab_size))
print("Max Length of English Vocab: {}".format(eng_len))

English Vocabulary size: 2912
Max Length of English Vocab: 5


### French Tokenizer

In [22]:
fra_tokens = tokenize_words(data[:, 1])
fra_vocab_size = len(fra_tokens.word_index) + 1
fra_len = max_len(data[:, 1])

print("French Vocabulary size: {}".format(fra_vocab_size))
print("Max Length of French Vocab: {}".format(fra_len))

French Vocabulary size: 5791
Max Length of French Vocab: 10


### Encode Input and Output to Ints/Pad to Max Phrase Length

In [40]:
def encode_input(tokenizer, length, lines):
    #integer encoding input
    X = tokenizer.texts_to_sequences(lines)
    #padding sequences with 0 to max length
    X = pad_sequences(X, maxlen=length, padding='post')
    return X

In [48]:
def encode_output(sequences, vocab_size):
    y_list = []
    for s in sequences:
        encoded = to_categorical(s, num_classes=vocab_size)
        y_list.append(encoded)
    
    y = array(y_list)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

### Prepare Data for Training and Testing

In [51]:
X_train = encode_input(fra_tokens, fra_len, train[:, 1])
Y_train = encode_input(engl_tokens, eng_len, train[:, 0])
Y_train = encode_output(Y_train, eng_vocab_size)

X_test = encode_input(fra_tokens, fra_len, test[:, 1])
Y_test = encode_input(engl_tokens, eng_len, test[:, 0])
Y_test = encode_output(Y_test, eng_vocab_size)

##### [(back to top)](#TOC)

---

## ðŸ”¹ Section C: Introductory Machine Learning <a name="section-C"></a>

##### [(back to top)](#TOC)

---

## ðŸ”¹ Section D: Deep Learning & Advanced Modeling <a name="section-D"></a>

##### [(back to top)](#TOC)

---

## ðŸ”¹ Appendix: Supplementary Custom Objects <a name="appendix"></a>

##### [(back to top)](#TOC)

---