## Description

In the past notebook it was found that some input variables were pushing the predictions to negative values. Having identified them, one more attempt will be made to predict emissions by implementing some new techniques that have come to light by reading even more literature. 

1) Maybe using the **MinMaxScaler** wasn't the best option. Using the **Standard Scaler** yields data with mean of **0** and variance of **1**. Will this have an impact in the algorithm?

2) Having applied a standard scaler to the data, PCA (Principal Component Analysis) will be implemented to see if this algorithm can identify the most important input variables.

    1) It's important to take into account the results of the past notebook. The variables that pushed the predictions to be negative were:
    
        1) Fuel Code
        2) Drive System
        3) Peak_pos (Positive acceleration peaks)
        4) ETW (Estimated Test Weight)
        5) a_pos (Positive acceleration mean)
        6) HP (Horse Power)

The models that will be used during this notebook will be: 

* **MAPE + Leaky ReLU + 20% dropout + linear output**
* **MAPE + Leaky ReLU + 10% dropout + linear output**

This is in hopes of trying to keep everything as close as possible between notebooks and find the error in the prediction process. After this notebook has been implemented, another smaller hyper-parameter search can be done to see if predictions can be improved. 

For the sake of continuity, only **HC** will be predicted in this run. If something promising is found, other pollutants will be attempted.

## TO DO's

Data Scaling
* Import scaled data
* Import scalers
* Inverse scale data
* Create new scalers using **Standard Scaler**
* Scale the data 

Data Sets
* Create the data sets the same way as before (without shuffling to keep the same order)

PCA
* Apply different instances of PCA keeping a different amount of variables.
    * The PCA function receives one input: n_componentes
        * If **0 < n_components < 1**, this number represents the minimum amount of variance that needs to be explained by the selected components
        * If **1 $\leq$ n_components $\leq$ number of input parameters**, this number represents the number of components to be kept
* Explain the following amount of variance:
    * 99%
    * 95%
    * 90%
    * 80%
* Create a new training set according to the PCA instances

Machine Learning
* Create models for each training set
* Train
* Predict
* Calculate error
* Rank models
    


In [1]:
from keras.models import Sequential, load_model, Model
from keras.layers import Input, Dense, Dropout, advanced_activations, BatchNormalization, LeakyReLU
from keras import losses, optimizers, activations
import keras.backend as K

import h5py

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.externals import joblib
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import time
import datetime
import os

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
output_path = os.path.join('.','output')
minmax_scaler_path = os.path.join('.','Scalers','MinMax')
standard_scaler_path = os.path.join('.','Scalers','Standard')

## Load Original Scaled Data

In [3]:
complete_data_scaled_shuffled = pd.read_csv('Dataset_Scaled_Shuffled.csv')
print('Shuffled dataset loaded.')

Shuffled dataset loaded.


## Load MinMax Scalers

In [4]:
# Create an empty list to put all the scalers
minmax_scalers = []

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    scaler_filename = os.path.join(minmax_scaler_path,'scaler{}.save'.format(i))
    minmax_scaler = joblib.load(scaler_filename)
    
    minmax_scalers.append(minmax_scaler)

## Inverse Scale Data

In [5]:
# First, inverse transform all original values from the test_set
original_data_inverse = complete_data_scaled_shuffled.copy()

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    col_name = complete_data_scaled_shuffled.columns[i]
    
    values = original_data_inverse[col_name].values
    values = values.astype('float64')
    values = values.reshape(values.shape[0],1)
    
    original_data_inverse[col_name] = minmax_scalers[i].inverse_transform(values)
    
    print('Success with feature: {}'.format(col_name))

Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


## Standard Scaler Operation

In [6]:
# Create an empty list to put all the scalers
standard_scalers = []

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    standard_scaler = StandardScaler()
    
    standard_scalers.append(standard_scaler)
    
# Copy the data set to avoid altering the original
new_data_scaled = original_data_inverse.copy()
    
# Loop over the standard_scalers and perform the scaling operation on each column
for i in range(len(complete_data_scaled_shuffled.columns)):
    
    col_name = complete_data_scaled_shuffled.columns[i]
    
    values = new_data_scaled[col_name].values
    values = values.astype('float64')
    values = values.reshape(values.shape[0],1)
    
    new_data_scaled[col_name] = standard_scalers[i].fit_transform(values)
    
    print('Success with feature: {}'.format(col_name))
    
# Export scalers for later use
for i in range(len(complete_data_scaled_shuffled.columns)):
    
    scaler_filename = os.path.join(standard_scaler_path,'scaler{}.save'.format(i))
    standard_scaler = standard_scalers[i]
    joblib.dump(standard_scaler, scaler_filename) 

Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


## Prepare Data

In [7]:
# Get number of data points
data_points = complete_data_scaled_shuffled.shape[0]

# Set sizes for train, dev, test sets
train_percent = 0.8
train_size = round(train_percent*data_points)

if (data_points-train_size)%2 == 0:
    dev_size = int((data_points-train_size)/2)
    test_size = dev_size
    print('Train Size = {}'.format(train_size))
    print('Dev Size = {}'.format(dev_size))
    print('Test Size = {}'.format(test_size))
    print('Remainder = {}'.format(train_size+dev_size+test_size-data_points))
    
else:
    train_size = train_size-1
    dev_size = int((data_points-train_size)/2)
    test_size = dev_size 
    print('Train Size = {}'.format(train_size))
    print('Dev Size = {}'.format(dev_size))
    print('Test Size = {}'.format(test_size))
    print('Remainder = {}'.format(train_size+dev_size+test_size-data_points))

Train Size = 62511
Dev Size = 7814
Test Size = 7814
Remainder = 0


In [8]:
# Create a local copy of the entire NEW dataset
data_scaled_shuffled = new_data_scaled.copy()

print('Preparing Data-sets')
# Divide data into train, dev, and test sets
train_set = data_scaled_shuffled[ : train_size]
dev_set = data_scaled_shuffled[train_size : train_size+dev_size]
test_set = data_scaled_shuffled[train_size+dev_size : train_size+dev_size+test_size]

# Reset index for all sets
train_set = train_set.reset_index(drop=True)
dev_set = dev_set.reset_index(drop=True)
test_set = test_set.reset_index(drop=True)

# Get values
train_set_values = train_set.values
dev_set_values = dev_set.values
test_set_values = test_set.values

# Number of emissions: HC, CO, CO2, NOX
n_out = 4

print('Splitting into inputs and outputs')
# SLICING: [start row:end row , start column:end column]
# Split into inputs and outputs
original_x_train = train_set_values[:,:-n_out]
original_x_dev = dev_set_values[:,:-n_out]
original_x_test = test_set_values[:,:-n_out]

# Get the outputs (only HC)
HC_train = train_set_values[:,-n_out]
HC_dev = dev_set_values[:,-n_out]
HC_test = test_set_values[:,-n_out]

print('Data-sets complete')
print('----------------------------------')

Preparing Data-sets
Splitting into inputs and outputs
Data-sets complete
----------------------------------


In [9]:
# Inverse transform the TEST DATA to be able to calculate the error further down
test_set_scaled = complete_data_scaled_shuffled[train_size+dev_size : train_size+dev_size+test_size]
test_set_inverse = test_set_scaled.copy()

for i in range(np.size(complete_data_scaled_shuffled.columns)):
    
    col_name = complete_data_scaled_shuffled.columns[i]
    
    values = test_set_inverse[col_name].values
    values = values.astype('float64')
    values = values.reshape(values.shape[0],1)
    
    test_set_inverse[col_name] = minmax_scalers[i].inverse_transform(values)
    
    print('Success with feature: {}'.format(col_name))

Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


## PCA

Create a function that creates a PCA instance and with that PCA instance creates a new x_train

In [10]:
def create_pca_set(variance_amount):
    
    print('Create PCA Instance')
    pca = PCA(variance_amount)
    
    print('Fit PCA Instance')
    pca.fit(original_x_train)
    print('Number of Components = {}'.format(pca.n_components_))
    
    print('Create New Input Training Set')
    new_x_train = pca.transform(original_x_train)
    
    print('Create New Input Dev Set')
    new_x_dev = pca.transform(original_x_dev)
    
    print('Create New Input Test Set')
    new_x_test = pca.transform(original_x_test)
    
    print('----------------------------------')
    
    return new_x_train, new_x_dev, new_x_test

In [11]:
pca_options = [0.99,0.95,0.90,0.80]

-----------------
## Models

#### Basics

In [12]:
# Mini-batch size, epochs
batch_size = 64
epochs = 300
dropouts = [0.1, 0.2]

#### Build Model

In [13]:
# Build model with MAPE, Leaky ReLU, Adam
def build_model(number, x_train, dd):
    
    # Create model
    model = Sequential(name='Model_{}'.format(number))

    model.add(Dense(256, input_dim=x_train.shape[1]))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(128))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(64))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(32))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(16))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(1))

    #Compile model
    model.compile(loss=losses.mean_absolute_percentage_error, optimizer=optimizers.Adam(), metrics = ['accuracy'])
    
    print('{} Created'.format(model.name))
    print('----------------------------------')
    
    return model

#### Train Model

In [14]:
def train_models(model, x_train, y_train, x_dev, y_dev):
    
    print('{} - Training'.format(model.name))
    print('- Started on {} at {}'.format(str(datetime.datetime.now())[5:-16], str(datetime.datetime.now())[11:-10]))
    # Start timer
    start_time = time.time()

    # fit network
    history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, 
                        validation_data=(x_dev, y_dev), verbose=0, shuffle=True)

    # End timer
    end_time = time.time() - start_time
    print('{} - Training Complete'.format(model.name))
    print('- Time: {:.3f} min'.format(end_time/60))
    print('- Loss = {:.5f}'.format(history.history['loss'][-1]))
    print('- Val Loss = {:.5f}'.format(history.history['val_loss'][-1]))
    print('----------------------------------')
        
    return history

#### Make Predictions and Calculate Error

In [15]:
# Function to define MSPE
def msp_error(true,pred):
    error = 100*np.sum(((true-pred)/true)**2)/np.size(true)
    return error

In [16]:
def predict_get_error(model, x_test):
    
    print('Predicting with {}'.format(model.name))
    scaled_predictions = model.predict(x_test)
    
    print('Inverse Scaling Operation') 
     
    # Inverse the scaling operation on the predictions
    predictions = standard_scalers[-4].inverse_transform(scaled_predictions)
    
    print('- Prediction Mean = {:.5f}'.format(np.mean(predictions)))
    print('- Prediction Min = {:.5f}'.format(np.min(predictions)))
    print('- Prediction Max = {:.5f}'.format(np.max(predictions)))

    print('Calculating HC Error')
    mspe = msp_error(test_set_inverse['HC'].values, predictions)
        
    print('- HC Error  = {:.2e}'.format(mspe))
    print('----------------------------------')
    
    return mspe

#### Process Models and Rank with MSPE

In [17]:
def process_models():
    
    count = 1
    model_list = []
    history_list = []
    HC_error_list = []

    for variance_percent in pca_options:
        
        for dd in dropouts:

            # Print model variables
            print('Model_{} Variables:'.format(count))
            print('- Loss: MAPE')
            print('- Activation: Leaky ReLU')
            print('- Optimizer: Adam')
            print('- Dropout: {}%'.format(dd*100))
            print('- Variance Explained: {}'.format(variance_percent))
            print('----------------------------------')

            # Get the INPUT datasets after PCA
            x_train, x_dev, x_test = create_pca_set(variance_percent)

            # Get OUTPUT datasets
            y_train = HC_train
            y_dev = HC_dev
            y_test = HC_test

            # Create model
            model = build_model(count, x_train, dd)

            # Train model
            history = train_models(model, x_train, y_train, x_dev, y_dev)
            history_list.append(history)

            # Make predictions and calculate error
            error = predict_get_error(model, x_test)

            # Add error to error list
            HC_error_list.append([model.name, variance_percent, error])

            # Announce one model process ended
            print('============== MODEL {} PROCESS END =============='.format(count))
            print(' ')

            # Increase counter by 1
            count = count+1

            # Add TRAINED model to list
            model_list.append(model)

    print('Creating DataFrame')                
    HC_error = pd.DataFrame(HC_error_list)

    print('Changing DataFrame column names')
    HC_error.columns = ['Model', 'Variance Explained', 'MSPE']

    print('Ranking Models')
    HC_error.sort_values(by=['MSPE'], inplace=True)

    count = 0
    
    return HC_error, model_list, history_list

In [18]:
HC_ranking, models, histories = process_models()

Model_1 Variables:
- Loss: MAPE
- Activation: Leaky ReLU
- Optimizer: Adam
- Dropout: 10.0%
- Variance Explained: 0.99
----------------------------------
Create PCA Instance
Fit PCA Instance
Number of Components = 12
Create New Input Training Set
Create New Input Dev Set
Create New Input Test Set
----------------------------------
Model_1 Created
----------------------------------
Model_1 - Training
- Started on 04-02 at 17:23
Model_1 - Training Complete
- Time: 41.832 min
- Loss = 83.95899
- Val Loss = 88.66776
----------------------------------
Predicting with Model_1
Inverse Scaling Operation
- Prediction Mean = 0.04011
- Prediction Min = -0.05192
- Prediction Max = 0.05140
Calculating HC Error
- HC Error  = 6.05e+10
----------------------------------
 
Model_2 Variables:
- Loss: MAPE
- Activation: Leaky ReLU
- Optimizer: Adam
- Dropout: 20.0%
- Variance Explained: 0.99
----------------------------------
Create PCA Instance
Fit PCA Instance
Number of Components = 12
Create New Input

In [22]:
HC_ranking

Unnamed: 0,Model,Variance Explained,MSPE
2,Model_3,0.95,60479310000.0
0,Model_1,0.99,60489970000.0
1,Model_2,0.99,60914720000.0
5,Model_6,0.9,61631290000.0
4,Model_5,0.9,62065160000.0
7,Model_8,0.8,63839610000.0
6,Model_7,0.8,64769020000.0
3,Model_4,0.95,66338110000.0


In [21]:
epoch_vector=np.linspace(1,epochs,epochs)

for i in range(len(models)):
    model = models[i]
    history = histories[i]
    
    model.save(os.path.join(output_path,'{}'.format(model.name)))
    
    hist_data =[epoch_vector,history.history['loss'],history.history['val_loss']]
    hist_data =pd.DataFrame(hist_data).transpose()
    hist_data.columns=['Epochs','loss','val_loss']
    
    hist_data.to_csv(os.path.join(output_path,'Training_History_{}.csv'.format(model.name)),index=False)

## Next Steps

Applying PCA made the predictions mostly positive (with some negatives), but did NOT improve the error. 

Using 20% dropout generally made the difference between the training loss and the validation loss smaller. 

The **LAST** attempt to get something will be:
* Remove **all** the variables that made the predictions negative (found in the previous notebook)
* Apply MinMax and Standard Scalers to compare the performance
* Apply PCA to that data-set
* Train models for the PCA-based data-sets and **one** model that uses the original inputs
    * Make predictions
    * Calculate error 
    * Rank models
    * See if anything makes sense