## Description

In the past notebook, PCA was applied to see if there was a better combination of parameters that could improve the predictions. PCA was applied to the original inputs (18 of them), and PCA models that explained 99%, 95%, 90%, and 80% were tried out. The predictions had a positive **mean** but the minimum was still negative. Additionally, the errors were still through the roof. 

For that notebook MinMax Scaler was replaced by Standard Scaler. In this notebook, both will be used to compare their performance. 

Here **all** the variables that were found to be making the predictions negative in notebook **1_03** will be removed. Then PCA will be applied to explain 99%, 98%, 95%, and 90% of the output variance.

Additionally, two extra models will be trained using the original data without PCA (but with the problematic variables removed. 

The architecture will be: 

* **MAPE + Leaky ReLU + 20% dropout + linear output**


The variables that pushed the predictions to be negative were:

    1) Fuel Code
    2) Drive System
    3) Peak_pos (Positive acceleration peaks)
    4) ETW (Estimated Test Weight)
    5) a_pos (Positive acceleration mean)
    6) HP (Horse Power)

After this notebook has been implemented, another smaller hyper-parameter search can be done to see if predictions can be improved. **IF** something in this notebook improves predictions

For the sake of continuity, only **HC** will be predicted in this run. If something promising is found, other pollutants will be attempted.

-------

## TO DO's

Data Scaling
* Import scaled data
* Import scalers (MinMax)
* Inverse scale data
* Impor scalers (Standard)
* Scale the data 

Data Sets
* Create the data sets the same way as before (without shuffling to keep the same order)

PCA
* Apply different instances of PCA keeping a different amount of variables.
    * The PCA function receives one input: n_componentes
        * If **0 < n_components < 1**, this number represents the minimum amount of variance that needs to be explained by the selected components
        * If **1 $\leq$ n_components $\leq$ number of input parameters**, this number represents the number of components to be kept
* Explain the following amount of variance:
    * 99%
    * 98%
    * 95%
    * 90%
* Create a new training set according to the PCA instances

Machine Learning
* Create models for each training set
* Train
* Predict
* Calculate error
* Rank models
    


In [1]:
from keras.models import Sequential, load_model, Model
from keras.layers import Input, Dense, Dropout, advanced_activations, BatchNormalization, LeakyReLU
from keras import losses, optimizers, activations
import keras.backend as K

import h5py

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.externals import joblib
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import time
import datetime
import os

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
output_path = os.path.join('.','output')
minmax_scaler_path = os.path.join('.','Scalers','MinMax')
standard_scaler_path = os.path.join('.','Scalers','Standard')

## Load Original Scaled Data

In [3]:
complete_data_scaled_shuffled = pd.read_csv('Dataset_Scaled_Shuffled.csv')
print('Shuffled dataset loaded.')

Shuffled dataset loaded.


## Load MinMax Scalers

In [4]:
# Create an empty list to put all the scalers
minmax_scalers = []

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    scaler_filename = os.path.join(minmax_scaler_path,'scaler{}.save'.format(i))
    minmax_scaler = joblib.load(scaler_filename)
    
    minmax_scalers.append(minmax_scaler)

## Load Standard Scalers

In [5]:
# Create an empty list to put all the loaded scalers
standard_scalers = []

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    scaler_filename = os.path.join(standard_scaler_path,'scaler{}.save'.format(i))
    standard_scaler = joblib.load(scaler_filename)
    
    standard_scalers.append(standard_scaler)

## Inverse Scale Data

Using the original MinMax

In [6]:
# First, inverse transform all original values from the test_set
original_data_inverse = complete_data_scaled_shuffled.copy()

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    col_name = complete_data_scaled_shuffled.columns[i]
    
    values = original_data_inverse[col_name].values
    values = values.astype('float64')
    values = values.reshape(values.shape[0],1)
    
    original_data_inverse[col_name] = minmax_scalers[i].inverse_transform(values)
    
    print('Success with feature: {}'.format(col_name))

Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


## Scale Data

Function to scale data according to an input that decides if it is a MinMax or a Standard Scaler

In [7]:
def scale_data(scaler_type):
    
    # Select the scalers
    if scaler_type == 'MinMax':
        
        scalers = minmax_scalers
        print('Using {} Scalers'.format(scaler_type))
        
    if scaler_type == 'Standard':
        
        scalers = standard_scalers
        print('Using {} Scalers'.format(scaler_type))
        
    # Scale the data
    
    # Copy the data set to avoid altering the original
    new_data_scaled = original_data_inverse.copy()
    
    # DROP THE VARIABLES WE DON'T WANT
    new_data_scaled.drop(columns=['Fuel_Code', 'Drive_System_Code', 'Peak_pos', 'Peak_neg', 'ETW', 'a_pos', 'a_neg', 'HP'], 
                     inplace=True)
    
    # Loop over the standard_scalers and perform the scaling operation on each column
    for i in range(len(new_data_scaled.columns)):

        col_name = new_data_scaled.columns[i]

        values = new_data_scaled[col_name].values
        values = values.astype('float64')
        values = values.reshape(values.shape[0],1)

        new_data_scaled[col_name] = scalers[i].fit_transform(values)

        print('Success with feature: {}'.format(col_name))
    

    return new_data_scaled

In [8]:
scaler_types = ['MinMax', 'Standard']

In [9]:
data_standard_scaled = scale_data('Standard')

Using Standard Scalers
Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


In [10]:
data_minmax_scaled = scale_data('MinMax')

Using MinMax Scalers
Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


## Prepare Data

Function to prepare data based on the scaler that was used

In [11]:
# Get number of data points
data_points = complete_data_scaled_shuffled.shape[0]

# Set sizes for train, dev, test sets
train_percent = 0.8
train_size = round(train_percent*data_points)

if (data_points-train_size)%2 == 0:
    dev_size = int((data_points-train_size)/2)
    test_size = dev_size
    print('Train Size = {}'.format(train_size))
    print('Dev Size = {}'.format(dev_size))
    print('Test Size = {}'.format(test_size))
    print('Remainder = {}'.format(train_size+dev_size+test_size-data_points))
    
else:
    train_size = train_size-1
    dev_size = int((data_points-train_size)/2)
    test_size = dev_size 
    print('Train Size = {}'.format(train_size))
    print('Dev Size = {}'.format(dev_size))
    print('Test Size = {}'.format(test_size))
    print('Remainder = {}'.format(train_size+dev_size+test_size-data_points))

Train Size = 62511
Dev Size = 7814
Test Size = 7814
Remainder = 0


In [12]:
def create_data_sets(scaler_type):
    
    if scaler_type == 'MinMax':
        
        new_data_scaled = data_minmax_scaled
        
    if scaler_type == 'Standard':
        
        new_data_scaled = data_standard_scaled
    
    # Create a local copy of the entire NEW dataset
    data_scaled_shuffled = new_data_scaled.copy()

    print('Preparing Data-sets')
    # Divide data into train, dev, and test sets
    train_set = data_scaled_shuffled[ : train_size]
    dev_set = data_scaled_shuffled[train_size : train_size+dev_size]
    test_set = data_scaled_shuffled[train_size+dev_size : train_size+dev_size+test_size]

    # Reset index for all sets
    train_set = train_set.reset_index(drop=True)
    dev_set = dev_set.reset_index(drop=True)
    test_set = test_set.reset_index(drop=True)

    # Get values
    train_set_values = train_set.values
    dev_set_values = dev_set.values
    test_set_values = test_set.values

    # Number of emissions: HC, CO, CO2, NOX
    n_out = 4

    print('Splitting into inputs and outputs')
    # SLICING: [start row:end row , start column:end column]
    # Split into inputs and outputs
    x_train = train_set_values[:,:-n_out]
    x_dev = dev_set_values[:,:-n_out]
    x_test = test_set_values[:,:-n_out]

    # Get the outputs (only HC)
    HC_train = train_set_values[:,-n_out]
    HC_dev = dev_set_values[:,-n_out]
    HC_test = test_set_values[:,-n_out]

    print('Data-sets complete')
    print('----------------------------------')
    
    return x_train, x_dev, x_test, HC_train, HC_dev, HC_test

In [13]:
# Inverse transform the TEST DATA to be able to calculate the error further down
test_set_scaled = complete_data_scaled_shuffled[train_size+dev_size : train_size+dev_size+test_size]
test_set_inverse = test_set_scaled.copy()

for i in range(np.size(complete_data_scaled_shuffled.columns)):
    
    col_name = complete_data_scaled_shuffled.columns[i]
    
    values = test_set_inverse[col_name].values
    values = values.astype('float64')
    values = values.reshape(values.shape[0],1)
    
    test_set_inverse[col_name] = minmax_scalers[i].inverse_transform(values)
    
    print('Success with feature: {}'.format(col_name))

Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


## PCA

Create a function that creates a PCA instance and with that PCA instance creates a new x_train

In [14]:
def create_pca_set(variance_amount, x_train, x_dev, x_test):
    
    if variance_amount < 1:
    
        print('Create PCA Instance')
        pca = PCA(variance_amount)

        print('Fit PCA Instance')
        pca.fit(x_train)
        print('Number of Components = {}'.format(pca.n_components_))

        print('Create New Input Training Set')
        new_x_train = pca.transform(x_train)

        print('Create New Input Dev Set')
        new_x_dev = pca.transform(x_dev)

        print('Create New Input Test Set')
        new_x_test = pca.transform(x_test)
        
    else:
        
        new_x_train = x_train
        new_x_dev = x_dev
        new_x_test = x_test

    print('----------------------------------')
    
    return new_x_train, new_x_dev, new_x_test

In [15]:
pca_options = [1, 0.99,0.98,0.95,0.90]

-----------------
## Models

#### Basics

In [16]:
# Mini-batch size, epochs
batch_size = 64
epochs = 300
dd = 0.2

#### Build Model

In [17]:
# Build model with MAPE, Leaky ReLU, Adam
def build_model(number, x_train):
    
    # Create model
    model = Sequential(name='Model_{}'.format(number))

    model.add(Dense(256, input_dim=x_train.shape[1]))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(128))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(64))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(32))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(16))
    model.add(advanced_activations.LeakyReLU())
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(1))

    #Compile model
    model.compile(loss=losses.mean_absolute_percentage_error, optimizer=optimizers.Adam(), metrics = ['accuracy'])
    
    print('{} Created'.format(model.name))
    print('----------------------------------')
    
    return model

#### Train Model

In [18]:
def train_models(model, x_train, y_train, x_dev, y_dev):
    
    print('{} - Training'.format(model.name))
    print('- Started on {} at {}'.format(str(datetime.datetime.now())[5:-16], str(datetime.datetime.now())[11:-10]))
    # Start timer
    start_time = time.time()

    # fit network
    history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, 
                        validation_data=(x_dev, y_dev), verbose=0, shuffle=True)

    # End timer
    end_time = time.time() - start_time
    print('{} - Training Complete'.format(model.name))
    print('- Time: {:.3f} min'.format(end_time/60))
    print('- Loss = {:.5f}'.format(history.history['loss'][-1]))
    print('- Val Loss = {:.5f}'.format(history.history['val_loss'][-1]))
    print('----------------------------------')
        
    return history

#### Make Predictions and Calculate Error

In [19]:
# Function to define MSPE
def msp_error(true,pred):
    error = 100*np.sum(((true-pred)/true)**2)/np.size(true)
    return error

In [20]:
def predict_get_error(model, x_test, scaler_type):
    
        # Select the scalers
    if scaler_type == 'MinMax':
        
        scalers = minmax_scalers
        
    if scaler_type == 'Standard':
        
        scalers = standard_scalers  
    
    #------------------
    print('Predicting with {}'.format(model.name))
    scaled_predictions = model.predict(x_test)
    
    print('Inverse Scaling Operation') 
     
    # Inverse the scaling operation on the predictions
    predictions = scalers[-4].inverse_transform(scaled_predictions)
    
    print('- Prediction Mean = {:.5f}'.format(np.mean(predictions)))
    print('- Prediction Min = {:.5f}'.format(np.min(predictions)))
    print('- Prediction Max = {:.5f}'.format(np.max(predictions)))

    print('Calculating HC Error')
    mspe = msp_error(test_set_inverse['HC'].values, predictions)
        
    print('- HC Error  = {:.2e}'.format(mspe))
    print('----------------------------------')
    
    return mspe

#### Process Models and Rank with MSPE

In [21]:
def process_models():
    
    count = 1
    model_list = []
    history_list = []
    HC_error_list = []

    for variance_percent in pca_options:
        
        for scaler_type in scaler_types:

            # Print model variables
            print('Model_{} Variables:'.format(count))
            print('- Loss: MAPE')
            print('- Activation: Leaky ReLU')
            print('- Optimizer: Adam')
            print('- Dropout: {}%'.format(dd*100))
            print('- Variance Explained: {}'.format(variance_percent))
            print('- Scaler: {}'.format(scaler_type))
            print('----------------------------------')
            
            # Create the inputs and outputs
            orginal_x_train, orginal_x_dev, orginal_x_test, y_train, y_dev, y_test = create_data_sets(scaler_type)

            # Get the INPUT datasets after PCA
            x_train, x_dev, x_test = create_pca_set(variance_percent, orginal_x_train, orginal_x_dev, orginal_x_test)

            # Create model
            model = build_model(count, x_train)

            # Train model
            history = train_models(model, x_train, y_train, x_dev, y_dev)
            history_list.append(history)

            # Make predictions and calculate error
            error = predict_get_error(model, x_test, scaler_type)

            # Add error to error list
            HC_error_list.append([model.name, variance_percent, scaler_type, error])

            # Announce one model process ended
            print('============== MODEL {} PROCESS END =============='.format(count))
            print(' ')

            # Increase counter by 1
            count = count+1

            # Add TRAINED model to list
            model_list.append(model)

    print('Creating DataFrame')                
    HC_error = pd.DataFrame(HC_error_list)

    print('Changing DataFrame column names')
    HC_error.columns = ['Model', 'Variance Explained', 'Scaler', 'MSPE']

    print('Ranking Models')
    HC_error.sort_values(by=['MSPE'], inplace=True)

    count = 0
    
    return HC_error, model_list, history_list

In [22]:
HC_ranking, models, histories = process_models()

Model_1 Variables:
- Loss: MAPE
- Activation: Leaky ReLU
- Optimizer: Adam
- Dropout: 20.0%
- Variance Explained: 1
- Scaler: MinMax
----------------------------------
Preparing Data-sets
Splitting into inputs and outputs
Data-sets complete
----------------------------------
----------------------------------
Model_1 Created
----------------------------------
Model_1 - Training
- Started on 05-16 at 00:40
Model_1 - Training Complete
- Time: 41.422 min
- Loss = 719.23825
- Val Loss = 165.20448
----------------------------------
Predicting with Model_1
Inverse Scaling Operation
- Prediction Mean = -0.00209
- Prediction Min = -0.00496
- Prediction Max = 0.07331
Calculating HC Error
- HC Error  = 6.25e+08
----------------------------------
 
Model_2 Variables:
- Loss: MAPE
- Activation: Leaky ReLU
- Optimizer: Adam
- Dropout: 20.0%
- Variance Explained: 1
- Scaler: Standard
----------------------------------
Preparing Data-sets
Splitting into inputs and outputs
Data-sets complete
---------

Model_10 Created
----------------------------------
Model_10 - Training
- Started on 05-16 at 07:00
Model_10 - Training Complete
- Time: 43.209 min
- Loss = 90.64018
- Val Loss = 88.91298
----------------------------------
Predicting with Model_10
Inverse Scaling Operation
- Prediction Mean = 0.04391
- Prediction Min = 0.01105
- Prediction Max = 0.05154
Calculating HC Error
- HC Error  = 6.75e+10
----------------------------------
 
Creating DataFrame
Changing DataFrame column names
Ranking Models


In [23]:
HC_ranking

Unnamed: 0,Model,Variance Explained,Scaler,MSPE
6,Model_7,0.95,MinMax,92536330.0
2,Model_3,0.99,MinMax,201640000.0
4,Model_5,0.98,MinMax,359282600.0
8,Model_9,0.9,MinMax,363323300.0
0,Model_1,1.0,MinMax,625371100.0
1,Model_2,1.0,Standard,60624620000.0
3,Model_4,0.99,Standard,60809190000.0
5,Model_6,0.98,Standard,62165470000.0
7,Model_8,0.95,Standard,66527850000.0
9,Model_10,0.9,Standard,67459210000.0


In [48]:
number = 5
print(models[number].name + ' ' + HC_ranking['Scaler'][number])

Model_6 Standard


In [24]:
epoch_vector=np.linspace(1,epochs,epochs)

for i in range(len(models)):
    model = models[i]
    history = histories[i]
    scaler_name = HC_ranking['Scaler'][i]
    var_exp = HC_ranking['Variance Explained'][i]
    
    model.save(os.path.join(output_path,'{}-Var_{}-{}.h5'.format(model.name, var_exp, scaler_name)))
    
    hist_data =[epoch_vector,history.history['loss'],history.history['val_loss']]
    hist_data =pd.DataFrame(hist_data).transpose()
    hist_data.columns=['Epochs','loss','val_loss']
    
    hist_data.to_csv(os.path.join(output_path,'Training_History_{}.csv'.format(model.name)),index=False)

## Next Steps

The models and the training histories will be moved to foldr **Gen 7**. 

Nothing in this notebook yielded an improvement in performance. 

Maybe it has to do with the setup of the problem, or the setup of the database, or simply there is no relationship between the chosen inputs and the chosen outputs. 

A possible source of error is the cheating done by automakers on these standarized EPA tests from which all the data is derived. It **could** be possible that there **is** a relationship between inputs and outputs, but because many of the outputs are fake, this relationship is destroyed and the neural networks can't learn how to represent it. 

The **FINAL** attempt to predict something will drastically reduce the data sets. Different data-sets will be created where there is only **ONE** car manufacturer (maybe the top 2 or 3 manufacturers with the most cars will be chosen), and the same procedure will be applied to see if in that limited application, neural networks are capable of finding something. 