## Description


### This Notebook:

This notebook will run a test to see if a model can accurately predict emissions for vehicles **AFTER YEAR 2009**. 

* Activation Functions:
    * Leaky ReLU + Linear Output
* Optimizers:
    * Adam
* Dropout:
    * 20%
* Loss Function:
    * MAPE (Mean Absolute Percentage Error)
* Scaler:
    * MinMax
    
There will be an analysis of the manufacturers in the data-base and with this, different data-sets to train the models will be created. 

    1) Create a data-set with 18 inputs for the manufacturer with the most cars
    2) Create 18 independent databases by dropping one input at a time
    3) Apply PCA for 99% and 90% to the first database
    
This totals 21 models to be trained and tested


-------

## TO DO's



    


In [1]:
from keras.models import Sequential, load_model, Model
from keras.layers import Input, Dense, Dropout, advanced_activations, BatchNormalization, LeakyReLU
from keras import losses, optimizers, activations
import keras.backend as K

import h5py

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.externals import joblib
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import time
import datetime
import os

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
output_path = os.path.join('.','output')
minmax_scaler_path = os.path.join('.','Scalers','MinMax')
standard_scaler_path = os.path.join('.','Scalers','Standard')

## Load Original Scaled Data

In [33]:
complete_data_scaled_shuffled = pd.read_csv('Dataset_Scaled_Shuffled.csv')
print('Shuffled dataset loaded.')

Shuffled dataset loaded.


## Load MinMax Scalers

In [34]:
# Create an empty list to put all the scalers
minmax_scalers = []

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    scaler_filename = os.path.join(minmax_scaler_path,'scaler{}.save'.format(i))
    minmax_scaler = joblib.load(scaler_filename)
    
    minmax_scalers.append(minmax_scaler)

## Inverse Scale Data

Using the original MinMax

In [35]:
# First, inverse transform all original values from the test_set
original_data_inverse = complete_data_scaled_shuffled.copy()

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    col_name = complete_data_scaled_shuffled.columns[i]
    
    values = original_data_inverse[col_name].values
    values = values.astype('float64')
    values = values.reshape(values.shape[0],1)
    
    original_data_inverse[col_name] = minmax_scalers[i].inverse_transform(values)
    
    print('Success with feature: {}'.format(col_name))

Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


### Drop info prior to year 2009

In [36]:
# Create the dataset
data_after_2009 = original_data_inverse[original_data_inverse.Year >= 2010]

## Scale Data

Function to scale data according to an input that decides if it is a MinMax or a Standard Scaler

In [37]:
def scale_data(data, scaler_type):
    
    # Select the scalers
    if scaler_type == 'MinMax':
        
        scalers = minmax_scalers
        print('Using {} Scalers'.format(scaler_type))
        print('----------------------------------')
        
    if scaler_type == 'Standard':
        
        scalers = standard_scalers
        print('Using {} Scalers'.format(scaler_type))
        print('----------------------------------')
        
    # Scale the data
    
    # Copy the data set to avoid altering the original
    new_data_scaled = data.copy()
    
    # DROP THE VARIABLES WE DON'T WANT
    #new_data_scaled.drop(columns=['Fuel_Code', 'Drive_System_Code', 'Peak_pos', 'Peak_neg', 'ETW', 'a_pos', 'a_neg', 'HP'], 
                     #inplace=True)
    
    # Loop over the standard_scalers and perform the scaling operation on each column
    for i in range(len(new_data_scaled.columns)):

        col_name = new_data_scaled.columns[i]

        values = new_data_scaled[col_name].values
        values = values.astype('float64')
        values = values.reshape(values.shape[0],1)

        new_data_scaled[col_name] = scalers[i].fit_transform(values)

        print('Success with feature: {}'.format(col_name))
    
    print('----------------------------------')
    
    return new_data_scaled

In [38]:
scaler_types = ['MinMax']

In [39]:
data_minmax_scaled = scale_data(data_after_2009, 'MinMax')

Using MinMax Scalers
----------------------------------
Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox
----------------------------------


## Prepare Data

Function to prepare data based on the database input, which depends on the souce (Top or 500) and the scaler

In [40]:
# Get number of data points
data_points = data_after_2009.shape[0]

# Set sizes for train, dev, test sets
train_percent = 0.8
train_size = round(train_percent*data_points)

if (data_points-train_size)%2 == 0:
    dev_size = int((data_points-train_size)/2)
    test_size = dev_size
    print('Train Size = {}'.format(train_size))
    print('Dev Size = {}'.format(dev_size))
    print('Test Size = {}'.format(test_size))
    print('Remainder = {}'.format(train_size+dev_size+test_size-data_points))
    print('----------------------------------')

else:
    train_size = train_size-1
    dev_size = int((data_points-train_size)/2)
    test_size = dev_size 
    print('Train Size = {}'.format(train_size))
    print('Dev Size = {}'.format(dev_size))
    print('Test Size = {}'.format(test_size))
    print('Remainder = {}'.format(train_size+dev_size+test_size-data_points))
    print('----------------------------------')

Train Size = 19263
Dev Size = 2408
Test Size = 2408
Remainder = 0
----------------------------------


## Prepare Datasets

Create functions to modify the input variables and thus create different input sets

In [41]:
# Save the names of the columns in a list that can be looped
input_names = complete_data_scaled_shuffled.columns[:-4]

In [42]:
for variable in input_names:
    print('{}'.format(variable))

Year
Vehicle_Code
Manufacturer_Code
Displacement
Fuel_System
Gears
Transmission_Code
ETW
HP
Drive_System_Code
Fuel_Code
V_avg
V_max
V_std
a_pos
a_neg
Peak_pos
Peak_neg


In [43]:
def prepare_data_sets(data, drop_variable=None):
    
    # Create a local copy of the entire dataset
    data_scaled_shuffled = data.copy()
    
    if drop_variable != None:
        # Drop the variable that will be ignored during the run
        data_scaled_shuffled.drop(columns=drop_variable, inplace=True)
        print('{} Column Dropped'.format(drop_variable))
    
    print('Preparing Data-sets')
    # Divide data into train, dev, and test sets
    train_set = data_scaled_shuffled[ : train_size]
    dev_set = data_scaled_shuffled[train_size : train_size+dev_size]
    test_set = data_scaled_shuffled[train_size+dev_size : train_size+dev_size+test_size]

    # Reset index for all sets
    train_set = train_set.reset_index(drop=True)
    dev_set = dev_set.reset_index(drop=True)
    test_set = test_set.reset_index(drop=True)

    # Get values
    train_set_values = train_set.values
    dev_set_values = dev_set.values
    test_set_values = test_set.values
    
    # Number of emissions: HC, CO, CO2, NOX
    n_out = 4
    
    print('Splitting into inputs and outputs')
    # SLICING: [start row:end row , start column:end column]
    # Split into inputs and outputs
    x_train = train_set_values[:,:-n_out]
    x_dev = dev_set_values[:,:-n_out]
    x_test = test_set_values[:,:-n_out]
    
    print('Inputs = {}'.format(x_train.shape[1]))
    
    # Get the outputs (only HC)
    HC_train = train_set_values[:,-n_out]
    HC_dev = dev_set_values[:,-n_out]
    HC_test = test_set_values[:,-n_out]
    
    print('Data-sets complete')
    print('----------------------------------')
    
    return x_train, x_dev, x_test, HC_train, HC_dev, HC_test

In [44]:
def get_test_inverse(scaler_type):
    
    # Select the scalers
    if scaler_type == 'MinMax':
        
        scalers = minmax_scalers
        
    if scaler_type == 'Standard':
        
        scalers = standard_scalers
        
    #----------------------------------

    # Inverse transform the TEST DATA to be able to calculate the error further down
    test_set_scaled = data_after_2009[train_size+dev_size : train_size+dev_size+test_size]
    test_set_inverse = test_set_scaled.copy()

    for i in range(np.size(data_after_2009.columns)):

        col_name = data_after_2009.columns[i]

        values = test_set_inverse[col_name].values
        values = values.astype('float64')
        values = values.reshape(values.shape[0],1)

        test_set_inverse[col_name] = scalers[i].inverse_transform(values)

    print('Success creating inverse test set')
    print('----------------------------------')
    
    return test_set_inverse

## PCA

Create a function that creates a PCA instance and with that PCA instance creates a new x_train

In [45]:
def create_pca_set(variance_amount, x_train, x_dev, x_test):
    
    if variance_amount < 1:
    
        print('Create PCA Instance')
        pca = PCA(variance_amount)

        print('Fit PCA Instance')
        pca.fit(x_train)
        print('Number of Components = {}'.format(pca.n_components_))

        print('Create New Input Training Set')
        new_x_train = pca.transform(x_train)

        print('Create New Input Dev Set')
        new_x_dev = pca.transform(x_dev)

        print('Create New Input Test Set')
        new_x_test = pca.transform(x_test)
        
    else:
        
        new_x_train = x_train
        new_x_dev = x_dev
        new_x_test = x_test

    print('----------------------------------')
    
    return new_x_train, new_x_dev, new_x_test

In [47]:
pca_options = [1, 0.99, 0.90]

-----------------
## Models

#### Basics

In [48]:
# Mini-batch size, epochs
batch_size = 64
epochs = 300

#### Hyper-Parameter Space

In [49]:
# Activation functions to try
names_activations = ['LReLU']
    # A function has to be called so that a new instance of the function can be created in each layer
def get_activation(name):
    
    if name == 'ReLU':
        function = advanced_activations.ReLU()
    if name == 'LReLU':
        function = advanced_activations.LeakyReLU()
        
    return function

#--------------------------------------------------------------------------------- 

output_activations = ['linear']

#--------------------------------------------------------------------------------- 

# Dropout rate to be tried
dropouts = [0.2]

#### Build Model

In [50]:
def build_model(number, x_train, activation_name, output, dd):
    
    # Create model
    model = Sequential(name='Model_{}'.format(number))

    model.add(Dense(256, input_dim=x_train.shape[1]))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(128))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(64))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(32))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(16))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(1))
    
    if output == 'ReLU':
        model.add(advanced_activations.ReLU())

    #Compile model
    model.compile(loss=losses.mean_absolute_percentage_error, optimizer=optimizers.Adam(), metrics = ['accuracy'])
    
    print('{} Created'.format(model.name))
    print('----------------------------------')
    
    return model

#### Train Model

In [51]:
def train_models(model, x_train, y_train, x_dev, y_dev):
    
    print('{} - Training'.format(model.name))
    print('- Started on {} at {}'.format(str(datetime.datetime.now())[5:-16], str(datetime.datetime.now())[11:-10]))
    # Start timer
    start_time = time.time()

    # fit network
    history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, 
                        validation_data=(x_dev, y_dev), verbose=0, shuffle=True)

    # End timer
    end_time = time.time() - start_time
    print('{} - Training Complete'.format(model.name))
    print('- Time: {:.3f} min'.format(end_time/60))
    print('- Loss = {:.5f}'.format(history.history['loss'][-1]))
    print('- Val Loss = {:.5f}'.format(history.history['val_loss'][-1]))
    print('----------------------------------')
        
    return history

#### Make Predictions and Calculate Error

In [52]:
# Function to define MSPE
def msp_error(true,pred):
    error = 100*np.sum(((true-pred)/true)**2)/np.size(true)
    return error

In [53]:
def predict_get_error(model, x_test, scaler_type, test_set_inverse):
    
        # Select the scalers
    if scaler_type == 'MinMax':
        
        scalers = minmax_scalers
        
    if scaler_type == 'Standard':
        
        scalers = standard_scalers  
    
    #------------------
    print('Predicting with {}'.format(model.name))
    scaled_predictions = model.predict(x_test)
    
    print('Inverse Scaling Operation') 
     
    # Inverse the scaling operation on the predictions
    predictions = scalers[-4].inverse_transform(scaled_predictions)
    
    print('- Prediction Mean = {:.5f}'.format(np.mean(predictions)))
    print('- Prediction Min = {:.5f}'.format(np.min(predictions)))
    print('- Prediction Max = {:.5f}'.format(np.max(predictions)))

    print('Calculating HC Error')
    mspe = msp_error(test_set_inverse['HC'].values, predictions)
        
    print('- HC Error  = {:.2e}'.format(mspe))
    print('----------------------------------')
    
    return mspe

#### Process Models and Rank with MSPE

In [54]:
def process_models():
    
    count = 1
    model_list = []
    history_list = []
    HC_error_list = []
        
    for scaler_type in scaler_types:

        for activation_name in names_activations:

            for output in output_activations:

                for dd in dropouts:
                    
                    for variable in input_names:

                        # Print model variables
                        print('Model_{} Variables:'.format(count))
                        print('- Loss: MAPE')
                        print('- Activation: {}'.format(activation_name))
                        print('- Output: {}'.format(output))
                        print('- Optimizer: Adam')
                        print('- Dropout: {}%'.format(dd*100))
                        print('- Scaler: {}'.format(scaler_type))
                        print('- Dropped: {}'.format(variable))
                        print('----------------------------------')
                        
                        data_scaled = data_minmax_scaled.copy()

                        # Prepare data-sets and create inputs and outputs
                        x_train, x_dev, x_test, y_train, y_dev, y_test = prepare_data_sets(data_scaled, variable)

                        # Create model
                        model = build_model(count, x_train, activation_name, output, dd)

                        # Train model
                        history = train_models(model, x_train, y_train, x_dev, y_dev)
                        history_list.append(history)

                        # Get test_set_inverse
                        test_set_inverse = get_test_inverse(scaler_type)

                        # Make predictions and calculate error
                        error = predict_get_error(model, x_test, scaler_type, test_set_inverse)

                        # Add error to error list
                        HC_error_list.append([model.name, scaler_type, activation_name, output, dd, variable, error])

                        # Announce one model process ended
                        print('============== MODEL {} PROCESS END =============='.format(count))
                        print(' ')

                        # Increase counter by 1
                        count = count+1

                        # Add TRAINED model to list
                        model_list.append(model)
                        
                        
    # Create THREE PCA database
    for variance_amount in pca_options:
        
        scaler_type = 'MinMax'
        activation_name = 'LReLU'
        output = 'linear'
        dd = 0.2
        
        # Print model variables
        print('Model_{} Variables:'.format(count))
        print('- Loss: MAPE')
        print('- Activation: {}'.format(activation_name))
        print('- Output: {}'.format(output))
        print('- Optimizer: Adam')
        print('- Dropout: {}%'.format(dd*100))
        print('- Scaler: {}'.format(scaler_type))
        print('- Variance: {}'.format(variance_amount))
        print('----------------------------------')
        
        # Prepare data-sets and create inputs and outputs
        old_x_train, old_x_dev, old_x_test, y_train, y_dev, y_test = prepare_data_sets(data_minmax_scaled)
        
        # Create PCA instance for the inputs
        x_train, x_dev, x_test = create_pca_set(variance_amount, old_x_train, old_x_dev, old_x_test)

        # Create model
        model = build_model(count, x_train, activation_name, output, dd)

        # Train model
        history = train_models(model, x_train, y_train, x_dev, y_dev)
        history_list.append(history)

        # Get test_set_inverse
        test_set_inverse = get_test_inverse(scaler_type)

        # Make predictions and calculate error
        error = predict_get_error(model, x_test, scaler_type, test_set_inverse)

        # Add error to error list
        HC_error_list.append([model.name, scaler_type, activation_name, output, dd, variance_amount, error])

        # Announce one model process ended
        print('============== MODEL {} PROCESS END =============='.format(count))
        print(' ')

        # Increase counter by 1
        count = count+1

        # Add TRAINED model to list
        model_list.append(model)
    
    #------------------------------------------------------------

    print('Creating DataFrame')                
    HC_error = pd.DataFrame(HC_error_list)

    print('Changing DataFrame column names')
    HC_error.columns = ['Model', 'Scaler', 'Activation', 'Output', 'Dropout', 'Variable/Variance', 'MSPE']

    print('Ranking Models')
    HC_error.sort_values(by=['MSPE'], inplace=True)

    count = 0
    
    return HC_error, model_list, history_list

In [55]:
HC_ranking, models, histories = process_models()

Model_1 Variables:
- Loss: MAPE
- Activation: LReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Year
----------------------------------
Year Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_1 Created
----------------------------------
Model_1 - Training
- Started on 05-08 at 12:23
Model_1 - Training Complete
- Time: 12.850 min
- Loss = 1081.39572
- Val Loss = 183.18381
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_1
Inverse Scaling Operation
- Prediction Mean = 0.00280
- Prediction Min = 0.00253
- Prediction Max = 0.01095
Calculating HC Error
- HC Error  = 1.03e+08
----------------------------------
 
Model_2 Variables:
- Loss: MAPE
- Activation: LReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Vehicle_Code
----------------------------------
Veh

Model_10 Created
----------------------------------
Model_10 - Training
- Started on 05-08 at 14:21
Model_10 - Training Complete
- Time: 13.434 min
- Loss = 1081.07312
- Val Loss = 181.31188
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_10
Inverse Scaling Operation
- Prediction Mean = 0.00048
- Prediction Min = -0.00107
- Prediction Max = 0.06206
Calculating HC Error
- HC Error  = 3.91e+08
----------------------------------
 
Model_11 Variables:
- Loss: MAPE
- Activation: LReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Fuel_Code
----------------------------------
Fuel_Code Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_11 Created
----------------------------------
Model_11 - Training
- Started on 05-08 at 14:34
Model_11 - Training Complete
- Time: 13.599 min
- Loss = 1255.64982

Inverse Scaling Operation
- Prediction Mean = 0.00450
- Prediction Min = 0.00155
- Prediction Max = 0.08381
Calculating HC Error
- HC Error  = 1.25e+09
----------------------------------
 
Model_20 Variables:
- Loss: MAPE
- Activation: LReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Variance: 0.99
----------------------------------
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 18
Data-sets complete
----------------------------------
Create PCA Instance
Fit PCA Instance
Number of Components = 11
Create New Input Training Set
Create New Input Dev Set
Create New Input Test Set
----------------------------------
Model_20 Created
----------------------------------
Model_20 - Training
- Started on 05-08 at 16:44
Model_20 - Training Complete
- Time: 15.745 min
- Loss = 1121.40081
- Val Loss = 254.55683
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_20
Inverse Scaling Op

In [56]:
HC_ranking

Unnamed: 0,Model,Scaler,Activation,Output,Dropout,Variable/Variance,MSPE
16,Model_17,MinMax,LReLU,linear,0.2,Peak_pos,5709966.0
5,Model_6,MinMax,LReLU,linear,0.2,Gears,6586152.0
7,Model_8,MinMax,LReLU,linear,0.2,ETW,28595480.0
10,Model_11,MinMax,LReLU,linear,0.2,Fuel_Code,39567650.0
6,Model_7,MinMax,LReLU,linear,0.2,Transmission_Code,43349640.0
14,Model_15,MinMax,LReLU,linear,0.2,a_pos,44250150.0
17,Model_18,MinMax,LReLU,linear,0.2,Peak_neg,50697520.0
1,Model_2,MinMax,LReLU,linear,0.2,Vehicle_Code,62493970.0
20,Model_21,MinMax,LReLU,linear,0.2,0.9,66741330.0
13,Model_14,MinMax,LReLU,linear,0.2,V_std,98407340.0


In [57]:
epoch_vector=np.linspace(1,epochs,epochs)

for i in range(len(models)):
    model = models[i]
    history = histories[i]
    activation = HC_ranking['Activation'][i]
    output = HC_ranking['Output'][i]
    variable = HC_ranking['Variable/Variance'][i]
    
    model.save(os.path.join(output_path,'{}_{}_{}_{}.h5'.format(model.name, activation, output, variable)))
    
    hist_data =[epoch_vector,history.history['loss'],history.history['val_loss']]
    hist_data =pd.DataFrame(hist_data).transpose()
    hist_data.columns=['Epochs','loss','val_loss']
    
    hist_data.to_csv(os.path.join(output_path,'Training_History_{}.csv'.format(model.name)),index=False)

## Next Steps

The models and the training histories will be moved to foldr **Gen 8**. 