## Description

This notebook represents the culmination of a long road of trial and error trying to predict vehicle emissions using neural networks. 

### What has been done until now:

* Construct database of inputs and outputs
* Remove zeros from data to prevent unnecessary noise
* Try Multi Task Learning to predict: $HC$, $CO$, $CO_2$, $NO_X$
    * This didn't work
* Run a short code trying to evaluate the impact of network size on the predictions
    * It was found the effect was minimal at that point, so ONE architecture was chosen: inputs-256-128-64-32-16-output
* Try Single Task Learning (normal supervised learning) to predict only HC
* Perform a **hyper-parameter** search to find the best possible combination that would reduce the error
    * For this a standard error was chosen to evaluate the predictions: MSPE
    * The top four were:
        * MAPE + LReLU + Adam + 20% dropout
        * MAPE + PReLU + Adam + 20% dropout
        * R2 + LReLU + Adam + 20% dropout
        * R2 + PReLU + Adam + 20% dropout
    * The models were predicting **negative** values which made no sense
* These top four models were tested for all four pollutants: $HC$, $CO$, $CO_2$, $NO_X$
    * To see if one of them could be predicted better than the others and this could provide a clue into the massive errors
    * Because in the past notebook the predictions were negative, a ReLU output replaced the Linear output. But this caused all of the predictions to be **zero** for all emissions
* The hypothesis was that probably one of the inputs was introducing a VERY negative correlation into the relationship which made the predictions zero. So 18 different data-sets were created, where each was missing **one** input. Models were trained (with a linear output) to see when they generated positive predictions and when they were negative. 
    * It was found that the following variables made the predictions negative: 
        * Fuel Code
        * Drive System
        * Peak_pos (Positive acceleration peaks)
        * ETW (Estimated Test Weight)
        * a_pos (Positive acceleration mean)
        * HP (Horse Power)
    * It was impossible to try combinations of two of those variables because there would be around 300 models to test
* PCA (Principal Component Analysis) was applied on the original data-set (with 18 inputs) to see if this algorithm could generate variables that would improve predictions. 
    * It didn't work
* PCA was applied on a new data-set that did not include the **six** variables from above. 
    * This didn't work either

### This Notebook:

This notebook will run a test to see if a model can accurately predict emissions for vehicles belonging to **ONE SINGLE MANUFACTURER**. A combination of hyper-parameters will be tried to see if something works: 

* Activation Functions:
    * Leaky ReLU + Linear Output
    * Leaky ReLU + ReLU Output
    * ReLU + Linear Output
    * ReLU + ReLU
* Optimizers:
    * Adam
* Dropout:
    * 20%
* Loss Function:
    * MAPE (Mean Absolute Percentage Error)
* Scaler:
    * MinMax
    
There will be an analysis of the manufacturers in the data-base and with this, different data-sets to train the models will be created. 

    1) Create a data-set with 18 inputs for the manufacturer with the most cars
    2) Create 18 independent databases by dropping one input at a time
    3) Apply PCA for 99% and 90% to the first database
    
This totals 78 models to be trained and tested


-------

## TO DO's

Load original shuffled and scaled data

Data Scaling
* Load MinMax Scalers
* Inverse the data
* Re-scale the data using the selected scaler (MinMax or Standard)
    * This has to be a function where the scaler is chosen and also the data-set

Data Processing
* Get a list of manufacturers
* Count how many entries each one has
* Rank them by the number of entries
* Create one database:
    * One for the TOP manufacturer
* Write code to create a new database for each dropped variable (**after scaling**)

    
Data Pre-Processing
* Function to create x_train, x_dev, x_test, and y_train, y_dev, y_test based on the database that is fed to it

Hyper-Parameters
* Create the hyper-parameter space to be explored

PCA
* Function to apply PCA by feeding the **x**'s
* This will be used later

Models
* Function to create models based on the hyper-parameter combinations



    


In [1]:
from keras.models import Sequential, load_model, Model
from keras.layers import Input, Dense, Dropout, advanced_activations, BatchNormalization, LeakyReLU
from keras import losses, optimizers, activations
import keras.backend as K

import h5py

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.externals import joblib
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import time
import datetime
import os

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
output_path = os.path.join('.','output')
minmax_scaler_path = os.path.join('.','Scalers','MinMax')
standard_scaler_path = os.path.join('.','Scalers','Standard')

## Load Original Scaled Data

In [3]:
complete_data_scaled_shuffled = pd.read_csv('Dataset_Scaled_Shuffled.csv')
print('Shuffled dataset loaded.')

Shuffled dataset loaded.


## Load MinMax Scalers

In [4]:
# Create an empty list to put all the scalers
minmax_scalers = []

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    scaler_filename = os.path.join(minmax_scaler_path,'scaler{}.save'.format(i))
    minmax_scaler = joblib.load(scaler_filename)
    
    minmax_scalers.append(minmax_scaler)

## Load Standard Scalers

In [5]:
# Create an empty list to put all the loaded scalers
standard_scalers = []

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    scaler_filename = os.path.join(standard_scaler_path,'scaler{}.save'.format(i))
    standard_scaler = joblib.load(scaler_filename)
    
    standard_scalers.append(standard_scaler)

## Inverse Scale Data

Using the original MinMax

In [6]:
# First, inverse transform all original values from the test_set
original_data_inverse = complete_data_scaled_shuffled.copy()

for i in range(len(complete_data_scaled_shuffled.columns)):
    
    col_name = complete_data_scaled_shuffled.columns[i]
    
    values = original_data_inverse[col_name].values
    values = values.astype('float64')
    values = values.reshape(values.shape[0],1)
    
    original_data_inverse[col_name] = minmax_scalers[i].inverse_transform(values)
    
    print('Success with feature: {}'.format(col_name))

Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox


## Analyze Manufacturers

In [7]:
# Get a list of the available manufacturers (each one has a code)
manufacturers = original_data_inverse['Manufacturer_Code'].unique()
original_data_inverse['Manufacturer_Code'].unique()

array([493., 661., 716., 241., 733., 244., 498., 608., 171., 126., 853.,
       122., 662., 382., 838., 727., 278., 929., 964., 892., 578., 745.,
       229., 381., 996., 688., 416., 376., 554., 107., 571., 164., 867.,
       613., 413., 457., 815., 590., 680., 753., 137., 490., 983., 352.,
       315., 296., 642., 721., 335., 309., 354., 634., 566., 683., 597.,
       531., 115., 397., 920., 568., 172., 436., 947., 125., 849., 467.,
       186., 419., 487.])

In [8]:
# Create empty list
manuf_by_entries_list = []

# Loop over the manufacturer codes
for i in manufacturers:
    
    # Create a temporary DataFrame where there are only entries that match the current manufactuer code
    # Get the length of that temporary DataFrame
    num_cars = len(original_data_inverse[original_data_inverse.Manufacturer_Code == i])
    
    # Append the manufacturer code and the number of entries to the empty list
    manuf_by_entries_list.append([i, num_cars])

# Turn the list into a DataFrame
manuf_by_entries = pd.DataFrame(manuf_by_entries_list)

# Rename the columns
manuf_by_entries.columns = ['Manufacturer', 'Num_Cars']

# Sort by number of entries from largest to smallest
manuf_by_entries.sort_values(by=['Num_Cars'], inplace=True, ascending=False)

# Reset the index to make it cleaner
manuf_by_entries.reset_index(drop=True, inplace=True)

#----------------------------------------------
# Create a database using only the top manufacturer

# Save the top manufacturer into a variable
top_manufacturer = manuf_by_entries['Manufacturer'][0]
top_manufacturer_entries = manuf_by_entries['Num_Cars'][0]
print('Top Manufacturer: {:.0f}'.format(top_manufacturer))
print('Number of Entries = {}'.format(top_manufacturer_entries))
print('-------------------------')

# Create the dataset
top_manufacturer_data = original_data_inverse[original_data_inverse.Manufacturer_Code == top_manufacturer]
print('Created Data Set: Top Manufacturer Data')
print('-------------------------')

'''
#----------------------------------------------
# Create a database using only manufacturers with more than 500 entries

# Create variable with minimum number of entries
min_entries = 500

# Create a list to store the manufacturers that have less than 500 entries
small_manufacturers = manuf_by_entries[manuf_by_entries.Num_Cars <= min_entries]['Manufacturer'].values

# Loop over the list above and drop all the small manufacturers
manufacturers_500_data = original_data_inverse.copy()

for i in small_manufacturers:
    manufacturers_500_data = manufacturers_500_data[manufacturers_500_data.Manufacturer_Code != i]
    rows = manufacturers_500_data.shape[0]
    
    print('Dropped Manufacturer: {:.0f}'.format(i))
    print('Rows = {:.0f}'.format(rows))
    
print('-------------------------')    
print('Created Data Set: Manufacturers 500 Data ')
'''

Top Manufacturer: 498
Number of Entries = 9913
-------------------------
Created Data Set: Top Manufacturer Data
-------------------------


"\n#----------------------------------------------\n# Create a database using only manufacturers with more than 500 entries\n\n# Create variable with minimum number of entries\nmin_entries = 500\n\n# Create a list to store the manufacturers that have less than 500 entries\nsmall_manufacturers = manuf_by_entries[manuf_by_entries.Num_Cars <= min_entries]['Manufacturer'].values\n\n# Loop over the list above and drop all the small manufacturers\nmanufacturers_500_data = original_data_inverse.copy()\n\nfor i in small_manufacturers:\n    manufacturers_500_data = manufacturers_500_data[manufacturers_500_data.Manufacturer_Code != i]\n    rows = manufacturers_500_data.shape[0]\n    \n    print('Dropped Manufacturer: {:.0f}'.format(i))\n    print('Rows = {:.0f}'.format(rows))\n    \nprint('-------------------------')    \nprint('Created Data Set: Manufacturers 500 Data ')\n"

## Scale Data

Function to scale data according to an input that decides if it is a MinMax or a Standard Scaler

In [9]:
def scale_data(data, scaler_type):
    
    # Select the scalers
    if scaler_type == 'MinMax':
        
        scalers = minmax_scalers
        print('Using {} Scalers'.format(scaler_type))
        print('----------------------------------')
        
    if scaler_type == 'Standard':
        
        scalers = standard_scalers
        print('Using {} Scalers'.format(scaler_type))
        print('----------------------------------')
        
    # Scale the data
    
    # Copy the data set to avoid altering the original
    new_data_scaled = data.copy()
    
    # DROP THE VARIABLES WE DON'T WANT
    #new_data_scaled.drop(columns=['Fuel_Code', 'Drive_System_Code', 'Peak_pos', 'Peak_neg', 'ETW', 'a_pos', 'a_neg', 'HP'], 
                     #inplace=True)
    
    # Loop over the standard_scalers and perform the scaling operation on each column
    for i in range(len(new_data_scaled.columns)):

        col_name = new_data_scaled.columns[i]

        values = new_data_scaled[col_name].values
        values = values.astype('float64')
        values = values.reshape(values.shape[0],1)

        new_data_scaled[col_name] = scalers[i].fit_transform(values)

        print('Success with feature: {}'.format(col_name))
    
    print('----------------------------------')
    
    return new_data_scaled

In [10]:
scaler_types = ['MinMax']

In [11]:
data_standard_scaled = scale_data(top_manufacturer_data, 'Standard')

Using Standard Scalers
----------------------------------
Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox
----------------------------------


In [12]:
data_minmax_scaled = scale_data(top_manufacturer_data, 'MinMax')

Using MinMax Scalers
----------------------------------
Success with feature: Year
Success with feature: Vehicle_Code
Success with feature: Manufacturer_Code
Success with feature: Displacement
Success with feature: Fuel_System
Success with feature: Gears
Success with feature: Transmission_Code
Success with feature: ETW
Success with feature: HP
Success with feature: Drive_System_Code
Success with feature: Fuel_Code
Success with feature: V_avg
Success with feature: V_max
Success with feature: V_std
Success with feature: a_pos
Success with feature: a_neg
Success with feature: Peak_pos
Success with feature: Peak_neg
Success with feature: HC
Success with feature: CO
Success with feature: CO2
Success with feature: Nox
----------------------------------


## Prepare Data

Function to prepare data based on the database input, which depends on the souce (Top or 500) and the scaler

In [13]:
# Get number of data points
data_points = top_manufacturer_data.shape[0]

# Set sizes for train, dev, test sets
train_percent = 0.8
train_size = round(train_percent*data_points)

if (data_points-train_size)%2 == 0:
    dev_size = int((data_points-train_size)/2)
    test_size = dev_size
    print('Train Size = {}'.format(train_size))
    print('Dev Size = {}'.format(dev_size))
    print('Test Size = {}'.format(test_size))
    print('Remainder = {}'.format(train_size+dev_size+test_size-data_points))
    print('----------------------------------')

else:
    train_size = train_size-1
    dev_size = int((data_points-train_size)/2)
    test_size = dev_size 
    print('Train Size = {}'.format(train_size))
    print('Dev Size = {}'.format(dev_size))
    print('Test Size = {}'.format(test_size))
    print('Remainder = {}'.format(train_size+dev_size+test_size-data_points))
    print('----------------------------------')

Train Size = 7929
Dev Size = 992
Test Size = 992
Remainder = 0
----------------------------------


## Prepare Datasets

Create functions to modify the input variables and thus create different input sets

In [14]:
# Save the names of the columns in a list that can be looped
input_names = complete_data_scaled_shuffled.columns[:-4]

In [15]:
for variable in input_names:
    print('{}'.format(variable))

Year
Vehicle_Code
Manufacturer_Code
Displacement
Fuel_System
Gears
Transmission_Code
ETW
HP
Drive_System_Code
Fuel_Code
V_avg
V_max
V_std
a_pos
a_neg
Peak_pos
Peak_neg


In [16]:
def prepare_data_sets(data, drop_variable=None):
    
    # Create a local copy of the entire dataset
    data_scaled_shuffled = data.copy()
    
    if drop_variable != None:
        # Drop the variable that will be ignored during the run
        data_scaled_shuffled.drop(columns=drop_variable, inplace=True)
        print('{} Column Dropped'.format(drop_variable))
    
    print('Preparing Data-sets')
    # Divide data into train, dev, and test sets
    train_set = data_scaled_shuffled[ : train_size]
    dev_set = data_scaled_shuffled[train_size : train_size+dev_size]
    test_set = data_scaled_shuffled[train_size+dev_size : train_size+dev_size+test_size]

    # Reset index for all sets
    train_set = train_set.reset_index(drop=True)
    dev_set = dev_set.reset_index(drop=True)
    test_set = test_set.reset_index(drop=True)

    # Get values
    train_set_values = train_set.values
    dev_set_values = dev_set.values
    test_set_values = test_set.values
    
    # Number of emissions: HC, CO, CO2, NOX
    n_out = 4
    
    print('Splitting into inputs and outputs')
    # SLICING: [start row:end row , start column:end column]
    # Split into inputs and outputs
    x_train = train_set_values[:,:-n_out]
    x_dev = dev_set_values[:,:-n_out]
    x_test = test_set_values[:,:-n_out]
    
    print('Inputs = {}'.format(x_train.shape[1]))
    
    # Get the outputs (only HC)
    HC_train = train_set_values[:,-n_out]
    HC_dev = dev_set_values[:,-n_out]
    HC_test = test_set_values[:,-n_out]
    
    print('Data-sets complete')
    print('----------------------------------')
    
    return x_train, x_dev, x_test, HC_train, HC_dev, HC_test

In [17]:
def get_test_inverse(scaler_type):
    
    # Select the scalers
    if scaler_type == 'MinMax':
        
        scalers = minmax_scalers
        
    if scaler_type == 'Standard':
        
        scalers = standard_scalers
        
    #----------------------------------

    # Inverse transform the TEST DATA to be able to calculate the error further down
    test_set_scaled = top_manufacturer_data[train_size+dev_size : train_size+dev_size+test_size]
    test_set_inverse = test_set_scaled.copy()

    for i in range(np.size(top_manufacturer_data.columns)):

        col_name = top_manufacturer_data.columns[i]

        values = test_set_inverse[col_name].values
        values = values.astype('float64')
        values = values.reshape(values.shape[0],1)

        test_set_inverse[col_name] = scalers[i].inverse_transform(values)

    print('Success creating inverse test set')
    print('----------------------------------')
    
    return test_set_inverse

## PCA

Create a function that creates a PCA instance and with that PCA instance creates a new x_train

In [18]:
def create_pca_set(variance_amount, x_train, x_dev, x_test):
    
    if variance_amount < 1:
    
        print('Create PCA Instance')
        pca = PCA(variance_amount)

        print('Fit PCA Instance')
        pca.fit(x_train)
        print('Number of Components = {}'.format(pca.n_components_))

        print('Create New Input Training Set')
        new_x_train = pca.transform(x_train)

        print('Create New Input Dev Set')
        new_x_dev = pca.transform(x_dev)

        print('Create New Input Test Set')
        new_x_test = pca.transform(x_test)
        
    else:
        
        new_x_train = x_train
        new_x_dev = x_dev
        new_x_test = x_test

    print('----------------------------------')
    
    return new_x_train, new_x_dev, new_x_test

In [19]:
pca_options = [1, 0.99, 0.90]

-----------------
## Models

#### Basics

In [20]:
# Mini-batch size, epochs
batch_size = 64
epochs = 300

#### Hyper-Parameter Space

In [21]:
# Activation functions to try
names_activations = ['LReLU']
    # A function has to be called so that a new instance of the function can be created in each layer
def get_activation(name):
    
    if name == 'ReLU':
        function = advanced_activations.ReLU()
    if name == 'LReLU':
        function = advanced_activations.LeakyReLU()
        
    return function

#--------------------------------------------------------------------------------- 

output_activations = ['linear']

#--------------------------------------------------------------------------------- 

# Dropout rate to be tried
dropouts = [0.2]

#### Build Model

In [22]:
def build_model(number, x_train, activation_name, output, dd):
    
    # Create model
    model = Sequential(name='Model_{}'.format(number))

    model.add(Dense(256, input_dim=x_train.shape[1]))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(128))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(64))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(32))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(16))
    model.add(get_activation(activation_name))
    model.add(Dropout(dd))
    model.add(BatchNormalization())

    model.add(Dense(1))
    
    if output == 'ReLU':
        model.add(advanced_activations.ReLU())

    #Compile model
    model.compile(loss=losses.mean_absolute_percentage_error, optimizer=optimizers.Adam(), metrics = ['accuracy'])
    
    print('{} Created'.format(model.name))
    print('----------------------------------')
    
    return model

#### Train Model

In [23]:
def train_models(model, x_train, y_train, x_dev, y_dev):
    
    print('{} - Training'.format(model.name))
    print('- Started on {} at {}'.format(str(datetime.datetime.now())[5:-16], str(datetime.datetime.now())[11:-10]))
    # Start timer
    start_time = time.time()

    # fit network
    history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, 
                        validation_data=(x_dev, y_dev), verbose=0, shuffle=True)

    # End timer
    end_time = time.time() - start_time
    print('{} - Training Complete'.format(model.name))
    print('- Time: {:.3f} min'.format(end_time/60))
    print('- Loss = {:.5f}'.format(history.history['loss'][-1]))
    print('- Val Loss = {:.5f}'.format(history.history['val_loss'][-1]))
    print('----------------------------------')
        
    return history

#### Make Predictions and Calculate Error

In [24]:
# Function to define MSPE
def msp_error(true,pred):
    error = 100*np.sum(((true-pred)/true)**2)/np.size(true)
    return error

In [25]:
def predict_get_error(model, x_test, scaler_type, test_set_inverse):
    
        # Select the scalers
    if scaler_type == 'MinMax':
        
        scalers = minmax_scalers
        
    if scaler_type == 'Standard':
        
        scalers = standard_scalers  
    
    #------------------
    print('Predicting with {}'.format(model.name))
    scaled_predictions = model.predict(x_test)
    
    print('Inverse Scaling Operation') 
     
    # Inverse the scaling operation on the predictions
    predictions = scalers[-4].inverse_transform(scaled_predictions)
    
    print('- Prediction Mean = {:.5f}'.format(np.mean(predictions)))
    print('- Prediction Min = {:.5f}'.format(np.min(predictions)))
    print('- Prediction Max = {:.5f}'.format(np.max(predictions)))

    print('Calculating HC Error')
    mspe = msp_error(test_set_inverse['HC'].values, predictions)
        
    print('- HC Error  = {:.2e}'.format(mspe))
    print('----------------------------------')
    
    return mspe

#### Process Models and Rank with MSPE

In [32]:
def process_models():
    
    count = 1
    model_list = []
    history_list = []
    HC_error_list = []
        
    for scaler_type in scaler_types:

        for activation_name in names_activations:

            for output in output_activations:

                for dd in dropouts:
                    
                    for variable in input_names:

                        # Print model variables
                        print('Model_{} Variables:'.format(count))
                        print('- Loss: MAPE')
                        print('- Activation: {}'.format(activation_name))
                        print('- Output: {}'.format(output))
                        print('- Optimizer: Adam')
                        print('- Dropout: {}%'.format(dd*100))
                        print('- Scaler: {}'.format(scaler_type))
                        print('- Dropped: {}'.format(variable))
                        print('----------------------------------')
                        
                        data_scaled = data_minmax_scaled.copy()

                        # Prepare data-sets and create inputs and outputs
                        x_train, x_dev, x_test, y_train, y_dev, y_test = prepare_data_sets(data_scaled, variable)

                        # Create model
                        model = build_model(count, x_train, activation_name, output, dd)

                        # Train model
                        history = train_models(model, x_train, y_train, x_dev, y_dev)
                        history_list.append(history)

                        # Get test_set_inverse
                        test_set_inverse = get_test_inverse(scaler_type)

                        # Make predictions and calculate error
                        error = predict_get_error(model, x_test, scaler_type, test_set_inverse)

                        # Add error to error list
                        HC_error_list.append([model.name, scaler_type, activation_name, output, dd, variable, error])

                        # Announce one model process ended
                        print('============== MODEL {} PROCESS END =============='.format(count))
                        print(' ')

                        # Increase counter by 1
                        count = count+1

                        # Add TRAINED model to list
                        model_list.append(model)
                        
                        
    # Create TWO PCA database
    for variance_amount in pca_options:
        
        scaler_type = 'MinMax'
        activation_name = 'LReLU'
        output = 'linear'
        dd = 0.2
        
        # Print model variables
        print('Model_{} Variables:'.format(count))
        print('- Loss: MAPE')
        print('- Activation: {}'.format(activation_name))
        print('- Output: {}'.format(output))
        print('- Optimizer: Adam')
        print('- Dropout: {}%'.format(dd*100))
        print('- Scaler: {}'.format(scaler_type))
        print('- Variance: {}'.format(variance_amount))
        print('----------------------------------')
        
        # Prepare data-sets and create inputs and outputs
        old_x_train, old_x_dev, old_x_test, y_train, y_dev, y_test = prepare_data_sets(data_minmax_scaled)
        
        # Create PCA instance for the inputs
        x_train, x_dev, x_test = create_pca_set(variance_amount, old_x_train, old_x_dev, old_x_test)

        # Create model
        model = build_model(count, x_train, activation_name, output, dd)

        # Train model
        history = train_models(model, x_train, y_train, x_dev, y_dev)
        history_list.append(history)

        # Get test_set_inverse
        test_set_inverse = get_test_inverse(scaler_type)

        # Make predictions and calculate error
        error = predict_get_error(model, x_test, scaler_type, test_set_inverse)

        # Add error to error list
        HC_error_list.append([model.name, scaler_type, activation_name, output, dd, variance_amount, error])

        # Announce one model process ended
        print('============== MODEL {} PROCESS END =============='.format(count))
        print(' ')

        # Increase counter by 1
        count = count+1

        # Add TRAINED model to list
        model_list.append(model)
    
    #------------------------------------------------------------

    print('Creating DataFrame')                
    HC_error = pd.DataFrame(HC_error_list)

    print('Changing DataFrame column names')
    HC_error.columns = ['Model', 'Scaler', 'Activation', 'Output', 'Dropout', 'Variable/Variance', 'MSPE']

    print('Ranking Models')
    HC_error.sort_values(by=['MSPE'], inplace=True)

    count = 0
    
    return HC_error, model_list, history_list

In [27]:
HC_ranking, models, histories = process_models()

Model_1 Variables:
- Loss: MAPE
- Activation: ReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Year
----------------------------------
Year Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_1 Created
----------------------------------
Model_1 - Training
- Started on 04-09 at 21:32
Model_1 - Training Complete
- Time: 8.315 min
- Loss = 482.43511
- Val Loss = 208.83346
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_1
Inverse Scaling Operation
- Prediction Mean = -0.01266
- Prediction Min = -0.12968
- Prediction Max = -0.01039
Calculating HC Error
- HC Error  = 3.49e+06
----------------------------------
 
Model_2 Variables:
- Loss: MAPE
- Activation: ReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Vehicle_Code
----------------------------------
Vehi

Model_10 Created
----------------------------------
Model_10 - Training
- Started on 04-09 at 22:48
Model_10 - Training Complete
- Time: 8.785 min
- Loss = 770.12101
- Val Loss = 152.61838
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_10
Inverse Scaling Operation
- Prediction Mean = -0.00837
- Prediction Min = -0.42251
- Prediction Max = 0.00212
Calculating HC Error
- HC Error  = 1.30e+07
----------------------------------
 
Model_11 Variables:
- Loss: MAPE
- Activation: ReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Fuel_Code
----------------------------------
Fuel_Code Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_11 Created
----------------------------------
Model_11 - Training
- Started on 04-09 at 22:57
Model_11 - Training Complete
- Time: 8.849 min
- Loss = 876.62836
- V

Inverse Scaling Operation
- Prediction Mean = 0.00006
- Prediction Min = 0.00006
- Prediction Max = 0.00006
Calculating HC Error
- HC Error  = 9.76e+04
----------------------------------
 
Model_20 Variables:
- Loss: MAPE
- Activation: ReLU
- Output: ReLU
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Vehicle_Code
----------------------------------
Vehicle_Code Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_20 Created
----------------------------------
Model_20 - Training
- Started on 04-10 at 00:19
Model_20 - Training Complete
- Time: 9.403 min
- Loss = 99.98739
- Val Loss = 100.00000
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_20
Inverse Scaling Operation
- Prediction Mean = 0.00006
- Prediction Min = 0.00006
- Prediction Max = 0.00006
Calculating HC Error
- HC Error  = 9.76e+04
----------------

Model_29 Created
----------------------------------
Model_29 - Training
- Started on 04-10 at 01:47
Model_29 - Training Complete
- Time: 10.051 min
- Loss = 99.98739
- Val Loss = 100.00000
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_29
Inverse Scaling Operation
- Prediction Mean = 0.00006
- Prediction Min = 0.00006
- Prediction Max = 0.00006
Calculating HC Error
- HC Error  = 9.76e+04
----------------------------------
 
Model_30 Variables:
- Loss: MAPE
- Activation: ReLU
- Output: ReLU
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: V_avg
----------------------------------
V_avg Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_30 Created
----------------------------------
Model_30 - Training
- Started on 04-10 at 01:57
Model_30 - Training Complete
- Time: 10.526 min
- Loss = 99.98739
- Val Loss = 10

Model_39 Created
----------------------------------
Model_39 - Training
- Started on 04-10 at 03:45
Model_39 - Training Complete
- Time: 12.259 min
- Loss = 957.38395
- Val Loss = 139.74297
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_39
Inverse Scaling Operation
- Prediction Mean = -0.01050
- Prediction Min = -0.20710
- Prediction Max = -0.00173
Calculating HC Error
- HC Error  = 9.98e+06
----------------------------------
 
Model_40 Variables:
- Loss: MAPE
- Activation: LReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Displacement
----------------------------------
Displacement Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_40 Created
----------------------------------
Model_40 - Training
- Started on 04-10 at 03:58
Model_40 - Training Complete
- Time: 11.761 min
- Loss = 101

Inverse Scaling Operation
- Prediction Mean = 0.00302
- Prediction Min = -0.11719
- Prediction Max = 0.02405
Calculating HC Error
- HC Error  = 3.55e+06
----------------------------------
 
Model_49 Variables:
- Loss: MAPE
- Activation: LReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: V_max
----------------------------------
V_max Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_49 Created
----------------------------------
Model_49 - Training
- Started on 04-10 at 05:44
Model_49 - Training Complete
- Time: 12.386 min
- Loss = 1243.72659
- Val Loss = 141.09491
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_49
Inverse Scaling Operation
- Prediction Mean = 0.00458
- Prediction Min = -0.17095
- Prediction Max = 0.11896
Calculating HC Error
- HC Error  = 1.52e+07
----------------------

Model_58 Created
----------------------------------
Model_58 - Training
- Started on 04-10 at 07:40
Model_58 - Training Complete
- Time: 13.384 min
- Loss = 99.98739
- Val Loss = 100.00000
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_58
Inverse Scaling Operation
- Prediction Mean = 0.00006
- Prediction Min = 0.00006
- Prediction Max = 0.00006
Calculating HC Error
- HC Error  = 9.76e+04
----------------------------------
 
Model_59 Variables:
- Loss: MAPE
- Activation: LReLU
- Output: ReLU
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: Fuel_System
----------------------------------
Fuel_System Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_59 Created
----------------------------------
Model_59 - Training
- Started on 04-10 at 07:53
Model_59 - Training Complete
- Time: 13.556 min
- Loss = 99.98739
- 

Inverse Scaling Operation
- Prediction Mean = 0.00006
- Prediction Min = 0.00006
- Prediction Max = 0.00006
Calculating HC Error
- HC Error  = 9.76e+04
----------------------------------
 
Model_68 Variables:
- Loss: MAPE
- Activation: LReLU
- Output: ReLU
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
- Dropped: V_std
----------------------------------
V_std Column Dropped
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 17
Data-sets complete
----------------------------------
Model_68 Created
----------------------------------
Model_68 - Training
- Started on 04-10 at 10:06
Model_68 - Training Complete
- Time: 16.351 min
- Loss = 99.98739
- Val Loss = 100.00000
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_68
Inverse Scaling Operation
- Prediction Mean = 0.00006
- Prediction Min = 0.00006
- Prediction Max = 0.00006
Calculating HC Error
- HC Error  = 9.76e+04
----------------------------

In [28]:
HC_ranking

Unnamed: 0,Model,Scaler,Activation,Output,Dropout,Variable/Variance,MSPE
21,Model_22,MinMax,ReLU,ReLU,0.2,Displacement,9.760168e+04
20,Model_21,MinMax,ReLU,ReLU,0.2,Manufacturer_Code,9.760168e+04
58,Model_59,MinMax,LReLU,ReLU,0.2,Fuel_System,9.760168e+04
22,Model_23,MinMax,ReLU,ReLU,0.2,Fuel_System,9.760168e+04
23,Model_24,MinMax,ReLU,ReLU,0.2,Gears,9.760168e+04
24,Model_25,MinMax,ReLU,ReLU,0.2,Transmission_Code,9.760168e+04
25,Model_26,MinMax,ReLU,ReLU,0.2,ETW,9.760168e+04
26,Model_27,MinMax,ReLU,ReLU,0.2,HP,9.760168e+04
27,Model_28,MinMax,ReLU,ReLU,0.2,Drive_System_Code,9.760168e+04
28,Model_29,MinMax,ReLU,ReLU,0.2,Fuel_Code,9.760168e+04


In [None]:
number = 5
print(models[number].name + ' ' + HC_ranking['Scaler'][number])

In [29]:
epoch_vector=np.linspace(1,epochs,epochs)

for i in range(len(models)):
    model = models[i]
    history = histories[i]
    activation = HC_ranking['Activation'][i]
    output = HC_ranking['Output'][i]
    variable = HC_ranking['Variable/Variance'][i]
    
    model.save(os.path.join(output_path,'{}_{}_{}_{}.h5'.format(model.name, activation, output, variable)))
    
    hist_data =[epoch_vector,history.history['loss'],history.history['val_loss']]
    hist_data =pd.DataFrame(hist_data).transpose()
    hist_data.columns=['Epochs','loss','val_loss']
    
    hist_data.to_csv(os.path.join(output_path,'Training_History_{}.csv'.format(model.name)),index=False)

In [30]:
# I forgot to train the models using all 18 input variables

In [35]:
def process_models_2():
    
    count = 75
    model_list = []
    history_list = []
    HC_error_list = []
        
    for scaler_type in scaler_types:

        for activation_name in names_activations:

            for output in output_activations:

                for dd in dropouts:

                    # Print model variables
                    print('Model_{} Variables:'.format(count))
                    print('- Loss: MAPE')
                    print('- Activation: {}'.format(activation_name))
                    print('- Output: {}'.format(output))
                    print('- Optimizer: Adam')
                    print('- Dropout: {}%'.format(dd*100))
                    print('- Scaler: {}'.format(scaler_type))
                    print('----------------------------------')

                    data_scaled = data_minmax_scaled.copy()

                    # Prepare data-sets and create inputs and outputs
                    x_train, x_dev, x_test, y_train, y_dev, y_test = prepare_data_sets(data_scaled)

                    # Create model
                    model = build_model(count, x_train, activation_name, output, dd)

                    # Train model
                    history = train_models(model, x_train, y_train, x_dev, y_dev)
                    history_list.append(history)

                    # Get test_set_inverse
                    test_set_inverse = get_test_inverse(scaler_type)

                    # Make predictions and calculate error
                    error = predict_get_error(model, x_test, scaler_type, test_set_inverse)

                    # Add error to error list
                    HC_error_list.append([model.name, scaler_type, activation_name, output, dd, variable, error])

                    # Announce one model process ended
                    print('============== MODEL {} PROCESS END =============='.format(count))
                    print(' ')

                    # Increase counter by 1
                    count = count+1

                    # Add TRAINED model to list
                    model_list.append(model)
                        
    #------------------------------------------------------------

    print('Creating DataFrame')                
    HC_error = pd.DataFrame(HC_error_list)

    print('Changing DataFrame column names')
    HC_error.columns = ['Model', 'Scaler', 'Activation', 'Output', 'Dropout', 'Variable/Variance', 'MSPE']

    print('Ranking Models')
    HC_error.sort_values(by=['MSPE'], inplace=True)

    count = 0
    
    return HC_error, model_list, history_list

In [36]:
HC_ranking_2, models_2, histories_2 = process_models_2()

Model_75 Variables:
- Loss: MAPE
- Activation: ReLU
- Output: linear
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
----------------------------------
Preparing Data-sets
Splitting into inputs and outputs
Inputs = 18
Data-sets complete
----------------------------------
Model_75 Created
----------------------------------
Model_75 - Training
- Started on 04-12 at 13:02
Model_75 - Training Complete
- Time: 18.813 min
- Loss = 928.70895
- Val Loss = 125.84016
----------------------------------
Success creating inverse test set
----------------------------------
Predicting with Model_75
Inverse Scaling Operation
- Prediction Mean = 0.00279
- Prediction Min = -1.02577
- Prediction Max = 0.00982
Calculating HC Error
- HC Error  = 5.86e+07
----------------------------------
 
Model_76 Variables:
- Loss: MAPE
- Activation: ReLU
- Output: ReLU
- Optimizer: Adam
- Dropout: 20.0%
- Scaler: MinMax
----------------------------------
Preparing Data-sets
Splitting into inputs and outputs
Inputs 

In [38]:
HC_ranking_2.drop(columns=['Variable/Variance'], inplace=True)

In [39]:
HC_ranking_2

Unnamed: 0,Model,Scaler,Activation,Output,Dropout,MSPE
1,Model_76,MinMax,ReLU,ReLU,0.2,97601.68
3,Model_78,MinMax,LReLU,ReLU,0.2,97601.68
2,Model_77,MinMax,LReLU,linear,0.2,1458004.0
0,Model_75,MinMax,ReLU,linear,0.2,58587670.0


In [41]:
for i in range(len(models_2)):
    model = models_2[i]
    history = histories_2[i]
    activation = HC_ranking_2['Activation'][i]
    output = HC_ranking_2['Output'][i]
    variable = 'ALL'
    
    model.save(os.path.join(output_path,'{}_{}_{}_{}.h5'.format(model.name, activation, output, variable)))
    
    hist_data =[epoch_vector,history.history['loss'],history.history['val_loss']]
    hist_data =pd.DataFrame(hist_data).transpose()
    hist_data.columns=['Epochs','loss','val_loss']
    
    hist_data.to_csv(os.path.join(output_path,'Training_History_{}.csv'.format(model.name)),index=False)

## Next Steps

The models and the training histories will be moved to foldr **Gen 9**. 

Nothing in this notebook yielded an improvement in performance. 

Nothing else will be done as time does not permit it. 

Next steps in the proyect is doing an investigation into how to correct vehicle emissions to account for the vehicle age (and maybe alitude), and write the final document.