# Model Selection Using Cross-validation on a Traffic Volume Dataset

In this activity, you are going to practice model selection using cross-validation one more time. Here, we are going to use a simulated dataset that represents a target variable representing the volume of traffic in cars/hour across a city bridge and various normalized features related to traffic data such as the time of day and the traffic volume on the previous day. Our goal is to build a model that predicts the traffic volume across the city bridge given the various features.

The dataset contains 10000 records, and for each of them, 10 attributes/features are included. The goal is to build a deep neural network that receives the 10 features and predicts the traffic volume across the bridge. Since the output is a number, this is a regression problem.

### 1. Import all the required packages.

In [6]:
import pandas as pd 
import numpy as np 
from tensorflow import random
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import make_pipeline

SEED = 1

### 2. Print the input and output sizes to check the number of examples in the dataset and the number of features for each example. Also, you can print the range of the output (the output in this dataset represents the median value of owner-occupied homes in thousands of dollars).

In [7]:
# Load the dataset
X = pd.read_csv('../data/traffic_volume_feats.csv')
y = pd.read_csv('../data/traffic_volume_target.csv')

# Print the sizes of input data and output data
print(X.shape)
print(y.shape)
# Print the range for output
print(f"Output Range = ({y['Volume'].min()}, { y['Volume'].max()})")

(10000, 10)
(10000, 1)
Output Range = (0, 584)


### 3. Define three functions, each returning a different Keras model. The first Keras model will be a shallow neural network with one hidden layer of size 10 and a ReLU activation function. The second Keras model will be a deep neural network with two hidden layers of size 10 and a ReLU activation function in each layer. The third Keras model will be a deep neural network with three hidden layers of size 10 and a ReLU activation function in each layer.Use the following values as well: optimizer = 'adam', loss = 'mean_squared_error'

In [8]:
def build_model_1(activation='relu', optimizer='adam', loss='mean_squared_error'):
    model = Sequential()
    model.add(Dense(10, input_dim=X.shape[1], activation=activation))
    model.add(Dense(1))
    model.compile(optimizer=optimizer, loss=loss)
    return model

def build_model_2(activation='relu', optimizer='adam', loss='mean_squared_error'):
    model = Sequential()
    model.add(Dense(10, input_dim=X.shape[1], activation=activation))
    model.add(Dense(10, activation=activation))
    model.add(Dense(1))
    model.compile(optimizer=optimizer, loss=loss)
    return model

def build_model_3(activation='relu', optimizer='adam', loss='mean_squared_error'):
    model = Sequential()
    model.add(Dense(10, input_dim=X.shape[1], activation=activation))
    model.add(Dense(10, activation=activation))
    model.add(Dense(10, activation=activation))
    model.add(Dense(1))
    model.compile(optimizer=optimizer, loss=loss)
    return model

### 4. Write the code to loop over the three models and perform 5-fold cross-validation on each of them (use epochs=100, batch_size=5, and shuffle=False in this step). Store all the cross-validation scores in a list and print the results. Which model results in the lowest test error rate?

In [None]:
n_folds = 5

params = {
    'epochs': 100,
    'batch_size': 5,
    'shuffle': False,
    'verbose': 1
}

results_1 = []

models = [build_model_1, build_model_2, build_model_3]

for m in range(len(models)):
    # build regressor
    regressor = KerasRegressor(build_fn=models[m], **params)
    # build pipeline
    model = make_pipeline(StandardScaler(), regressor)
    # define cross-validator
    cv = KFold(n_folds, shuffle=True, random_state=SEED)
    # perform cross-validation
    result = cross_val_score(model, X, y, cv=cv, verbose=1)
    # append result to list
    results_1.append(result)

for m in range(len(models)):
    print(f'Model_{m+1} test accuracy: {results_1[m]}')