# The Naive Modeler
**Expectation:**  
It is sometimes suggested that machine learning models are so "intelligent" that you simply have to plug training data into a model to fit it and you should get pretty much as good a model as can be achieved. Perhaps some playing around with the very basic model parameters might be needed but that's all.

**Reality**  
Often, such naive model building does not yield very good models. There are many reasons why this might happen. For example:
* the presence of interaction effects
* the need for calculation of appropriate features from the raw data
* cleaning of the data would be needed and requires in-depth knowledge of the data
* ...

**This Project**  
The purpose of this project is simply to investigate how well common predictive regression model types can handle significant interaction between predictors. The code uses randomly generated data and builds regression models of the most common types, sometimes trying multiple simple variations. It assumes that the modeler is not attempting to understand the data very closely, is not going to calculate new features and will only attempt to try the most basic variations of the predictive regression model types.

**The Data**  
The data will contain box dimensions (length, width and height) generated randomly, each between 0 and 1.0. These are the predictors. The outcome is the box volume. There is no error term present so perfect predictions are possible. The code sets aside 10% of the data as the test data, leaving 90% as training data.

**The Models**  
Various regression model types are built. A model predicting the volume according to: 
predicted_volume = length x width x height 
would achieve 100% accuracy. However, most model types cannot predict using a relationship like this and so will be less than 100% accurate. Sinde the assumption is that the user is naive, the code does not calculate dimensions_product = length x width x height as a new feature which is what a modeler who understands the data and the abilities of various model types would do.

**Library Requirements**  
numpy, scikit-learn(sklearn) and (for the last models) tensorflow

In [None]:
import numpy as np
#set the random seed so models and results can be reproduced
np.random.seed(seed=123456789)

In [None]:
# Constants - you can try out different values for these
NUM_BOXES = 10000        # recommended at least 1000 - if too low models might suffer from overfitting
TRAINING_PERCENT = 90    # 90 is recommended, keep between 1 and 99

In [None]:
# create an array with columns being the 3 dimensions of boxes (length, width, height)
# which are random values from (0,1]
# Note: random_sample generates values in range [0,1) so multiply by -1 and add 1 to get in range (0,1
box_dims = np.random.random_sample(size=(NUM_BOXES, 3))
box_dims = (box_dims * (-1)) + 1
# create an array with the corresponding volumes of the boxes
box_volumes = box_dims[:,0] * box_dims[:,1] * box_dims[:,2]

With box volumes as the outcome, there is a strong interaction effect because the contribution to outcome value from each dimension depends on the other. For example, an additional 0.1 added to the length of a box increases the volume of a box with width 0.8 and height 0.8 than it would add to a box with width 0.2 and height 0.2.

In [None]:
# Traditionally the predictors array is called X and the outcome array is y
X = box_dims
y = box_volumes

For comparison, if y is instead the length of tape around the box (2\*width + 2\*height) then there is no interaction effect. In this case many of the models below should make perfect (or almost perfect) predictions.

In [None]:
# Uncomment the following to try tape length as the outcome.
#box_tape = 2*box_dims[:,1] + 2*box_dims[:,2]
#y = box_tape
# Any reference below to "volume" should be understood to mean "tape length" if the last lines are uncommented

In [None]:
# Split the data into a training set and a test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=TRAINING_PERCENT/100, random_state=987654321)

## Some data analysis

In [None]:
print(f'\n\
    shape X: {X.shape}\n\
    shape y: {y.shape}\n\
    shape X_train: {X_train.shape}\n\
    shape X_test: {X_test.shape}\n\
    shape y_train: {y_train.shape}\n\
    shape y_test: {y_test.shape}\n\
    mean X: {X.mean():.4f}\n\
    mean y: {y.mean():.4f}')

Out of curiosity, calculate correlations between the predictors, predictor products and the outcome.

In [None]:
# separate the columns and calculate products of dimension pairs
X_length = X[:,0]
X_width = X[:,1]
X_height = X[:,2]
X_length_by_width = X_length * X_width
X_length_by_height = X_length * X_height
X_width_by_height = X_width * X_height

In [None]:
def corr2(var_1, var_2):
    corrcoef_result = np.corrcoef(var_1, var_2)
    # the result is the covariance matrix, element [1,0] is the correlation coefficient 
    return corrcoef_result[1,0]

In [None]:
print('Correlation between predictors and outcome and products of predictors and outcomes')
print(f'length vs volume: {corr2(X_length,y):.4f}')
print(f'width vs volume: {corr2(X_width,y):.4f}')
print(f'height vs volume: {corr2(X_height,y):.4f}')
print(f'(length x width) vs volume: {corr2(X_length_by_width,y):.4f}')
print(f'(length x height) vs volume: {corr2(X_length_by_height,y):.4f}')
print(f'(width x height) vs volume: {corr2(X_width_by_height,y):.4f}')

## Use a Normalized Root Mean Squared Error, as a percent, for measuring the accuracy of all models

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# The smaller nrmse the more accurate - a nrmse of 0 means perfect accuracy.
def pct_nrmse(y_actual, y_pred):
    # normalize the rmse by dividing the rmse by the average predicted value
    avg_pred = np.mean(y_pred)
    rmse = sqrt(mean_squared_error(y_actual, y_pred))
    return (rmse / avg_pred) * 100

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Fit the model using the training data
reg_model = LinearRegression().fit(X_train, y_train)

In [None]:
# Calculatye the predicted values and measure accuracy
y_pred_reg_model = reg_model.predict(X_test)
pct_err = pct_nrmse(y_test, y_pred_reg_model)
print(f'Percent error: {pct_err:.2f}%')

In [None]:
# For interest, view the coefficients generated for the model
print(f'Coeffs: \n\
      {reg_model.coef_[0]:.4f}(length), \n\
      {reg_model.coef_[1]:.4f}(width), \n\
      {reg_model.coef_[2]:.4f}(height). \n\
Intercept: {reg_model.intercept_:.4f}')

## Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge_model = Ridge(alpha=1.0)

In [None]:
ridge_model.fit(X_train, y_train)

In [None]:
#Measure accuracy
y_pred_ridge_model = ridge_model.predict(X_test)
pct_err = pct_nrmse(y_test, y_pred_ridge_model)
print(f'Percent error: {pct_err:.2f}%')

## Epsilon-Support Vector Regression

In [None]:
from sklearn.svm import SVR

In [None]:
#We'll use all types of kernels to compare
#Returns the accuracy
def svr_with_kernel(X_train, X_test, y_train, y_test, kernel):
    svr_model = SVR(gamma='scale', kernel=kernel)
    svr_model.fit(X_train, y_train)
    y_pred_svr_model = svr_model.predict(X_test)
    return pct_nrmse(y_test, y_pred_svr_model)

In [None]:
kernels = ('linear', 'poly', 'rbf', 'sigmoid')

print(f'SVR percent error, kernel: \n')
for kernel in kernels:
    svr_acc = svr_with_kernel(X_train, X_test, y_train, y_test, kernel)
    print(f'{svr_acc:.2f}%, {kernel}')

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
import math

In [None]:
# Use depths up to log base 2 of NUM_BOXES
first_depth = 1
depths = list(range(first_depth, int(math.log(NUM_BOXES, 2))))

In [None]:
#We'll use all a variety of depths to compare
#Returns a tuple containing the accuracy and the number of leaves
def dec_tree_with_depth(X_train, X_test, y_train, y_test, depth):
    dec_tree_model = DecisionTreeRegressor(max_depth=depth)
    dec_tree_model.fit(X_train, y_train)
    y_pred_dec_tree_model = dec_tree_model.predict(X_test)
    dec_tree_rmse = pct_nrmse(y_test, y_pred_dec_tree_model)
    num_leaves = dec_tree_model.get_n_leaves()
    return (dec_tree_rmse, num_leaves)

In [None]:
print(f'Decision tree: percent error, num leaves, depth:')
for depth in depths:
    dec_tree_depth = dec_tree_with_depth(X_train, X_test, y_train, y_test, depth)
    print(f'{dec_tree_depth[0]:.2f}%, {dec_tree_depth[1]} - depth={depth}')
    
dec_tree_depth_no_limit = dec_tree_with_depth(X_train, X_test, y_train, y_test, None)
print(f'{dec_tree_depth_no_limit[0]:.2f}%, {dec_tree_depth_no_limit[1]} - depth=no limit')

## Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
random_forest_regressor_model = RandomForestRegressor(random_state=911)

In [None]:
random_forest_regressor_model.fit(X_train,y_train)

In [None]:
y_pred_random_forest_regressor_model = random_forest_regressor_model.predict(X_test)
pct_err = pct_nrmse(y_test, y_pred_random_forest_regressor_model)
print(f'Percent error: {pct_err:.2f}%')

## Neural Network 1 (using mostly sklearn defaults)

In [None]:
from sklearn.neural_network import MLPRegressor

In [None]:
#We'll use all activation functions
#Prints the name of the activation function, the accuracy and the number of layers
def mlp_with_act_func(X_train, X_test, y_train, y_test, act_func):
    mlp_model = MLPRegressor(activation=act_func, hidden_layer_sizes=(9,9,9))
    mlp_model.fit(X_train, y_train)
    y_pred_mlp_model = mlp_model.predict(X_test)
    mlp_model_pct_nrmse = pct_nrmse(y_test, y_pred_mlp_model)
    mlp_model_num_layers = mlp_model.n_layers_
    print(f'{act_func}, {mlp_model_pct_nrmse:.2f}%, {mlp_model_num_layers}')

In [None]:
act_functions = ('identity', 'logistic', 'tanh', 'relu')

print('activation, percent error, num layers')
for act_func in act_functions:
    mlp_with_act_func(X_train, X_test, y_train, y_test, act_func)

## Neural Network 2 - using Tensorflow, only input layer and output layer, variety of activation functions

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense, Activation
import math

In [None]:
activations = ('relu', 'sigmoid', 'tanh', 'linear', 'exponential', 'softmax')

In [None]:
#runs simple NN, no hidden layers, only input layer and output layer
#returns accuracy
def simple_NN(X_train, X_test, y_train, y_test, activation):
    nn_model = keras.Sequential()
    nn_model.add(Dense(1, input_dim=(3), activation=activation))
    #optimizer = tf.keras.optimizers.SGD(0.001)
    nn_model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_squared_error'])
    nn_model.fit(X_train, y_train, batch_size=10, epochs=100, verbose=0)
    y_pred_nn_model = nn_model.predict(X_test)
    nn_model_pct_nrmse = pct_nrmse(y_test, y_pred_nn_model)
    print(f'{activation}, {nn_model_pct_nrmse:.2f}%')

In [None]:
print('activation, accuracy')

for act in activations:
    simple_NN(X_train, X_test, y_train, y_test, act)

## Neural Network 3 - 3 hidden layers, all layers using the same specified activation function, variety of activation functions

In [None]:
#runs a NN with 3 hidden layers, all layers using the same specified activation function
#prints the activation and the accuracy
def layered_NN(X_train, X_test, y_train, y_test, activation):
    nn_model = keras.Sequential()
    nn_model.add(Dense(6, input_dim=(3), activation=activation))
    nn_model.add(Dense(6, activation=activation))
    nn_model.add(Dense(6, activation=activation))
    nn_model.add(Dense(1, activation=activation))
    optimizer = tf.keras.optimizers.SGD(0.001)
    nn_model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['mean_squared_error'])
    nn_model.fit(X_train, y_train, batch_size=10, epochs=100, verbose=0)
    y_pred_nn_model = nn_model.predict(X_test)
    nn_model_pct_nrmse = pct_nrmse(y_test, y_pred_nn_model)
    print(f'{activation}, {nn_model_pct_nrmse:.2f}%')

In [None]:
print('activation, accuracy')

for act in activations:
    layered_NN(X_train, X_test, y_train, y_test, act)

In [None]:
print('activation, accuracy')

for act in activations:
    with_power_activations_NN(X_train, X_test, y_train, y_test, act)