# Modeling with machine learning

In this section, we will cover:

- fitting different machine learting regression models with sklearn
- score analysis: MSE and variance explained: $R^2$
- comparing the models: conclusions


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/df_resample.csv')
df.head()

Unnamed: 0,symboling,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,2,mitsubishi,gas,std,two,hatchback,fwd,front,93.7,157.3,...,92,2bbl,2.97,3.23,9.4,68.0,5500.0,37,41,5389.0
1,1,dodge,gas,std,four,sedan,fwd,front,93.7,157.3,...,90,2bbl,2.97,3.23,9.4,68.0,5500.0,31,38,6692.0
2,0,jaguar,gas,std,two,sedan,rwd,front,102.0,191.7,...,326,mpfi,3.54,2.76,11.5,262.0,5000.0,13,17,36000.0
3,0,peugot,gas,std,four,sedan,rwd,front,107.9,186.7,...,120,mpfi,3.46,3.19,8.4,97.0,5000.0,19,24,11900.0
4,0,subaru,gas,turbo,four,sedan,fwd,front,97.0,172.0,...,108,mpfi,3.62,2.64,7.7,111.0,4800.0,24,29,11259.0


In [2]:
X = df.copy()
X.drop('price', axis=1, inplace=True)
y = np.log(df.price) # as discussed, we are going to use the log transform here

## Train-test split#
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.3, random_state=95276
)

In [3]:
# normalize and encode
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import pickle

with open('data/category_list', 'rb') as file:
    cat_cols = pickle.load(file)

# numeric columns
num_cols = [col for col in X_train.columns if col not in cat_cols]

# normalize numeric features
scaler = StandardScaler()
num_scaled = scaler.fit_transform(X_train[num_cols])

# encode categories
encoder = OneHotEncoder(sparse=False)
cat_encoded = encoder.fit_transform(X_train[cat_cols])

# all together
X_train_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)

In [4]:
# apply transformations on test set
num_scaled = scaler.transform(X_test[num_cols])

# encode categories
cat_encoded = encoder.transform(X_test[cat_cols])

# all together
X_test_proc = np.concatenate([cat_encoded, num_scaled] ,axis=1)
X_test_proc.shape

(3000, 73)

## Hyper-parameter tuning and Cross Validation

It is important to note that we are going to use the gridsearchCV method, so we can iterate over a series of hyper-parameters for each model in order to find the best combination of them through cross validation.

## Decision Tree Regressor


Lets start trying a simple sklearn decision tree regression model.




In [5]:
%load_ext autoreload
%autoreload 2

import aux_functions as aux

In [6]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=95276)

grid_params = {
    'min_samples_split': [2, 5],
    'min_samples_leaf': [2, 5],
    'max_depth': [20, 25, 30]
}
name = 'Decision tree'
data = (X_train_proc, y_train, X_test_proc, y_test)

dt_results = aux.make_regressor(name, model, grid_params, data)

Decision tree
Score r2: 0.9986 
Score MSE: 7.89e+04 
Time: 1.8e+01s
{'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 20, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': 95276, 'splitter': 'best'}


## k-Nearest Neighbors


In [7]:
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor()

grid_params = {
    'n_neighbors': [5, 10],
    'p': [1, 2]
    }
name = 'knn'
knn_results = aux.make_regressor(name, model, grid_params, data)

knn
Score r2: 0.9984 
Score MSE: 1.006e+05 
Time: 2.2s
{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 10, 'p': 1, 'weights': 'uniform'}


## Random Forests



In [8]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

grid_params = {
    'max_features': [10, 15, 20],
    'max_depth': [20, 30],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [2, 5],
}
name = 'RF'
rf_results = aux.make_regressor(name, model, grid_params, data)

RF
Score r2: 0.9986 
Score MSE: 7.915e+04 
Time: 2.3e+01s
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 30, 'max_features': 20, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


## Gradient tree boosting


The gradient tree boosting is an ensemble machine learning methods too, but this time we have the boosting class: several weak models are combined to produce a powerful estimator with reduced bias.

This method is very robust because it uses regularization.

In [9]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()

grid_params = {
    'min_samples_split': [5],
}
name = 'Gradient Boost'
gb_results = aux.make_regressor(name, model, grid_params, data)


Gradient Boost
Score r2: 0.9871 
Score MSE: 5.658e+05 
Time: 8.0s
{'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'ls', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_iter_no_change': None, 'presort': 'deprecated', 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}


## AdaBoost

Adaboost is another ensemble machine learning method of the boosting class.

This time, however, we can start with the best model we have so far. Then, copies of the original model will be fitted on the same dataset, but weights will be attributed to them according to the error of the prediction.

Lets use our previously trained decision tree regressor.

In [12]:
from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor(random_state=95276, base_estimator=dt_results[0])

grid_params = {
    'learning_rate': [.5, 1],
}
name = 'AdaBoost'
ada_results = aux.make_regressor(name, model, grid_params, data)

AdaBoost
Score r2: 0.9986 
Score MSE: 7.917e+04 
Time: 2.4s
{'base_estimator__ccp_alpha': 0.0, 'base_estimator__criterion': 'mse', 'base_estimator__max_depth': 20, 'base_estimator__max_features': None, 'base_estimator__max_leaf_nodes': None, 'base_estimator__min_impurity_decrease': 0.0, 'base_estimator__min_impurity_split': None, 'base_estimator__min_samples_leaf': 2, 'base_estimator__min_samples_split': 2, 'base_estimator__min_weight_fraction_leaf': 0.0, 'base_estimator__presort': 'deprecated', 'base_estimator__random_state': 95276, 'base_estimator__splitter': 'best', 'base_estimator': DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=20,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=2, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=95276, splitter='best'), 'lear

## Comparing models - MSE and $R^2$

In [13]:
df_scores = pd.DataFrame({
    'MSE': [
        dt_results[2]['mse'],
        knn_results[2]['mse'],
        rf_results[2]['mse'],
        gb_results[2]['mse'],
        ada_results[2]['mse']

        ],
    'r2': [
        dt_results[2]['r2'],
        knn_results[2]['r2'],
        rf_results[2]['r2'],
        gb_results[2]['r2'],
        ada_results[2]['r2']
        ],
    'model name': [
        dt_results[2]['model name'],
        knn_results[2]['model name'],
        rf_results[2]['model name'],
        gb_results[2]['model name'],
        ada_results[2]['model name']
        ],
    'time': [
        dt_results[2]['time'],
        knn_results[2]['time'],
        rf_results[2]['time'],
        gb_results[2]['time'],
        ada_results[2]['time'],
        ],
    },
#     index=['linear', 'ridge', 'lasso', 'hubber']
)

# load ols results
df_scores_old = pd.read_csv('data/sk_scores.csv')
df_scores = pd.concat([df_scores, df_scores_old], axis=0)

# lets get the rmse
df_scores['rmse'] = np.sqrt(df_scores['MSE'])

# now lets measure the impact of processing time over RMSE
df_scores = df_scores.sort_values(by='rmse', ascending=False)
df_scores['dif_rmse'] = df_scores['rmse'].diff()
df_scores['dif_time'] = df_scores['time'].diff()
df_scores.to_csv('data/full_scores.csv', index=False)
df_scores

Unnamed: 0,MSE,r2,model name,time,rmse,dif_rmse,dif_time
2,1618442.0,0.962031,Lasso Regression,0.474396,1272.179898,,
4,1211086.0,0.968962,ols,0.701995,1100.493742,-171.686156,0.227598
1,1157179.0,0.971413,Ridge Regression,3.785377,1075.722459,-24.771283,3.083382
3,1156863.0,0.971421,HUbber Regression,69.701672,1075.575433,-0.147026,65.916295
0,1156840.0,0.971422,Linear Regression,13.188963,1075.565067,-0.010367,-56.512709
3,565778.1,0.987106,Gradient Boost,7.981411,752.1822,-323.382866,-5.207552
1,100597.0,0.998437,knn,2.22651,317.170259,-435.011942,-5.754901
4,79166.39,0.998595,AdaBoost,2.420043,281.365229,-35.805029,0.193533
2,79153.6,0.998599,RF,23.239215,281.34249,-0.022739,20.819172
0,78902.53,0.998604,Decision tree,17.802543,280.89595,-0.44654,-5.436672


## Conclusions

Considering the computation time and the error measured on the test set, we can conclude that:

- OLS, with proper feature selection, would be chosen over sklearn. However, its important to consider the time it takes to manually choose those features
- Given that all sklearn linear models have almost the same scores, Ridge Regression would be chosen due to the time it needs to be trained while automatically adjusting the weights of each feature, so we don't need to manually select them
- among the ML models, its important to note that the KNN model took 6x less time to produce a model almost as good as the alternatives.