<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

## 0. python imports & setup

for learning purposes, libraries will be imported inside its corresponding usage section...

## 1. data loading

In [1]:
#lista=[] #this is just to make sure that the track starts at zero

In [1]:
import pandas as pd

* diamonds: labeled data we can use for training and testing
* diamonds_predict: diamonds to predict its price and upload result to Kaggle

In [2]:
diamonds = pd.read_csv('../data/diamonds_train.csv')
diamonds_predict = pd.read_csv('../data/diamonds_predict.csv')

In [3]:
diamonds.head().T

Unnamed: 0,0,1,2,3,4
carat,1.21,0.32,0.71,0.41,1.02
cut,Premium,Very Good,Fair,Good,Ideal
color,J,H,G,D,G
clarity,VS2,VS2,VS1,SI1,SI1
depth,62.4,63,65.5,63.8,60.5
table,58,57,55,56,59
price,4268,505,2686,738,4882
x,6.83,4.35,5.62,4.68,6.55
y,6.79,4.38,5.53,4.72,6.51
z,4.25,2.75,3.65,3,3.95


In [4]:
diamonds.shape

(40455, 10)

In [5]:
diamonds_predict.shape

(13485, 10)

as you can see, there are both categorical and numerical columns...

## 2. eda

this section is up to you! this guided lesson is about a machine learning pipeline...

## 3. ml preprocessing

in this section I will teach how to use scikit-learn's Pipiline and ColumnTransformer, one of the best practices for composing preprocessing and modeling in a single and elegand class... pay attention as it is hard to understand...

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

* https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
* https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

let's identify numerical and categorical features...

In [7]:
NUM_FEATS = ['carat', 'depth', 'table', 'x', 'y', 'z']
CAT_FEATS = ['cut', 'color', 'clarity']
FEATS = NUM_FEATS + CAT_FEATS
TARGET = 'price'

let's define a preprocessing transformer for numerical columns...

In [8]:
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), 
                ('scaler', StandardScaler())])

let's define a preprocessing transformer for categorical columns...

In [9]:
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                ('onehot', OneHotEncoder(handle_unknown='ignore'))])

let's join these transformers using a `ColumnTransformer`:

In [10]:
preprocessor =ColumnTransformer(transformers=[('num', numeric_transformer, NUM_FEATS),
                                ('cat', categorical_transformer, CAT_FEATS)])

inspecting the full preprocessor:

In [11]:
preprocessor

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['carat', 'depth', 'table', 'x', 'y', 'z']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['cut', 'color', 'clarity'])])

how does this preprocessing looks like?

In [12]:

type(preprocessor)

sklearn.compose._column_transformer.ColumnTransformer

at least in this case, it is at the cost of interpretability of transformed DataFrame...

In [13]:
pd.DataFrame(data=preprocessor.fit_transform(diamonds)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,0.867006,0.452019,0.247981,0.978807,0.921985,1.022657,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-1.004557,0.871099,-0.199745,-1.226738,-1.179816,-1.129259,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,-0.184434,2.617265,-1.095198,-0.097286,-0.176882,0.161891,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,-0.815298,1.429872,-0.647472,-0.933258,-0.883296,-0.770607,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.467458,-0.875068,0.695707,0.729794,0.677793,0.592274,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


## 4. train a simple model

first, lets train a simple model using holdout, train - test split...

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
diamonds_train, diamonds_test = train_test_split(diamonds)

In [16]:
print(diamonds_train.shape)
print(diamonds_test.shape)

(30341, 10)
(10114, 10)


let's choose a model from scikit-learn cheatsheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## list of models

In [17]:

lista_df = pd.read_csv('Lista.csv')
lista=lista_df.values.tolist()
print(lista)



[[1039.1264129703216, 1017.9747653673261, 'LinearRegression'], [1041.9473566976574, 1020.6962839264064, 'Lasso'], [1050.7921083747556, 1029.5146134725842, 'LassoCV'], [1698.5883613010342, 1713.4517228411326, 'ElasticNet'], [1039.311195051357, 1018.144487308095, 'Ridge'], [1037.4517780255374, 1015.7964185336928, 'SGDRegressor'], [2961.9755209560008, 2991.6295653223133, 'svm.SVR'], [212.2094306786186, 545.1266476680867, 'RandomForestRegressor'], [11.852920625972777, 747.5122929024915, 'tree.DecisionTreeRegressor'], [906.5485731502216, 1146.479012223794, 'NearestCentroid'], [658.4798027731473, 821.4436457025267, 'KNeighborsRegressor'], [592.0271637750809, 614.939156779601, 'MLPRegressor(random_state=1, max_iter=500)'], [552.7309699221058, 582.9056441390277, 'MLPRegressor(random_state=1, max_iter=1000'], [536.2096559405162, 567.8278023481264, 'MLPRegressor(random_state=1, max_iter=2000'], [470.5668053427431, 607.1335438204703, 'RandomForestRegressor(max_depth=12, random_state=0)'], [326.62

In [18]:
lista=[[1039.1264129703216, 1017.9747653673261, 'LinearRegression'], [1041.9473566976574, 1020.6962839264065, 'Lasso'], [1050.7921083747556, 1029.5146134725842, 'LassoCV'], [1698.5883613010342, 1713.4517228411326, 'ElasticNet'], [1039.311195051357, 1018.144487308095, 'Ridge'], [1037.4517780255374, 1015.7964185336928, 'SGDRegressor'], [2961.9755209560008, 2991.629565322313, 'svm.SVR'], [212.2094306786186, 545.1266476680867, 'RandomForestRegressor'], [11.852920625972777, 747.5122929024915, 'tree.DecisionTreeRegressor'], [906.5485731502216, 1146.479012223794, 'NearestCentroid'], [658.4798027731473, 821.4436457025267, 'KNeighborsRegressor']]

In [19]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import SGDRegressor
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import NearestCentroid
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.isotonic import IsotonicRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import StackingRegressor




In [20]:
model= Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor', ExtraTreesRegressor(n_estimators=200, random_state=15))])

In [21]:
model.fit(diamonds_train[FEATS], diamonds_train[TARGET]);

## 5. check model performance on test and train data

In [22]:
from sklearn.metrics import mean_squared_error

In [23]:
y_test = model.predict(diamonds_test[FEATS]).clip(325,18500)
y_train = model.predict(diamonds_train[FEATS]).clip(325,18500)

In [24]:
print(f"test error: {mean_squared_error(y_pred=y_test, y_true=diamonds_test[TARGET], squared=False)}")
print(f"train error: {mean_squared_error(y_pred=y_train, y_true=diamonds_train[TARGET], squared=False)}")

test error: 551.6719078425618
train error: 12.220368468785724


In [25]:
test_error=mean_squared_error(y_pred=y_test, y_true=diamonds_test[TARGET], squared=False)
train_error=mean_squared_error(y_pred=y_train, y_true=diamonds_train[TARGET], squared=False)
# track model
mod = input('Algorithm used:')
b=(train_error,test_error,mod)
lista.append(b)
print (lista)
lista_df=pd.DataFrame(lista, columns=["train_error","test_error","model"])
lista_df.to_csv("~/Ironhack/ih_datamadpt0420_project_m3/notebooks/Lista.csv", index=False)

Algorithm used: ExtraTreesRegressor(n_estimators=200, random_state=15)


[[1039.1264129703216, 1017.9747653673261, 'LinearRegression'], [1041.9473566976574, 1020.6962839264065, 'Lasso'], [1050.7921083747556, 1029.5146134725842, 'LassoCV'], [1698.5883613010342, 1713.4517228411326, 'ElasticNet'], [1039.311195051357, 1018.144487308095, 'Ridge'], [1037.4517780255374, 1015.7964185336928, 'SGDRegressor'], [2961.9755209560008, 2991.629565322313, 'svm.SVR'], [212.2094306786186, 545.1266476680867, 'RandomForestRegressor'], [11.852920625972777, 747.5122929024915, 'tree.DecisionTreeRegressor'], [906.5485731502216, 1146.479012223794, 'NearestCentroid'], [658.4798027731473, 821.4436457025267, 'KNeighborsRegressor'], (12.220368468785724, 551.6719078425618, 'ExtraTreesRegressor(n_estimators=200, random_state=15)')]


In [26]:
lista_df=pd.DataFrame(lista, columns=["train_error","test_error","model"])

In [27]:
display(lista_df)

Unnamed: 0,train_error,test_error,model
0,1039.126413,1017.974765,LinearRegression
1,1041.947357,1020.696284,Lasso
2,1050.792108,1029.514613,LassoCV
3,1698.588361,1713.451723,ElasticNet
4,1039.311195,1018.144487,Ridge
5,1037.451778,1015.796419,SGDRegressor
6,2961.975521,2991.629565,svm.SVR
7,212.209431,545.126648,RandomForestRegressor
8,11.852921,747.512293,tree.DecisionTreeRegressor
9,906.548573,1146.479012,NearestCentroid


In [28]:
lista_df.to_csv("~/Ironhack/ih_datamadpt0420_project_m3/notebooks/Lista.csv", index=False)

In [30]:
display(pd.DataFrame(lista, columns=["train_error","test_error","model"]))

Unnamed: 0,train_error,test_error,model
0,1039.126413,1017.974765,LinearRegression
1,1041.947357,1020.696284,Lasso
2,1050.792108,1029.514613,LassoCV
3,1698.588361,1713.451723,ElasticNet
4,1039.311195,1018.144487,Ridge
5,1037.451778,1015.796419,SGDRegressor
6,2961.975521,2991.629565,svm.SVR
7,212.209431,545.126648,RandomForestRegressor
8,11.852921,747.512293,tree.DecisionTreeRegressor
9,906.548573,1146.479012,NearestCentroid


## 6. check model performance using cross validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score(model, 
                         diamonds[FEATS], 
                         diamonds[TARGET], 
                         scoring='neg_root_mean_squared_error', 
                         cv=5, n_jobs=-1)

In [None]:
import numpy as np
np.mean(-scores)

## 7. optimize model using grid search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'regressor__n_estimators': [16, 32, 64, 128, 256, 512],
    'regressor__max_depth': [2, 4, 8, 16],
}

grid_search = RandomizedSearchCV(model, 
                                 param_grid, 
                                 cv=5, 
                                 verbose=10, 
                                 scoring='neg_root_mean_squared_error', 
                                 n_jobs=-1,
                                 n_iter=32)

grid_search.fit(diamonds[FEATS], diamonds[TARGET])

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_score_

## 8. prepare submission

In [None]:
y_pred = grid_search.predict(diamonds_predict[FEATS])

In [None]:
submission_df = pd.DataFrame({'id': diamonds_predict['id'], 'price': y_pred})

In [None]:
submission_df.head()

In [None]:
submission_df.describe()

In [None]:
submission_df.price.clip(0, 20000, inplace=True)

In [None]:
submission_df.to_csv('diamonds_rf.csv', index=False)

## 9. let's try more models...

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>