<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

## 0. python imports & setup

for learning purposes, libraries will be imported inside its corresponding usage section...

## 1. data loading

In [236]:
import pandas as pd
import numpy as np

* diamonds: labeled data we can use for training and testing
* diamonds_predict: diamonds to predict its price and upload result to Kaggle

In [237]:
diamonds = pd.read_csv('../data/diamonds_train.csv')
diamonds_predict = pd.read_csv('../data/diamonds_predict.csv')

## 2. features modification

In [238]:
# Creating new variables for diamonds

In [239]:
def set_value(row_number, assigned_value): 
    return assigned_value[row_number] 
  
# Create the dictionary 
color_dictionary ={'D' : 1, 'E' : 1, 'F' : 1, 'G' :0, 'H' :0, 'I' :0, 'J':0 } 
  
# Add a new column named 'color_simplifyed' 
diamonds['color_simplifyed'] = diamonds['color'].apply(set_value, args =(color_dictionary, ))
diamonds_predict['color_simplifyed'] = diamonds_predict['color'].apply(set_value, args =(color_dictionary, ))

In [240]:
def set_value(row_number, assigned_value): 
    return assigned_value[row_number] 
  
# Create the dictionary 
cut_dictionary ={'Ideal' : 5, 'Premium' : 4, 'Very Good' : 3, 'Good' :2, 'Fair' :1 } 
  
# Add a new column named 'cut_numeric' 
diamonds['cut_numeric'] = diamonds['cut'].apply(set_value, args =(cut_dictionary, )) 
diamonds_predict['cut_numeric'] = diamonds_predict['cut'].apply(set_value, args =(cut_dictionary, ))

In [241]:
# Add a new column named 'cut_and_carat' 
diamonds['cut_and_carat'] = diamonds['carat']*diamonds['cut_numeric']
diamonds_predict['cut_and_carat'] = diamonds_predict['cut_numeric']*diamonds_predict['carat']

In [242]:
# Add a new column named 'x minus y squared' 
diamonds['x_minus_y_squared'] = (diamonds['x']-diamonds['y'])**2
diamonds_predict['x_minus_y_squared'] = (diamonds_predict['x']-diamonds_predict['y'])**2

In [243]:
# Create a new column called based on the value of another column
diamonds['roundness'] = np.where(diamonds.x_minus_y_squared <= 0.012, 1, 0)
diamonds_predict['roundness'] = np.where(diamonds_predict.x_minus_y_squared <= 0.012, 1, 0)

In [244]:
diamonds['roundness'].value_counts()
# We analyse the distribution of the "cut" variable

1    39294
0     1161
Name: roundness, dtype: int64

In [245]:
diamonds_predict.head().T

Unnamed: 0,0,1,2,3,4
id,0,1,2,3,4
carat,0.79,1.2,1.57,0.9,0.5
cut,Very Good,Ideal,Premium,Very Good,Very Good
color,F,J,H,F,F
clarity,SI1,VS1,SI1,SI1,VS1
depth,62.7,61,62.2,63.8,62.9
table,60,57,61,54,58
x,5.82,6.81,7.38,6.09,5.05
y,5.89,6.89,7.32,6.13,5.09
z,3.67,4.18,4.57,3.9,3.19


In [246]:
diamonds.shape

(40455, 15)

In [247]:
diamonds_predict.shape

(13485, 15)

as you can see, there are both categorical and numerical columns...

## 2. eda

this section is up to you! this guided lesson is about a machine learning pipeline...

## 3. ml preprocessing

in this section I will teach how to use scikit-learn's Pipiline and ColumnTransformer, one of the best practices for composing preprocessing and modeling in a single and elegand class... pay attention as it is hard to understand...

In [248]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

* https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
* https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

let's identify numerical and categorical features...

In [249]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   carat              40455 non-null  float64
 1   cut                40455 non-null  object 
 2   color              40455 non-null  object 
 3   clarity            40455 non-null  object 
 4   depth              40455 non-null  float64
 5   table              40455 non-null  float64
 6   price              40455 non-null  int64  
 7   x                  40455 non-null  float64
 8   y                  40455 non-null  float64
 9   z                  40455 non-null  float64
 10  color_simplifyed   40455 non-null  int64  
 11  cut_numeric        40455 non-null  int64  
 12  cut_and_carat      40455 non-null  float64
 13  x_minus_y_squared  40455 non-null  float64
 14  roundness          40455 non-null  int64  
dtypes: float64(8), int64(4), object(3)
memory usage: 4.6+ MB


In [250]:
NUM_FEATS = ['carat', 'depth', 'table', 'x', 'y', 'z','color_simplifyed','cut_numeric','cut_and_carat','x_minus_y_squared','roundness']
CAT_FEATS = ['cut', 'color', 'clarity']
FEATS = NUM_FEATS + CAT_FEATS
TARGET = 'price'

let's define a preprocessing transformer for numerical columns...

In [251]:
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), 
                ('scaler', StandardScaler())])

let's define a preprocessing transformer for categorical columns...

In [252]:
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                ('onehot', OneHotEncoder(handle_unknown='ignore'))])

let's join these transformers using a `ColumnTransformer`:

In [253]:
preprocessor =ColumnTransformer(transformers=[('num', numeric_transformer, NUM_FEATS),
                                ('cat', categorical_transformer, CAT_FEATS)])

inspecting the full preprocessor:

In [254]:
preprocessor

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['carat', 'depth', 'table', 'x', 'y', 'z',
                                  'color_simplifyed', 'cut_numeric',
                                  'cut_and_carat', 'x_minus_y_squared',
                                  'roundness']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['cut',

how does this preprocessing looks like?

In [255]:

type(preprocessor)

sklearn.compose._column_transformer.ColumnTransformer

at least in this case, it is at the cost of interpretability of transformed DataFrame...

In [256]:
pd.DataFrame(data=preprocessor.fit_transform(diamonds)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,0.867006,0.452019,0.247981,0.978807,0.921985,1.022657,-0.967096,0.085178,0.897154,-0.005195,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-1.004557,0.871099,-0.199745,-1.226738,-1.179816,-1.129259,-0.967096,-0.809387,-1.03939,-0.00525,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,-0.184434,2.617265,-1.095198,-0.097286,-0.176882,0.161891,-0.967096,-2.598516,-1.164167,-0.004689,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,-0.815298,1.429872,-0.647472,-0.933258,-0.883296,-0.770607,1.034023,-1.703951,-1.109265,-0.005195,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.467458,-0.875068,0.695707,0.729794,0.677793,0.592274,-0.967096,0.979742,1.026923,-0.005195,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


## 4. train a simple model

first, lets train a simple model using holdout, train - test split...

In [257]:
from sklearn.model_selection import train_test_split

In [258]:
diamonds_train, diamonds_test = train_test_split(diamonds)

In [259]:
print(diamonds_train.shape)
print(diamonds_test.shape)

(30341, 15)
(10114, 15)


let's choose a model from scikit-learn cheatsheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## list of models

In [260]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import SGDRegressor
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import NearestCentroid
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.isotonic import IsotonicRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import StackingRegressor




In [261]:
model= Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor',RandomForestRegressor() )])

In [262]:
model.fit(diamonds_train[FEATS], diamonds_train[TARGET]);

## 5. check model performance on test and train data

In [263]:
from sklearn.metrics import mean_squared_error

In [264]:
y_test = model.predict(diamonds_test[FEATS]).clip(325,18500)
y_train = model.predict(diamonds_train[FEATS]).clip(325,18500)

In [265]:
print(f"test error: {mean_squared_error(y_pred=y_test, y_true=diamonds_test[TARGET], squared=False)}")
print(f"train error: {mean_squared_error(y_pred=y_train, y_true=diamonds_train[TARGET], squared=False)}")

test error: 591.3840622474926
train error: 207.53117263031754


In [267]:
test_error=mean_squared_error(y_pred=y_test, y_true=diamonds_test[TARGET], squared=False)
train_error=mean_squared_error(y_pred=y_train, y_true=diamonds_train[TARGET], squared=False)
# track model
mod = input('Algorithm used:')
b=(train_error,test_error,mod)
listb.append(b)
print (listb)
listb_df=pd.DataFrame(listb, columns=["train_error","test_error","model"])
listb_df.to_csv("~/Ironhack/ih_datamadpt0420_project_m3/notebooks/Listb.csv", index=False)

Algorithm used: aa


NameError: name 'listb' is not defined

In [268]:
display(listb_df)

NameError: name 'listb_df' is not defined

In [85]:
listb_df.to_csv("~/Ironhack/ih_datamadpt0420_project_m3/notebooks/Listb.csv", index=False)

In [148]:
display(pd.DataFrame(listb, columns=["train_error","test_error","model"]))

NameError: name 'listb' is not defined

## 6. check model performance using cross validation

In [269]:
from sklearn.model_selection import cross_val_score

In [270]:
scores = cross_val_score(model, 
                         diamonds[FEATS], 
                         diamonds[TARGET], 
                         scoring='neg_root_mean_squared_error', 
                         cv=5, n_jobs=-1)

In [271]:
import numpy as np
np.mean(-scores)

556.5538190382399

## 7. optimize model using grid search

In [152]:
from sklearn.model_selection import RandomizedSearchCV

In [197]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'regressor__n_estimators': [256,512,1024],
    'regressor__max_depth': [16,32],
}

grid_search = RandomizedSearchCV(model, 
                                 param_grid, 
                                 cv=5, 
                                 verbose=10, 
                                 scoring='neg_root_mean_squared_error', 
                                 n_jobs=-1,
                                 n_iter=15)

grid_search.fit(diamonds[FEATS], diamonds[TARGET])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 14.9min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed: 20.0min
[Parallel(n_jobs=-1)]: Done  52 out of  60 | elapsed: 22.6min remaining:  3.5min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 30.1min finished


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(transformers=[('num',
                                                                               Pipeline(steps=[('imputer',
                                                                                                SimpleImputer(strategy='median')),
                                                                                               ('scaler',
                                                                                                StandardScaler())]),
                                                                               ['carat',
                                                                                'depth',
                                                                                'table',
                                                                                'x',
              

In [198]:
grid_search.best_params_

{'regressor__n_estimators': 1024,
 'regressor__max_depth': 16,
 'preprocessor__num__imputer__strategy': 'mean'}

In [199]:
grid_search.best_score_

-553.142998069181

## 8. prepare submission

In [200]:
y_pred = grid_search.predict(diamonds_predict[FEATS])

In [201]:
submission_df = pd.DataFrame({'id': diamonds_predict['id'], 'price': y_pred})

In [202]:
submission_df.head()

Unnamed: 0,id,price
0,0,3189.802255
1,1,5468.256945
2,2,9344.256461
3,3,4372.177056
4,4,1732.851949


In [203]:
submission_df.describe()

Unnamed: 0,id,price
count,13485.0,13485.0
mean,6742.0,4109.31387
std,3892.928525,4017.72134
min,0.0,374.120835
25%,3371.0,959.044505
50%,6742.0,2618.909428
75%,10113.0,5624.489694
max,13484.0,17670.113049


In [206]:
submission_df.price.clip(300, 18000, inplace=True)

In [207]:
submission_df.to_csv('diamonds VI', index=False)

## 9. let's try more models...

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>