<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Loading-data-and-libraries" data-toc-modified-id="Loading-data-and-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Loading data and libraries</a></span></li><li><span><a href="#Cleaning-the-data" data-toc-modified-id="Cleaning-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Cleaning the data</a></span></li><li><span><a href="#Test" data-toc-modified-id="Test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Test</a></span></li><li><span><a href="#Submit" data-toc-modified-id="Submit-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Submit</a></span></li></ul></div>

## Intro

I'll test a different data cleaning process to see if I can improve my predictions.  
From this [website](https://www.gemsociety.org/article/what-determines-diamond-cost/) I have found that color and clarity also have an order, so they can be encoded in one column, just like cut.

## Loading data and libraries

In [1]:
%config Completer.use_jedi = False

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.impute import KNNImputer

In [35]:
df_train = pd.read_csv("../data/train.csv", index_col = 0)
df_test = pd.read_csv("../data/test.csv", index_col = 0)

## Cleaning the data

In [36]:
df_train.head()

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,1.5,Premium,F,VS2,61.5,58.0,7.32,7.34,4.51,9.588
1,2.01,Very Good,E,SI2,60.6,59.0,8.11,8.25,4.96,9.748
2,0.5,Ideal,E,SI1,61.6,57.0,5.13,5.09,3.15,7.255
3,0.25,Very Good,F,VVS2,61.6,57.0,4.05,4.08,2.5,6.45
4,0.52,Ideal,G,VS2,62.0,55.0,5.16,5.19,3.21,7.721


Converting categorical values to ordinal

Using the table from the website above, I will encode categorical values to make them ordinal

In [37]:
for col in ["cut", "color", "clarity"]:
    print(df_train[col].unique())

['Premium' 'Very Good' 'Ideal' 'Good' 'Fair']
['F' 'E' 'G' 'D' 'J' 'I' 'H']
['VS2' 'SI2' 'SI1' 'VVS2' 'VS1' 'VVS1' 'IF' 'I1']


In [38]:
cut_code = {
    "Fair":2,
    "Good":4,
    "Very Good":6,
    "Premium":8,
    "Ideal":10
}

color_code = {
    "D":10,
    "E":9.1,
    "F":8.4,
    "G":7.7,
    "H":6.8,
    "I":5.8,
    "J":5.2
}

clarity_code = {
    "IF":10,
    "VVS1":8.9,
    "VVS2":8.1,
    "VS1":7.4,
    "VS2":6.6,
    "SI1":5.8,
    "SI2":4.9,
    "I1":3.7
}

In [39]:
df_train["cut"] = df_train["cut"].apply(lambda x: cut_code[x])
df_test["cut"] = df_test["cut"].apply(lambda x: cut_code[x])

In [40]:
df_train["color"] = df_train["color"].apply(lambda x: color_code[x])
df_test["color"] = df_test["color"].apply(lambda x: color_code[x])

In [41]:
df_train["clarity"] = df_train["clarity"].apply(lambda x: clarity_code[x])
df_test["clarity"] = df_test["clarity"].apply(lambda x: clarity_code[x])

I will try keeping the x,y,z values

In [43]:
df_train[df_train['x'] == 0]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5075,2.25,8,6.8,4.9,62.8,59.0,0.0,0.0,0.0,9.8
7453,1.0,6,6.8,6.6,63.3,53.0,0.0,0.0,0.0,8.545
25591,1.56,10,7.7,6.6,62.2,54.0,0.0,0.0,0.0,9.457
27381,0.71,4,8.4,4.9,64.1,60.0,0.0,0.0,0.0,7.664
36954,1.07,10,8.4,4.9,61.6,56.0,0.0,6.62,0.0,8.508
38134,1.14,2,7.7,7.4,57.5,67.0,0.0,0.0,0.0,8.761
39032,1.2,8,10.0,8.9,62.1,59.0,0.0,0.0,0.0,9.661


Some diamonds have zero size (x,y,z), so they will probably make the model worse. We could delete them, but we can also use a KNN imputer for those values

In [44]:
for col in ['x', 'y', 'z']:
    df_train[col].replace(0, np.nan, inplace=True)

In [45]:
df_train[df_train.isna().any(axis=1)]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
375,1.0,8,7.7,4.9,59.1,59.0,6.55,6.48,,8.053
5075,2.25,8,6.8,4.9,62.8,59.0,,,,9.8
7453,1.0,6,6.8,6.6,63.3,53.0,,,,8.545
13811,1.01,8,8.4,4.9,59.2,58.0,6.5,6.47,,8.252
18740,2.02,8,6.8,6.6,62.7,53.0,8.02,7.95,,9.81
19989,2.25,8,5.8,5.8,61.3,58.0,8.52,8.42,,9.642
22669,2.2,8,6.8,5.8,61.2,59.0,8.42,8.37,,9.756
24439,1.12,8,7.7,3.7,60.4,59.0,6.71,6.67,,7.776
25591,1.56,10,7.7,6.6,62.2,54.0,,,,9.457
27381,0.71,4,8.4,4.9,64.1,60.0,,,,7.664


In [46]:
knn_imput = KNNImputer(n_neighbors=5)

In [47]:
df_filled = pd.DataFrame(knn_imput.fit_transform(df_train), columns = df_train.columns)

In [50]:
df_test[df_test['z'] == 0]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2567,2.8,4,7.7,4.9,63.8,58.0,8.9,8.85,0.0
6749,1.01,8,6.8,3.7,58.1,59.0,6.66,6.6,0.0
10628,0.71,4,8.4,4.9,64.1,60.0,0.0,0.0,0.0
11917,1.1,8,7.7,4.9,63.0,59.0,6.5,6.47,0.0


In [51]:
for col in ['x', 'y', 'z']:
    df_test[col].replace(0, np.nan, inplace=True)

In [52]:
df_test = pd.DataFrame(knn_imput.fit_transform(df_test), columns = df_test.columns)

## Test

Since the random forest gave the best results in the previous iteration, I will just use that model 

In [15]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor

In [65]:
X = df_filled.drop("price", axis = 1)
y = df_filled.price

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=10)

In [18]:
# random forest regressor model

# create forest
rf = RandomForestRegressor(n_estimators=100, max_depth=None, max_features='auto')

# model training
rf.fit(X_train, y_train)

# predictions
y_pred_train = rf.predict(X_train)
y_pred_test = rf.predict(X_test)

# Metrics
mse_train = mean_squared_error(y_pred_train,y_train)
mse_test = mean_squared_error(y_pred_test,y_test)

mse_train, mse_test

(0.00116514859876537, 0.00879665196865035)

In [19]:
# define model and params

rf_model = RandomForestRegressor() 

params = {
    "n_estimators":[50, 100, 200], 
    "max_depth":[5, 20, None], 
    "min_samples_split":[2, 4], 
    "min_samples_leaf":[1, 2]
}

In [20]:
# define score

mse = make_scorer(mean_squared_error)

In [21]:
# run grid search

clf = GridSearchCV(estimator = rf_model, param_grid = params, scoring = mse, verbose=2)

In [22]:
# fit model

clf.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.3s
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.3s
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.2s
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.2s
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   1.2s
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.3s
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.4s
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.4s
[CV] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.4s
[CV] END max_depth=5, min_samples_leaf=1, m

[CV] END max_depth=20, min_samples_leaf=1, min_samples_split=4, n_estimators=100; total time=   9.4s
[CV] END max_depth=20, min_samples_leaf=1, min_samples_split=4, n_estimators=100; total time=   8.4s
[CV] END max_depth=20, min_samples_leaf=1, min_samples_split=4, n_estimators=100; total time=   8.4s
[CV] END max_depth=20, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=  16.8s
[CV] END max_depth=20, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=  16.9s
[CV] END max_depth=20, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=  15.8s
[CV] END max_depth=20, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=  16.3s
[CV] END max_depth=20, min_samples_leaf=1, min_samples_split=4, n_estimators=200; total time=  17.6s
[CV] END max_depth=20, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   4.1s
[CV] END max_depth=20, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=

[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=  14.1s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=  14.3s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=4, n_estimators=50; total time=   3.5s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=4, n_estimators=50; total time=   3.6s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=4, n_estimators=50; total time=   3.5s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=4, n_estimators=50; total time=   3.6s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=4, n_estimators=50; total time=   3.6s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=4, n_estimators=100; total time=   7.2s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=4, n_estimators=100; total time=   7.2s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=4, n_estimators

GridSearchCV(estimator=RandomForestRegressor(),
             param_grid={'max_depth': [5, 20, None], 'min_samples_leaf': [1, 2],
                         'min_samples_split': [2, 4],
                         'n_estimators': [50, 100, 200]},
             scoring=make_scorer(mean_squared_error), verbose=2)

In [23]:
# check best parameters

clf.best_params_

{'max_depth': 5,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 50}

In [24]:
# define best model

best = clf.best_estimator_
best

RandomForestRegressor(max_depth=5, n_estimators=50)

In [25]:
# predict using best model

y_pred_train = best.predict(X_train)
y_pred_test = best.predict(X_test)
mse_train = mean_squared_error(y_pred_train,y_train)
mse_test = mean_squared_error(y_pred_test,y_test)
mse_train, mse_test

(0.031692695669965125, 0.03344708564106242)

I'll try Gradient Boosting

In [57]:
# start model

gb = GradientBoostingRegressor(n_estimators=100, max_depth=None, criterion = 'mse', max_features='auto')

# fit model

gb.fit(X_train, y_train)

GradientBoostingRegressor(criterion='mse', max_depth=None, max_features='auto')

In [58]:
# predict
y_pred_train = gb.predict(X_train)
y_pred_test = gb.predict(X_test)

# Metrics
mse_train = mean_squared_error(y_pred_train,y_train)
mse_test = mean_squared_error(y_pred_test,y_test)

mse_train, mse_test

(0.0011448318972469738, 0.01953153800969154)

Not working better than RF

## Submit

In [60]:
import sys

sys.path.append('../src')

from diamond_comp_functions import diamond_submission

In [66]:
rf = RandomForestRegressor(n_estimators=100, max_depth=None, max_features='auto')

diamond_submission(rf, X, y, df_test, "rf2_df")

'rf2_df.csv has been created in ../data/'

In [94]:
rf2 = RandomForestRegressor(n_estimators=1000, max_depth=20, max_features='auto')

diamond_submission(rf2, X, y, df_test, "rf3_df")

'rf3_df.csv has been created in ../data/'

In [95]:
df_filled.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,1.5,8.0,8.4,6.6,61.5,58.0,7.32,7.34,4.51,9.588
1,2.01,6.0,9.1,4.9,60.6,59.0,8.11,8.25,4.96,9.748
2,0.5,10.0,9.1,5.8,61.6,57.0,5.13,5.09,3.15,7.255
3,0.25,6.0,8.4,8.1,61.6,57.0,4.05,4.08,2.5,6.45
4,0.52,10.0,7.7,6.6,62.0,55.0,5.16,5.19,3.21,7.721


In [None]:
rf2 = RandomForestRegressor(n_estimators=1000, max_depth=20, max_features='auto')

# model training
rf2.fit(X_train, y_train)

# predictions
y_pred_train = rf2.predict(X_train)
y_pred_test = rf2.predict(X_test)

# Metrics
mse_train = mean_squared_error(y_pred_train,y_train)
mse_test = mean_squared_error(y_pred_test,y_test)

mse_train, mse_test