# Rolex Listing Price Prediction based on model and complications

## Introduction

The goal of this model is to predict the listing price of Rolex watches given their condition, model number, complications, and dial colour, etc. It should determine a fair market value for the interested Rolex watch.  
  
Understanding what feature plays an important role in the price could be beneficial to potential buyers or sellers, despite the listing price might not be the final transacted price. 

In [1]:
import pandas as pd
import numpy as np
import glob
import janitor
import altair as alt
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

## Data Cleaning

In [2]:
files = glob.glob('data/result_df/*.csv')
dirty_df = pd.concat((pd.read_csv(file, index_col=0)
                for file in files)
              )

dirty_df = dirty_df.clean_names()
dirty_df.drop_duplicates(subset=['listing_code', 'reference_number'], inplace=True)
dirty_df.dropna(subset=['brand', 'model', 'listing_code', 'price', 'title', 'subtitle', 'case_diameter'], inplace=True)
dirty_df.reset_index(drop=True, inplace=True)


dirty_df.head()

Unnamed: 0,listing_code,brand,model,reference_number,movement,case_material,bracelet_material,year_of_production,condition,scope_of_delivery,...,thickness,lug_width,buckle_width,frequency,bracelet_thickness,submariner_kermit_ref_,day_date_ref_,datejust_reference_number,submariner_date_reference,reference
0,IJD7R3,Rolex,Datejust 41,126331 NEW UNWORN 2023 Wimbledon 41mm Jubilee,Automatic,Gold/Steel,Gold/Steel,2023,"New\n(Brand new, without any signs of wear)","Original box, original papers",...,,,,,,,,,,
1,HOAOQ5,Rolex,Datejust 31,278271,Automatic,Gold/Steel,Gold/Steel,2023,"New\n(Brand new, without any signs of wear)","Original box, original papers",...,,,,,,,,,,
2,IJ9RY8,Rolex,Datejust 36,126231,Automatic,Gold/Steel,Gold/Steel,2023,"New\n(Brand new, without any signs of wear)","Original box, original papers",...,,,,,,,,,,
3,FDHJM3,Rolex,GMT-Master II,126710BLNR,Automatic,Steel,Steel,2023,"New\n(Brand new, without any signs of wear)","Original box, original papers",...,,,,,,,,,,
4,FFF9D3,Rolex,Explorer,124270,Automatic,Steel,Steel,2021,Very good\n(Worn with little to no signs of wear),"Original box, original papers",...,,,,,,,,,,


In [3]:
# clean case_diameter
def is_convertible_to_int(value):
    try:
        int(value)
        return True
    except ValueError:
        return False

convertible_mask = dirty_df['case_diameter'].str[:2].apply(is_convertible_to_int)

dirty_df = dirty_df[convertible_mask]

dirty_df['case_diameter'] = dirty_df['case_diameter'].str[:2].astype('int')


In [4]:
# add column of whether the price is negotiable
dirty_df.insert(loc=13, column='is_negotiable', value=dirty_df['price'].str.contains('Negotiable', case=False).astype(int))

In [5]:
# keep only CA$ in the `price` column
dirty_df['price'] = dirty_df['price'].str.extract('C\$([0-9,]+)')[0].str.replace(',', '')
dirty_df['price'] = pd.to_numeric(dirty_df['price'], errors='coerce')
dirty_df['price'].fillna(0, inplace=True)
dirty_df['price'] = dirty_df['price'].astype(int)

dirty_df = dirty_df.query('price != 0')

  dirty_df['price'] = dirty_df['price'].str.extract('C\$([0-9,]+)')[0].str.replace(',', '')


In [6]:
# add column of whether the year of production is approximated
dirty_df.insert(loc=8, column='year_is_approximated', value=dirty_df['year_of_production'].str.contains('Approximation', case=False).astype(int))

# Clean year of production
dirty_df['year_of_production'] = dirty_df['year_of_production'].apply(lambda x: x[:4] if x != 'Unknown' else x)

In [7]:
# simplify the location to country only
dirty_df['country'] = dirty_df['location'].str.split(',').str[0]

Save the cleaned data locally

In [8]:
rolex_df = dirty_df
rolex_df.to_csv('data/rolex_df.csv')

## EDA

In [9]:
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split

In [10]:
display(rolex_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 62495 entries, 0 to 66279
Data columns (total 51 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   listing_code                          62495 non-null  object 
 1   brand                                 62495 non-null  object 
 2   model                                 62495 non-null  object 
 3   reference_number                      61846 non-null  object 
 4   movement                              61443 non-null  object 
 5   case_material                         60862 non-null  object 
 6   bracelet_material                     56783 non-null  object 
 7   year_of_production                    62495 non-null  object 
 8   year_is_approximated                  62495 non-null  int32  
 9   condition                             61537 non-null  object 
 10  scope_of_delivery                     62495 non-null  object 
 11  gender              

None

In [11]:
df = rolex_df[['model', 'movement', 'case_material', 'bracelet_material',
               'year_of_production', 'year_is_approximated', 'condition', 'scope_of_delivery',
               'country', 'availability', 'case_diameter', 'bezel_material',
               'crystal', 'dial', 'bracelet_color', 'clasp', 'clasp_material',
               'rating', 'reviews', 'price', 'is_negotiable']]
df.head(1)

Unnamed: 0,model,movement,case_material,bracelet_material,year_of_production,year_is_approximated,condition,scope_of_delivery,country,availability,...,bezel_material,crystal,dial,bracelet_color,clasp,clasp_material,rating,reviews,price,is_negotiable
0,Datejust 41,Automatic,Gold/Steel,Gold/Steel,2023,0,"New\n(Brand new, without any signs of wear)","Original box, original papers",United States of America,Item is in stock,...,Rose gold,Sapphire crystal,Silver,Gold/Steel,Fold clasp,Gold/Steel,4.2,11,23421,1


In [12]:
df.shape

(62495, 21)

We will use only the following columns since they have fewer missing values and have more variation even for the same model. Features that are unrelated to the watch model is especially interesting, such as `condition` and `scope_of_delivery`, as they provide insights on how these factor in to the listing price.

In [13]:
train_df, test_df = train_test_split(df, test_size=0.3, random_state=123)
print(train_df.shape)
print(test_df.shape)

(43746, 21)
(18749, 21)


In [61]:
X_train, y_train = train_df.drop(
    columns=["price"]), train_df["price"]
y_train = pd.DataFrame(y_train)
X_test, y_test = test_df.drop(
    columns=["price"]), test_df["price"]
y_test = pd.DataFrame(y_test)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(43746, 20)
(43746, 1)
(18749, 20)
(18749, 1)


In [15]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43746 entries, 63205 to 56465
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   model                 43746 non-null  object 
 1   movement              43021 non-null  object 
 2   case_material         42626 non-null  object 
 3   bracelet_material     39833 non-null  object 
 4   year_of_production    43746 non-null  object 
 5   year_is_approximated  43746 non-null  int32  
 6   condition             43093 non-null  object 
 7   scope_of_delivery     43746 non-null  object 
 8   country               43746 non-null  object 
 9   availability          43746 non-null  object 
 10  case_diameter         43746 non-null  int32  
 11  bezel_material        32235 non-null  object 
 12  crystal               35703 non-null  object 
 13  dial                  40357 non-null  object 
 14  bracelet_color        33746 non-null  object 
 15  clasp               

In [16]:
plot_columns = X_train.columns.to_list()

for column in plot_columns:
    top_categories = X_train[column].value_counts().head(10).index
    filtered_X_train = X_train[X_train[column].isin(top_categories)]

    chart = alt.Chart(filtered_X_train).mark_bar().encode(
        y=alt.Y(f"{column}:N", sort='-x'),
        x=alt.X('count()', title='Count')
    ).properties(
        title=f"Top 10 Categories in {column}"
    )
    
    chart.display()

In [17]:
y_train.describe(percentiles=[.25, .5, .75, 0.975]).apply(lambda s: s.apply('{0:.0f}'.format))

Unnamed: 0,price
count,43746
mean,31721
std,43769
min,198
25%,13105
50%,20724
75%,33910
97.5%,119718
max,1506426


In [18]:
alt.Chart(y_train.query('price <= 120000'),
          title='Histogram of Rolex price').mark_bar().encode(
    alt.X('price:Q').bin(maxbins=40),
    y='count()'
)

The above histogram is showing at least 97.5% of the price data. It is difficult to interpret the distribution with the outliers so they are disregarded for the purpose of this visualization.

In [19]:
train_df.corr(numeric_only=True).round(
    decimals=3).style.background_gradient()

Unnamed: 0,year_is_approximated,case_diameter,rating,reviews,price,is_negotiable
year_is_approximated,1.0,-0.075,0.017,0.263,-0.026,0.07
case_diameter,-0.075,1.0,0.015,-0.086,0.222,0.045
rating,0.017,0.015,1.0,0.098,-0.005,0.041
reviews,0.263,-0.086,0.098,1.0,-0.058,-0.115
price,-0.026,0.222,-0.005,-0.058,1.0,0.02
is_negotiable,0.07,0.045,0.041,-0.115,0.02,1.0


The price seems to be slightly positively correlated with case diameter, which is expected as larger models are usually equipped with more complications that drive up the price.

## Models

### Preprocessing

In [20]:
# imports
import sys, os
import time

import matplotlib.pyplot as plt

%matplotlib inline
import numpy as np
import pandas as pd
import altair as alt
from IPython.display import HTML

sys.path.append(os.path.join(os.path.abspath("."), "code"))

from IPython.display import display

# Classifiers and regressors
from sklearn.dummy import DummyClassifier, DummyRegressor

# Preprocessing and pipeline
from sklearn.impute import SimpleImputer

# train test split and cross validation
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import (
    MinMaxScaler,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import *
from sklearn.tree import *
from sklearn.ensemble import *
from sklearn.svm import *
from lightgbm.sklearn import *
from sklearn.model_selection import *

In [21]:
# adapted from 571 lecture notes
# code from lecture
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, n_jobs=-1, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" %
                       (mean_scores.iloc[i], std_scores.iloc[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

One-hot encoding is applied on categorical features and scaling on numerical features. The preprocesspr with scaler is used only for distance-based models that are sensitive to units in features. It is easier to interpret the feature importance with unscaled data when using models that are robust to such data.

In [22]:
categorcial_feats = [col for col in X_train.columns if col not in ['case_diameter', 'rating', 'reviews']]
numerical_feats = ['case_diameter', 'rating', 'reviews']

categorical_pipe = make_pipeline(OneHotEncoder(drop='if_binary', handle_unknown='ignore'))
numerical_pipe = make_pipeline(StandardScaler(), SimpleImputer(strategy='median'))

preprocessor = make_column_transformer((categorical_pipe, categorcial_feats))
preprocessor_with_scaler = make_column_transformer((categorical_pipe, categorcial_feats),
                                                    (numerical_pipe, numerical_feats))
preprocessor

### Model Fitting

In [23]:
# create a dictionary for storing model scores
results_dict = {}

#### Baseline - Simple Linear Regression

In [67]:
linear_reg = make_pipeline(preprocessor,
                           LinearRegression())
results_dict["linear regression"] = mean_std_cross_val_scores(
    linear_reg, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,0.922 (+/- 0.014),0.062 (+/- 0.011),0.437 (+/- 0.041),0.446 (+/- 0.011)
lightgbm optimized,29.872 (+/- 1.887),0.707 (+/- 0.361),0.623 (+/- 0.062),0.946 (+/- 0.003)
lightgbm,1.265 (+/- 0.068),0.108 (+/- 0.012),0.584 (+/- 0.069),0.698 (+/- 0.016)


##### Feature Importance of Linear Model

In [77]:
linear_reg.fit(X_train, y_train)

In [96]:
transformed_cols = list(preprocessor.named_transformers_["pipeline"].named_steps["onehotencoder"].get_feature_names_out())

lr_coefs = pd.DataFrame(
    data=linear_reg.named_steps["linearregression"].coef_.T,
    index=transformed_cols,
    columns=["Coefficient"]
)
top_20 = lr_coefs.sort_values('Coefficient', ascending=False).head(20).round(0)
bottom_20 = lr_coefs.sort_values('Coefficient', ascending=True).head(20).round(0)
condition_importance = lr_coefs.filter(like='condition', axis=0).sort_values('Coefficient', ascending=False).round(0)
scope_of_delivery_importance = lr_coefs.filter(like='scope_of_delivery', axis=0).sort_values('Coefficient', ascending=False).round(0)

In [91]:
lr_coefs.query()

Unnamed: 0,Coefficient
model_1908,20459.841072
model_Air King,4803.517748
model_Air King Date,4166.717149
model_Cellini,-81734.780940
model_Cellini Danaos,-71477.730551
...,...
clasp_material_Titanium,8884.846234
clasp_material_White Gold,4435.060141
clasp_material_Yellow gold,-1803.980679
clasp_material_nan,-6965.723652


In [89]:
top_20

Unnamed: 0,Coefficient
model_Padellone,152255.0
country_Venezuela,110914.0
bezel_material_Carbon,95716.0
year_of_production_1958,78735.0
bracelet_color_Pink,77024.0
year_of_production_1941,75403.0
dial_Transparent,65395.0
case_material_Platinum,57789.0
model_Daytona,50641.0
movement_Manual winding,44498.0


Most features that are driving up the price are rather reasonable. The sought after Padellone is one of the few models with a moonphase complication. A carbon bezel, pink bracelet, transparent dial, or platinum case are some rare features on Rolex watches. Many of the popular models such as Daytona and Day-Date are on the list as well.  
  
The fact that Venezuelan sellers are asking for a higher price compared to sellers from other countries is rather interesting. 

In [90]:
bottom_20

Unnamed: 0,Coefficient
model_Orchid,-88905.0
model_Precision,-83725.0
model_Cellini,-81735.0
model_Oysterdate Precision,-75676.0
model_Cellini Prince,-74098.0
model_Oyster Precision,-73705.0
model_Cellini Danaos,-71478.0
model_Oyster,-71057.0
year_of_production_1929,-69943.0
model_Prince,-66646.0


Old models that are produced in the early 20th century are likely to be lower in value.

In [97]:
condition_importance

Unnamed: 0,Coefficient
"condition_New\n(Brand new, without any signs of wear)",9442.0
"condition_Unworn\n(Mint condition, without signs of wear)",8613.0
"condition_Incomplete\n(Components missing, non-functional)",2388.0
condition_Very good\n(Worn with little to no signs of wear),1343.0
condition_nan,381.0
condition_Good\n(Light signs of wear or scratches),209.0
condition_Fair\n(Obvious signs of wear or scratches),-2461.0
condition_Poor\n(Heavy signs of wear or scratches),-19915.0


In [98]:
scope_of_delivery_importance

Unnamed: 0,Coefficient
"scope_of_delivery_Original box, original papers",2669.0
"scope_of_delivery_Original papers, no original box",1721.0
"scope_of_delivery_Original box, no original papers",-2087.0
"scope_of_delivery_No original box, no original papers",-2303.0


Keeping both the original box and the papers increases the listing price by CA$2669.

#### Classical Linear Regression Models: Ridge and Lasso

In [38]:
ridge = make_pipeline(preprocessor,
                      Ridge())
results_dict["ridge"] = mean_std_cross_val_scores(
    ridge, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,0.847 (+/- 0.015),0.068 (+/- 0.006),0.437 (+/- 0.041),0.446 (+/- 0.011)
decision tree,5.077 (+/- 0.059),0.070 (+/- 0.007),0.210 (+/- 0.271),0.993 (+/- 0.001)
lightgbm,1.158 (+/- 0.031),0.106 (+/- 0.008),0.584 (+/- 0.069),0.698 (+/- 0.016)
knn,0.337 (+/- 0.016),93.221 (+/- 0.577),0.388 (+/- 0.058),0.619 (+/- 0.013)
SVR,7.193 (+/- 0.141),0.074 (+/- 0.012),0.130 (+/- 0.021),0.128 (+/- 0.007)
ridge,0.566 (+/- 0.046),0.062 (+/- 0.005),0.438 (+/- 0.043),0.446 (+/- 0.011)


In [39]:
lasso = make_pipeline(preprocessor,
                      Lasso())
results_dict["lasso"] = mean_std_cross_val_scores(
    lasso, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,0.847 (+/- 0.015),0.068 (+/- 0.006),0.437 (+/- 0.041),0.446 (+/- 0.011)
decision tree,5.077 (+/- 0.059),0.070 (+/- 0.007),0.210 (+/- 0.271),0.993 (+/- 0.001)
lightgbm,1.158 (+/- 0.031),0.106 (+/- 0.008),0.584 (+/- 0.069),0.698 (+/- 0.016)
knn,0.337 (+/- 0.016),93.221 (+/- 0.577),0.388 (+/- 0.058),0.619 (+/- 0.013)
SVR,7.193 (+/- 0.141),0.074 (+/- 0.012),0.130 (+/- 0.021),0.128 (+/- 0.007)
ridge,0.566 (+/- 0.046),0.062 (+/- 0.005),0.438 (+/- 0.043),0.446 (+/- 0.011)
lasso,14.934 (+/- 0.919),0.060 (+/- 0.007),0.438 (+/- 0.043),0.446 (+/- 0.011)


In [40]:
elasticnet = make_pipeline(preprocessor,
                           ElasticNet())
results_dict["elastic net"] = mean_std_cross_val_scores(
    elasticnet, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,0.847 (+/- 0.015),0.068 (+/- 0.006),0.437 (+/- 0.041),0.446 (+/- 0.011)
decision tree,5.077 (+/- 0.059),0.070 (+/- 0.007),0.210 (+/- 0.271),0.993 (+/- 0.001)
lightgbm,1.158 (+/- 0.031),0.106 (+/- 0.008),0.584 (+/- 0.069),0.698 (+/- 0.016)
knn,0.337 (+/- 0.016),93.221 (+/- 0.577),0.388 (+/- 0.058),0.619 (+/- 0.013)
SVR,7.193 (+/- 0.141),0.074 (+/- 0.012),0.130 (+/- 0.021),0.128 (+/- 0.007)
ridge,0.566 (+/- 0.046),0.062 (+/- 0.005),0.438 (+/- 0.043),0.446 (+/- 0.011)
lasso,14.934 (+/- 0.919),0.060 (+/- 0.007),0.438 (+/- 0.043),0.446 (+/- 0.011)
elastic net,0.432 (+/- 0.037),0.054 (+/- 0.010),0.218 (+/- 0.034),0.214 (+/- 0.008)


#### Tree-based Models

In [26]:
dt = make_pipeline(preprocessor, DecisionTreeRegressor())
results_dict["decision tree"] = mean_std_cross_val_scores(
    dt, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,0.847 (+/- 0.015),0.068 (+/- 0.006),0.437 (+/- 0.041),0.446 (+/- 0.011)
decision tree,5.077 (+/- 0.059),0.070 (+/- 0.007),0.210 (+/- 0.271),0.993 (+/- 0.001)


In [27]:
# rf = make_pipeline(preprocessor, RandomForestRegressor(random_state=123))
# results_dict["random forest"] = mean_std_cross_val_scores(
#     rf, X_train, y_train, return_train_score=True
# )
# pd.DataFrame(results_dict).T

In [29]:
lightgbm = make_pipeline(preprocessor, LGBMRegressor(random_state=123))
results_dict["lightgbm"] = mean_std_cross_val_scores(
    lightgbm, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,1.092 (+/- 0.061),0.110 (+/- 0.006),0.437 (+/- 0.041),0.446 (+/- 0.011)
lightgbm optimized,30.260 (+/- 1.535),0.710 (+/- 0.285),0.623 (+/- 0.062),0.946 (+/- 0.003)
lightgbm,1.265 (+/- 0.068),0.108 (+/- 0.012),0.584 (+/- 0.069),0.698 (+/- 0.016)


#### Distance-based Models

In [34]:
knn = make_pipeline(preprocessor_with_scaler, KNeighborsRegressor())
results_dict["knn"] = mean_std_cross_val_scores(
    knn, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,0.847 (+/- 0.015),0.068 (+/- 0.006),0.437 (+/- 0.041),0.446 (+/- 0.011)
decision tree,5.077 (+/- 0.059),0.070 (+/- 0.007),0.210 (+/- 0.271),0.993 (+/- 0.001)
lightgbm,1.158 (+/- 0.031),0.106 (+/- 0.008),0.584 (+/- 0.069),0.698 (+/- 0.016)
knn,0.337 (+/- 0.016),93.221 (+/- 0.577),0.388 (+/- 0.058),0.619 (+/- 0.013)


In [37]:
svr = make_pipeline(preprocessor_with_scaler, LinearSVR())
results_dict["SVR"] = mean_std_cross_val_scores(
    svr, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

CPU times: total: 125 ms
Wall time: 8.91 s


Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,0.847 (+/- 0.015),0.068 (+/- 0.006),0.437 (+/- 0.041),0.446 (+/- 0.011)
decision tree,5.077 (+/- 0.059),0.070 (+/- 0.007),0.210 (+/- 0.271),0.993 (+/- 0.001)
lightgbm,1.158 (+/- 0.031),0.106 (+/- 0.008),0.584 (+/- 0.069),0.698 (+/- 0.016)
knn,0.337 (+/- 0.016),93.221 (+/- 0.577),0.388 (+/- 0.058),0.619 (+/- 0.013)
SVR,7.193 (+/- 0.141),0.074 (+/- 0.012),0.130 (+/- 0.021),0.128 (+/- 0.007)


It appears that gradient boosted tree model is outperforming the other models, with short training time.

### Hyperparameter Optimization

In [86]:
param_grid = {
    "lgbmregressor__num_leaves": np.arange(100, 2001, 100),
    "lgbmregressor__learning_rate": np.arange(0.0001, 0.011, 0.001),
    "lgbmregressor__n_estimators": np.arange(100, 2001, 100)
}

In [87]:
random_search = RandomizedSearchCV(
    lightgbm,
    param_distributions=param_grid,
    n_iter=100,
    n_jobs=-1,
    return_train_score=True,
    random_state=123
)
random_search.fit(X_train, y_train)

In [None]:
pd.DataFrame(random_search.cv_results_)[
    [
        "mean_test_score",
        "param_lgbmregressor__num_leaves",
        "param_lgbmregressor__learning_rate",
        "param_lgbmregressor__n_estimators",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().T

rank_test_score,1,2,3,4,5,6,7,8,9,10,...,71,72,73,74,75,76,76.1,78,79,80
mean_test_score,0.622546,0.622301,0.622066,0.62203,0.621959,0.619061,0.618904,0.614686,0.610038,0.606608,...,0.457314,0.453748,0.453719,0.44815,0.448057,0.447063,0.447063,0.447054,0.433596,0.433596
param_lgbmregressor__num_leaves,100.0,100.0,100.0,50.0,50.0,100.0,31.0,50.0,31.0,31.0,...,100.0,100.0,100.0,50.0,50.0,50.0,50.0,50.0,31.0,31.0
param_lgbmregressor__max_depth,-1.0,-1.0,20.0,-1.0,20.0,20.0,20.0,20.0,20.0,-1.0,...,10.0,10.0,10.0,10.0,20.0,-1.0,20.0,20.0,20.0,20.0
param_lgbmregressor__learning_rate,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01
param_lgbmregressor__n_estimators,1000.0,500.0,1000.0,1000.0,1000.0,1000.0,1000.0,500.0,500.0,500.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
param_lgbmregressor__reg_alpha,0.0,1.0,0.1,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,0.1,0.1,0.1,1.0,1.0,0.0,0.0,0.1
param_lgbmregressor__reg_lambda,0.0,0.0,0.0,1.0,0.0,0.1,1.0,0.0,0.0,1.0,...,0.1,1.0,1.0,0.1,0.0,1.0,1.0,1.0,0.1,0.1
mean_fit_time,112.294243,76.651764,120.854339,60.619572,65.545324,98.027054,27.66398,17.188546,25.828404,17.513388,...,10.434995,17.123429,11.702029,6.806891,5.462149,5.574002,8.913583,7.700104,4.841807,6.64758


In [None]:
random_search.best_params_

{'lgbmregressor__reg_lambda': 0,
 'lgbmregressor__reg_alpha': 0,
 'lgbmregressor__num_leaves': 100,
 'lgbmregressor__n_estimators': 1000,
 'lgbmregressor__max_depth': -1,
 'lgbmregressor__learning_rate': 0.1}

In [27]:
opt = LGBMRegressor(num_leaves=100, n_estimators=1000, learning_rate=0.1, random_state=123)

In [30]:
lightgbm_opt = make_pipeline(preprocessor, opt) # random_search.best_estimator_
results_dict["lightgbm optimized"] = mean_std_cross_val_scores(
    lightgbm_opt, X_train, y_train, return_train_score=True
)
pd.DataFrame(results_dict).T

Unnamed: 0,fit_time,score_time,test_score,train_score
linear regression,1.092 (+/- 0.061),0.110 (+/- 0.006),0.437 (+/- 0.041),0.446 (+/- 0.011)
lightgbm optimized,29.872 (+/- 1.887),0.707 (+/- 0.361),0.623 (+/- 0.062),0.946 (+/- 0.003)
lightgbm,1.265 (+/- 0.068),0.108 (+/- 0.012),0.584 (+/- 0.069),0.698 (+/- 0.016)


The $R^2$ score of of the optimized LightGBM model improved slightly from 0.584 to 0.623. 

In [32]:
lightgbm_opt.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001553 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 538
[LightGBM] [Info] Number of data points in the train set: 43746, number of used features: 269
[LightGBM] [Info] Start training from score 31720.866868


### Predict on the test set

In [64]:
# score on the test set
print(f'{lightgbm_opt.score(X_test, y_test):.4f}')



0.6280


The score on the test set is about the same as the validation score from training. It appears that the model is generalizing the data decently.

In [65]:
y_test['prediction'] = lightgbm_opt.predict(X_test)
y_test



Unnamed: 0,price,prediction
26312,81660,67907.696927
46447,23558,20806.684403
18438,11972,10908.011649
33352,19316,17991.062601
45125,60300,104736.238830
...,...,...
10655,67487,65760.148765
61562,48189,42911.663052
21716,82408,89852.704118
62068,157501,141051.660274


In [66]:
prediction_plot = alt.Chart(y_test, title='Actual Listing Price and Prediction').mark_point().encode(
    alt.X('price').title('Actual Price'),
    alt.Y('prediction').title('Predicted Price')
)

min_price = min(y_test['price'].min(), y_test['prediction'].min())
max_price = max(y_test['price'].max(), y_test['prediction'].max())

# Create a DataFrame for the 45-degree line
line_data = pd.DataFrame({
    'price': [min_price, max_price],
    'prediction': [min_price, max_price]
})

# Create the 45-degree line chart
line_chart = alt.Chart(line_data).mark_line(color='red').encode(
    x='price',
    y='prediction'
)

prediction_plot + line_chart

It is not very helpful to look at the prediction since there are outliers.  
  
Let us zoom into the range where the listing price is below CA$200k:

In [59]:
zoom_y_test = y_test.query('price <= 200000')

min_price = min(zoom_y_test['price'].min(), zoom_y_test['prediction'].min())
max_price = max(zoom_y_test['price'].max(), zoom_y_test['prediction'].max())

# Create a DataFrame for the 45-degree line
line_data = pd.DataFrame({
    'price': [0, 200000],
    'prediction': [0, 200000]
})

# Create the 45-degree line chart
line_chart = alt.Chart(line_data).mark_line(color='red').encode(
    x='price',
    y='prediction'
)

prediction_plot_zoom = alt.Chart(y_test.query('price <= 200000'),
                                 title='Actual Listing Price under CA$200k and Prediction').mark_point().encode(
    alt.X('price').title('Actual Price'),
    alt.Y('prediction').title('Predicted Price')
)
prediction_plot_zoom + line_chart

The perfect prediction would lie on the red 45 degree line and the data is generally following that.