Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

First thing I'll do is import any libraries I find necessary.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import xgboost as xgb
from catboost import CatBoostRegressor
import lightgbm as lgb

Next, I'll load the data and look to see what needs to be done to clean up the data and the best way to do so.

In [2]:
df = pd.read_csv('/datasets/car_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Immediately, I noticed the columns have capital letters and are all one word. I will make all the columns lowercase and give spacing to the names that need it so it is easier to read.

In [3]:
df = df.rename(columns={'DateCrawled':'date_crawled', 'Price':'price', 'VehicleType':'vehicle_type',
                       'RegistrationYear':'registration_year', 'Gearbox':'gearbox', 'Power':'power', 'Model':'model',
                       'Mileage':'mileage', 'RegistrationMonth':'registration_month', 'FuelType':'fuel_type',
                       'Brand':'brand', 'NotRepaired':'not_repaired', 'DateCreated':'date_created',
                       'NumberOfPictures':'number_of_pictures', 'PostalCode':'postal_code', 'LastSeen':'last_seen'})

df.sample(10)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
239202,21/03/2016 16:42,9900,wagon,2008,auto,0,c_klasse,150000,10,gasoline,mercedes_benz,no,21/03/2016 00:00,0,53332,30/03/2016 03:16
245885,07/03/2016 19:40,9950,sedan,2007,manual,177,5er,150000,1,gasoline,bmw,no,07/03/2016 00:00,0,52445,24/03/2016 17:15
133213,31/03/2016 16:57,3300,small,2007,manual,60,c2,60000,11,petrol,citroen,no,31/03/2016 00:00,0,24594,06/04/2016 10:46
224566,15/03/2016 12:51,450,,2017,manual,101,golf,150000,0,petrol,volkswagen,,15/03/2016 00:00,0,73733,17/03/2016 07:46
161862,21/03/2016 11:47,6990,small,2012,manual,69,punto,30000,5,petrol,fiat,no,21/03/2016 00:00,0,45527,06/04/2016 05:45
79788,27/03/2016 16:37,0,,2016,manual,45,polo,125000,4,,volkswagen,no,27/03/2016 00:00,0,39606,01/04/2016 16:45
8570,16/03/2016 15:46,10000,wagon,2013,manual,140,passat,40000,6,gasoline,volkswagen,no,16/03/2016 00:00,0,10115,16/03/2016 15:46
100625,10/03/2016 07:54,16500,convertible,2005,auto,300,mustang,90000,3,petrol,ford,no,10/03/2016 00:00,0,85391,26/03/2016 10:17
17259,30/03/2016 13:52,500,wagon,1997,manual,101,vectra,150000,6,petrol,opel,yes,30/03/2016 00:00,0,6686,05/04/2016 01:46
110163,27/03/2016 01:57,1890,small,1999,manual,110,ibiza,150000,8,gasoline,seat,no,27/03/2016 00:00,0,38524,07/04/2016 04:16


Since this is such a large dataset, I feel comfortable droping any rows with NaN values since I do not think it will heavily affect the results of the models.

In [4]:
df = df.dropna()

Next, I will use the Label Encoder to change the object datatypes into integers that represent unique categories. 

In [5]:
categorical_features = df[['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand', 'not_repaired']]

label_encoder = LabelEncoder()

# This will iterate over each categorical column and encode the values
for column in categorical_features.columns:
    df[column] = label_encoder.fit_transform(df[column].astype(str))
    
print(df.head(10))

        date_crawled  price  vehicle_type  registration_year  gearbox  power  \
3   17/03/2016 16:54   1500             5               2001        1     75   
4   31/03/2016 17:25   3600             5               2008        1     69   
5   04/04/2016 17:36    650             4               1995        1    102   
6   01/04/2016 20:48   2200             1               2004        1    109   
7   21/03/2016 18:54      0             4               1980        1     50   
10  26/03/2016 19:54   2000             4               2004        1    105   
11  07/04/2016 10:06   2799             7               2005        1    140   
14  21/03/2016 12:57  17999             6               2011        1    190   
17  20/03/2016 10:25   1750             5               2004        0     75   
18  23/03/2016 15:48   7550             0               2007        1    136   

    model  mileage  registration_month  fuel_type  brand  not_repaired  \
3     116   150000                   6       

In [6]:
print(df.describe())
df.drop('number_of_pictures', axis=1, inplace=True)
df.info()

               price   vehicle_type  registration_year        gearbox  \
count  245814.000000  245814.000000      245814.000000  245814.000000   
mean     5125.346717       4.253525        2002.918699       0.792209   
std      4717.948673       2.131782           6.163765       0.405727   
min         0.000000       0.000000        1910.000000       0.000000   
25%      1499.000000       4.000000        1999.000000       1.000000   
50%      3500.000000       4.000000        2003.000000       1.000000   
75%      7500.000000       5.000000        2007.000000       1.000000   
max     20000.000000       7.000000        2018.000000       1.000000   

               power          model        mileage  registration_month  \
count  245814.000000  245814.000000  245814.000000       245814.000000   
mean      119.970884     108.049651  127296.716216            6.179701   
std       139.387116      70.990730   37078.820368            3.479519   
min         0.000000       0.000000    5000.00

Upon further EDA, I have discovered that the 'number_of_pictures' column is filled entirely of zeros. Since this is clearly an error, I have dropped the column completely.

We can now see that the dataframe columns that had object datatypes now contain integers that represent unique categories.

I will drop the required columns to run the models proficiently.

In [7]:
drop_columns = ['date_crawled', 'date_created', 'last_seen', 'price']
features = df.drop(drop_columns, axis=1)
target = df['price']

features_train_val, features_test, target_train_val, target_test = train_test_split(features,
                                                                                    target,
                                                                                    test_size=0.2,
                                                                                    random_state=12345)

features_train, features_valid, target_train, target_valid = train_test_split(features_train_val,
                                                                              target_train_val,
                                                                              test_size=0.25,
                                                                              random_state=12345)

Since the dataset has a potential for large numbers, I decided to scale the feature data. Scaling the feature data helps make the model training process more reliable.

In [8]:
scaler = StandardScaler()
scaler.fit(features_train)

features_train = scaler.transform(features_train)
features_valid = scaler.transform(features_valid)

## Model training

In [9]:
%%time

lr = LinearRegression()
lr.fit(features_train, target_train)
predictions_lr = lr.predict(features_valid)

result_lr = mean_squared_error(target_valid, predictions_lr, squared=False)
print("RMSE of the Linear Regression model on the validation set:", result_lr)

RMSE of the Linear Regression model on the validation set: 3343.2026497566317
CPU times: user 48.1 ms, sys: 12.1 ms, total: 60.2 ms
Wall time: 42.9 ms


In [10]:
class SGDLinearRegression:
    def __init__(self, step_size, epochs, batch_size):
        self.step_size = step_size
        self.epochs = epochs
        self.batch_size = batch_size
        self.w = None
        
    def fit(self, train_features, train_target):
        X = np.concatenate(
            (np.ones((train_features.shape[0], 1)), train_features), axis=1
        )
        y = train_target
        w = np.zeros(X.shape[1])
        
        for _ in range(self.epochs):
            batches_count = X.shape[0] // self.batch_size
            for i in range(batches_count):
                begin = i * self.batch_size
                end = (i + 1) * self.batch_size
                X_batch = X[begin:end, :]
                y_batch = y[begin:end]
                
                gradient = 2 * X_batch.T.dot(X_batch.dot(w) - y_batch) / X_batch.shape[0]
                
                w -= self.step_size * gradient
        
        self.w = w

In [11]:
%%time

sgd_lr = SGDLinearRegression(step_size=0.01, epochs=100, batch_size=32)
sgd_lr.fit(features_train, target_train)

predictions_sgd = np.dot(np.concatenate((np.ones((features_valid.shape[0], 1)), features_valid), axis=1), sgd_lr.w)
rmse_sgd = mean_squared_error(target_valid, predictions_sgd, squared=False)

print("RMSE of the SGD Linear Regression model on the validaiton set:", rmse_sgd)

RMSE of the SGD Linear Regression model on the validaiton set: 3655.542951267383
CPU times: user 2min 11s, sys: 1.7 s, total: 2min 13s
Wall time: 2min 13s


In [12]:
%%time

rfr = RandomForestRegressor(n_estimators=100, random_state=12345)
rfr.fit(features_train, target_train)
predictions_rfr = rfr.predict(features_valid)

result_rfr = mean_squared_error(target_valid, predictions_rfr, squared=False)
print("RMSE of the Random Forest Regression model on the validation set:", result_rfr)

RMSE of the Random Forest Regression model on the validation set: 1635.4025039604942
CPU times: user 1min 26s, sys: 467 ms, total: 1min 27s
Wall time: 1min 27s


In [13]:
%%time

dtr = DecisionTreeRegressor(random_state=12345)
dtr.fit(features_train, target_train)
predictions_dtr = dtr.predict(features_valid)

result_dtr = mean_squared_error(target_valid, predictions_dtr, squared=False)
print("RMSE of the Decision Tree Regression model on the validation set:", result_dtr)

RMSE of the Decision Tree Regression model on the validation set: 2237.179300431031
CPU times: user 1.35 s, sys: 0 ns, total: 1.35 s
Wall time: 1.34 s


In [14]:
%%time

dtrain = xgb.DMatrix(features_train, label=target_train)
dvalid = xgb.DMatrix(features_valid, label=target_valid)

params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse'
}

num_rounds = 100
xgb_model = xgb.train(params, dtrain, num_rounds)

predictions_xgb = xgb_model.predict(dvalid)

rmse_xgb = mean_squared_error(target_valid, predictions_xgb, squared=False)
print('RMSE of the XGBoost Regressor model on the validation set:', rmse_xgb)

RMSE of the XGBoost Regressor model on the validation set: 1650.0206482819442
CPU times: user 27.3 s, sys: 109 ms, total: 27.4 s
Wall time: 27.6 s


In [15]:
%%time

lgbm_train = lgb.Dataset(features_train, label=target_train)
lgbm_valid = lgb.Dataset(features_valid, label=target_valid)

params = {
    'objective': 'regression',
    'metric': 'rmse'
}

num_rounds = 100
lgb_model = lgb.train(params, lgbm_train, num_rounds, valid_sets=[lgbm_valid], early_stopping_rounds=10)

predictions_lgb = lgb_model.predict(features_valid, num_iteration=lgb_model.best_iteration)
rmse_lgb = mean_squared_error(target_valid, predictions_lgb, squared=False)

print('RMSE of the LightGBM model on the validation set:', rmse_lgb)



You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 919
[LightGBM] [Info] Number of data points in the train set: 147488, number of used features: 11
[LightGBM] [Info] Start training from score 5135.456844
[1]	valid_0's rmse: 4389.27
Training until validation scores don't improve for 10 rounds
[2]	valid_0's rmse: 4087.44
[3]	valid_0's rmse: 3823.47
[4]	valid_0's rmse: 3592.38
[5]	valid_0's rmse: 3390.92
[6]	valid_0's rmse: 3211.58
[7]	valid_0's rmse: 3057.56
[8]	valid_0's rmse: 2923.96
[9]	valid_0's rmse: 2805.59
[10]	valid_0's rmse: 2701.31
[11]	valid_0's rmse: 2612.4
[12]	valid_0's rmse: 2534.39
[13]	valid_0's rmse: 2464.26
[14]	valid_0's rmse: 2404.27
[15]	valid_0's rmse: 2349.22
[16]	valid_0's rmse: 2302.96
[17]	valid_0's rmse: 2259.83
[18]	valid_0's rmse: 2221.62
[19]	valid_0's rmse: 2186.32
[20]	valid_0's rmse: 2156.54
[21]	valid_0's rmse: 2128.66
[22]	valid_0's rmse: 2098.02
[23]	v

In [16]:
%%time

catboost_model = CatBoostRegressor(iterations=100,
                                  loss_function='RMSE',
                                  eval_metric='RMSE',
                                  random_seed=12345)

catboost_model.fit(features_train, target_train, eval_set=(features_valid, target_valid), early_stopping_rounds=10, verbose=False)

predictions_catboost = catboost_model.predict(features_valid)
rmse_catboost = mean_squared_error(target_valid, predictions_catboost, squared=False)

print("RMSE of the CatBoost model on the validation set:", rmse_catboost)

RMSE of the CatBoost model on the validation set: 1708.274099598683
CPU times: user 3.13 s, sys: 32 ms, total: 3.16 s
Wall time: 3.37 s


Now, I'm going to use the test set on the Random Forest model and the CatBoost model. I'm choosing these two because the Random Forest had the best result, and the CatBoost had the second best result and was significantly faster.

In [17]:
features_test_scaled = scaler.transform(features_test)

predictions_rfr_test = rfr.predict(features_test)
rmse_rfr_test = mean_squared_error(target_test, predictions_rfr_test, squared=False)
print("RMSE of the Random Forest Regression model on the test set:", rmse_rfr_test)

RMSE of the Random Forest Regression model on the test set: 5209.36487182959


In [18]:
predictions_catboost_test = catboost_model.predict(features_test_scaled)
rmse_catboost_test = mean_squared_error(target_test, predictions_catboost_test, squared=False)
print("RMSE of the CatBoost model on the test set:", rmse_catboost_test)

RMSE of the CatBoost model on the test set: 1708.2436239697508


Given that the CatBoost model performed significantly better than the Random Forest model on the test set, that would be the model that Rusty Bargain should consider using.

In [None]:
%%time

rf_cv_scores = cross_val_score(rfr, features, target, cv=5)
rf_final_score = sum(rf_cv_scores) / len(rf_cv_scores)

print("Average Random Forest Regression evaluation score:", rf_final_score)

In [19]:
%%time

catboost_cv_scores = cross_val_score(catboost_model, features, target, cv=5)
catboost_final_score = sum(catboost_cv_scores) / len(catboost_cv_scores)

print("Average CatBoost evaluation score:", catboost_final_score)

Learning rate set to 0.5
0:	learn: 3313.7494737	total: 46.8ms	remaining: 4.63s
1:	learn: 2670.9990775	total: 93ms	remaining: 4.56s
2:	learn: 2408.0628671	total: 133ms	remaining: 4.31s
3:	learn: 2282.4673045	total: 175ms	remaining: 4.19s
4:	learn: 2203.5555298	total: 219ms	remaining: 4.17s
5:	learn: 2152.3403951	total: 257ms	remaining: 4.03s
6:	learn: 2089.4259336	total: 307ms	remaining: 4.08s
7:	learn: 2063.4313712	total: 344ms	remaining: 3.96s
8:	learn: 2040.7232232	total: 389ms	remaining: 3.93s
9:	learn: 2005.5048544	total: 439ms	remaining: 3.95s
10:	learn: 1983.2334146	total: 482ms	remaining: 3.9s
11:	learn: 1969.3922876	total: 527ms	remaining: 3.87s
12:	learn: 1952.7958608	total: 565ms	remaining: 3.78s
13:	learn: 1940.3790506	total: 612ms	remaining: 3.76s
14:	learn: 1930.4793586	total: 647ms	remaining: 3.67s
15:	learn: 1921.6271705	total: 688ms	remaining: 3.61s
16:	learn: 1914.0717298	total: 728ms	remaining: 3.55s
17:	learn: 1898.8272488	total: 770ms	remaining: 3.51s
18:	learn: 189

54:	learn: 1726.2408997	total: 2.59s	remaining: 2.12s
55:	learn: 1724.2035329	total: 2.63s	remaining: 2.07s
56:	learn: 1722.2403569	total: 2.68s	remaining: 2.02s
57:	learn: 1721.2111642	total: 2.72s	remaining: 1.97s
58:	learn: 1719.6766613	total: 2.77s	remaining: 1.92s
59:	learn: 1717.9443171	total: 2.82s	remaining: 1.88s
60:	learn: 1715.6153705	total: 2.87s	remaining: 1.83s
61:	learn: 1713.0717171	total: 2.92s	remaining: 1.79s
62:	learn: 1710.4474513	total: 2.96s	remaining: 1.74s
63:	learn: 1708.4285590	total: 3s	remaining: 1.69s
64:	learn: 1706.9213585	total: 3.04s	remaining: 1.63s
65:	learn: 1703.7524619	total: 3.09s	remaining: 1.59s
66:	learn: 1701.0132406	total: 3.13s	remaining: 1.54s
67:	learn: 1699.2552696	total: 3.17s	remaining: 1.49s
68:	learn: 1697.8342834	total: 3.21s	remaining: 1.44s
69:	learn: 1696.5500622	total: 3.25s	remaining: 1.39s
70:	learn: 1695.0041002	total: 3.29s	remaining: 1.34s
71:	learn: 1694.1784699	total: 3.33s	remaining: 1.29s
72:	learn: 1693.2165602	total: 

7:	learn: 2055.7567883	total: 372ms	remaining: 4.28s
8:	learn: 2037.2212986	total: 411ms	remaining: 4.16s
9:	learn: 2001.0521110	total: 453ms	remaining: 4.08s
10:	learn: 1982.2596172	total: 498ms	remaining: 4.03s
11:	learn: 1959.2969269	total: 543ms	remaining: 3.98s
12:	learn: 1935.6620588	total: 587ms	remaining: 3.93s
13:	learn: 1918.5409291	total: 630ms	remaining: 3.87s
14:	learn: 1909.6140843	total: 678ms	remaining: 3.84s
15:	learn: 1900.9763270	total: 722ms	remaining: 3.79s
16:	learn: 1890.4412417	total: 768ms	remaining: 3.75s
17:	learn: 1882.0441548	total: 810ms	remaining: 3.69s
18:	learn: 1874.4013104	total: 855ms	remaining: 3.65s
19:	learn: 1871.3335313	total: 891ms	remaining: 3.56s
20:	learn: 1863.0299154	total: 935ms	remaining: 3.52s
21:	learn: 1851.6516707	total: 981ms	remaining: 3.48s
22:	learn: 1840.7880071	total: 1.03s	remaining: 3.45s
23:	learn: 1835.6814098	total: 1.08s	remaining: 3.41s
24:	learn: 1832.2310209	total: 1.12s	remaining: 3.35s
25:	learn: 1828.2366455	total: 

62:	learn: 1714.5779917	total: 2.79s	remaining: 1.64s
63:	learn: 1712.6385996	total: 2.83s	remaining: 1.59s
64:	learn: 1710.9956772	total: 2.87s	remaining: 1.54s
65:	learn: 1708.2363742	total: 2.91s	remaining: 1.5s
66:	learn: 1706.0492579	total: 2.96s	remaining: 1.46s
67:	learn: 1704.7815779	total: 3s	remaining: 1.41s
68:	learn: 1702.4127985	total: 3.04s	remaining: 1.36s
69:	learn: 1700.2465997	total: 3.08s	remaining: 1.32s
70:	learn: 1698.6237822	total: 3.13s	remaining: 1.28s
71:	learn: 1698.0590136	total: 3.17s	remaining: 1.23s
72:	learn: 1697.4674563	total: 3.21s	remaining: 1.19s
73:	learn: 1696.0330777	total: 3.25s	remaining: 1.14s
74:	learn: 1694.5414216	total: 3.3s	remaining: 1.1s
75:	learn: 1692.5260813	total: 3.33s	remaining: 1.05s
76:	learn: 1691.2079931	total: 3.38s	remaining: 1.01s
77:	learn: 1689.8189182	total: 3.42s	remaining: 966ms
78:	learn: 1688.3666368	total: 3.46s	remaining: 921ms
79:	learn: 1685.9757333	total: 3.51s	remaining: 877ms
80:	learn: 1684.7234568	total: 3.5

## Model analysis

Linear Regression: Compared to the other models, the RMSE value is relatively high, indicating a poorer performance.

SGD Linear Regression: A similar performance to Linear Regression, but significantly slower due to the stochastic gradient descent.

Random Forest Regression: Lowest RMSE compared to the other models, indicating a better performance, but still slow due to the ensemble nature of Random Forest.

Decision Tree Regression: Higher RMSE than Random Forest, but better than Linear Regression and SGD. The model is faster than the Random Forest and SGD models.

XGBoost Regressor: Similar performance to the Random Forest model with a faster training time.

LightGBM: Slightly higher RMSE compared to XGBoost but a bit faster with training.

CatBoost: Similar performance to XGBoost and LightGBM with a faster training time.

In summary, Random Forest Regression, XGBoost, LightGBM, and CatBoost perform relatively well in terms of both speed and quality compared to other models. Any one of those models would be a good choice for developing the app. Linear Regression and SGD Linear Regression have poorer performance in terms of quality, while Decision Tree Regression has a higher RMSE and slower training time compared to other tree-based models. With Cross-Validation on the better performing models, Random Forest and CatBoost, it appears that Random Forest came out on top purely looking at the evaluation score. However, based on time, CatBoost might prove to be a better model, since it is very close to the Random Forest evaluation score and much quicker.