A used car sales service dealer is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. There is need to build the model to determine the value. 

The dealer is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats as st
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error 
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.datasets import make_regression
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

try:
    df = pd.read_csv('.csv')
except:
    df = pd.read_csv('.csv')
     
display(df.head(5))

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


Viewing the content of the table data.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
DateCrawled          354369 non-null object
Price                354369 non-null int64
VehicleType          316879 non-null object
RegistrationYear     354369 non-null int64
Gearbox              334536 non-null object
Power                354369 non-null int64
Model                334664 non-null object
Mileage              354369 non-null int64
RegistrationMonth    354369 non-null int64
FuelType             321474 non-null object
Brand                354369 non-null object
NotRepaired          283215 non-null object
DateCreated          354369 non-null object
NumberOfPictures     354369 non-null int64
PostalCode           354369 non-null int64
LastSeen             354369 non-null object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB


Viewing the type of data and the completeness of each column.

Processing the categorical data using One-Hot Encoding method.

In [16]:
df_ohe_vehicleType = pd.get_dummies(df['VehicleType'])
df_ohe_1 = pd.concat([df, df_ohe_vehicleType], axis=1)

display(df_ohe_1.head(5))

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,...,PostalCode,LastSeen,bus,convertible,coupe,other,sedan,small,suv,wagon
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,...,70435,07/04/2016 03:16,0,0,0,0,0,0,0,0
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,...,66954,07/04/2016 01:46,0,0,1,0,0,0,0,0
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,...,90480,05/04/2016 12:47,0,0,0,0,0,0,1,0
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,...,91074,17/03/2016 17:40,0,0,0,0,0,1,0,0
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,...,60437,06/04/2016 10:17,0,0,0,0,0,1,0,0


In [17]:
df_ohe_Gearbox = pd.get_dummies(df_ohe_1['Gearbox'])
df_ohe_2 = pd.concat([df_ohe_1, df_ohe_Gearbox], axis=1)


In [18]:
df_ohe_FuelType = pd.get_dummies(df_ohe_2['FuelType'])
df_ohe_3 = pd.concat([df_ohe_2, df_ohe_FuelType], axis=1)


In [19]:
df_ohe_Brand = pd.get_dummies(df_ohe_3['Brand'])
df_ohe_4 = pd.concat([df_ohe_3, df_ohe_Brand], axis=1)


In [20]:
df_ohe_NotRepaired = pd.get_dummies(df_ohe_3['NotRepaired'])
df_ohe_5 = pd.concat([df_ohe_4, df_ohe_NotRepaired], axis=1)


Viewing the table and the datatype after the One-Hot Encoding method.

In [21]:
display(df_ohe_5.head(5))

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,...,smart,sonstige_autos,subaru,suzuki,toyota,trabant,volkswagen,volvo,no,yes
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,...,0,0,0,0,0,0,1,0,0,0
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,...,0,0,0,0,0,0,0,0,0,1
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,...,0,0,0,0,0,0,0,0,0,0
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,...,0,0,0,0,0,0,1,0,1,0
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,...,0,0,0,0,0,0,0,0,1,0


In [22]:
df_ohe_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 75 columns):
DateCrawled          354369 non-null object
Price                354369 non-null int64
VehicleType          316879 non-null object
RegistrationYear     354369 non-null int64
Gearbox              334536 non-null object
Power                354369 non-null int64
Model                334664 non-null object
Mileage              354369 non-null int64
RegistrationMonth    354369 non-null int64
FuelType             321474 non-null object
Brand                354369 non-null object
NotRepaired          283215 non-null object
DateCreated          354369 non-null object
NumberOfPictures     354369 non-null int64
PostalCode           354369 non-null int64
LastSeen             354369 non-null object
bus                  354369 non-null uint8
convertible          354369 non-null uint8
coupe                354369 non-null uint8
other                354369 non-null uint8
sedan               

In [23]:
#Split target and features
target = df['Price']
features = df.drop(['Price','DateCrawled','Model','DateCreated','LastSeen','VehicleType','Gearbox','FuelType','Brand','NotRepaired'], axis = 1)

# Split into sets
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.2,random_state=42)
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.2,random_state=42)

2 Model training

Using the DecisionTree regressor.

In [24]:
for max_depth in range(1,5,1):
        regressor = DecisionTreeRegressor(max_depth=max_depth, random_state = 54321)

        regressor.fit(features_train,target_train)
        predictions = regressor.predict(features_valid)
        
        mse = mean_squared_error(target_valid, predictions)
        
        print("Depth =", max_depth, ": mse =", mse)

Depth = 1 : mse = 14514702.315988125
Depth = 2 : mse = 11102661.63915938
Depth = 3 : mse = 9375218.31992351
Depth = 4 : mse = 7742559.939510886


for max_depth in range(1,5,1):
        regressor = DecisionTreeRegressor(max_depth=max_depth, random_state = 54321)

        regressor.fit(features_train,target_train)
        predictions = regressor.predict(features_valid)
        
        mse = mean_squared_error(target, predictions)
        
        print("n_estimators =", max_depth, ": mse =", mse)


Using the Linear regressor.

In [25]:
regressor = LinearRegression()

predictions = pd.Series(target_valid.mean(), index=target_valid.index)
linear_mse = mean_squared_error(target_valid, predictions)

regressor.fit(features_train,target_train)

print("MSE =", linear_mse)


MSE = 20676713.4465188


Using the RandomForest regressor.

In [26]:
for estim in range(5,16,5):
       for depth in range(1,15,1):

                model = RandomForestRegressor(n_estimators=estim,max_depth=depth, random_state=54321)

                predictions = pd.Series(target_valid.mean(), index=target_valid.index)
                mse = mean_squared_error(target_valid, predictions)

                model.fit(features_train,target_train)
                predictions_train = model.predict(features_train)
                predictions_valid = model.predict(features_valid)
                predictions_test = model.predict(features_test)
                mse_train = mean_squared_error(target_train, predictions_train)
                mse_valid = mean_squared_error(target_valid, predictions_valid)
                mse_test = mean_squared_error(target_test, predictions_test)

                rmse = mean_squared_error(target_valid, predictions_valid)**0.5
                
                print("n_estimators =", estim, " :max_depth =",depth, " :rmse =", rmse) 
        

n_estimators = 5  :max_depth = 1  :rmse = 3809.817366211501
n_estimators = 5  :max_depth = 2  :rmse = 3298.6432443072863
n_estimators = 5  :max_depth = 3  :rmse = 3010.1240400273887
n_estimators = 5  :max_depth = 4  :rmse = 2741.7407061712443
n_estimators = 5  :max_depth = 5  :rmse = 2554.038414549346
n_estimators = 5  :max_depth = 6  :rmse = 2448.242506196709
n_estimators = 5  :max_depth = 7  :rmse = 2348.6154818463706
n_estimators = 5  :max_depth = 8  :rmse = 2291.8255016811777
n_estimators = 5  :max_depth = 9  :rmse = 2253.1288313365403
n_estimators = 5  :max_depth = 10  :rmse = 2230.6478692053406
n_estimators = 5  :max_depth = 11  :rmse = 2215.9310939359357
n_estimators = 5  :max_depth = 12  :rmse = 2207.237124564558
n_estimators = 5  :max_depth = 13  :rmse = 2206.5540018782212
n_estimators = 5  :max_depth = 14  :rmse = 2208.700795898802
n_estimators = 10  :max_depth = 1  :rmse = 3809.7826878417336
n_estimators = 10  :max_depth = 2  :rmse = 3310.860335563865
n_estimators = 10  :max

Using LightGBM for gradient Boosting.

In [27]:
from catboost import CatBoostRegressor

In [28]:
model_lgbm = LGBMRegressor(loss_function="Logloss", iterations=50, random_seed=12345)

model_lgbm.fit(features_train, target_train, verbose=20)

#model_lgbm = LGBMRegressor(target_test, label=features_test)
#print(model_lgbm)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', iterations=50, learning_rate=0.1,
              loss_function='Logloss', max_depth=-1, min_child_samples=20,
              min_child_weight=0.001, min_split_gain=0.0, n_estimators=100,
              n_jobs=-1, num_leaves=31, objective=None, random_seed=12345,
              random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [29]:
model_cat = CatBoostRegressor(iterations=50, random_seed=12345)

model_cat.fit(features_train, target_train, verbose=20)

0:	learn: 4422.5599769	total: 202ms	remaining: 9.92s
20:	learn: 3264.5968932	total: 4.08s	remaining: 5.63s
40:	learn: 2745.5028503	total: 7.87s	remaining: 1.73s
49:	learn: 2618.4053967	total: 9.56s	remaining: 0us


<catboost.core.CatBoostRegressor at 0x7f709b99cb90>

In [30]:
import time
def exec_time(start, end):
   diff_time = end - start
   m, s = divmod(diff_time, 60)
   h, m = divmod(m, 60)
   s,m,h = int(round(s, 0)), int(round(m, 0)), int(round(h, 0))
   print("Execution Time: " + "{0:02d}:{1:02d}:{2:02d}".format(h, m, s))

In [31]:
start = time.time()
end = time.time()
exec_time(start, end)

print(exec_time)

Execution Time: 00:00:00
<function exec_time at 0x7f70990a4f80>


In [38]:
#model = CatBoostRegressor(iterations=50, random_seed=12345)
model = RandomForestRegressor(n_estimators=estim,max_depth=depth, random_state=54321)

# model training time
start = time.time()
model.fit(features_train, target_train)
end = time.time()

training_time = exec_time(start, end)

#######################################

# model predicting time:

start = time.time()
model.predict(features_valid)
end = time.time()

predicting_time = exec_time(start, end)

Execution Time: 00:00:10
Execution Time: 00:00:00


The execution time using the randomforest model is 10seconds. This shows the wpeed of the model.

In [None]:
model = ...

# model training time
start = time.time()
model.fit(..)
end = time.time()

training_time = exec_time(start, end)

#######################################

# model predicting time:

start = time.time()
model.predict(...)
end = time.time()

predicting_time = exec_time(start, end)

## Model analysis

In [44]:
final_model = RandomForestRegressor(n_estimators=5, random_state=54321, max_depth=1)
final_model.fit(features_train, target_train)


rmse = mean_squared_error(target_train, predictions_train)**0.5
test_rmse = mean_squared_error(target_test, predictions_test)**0.5

print("The Quality")
print("Training set:", rmse)
print("Test set:", test_rmse)

The Quality
Training set: 1854.6836261235856
Test set: 2208.197025263618


General Conclusion: The rmse is higher with the testing set than training set. So the quality of the model is good.

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed