Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

In [1]:
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, make_scorer, mean_squared_error
from sklearn.preprocessing import StandardScaler

## Data Preparation

In [2]:
car_data = pd.read_csv('/datasets/car_data.csv')

In [3]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
car_data.sample(10)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
108704,29/03/2016 20:36,8499,wagon,2007,auto,158,forester,125000,3,petrol,subaru,no,29/03/2016 00:00,0,96146,06/04/2016 06:45
197063,17/03/2016 11:46,11111,,2017,manual,90,,20000,3,petrol,suzuki,no,17/03/2016 00:00,0,21465,24/03/2016 03:17
55913,11/03/2016 12:48,4000,bus,2007,manual,125,meriva,125000,12,petrol,opel,no,11/03/2016 00:00,0,98553,01/04/2016 10:18
243711,19/03/2016 11:25,11500,bus,2007,manual,170,touran,125000,7,gasoline,volkswagen,no,19/03/2016 00:00,0,63263,19/03/2016 18:42
223395,09/03/2016 01:00,3650,sedan,2003,auto,170,stilo,125000,6,petrol,fiat,no,09/03/2016 00:00,0,49377,07/04/2016 04:47
78735,04/04/2016 12:52,17000,,2014,manual,0,5_reihe,20000,8,petrol,mazda,no,04/04/2016 00:00,0,47441,06/04/2016 14:16
159498,09/03/2016 20:25,650,small,1996,manual,60,corsa,150000,9,petrol,opel,yes,09/03/2016 00:00,0,55129,10/03/2016 19:38
145400,14/03/2016 06:56,4500,wagon,2003,manual,190,a4,150000,3,petrol,audi,yes,14/03/2016 00:00,0,7333,15/03/2016 03:44
330884,02/04/2016 09:51,4800,convertible,1997,manual,101,golf,150000,4,petrol,volkswagen,no,02/04/2016 00:00,0,35435,02/04/2016 14:49
317867,14/03/2016 07:32,7600,sedan,2009,manual,80,golf,125000,9,petrol,volkswagen,no,14/03/2016 00:00,0,55411,06/04/2016 17:15


In [5]:
car_data['VehicleType'].unique()

array([nan, 'coupe', 'suv', 'small', 'sedan', 'convertible', 'bus',
       'wagon', 'other'], dtype=object)

In [6]:
car_data['Gearbox'].unique()

array(['manual', 'auto', nan], dtype=object)

In [7]:
car_data['FuelType'].unique()

array(['petrol', 'gasoline', nan, 'lpg', 'other', 'hybrid', 'cng',
       'electric'], dtype=object)

In [8]:
car_data['NotRepaired'].unique()

array([nan, 'yes', 'no'], dtype=object)

Looking at the unique values in each column to better determine how to deal with the missing values. 

In [9]:
car_data['Model'].nunique()

250

In [10]:
car_data = car_data.fillna('unknown')

In [11]:
car_data['VehicleType'] = car_data['VehicleType'].replace('other','unknown')
car_data['FuelType'] = car_data['FuelType'].replace('other','unknown')
car_data['Model'] = car_data['Model'].replace('other','unknown')

All of the missing values were for categorical features. Because of this, I chose to fill the missing values with 'unknown' so that we won't have to eliminate any of the data. I also changed any of the 'other' values to unknown as well, because these are essentially unknown values too. 

In [12]:
car_data.sample(10)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
220223,03/04/2016 20:49,0,unknown,2017,manual,55,corsa,125000,0,petrol,opel,no,03/04/2016 00:00,0,17192,05/04/2016 21:46
109989,09/03/2016 15:39,2300,wagon,2001,manual,125,forester,150000,12,lpg,subaru,no,09/03/2016 00:00,0,37081,15/03/2016 00:15
73176,19/03/2016 18:47,14299,sedan,2011,manual,120,a4,60000,5,petrol,audi,no,19/03/2016 00:00,0,94532,07/04/2016 06:16
328209,26/03/2016 21:50,0,small,1999,manual,54,corsa,125000,12,petrol,opel,no,26/03/2016 00:00,0,35683,28/03/2016 13:32
217270,29/03/2016 19:51,1500,sedan,1994,manual,122,c_klasse,150000,6,petrol,mercedes_benz,no,29/03/2016 00:00,0,69488,06/04/2016 06:15
108680,03/04/2016 20:41,1695,wagon,2000,auto,165,v70,150000,9,unknown,volvo,no,03/04/2016 00:00,0,29587,05/04/2016 21:17
257845,23/03/2016 09:52,350,wagon,1995,manual,74,astra,150000,4,petrol,opel,no,23/03/2016 00:00,0,84367,24/03/2016 17:44
334270,02/04/2016 13:55,6200,sedan,2007,manual,140,golf,150000,2,petrol,volkswagen,no,02/04/2016 00:00,0,23684,06/04/2016 12:16
122040,24/03/2016 08:51,5800,wagon,2008,manual,90,astra,80000,3,petrol,opel,no,24/03/2016 00:00,0,23730,01/04/2016 01:16
340524,28/03/2016 14:43,1999,wagon,2002,manual,84,astra,150000,10,petrol,opel,no,28/03/2016 00:00,0,32107,06/04/2016 18:44


In [13]:
#Using descriptive statistics to see if there are any outliers or oddities to be dealt with
car_data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [14]:
#dropping Number of Pictures column 
car_data = car_data.drop(columns=['NumberOfPictures'])

Number of Pictures column only has the value of 0, either this column is flawed, or it is providing no actual useful information. Either way, we will drop it from the df. 

In [15]:
car_data['RegistrationMonth'].value_counts()

0     37352
3     34373
6     31508
4     29270
5     29153
7     27213
10    26099
12    24289
11    24186
9     23813
1     23219
8     22627
2     21267
Name: RegistrationMonth, dtype: int64

In [16]:
month0 = car_data[car_data['RegistrationMonth']==0]
month0.sample(10)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,PostalCode,LastSeen
224593,08/03/2016 19:39,2200,unknown,2017,manual,165,a6,150000,0,unknown,audi,unknown,08/03/2016 00:00,1458,05/04/2016 13:18
288384,19/03/2016 12:55,650,unknown,2016,manual,101,golf,150000,0,unknown,volkswagen,unknown,19/03/2016 00:00,89281,27/03/2016 00:45
162425,27/03/2016 11:36,799,unknown,1995,unknown,0,a4,150000,0,petrol,audi,yes,27/03/2016 00:00,34593,07/04/2016 07:45
51905,31/03/2016 17:45,1000,wagon,2000,manual,205,mondeo,150000,0,unknown,ford,unknown,31/03/2016 00:00,42651,06/04/2016 11:15
292363,24/03/2016 07:57,4850,unknown,1980,unknown,0,x_type,30000,0,unknown,jaguar,unknown,24/03/2016 00:00,31008,25/03/2016 15:55
174796,12/03/2016 15:48,0,unknown,2005,unknown,0,corsa,150000,0,unknown,opel,unknown,12/03/2016 00:00,66773,17/03/2016 17:47
154676,03/04/2016 12:39,6500,convertible,2004,manual,135,unknown,70000,0,petrol,peugeot,no,03/04/2016 00:00,25541,07/04/2016 14:57
44643,23/03/2016 09:47,3600,coupe,1999,manual,170,3er,150000,0,petrol,bmw,no,23/03/2016 00:00,46499,01/04/2016 12:44
71953,12/03/2016 17:43,2100,wagon,2004,manual,103,stilo,150000,0,petrol,fiat,unknown,12/03/2016 00:00,33647,03/04/2016 08:16
244784,15/03/2016 16:55,1500,wagon,2002,unknown,110,unknown,150000,0,petrol,peugeot,no,15/03/2016 00:00,52076,19/03/2016 11:17


Looking at random samples of the entries where the registration month is 0 in order to see if there is any way to determine what month 0 actually is supposed to be, or if this is just denoting an unknown value. Roughly 10% of the data has the Registration Month as 0, so I don't want to remove all of these rows. I see no pattern as to what month 0 is supposed to be, so I will opt to count month '0' as unknown. I would change the values of 0 to unknown, but I want to keep all of the datatypes numerical. So for that reason, any value of 0 in the RegistrationMonth column is meant to denote an unknown value. 

In [17]:
car_data = car_data[(car_data['RegistrationYear'] >= 1920) & (car_data['RegistrationYear'] <= 2024)].reset_index(drop=True)

There were some oddities in the Registration Year column (i.e year 1000 or 9999), so I have removed any values outside the range of 1920-2024. 

In [18]:
car_data = car_data[car_data['Power'] >= 50].reset_index(drop=True)

Setting a reasonable range for the Power column to get rid of outliers (i.e value of 0)

In [19]:
car_data = car_data[car_data['Price'] > 0].reset_index(drop=True)

Setting a reasonable range for the Price column to get rid of outliers (i.e value of 0)

In [20]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301476 entries, 0 to 301475
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        301476 non-null  object
 1   Price              301476 non-null  int64 
 2   VehicleType        301476 non-null  object
 3   RegistrationYear   301476 non-null  int64 
 4   Gearbox            301476 non-null  object
 5   Power              301476 non-null  int64 
 6   Model              301476 non-null  object
 7   Mileage            301476 non-null  int64 
 8   RegistrationMonth  301476 non-null  int64 
 9   FuelType           301476 non-null  object
 10  Brand              301476 non-null  object
 11  NotRepaired        301476 non-null  object
 12  DateCreated        301476 non-null  object
 13  PostalCode         301476 non-null  int64 
 14  LastSeen           301476 non-null  object
dtypes: int64(6), object(9)
memory usage: 34.5+ MB


In [21]:
#Dropping unnecessary columns
car_data = car_data.drop(columns=['DateCrawled', 'DateCreated', 'LastSeen', 'PostalCode'])

I noticed that the Postal Code column had many values with only 4 digits. It's possible that these entries were missing a 0 from either the first or last digit. Regardless, I don't think the postal code is an important feature in determining the vehicles' value. 

In [22]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301476 entries, 0 to 301475
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              301476 non-null  int64 
 1   VehicleType        301476 non-null  object
 2   RegistrationYear   301476 non-null  int64 
 3   Gearbox            301476 non-null  object
 4   Power              301476 non-null  int64 
 5   Model              301476 non-null  object
 6   Mileage            301476 non-null  int64 
 7   RegistrationMonth  301476 non-null  int64 
 8   FuelType           301476 non-null  object
 9   Brand              301476 non-null  object
 10  NotRepaired        301476 non-null  object
dtypes: int64(5), object(6)
memory usage: 25.3+ MB


In [23]:
car_data_encoded = pd.get_dummies(car_data, columns=['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired'], drop_first=True)

Encoding categorical features 

In [24]:
car_data_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301476 entries, 0 to 301475
Columns: 309 entries, Price to NotRepaired_yes
dtypes: int64(5), uint8(304)
memory usage: 98.9 MB


## Model Training

In [25]:
X = car_data_encoded.drop('Price', axis=1)
y = car_data_encoded['Price']
   
# Split data into training + validation set and test set
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=315)

# Further split the training set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size=0.25, random_state=315)
   

In [26]:
numerical_features = ['RegistrationYear', 'Power', 'Mileage', 'RegistrationMonth']

scaler = StandardScaler()

# Fit the scaler on the training data
scaler.fit(X_train[numerical_features])

# Transform the training, validation, and test data
X_train[numerical_features] = scaler.transform(X_train[numerical_features])
X_valid[numerical_features] = scaler.transform(X_valid[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[numerical_features] = scaler.transform(X_train[numerical_features])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_valid[numerical_features] = scaler.transform(X_valid[numerical_featu

In [27]:
# Linear Regression
lin_reg = LinearRegression()
#measure train time
start_train = time.time()
lin_reg.fit(X_train, y_train)
end_train = time.time()
training_time = end_train - start_train
#measure prediction time
start_pred = time.time()
y_pred = lin_reg.predict(X_valid)
end_pred = time.time()
prediction_time = end_pred - start_pred
# Calculate RMSE
rmse_lin = mean_squared_error(y_valid, y_pred, squared=False)

print(f"Linear Regression RMSE: {rmse_lin:.2f}")
print(f"Training time: {training_time:.4f} seconds")
print(f"Prediction time: {prediction_time:.4f} seconds")
print()

# Decision Tree Regressor
tree_reg = DecisionTreeRegressor(random_state=315)  
# Measure training time
start_train = time.time()
tree_reg.fit(X_train, y_train)
end_train = time.time()
training_time = end_train - start_train

# Measure prediction time
start_pred = time.time()
y_pred = tree_reg.predict(X_valid)
end_pred = time.time()
prediction_time = end_pred - start_pred

# Calculate RMSE
rmse_tree = mean_squared_error(y_valid, y_pred, squared=False)

print(f"Decision Tree RMSE: {rmse_tree:.2f}")
print(f"Training time: {training_time:.4f} seconds")
print(f"Prediction time: {prediction_time:.4f} seconds")
print()

# Random Forest Regressor
#forest_reg = RandomForestRegressor(random_state=315, n_estimators=50) 
# Measure training time
#start_train = time.time()
#forest_reg.fit(X_train, y_train)
#end_train = time.time()
#training_time = end_train - start_train

# Measure prediction time
#start_pred = time.time()
#y_pred = forest_reg.predict(X_valid)
#end_pred = time.time()
#prediction_time = end_pred - start_pred

# Calculate RMSE
#rmse_forest = mean_squared_error(y_valid, y_pred, squared=False)

#print(f"Random Forest RMSE: {rmse_forest:.2f}")
#print(f"Training time: {training_time:.4f} seconds")
#print(f"Prediction time: {prediction_time:.4f} seconds")

Linear Regression RMSE: 2785.57
Training time: 6.1357 seconds
Prediction time: 0.1048 seconds

Decision Tree RMSE: 2045.47
Training time: 3.2994 seconds
Prediction time: 0.0635 seconds



Upon the initial run of the models, we see that both the Decision Tree and Random Forest models have better RMSE scores than the Linear Regression model. This is to be expected as the Linear Regression model is being used as a sanity check in this instance. It should be noted that the Random Forest model's RMSE score is better than the Decision Tree's by a rather significant amount. With that said, we also need to take into account the amount of time it takes the models to train and predict. The Random Forest model took over a minute to train, whereas the other 2 models took a matter of seconds. The prediction time for the Random Forest, while not long at 1.05 seconds, is still significantly longer relatively speaking to the fraction of a second it took the other 2 models. 

### Decision Tree Regressor Model

In [28]:
#for depth in range(1, 20):
#    tree = DecisionTreeRegressor(random_state=315, max_depth=depth)
#    tree.fit(X_train, y_train)
#    y_pred = tree.predict(X_valid)
#    rmse_tree = mean_squared_error(y_valid, y_pred, squared=False)
#    print('max_depth =', depth, ': ', end='')
#    print(f'Decision Tree RMSE: {rmse_tree:.2f}')

After testing out many different max_depth values for this model, I have determined that max_depth=16 will give us the best DecisionTree model. 

In [29]:
tree_reg = DecisionTreeRegressor(random_state=315, max_depth=16)
tree_reg.fit(X_train, y_train)
y_pred = tree_reg.predict(X_valid)
rmse_tree = mean_squared_error(y_valid, y_pred, squared=False)
print(f'Decision Tree RMSE: {rmse_tree:.2f}')

Decision Tree RMSE: 1895.55


In addition to tuning the max_depth hyperparameter, I tried tuning both the splitter and criterion hyperparameters, but the default values for both resulted with the most optimal version of this model. 

### Random Forest Regressor Model 

In [30]:
#best_rmse = 1650
#best_est = 0
#for est in range(90, 101):
#    rando = RandomForestRegressor(random_state=315, n_estimators=est)
#    rando.fit(X_train, y_train)
#    y_pred = rando.predict(X_valid)
#    rmse_forest = mean_squared_error(y_valid, y_pred, squared=False)
#    if rmse_forest < best_rmse:
#        best_rmse = rmse_forest
#        best_est = est
#print('RMSE of the best model is (n_estimators = {}): {}'.format(best_est, best_rmse))

After running multiple models with different n_estimators (commented out the code above for time saving purposes) I have determined that using n_estimators=50 is the best course of action for the RandomForestRegressor model. There were higher values for n_estimators with slightly lower RMSE scores, but by a small margin. Therefore I'm opting for 50 in order to ensure that it is a quality model, without slowing it down too much. 

In [31]:
#for depth in range (1,11):
#    rando = RandomForestRegressor(random_state=315, max_depth=depth)
#    rando.fit(X_train, y_train)
#    y_pred = rando.predict(X_valid)
#    rmse_forest = mean_squared_error(y_valid, y_pred, squared=False)
#    print('max_depth =', depth, ': ', end='')
#    print(f'Random Forest RMSE: {rmse_forest:.2f}')

After running multiple models with different max_depth values (commented out the code above for time saving purposes) I have determined that using the default max_depth=None is the best course of action for the RandomForestRegressor Model. 

In [32]:
#forest_reg = RandomForestRegressor(random_state=315, n_estimators=50) 
#forest_reg.fit(X_train, y_train)
#y_pred = forest_reg.predict(X_valid)
#rmse_forest = mean_squared_error(y_valid, y_pred, squared=False)
#print(f'Random Forest RMSE: {rmse_forest:.2f}')

### LightGBM Model 

In [33]:
lgb_train = lgb.Dataset(X_train, y_train)
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 57,
    'learning_rate': 0.5,
    'feature_fraction': 0.9
}
#measure training time 
start_train = time.time()
gbm = lgb.train(params, lgb_train, num_boost_round=100)
end_train = time.time()
training_time = end_train - start_train
#measure prediction time
start_pred = time.time()
y_pred_gbm = gbm.predict(X_valid, num_iteration=gbm.best_iteration)
end_pred = time.time()
prediction_time = end_pred - start_pred
#calculate RMSE
rmse_gbm = mean_squared_error(y_valid, y_pred_gbm, squared=False)
print(f'LightGBM RMSE: {rmse_gbm:.2f}')
print(f"Training time: {training_time:.4f} seconds")
print(f"Prediction time: {prediction_time:.4f} seconds")

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 918
[LightGBM] [Info] Number of data points in the train set: 180885, number of used features: 286
[LightGBM] [Info] Start training from score 4870.838151
LightGBM RMSE: 1608.18
Training time: 2.7938 seconds
Prediction time: 0.4030 seconds


After messing around with the different parameters of the LGBM model, the above iteration is what I would use moving forward. I was able to achieve even lower RMSE scores by continuing to raise the num_leaves, but I did not want to run the risk of overfitting the model.

In [34]:
# Linear Regression
lin_reg = LinearRegression()
#measure train time
start_train = time.time()
lin_reg.fit(X_train, y_train)
end_train = time.time()
training_time = end_train - start_train
#measure prediction time
start_pred = time.time()
y_pred = lin_reg.predict(X_test)
end_pred = time.time()
prediction_time = end_pred - start_pred
# Calculate RMSE
rmse_lin = mean_squared_error(y_test, y_pred, squared=False)

print(f"Linear Regression RMSE: {rmse_lin:.2f}")
print(f"Training time: {training_time:.4f} seconds")
print(f"Prediction time: {prediction_time:.4f} seconds")
print()

# Decision Tree Regressor
tree_reg = DecisionTreeRegressor(random_state=315, max_depth=16)
# Measure training time
start_train = time.time()
tree_reg.fit(X_train, y_train)
end_train = time.time()
training_time = end_train - start_train

# Measure prediction time
start_pred = time.time()
y_pred = tree_reg.predict(X_test)
end_pred = time.time()
prediction_time = end_pred - start_pred

# Calculate RMSE
rmse_tree = mean_squared_error(y_test, y_pred, squared=False)

print(f"Decision Tree RMSE: {rmse_tree:.2f}")
print(f"Training time: {training_time:.4f} seconds")
print(f"Prediction time: {prediction_time:.4f} seconds")
print()

# Random Forest Regressor
forest_reg = RandomForestRegressor(random_state=315, n_estimators=50) 
# Measure training time
start_train = time.time()
forest_reg.fit(X_train, y_train)
end_train = time.time()
training_time = end_train - start_train

# Measure prediction time
start_pred = time.time()
y_pred = forest_reg.predict(X_test)
end_pred = time.time()
prediction_time = end_pred - start_pred

# Calculate RMSE
rmse_forest = mean_squared_error(y_test, y_pred, squared=False)

print(f"Random Forest RMSE: {rmse_forest:.2f}")
print(f"Training time: {training_time:.4f} seconds")
print(f"Prediction time: {prediction_time:.4f} seconds")
print()

# LightGBM 
lgb_train = lgb.Dataset(X_train, y_train)
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 57,
    'learning_rate': 0.5,
    'feature_fraction': 0.9
}
#measure training time 
start_train = time.time()
gbm = lgb.train(params, lgb_train, num_boost_round=100)
end_train = time.time()
training_time = end_train - start_train
#measure prediction time
start_pred = time.time()
y_pred_gbm = gbm.predict(X_test, num_iteration=gbm.best_iteration)
end_pred = time.time()
prediction_time = end_pred - start_pred
#calculate RMSE
rmse_gbm = mean_squared_error(y_test, y_pred_gbm, squared=False)
print(f'LightGBM RMSE: {rmse_gbm:.2f}')
print(f"Training time: {training_time:.4f} seconds")
print(f"Prediction time: {prediction_time:.4f} seconds")

Linear Regression RMSE: 2765.06
Training time: 6.3768 seconds
Prediction time: 0.1054 seconds

Decision Tree RMSE: 1904.65
Training time: 2.5349 seconds
Prediction time: 0.0511 seconds

Random Forest RMSE: 1619.84
Training time: 96.2357 seconds
Prediction time: 1.1498 seconds

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 918
[LightGBM] [Info] Number of data points in the train set: 180885, number of used features: 286
[LightGBM] [Info] Start training from score 4870.838151
LightGBM RMSE: 1617.93
Training time: 2.8288 seconds
Prediction time: 0.4050 seconds


All 3 models outperformed the linear regression model, which is what we were hoping for. The linear regression model is meant to act as a sanity check. 

## Conclusion
After building multiple models (Linear Regression, Decision Tree Regressor, Random Forest Regressor, and LightGBM) to determine the value of customer's vehicles, here are my findings. All 3 models outperformed the Linear Regression model, which is what we anticipated. The Linear Regression model was used as a sanity check for the other 3 models. For context, the RMSE scores of the models were as follows: 
- Linear Regression: 2765.06
- Decision Tree: 1904.65
- Random Forest: 1619.84
- LightGBM: 1617.93 

The 2 best models for determining the value of a customer's car are the Random Forest and LightGBM models. Both of them scored very similarly in RMSE. With that said, the Random Forest Model takes a much longer time to train and predict than the LightGBM Model. With that in mind, the LightGBM model would be the optimal model for this task. It was the most accurate of the models, while also performing very quickly. This makes it the more desireable model to use for this task. 