# Introduction
The hypothetical Rusty Bargain used car sales service is developing an app to attract new customers. In that app, customers can quickly find out the market value of their car. We have access to historical data: technical specifications, trim versions, and prices. We will build a model to predict the car value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

### Initialize Libraries and First Looks at Dataframe

In [1]:
# Statistical libraries
import pandas as pd
import time
import numpy as np
from sklearn.metrics import *
from sklearn.preprocessing import OrdinalEncoder

# Machine Learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
import lightgbm as lgb

In [2]:
# Read Dataframe
df_raw = pd.read_csv('./datasets/car_data.csv')

# Quick look
df_raw.info()
display(df_raw.head(3))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47


### Clean the Dataframe

There are three datetime features which are currently in object (string) format -- 'DateCrawled', 'DateCreated', and 'LastSeen', referring respectively to the date the entry was pulled from the database, the date the entry was created, and the last date of user activity. We should convert these features to datetime.

There are missing values for 'VehicleType', 'Gearbox', 'Model', 'FuelType', and 'NotRepaired'. The first four of these features are categorical and thus difficult to fill in with representative data like a median. We don't want to drop that many entries, so we should create an "other" category for all NaN values. For the final feature, 'NotRepaired', we can instead create an "unknown" category, which better reflects why there is a missing value.

The only benefit to having a 'RegistrationMonth' is to combine it with 'RegistrationYear'. We shouldn't expect a vehicle registered in May of 1980 to be in the same meaningful category as a vehicle registered in May of 2021. We will combine these values into one column.

Finally, the "PostalCode' column is tricky. It corresponds to a float number, but realistically, it must be treated as categorical, since a zip code of 90000 is not 9 times as influential on the price as 10000. It would be worrisome, however, to create 100000 different categories. We can first reduce this feature load by a magnitude of 100 by grouping zip codes to their post office number, which is represented by the first three numbers. This creates a meaningful grouping of the postal codes. We should change this column to a three-digit string (or two-digit in the case of zip codes beginning with 0).

In [3]:
# Copy df_raw to new dataframe to preserve original data before editing
df_clean = df_raw.copy()

# Convert datetime
df_clean['DateCrawled'] = pd.to_datetime(df_clean['DateCrawled'], format='%d/%m/%Y %H:%M')
df_clean['DateCreated'] = pd.to_datetime(df_clean['DateCreated'], format='%d/%m/%Y %H:%M')
df_clean['LastSeen'] = pd.to_datetime(df_clean['LastSeen'], format='%d/%m/%Y %H:%M')

In [4]:
# Create 'other' / 'unknown' category filling in NaN values
df_clean[['VehicleType', 'Gearbox', 'Model', 'FuelType']] = df_clean[['VehicleType', 'Gearbox', 'Model', 'FuelType']].fillna('other')
df_clean['NotRepaired'] = df_clean['NotRepaired'].fillna('unknown')

In [5]:
# Collapse the RegistrationYear and RegistrationMonth to a single number where Jan 2020 = 2020.0 and Dec 2020 = 2020.916666
df_clean['RegistrationMonth'] = df_clean['RegistrationMonth'] - 1
df_clean['RegistrationMonth'] = df_clean['RegistrationMonth'] / 12
df_clean['Registration'] = df_clean[['RegistrationYear', 'RegistrationMonth']].sum(axis=1)
df_clean = df_clean.drop(['RegistrationYear', 'RegistrationMonth'], axis=1)

In [6]:
# Truncate the zip codes to their nearest post office identifier and cast as string
def trunc_zip(zipcode):
    return str(zipcode // 100)
df_clean['PostalCode'] = df_clean['PostalCode'].apply(trunc_zip).astype('string')

In [7]:
df_clean.describe()

Unnamed: 0,DateCrawled,Price,Power,Mileage,DateCreated,NumberOfPictures,LastSeen,Registration
count,354369,354369.0,354369.0,354369.0,354369,354369.0,354369,354369.0
mean,2016-03-21 12:57:41.165057280,4416.656776,110.094337,128211.172535,2016-03-20 19:12:07.753274112,0.0,2016-03-29 23:50:30.593703680,2004.627335
min,2016-03-05 14:06:00,0.0,0.0,5000.0,2014-03-10 00:00:00,0.0,2016-03-05 14:15:00,999.916667
25%,2016-03-13 11:52:00,1050.0,69.0,125000.0,2016-03-13 00:00:00,0.0,2016-03-23 02:50:00,1999.25
50%,2016-03-21 17:50:00,2700.0,105.0,150000.0,2016-03-21 00:00:00,0.0,2016-04-03 15:15:00,2003.5
75%,2016-03-29 14:37:00,6400.0,143.0,150000.0,2016-03-29 00:00:00,0.0,2016-04-06 10:15:00,2008.0
max,2016-04-07 14:36:00,20000.0,20000.0,150000.0,2016-04-07 00:00:00,0.0,2016-04-07 14:58:00,9999.5
std,,4514.158514,189.850405,37905.34153,,0.0,,90.224884


Unfortunately, it looks like the 'NumberOfPictures' Column doesn't contain meaningful values, as every entry has 0 pictures. We should delete the column. 

'RegistrationYear' has some odd outliers. We should only keep entries that are within the range from the first year vehicle registrations were used, 1901, to the year of this analysis, 2024. 

Similarly, 'Power' has some odd outliers. We should only keep entries that are within the range from >0 horsepower to 5000 horsepower, the highest current commercially-available horsepower.

Finally, a 'Price' of 0 makes no sense. We should only keep entries that are at or above, say, $5.

In [8]:
# Drop NumberOfPictures column 
df_clean = df_clean.drop(['NumberOfPictures'], axis=1)

In [9]:
# Keep entries where RegistrationYear is a logical value
df_clean = df_clean.query('Registration >= 1901 and Registration <= 2024')

In [10]:
# Keep entries where Power is a logical value
df_clean = df_clean.query('Power >= 0 and Power <= 5000')

In [11]:
# Keep entries where Price is a logical value
df_clean = df_clean.query('Price >= 0')

### Drop Duplicates and Create Finalized Dataframe

In [12]:
# Check for duplicates
display(df_clean[df_clean.duplicated()])

Unnamed: 0,DateCrawled,Price,VehicleType,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,DateCreated,PostalCode,LastSeen,Registration
14266,2016-03-21 19:06:00,5999,small,manual,80,polo,125000,petrol,volkswagen,no,2016-03-21,655,2016-04-05 20:47:00,2009.333333
27568,2016-03-23 10:38:00,12200,bus,manual,125,zafira,40000,gasoline,opel,no,2016-03-23,266,2016-04-05 07:44:00,2011.750000
31599,2016-04-03 20:41:00,4950,wagon,auto,170,e_klasse,150000,gasoline,mercedes_benz,no,2016-04-03,484,2016-04-05 21:17:00,2003.250000
33138,2016-03-07 20:45:00,10900,convertible,auto,163,clk,125000,petrol,mercedes_benz,no,2016-03-07,612,2016-03-21 03:45:00,2005.333333
43656,2016-03-13 20:48:00,4200,sedan,manual,105,golf,150000,gasoline,volkswagen,no,2016-03-13,144,2016-03-13 20:48:00,2003.750000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349709,2016-04-03 20:52:00,700,small,manual,60,ibiza,150000,petrol,seat,yes,2016-04-03,62,2016-04-05 21:47:00,1999.916667
351555,2016-03-26 16:54:00,3150,bus,manual,86,transit,150000,gasoline,ford,no,2016-03-26,961,2016-04-02 07:47:00,2003.833333
352384,2016-03-15 21:54:00,5900,wagon,manual,129,3er,150000,petrol,bmw,no,2016-03-15,925,2016-03-20 21:17:00,2006.916667
353057,2016-03-05 14:16:00,9500,small,manual,105,ibiza,40000,petrol,seat,no,2016-03-04,613,2016-04-05 19:18:00,2013.333333


In [13]:
# Drop duplicates
df_clean = df_clean[~df_clean.duplicated()]

In [14]:
# Final Dataframe
df = df_clean.copy()
del df_raw
del df_clean

# Display
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 353841 entries, 0 to 354368
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   DateCrawled   353841 non-null  datetime64[ns]
 1   Price         353841 non-null  int64         
 2   VehicleType   353841 non-null  object        
 3   Gearbox       353841 non-null  object        
 4   Power         353841 non-null  int64         
 5   Model         353841 non-null  object        
 6   Mileage       353841 non-null  int64         
 7   FuelType      353841 non-null  object        
 8   Brand         353841 non-null  object        
 9   NotRepaired   353841 non-null  object        
 10  DateCreated   353841 non-null  datetime64[ns]
 11  PostalCode    353841 non-null  string        
 12  LastSeen      353841 non-null  datetime64[ns]
 13  Registration  353841 non-null  float64       
dtypes: datetime64[ns](3), float64(1), int64(3), object(6), string(1)
memory u

### Vectorizing Categorical Variables: OHE, Label Encoding

The below models and algorithms rely on pre-encoding non-numeric features. For our baseline Linear Regression, we won't be tuning to hyperparameters, and can afford resource-wise to simply OHE (One-Hot Encode) all categorical features. For all other models, we will instead label encode those features via sklearn's OrdinalEncoder. We should still look at the count of each feature's unique values just to understand how many categories we are encoding.

In [15]:
categorical_cols = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired', 'PostalCode']
ohe_cols = ['VehicleType', 'Gearbox', 'FuelType', 'Brand', 'NotRepaired']
labelencode_cols = ['Model', 'PostalCode']

# Display
print('UNIQUE VALUE COUNTS FOR EACH FEATURE')
for col in categorical_cols:
    print(col, len(df[col].value_counts()))

UNIQUE VALUE COUNTS FOR EACH FEATURE
VehicleType 8
Gearbox 3
Model 250
FuelType 7
Brand 40
NotRepaired 3
PostalCode 671


There are 671 unique 'PostalCode' zones (recall in an earlier step we resolved the PostalCode down to the post office, a meaningful division that will better group zip codes to their location). When we OHE below, we will also label encode 'PostalCode'.

## Model training

### Initializing, Eval Criteria

As you can see above, we have some datetime values in the dataframe to help us order, organize, and check for duplicates. However, none of the datetimes relate to the vehicle. We have the 'Registration' column to age the vehicle. Since the training of models should not depend on these datetime features, we will exclude them when we create our base feature set.

We will also create a helper function to tidy up repeated code that evaluates a model against its test dataset. The function will predict values based on that model and evaluate its performance, returning a DataFrame of information.

In [16]:
'''
IMPORTANT!

Grid searches take a very long time to compute over hyperparameters. 
The below boolean variable 'execute_grid' toggles the execution and evaluation of the various criteria

    Enable the below value to allow the grid searches to run. 
    Disable the below value to more efficiently debug and work with the below code.

''' 
execute_grid = True

In [17]:
# Initialize features and target, dropping unnecessary columns
cols_to_drop = ['Price', 'DateCrawled', 'DateCreated', 'LastSeen']
features = df.drop(cols_to_drop, axis=1)
target = df['Price']

In [18]:
def eval_model(model_name, model, X, y, Xtest, ytest, fit_params={}):
    '''
    This function takes as parameters the kind of model being used (str), an untrained model object, 
    training data (X and target y), and test data.
    
    It returns a dataframe that can be used later to concatenate with other evaluation dataframes.
    '''   
    
    # Train model predictions and time execution
    start_train_time = time.time()
    model.fit(X, y, **fit_params)
    train_time = time.time() - start_train_time
    
    # Make model predictions and time execution
    start_predict_time = time.time()
    predictions = model.predict(Xtest)
    predict_time = time.time() - start_predict_time
    
    # Create return data
    return_dict = {
        'train_time' :    train_time,                                    # Execution time of training
        'predict_time' :  predict_time,                                  # Execution time of prediction
        'RMSE' :          mean_squared_error(ytest, predictions)**(1/2), # Root Mean Squared Error
        'R2' :            r2_score(ytest, predictions)                   # R2 (Coefficient of Determination)
    }
    
    return pd.DataFrame(return_dict, index=[model_name])

### Baseline LR Model
We'll start with timing and evaluating a linear regression model for baseline execution. 

#### OHE All categorical features

In [19]:
# For this model, in the absence of tuning hyperparameters, we can OHE all categorical features besides our two largest
ohe_features = features.copy()
ohe_target = target.copy()
encoder = OrdinalEncoder(dtype=np.int64)

ohe_features = pd.get_dummies(
    ohe_features, columns=ohe_cols, drop_first=True
)
label_encoded = pd.DataFrame(encoder.fit_transform(ohe_features[labelencode_cols]), columns=labelencode_cols)
ohe_features = ohe_features.drop(labelencode_cols, axis=1)
ohe_features = ohe_features.join(label_encoded)

In [20]:
# Display
display(ohe_features.info())

<class 'pandas.core.frame.DataFrame'>
Index: 353841 entries, 0 to 354368
Data columns (total 61 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Power                    353841 non-null  int64  
 1   Mileage                  353841 non-null  int64  
 2   Registration             353841 non-null  float64
 3   VehicleType_convertible  353841 non-null  bool   
 4   VehicleType_coupe        353841 non-null  bool   
 5   VehicleType_other        353841 non-null  bool   
 6   VehicleType_sedan        353841 non-null  bool   
 7   VehicleType_small        353841 non-null  bool   
 8   VehicleType_suv          353841 non-null  bool   
 9   VehicleType_wagon        353841 non-null  bool   
 10  Gearbox_manual           353841 non-null  bool   
 11  Gearbox_other            353841 non-null  bool   
 12  FuelType_electric        353841 non-null  bool   
 13  FuelType_gasoline        353841 non-null  bool   
 14  FuelType_

None

In [21]:
# We have a strange issue of introducing 533 NaNs into the features 'Model' and 'PostalCode' when joining them to the dataframe
# So we will drop these indices from the features and from the target
problem_indices = ohe_features[ohe_features['Model'].isna()].index
ohe_features = ohe_features.drop(problem_indices)
ohe_target = ohe_target.drop(problem_indices)

ohe_features[labelencode_cols] = ohe_features[labelencode_cols].astype('int64')

#### Split data on Train:Test at 4:1 ratio

In [22]:
# Our target ratio of train:test is 4:1
features_train, features_test, target_train, target_test = train_test_split(
    ohe_features, ohe_target, test_size=0.2, random_state=12345
)

#### Execute and evaluate Linear Regression

In [23]:
# Simple linear regression
lr_model = LinearRegression()
lr_model = lr_model.fit(features_train, target_train) # This step is purely for passing consistent values to eval_model

In [24]:
# Package performance data
lr_performance = eval_model('LinearRegression', LinearRegression(), features_train, target_train, features_test, target_test)
display(lr_performance)

Unnamed: 0,train_time,predict_time,RMSE,R2
LinearRegression,0.773429,0.037455,3035.034517,0.548652


### Random Forest Model

For this model we will use the same features and target as above, but with a Random Forest Regressor, and as we will be tuning hyperparameters, we will utilize a GridSearchCV to iterate over hyperparameter options. 

It is worth noting that this grid search algorithm utilizes a KFold cross-validation generator that we will specify within the call to GridSearchCV. We therefore won't split our data into training and validation outside the cross-validation. However, we will keep the test data separate from this training to evaluate performance after the best model is found.

#### Define Hyperparameter Ranges and Initialize

In [25]:
parameter_grid = {
    'n_estimators': [6],
    'max_depth': [4, 6, 8],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [2, 4, 6]
}
forest = RandomForestRegressor(
    random_state=12345, n_jobs = -1
)
cross_validator = KFold(
    n_splits=4, shuffle=True, random_state=12345
)
forest_GSCV = GridSearchCV(
    estimator=forest, param_grid=parameter_grid, scoring = 'neg_root_mean_squared_error', cv = cross_validator, verbose=3
)

#### Execute and Evaluate RF Regression

In [26]:
# Random Forest Grid Search
if execute_grid:
    forest_GSCV.fit(features_train, target_train)

Fitting 4 folds for each of 27 candidates, totalling 108 fits
[CV 1/4] END max_depth=4, min_samples_leaf=2, min_samples_split=2, n_estimators=6;, score=-2709.466 total time=   0.6s
[CV 2/4] END max_depth=4, min_samples_leaf=2, min_samples_split=2, n_estimators=6;, score=-2663.838 total time=   0.6s
[CV 3/4] END max_depth=4, min_samples_leaf=2, min_samples_split=2, n_estimators=6;, score=-2711.231 total time=   0.6s
[CV 4/4] END max_depth=4, min_samples_leaf=2, min_samples_split=2, n_estimators=6;, score=-2717.896 total time=   0.6s
[CV 1/4] END max_depth=4, min_samples_leaf=2, min_samples_split=4, n_estimators=6;, score=-2709.466 total time=   0.5s
[CV 2/4] END max_depth=4, min_samples_leaf=2, min_samples_split=4, n_estimators=6;, score=-2663.838 total time=   0.6s
[CV 3/4] END max_depth=4, min_samples_leaf=2, min_samples_split=4, n_estimators=6;, score=-2711.231 total time=   0.6s
[CV 4/4] END max_depth=4, min_samples_leaf=2, min_samples_split=4, n_estimators=6;, score=-2717.896 total

In [27]:
# Display
if execute_grid:
    print(forest_GSCV.best_params_)
    print(forest_GSCV.best_score_)

{'max_depth': 8, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 6}
-2194.8949066294317


In [28]:
# Package performance data
if execute_grid:
    best_params = forest_GSCV.best_params_
    rf_performance = eval_model('RandomForestRegression', 
                                RandomForestRegressor(random_state=12345, n_jobs=-1, **best_params), 
                                features_train, 
                                target_train, 
                                features_test, 
                                target_test)
    display(rf_performance)

Unnamed: 0,train_time,predict_time,RMSE,R2
RandomForestRegression,1.684583,0.031952,2195.229854,0.763874


#### Cleanup to Conserve Memory

In [29]:
del forest_GSCV, cross_validator, ohe_features, ohe_target

### LightGBM Model

For this model we will pass the original feature set to LightGBM but encode every categorical feature using label encoding. This is effective for LightGBM when we pass the 'categorical_feature' parameter to the lgb.Dataset() call and specify which categories are actually categorical. We will again utilize a GridSearchCV to iterate over hyperparameter options. 

It is worth noting that this grid search algorithm utilizes a KFold cross-validation generator that we will specify within the call to GridSearchCV. We therefore won't split our data into training and validation outside the cross-validation.

#### Use OrdinalEncoder to Label Encode all Categorical Features

In [30]:
# For this model, due to the specifications of LightGBM, we should label encode all categorical features 
le_features = features.copy()
le_target = target.copy()
encoder = OrdinalEncoder(dtype=np.int64)

label_encoded = pd.DataFrame(encoder.fit_transform(le_features[categorical_cols]), columns=categorical_cols)
le_features = le_features.drop(categorical_cols, axis=1)
le_features = le_features.join(label_encoded)

In [31]:
# Display
display(le_features.head(5))
display(le_features.info())

Unnamed: 0,Power,Mileage,Registration,VehicleType,Gearbox,Model,FuelType,Brand,NotRepaired,PostalCode
0,0,150000,1992.916667,3.0,1.0,116.0,6.0,38.0,1.0,443.0
1,190,125000,2011.333333,2.0,1.0,166.0,2.0,1.0,2.0,417.0
2,163,125000,2004.583333,6.0,0.0,117.0,2.0,14.0,1.0,600.0
3,75,150000,2001.416667,5.0,1.0,116.0,6.0,38.0,0.0,605.0
4,69,90000,2008.5,5.0,1.0,101.0,2.0,31.0,0.0,371.0


<class 'pandas.core.frame.DataFrame'>
Index: 353841 entries, 0 to 354368
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Power         353841 non-null  int64  
 1   Mileage       353841 non-null  int64  
 2   Registration  353841 non-null  float64
 3   VehicleType   353313 non-null  float64
 4   Gearbox       353313 non-null  float64
 5   Model         353313 non-null  float64
 6   FuelType      353313 non-null  float64
 7   Brand         353313 non-null  float64
 8   NotRepaired   353313 non-null  float64
 9   PostalCode    353313 non-null  float64
dtypes: float64(8), int64(2)
memory usage: 37.8 MB


None

In [32]:
# We have a strange issue of introducing 533 NaNs into the features only when joining them to the dataframe
# So we will drop these indices from the features and from the target
problem_indices = le_features[le_features['Model'].isna()].index
le_features = le_features.drop(problem_indices)
le_target = le_target.drop(problem_indices)

le_features[categorical_cols] = le_features[categorical_cols].astype('int64')

In [33]:
display(le_features.info())

<class 'pandas.core.frame.DataFrame'>
Index: 353313 entries, 0 to 353840
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Power         353313 non-null  int64  
 1   Mileage       353313 non-null  int64  
 2   Registration  353313 non-null  float64
 3   VehicleType   353313 non-null  int64  
 4   Gearbox       353313 non-null  int64  
 5   Model         353313 non-null  int64  
 6   FuelType      353313 non-null  int64  
 7   Brand         353313 non-null  int64  
 8   NotRepaired   353313 non-null  int64  
 9   PostalCode    353313 non-null  int64  
dtypes: float64(1), int64(9)
memory usage: 29.7 MB


None

#### Split data on Train:Test at 4:1 ratio

In [34]:
# Our target ratio of train:test is 4:1
features_train, features_test, target_train, target_test = train_test_split(
    le_features, le_target, test_size=0.2, random_state=12345
)

#### Define Hyperparameter Ranges and Initialize

In [35]:
parameter_grid = {
    'num_leaves': [30, 60, 90],
    'max_depth': [1, 5, 15],
    'subsample': [0.6, 0.8, 1.0],
    'learning_rate': [0.1, 0.01]
}
categorical_feature_indices = [features_train.columns.get_loc(col) for col in categorical_cols]
fit_params = {"categorical_feature": categorical_feature_indices}
lgb_model = lgb.LGBMRegressor(
    random_state=12345, n_jobs = -1
)
cross_validator = KFold(
    n_splits=4, shuffle=True, random_state=12345
)
lgbm_GSCV = GridSearchCV(
    lgb_model, parameter_grid, scoring = 'neg_root_mean_squared_error', cv = cross_validator, verbose=3
)

#### Execute and Evaluate LightGBM Regression

In [36]:
# Random Forest Grid Search
if execute_grid:
    lgbm_GSCV.fit(features_train, target_train, **fit_params)

Fitting 4 folds for each of 54 candidates, totalling 216 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005724 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1463
[LightGBM] [Info] Number of data points in the train set: 211987, number of used features: 10
[LightGBM] [Info] Start training from score 4406.527721
[CV 1/4] END learning_rate=0.1, max_depth=1, num_leaves=30, subsample=0.6;, score=-2682.268 total time=   0.3s
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004110 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1463
[LightGBM] [Info] Number of data points in the train set: 211987, number of used features: 10
[LightGBM] [Info] Start training from score 4420.314382
[CV 2/4

In [37]:
# Display
if execute_grid:
    print(lgbm_GSCV.best_params_)
    print(lgbm_GSCV.best_score_)

{'learning_rate': 0.1, 'max_depth': 15, 'num_leaves': 90, 'subsample': 0.6}
-2257.7429016732085


In [38]:
# Package performance data
if execute_grid:
    best_params = lgbm_GSCV.best_params_
    lgbm_performance = eval_model(
        'LightGBMRegression', 
        lgb.LGBMRegressor(random_state=12345, n_jobs = -1, **best_params), 
        features_train, 
        target_train,
        features_test, 
        target_test,
        fit_params=fit_params
    )
    display(lgbm_performance)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007624 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1463
[LightGBM] [Info] Number of data points in the train set: 282650, number of used features: 10
[LightGBM] [Info] Start training from score 4415.711024


Unnamed: 0,train_time,predict_time,RMSE,R2
LightGBMRegression,1.343525,0.098514,2256.171422,0.750582


## Model Analysis and Takeaways

In [39]:
if execute_grid:
    all_performance = pd.concat([lr_performance, rf_performance, lgbm_performance])
    display(all_performance)

Unnamed: 0,train_time,predict_time,RMSE,R2
LinearRegression,0.773429,0.037455,3035.034517,0.548652
RandomForestRegression,1.684583,0.031952,2195.229854,0.763874
LightGBMRegression,1.343525,0.098514,2256.171422,0.750582


The above table shows the execution time of each model's .predict() function given an input feature set of size (n x m), which is also reflected in the table. The resulting output from this function is compared to the true target values for the data and shows the RMSE and R2 values for the predictions.

Predictably, the error is highest for the Linear Regression model. The regresion line only captures about 55% of the variation in the data and is therefore only moderately accurate.

The other two models have lower error than the Linear Regression Model, and capture 76% of the variation in the data. This is within an optimal range for our task, since an R2 that is too high could reflect overfitting in the data. 

The training time for RandomForestRegression in our environment was over 1.7 seconds, while the training time for LightGBMRegression was under 1.35 seconds.

The prediction times, however, tell a diferent story. Per-prediction processing time is an order of magnitude faster for the RandomForestRegression.

The above task, to fit a model to user car data in order to best predict car value, is likely best performed by the RandomForestRegression since the model has the lowest prediction time and the lowest error value. Since the training time isn't much longer than the others, the extra training time is worth the boost in performance.