# Introduction

Rusty Bargain, a used car company, is developing an app to help customers predict the market value of their own car. By using historical data on technical specifications and prices, they hope to implement a model that is accurate and time efficient.

## Data Description

Rusty Bargain has provided the following data:

**Features**

- `DateCrawled` — date profile was downloaded from the database
- `VehicleType` - vehicle body type
- `RegistrationYear` - vehicle registration year
- `Gearbox` - gearbox type
- `Power` - power (hp)
- `Model` - vehicle model
- `Mileage` — mileage (measured in km due to dataset's regional specifics)
- `RegistrationMonth` - vehicle registration month
- `FuelType` - fuel type
- `Brand` - vehicle brand
- `NotRepaired` - vehicle repaired or not
- `DateCreated` - date of profile creation
- `NumberOfPictures` - number of vehicle pictures
- `PostalCode` - postal code of profile owner (user)
- `LastSeen` - date of the last activity of the user

**Target**

- `Price` — price (Euro)

## Process

The process will include the following three steps:
1. Data Preparation
2. Model Training
3. Model Analysis

### Preparation

The data will first be prepared. This will include:
- Importing packages
- Reading the dataframe
- Inspecting the dataframe
- Converting datatypes
- Dropping unnecessary columns
- Handling missing data
- Encoding data

### Training

Training will be done on four different models:

- Linear regression (1 model)
- Random forest (1 model)
- Gradient descent (2 models)

Each will require splitting data and training according to their specific hyperparameters.

### Analysis

Model efficiency will be compared. This can be broken down to how accurate and how quick the model is. Accuracy will be measured with reference to the root of the mean square error (RMSE). The lower the number, the better. Each model will also be evaluated by how long each model takes along with how long the best model takes whilst iterating through hyperparameters.

# Data preparation

Import relevant packages and save dataframe.

## Import and Read

In [2]:
import pandas as pd 
import numpy as np
import re # to re-format column names
from datetime import date # to calculate age
from sklearn.model_selection import train_test_split # to split data
from sklearn.preprocessing import OneHotEncoder # for encoding
from sklearn.preprocessing import OrdinalEncoder # for encoding
import time # to calculate execution time
from sklearn.linear_model import LinearRegression # for linear regression modelling
from sklearn.metrics import mean_squared_error # to calculate MSE
from sklearn.ensemble import RandomForestRegressor # for random forest modelling
import lightgbm as lgb # for lightGBM modelling
from catboost import CatBoostRegressor # for catboost modelling
import time # to calculate execution time

In [3]:

df = pd.read_csv('data/car_data.csv')

## Inspect Columns

Inspect column information to ensure column names are approriate and that columns contain the correct datatype.

In [3]:
# check head of dataframe
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


### Column Names

Columns are currently in `CamelCase` which are hard to read. These will be changed to `snake_case`.

In [4]:
# change columns from CamelCase to snake_case
df.columns = [re.sub(r'(?<!^)(?=[A-Z])', '_', col).lower() for col in df.columns]

### Data Types

In [5]:
# check info to inspect data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354369 non-null  object
 1   price               354369 non-null  int64 
 2   vehicle_type        316879 non-null  object
 3   registration_year   354369 non-null  int64 
 4   gearbox             334536 non-null  object
 5   power               354369 non-null  int64 
 6   model               334664 non-null  object
 7   mileage             354369 non-null  int64 
 8   registration_month  354369 non-null  int64 
 9   fuel_type           321474 non-null  object
 10  brand               354369 non-null  object
 11  not_repaired        283215 non-null  object
 12  date_created        354369 non-null  object
 13  number_of_pictures  354369 non-null  int64 
 14  postal_code         354369 non-null  int64 
 15  last_seen           354369 non-null  object
dtypes:

#### Date Columns

In [6]:
# save date columns to list
date_cols = ['date_crawled', 'date_created', 'last_seen']

# parse dates in correct format
for col in date_cols:
    df[col] = pd.to_datetime(df[col], format = '%d/%m/%Y %H:%M')

## Drop Duplicates

In [7]:
# find number of duplicate values 
df.duplicated().sum()

262

In [8]:
# drop duplicates
df = df.drop_duplicates()

## Inspect Values

For dates, integer and float columns, inspect outliers with respect to summary statistics. Compare minimum and maximum values with what is logically possible. For instace, all prices should be positive. 

### Summary Statistics

In [9]:
# inspect summary statistics for non-categorical data
df.describe()

Unnamed: 0,date_crawled,price,registration_year,power,mileage,registration_month,date_created,number_of_pictures,postal_code,last_seen
count,354107,354107.0,354107.0,354107.0,354107.0,354107.0,354107,354107.0,354107.0,354107
mean,2016-03-21 12:56:48.735947008,4416.433287,2004.235355,110.089651,128211.811684,5.714182,2016-03-20 19:11:13.738728960,0.0,50507.14503,2016-03-29 23:51:12.374903808
min,2016-03-05 14:06:00,0.0,1000.0,0.0,5000.0,0.0,2014-03-10 00:00:00,0.0,1067.0,2016-03-05 14:15:00
25%,2016-03-13 11:52:00,1050.0,1999.0,69.0,125000.0,3.0,2016-03-13 00:00:00,0.0,30165.0,2016-03-23 02:50:00
50%,2016-03-21 17:50:00,2700.0,2003.0,105.0,150000.0,6.0,2016-03-21 00:00:00,0.0,49406.0,2016-04-03 15:15:00
75%,2016-03-29 14:36:00,6400.0,2008.0,143.0,150000.0,9.0,2016-03-29 00:00:00,0.0,71083.0,2016-04-06 10:06:00
max,2016-04-07 14:36:00,20000.0,9999.0,20000.0,150000.0,12.0,2016-04-07 00:00:00,0.0,99998.0,2016-04-07 14:58:00
std,,4514.338584,90.261168,189.914972,37906.590101,3.726682,,0.0,25784.212094,


#### Registration Year
Registration year contains dates for cars that are registered before cars were invented or in the future from the time of the data collection. These years will be replaced with null values.

In [10]:
# replace car registration years that are not between 1900 and 2016 with NaN
df['registration_year'] = df['registration_year'].apply(lambda x: x if 1900 < x < 2016 else np.nan)

#### Power
A quick google search will reveal that the highest horse power of any car in 2016 eclipses at 1500. Cars with a horse power listed at 0 are also unlikely to be so. Values above this will be replaced with null values.

In [11]:
# replace horse power values that are not between 1 and 2000 with 0
df['power'] = df['power'].apply(lambda x: x if 1 < x < 1500 else np.nan)

#### Pictures
Every row in the dataset contains zero pictures. This column will be dropped.

In [12]:
# drop picture column
df = df.drop('number_of_pictures', axis = 1)

#### Registration Month

Thirteen values have been provided (0-12) when only 12 months exist. This column will be dropped as it is not possible to distinguish where the months start and end.

In [13]:
# drop registration month column
df = df.drop('registration_month', axis = 1)

#### Date Crawled

Now that dates have been confirmed, this column can be dropped as it has no impact on the modelling process.

In [14]:
# drop date_crawled column
df = df.drop('date_crawled', axis = 1)

#### Last Seen

Whilst the date in which a car was last seen may indicate how much traffic the page has received, it is not useful for our analysis. We will drop this column.

In [15]:
# drop last_seen column
df = df.drop('last_seen', axis = 1)

#### Date Created

Date time cannot be analysed as a continuous variable. Instead, we will extract the number of days since the ad was created and scale it accordingly. The date_created column will then be dropped.

In [16]:
# find days since car was created
df['age'] = (df['date_created'].max() - df['date_created'])

# convert age to days
df['age'] = df['age'].apply(lambda x: x.days)

# drop date_created column
df = df.drop('date_created', axis = 1)

### Categorical columns

Categorical columns will be inspected for typos and inconsistencies by finding their unique values.

In [17]:
# find categorical columns
categorical = df.select_dtypes(include=['object']).columns

# print unique values of categorical columns
for col in categorical:
    unique_values = np.sort(df[col].unique().astype(str))
    print(col, '\n', unique_values,'\n')

vehicle_type 
 ['bus' 'convertible' 'coupe' 'nan' 'other' 'sedan' 'small' 'suv' 'wagon'] 

gearbox 
 ['auto' 'manual' 'nan'] 

model 
 ['100' '145' '147' '156' '159' '1_reihe' '1er' '200' '2_reihe' '300c'
 '3_reihe' '3er' '4_reihe' '500' '5_reihe' '5er' '601' '6_reihe' '6er'
 '7er' '80' '850' '90' '900' '9000' '911' 'a1' 'a2' 'a3' 'a4' 'a5' 'a6'
 'a8' 'a_klasse' 'accord' 'agila' 'alhambra' 'almera' 'altea' 'amarok'
 'antara' 'arosa' 'astra' 'auris' 'avensis' 'aveo' 'aygo' 'b_klasse'
 'b_max' 'beetle' 'berlingo' 'bora' 'boxster' 'bravo' 'c1' 'c2' 'c3' 'c4'
 'c5' 'c_klasse' 'c_max' 'c_reihe' 'caddy' 'calibra' 'captiva' 'carisma'
 'carnival' 'cayenne' 'cc' 'ceed' 'charade' 'cherokee' 'citigo' 'civic'
 'cl' 'clio' 'clk' 'clubman' 'colt' 'combo' 'cooper' 'cordoba' 'corolla'
 'corsa' 'cr_reihe' 'croma' 'crossfire' 'cuore' 'cx_reihe' 'defender'
 'delta' 'discovery' 'doblo' 'ducato' 'duster' 'e_klasse' 'elefantino'
 'eos' 'escort' 'espace' 'exeo' 'fabia' 'fiesta' 'focus' 'forester'
 'forfour' 

From inspection, no typos or errors were found in categorical columns.

## Missing Data

Check missing data as a percentage of given data. 

In [18]:
# find percentage of null values
round(df.isnull().sum().sort_values(ascending = False) * 100/len(df),2)

not_repaired         20.09
power                11.43
vehicle_type         10.59
fuel_type             9.29
registration_year     6.83
gearbox               5.60
model                 5.56
price                 0.00
mileage               0.00
brand                 0.00
postal_code           0.00
age                   0.00
dtype: float64

### Horse Power

Horse power values can be filled with the average that a particular brand and model of car holds. However, models that are listed as 'other' can contain a range of different cars and will be excluded from the process.

Filling will occur by concatenating brand and model columns. This is important because doing this by model only would lump all models classified as `other` as well as brands with exact models names. Those that are considered `other` will also be excluded by replacing these values with null values. 

In [19]:
# combine brand and model columns
df['brand_model'] = df['brand'] + '_' + df['model'].replace('other', np.nan)

# groupby brand_model and find median power
median_power = df.groupby('brand_model')['power'].median()

# where power is null, replace with median power of brand_model
df['power'] = df['power'].fillna(df['brand_model'].map(median_power))

# drop brand_model column
df = df.drop('brand_model', axis = 1)

### Other Values

Some missing data columns include an option for `other` already. These include:

- model
- vehicle type
- fuel type

Null values in these columns will also be set accordingly.



Also fill 'not_repaired' values as `other` as these values make up a large chunk of the data that should not be dropped.

In [20]:
# fill null values in columns with 'other' as an option
df['model'] = df['model'].fillna('other')
df['vehicle_type'] = df['vehicle_type'].fillna('other')
df['fuel_type'] = df['fuel_type'].fillna('other')

# fill null values in not_repaired column with 'other'
df['not_repaired'] = df['not_repaired'].fillna('other')

### Examine Remaining Null Values

In [21]:
# find percentage of null values
round(df.isnull().sum().sort_values(ascending = False) * 100/len(df),2)

registration_year    6.83
gearbox              5.60
power                2.69
price                0.00
vehicle_type         0.00
model                0.00
mileage              0.00
fuel_type            0.00
brand                0.00
not_repaired         0.00
postal_code          0.00
age                  0.00
dtype: float64

Drop remaining null values as only a small percentage remains.

In [22]:
# drop rows with null values
df = df.dropna()

# reset index
df = df.reset_index(drop = True)

## Pre-processing Data

### Encoding

Data encoding ensures models recognise data in the appropriate way. However, different models recognise data differently. Theoretically, all could be done using OHC, but to reduce high dimensionality effects (increased training time, overfitting), each will be encoded differently for linear regression, random forest and gradient boosting models.

One hot encoding is appropriate for linear regression, whereas label encoding will be more appropriate for the random forest regressor. Gradient boosting does not require encoding as they have systems inbuilt that can handle this data, however, requires datatype changes.

### One Hot Encoding for Linear Regression

OHE assigns a new column for each unique value in categorical columns. In this case this includes values from:

- `vehicle_type`
- `gearbox`
- `model`
- `fuel_type`
-`brand`
-`not_repaired`

Each will be 

In [23]:
# apply one hot encoding to categorical columns
one_hot = OneHotEncoder()

# fit one hot encoder to categorical columns
one_hot.fit(df[categorical])

# transform categorical columns
one_hot_arr = one_hot.transform(df[categorical]).toarray()

# create dataframe of one hot encoded columns
one_hot_df = pd.DataFrame(one_hot_arr, columns=one_hot.get_feature_names_out())

# drop categorical columns from original dataframe
df_ohe = df.drop(categorical, axis = 1)

# left merge one hot encoded dataframe with original dataframe
df_ohe = df_ohe.merge(one_hot_df, left_index=True, right_index=True)

# confirm shape length matches original dataframe
df_ohe.shape

(308999, 315)

In [40]:
# create copy of df for ohe
df_ohe = df.copy()

# identify categorical columns
categorical = df_ohe.select_dtypes(include=['object']).columns.tolist()

# apply one hot encoding to categorical columns
encoder = OneHotEncoder()
one_hot_encoded = encoder.fit_transform(df_ohe[categorical])

# convert the one hot encoded result into a DataFrame
one_hot_df = pd.DataFrame(one_hot_encoded.toarray(), columns=encoder.get_feature_names_out(categorical))

# drop categorical columns from original dataframe
df_ohe = df_ohe.drop(categorical, axis=1)

# left merge one hot encoded dataframe with original dataframe
df_ohe = pd.concat([df_ohe, one_hot_df], axis=1)

# confirm shape length matches original dataframe
print(df_ohe.shape)

(308999, 315)


### Label Encoding for Random Forests

Apply label encoding for categorical columns. This is done with OrdinalEncoder, but note that values are nominal.  

In [43]:
# create a copy of original dataframe
data_ordinal = df.copy()

# apply ordinal encoding to categorical columns
encoder = OrdinalEncoder()

# fit ordinal encoder to categorical columns
data_ordinal[categorical] = encoder.fit_transform(data_ordinal[categorical])

#confirm shape length matches original dataframe
data_ordinal.shape

(308999, 12)

### Type Changing for Gradient Boosting Models

In [30]:
# convert category columns with object type to category type

# create copy of dataframe for gradient boosting as df_gb
df_gb = df.copy()

# convert category features to category type
for feature in categorical:
    df_gb[feature] = pd.Series(df_gb[feature], dtype="category")
    
# show info of df_gb
df_gb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308999 entries, 0 to 308998
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   price              308999 non-null  int64   
 1   vehicle_type       308999 non-null  category
 2   registration_year  308999 non-null  float64 
 3   gearbox            308999 non-null  category
 4   power              308999 non-null  float64 
 5   model              308999 non-null  category
 6   mileage            308999 non-null  int64   
 7   fuel_type          308999 non-null  category
 8   brand              308999 non-null  category
 9   not_repaired       308999 non-null  category
 10  postal_code        308999 non-null  int64   
 11  age                308999 non-null  int64   
dtypes: category(6), float64(2), int64(4)
memory usage: 16.2 MB


# Model training

Now that the data has been pre-processed, models will be trained. Functions will be made that:

1. split the data into training, validation and testing.
2. split features from targets

Once done models will be trained with fine-tuned hyperparameters that will give the best result. These results will include the root of the mean of squared errors (rmse) which be used to compare the four models. The linear regression model will be tested first and used as a baseline as this model has limited hyperparameters.

## Linear Regression



### Train, Validation and Test

As no test data has been provided, data will be split into train, validation and test sets. This will occur in a 60:20:20 ratio split. To do so first split the data by 60:40 (train:validation and test). Then split the validation and test portion 50:50.

### Features and Targets

Within the same function, split each dataset into features and targets

In [31]:
# create function that splits data into sets and by features/targets
def split_data(df):
    # split into train, validation, and test sets
    train, val_test = train_test_split(df, test_size = 0.4, random_state = 42)
    val, test = train_test_split(val_test, test_size = 0.5, random_state = 42)
    
    # split into features and targets
    features_train, target_train = train.drop('price', axis = 1), train['price']
    features_val, target_val = val.drop('price', axis = 1), val['price']
    features_test, target_test = test.drop('price', axis = 1), test['price']
    
    return features_train, target_train, features_val, target_val, features_test, target_test

In [32]:
# split data_ohe for linear regression
features_train_ohe, target_train_ohe, features_val_ohe, \
target_val_ohe, features_test_ohe, target_test_ohe = split_data(df_ohe)

### Model Predictions and RMSE

In [44]:
# start time
start_time = time.time()

# train the model
model = LinearRegression()

# model training time start
model_training_start = time.time()

# fit the model
model.fit(features_train_ohe, target_train_ohe)

# model training time end
model_training_end = time.time()

# make predictions
predictions = model.predict(features_val_ohe)

# model prediction time end
model_prediction_end = time.time()

# calculate RMSE
rmse = mean_squared_error(target_val_ohe, predictions) ** 0.5

# print RMSE
print('RMSE:', round(rmse,2))

# calculate execution time
execution_time = model_prediction_end - start_time  # Calculate execution time

# print execution time
print('Model training time:', round(model_training_end - model_training_start,2),'seconds')
print('Model prediction time:', round(model_prediction_end - model_training_end,2),'seconds')
print('Total execution time:', round(execution_time,2),'seconds')

RMSE: 2665.65
Model training time: 4.61 seconds
Model prediction time: 0.06 seconds
Total execution time: 4.67 seconds


## Random Forest Regressor

The random forest regressor will follow a similar process. Key differences will include using the label encoded dataframe and using more hyperparameters to find a more optimal RMSE value. Hyperparameters will be looped based on tree depth and number of estimators.


In [34]:
# split data_ordinal for linear regression
features_train_ordinal, target_train_ordinal, features_val_ordinal, \
target_val_ordinal, features_test_ordinal, target_test_ordinal = split_data(data_ordinal)

In [36]:
 # Record start time
start_time = time.time()

# set hyperparameters
n_estimators = range(30, 71, 10)
max_depth = range(7, 13, 1)

# best rmse score (set high to start)
best_rmse = 10000

# loop through hyperparameters
for n in n_estimators:
    for d in max_depth:
        # start training time
        start_train_time = time.time()
        
        # train the model
        model = RandomForestRegressor(random_state=42, n_estimators=n, max_depth=d)
        
        # fit the model
        model.fit(features_train_ordinal, target_train_ordinal)
        
        # end train time
        end_train_time = time.time()

        # make predictions
        predictions = model.predict(features_val_ordinal)
        
        # end prediction time
        end_prediction_time = time.time()

        # calculate RMSE
        rmse = mean_squared_error(target_val_ordinal, predictions) ** 0.5

        # if rmse is lower than best_rmse, update best_rmse
        if rmse < best_rmse:
            best_rmse = rmse
            best_n = n
            best_d = d
            training_time = end_train_time - start_train_time
            predictions_time = end_prediction_time - end_train_time

# end time
end_time = time.time()

# find hyperparameter time
hyperparameter_time = end_time - start_time

# calculate execution time
execution_time = end_time - start_time  # Calculate total execution time
            
# print best hyperparameters
print('RMSE:', round(best_rmse,2))
print('n_estimators:', best_n)
print('max_depth:', best_d)
print('Training time:', round(training_time,2), 'seconds')
print('Prediction time:', round(predictions_time,2), 'seconds')
print('Total execution time:', round(hyperparameter_time,2), 'seconds')

RMSE: 1781.69
n_estimators: 70
max_depth: 12
Training time: 21.85533094406128
Prediction time: 0.32674503326416016
Totatl execution time: 395.01 seconds


## Gradient Boosting

Gradient boosting will use LightGBM and CatBoost models. Both will use the same data which will be split from the `df_gb` dataframe. 

In [47]:
# split data
features_train_gb, target_train_gb, features_val_gb, \
target_val_gb, features_test_gb, target_test_gb = split_data(df_gb)

### Light GBM

Hyperparameters that will be looped over include number of leaves, tree depth and learning rate.

In [55]:
# Set high RMSE to start
best_rmse = 10000

# Set best hyperparameters
best_params = {'num_leaves': None, 'max_depth': None, 'learning_rate': None}

# Set hyperparameters
n_leaves = range(50, 91, 10)
depth = range(7, 13, 1)
learning_rate = [0.1, 0.2, 0.3, 0.4, 0.5]

# Record start time
start_time = time.time()

# Loop through hyperparameters
for num_leaves in n_leaves:
    for max_depth in depth:
        for lr in learning_rate:
            # Define parameters
            params = {
                'task': 'train', 
                'boosting': 'gbdt',
                'objective': 'regression',
                'num_leaves': num_leaves,
                'max_depth': max_depth,
                'verbose': -1,
                'metric': 'rmse',
                'learning_rate': lr,
            }

            # start training time
            start_train_time = time.time()

            # load data into LightGBM dataset
            lgb_train = lgb.Dataset(features_train_gb, target_train_gb)
        
            # fit the model
            model = lgb.train(params, lgb_train, num_boost_round=1000)
            
            # end train time
            end_train_time = time.time()

            # make predictions
            predictions = model.predict(features_val_gb, num_iteration=model.best_iteration)
            
            # end prediction time
            end_prediction_time = time.time()

            # calculate RMSE
            rmse = mean_squared_error(target_val_gb, predictions) ** 0.5

            # If RMSE is lower than best_rmse, update best_rmse and best_params
            if rmse < best_rmse:
                best_rmse = rmse
                best_params['num_leaves'] = num_leaves
                best_params['max_depth'] = max_depth
                best_params['learning_rate'] = lr
                best_time = time.time() - start_time
                training_time = end_train_time - start_train_time
                predictions_time = end_prediction_time - end_train_time
                

# Print rmse and best hyperparameters
print('RMSE:', round(best_rmse, 2))
print('Best hyperparameters:', best_params)

# print execution times
print('Training time:', round(training_time, 2), 'seconds')
print('Prediction time:', round(predictions_time, 2), 'seconds')
print('Total execution time for all hyperparameters:', round(time.time() - start_time, 2), 'seconds')

RMSE: 1586.68
Best hyperparameters: {'num_leaves': 80, 'max_depth': 10, 'learning_rate': 0.1}
Training time: 8.03 seconds
Prediction time: 0.59 seconds
Total execution time for all hyperparameters: 15368.14 seconds


### CatBoost

In [49]:
# record start time
start_time = time.time() 

# initialize CatBoostRegressor with appropriate parameters
model = CatBoostRegressor(loss_function='RMSE', iterations=300, random_seed=42)

# create list of categorical features
cat_features = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand', 'not_repaired']

# start training time
start_train_time = time.time()

# fit the model
model.fit(features_train_gb, target_train_gb, cat_features=cat_features, verbose=20)

# end training time
end_train_time = time.time()

# make predictions
predictions = model.predict(features_val_gb)

# end prediction time
end_prediction_time = time.time()

# calculate RMSE
rmse = mean_squared_error(target_val_gb, predictions) ** 0.5

# print RMSE
print('RMSE:', round(rmse, 2))

# record end time
end_time = time.time()

# calculate execution time
execution_time = end_time - start_time

# print execution time
print('Training time:', round(end_train_time - start_train_time, 2), 'seconds')
print('Prediction time:', round(end_prediction_time - end_train_time, 2), 'seconds')
print('Total execution time:', round(execution_time, 2), 'seconds')

Learning rate set to 0.24868
0:	learn: 3867.3992240	total: 97.9ms	remaining: 29.3s
20:	learn: 1899.0404320	total: 767ms	remaining: 10.2s
40:	learn: 1814.0047265	total: 1.44s	remaining: 9.11s
60:	learn: 1763.4937535	total: 2.13s	remaining: 8.33s
80:	learn: 1734.8414802	total: 2.76s	remaining: 7.45s
100:	learn: 1714.7022312	total: 3.36s	remaining: 6.62s
120:	learn: 1699.4839566	total: 4.05s	remaining: 5.99s
140:	learn: 1686.5864089	total: 4.74s	remaining: 5.35s
160:	learn: 1673.1592576	total: 5.51s	remaining: 4.76s
180:	learn: 1661.5575040	total: 6.33s	remaining: 4.16s
200:	learn: 1650.3172890	total: 6.94s	remaining: 3.42s
220:	learn: 1643.1304999	total: 7.57s	remaining: 2.71s
240:	learn: 1634.5835119	total: 8.17s	remaining: 2s
260:	learn: 1626.9940590	total: 8.79s	remaining: 1.31s
280:	learn: 1619.6881842	total: 9.44s	remaining: 638ms
299:	learn: 1612.7646550	total: 10s	remaining: 0us
RMSE: 1660.48
Training time: 10.31 seconds
Prediction time: 0.05 seconds
Total execution time: 10.37 se

# Model analysis

## Linear Regression
Linear regression shows a large RMSE value at 2666 that was calculated in just under 5 seconds. This indicates that this model may not be that accurate, however, acts as a baseline for the other models.

## Random Forest
The random forest model was tested next and provided a much better RMSE value at around 1782. However, iterating through these hyperparameters took almost 7 minutes to find the best model. This model itself took just over 22 seconds to train. Once trained, predictions were made within 0.33 seconds.

## Gradient Descent 
The light GBM generated an even better RMSE value at 1587 and at a quicker rate that trained and predicted values in around 9 seconds. However, more time was spent to find the best parameters at around 19 minutes.

The CatBoost model also performed better than the random forest model. However, RMSE (1660) and total execution time (10.37 seconds) were just behind the light GBM model.


## Analysis
Overall, the light GBM provided the best results in terms of RMSE and speed.

## Final Predictions
Now that this model has been chosen, predictions will be made on the test set using the appropriate parameters.

In [54]:
# Define parameters
params = {
    'task': 'train', 
    'boosting': 'gbdt',
    'objective': 'regression',
    'num_leaves': 80,
    'max_depth': 10,
    'verbose': -1,
    'metric': 'rmse',
    'learning_rate': 0.1,
}

# start training time
start_train_time = time.time()

# load data into LightGBM dataset
lgb_train = lgb.Dataset(features_train_gb, target_train_gb)

# fit the model
model = lgb.train(params, lgb_train, num_boost_round=1000)

# end train time
end_train_time = time.time()

# make predictions
predictions = model.predict(features_test_gb, num_iteration=model.best_iteration)

# end prediction time
end_prediction_time = time.time()

# calculate RMSE
rmse = mean_squared_error(target_test_gb, predictions) ** 0.5

# training time
training_time = end_train_time - start_train_time

# prediction time
predictions_time = end_prediction_time - end_train_time

# print rmse
print('RMSE:', round(rmse, 2))

# print execution times
print('Training time:', round(training_time, 2), 'seconds')
print('Prediction time:', round(predictions_time, 2), 'seconds')

RMSE: 1623.92
Training time: 8.6 seconds
Prediction time: 0.72 seconds


### Analysis

The RMSE value is similar to the RMSE value higher than the validation set, yet still better than any other model when compared to validation stages. The execution time is also reasonably similar.

# Conclusion

### Accuracy

A gradient boosting model (LGBM) performed best among other models, achieving an RMSE of 1586.68. This indicates that, on average, the predicted car values deviate from the true values by approximately `$1,586`. This compares to RMSE values of 1660, 1781.69, and 2665.65 of the CatBoost, Random Forest, and Linear Regression models.

### Speed

To train the LGBM model, training time was minimal at 8.03 seconds which is faster than the CatBoost (10.31 seconds) and Random Forest (21.86 seconds) models. The Linear Regression model was faster, however, much more inaccurate.

### Test Set

When applying the LGBM model to the test set, RMSE decreased slightly to 1623.92 but retained a faster training time at 8.6 seconds. 