Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

In [3]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import catboost as cb
import xgboost as xgb
import time
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor	
from sklearn.tree import DecisionTreeRegressor

Features

    DateCrawled — date profile was downloaded from the database
    VehicleType — vehicle body type
    RegistrationYear — vehicle registration year
    Gearbox — gearbox type
    Power — power (hp)
    Model — vehicle model
    Mileage — mileage (measured in km due to dataset's regional specifics)
    RegistrationMonth — vehicle registration month
    FuelType — fuel type
    Brand — vehicle brand
    NotRepaired — vehicle repaired or not
    DateCreated — date of profile creation
    NumberOfPictures — number of vehicle pictures
    PostalCode — postal code of profile owner (user)
    LastSeen — date of the last activity of the user
Target

    Price — price (Euro)

## Data preparation

In [6]:
try:
    df = pd.read_csv('/datasets/car_data.csv')

except FileNotFoundError:
    df = pd.read_csv('/datasets/car_data.csv')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [8]:
# Drop the original datetime columns 
df.drop(columns=['DateCrawled', 'DateCreated', 'LastSeen'], inplace=True)

In [9]:
df.head(5)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,NumberOfPictures,PostalCode
0,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,0,70435
1,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,0,66954
2,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,0,90480
3,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,0,91074
4,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,0,60437


In [12]:
df.duplicated().sum()

21333

In [13]:
df = df.drop_duplicates().reset_index(drop=True)

In [14]:
df.isnull().sum()

Price                    0
VehicleType          36140
RegistrationYear         0
Gearbox              19015
Power                    0
Model                19023
Mileage                  0
RegistrationMonth        0
FuelType             31902
Brand                    0
NotRepaired          68063
NumberOfPictures         0
PostalCode               0
dtype: int64

In [15]:
#All missing values are categorical, value was unspecified so fill with unknown
df['VehicleType'].fillna('unknown', inplace=True)
df['Gearbox'].fillna('unknown', inplace=True)
df['Model'].fillna('unknown', inplace=True)
df['FuelType'].fillna('unknown', inplace=True)
df['NotRepaired'].fillna('unknown', inplace=True)

In [17]:
df.isnull().sum()

Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
NumberOfPictures     0
PostalCode           0
dtype: int64

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333036 entries, 0 to 333035
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              333036 non-null  int64 
 1   VehicleType        333036 non-null  object
 2   RegistrationYear   333036 non-null  int64 
 3   Gearbox            333036 non-null  object
 4   Power              333036 non-null  int64 
 5   Model              333036 non-null  object
 6   Mileage            333036 non-null  int64 
 7   RegistrationMonth  333036 non-null  int64 
 8   FuelType           333036 non-null  object
 9   Brand              333036 non-null  object
 10  NotRepaired        333036 non-null  object
 11  NumberOfPictures   333036 non-null  int64 
 12  PostalCode         333036 non-null  int64 
dtypes: int64(7), object(6)
memory usage: 33.0+ MB


## Model training

In [20]:
def lr(X_train, y_train, X_test, y_test):
    model = LinearRegression()
    
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    start_time = time.time()
    y_pred_test = model.predict(X_test)
    prediction_time = time.time() - start_time
    
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
    print(f'Test RMSE: {rmse_test:.4f}')
    print(f'Training Time: {training_time:.4f} seconds')
    print(f'Prediction Time: {prediction_time:.4f} seconds')
    
def dt(X_train, y_train, X_test, y_test):
    best_model = None
    best_result = 10000
    best_depth = 0
    for depth in range(1, 6): # choose hyperparameter range
        model = DecisionTreeRegressor(random_state=12345, max_depth= depth) # train model on training set
        model.fit(X_train, y_train)
        predictions_valid = model.predict(X_test) # get model predictions on validation set
        result = mean_squared_error(y_test, predictions_valid)**0.5
        if result < best_result:
            best_model = model
            best_result = result
            best_depth = depth
            
            
    model = best_model
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time

    start_time = time.time()
    predictions_valid = best_model.predict(X_test)
    prediction_time = time.time() - start_time

    rmse_test = np.sqrt(mean_squared_error(y_test, predictions_valid))
    
    print(f"RMSE of the best model on the validation set (max_depth = {best_depth}): {best_result}")
    print(f'Test RMSE: {rmse_test:.4f}')
    print(f'Training Time: {training_time:.4f} seconds')
    print(f'Prediction Time: {prediction_time:.4f} seconds')

def rf(X_train, y_train, X_test, y_test):
    best_model = None
    best_result = 10000
    best_est = 0
    best_depth = 0
    start_time = time.time()
    for est in range(1, 10, 1):
        for depth in range (1, 5):
            model = RandomForestRegressor(random_state=12345, n_estimators=est, max_depth=depth)
            model.fit(X_train, y_train)
            predictions_valid = model.predict(X_test)
            result = mean_squared_error(y_test, predictions_valid)**0.5
            if result < best_result:
                best_model = model
                best_result = result
                best_est = est
                best_depth = depth
    
    model = best_model
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time

    start_time = time.time()
    predictions_valid = best_model.predict(X_test)
    prediction_time = time.time() - start_time

    rmse_test = np.sqrt(mean_squared_error(y_test, predictions_valid))
    
    print("RMSE of the best model on the validation set:", best_result, "n_estimators:", best_est, "best_depth:", depth)
    print(f'Test RMSE: {rmse_test:.4f}')
    print(f'Training Time: {training_time:.4f} seconds')
    print(f'Prediction Time: {prediction_time:.4f} seconds')


def light_gbm(X_train, y_train, X_test, y_test):
    train_data = lgb.Dataset(X_train, label=y_train)

    # Set parameters
    params = {
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.1,
        'num_leaves': 100,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.5,
        'verbose': -1
    }

    # Train the model
    start_time = time.time()
    model = lgb.train(params, train_data, num_boost_round=100)
    training_time = time.time() - start_time

    # Make predictions
    start_time = time.time()
    y_pred = model.predict(X_test)
    prediction_time = time.time() - start_time
    
    # Evaluate the model
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f'Test RMSE: {rmse_test:.4f}')
    print(f'Training Time: {training_time:.4f} seconds')
    print(f'Prediction Time: {prediction_time:.4f} seconds')

def c_boost(X_train, y_train, X_test, y_test):
    model = cb.CatBoostRegressor(
        iterations=100,
        learning_rate=0.1,
        depth=10,  # Depth of the tree, similar to num_leaves in LightGBM
        loss_function='RMSE',
        verbose=0  # Set to 0 to disable output, use 1 for detailed output
    )

    # Train the model
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Make predictions
    start_time = time.time()
    y_pred = model.predict(X_test)
    prediction_time = time.time() - start_time

    # Evaluate the model
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f'Test RMSE: {rmse_test:.4f}')
    print(f'Training Time: {training_time:.4f} seconds')
    print(f'Prediction Time: {prediction_time:.4f} seconds')

def xgb_boost(X_train, y_train, X_test, y_test):

    # Convert the dataset into DMatrix format, which is optimized for XGBoost
    train_data = xgb.DMatrix(X_train, label=y_train)
    test_data = xgb.DMatrix(X_test, label=y_test)
    
    # Set parameters
    params = {
        'objective': 'reg:squarederror',  # Objective for regression tasks
        'eval_metric': 'rmse',  # Evaluation metric
        'learning_rate': 0.1,
        'max_depth': 6,  # Similar to 'depth' in CatBoost and 'num_leaves' in LightGBM
        'colsample_bytree': 0.9,  # Similar to 'feature_fraction' in LightGBM
        'subsample': 0.8,  # Similar to 'bagging_fraction' in LightGBM
        'verbosity': 1  # Set to 0 to silence output
    }
    
    # Train the model
    start_time = time.time()
    model = xgb.train(params, train_data, num_boost_round=100)
    training_time = time.time() - start_time
    
    # Make predictions
    start_time = time.time()
    y_pred = model.predict(test_data)
    prediction_time = time.time() - start_time
    
    # Evaluate the model
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f'Test RMSE: {rmse_test:.4f}')
    print(f'Training Time: {training_time:.4f} seconds')
    print(f'Prediction Time: {prediction_time:.4f} seconds')

In [21]:
categorical_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']
# One-hot encoding the categorical features
df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)
features = df_encoded.drop(columns=['Price'])  # Features
target = df_encoded['Price']                 # Target
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=12345)
scaler = MaxAbsScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [24]:
df['Price'].describe()

count    333036.000000
mean       4378.277586
std        4502.534823
min           0.000000
25%        1000.000000
50%        2699.000000
75%        6299.250000
max       20000.000000
Name: Price, dtype: float64

In [25]:
lr(X_train, y_train, X_test, y_test)

Test RMSE: 3159.5750
Training Time: 3.3398 seconds
Prediction Time: 0.0540 seconds


In [26]:
dt(X_train, y_train, X_test, y_test)

RMSE of the best model on the validation set (max_depth = 5): 2583.9005957382897
Test RMSE: 2583.9006
Training Time: 2.6785 seconds
Prediction Time: 0.0600 seconds


In [27]:
rf(X_train, y_train, X_test, y_test)

RMSE of the best model on the validation set: 2744.2724731667913 n_estimators: 9 best_depth: 4
Test RMSE: 2744.2725
Training Time: 13.1276 seconds
Prediction Time: 0.0790 seconds


In [28]:
light_gbm(X_train, y_train, X_test, y_test)

Test RMSE: 1790.4000
Training Time: 1.9485 seconds
Prediction Time: 0.1190 seconds


In [29]:
c_boost(X_train, y_train, X_test, y_test)

Test RMSE: 1875.6100
Training Time: 2.8121 seconds
Prediction Time: 0.0164 seconds


In [30]:
xgb_boost(X_train, y_train, X_test, y_test)

Test RMSE: 1874.3630
Training Time: 1.7199 seconds
Prediction Time: 0.0180 seconds


## Model analysis

When comparing the model's training times, the gradient boosting methods demonstrate significantly faster performance compared to other models like Random Forest, Decision Tree, and Linear Regression. Among these, LightGBM not only offers the highest accuracy, but also has a great prediction speed. 

Although the prediction time for LightGBM is slightly slower compared to other models, the difference in time is marginal and does not significantly impact the overall performance. Therefore, given its accuracy and training speed, I recommend using the LightGBM model for this task.