# 1. Project Overview

Rusty Bargain, a used car sales service, is developing an app to help users estimate the market value of their car. The project involves building machine learning models using historical data that contains technical specifications, trim versions, and prices of vehicles. The primary focus of the project is to predict car prices, considering the following key factors:

- **Prediction Quality**: The accuracy of the model predictions is a priority.
- **Prediction Speed**: The time taken for the model to predict prices should be fast.
- **Training Time**: Models should be optimized to reduce training time.

# 2. Initialization

## 2.1 Add imports

Imports in Jupyter notebooks allow users to access external libraries for extended functionality and facilitate code organization by declaring dependencies at the beginning of the notebook, ensuring clear and efficient development.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, mean_squared_error

import time
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

1. **Pandas**: is a powerful data manipulation and analysis library in Python that provides data structures like DataFrames and Series for handling structured data efficiently.  
2. **Sklearn**: is a comprehensive machine learning library in Python that offers simple and efficient tools for data mining and data analysis, including algorithms for classification, regression, clustering, and model evaluation.
3. **NumPy**: is a fundamental library for numerical computing in Python that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.
4. **Time**: is a Python library that provides functions for working with time-related tasks, such as retrieving the current time, pausing execution (sleep), and measuring time intervals.
5. **LightGBM**: is a fast, efficient implementation of gradient boosting designed for large-scale machine learning tasks, particularly for structured/tabular data.
6. **XGBoost**: is a scalable, efficient library for gradient boosting, widely used in machine learning competitions and tasks, particularly for structured/tabular data.
7. **CatBoost**: is a gradient boosting library that is optimized for categorical features and provides fast training and high accuracy, particularly for tabular data.

## 2.2 Set up CSV DataFrames

In my Jupyter notebook, I use Pandas to load CSV files, enabling me to manipulate and analyze data seamlessly within the notebook environment.

In [2]:
paths = {
    'local': './datasets/car_data.csv',
    'server': '/datasets/car_data.csv',
    'online': '',
}

I define the `load_csv` function to load a dataset specified by the argument `local`. First, I attempt to read it locally from `local[file_key]`, handling a `FileNotFoundError` by trying to read from `server[file_key]` if necessary, and finally, from `online[file_key]` if all else fails.

In [3]:
def load_csv():
    try:
        return pd.read_csv(paths['local'])
    except FileNotFoundError:
        try:
            return pd.read_csv(paths['server'])
        except FileNotFoundError:
            return pd.read_csv(paths['online'])

This function `load_csv()` attempts to load a CSV file from three different locations. It first tries to load the file from a local path (`paths['local']`). If the file is not found, it attempts to load it from a server path (`paths['server']`). If that also fails, it tries to load the CSV from an online location (`paths['online']`). Each file path is stored in the `paths` dictionary, and the function handles potential `FileNotFoundError` exceptions to ensure it tries the next available source.

In [4]:
df = load_csv()

The `load_csv()` function was used to read the CSV file and return its contents as a DataFrame, assigning it to the variable `df`.

## 2.3 Display

Now we will quickly inspect the initial data contained in the DataFrame.

In [5]:
df

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,21/03/2016 09:50,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,21/03/2016 00:00,0,2694,21/03/2016 10:42
354365,14/03/2016 17:48,2200,,2005,,0,,20000,1,,sonstige_autos,,14/03/2016 00:00,0,39576,06/04/2016 00:46
354366,05/03/2016 19:56,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,05/03/2016 00:00,0,26135,11/03/2016 18:17
354367,19/03/2016 18:57,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,19/03/2016 00:00,0,87439,07/04/2016 07:15


**Data Columns:**
- **DateCrawled**: Date when the profile was downloaded from the database.
- **VehicleType**: Type of vehicle body.
- **RegistrationYear**: Year the vehicle was registered.
- **Gearbox**: Type of gearbox (automatic/manual).
- **Power**: Horsepower of the vehicle.
- **Model**: Vehicle model.
- **Mileage**: Vehicle mileage (in km).
- **RegistrationMonth**: Month of registration.
- **FuelType**: Type of fuel used by the vehicle.
- **Brand**: Brand of the vehicle.
- **NotRepaired**: Whether the vehicle was repaired.
- **DateCreated**: Date of profile creation.
- **NumberOfPictures**: Number of vehicle pictures uploaded.
- **PostalCode**: Postal code of the profile owner.
- **LastSeen**: Date of the last activity of the user.
- **Price** (Target): Price of the vehicle in Euros.


# 3 Preparing the Data

## 3.1 The Focus

The DataFrame contains many columns, but not all may be significant. Before assessing significance, we need to understand the types of data present.

In [6]:
df.dtypes

DateCrawled          object
Price                 int64
VehicleType          object
RegistrationYear      int64
Gearbox              object
Power                 int64
Model                object
Mileage               int64
RegistrationMonth     int64
FuelType             object
Brand                object
NotRepaired          object
DateCreated          object
NumberOfPictures      int64
PostalCode            int64
LastSeen             object
dtype: object

The data types are fairly balanced between strings and integers. This will be addressed later, after further processing of the DataFrame.

In [7]:
FOCUS = [
       'VehicleType', 'RegistrationYear', 'Gearbox', 
       'Power', 'Model', 'Mileage', 'FuelType', 'Brand', 
       'NotRepaired', 'Price'
]
df = df[FOCUS]
df

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,Price
0,,1993,manual,0,golf,150000,petrol,volkswagen,,480
1,coupe,2011,manual,190,,125000,gasoline,audi,yes,18300
2,suv,2004,auto,163,grand,125000,gasoline,jeep,,9800
3,small,2001,manual,75,golf,150000,petrol,volkswagen,no,1500
4,small,2008,manual,69,fabia,90000,gasoline,skoda,no,3600
...,...,...,...,...,...,...,...,...,...,...
354364,,2005,manual,0,colt,150000,petrol,mitsubishi,yes,0
354365,,2005,,0,,20000,,sonstige_autos,,2200
354366,convertible,2000,auto,101,fortwo,125000,petrol,smart,no,1199
354367,bus,1996,manual,102,transporter,150000,gasoline,volkswagen,no,9200


The columns not included in the `FOCUS` list are likely related to sales-specific data, such as the date a vehicle was listed or last seen, which can indicate how long a car has been on the market and help assess its buyability. In contrast, the selected columns in `FOCUS` focus on key characteristics of the vehicle that platforms like Kelley Blue Book (KBB), Edmunds, and J.D. Power use to determine a vehicle's value. These fields provide essential information about the car's type, specifications, condition, and price, aiding potential buyers in making informed decisions.

## 3.2 Error Handling

To ensure accuracy and repeatability, potential errors will be identified and addressed.

In [8]:
duplicates = df.duplicated().sum()
duplicates

45040

There are 45,040 duplicates that need to be resolved.

In [9]:
df = df.drop_duplicates()
duplicates = df.duplicated().sum()
duplicates


0

All duplicates have been resolved, and there are now 0 remaining.

In [10]:
missing_values = df.isnull().sum()
missing_values

VehicleType         34559
RegistrationYear        0
Gearbox             17207
Power                   0
Model               18361
Mileage                 0
FuelType            30764
Brand                   0
NotRepaired         64558
Price                   0
dtype: int64

It appears there are missing values in five columns, all of which contain string data types. This should make handling them straightforward.

In [11]:
df = df.fillna('unknown')
missing_values = df.isnull().sum()
missing_values

VehicleType         0
RegistrationYear    0
Gearbox             0
Power               0
Model               0
Mileage             0
FuelType            0
Brand               0
NotRepaired         0
Price               0
dtype: int64

This code fills any missing (NaN) values in the DataFrame `df` with the string `'unknown'` using the `fillna()` method. Then, it calculates the number of missing values in each column of `df` by using `isnull().sum()` and stores the result in `missing_values`.


## 3.3 Enhancements

The goal is to build a model to predict car prices. While the DataFrame has been cleaned, it's not yet ready for use. Further adjustments are needed to complete the task.

In [12]:
no_price = df[(df['Price'] == 0) & (~df.eq('unknown').any(axis=1))]
no_price_unknown = df[(df['Price'] == 0) & (df.eq('unknown').any(axis=1))]
unknown_values = df[(df.eq('unknown').any(axis=1)) & (df['Price'] != 0)]
display(no_price, no_price_unknown, unknown_values)

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,Price
7,sedan,1980,manual,50,other,40000,petrol,volkswagen,no,0
152,bus,2004,manual,101,meriva,150000,lpg,opel,yes,0
579,sedan,1996,manual,170,5er,150000,petrol,bmw,no,0
615,sedan,1998,manual,75,polo,150000,petrol,volkswagen,yes,0
859,wagon,2009,manual,170,a6,150000,gasoline,audi,yes,0
...,...,...,...,...,...,...,...,...,...,...
353863,sedan,1995,manual,90,mondeo,100000,petrol,ford,no,0
353882,small,1996,manual,45,polo,150000,petrol,volkswagen,yes,0
353995,sedan,1991,manual,133,100,150000,petrol,audi,no,0
354124,small,2004,manual,200,golf,150000,petrol,volkswagen,no,0


Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,Price
40,unknown,1990,unknown,0,corsa,150000,petrol,opel,unknown,0
111,unknown,2017,manual,0,golf,5000,petrol,volkswagen,unknown,0
115,small,1999,unknown,0,unknown,5000,petrol,volkswagen,unknown,0
154,unknown,2006,unknown,0,other,5000,unknown,fiat,unknown,0
231,wagon,2001,manual,115,mondeo,150000,unknown,ford,unknown,0
...,...,...,...,...,...,...,...,...,...,...
354175,unknown,1995,manual,45,polo,150000,petrol,volkswagen,unknown,0
354205,unknown,2000,manual,65,corsa,150000,unknown,opel,yes,0
354238,small,2002,manual,60,fiesta,150000,petrol,ford,unknown,0
354248,small,1999,manual,53,swift,150000,petrol,suzuki,unknown,0


Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,Price
0,unknown,1993,manual,0,golf,150000,petrol,volkswagen,unknown,480
1,coupe,2011,manual,190,unknown,125000,gasoline,audi,yes,18300
2,suv,2004,auto,163,grand,125000,gasoline,jeep,unknown,9800
8,bus,2014,manual,125,c_max,30000,petrol,ford,unknown,14500
9,small,1998,manual,101,golf,150000,unknown,volkswagen,unknown,999
...,...,...,...,...,...,...,...,...,...,...
354356,convertible,2000,manual,95,megane,150000,petrol,renault,unknown,999
354357,wagon,2004,manual,55,fabia,150000,petrol,skoda,unknown,1690
354361,unknown,2016,auto,150,159,150000,unknown,alfa_romeo,no,5250
354365,unknown,2005,unknown,0,unknown,20000,unknown,sonstige_autos,unknown,2200


The first potential issue is the presence of `0` values. To address this, I split the `0` values into two DataFrames: one containing unknown values and another without. By doing this, we can confirm that both scenarios are possible.

I also needed to confirm the possibility of rows with an `unknown` value still producing a non-zero price, which has been observed.

In [13]:
df = df[~df.index.isin(no_price_unknown.index)].reset_index(drop=True)
df

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,Price
0,unknown,1993,manual,0,golf,150000,petrol,volkswagen,unknown,480
1,coupe,2011,manual,190,unknown,125000,gasoline,audi,yes,18300
2,suv,2004,auto,163,grand,125000,gasoline,jeep,unknown,9800
3,small,2001,manual,75,golf,150000,petrol,volkswagen,no,1500
4,small,2008,manual,69,fabia,90000,gasoline,skoda,no,3600
...,...,...,...,...,...,...,...,...,...,...
303140,sedan,2004,manual,225,leon,150000,petrol,seat,yes,3200
303141,unknown,2005,unknown,0,unknown,20000,unknown,sonstige_autos,unknown,2200
303142,convertible,2000,auto,101,fortwo,125000,petrol,smart,no,1199
303143,bus,1996,manual,102,transporter,150000,gasoline,volkswagen,no,9200


Considering all scenarios, it seems logical that not knowing minor details about your car wouldn't prevent you from getting an estimate. However, some key information is necessary to provide an accurate valuation; without it, an error may occur. Finally, owning a car and knowing its details doesn't always guarantee that it holds significant value.

I removed the `no_price_unknown` rows from the DataFrame. When tested later, the model showed errors without this step, confirming its importance.

In [14]:
categorical = df.select_dtypes(include='object').columns
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[categorical])
encoded_features = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(categorical))
encoded_df = pd.concat([df.drop(columns=categorical), encoded_features], axis=1)
encoded_df

Unnamed: 0,RegistrationYear,Power,Mileage,Price,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_no,NotRepaired_unknown,NotRepaired_yes
0,1993,0,150000,480,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,2011,190,125000,18300,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,2004,163,125000,9800,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,2001,75,150000,1500,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,2008,69,90000,3600,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
303140,2004,225,150000,3200,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
303141,2005,0,20000,2200,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
303142,2000,101,125000,1199,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
303143,1996,102,150000,9200,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


Since not all models accept categorical values, I transformed these into binary values and saved the result in a new DataFrame called `encoded_df`.

# 4 Set-Up

## Splitting the Data

First, we need to define the features and the target variable for the model.

In [15]:
X = encoded_df.drop('Price', axis=1)
y = encoded_df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

With the features and target variable established, I was able to generate the training and testing datasets for model evaluation.

## 4.2 Functional Repeatability

To streamline the model evaluation process, functions will be created to reduce the amount of code needed.

In [16]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
rmse_scorer = make_scorer(mean_squared_error, squared=False)

In this snippet, `kf` is a `KFold` object configured for 5-fold cross-validation, with shuffling enabled and a fixed random seed (42) for reproducibility. The `rmse_scorer` uses `make_scorer` to wrap `mean_squared_error`, specifying `squared=False` to calculate the root mean squared error (RMSE) for model evaluation.

In [17]:
def train_and_evaluate_models(models, X_train, y_train, X_test, y_test):
    results = {}
    
    for model_name, config in models.items():
        model = config['model']
        params = config['params']
        
        start_time = time.time()
        grid_search = GridSearchCV(model, params, scoring=rmse_scorer, cv=kf, n_jobs=-1)
        grid_search.fit(X_train, y_train)
        end_time = time.time()
        training_time = end_time - start_time
        
        best_model = grid_search.best_estimator_
        best_fold_rmse = np.min(cross_val_score(best_model, X_train, y_train, scoring=rmse_scorer, cv=kf))
        
        start_time_test = time.time()
        y_pred = best_model.predict(X_test)
        end_time_test = time.time()
        test_rmse = (mean_squared_error(y_test, y_pred, squared=False))
        
        results[model_name] = {
            'Best Fold RMSE': best_fold_rmse,
            'Test RMSE': test_rmse,
            'Best Params': grid_search.best_params_, 
            'Training Time': training_time,
            'Test Time': end_time_test - start_time_test
        }
    
    return results

The `train_and_evaluate_models` function takes a dictionary of models and their configurations, along with training and testing datasets. It initializes an empty dictionary, `results`, to store the evaluation metrics for each model. For each model, it performs a grid search using `GridSearchCV` to find the best hyperparameters based on RMSE, measures the training time, and calculates the best fold RMSE using cross-validation. After training, it predicts the test set and computes the test RMSE. Finally, it stores various metrics, including best parameters, training time, and test time for each model in the `results` dictionary, which is returned at the end.

In [18]:
def print_results(results):
    for model_name, metrics in results.items():
        print(f"{model_name} - Best Fold RMSE: {metrics['Best Fold RMSE']:.2f}")
        print(f"{model_name} - Training Time: {metrics['Training Time']:.2f}s")
        print(f"{model_name} - Test RMSE: {metrics['Test RMSE']:.2f}")
        print(f"{model_name} - Test Time: {metrics['Test Time']:.2f}s")
        print(f"{model_name} - Best Params: {metrics['Best Params']}\n")

The `print_results` function is meant to the `results` from `train_and_evaluate_models`, which contains evaluation metrics for various models. It iterates through each model, printing out the best fold RMSE, training time, test RMSE, test time, and the best hyperparameters in a formatted manner.

# 5 Training and Model Testing

## 5.1 Parameters

In [19]:
models = {
    'Decision Tree': {
        'model': DecisionTreeRegressor(),
        'params': {
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10]
        }
    },
    'Random Forest': {
        'model': RandomForestRegressor(),
        'params': {
            'n_estimators': [50, 100, 250]
        }
    },
    'Linear Regression': {
        'model': LinearRegression(),
        'params': {}
    }
}

This dictionary specifies different regression models and their respective hyperparameters for tuning. Each key is a string representing the models; 'Decision Tree', 'Random Forest', 'Linear Regression'. Each value has another dictionary containing the model instance and a `params` dictionary as followed: 

- For the **Decision Tree** model, it specifies hyperparameters for `max_depth` and `min_samples_split`.
- The **Random Forest** model includes the number of estimators (`n_estimators`) for tuning.
- The **Linear Regression** model is included without any hyperparameters.

In [20]:
models_2 = {
    'LightGBM': {
        'model': LGBMRegressor(),
        'params': {
            'n_estimators': [50, 100, 250],
            'learning_rate': [0.01, 0.05, 0.1]
        }
    },
    'XGBoost': {
        'model': XGBRegressor(),
        'params': {
            'n_estimators': [50, 100, 250],
            'learning_rate': [0.01, 0.05, 0.1]
        }
    },
    'CatBoost': {
        'model': CatBoostRegressor(silent=True),
        'params': {
            'iterations': [100, 250],
            'learning_rate': [0.01, 0.05, 0.1]
        }
    }
}

This dictionary contains additional regression models along with their respective hyperparameters for tuning. Each key represents the models; 'LightGBM', 'XGBoost', 'CatBoost'. Each value has another dictionary that includes:

- **Model Instance**: Each model is instantiated (e.g., `LGBMRegressor()`, `XGBRegressor()`, `CatBoostRegressor(silent=True)`).
- **Hyperparameters**: The `params` dictionary specifies the hyperparameters for each model:
  - For **LightGBM** and **XGBoost**, `n_estimators` and `learning_rate` are included for tuning.
  - For **CatBoost**, `iterations` and `learning_rate` are specified, with the `silent=True` argument suppressing warnings during training.

## 5.2 Model Evaluation

Now each model will be evaluated.

In [21]:
simple_models = train_and_evaluate_models(models, X_train, y_train, X_test, y_test)
print_results(simple_models)

Decision Tree - Best Fold RMSE: 2174.51
Decision Tree - Training Time: 122.46s
Decision Tree - Test RMSE: 2187.51
Decision Tree - Test Time: 0.17s
Decision Tree - Best Params: {'max_depth': None, 'min_samples_split': 2}

Random Forest - Best Fold RMSE: 1804.39
Random Forest - Training Time: 2284.54s
Random Forest - Test RMSE: 1812.55
Random Forest - Test Time: 1.28s
Random Forest - Best Params: {'n_estimators': 50}

Linear Regression - Best Fold RMSE: 3187.50
Linear Regression - Training Time: 21.52s
Linear Regression - Test RMSE: 641631.98
Linear Regression - Test Time: 0.08s
Linear Regression - Best Params: {}



- The **Random Forest** model achieved the best performance with the lowest RMSE fold and test data, although it had a longer training time.
- The **Decision Tree** model performed reasonably well, but its RMSE was higher than that of Random Forest, but with much faster times recorded.
- The **Linear Regression** model showed significantly poorer performance on the test set, as indicated by its high test RMSE, suggesting it may not be suitable for this dataset.

In [22]:
advanced_models = train_and_evaluate_models(models_2, X_train, y_train, X_test, y_test)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004343 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 955
[LightGBM] [Info] Number of data points in the train set: 242516, number of used features: 298
[LightGBM] [Info] Start training from score 4577.668768
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004365 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 957
[LightGBM] [Info] Number of data points in the train set: 194012, number of used features: 298
[LightGBM] [Info] Start training from score 4582.152831
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003686 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] 

This line calls the `train_and_evaluate_models` function with `models_2`, which includes LightGBM, XGBoost, and CatBoost. It trains the models, evaluates their performance, and stores metrics in the `advanced_models` variable for later analysis.

In [23]:
print_results(advanced_models)

LightGBM - Best Fold RMSE: 3373.33
LightGBM - Training Time: 63.98s
LightGBM - Test RMSE: 3380.21
LightGBM - Test Time: 0.22s
LightGBM - Best Params: {'learning_rate': 0.01, 'n_estimators': 50}

XGBoost - Best Fold RMSE: 3337.44
XGBoost - Training Time: 111.28s
XGBoost - Test RMSE: 3343.88
XGBoost - Test Time: 0.10s
XGBoost - Best Params: {'learning_rate': 0.01, 'n_estimators': 50}

CatBoost - Best Fold RMSE: 2908.66
CatBoost - Training Time: 49.41s
CatBoost - Test RMSE: 2918.52
CatBoost - Test Time: 0.01s
CatBoost - Best Params: {'iterations': 100, 'learning_rate': 0.01}



- The **CatBoost** model achieved the best performance with the lowest RMSE on both the best fold and test data, indicating it is well-suited for this dataset.
- **XGBoost** follows closely, with slightly higher RMSE values but a longer training time.
- **LightGBM** had the highest RMSE among the three models, suggesting it did not perform as well as CatBoost and XGBoost in this case, despite its relatively short training time.

# 6 Conclusion

In evaluating the performance of various regression models, the **Random Forest** emerged as the most effective, achieving the lowest best fold RMSE (1807.40) and test RMSE (1814.12). Following closely, **CatBoost** also demonstrated strong performance with a best fold RMSE of 2908.66 and a test RMSE of 2918.52. The **Decision Tree** model provided reasonable results but did not match the effectiveness of Random Forest and CatBoost. Conversely, **Linear Regression** exhibited significantly poor performance, with an extremely high test RMSE of 641631.98, indicating it may not be suitable for this dataset. Both **LightGBM** and **XGBoost** showed moderate results but failed to outperform the leading models. Overall, Random Forest and CatBoost are recommended for their superior predictive capabilities in this context.