Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import time
from time import perf_counter
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

In [2]:
random_state = 42
np.random.seed(random_state)

In [3]:
df = pd.read_csv('/datasets/car_data.csv')
print(df.shape)
df.head()

(354369, 16)


Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [4]:
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [6]:
(df.isna().mean().sort_values(ascending = False) * 100).round(1).to_frame('% missing values').head(20)

Unnamed: 0,% missing values
NotRepaired,20.1
VehicleType,10.6
FuelType,9.3
Gearbox,5.6
Model,5.6
DateCrawled,0.0
Price,0.0
RegistrationYear,0.0
Power,0.0
Mileage,0.0


Interpretation

NotRepaired has the highest missing rate (~1 in 5 records). Fill with 'missing' or 'unknown' for categorical models.

VehicleType and FuelType are next (≈9–11%), impute similarly.

Gearbox and Model (~5%) have minor missingness; 'missing' or mode imputation works fine.

Numeric columns (if any) should use the median for imputation.

In [7]:
target = 'Price'

q_low = df[target].quantile(0.01)
q_high = df[target].quantile(0.99)
print(f"Price range: {q_low} to {q_high}")
print(f"Rows before filtering: {len(df)}")

df_filtered = df[(df[target] >= q_low) & (df[target] <= q_high)]
print(f"Rows after filtering: {len(df_filtered)}")

Price range: 0.0 to 18800.0
Rows before filtering: 354369
Rows after filtering: 350925


In [8]:
numeric_feats = list(df_filtered.select_dtypes(include=['int64']).columns)

if target in numeric_feats:
    numeric_feats.remove(target)

categorical_feats = list(df_filtered.select_dtypes(include=['object']).columns)

print('Numeric Features:', len(numeric_feats))
print('Categorical Features:', len(categorical_feats))


Numeric Features: 6
Categorical Features: 9


Created the target and feature lists

In [9]:
df_train, df_test = train_test_split(df_filtered, test_size = 0.20, random_state=random_state)
train_df, valid_df = train_test_split(df_train, test_size = 0.20, random_state=random_state)

x_train = train_df.drop([target], axis = 1)
y_train = train_df[target]

x_valid = valid_df.drop([target], axis = 1)
y_valid = valid_df[target]

x_test = df_test.drop([target], axis =1)
y_test = df_test[target]


We split the data into **train**, **validation**, and **test** sets to evaluate models fairly:
- **Train set** — used to train the model.
- **Validation set** — used to compare and tune models before testing.
- **Test set** — used only once at the end to see final performance.

This prevents overfitting and ensures the test set remains unbiased.

In [10]:
num_medians = x_train[numeric_feats].median()

x_train[numeric_feats] = x_train[numeric_feats].fillna(num_medians)
x_valid[numeric_feats] = x_valid[numeric_feats].fillna(num_medians)
x_test[numeric_feats] = x_test[numeric_feats].fillna(num_medians)

x_train[categorical_feats] = x_train[categorical_feats].fillna('missing')
x_valid[categorical_feats] = x_valid[categorical_feats].fillna('missing')
x_test[categorical_feats] = x_test[categorical_feats].fillna('missing')

Fill missing numeric values using only trian medians to prevent data leakage
Fill missing categorical values with 'missing'

In [11]:
important_cats = ['VehicleType', 'Gearbox', 'FuelType', 'Brand']
numeric_cols = ['RegistrationYear', 'Power', 'Mileage', 'RegistrationMonth', 'NumberOfPictures', 'PostalCode']

columns_to_keep = numeric_cols + important_cats
x_train_selected = x_train[columns_to_keep]
x_valid_selected = x_valid[columns_to_keep]
x_test_selected = x_test[columns_to_keep]

df_all = pd.concat([x_train_selected, x_valid_selected, x_test_selected])
df_all = pd.get_dummies(df_all, columns=important_cats, drop_first=False)

n_train = len(x_train_selected)
n_valid = len(x_valid_selected)

x_train_ohe = df_all.iloc[:n_train]
x_valid_ohe = df_all.iloc[n_train:n_train + n_valid]
x_test_ohe = df_all.iloc[n_train + n_valid:]

One-hot encoding for Linear Regression, Decision Tree, and Random Forest

In [12]:
columns_to_keep = ['RegistrationYear', 'Power', 'Mileage', 'RegistrationMonth', 
                   'NumberOfPictures', 'PostalCode', 'VehicleType', 'Gearbox', 
                   'FuelType', 'Brand']

x_train_lgb = x_train[columns_to_keep].copy()
x_valid_lgb = x_valid[columns_to_keep].copy()
x_test_lgb = x_test[columns_to_keep].copy()


categorical_cols_in_selected = ['VehicleType', 'Gearbox', 'FuelType', 'Brand']

for c in categorical_cols_in_selected:
    x_train_lgb[c] = x_train_lgb[c].astype('category')
    x_valid_lgb[c] = x_valid_lgb[c].astype('category')
    x_test_lgb[c] = x_test_lgb[c].astype('category')


LightGBM, CatBoost, and XGBoost setup

In [13]:
def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared = False)

class Timer:
    def __enter__(self):
        self.t0 = perf_counter()
        return self
    def __exit__(self, *exc):
        self.dt = perf_counter() - self.t0

results = []

best_cfg = {}
best_res = {}

def evaluate_model(name, model, x_tr, y_tr, x_va, y_va):
    row = {'model':name}
    with Timer() as t:
        model.fit(x_tr, y_tr)
    row['train_time_s'] = t.dt

    with Timer() as t:
        y_pred = model.predict(x_va)
    row['predict_time_s'] = t.dt
    row['rmse_valid'] = rmse(y_va, y_pred)

    results.append(row)
    print(row)
    return row

Helper Functions

## Model training

In [14]:
dummy = DummyRegressor(strategy = 'median')
evaluate_model('Dummy', dummy, x_train_ohe, y_train, x_valid_ohe, y_valid)

lr = LinearRegression()
evaluate_model('Linear Regression', lr, x_train_ohe, y_train, x_valid_ohe, y_valid)

{'model': 'Dummy', 'train_time_s': 0.0023159270058386028, 'predict_time_s': 9.344398858956993e-05, 'rmse_valid': 4566.048012713643}
{'model': 'Linear Regression', 'train_time_s': 0.921471374007524, 'predict_time_s': 0.09579567798937205, 'rmse_valid': 3263.6568037602865}


{'model': 'Linear Regression',
 'train_time_s': 0.921471374007524,
 'predict_time_s': 0.09579567798937205,
 'rmse_valid': 3263.6568037602865}

### Baseline Models Summary

| Model | Train Time (s) | Predict Time (s) | RMSE (Validation) |
|--------|----------------|------------------|-------------------|
| Dummy | 0.00 | 0.00 | 4566.05 |
| Linear Regression | 1.06 | 0.02 | **3263.66**|

#### Interpretation
- The **Dummy model** (predicts the median) gives a high RMSE (~4566), establishing a baseline reference.  
- **Linear Regression** improves RMSE by roughly **29%**, showing the dataset has meaningful numeric relationships.  
- Despite the improvement, Linear Regression cannot model nonlinear interactions or handle categorical variables effectively.  
- These results confirm that more advanced, tree-based models will be necessary for optimal prediction accuracy.

In [15]:
dt_configs = [
    {'max_depth':8, 'min_samples_leaf': 5},
    {'max_depth':12, 'min_samples_leaf':3},
    {'max_depth':16, 'min_samples_leaf':1}]

best_res['DecisionTree'] = {'rmse_valid':float('inf')}
for cfg in dt_configs:
    dt = DecisionTreeRegressor(random_state = random_state, **cfg)
    r = evaluate_model(f"DecisionTree{cfg}", dt, x_train_ohe, y_train, x_valid_ohe, y_valid)
    if r['rmse_valid'] < best_res['DecisionTree']['rmse_valid']:
        best_res['DecisionTree'] = r
        best_cfg['DecisionTree'] = cfg

{'model': "DecisionTree{'max_depth': 8, 'min_samples_leaf': 5}", 'train_time_s': 0.9546726949920412, 'predict_time_s': 0.015206447002128698, 'rmse_valid': 2147.329189449113}
{'model': "DecisionTree{'max_depth': 12, 'min_samples_leaf': 3}", 'train_time_s': 1.1813891289930325, 'predict_time_s': 0.014375790997291915, 'rmse_valid': 2008.624579618018}
{'model': "DecisionTree{'max_depth': 16, 'min_samples_leaf': 1}", 'train_time_s': 1.4411205069918651, 'predict_time_s': 0.018145797002944164, 'rmse_valid': 2072.2772544589666}


### Decision Tree Model Results Summary
| Max Depth | Min Samples Leaf | Train Time (s) | Predict Time (s) | RMSE (Validation) |
|------------|------------------|----------------|------------------|-------------------|
| 8  | 5 | 0.94 | 0.015 | 2147.33 |
| 12 | 3 | 1.26 | 0.014 | **2008.62** |
| 16 | 1 | 1.49 | 0.018 | 2072.28 |

#### Interpretation
- The best validation RMSE (~2008.6) comes from **max_depth=12** and **min_samples_leaf=3**.
- Increasing depth further (to 16) slightly worsens RMSE, suggesting **overfitting**.
- All models train and predict very quickly (under 2 seconds total).
- Overall, a Decision Tree gives solid baseline performance but will likely be outperformed by ensemble models like Random Forest, LightGBM, or CatBoost.


In [16]:
rf_configs = [
    {'n_estimators':50, 'max_depth': 10, 'min_samples_leaf':1},
    {'n_estimators':100, 'max_depth':20, 'min_samples_leaf':2}]

best_res['RandomForest'] = {'rmse_valid':float('inf')}
for cfg in rf_configs:
    rf = RandomForestRegressor(random_state = random_state, **cfg)
    r = evaluate_model(f"RandomForest{cfg}", rf, x_train_ohe, y_train, x_valid_ohe, y_valid)
    if r['rmse_valid'] < best_res['RandomForest']['rmse_valid']:
        best_res['RandomForest'] = r
        best_cfg['RandomForest'] = cfg

{'model': "RandomForest{'n_estimators': 50, 'max_depth': 10, 'min_samples_leaf': 1}", 'train_time_s': 32.80775443300081, 'predict_time_s': 0.1782479319954291, 'rmse_valid': 1965.6956770703114}
{'model': "RandomForest{'n_estimators': 100, 'max_depth': 20, 'min_samples_leaf': 2}", 'train_time_s': 103.07044502699864, 'predict_time_s': 1.070022229992901, 'rmse_valid': 1738.862410774293}


### Random Forest Model Results Summary
| n_estimators | max_depth | min_samples_leaf | Train Time (s) | Predict Time (s) | RMSE (Validation) |
|---------------|------------|------------------|----------------|------------------|-------------------|
| 50  | 10 | 1 | 33.56 | 0.20 | 1965.70 |
| 100 | 20 | 2 | 107.50 | 1.11 | **1738.86**|

#### Interpretation
- The best RMSE (~1738.9) occurs with **100 trees** and **max_depth=20**.
- Larger forests improve accuracy but take longer to train (tradeoff between quality and speed).
- Compared to the Decision Tree (RMSE ≈ 2008), Random Forest improves accuracy by ~13%.
- Prediction remains efficient (~1s), making it suitable for a car price app needing balance between performance and speed.

In [17]:
lgb_configs = [
    {'n_estimators':400, 'learning_rate': 0.05, 'num_leaves': 63, 'subsample':0.8, 'colsample_bytree':0.8},
    {'n_estimators':800, 'learning_rate': 0.05, 'num_leaves': 63, 'subsample':0.8, 'colsample_bytree':0.8}]

best_res['LightGBM'] = {'rmse_valid':float('inf')}
for cfg in lgb_configs:
    lgbm = LGBMRegressor(random_state = random_state, **cfg)
    r = evaluate_model(f"LightGBM{cfg}", lgbm, x_train_lgb, y_train, x_valid_lgb, y_valid)
    if r['rmse_valid'] < best_res['LightGBM']['rmse_valid']:
        best_res['LightGBM'] = r
        best_cfg['LightGBM'] = cfg

{'model': "LightGBM{'n_estimators': 400, 'learning_rate': 0.05, 'num_leaves': 63, 'subsample': 0.8, 'colsample_bytree': 0.8}", 'train_time_s': 7.62729413001216, 'predict_time_s': 0.9206472319929162, 'rmse_valid': 1728.5443172106156}
{'model': "LightGBM{'n_estimators': 800, 'learning_rate': 0.05, 'num_leaves': 63, 'subsample': 0.8, 'colsample_bytree': 0.8}", 'train_time_s': 12.800469480003812, 'predict_time_s': 2.3029470030014636, 'rmse_valid': 1700.0932390758599}


### LightGBM Model Results Summary
| n_estimators | learning_rate | num_leaves | Train Time (s) | Predict Time (s) | RMSE (Validation) |
|---------------|----------------|-------------|----------------|------------------|-------------------|
| 400 | 0.05 | 63 | 8.52 | 0.92 | 1728.54 |
| 800 | 0.05 | 63 | 16.58 | 2.56 | **1700.09**|

#### Interpretation
- The best RMSE (~1700.1) comes from **800 estimators**.
- Training is still very fast (≈16.6s) compared to Random Forest (≈107s).
- Prediction under 3s — extremely efficient.
- LightGBM currently offers the best balance between **accuracy**, **training time**, and **prediction speed**.


In [18]:
cb_configs = [
    {'iterations':500, 'learning_rate': 0.05, 'depth': 8, 'verbose':False},
    {'iterations':800, 'learning_rate': 0.05, 'depth': 8, 'verbose':False}]

best_res['CatBoost'] = {'rmse_valid':float('inf')}
for cfg in cb_configs:
    catb = CatBoostRegressor(random_state = random_state, cat_features=categorical_cols_in_selected, **cfg)
    r = evaluate_model(f"CatBoost{cfg}", catb, x_train_lgb, y_train, x_valid_lgb, y_valid)
    if r['rmse_valid'] < best_res['CatBoost']['rmse_valid']:
        best_res['CatBoost'] = r
        best_cfg['CatBoost'] = cfg

{'model': "CatBoost{'iterations': 500, 'learning_rate': 0.05, 'depth': 8, 'verbose': False}", 'train_time_s': 81.2829886059917, 'predict_time_s': 0.0845568020013161, 'rmse_valid': 1780.0245275773611}
{'model': "CatBoost{'iterations': 800, 'learning_rate': 0.05, 'depth': 8, 'verbose': False}", 'train_time_s': 130.12156948399206, 'predict_time_s': 0.09765255100501236, 'rmse_valid': 1756.2366665762947}


### CatBoost Model Results Summary
| iterations | learning_rate | depth | Train Time (s) | Predict Time (s) | RMSE (Validation) |
|-------------|----------------|--------|----------------|------------------|-------------------|
| 500 | 0.05 | 8 | 81.97 | 0.03 | 1780.02 |
| 800 | 0.05 | 8 | 131.77 | 0.10 | **1756.24**|

#### Interpretation
- The best RMSE (~1756.2) comes from **800 iterations**.
- Training time rises notably (82 → 132s) while predictions stay extremely fast (<0.1s).
- CatBoost handles categorical features natively, simplifying preprocessing.
- Excellent accuracy and stable performance, though slightly behind LightGBM in RMSE.

In [19]:
xgb_configs = [
    {'n_estimators':500, 'learning_rate': 0.05, 'max_depth': 8, 'subsample':0.8, 'colsample_bytree':0.8},
    {'n_estimators':800, 'learning_rate': 0.05, 'max_depth': 8, 'subsample':0.8, 'colsample_bytree':0.8}]

best_res['XGBoost'] = {'rmse_valid':float('inf')}
for cfg in xgb_configs:
    xgb = XGBRegressor(random_state = random_state, **cfg)
    r = evaluate_model(f"XGBoost{cfg}", xgb, x_train_ohe, y_train, x_valid_ohe, y_valid)
    if r['rmse_valid'] < best_res['XGBoost']['rmse_valid']:
        best_res['XGBoost'] = r
        best_cfg['XGBoost'] = cfg

{'model': "XGBoost{'n_estimators': 500, 'learning_rate': 0.05, 'max_depth': 8, 'subsample': 0.8, 'colsample_bytree': 0.8}", 'train_time_s': 302.44262059699395, 'predict_time_s': 0.4552787879947573, 'rmse_valid': 1702.2454559362875}
{'model': "XGBoost{'n_estimators': 800, 'learning_rate': 0.05, 'max_depth': 8, 'subsample': 0.8, 'colsample_bytree': 0.8}", 'train_time_s': 474.54050233100133, 'predict_time_s': 0.7502340719947824, 'rmse_valid': 1683.2189925780533}


### XGBoost Model Results Summary

| n_estimators | learning_rate | max_depth | Train Time (s) | Predict Time (s) | RMSE (Validation) |
|---------------|----------------|------------|----------------|------------------|-------------------|
| 500 | 0.05 | 8 | 902.60 | 0.48 | 1702.25 |
| 800 | 0.05 | 8 | 494.14 | 0.75 | **1683.22**|

#### Interpretation
- The **best RMSE** (~1683.2) occurs with **800 estimators**, offering a slight improvement over 500.  
- Training time is quite long (≈8 minutes) while predictions remain very fast (<1s).  
- Accuracy is nearly identical to LightGBM, confirming both are top-performing boosting models.  
- XGBoost trades a bit of speed for **stability** and **robustness**, making it a reliable model for car price prediction.  


## Model analysis

In [20]:
summary = pd.DataFrame(results).sort_values('rmse_valid').reset_index(drop=True)
summary

Unnamed: 0,model,train_time_s,predict_time_s,rmse_valid
0,"XGBoost{'n_estimators': 800, 'learning_rate': ...",474.540502,0.750234,1683.218993
1,"LightGBM{'n_estimators': 800, 'learning_rate':...",12.800469,2.302947,1700.093239
2,"XGBoost{'n_estimators': 500, 'learning_rate': ...",302.442621,0.455279,1702.245456
3,"LightGBM{'n_estimators': 400, 'learning_rate':...",7.627294,0.920647,1728.544317
4,"RandomForest{'n_estimators': 100, 'max_depth':...",103.070445,1.070022,1738.862411
5,"CatBoost{'iterations': 800, 'learning_rate': 0...",130.121569,0.097653,1756.236667
6,"CatBoost{'iterations': 500, 'learning_rate': 0...",81.282989,0.084557,1780.024528
7,"RandomForest{'n_estimators': 50, 'max_depth': ...",32.807754,0.178248,1965.695677
8,"DecisionTree{'max_depth': 12, 'min_samples_lea...",1.181389,0.014376,2008.62458
9,"DecisionTree{'max_depth': 16, 'min_samples_lea...",1.441121,0.018146,2072.277254


In [21]:
def best_from_summary(family_keyword):
    s = summary[summary['model'].str.contains(family_keyword, regex=False)]
    if len(s):
        row = s.iloc[0]
        return float(row['rmse_valid']), str(row['model'])
    return float('inf'), None

family_best_scores = []
for fam in ['Dummy', 'LinearRegression']:
    rmse_val, model_name = best_from_summary(fam)
    if np.isfinite(rmse_val):
        family_best_scores.append((fam, rmse_val, model_name))

for fam in ['DecisionTree','RandomForest','LightGBM','CatBoost','XGBoost']:
    if fam in best_res and np.isfinite(best_res[fam]['rmse_valid']):
        family_best_scores.append((fam, best_res[fam]['rmse_valid'], None))

                        
best_family, best_val_rmse, best_model_label = sorted(family_best_scores, key=lambda x: x[1])[0]
print('Best family by valid RMSE:', best_family)
print('Validation RMSE:', round(best_val_rmse, 2))
print('Best config (if tuned):', best_cfg.get(best_family, '(baseline/no tuning)'))

Best family by valid RMSE: XGBoost
Validation RMSE: 1683.22
Best config (if tuned): {'n_estimators': 800, 'learning_rate': 0.05, 'max_depth': 8, 'subsample': 0.8, 'colsample_bytree': 0.8}


In [22]:
if best_family in ['LightGBM','CatBoost']:
    x_fit = pd.concat([x_train_lgb, x_valid_lgb], axis=0)
    y_fit = pd.concat([y_train, y_valid], axis=0)
    x_test_use = x_test_lgb
elif best_family in ['RandomForest','DecisionTree','LinearRegression','XGBoost','Dummy']:
    x_fit = pd.concat([x_train_ohe, x_valid_ohe], axis=0)
    y_fit = pd.concat([y_train, y_valid], axis=0)
    x_test_use = x_test_ohe
else:
    x_fit = pd.concat([x_train_ohe, x_valid_ohe], axis=0)
    y_fit = pd.concat([y_train, y_valid], axis=0)
    x_test_use = x_test_ohe


if best_family == 'LightGBM':
    champ = LGBMRegressor(random_state=random_state, **best_cfg['LightGBM'])
elif best_family == 'CatBoost':
    champ = CatBoostRegressor(random_state=random_state, **best_cfg['CatBoost'])
elif best_family == 'XGBoost':
    champ = XGBRegressor(random_state=random_state, **best_cfg['XGBoost'])
elif best_family == 'RandomForest':
    champ = RandomForestRegressor(random_state=random_state, **best_cfg['RandomForest'])
elif best_family == 'DecisionTree':
    champ = DecisionTreeRegressor(random_state=random_state, **best_cfg['DecisionTree'])
elif best_family == 'LinearRegression':
    champ = LinearRegression()
else:
    champ = DummyRegressor(strategy='median')

In [23]:
with Timer() as t_train:
    champ.fit(x_fit, y_fit)
with Timer() as t_pred:
    y_pred_test = champ.predict(x_test_use)

rmse_test = rmse(y_test, y_pred_test)

print('\\n=== FINAL TEST RESULTS ===')
print('Champion family:', best_family)
print('Champion params:', best_cfg.get(best_family, '(baseline/no tuning)'))
print('Test RMSE:', round(rmse_test, 2))
print('Train time (s):', round(t_train.dt, 3))
print('Predict time (s):', round(t_pred.dt, 3))

CHAMP_FAMILY = best_family
CHAMP_MODEL = champ
X_TEST_USE = x_test_use

\n=== FINAL TEST RESULTS ===
Champion family: XGBoost
Champion params: {'n_estimators': 800, 'learning_rate': 0.05, 'max_depth': 8, 'subsample': 0.8, 'colsample_bytree': 0.8}
Test RMSE: 1716.25
Train time (s): 595.15
Predict time (s): 0.928


## Final Model Comparison Summary

| Model | Best Configuration (simplified) | RMSE (Validation) | Train Time (s) | Predict Time (s) | Notes |
|--------|----------------------------------|-------------------|----------------|------------------|-------|
| Dummy | — | 4566.05 | 0.00 | 0.00 | Baseline only |
| Linear Regression | — | 3263.66 | 1.06 | 0.02 | Sanity check — simple linear fit |
| Decision Tree | max_depth=12, min_samples_leaf=3 | 2008.62 | 1.26 | 0.01 | Fast but slightly overfits deeper |
| Random Forest | n_estimators=100, max_depth=20 | 1738.86 | 107.50 | 1.11 | 13% better than DT, but slower |
| LightGBM | n_estimators=800, lr=0.05, num_leaves=63 | **1700.09** | 16.58 | 2.56 | Best balance of speed & accuracy|
| CatBoost | iterations=800, lr=0.05, depth=8 | 1756.24 | 131.77 | 0.10 | Strong, smooth categorical handling |
| XGBoost | n_estimators=800, lr=0.05, depth=8 | 1683.22 | 494.14 | 0.75 | Slightly best RMSE, but slower training |


### Interpretation
- **LightGBM** and **XGBoost** are the top performers — both achieve RMSE ≈ **1700**, far outperforming simpler models.  
- **LightGBM** provides the **best efficiency** (fast training and strong accuracy).  
- **XGBoost** achieves a slightly lower RMSE but with **significantly longer training time**.  
- **CatBoost** performs consistently with minimal preprocessing and fast inference.  
- Baseline and linear models confirm that the dataset has strong nonlinear relationships that tree-based ensembles capture best.

**Final Choice for Deployment:** `LightGBM` — excellent tradeoff between **accuracy**, **speed**, and **scalability** for Rusty Bargain’s car price prediction app.
