Regression of Used Car Prices. Playground Series - Season 4, Episode 9. Private Score: 63313.29307. Public Score: 72411.72359.

This notebook builds three simple models: xgb, lgbm and cat. Hyperparameters are fine-tuned for each of them. Optimized blending weights are found through Optuna tuning. A cv score of 72584 is reached on full train dataset. Main references include:

1. Used Car Price - Analytics & Predictions by Usman Bashir, https://www.kaggle.com/code/usmanbashir1/used-car-price-analytics-predictions .
2. How to Remove Outliers for Machine Learning, https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/ .
3. S4E9 Exotics Identification? No & Proof by Todd Gardiner, https://www.kaggle.com/code/toddgardiner/s4e9-exotics-identification-no-proof/notebook .
4. [AutoML GrandPrix] 3rd Place Solution Write Up by lash_fire, https://github.com/Dhanush-M555/AutoML_GrandPrix_2024_My_Solutions/blob/main/%5BAutoML%20GrandPrix%5D%203rd%20Place%20Solution.ipynb .
5. Residual Analysis in Regression, https://stattrek.com/regression/residual-analysis .
6. Stacking: XGB + LGBM + CATB + ANN by rıza temizel, https://www.kaggle.com/code/rzatemizel/stacking-xgb-lgbm-catb-ann#Voting-vs-Stacking .
7. PS4E5| Simple Blending + Detailed Explanation by Trupolog Helper, https://www.kaggle.com/code/trupologhelper/ps4e5-simple-blending-detailed-explanation .
8. Visualize - Price vs. Model_Year and Price vs. Milage by Chris Deotte, https://www.kaggle.com/competitions/playground-series-s4e9/discussion/532952

In [None]:
#%load_ext cudf.pandas
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_theme(style="ticks", palette="pastel")
import matplotlib.pyplot as plt

import re
import sklearn
from sklearn import preprocessing
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler, RobustScaler
from numpy import percentile
from sklearn.decomposition import PCA

from sklearn.linear_model import HuberRegressor
from xgboost import XGBRegressor
import xgboost as xgb
from lightgbm import LGBMRegressor
import lightgbm as lgb
from catboost import CatBoostRegressor, Pool

from sklearn.model_selection import KFold
from sklearn.base import clone
from sklearn.metrics import mean_squared_error
import gc
import warnings
warnings.filterwarnings("ignore")

import optuna
from optuna.visualization import plot_param_importances
from optuna.samplers import RandomSampler, TPESampler, CmaEsSampler
from optuna.pruners import HyperbandPruner
from functools import partial

# print('The scikit-learn version is {}.'.format(sklearn.__version__))

In [None]:
pip install cmaes

In [None]:
random_seed = 0
n_fold = 5

# 1. Import and Glance at the Data.

In [None]:
train = pd.read_csv("/kaggle/input/playground-series-s4e9/train.csv")
test = pd.read_csv("/kaggle/input/playground-series-s4e9/test.csv")
original = pd.read_csv("/kaggle/input/used-car-price-prediction-dataset/used_cars.csv")

Train dataset have 188533 entries, and test dataset have 125690 entries. Three columns "fuel_type", "accident", "clean_title" have null values in both train and test datasets. Numerical features include "model_year" and "milage". All the other feaures are categorical: "brand", "model", "fuel_type", "engine", "transmission", "ext_col", "int_col", "accident" and "clean_title". Target variable is "price". 

The original dataset have 4009 samples. Only "model_year" column is numerical. All the others are categorical. The same three columns "fuel_type", "accident" and "clean_title" have null values.

.info(), .describe(), .isnull().sum() and .unique() are frequently used throughout the notebook to check intermediate results.

In [None]:
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

# 2. Feature Engineering.

Change formats of both milage and price columns in the original datasets. New features: car age, horsepower, engine_displacement, no_of_cylinder, transmission_speed and transmission_type. Drop engine, transmission and model_year columns.

In [None]:
original['milage'] = original['milage'].str.replace(',', '').str.replace(' mi.', '').astype(float)
original['price'] = original['price'].str.replace(',', '').str.replace('$', '').astype(float)

In [None]:
def engine_feat_extract(data):
    df=data.copy()
    
    df['horsepower'] = df['engine'].str.extract(r'(\d+\.\d+)HP')
    
    df['engine'] = df['engine'].apply(lambda x: x.replace(' Litre', 'L'))
    df['engine'] = df['engine'].apply(lambda x: x.replace(' Liter', 'L'))
    df['engine'] = df['engine'].apply(lambda x: x.replace(' L', 'L'))
    df['engine_displacement'] = df['engine'].str.extract(r'(\d+\.\d+)L')
    
    df['engine'] = df['engine'].apply(lambda x: x.replace('V-', 'V'))
    df['no_of_cylinder'] = df['engine'].str.extract(r'( \d+ | V\d+ | I\d+ | W\d+ | H\d+ |I\d+ |V\d+ |V\d+|I\d+)')
    df['no_of_cylinder'] = df['no_of_cylinder'].str.strip()
    df['no_of_cylinder'] = df['no_of_cylinder'].str.replace('V','').str.replace('I','').str.replace('H','').str.replace('W','')
    
    df.drop('engine', axis=1, inplace=True)
    
    return df

In [None]:
def transmission_feat_extract(data):
    df=data.copy()
    
    df['transmission'] = df['transmission'].str.replace('Automatic','AT').str.replace('A/T','AT').str.replace('At','AT')
    df['transmission'] = df['transmission'].str.replace('M/T','MT').str.replace('Mt','MT').str.replace('Manual', 'MT')
    df['transmission'] = df['transmission'].str.replace('Auto', 'AT')
    df['transmission'] = df['transmission'].str.replace('-Spd', '-Speed').str.replace('-SPEED', '-Speed')
    df['transmission'] = df['transmission'].str.replace('Single-Speed', '1-Speed').str.replace(' Speed', '-Speed')
    
    df['transmission_speed'] = df['transmission'].str.extract(r'(\d+)-Speed')
    df['transmission_speed'] = df['transmission_speed'].str.strip()
    
    df['transmission_type'] = df['transmission']
    df.loc[df.transmission.str.contains('AT/MT', na=False, case=False), 'transmission_type'] = 'AT/MT'
    df.loc[df.transmission.str.contains('AT', na=False, case=False), 'transmission_type'] = 'AT'
    df.loc[df.transmission.str.contains('MT', na=False, case=False), 'transmission_type'] = 'MT'
    df['transmission_type'] = df['transmission_type'].str.replace('1-Speed Fixed Gear', 'Fixed Gear')
    df.loc[df.transmission_type.str.contains('6-Speed', na=False, case=False), 'transmission_type'] = np.nan
    df.loc[df.transmission_type.str.contains('7-Speed', na=False, case=False), 'transmission_type'] = np.nan
    
    #df['transmission_type'] = df['transmission'].str.replace(r'([0-9]+-Speed)', '', regex=True)
    
    df.drop('transmission', axis=1, inplace=True)

    return df

In [None]:
def feature_engineering(data):
    df=data.copy()
    
    df = engine_feat_extract(df)
    df = transmission_feat_extract(df)
    
    df['age'] = 2025 - df['model_year']
    df.drop(['model_year'], axis=1, inplace=True)
    
    df['milage_per_year'] = df['milage'] / df['age']
    
    df['brand'] = df['brand'].str.lower()
    df['model'] = df['model'].str.lower()
    df['ext_col'] = df['ext_col'].str.lower()
    df['int_col'] = df['int_col'].str.lower()
    df['transmission_type'] = df['transmission_type'].str.lower()
    
    return df

In [None]:
original = feature_engineering(original)
train = feature_engineering(train)
test = feature_engineering(test)

Now all the datasets have 15 features. 

# 3. Check Outliers. 

The minimum and maximum car ages on all the three datasets are 1 and 51. Data look normal. No need to remove any samples.

In [None]:
fig, axes = plt.subplots(1, 3)
sns.boxplot(data=original[['age']], ax=axes[0]).set_title('Original')
sns.boxplot(data=train[['age']],ax=axes[1]).set_title('Train')
sns.boxplot(data=test[['age']],ax=axes[2]).set_title('Test')
fig.tight_layout()
plt.show()

The minimum and maximum milages on three datasets are 100 and 405000. Data look normal. No need to remove any samples.

In [None]:
fig, axes = plt.subplots(1, 3)
sns.boxplot(data=original[['milage']],ax=axes[0]).set_title('Original')
sns.boxplot(data=train[['milage']],ax=axes[1]).set_title('Train')
sns.boxplot(data=test[['milage']],ax=axes[2]).set_title('Test')
fig.tight_layout()
plt.show()

The three highest prices in both original and train datasets are 2954083,1950995 and 1599000. All the other prices are below 1,000,000. Because the test dataset may contain those outliers as well, we can not remove them. Keep all the samples for now.

In [None]:
fig, axes = plt.subplots(1, 2)
sns.boxplot(data=original[['price']],ax=axes[0]).set_title('Original')
sns.boxplot(data=train[['price']],ax=axes[1]).set_title('Train')
fig.tight_layout()
plt.show()

In [None]:
#fig, axes = plt.subplots(1, 2)
#sns.histplot(data=original['price'], kde=True, ax=axes[0]).set_title('Original')
#sns.histplot(data=original['price'], kde=True, ax=axes[1]).set_title('Train')
#fig.tight_layout()
#plt.show()

#data = original['price']
#q25, q75 = percentile(data, 25), percentile(data, 75)
#iqr = q75 - q25
#cut_off = iqr * 1.5
#lower, upper = q25 - cut_off, q75 + cut_off
#original = original[(data >= lower) & (data <= upper)].reset_index()
#original.drop('index', axis=1, inplace=True)

#pca = PCA()
#pca.fit(X_original)
#pd.DataFrame(pca.transform(X_original))

# 4. Fill in Missing Values.

Need to deal with both numerical('horsepower', 'engine_displacement', 'no_of_cylinder' and 'transmission_speed') and categorical('fuel_type', 'accident', 'clean_title' and 'transmission_type') missing values. 

In [None]:
#fig, axes = plt.subplots(1, 3)
#sns.boxplot(data=original[['horsepower']],ax=axes[0]).set_title('Original')
#sns.boxplot(data=train[['horsepower']],ax=axes[1]).set_title('Train')
#sns.boxplot(data=test[['horsepower']],ax=axes[2]).set_title('Test')
#fig.tight_layout()
#plt.show()

OrdinalEncoder() with encoded_missing_value set to -1 will turn missing values to -1. Compared to SimpleImputer and KNNImputer, I think OrdinalEncoder() makes sense for this dataset. Turn other categorical variables into numbers.

In [None]:
def fill_missing(data):
    df = data.copy()
    
    for col in ['horsepower', 'engine_displacement', 'no_of_cylinder', 'transmission_speed']:
        df[[col]] = df[[col]].fillna(df[col].mode()[0])
        df[[col]] = df[[col]].astype(float)
                
    for col in ['fuel_type','accident','clean_title','transmission_type']:
        enc = preprocessing.OrdinalEncoder(encoded_missing_value=-1)
        df[[col]] = enc.fit_transform(df[[col]])    
        
    #df['fuel_type'] = df['fuel_type'].fillna('none')
    #df['accident'] = df['accident'].fillna('empty')
    #df['clean_title'] = df['clean_title'].fillna('empty')    
        
    for col in ['brand','model','ext_col','int_col']:
        enc = preprocessing.OrdinalEncoder()
        df[[col]] = enc.fit_transform(df[[col]])            
        
    return df    

In [None]:
original = fill_missing(original)
train = fill_missing(train)
test = fill_missing(test)

# 5. Relationships among Features.

Now we have a full value dataset. We are ready to explore many kinds of relationships hidden in the dataset. 

In [None]:
#original.corr()
#train.corr()
#original.hist()
#train.hist()
#test.hist()

# 6. Build Models.

In [None]:
X_train = train.drop(['price'], axis=1)
y_train = train['price']

X_original = original.drop(['price'], axis=1)
y_original = original['price']

X_test = test

In [None]:
scaler = StandardScaler()
scaler.fit(pd.concat([X_train, X_original], axis=0))

In [None]:
X_test = scaler.transform(X_test)

Three models are built. Hyperparameters are tuned by Optuna for each of them.

*  6.1 XGB

In [None]:
#fig = plot_param_importances(study)
#fig.show()

In [None]:
def objective(trial):
    params = {
        'eta': 0.02,
        'n_estimators': 10000,
        'alpha': trial.suggest_float('alpha', 0., 1.0),
        'lambda': trial.suggest_float('lambda', 1., 100.0),
        'subsample': trial.suggest_float('subsample', 0., 1.0), #column-wise sampling 
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0., 1.0), #row-wise sampling
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'min_child_weight': trial.suggest_float("min_child_weight", 1., 50.),
        'gamma': trial.suggest_float('gamma', 0., 1.0),
        #'max_bin': trial.suggest_int('max_bin', 20, 400), # Gpu does not accept customized max_bin.
        'tree_method': 'gpu_hist',
        'eval_metric': 'rmse',
        'random_state': random_seed,
        'objective': 'reg:squarederror',
        'booster': 'gbtree',
        'grow_policy': 'lossguide',
        'verbosity': 1,
        'device': 'gpu',
    }
    
    cv = KFold(n_fold, shuffle=True, random_state=random_seed)
    cv_splits = cv.split(X_train, y_train)
    
    val_preds = np.zeros(len(X_train))

    model = XGBRegressor(**params)
    
    for train_idx, val_idx in cv_splits:
        X_train_fold = pd.concat([X_train.iloc[train_idx], X_original], axis=0)
        y_train_fold = pd.concat([y_train.iloc[train_idx], y_original], axis=0)
        X_val_fold, y_val_fold = X_train.iloc[val_idx], y_train.iloc[val_idx]
    
        X_train_fold = scaler.transform(X_train_fold)
        X_val_fold = scaler.transform(X_val_fold)
        model.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], early_stopping_rounds=100, verbose=False)
                
        val_preds[val_idx] = model.predict(X_val_fold)
        gc.collect()
        
        rmse_full = mean_squared_error(y_train, val_preds, squared=False)
    
    return rmse_full

In [None]:
study_name = "xgb"
sampler = TPESampler(multivariate=True, group=True, seed=random_seed)

optimize = False
if optimize:
    study = optuna.create_study(study_name=study_name, sampler=sampler, direction="minimize", load_if_exists=True)
    study.optimize(objective, timeout=3600*4)
    
    print(f"best optimized rmse: {study.best_value:0.5f}") #72612
    print(f"best hyperparameters: {study.best_params}") 
    xgb_params = study.best_params
else:
    xgb_params = {
        'eta': 0.02,
        'n_estimators': 10000,
        'alpha': 0.40415346051079754, 
        'lambda': 99.36388385514442, 
        'subsample': 0.9323744584461396, 
        'colsample_bytree': 0.41058416064357234, 
        'max_depth': 11, 
        'min_child_weight': 33.5644386196058, 
        'gamma': 0.2796042692041842,
        'tree_method': 'gpu_hist',
        'eval_metric': 'rmse',
        'random_state': random_seed,
        'objective': 'reg:squarederror',
        'booster': 'gbtree',
        'grow_policy': 'lossguide',
        'verbosity': 1,
        'device': 'gpu',
    }

* 6.2 LGBM

In [None]:
#fig = plot_param_importances(study)
#fig.show()

In [None]:
def objective(trial):
    params = {
        'learning_rate': 0.01,
        'n_estimators': 2000,
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'min_child_samples': trial.suggest_int('min_child_samples', 20, 500),
        'max_depth': trial.suggest_int('max_depth', 2, 13),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-9, 1.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-9, 1.0, log=True),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0), #row-wise sampling
        'subsample': trial.suggest_float('subsample', 0.5, 1.0), #column-wise sampling   
        'min_gain_to_split': trial.suggest_float('min_gain_to_split', 0.1, 1.0),
        'bin_construct_sample_cnt': trial.suggest_int('bin_construct_sample_cnt', 20000, 300000),
        #'max_bin': trial.suggest_int('max_bin', 20, 400), # Gpu does not accept customized max_bin.
        'random_state': random_seed,
        'verbosity': -1,
        'objective': 'regression',
        'boosting_type': 'gbdt',
        'device': 'gpu',
        'eval_metric': 'l2',
    }
    
    cv = KFold(n_fold, shuffle=True, random_state=random_seed)
    cv_splits = cv.split(X_train, y_train)
    
    val_preds = np.zeros(len(X_train))
    
    model = LGBMRegressor(**params)
    
    for train_idx, val_idx in cv_splits:
        X_train_fold = pd.concat([X_train.iloc[train_idx], X_original], axis=0)
        y_train_fold = pd.concat([y_train.iloc[train_idx], y_original], axis=0)
        X_val_fold, y_val_fold = X_train.iloc[val_idx], y_train.iloc[val_idx]
    
        X_train_fold = scaler.transform(X_train_fold)
        X_val_fold = scaler.transform(X_val_fold)
        model.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], eval_metric='l2', callbacks=[lgb.early_stopping(stopping_rounds=100)])
                
        val_preds[val_idx] = model.predict(X_val_fold)
        gc.collect()
        
        rmse_full = mean_squared_error(y_train, val_preds, squared=False)
    
    return rmse_full

In [None]:
#sqlite_db = "sqlite:///lgbm.db"
# storage=sqlite_db, 
study_name = "lgbm"
sampler = TPESampler(multivariate=True, group=True, seed=random_seed)

optimize = False
if optimize:
    study = optuna.create_study(study_name=study_name, sampler=sampler, direction="minimize", load_if_exists=True)
    study.optimize(objective, timeout=3600*4)
    
    print(f"best optimized rmse: {study.best_value:0.5f}") 
    print(f"best hyperparameters: {study.best_params}") 
    lgbm_params = study.best_params
else:
    lgbm_params = {
        'learning_rate': 0.01,
        'n_estimators': 2000,
        'num_leaves': 237, 
        'min_child_samples': 241, 
        'max_depth': 12, 
        'reg_alpha': 1.8869868007564816e-07, 
        'reg_lambda': 9.206863421105129e-06, 
        'colsample_bytree': 0.541469428945828, 
        'subsample': 0.6290239247477459, 
        'min_gain_to_split': 0.36460105927641795, 
        'bin_construct_sample_cnt': 119286,
        'random_state': random_seed,
        'verbosity': -1,
        'objective': 'regression',
        'boosting_type': 'gbdt',
        'device': 'gpu',
        'eval_metric': 'l2',
    }

* 6.3 Cat

In [None]:
#fig = plot_param_importances(study)
#fig.show()

In [None]:
def objective(trial):
    params = {
        'learning_rate': 0.05,
        'iterations': 2000,
        'depth': trial.suggest_int('depth', 4, 16),
        "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1, 100),
        #"rsm": trial.suggest_float("rsm", 0.5, 1.0),
        'random_strength': trial.suggest_float("random_strength", 0., 10.),
        'border_count': trial.suggest_int('border_count', 20, 500),
        'bagging_temperature': trial.suggest_float("bagging_temperature", 0., 1.),
        'random_state': random_seed,
        'grow_policy': 'SymmetricTree',
        'loss_function': 'RMSE',
        'eval_metric': 'RMSE',
        'task_type': 'GPU',
        'logging_level': 'Silent',
    }
    
    cv = KFold(n_fold, shuffle=True, random_state=random_seed)
    cv_splits = cv.split(X_train, y_train)
    
    val_preds = np.zeros(len(X_train))
    
    model = CatBoostRegressor(**params)
    
    for train_idx, val_idx in cv_splits:
        X_train_fold = pd.concat([X_train.iloc[train_idx], X_original], axis=0)
        y_train_fold = pd.concat([y_train.iloc[train_idx], y_original], axis=0)
        X_val_fold, y_val_fold = X_train.iloc[val_idx], y_train.iloc[val_idx]
    
        X_train_fold = scaler.transform(X_train_fold)
        X_val_fold = scaler.transform(X_val_fold)
        
        #X_train_fold_pool = Pool(X_train_fold, y_train_fold, cat_features=X_train_fold.columns.values)
        #X_val_fold_pool = Pool(X_val_fold, y_val_fold, cat_features=X_val_fold.columns.values)

        model.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], early_stopping_rounds=100, verbose=False)
                
        val_preds[val_idx] = model.predict(X_val_fold)
        gc.collect()
        
        rmse_full = mean_squared_error(y_train, val_preds, squared=False)
    
    return rmse_full

In [None]:
study_name = "cat"
sampler = TPESampler(multivariate=True, group=True, seed=random_seed)

optimize = False
if optimize:
    study = optuna.create_study(study_name=study_name, sampler=sampler, direction="minimize", load_if_exists=True)
    study.optimize(objective, timeout=3600*3)
    
    print(f"best optimized rmse: {study.best_value:0.5f}") 
    print(f"best hyperparameters: {study.best_params}") 
    cat_params = study.best_params
else:
    cat_params = {
        'learning_rate': 0.05,
        'iterations': 2000,
        'depth': 9, 
        'l2_leaf_reg': 12.85233671225663, 
        'random_strength': 7.379142037012981, 
        'border_count': 301, 
        'bagging_temperature': 0.23617322853025788,
        'random_state': random_seed,
        'grow_policy': 'SymmetricTree',
        'loss_function': 'RMSE',
        'eval_metric': 'RMSE',
        'task_type': 'GPU',
        'logging_level': 'Silent',
    }

# 7. Cross Validation and Prediction Analysis.

Apply three hyperparameter-tuned models on the train dataset to get the cv scores and on the test dataset to make predictions.

In [None]:
def validation(model):
    cv = KFold(n_fold, shuffle=True, random_state=random_seed)
    cv_splits = cv.split(X_train, y_train)
    
    #val_errors = list()
    test_preds = np.zeros(len(X_test))
    val_preds = np.zeros(len(X_train))
    
    for train_idx, val_idx in cv_splits:
        X_train_fold = pd.concat([X_train.iloc[train_idx], X_original], axis=0)
        y_train_fold = pd.concat([y_train.iloc[train_idx], y_original], axis=0)
        X_val_fold, y_val_fold = X_train.iloc[val_idx], y_train.iloc[val_idx]
        
        X_train_fold = scaler.transform(X_train_fold)
        X_val_fold = scaler.transform(X_val_fold)
        
        #X_train_fold_pool = Pool(X_train_fold, y_train_fold, cat_features=X_train_fold.columns.values)
        #X_val_fold_pool = Pool(X_val_fold, y_val_fold, cat_features=X_val_fold.columns.values)
        
        model_cloned = clone(model)
        
        if isinstance(model_cloned, XGBRegressor):
            model_cloned.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], early_stopping_rounds=100, verbose=False)
        elif isinstance(model_cloned, LGBMRegressor):
            model_cloned.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], eval_metric='l2', callbacks=[lgb.early_stopping(stopping_rounds=100)])
        elif isinstance(model_cloned, CatBoostRegressor):
            model_cloned.fit(X_train_fold, y_train_fold, eval_set=[(X_val_fold, y_val_fold)], early_stopping_rounds=100, verbose=False)
        
        val_preds[val_idx] = model_cloned.predict(X_val_fold)
        
        #y_val_pred = model_cloned.predict(X_val_fold)
        #val_error= mean_squared_error(y_val_fold, y_val_pred, squared=False)
        #val_errors.append(val_error)
        
        y_pred = model_cloned.predict(X_test)
        test_preds += y_pred / n_fold
        #gc.collect()
        
        #rmse_avg = np.mean(val_errors)
        rmse_full = mean_squared_error(y_train, val_preds, squared=False)
        
    return rmse_full, val_preds, test_preds

In [None]:
rmse_full_xgb, val_preds_xgb, test_preds_xgb = validation(XGBRegressor(**xgb_params))

rmse_full_lgbm, val_preds_lgbm, test_preds_lgbm = validation(LGBMRegressor(**lgbm_params))

rmse_full_cat, val_preds_cat, test_preds_cat = validation(CatBoostRegressor(**cat_params))

In [None]:
fig, axes = plt.subplots(1, 4)
sns.boxplot(data=y_train,ax=axes[0]).set_title('y-train')
sns.boxplot(data=val_preds_xgb,ax=axes[1]).set_title('val_preds_xgb')
sns.boxplot(data=val_preds_lgbm,ax=axes[2]).set_title('val_preds_lgbm')
sns.boxplot(data=val_preds_cat,ax=axes[3]).set_title('val_preds_cat')
fig.tight_layout()
plt.show()

 All three models underpredict vehicle prices for most training samples.

In [None]:
plt.scatter(y_train, val_preds_xgb, alpha=0.5)
plt.title('Actual vs Predicted Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

In [None]:
plt.scatter(y_train, val_preds_lgbm, alpha=0.5)
plt.title('Actual vs Predicted Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

In [None]:
plt.scatter(y_train, val_preds_cat, alpha=0.5)
plt.title('Actual vs Predicted Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

# 8. Optuna Weights and Submission.

Optuna is used to find the optimal weights for an ensemble of three models. The weights that minimize full train dataset's rmse score are needed.

In [None]:
class OptunaWeights:
    def __init__(self, random_state, n_trials=5000):
        self.study = None
        self.weights = None
        self.random_state = random_state
        self.n_trials = n_trials

    def _objective(self, trial, y_true, y_preds):
        # Define the weights for the predictions from each model
        weights = [trial.suggest_float(f"weight{n}", 0, 1) for n in range(len(y_preds) - 1)]
        weights.append(1 - sum(weights))  # Ensure the sum of weights is 1

        # Calculate the weighted prediction
        weighted_pred = np.average(np.array(y_preds), axis=0, weights=weights)

        rmse_full = mean_squared_error(y_true, weighted_pred, squared=False)
        return rmse_full  

    def fit(self, y_true, y_preds):
        optuna.logging.set_verbosity(optuna.logging.ERROR)
        sampler = CmaEsSampler(seed=self.random_state)
        pruner = HyperbandPruner()
        self.study = optuna.create_study(sampler=sampler, pruner=pruner, study_name="OptunaWeights",
                                         direction='minimize')
        objective_partial = partial(self._objective, y_true=y_true, y_preds=y_preds)
        self.study.optimize(objective_partial, n_trials=self.n_trials, show_progress_bar=True)
        weights = [self.study.best_params[f"weight{n}"] for n in range(len(y_preds) - 1)]
        weights.append(1 - sum(weights))  # Ensure the sum of weights is 1
        self.weights = weights

In [None]:
ow = OptunaWeights(random_seed)
ow.fit(y_train, y_preds=[val_preds_xgb, val_preds_lgbm, val_preds_cat])
weights = ow.weights
print(weights)

In [None]:
y_test = weights[0] * test_preds_xgb + weights[1] * test_preds_lgbm + weights[2] * test_preds_cat
sample_submission = pd.read_csv('/kaggle/input/playground-series-s4e9/sample_submission.csv')
sample_submission['price'] = y_test
sample_submission.to_csv('submission.csv', index=False)
sample_submission.head()