This section aims to try different approaches. Some of the project's choices might be hindering the model's performance such as brand-model grouped imputation.

#

In [110]:
# Locate project root dir, enable package imports from src/
import sys
from pathlib import Path

PROJ_ROOT = Path().resolve().parents[1]
sys.path.append(str(PROJ_ROOT / "src"))
print(PROJ_ROOT)

/Users/alexandre/Documents/GitHub/price_predictor


In [None]:
import sys
import os
import pandas as pd
import numpy as np
from math import ceil, floor
import json

# Load standardized paths, ml utilities and custom transformer
from project_utils.paths import (
    RAW_DATA_DIR,
    TEST_PREDICTIONS_DIR,
)
import project_utils.ml_utils as ml_utils
from open_ended_utils import feature_engineer, simpler_target_encoder, validity_cleaners

# Holdout method
from sklearn.model_selection import train_test_split

module_path = os.path.abspath(os.path.join(".."))
if module_path not in sys.path:
    sys.path.append(module_path)


from sklearn.base import clone

from sklearn.preprocessing import (
    RobustScaler,
)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import (
    HistGradientBoostingRegressor,
)

from sklearn.compose import make_column_selector


In [None]:
# Adds the project root to the path so 'Helpers' can be found
# '..' goes up one level from 'Notebooks/' to 'project-root/'




In [None]:
df_train = pd.read_csv(os.path.join(RAW_DATA_DIR, "train.csv"), index_col="carID")
X_test = pd.read_csv(os.path.join(RAW_DATA_DIR, "test.csv"), index_col="carID")

X = df_train.drop("price", axis=1).copy()
y = np.log1p(df_train["price"].copy())

# Performing the holdout methods
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=0)

# Experimenting new pipeline

This notebook focuses exclusively on the HistGradientBoostingRegressor, an ensemble tree-based method that demonstrated the strongest predictive performance across all models evaluated in this project. The primary objective of this section is to present a fully integrated, end-to-end pipeline that consolidates all data preprocessing, feature engineering, and modeling steps into a single, coherent workflow.

In contrast to earlier analyses, where hyperparameter optimization was restricted to the estimator level, this section adopts a more comprehensive strategy. A randomized hyperparameter search is conducted across all stages of the pipeline, encompassing not only model-specific parameters but also those associated with data validation, cleaning, and feature transformation. This holistic approach allows for a more thorough exploration of interactions between preprocessing decisions and model behavior.

The complete pipeline is structured sequentially as follows: data validity checking and cleaning, feature engineering, imputation, encoding, scaling, feature selection, and final model estimation. By integrating these components within a unified framework, this approach enhances both the robustness and reproducibility of the modeling process, while providing a systematic basis for evaluating the combined impact of preprocessing and modeling choices on overall performance.

In [None]:
# --- Dictionary of valid categorical features --- #
cleaned_dict_dir = os.path.join(
    PROJ_ROOT, "src", "project_utils", "valid_categories.json"
)
with open(cleaned_dict_dir, "r") as f:
    valid_categories = json.load(f)

# --- Dictionary of valid numerical features --- #
numeric_ranges = {
    "year": (0, 2020),
    "mileage": (0, None),
    "engineSize": (0, None),
    "mpg": (0, 300),
    "tax": (0, None),
    "previousOwners": (0, None),
}

In [115]:
best_params = {
    "cleaner_cat_replace_with": np.nan,
    "cleaner_min_similarity": 0.85,
    "cleaner_numeric_policy": "wipe",
    "imputer": "passthrough",
    "model_early_stopping": True,
    "model_l2_regularization": 0.017619542016522542,
    "model_learning_rate": 0.03833834069078183,
    "model_max_depth": 30,
    "model_max_iter": 2495,
    "model_max_leaf_nodes": 4540,
    "model_min_samples_leaf": 44,
    "model_random_state": 99,
}

## HGBR

In [116]:
pipeline = Pipeline(
    steps=[
        (
            "cleaner",
            validity_cleaners.FullValidityCleaner(
                valids=valid_categories,
                ranges=numeric_ranges,
            ),
        ),
        (
            "feature_engineer",
            feature_engineer.FeatureEngineer(reference_year=X_train["year"].max()),
        ),
        ("imputer", "passthrough"),
        (
            "encoder",
            ColumnTransformer(
                transformers=[
                    (
                        "num",
                        "passthrough",
                        make_column_selector(dtype_include=np.number),
                    ),
                    (
                        "cat",
                        simpler_target_encoder.MeanTargetEncoder(),
                        make_column_selector(dtype_exclude=np.number),
                    ),
                ],
                remainder="drop",
            ),
        ),
        ("scaler", RobustScaler()),
        ("feature_selector", "passthrough"),
        ("model", HistGradientBoostingRegressor()),
    ]
)


# --- Raw best params (e.g., loaded from JSON / previous output) --- #
raw_best_params = {
    "cleaner_cat_replace_with": np.nan,
    "cleaner_min_similarity": 0.85,
    "cleaner_numeric_policy": "wipe",
    "model_early_stopping": True,
    "model_l2_regularization": 0.017619542016522542,
    "model_learning_rate": 0.03833834069078183,
    "model_max_depth": 30,
    "model_max_iter": 2495,
    "model_max_leaf_nodes": 4540,
    "model_min_samples_leaf": 44,
    "model_random_state": 99,
}

# --- Convert to pipeline parameter names --- #
pipeline_best_params = {
    "cleaner__cat_replace_with": raw_best_params["cleaner_cat_replace_with"],
    "cleaner__min_similarity": raw_best_params["cleaner_min_similarity"],
    "cleaner__numeric_policy": raw_best_params["cleaner_numeric_policy"],
    "model__early_stopping": raw_best_params["model_early_stopping"],
    "model__l2_regularization": raw_best_params["model_l2_regularization"],
    "model__learning_rate": raw_best_params["model_learning_rate"],
    "model__max_depth": raw_best_params["model_max_depth"],
    "model__max_iter": raw_best_params["model_max_iter"],
    "model__max_leaf_nodes": raw_best_params["model_max_leaf_nodes"],
    "model__min_samples_leaf": raw_best_params["model_min_samples_leaf"],
    "model__random_state": raw_best_params["model_random_state"],
}

pipeline.set_params(**pipeline_best_params)
hgbr = pipeline.fit(X_train, y_train)
hgbr


0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('cleaner', ...), ('feature_engineer', ...), ...]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,valids,"{'categorical_valid_values': {'Brand': ['toyota', 'mercedes', ...], 'fuelTypes': ['petrol', 'diesel', ...], 'model': ['golf', 'passat', ...], 'transmission': ['manual', 'semi-auto', ...]}, 'numeric_ranges': {'engineSize': {'max': None, 'min': 0}, 'mileage': {'max': None, 'min': 0}, 'mpg': {'max': 300, 'min': 0}, 'previousOwners': {'max': None, 'min': 0}, ...}}"
,ranges,"{'engineSize': (0, ...), 'mileage': (0, ...), 'mpg': (0, ...), 'previousOwners': (0, ...), ...}"
,numeric_policy,'wipe'
,min_similarity,0.85
,cat_replace_with,
,add_invalid_flag,True
,normalize_strings,True

0,1,2
,reference_year,np.float64(2024.1217590521203)

0,1,2
,"transformers  transformers: list of tuples List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data. name : str  Like in Pipeline and FeatureUnion, this allows the transformer and  its parameters to be set using ``set_params`` and searched in grid  search. transformer : {'drop', 'passthrough'} or estimator  Estimator must support :term:`fit` and :term:`transform`.  Special-cased strings 'drop' and 'passthrough' are accepted as  well, to indicate to drop the columns or to pass them through  untransformed, respectively. columns : str, array-like of str, int, array-like of int, array-like of bool, slice or callable  Indexes the data on its second axis. Integers are interpreted as  positional columns, while strings can reference DataFrame columns  by name. A scalar string or int should be used where  ``transformer`` expects X to be a 1d array-like (vector),  otherwise a 2d array will be passed to the transformer.  A callable is passed the input data `X` and can return any of the  above. To select multiple columns by name or dtype, you can use  :obj:`make_column_selector`.","[('num', ...), ('cat', ...)]"
,"remainder  remainder: {'drop', 'passthrough'} or estimator, default='drop' By default, only the specified columns in `transformers` are transformed and combined in the output, and the non-specified columns are dropped. (default of ``'drop'``). By specifying ``remainder='passthrough'``, all remaining columns that were not specified in `transformers`, but present in the data passed to `fit` will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during `fit` will be excluded from the output of `transform`. By setting ``remainder`` to be an estimator, the remaining non-specified columns will use the ``remainder`` estimator. The estimator must support :term:`fit` and :term:`transform`. Note that using this feature requires that the DataFrame columns input at :term:`fit` and :term:`transform` have identical order.",'drop'
,"sparse_threshold  sparse_threshold: float, default=0.3 If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use ``sparse_threshold=0`` to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.",0.3
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"transformer_weights  transformer_weights: dict, default=None Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each transformer will be printed as it is completed.",False
,"verbose_feature_names_out  verbose_feature_names_out: bool, str or Callable[[str, str], str], default=True - If True, :meth:`ColumnTransformer.get_feature_names_out` will prefix  all feature names with the name of the transformer that generated that  feature. It is equivalent to setting  `verbose_feature_names_out=""{transformer_name}__{feature_name}""`. - If False, :meth:`ColumnTransformer.get_feature_names_out` will not  prefix any feature names and will error if feature names are not  unique. - If ``Callable[[str, str], str]``,  :meth:`ColumnTransformer.get_feature_names_out` will rename all the features  using the name of the transformer. The first argument of the callable is the  transformer name and the second argument is the feature name. The returned  string will be the new feature name. - If ``str``, it must be a string ready for formatting. The given string will  be formatted using two field names: ``transformer_name`` and ``feature_name``.  e.g. ``""{feature_name}__{transformer_name}""``. See :meth:`str.format` method  from the standard library for more info. .. versionadded:: 1.0 .. versionchanged:: 1.6  `verbose_feature_names_out` can be a callable or a string to be formatted.",True
,"force_int_remainder_cols  force_int_remainder_cols: bool, default=False This parameter has no effect. .. note::  If you do not access the list of columns for the remainder columns  in the `transformers_` fitted attribute, you do not need to set  this parameter. .. versionadded:: 1.5 .. versionchanged:: 1.7  The default value for `force_int_remainder_cols` will change from  `True` to `False` in version 1.7. .. deprecated:: 1.7  `force_int_remainder_cols` is deprecated and will be removed in 1.9.",'deprecated'

0,1,2
,"with_centering  with_centering: bool, default=True If `True`, center the data before scaling. This will cause :meth:`transform` to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_scaling  with_scaling: bool, default=True If `True`, scale the data to interquartile range.",True
,"quantile_range  quantile_range: tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0) Quantile range used to calculate `scale_`. By default this is equal to the IQR, i.e., `q_min` is the first quantile and `q_max` is the third quantile. .. versionadded:: 0.18","(25.0, ...)"
,"copy  copy: bool, default=True If `False`, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"unit_variance  unit_variance: bool, default=False If `True`, scale data so that normally distributed features have a variance of 1. In general, if the difference between the x-values of `q_max` and `q_min` for a standard normal distribution is greater than 1, the dataset will be scaled down. If less than 1, the dataset will be scaled up. .. versionadded:: 0.24",False

0,1,2
,"loss  loss: {'squared_error', 'absolute_error', 'gamma', 'poisson', 'quantile'}, default='squared_error' The loss function to use in the boosting process. Note that the ""squared error"", ""gamma"" and ""poisson"" losses actually implement ""half least squares loss"", ""half gamma deviance"" and ""half poisson deviance"" to simplify the computation of the gradient. Furthermore, ""gamma"" and ""poisson"" losses internally use a log-link, ""gamma"" requires ``y > 0`` and ""poisson"" requires ``y >= 0``. ""quantile"" uses the pinball loss. .. versionchanged:: 0.23  Added option 'poisson'. .. versionchanged:: 1.1  Added option 'quantile'. .. versionchanged:: 1.3  Added option 'gamma'.",'squared_error'
,"quantile  quantile: float, default=None If loss is ""quantile"", this parameter specifies which quantile to be estimated and must be between 0 and 1.",
,"learning_rate  learning_rate: float, default=0.1 The learning rate, also known as *shrinkage*. This is used as a multiplicative factor for the leaves values. Use ``1`` for no shrinkage.",0.03833834069078183
,"max_iter  max_iter: int, default=100 The maximum number of iterations of the boosting process, i.e. the maximum number of trees.",2495
,"max_leaf_nodes  max_leaf_nodes: int or None, default=31 The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.",4540
,"max_depth  max_depth: int or None, default=None The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn't constrained by default.",30
,"min_samples_leaf  min_samples_leaf: int, default=20 The minimum number of samples per leaf. For small datasets with less than a few hundred samples, it is recommended to lower this value since only very shallow trees would be built.",44
,"l2_regularization  l2_regularization: float, default=0 The L2 regularization parameter penalizing leaves with small hessians. Use ``0`` for no regularization (default).",0.017619542016522542
,"max_features  max_features: float, default=1.0 Proportion of randomly chosen features in each and every node split. This is a form of regularization, smaller values make the trees weaker learners and might prevent overfitting. If interaction constraints from `interaction_cst` are present, only allowed features are taken into account for the subsampling. .. versionadded:: 1.4",1.0
,"max_bins  max_bins: int, default=255 The maximum number of bins to use for non-missing values. Before training, each feature of the input array `X` is binned into integer-valued bins, which allows for a much faster training stage. Features with a small number of unique values may use less than ``max_bins`` bins. In addition to the ``max_bins`` bins, one more bin is always reserved for missing values. Must be no larger than 255.",255


In [117]:
# Compute evaluation metrics for the training and validation sets
y_train = np.expm1(y_train)
y_val = np.expm1(y_val)
# Training set
train_predictions = np.expm1(pipeline.predict(X_train))
print("Training set metrics:")
metrics_train = ml_utils.linear_evaluation_metrics(
    y_train, train_predictions, verbose=True
)
# Validation set
validation_predictions = np.expm1(pipeline.predict(X_val))
print("Validation set metrics:")
metrics_val = ml_utils.linear_evaluation_metrics(
    y_val, validation_predictions, verbose=True
)

Training set metrics:
R²: 0.969, MAE: 935.48, MAPE: 5.59%, RMSE: 1711.57
-------------------------------------------------------
Validation set metrics:
R²: 0.944, MAE: 1337.64, MAPE: 7.88%, RMSE: 2391.40
-------------------------------------------------------


In [118]:
# -- Export X_test predictions --#
pred_y_test = pipeline.predict(X_test)
# Convert to Series
test_predictions_series = pd.Series(data=pred_y_test, name="price", index=X_test.index)

# Export CSV file
test_predictions_series.to_csv(
    os.path.join(TEST_PREDICTIONS_DIR, "open_ended_test_predictions.csv")
)

This approach integrates data cleaning directly into the modeling pipeline while simultaneously performing a randomized search over the full set of hyperparameters. This strategy proved effective, yielding substantial improvements in overall model performance. Moreover, the retention of variables previously identified as non-informative by Elastic Net, together with the inclusion of newly engineered features, suggests that multiple complementary methodological approaches can meaningfully contribute to performance gains.

Overall, this section demonstrates the existence of several viable avenues through which model performance can be enhanced, highlighting the value of combining robust preprocessing, feature engineering, and flexible model selection strategies.