## **Hierarchical Price Prediction via Segment Classification and Specialized Regressors - Machine Learning Project**

Work done by Group 32:
Carlota Magalhaães 20221870
Gabriela Matos 20221835
Mariana Carvalhais r20211389 

## **Index**<br>

[1. **Objective**](#1st-bullet)<br>

[2. **Importing Libraries and Data**](#2nd-bullet)<br>

[3. **Data Preprocessing**](#3rd-bullet)<br>
- [3.1 Indexes and Duplicates](#4th-bullet)<br>
- [3.2 Variable Correction (Categorical Cleaning)](#5th-bullet)<br>
    - [3.2.1 Fuel Type Correction](#6th-bullet)<br>
    - [3.2.2 Transmission Correction](#7th-bullet)<br>
    - [3.2.3 Model Name Correction](#8th-bullet)<br>
- [3.3 Incoherencies Checking](#9th-bullet)<br>
- [3.4 Data Splitting (Train / Validation)](#10th-bullet)<br>
- [3.5 Outliers Treatment](#11th-bullet)<br>
- [3.6 Feature Engineering](#12th-bullet)<br>
- [3.7 Missing Values Treatment (Categorical Only)](#13th-bullet)<br>
- [3.8 Scaling Numerical Features](#14th-bullet)<br>
- [3.9 Missing Values Treatment (Numerical Data)](#15th-bullet)<br>
- [3.10 Encoding Categorical Variables](#16th-bullet)<br>

[4. **Price Segment Definition**](#17th-bullet)<br>

[5. **Price Segment Classification**](#18th-bullet)<br>
- [5.1 Classification Model and Setup](#19th-bullet)<br>
- [5.2 Classification Performance Evaluation](#20th-bullet)<br>

[6. **Segment-Specific Regression Models**](#21st-bullet)<br>
- [6.1 Global Regression Baseline](#22nd-bullet)<br>
- [6.2 Segment-Specific Regression Models](#23rd-bullet)<br>

[7. **End-to-End Hierarchical Prediction Pipeline**](#24th-bullet)<br>

[8. **Performance Evaluation and Comparison**](#25th-bullet)<br>
- [8.1 Overall MAE Comparison](#26th-bullet)<br>
- [8.2 MAE by Price Segment](#27th-bullet)<br>

[9. **Final Model**](#28th-bullet)<br>

[10. **Business Implications and Recommendations**](#29th-bullet)<br>

[11. **Limitations and Future Work**](#30th-bullet)<br>

</div>

<a id="1st-bullet"></a>
# **1. Objective**

In this open-ended section, we explore whether a single global regression model performs equally well across all car price ranges by testing a hierarchical approach where the price segment is predicted first and a segment-specific regression model is then used to estimate the final price. The goal is to see if splitting the problem into more homogeneous price groups helps capture different pricing dynamics, especially for low- and high-priced cars, while keeping this analysis exploratory and complementary to the main pipeline rather than a replacement.

<a id="2nd-bullet"></a>
# **2. Importing Libraries and Data**

In [1]:
# Core libraries
import math
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics
from scipy.stats import chi2_contingency, randint, uniform, loguniform

# Preprocessing
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, RobustScaler, LabelEncoder

# Imputation
from sklearn.impute import SimpleImputer, KNNImputer

# Feature selection
from sklearn.feature_selection import RFE, RFECV

# Model selection & validation
from sklearn.model_selection import train_test_split, RandomizedSearchCV, PredefinedSplit, KFold

# Metrics
from sklearn.metrics import mean_absolute_error, accuracy_score, balanced_accuracy_score, classification_report, confusion_matrix, r2_score, make_scorer

# Linear models
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, LassoCV, ElasticNet

# Tree-based models
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor, StackingRegressor

# Support Vector Machines
from sklearn.svm import SVR, LinearSVR

# Neighbors
from sklearn.neighbors import KNeighborsRegressor

# Neural Networks
from sklearn.neural_network import MLPRegressor

# Utilities
from sklearn.base import clone

# Warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Reproducibility
np.random.seed(42)

In [2]:
df_train = pd.read_csv('train.csv')

In [3]:
df_test = pd.read_csv('test.csv') 

In [4]:
train = df_train.copy()

In [5]:
test = df_test.copy()

In [6]:
#check if there are duplicates in ID so we can set index
train.duplicated(subset = 'carID').sum()

np.int64(0)

In [7]:
#separate metric and non-metric features
metric_features = ['year', 'mileage', 'tax', 'mpg', 'engineSize', 'previousOwners', 'price'] 
index = ['carID']
non_metric_features = train.columns.drop(metric_features + index).to_list()

In [8]:
metric_features = [f for f in metric_features if f != 'price']

<a id="3rd-bullet"></a>
# **3. Data Preprocessing**


<a id="4th-bullet"></a>
## **3.1 Indexes and Duplicates**


Again in this section, we will discriminate 'carID' as the index of the dataset.
Our training dataset doesn't have duplicates.

In [9]:
#set carID as index
train.set_index('carID', inplace = True)

In [10]:
test.set_index('carID', inplace=True)

In [11]:
#drop duplicates 
#train.drop_duplicates(inplace = True)

<a id="5th-bullet"></a>
## **3.2 Variable Correction (Categorical Cleaning)**


In this section, we have to correct the same variables (Brand, Fuel Type, Transmission and Model) as in the main pipeline.

<a id="6th-bullet"></a>
### **3.2.1 Brand Correction**


In [12]:
#Correct values of brands on x_train
brand_corrections = {
    'ford': 'Ford', 'for': 'Ford', 'ord': 'Ford', 'FORD': 'Ford', 'fOrd': 'Ford', 'or': 'Ford',
    'mercedes': 'Mercedes', 'mercede': 'Mercedes', 'ercedes': 'Mercedes', 
    'MERCEDES': 'Mercedes', 'MERCEDE': 'Mercedes', 'ercede': 'Mercedes',
    'vw': 'VW', 'v': 'VW', 'W': 'VW', 'w': 'VW',
    'opel': 'Opel', 'OPE': 'Opel', 'pel': 'Opel', 'ope': 'Opel', 'OPEL': 'Opel', 'pe': 'Opel',
    'bmw': 'BMW', 'bm': 'BMW', 'MW': 'BMW', 'mw': 'BMW',
    'audi': 'Audi', 'ud': 'Audi', 'udi': 'Audi', 'AUDI': 'Audi', 'aud': 'Audi', 'AUD': 'Audi',
    'toyota': 'Toyota', 'toyot': 'Toyota', 'OYOTA': 'Toyota', 'oyota': 'Toyota', 'TOYOTA': 'Toyota',
    'skoda': 'Skoda', 'skod': 'Skoda', 'koda': 'Skoda', 'Skod': 'Skoda', 'SKODA': 'Skoda', 'ko': 'Skoda',
    'hyundai': 'Hyundai', 'hyunda': 'Hyundai', 'yunda': 'Hyundai', 'HYUNDAI': 'Hyundai'
}


def correct_brand(name):
    if pd.isna(name):  # check for missing before converting to str, so it doesn't create a string 'Nan'
        return np.nan
    name = str(name).lower().strip()
    for key in brand_corrections:
        if key in name:
            return brand_corrections[key]
    return name.title()


# Apply the correction to the Brand column in training dataset
train['Brand'] = train['Brand'].apply(correct_brand)
print(train['Brand'].value_counts())

Brand
VW          17636
Ford        16063
Mercedes    11674
Opel         9352
Audi         7325
Toyota       4622
Skoda        4303
Hyundai      3336
BMW           141
Name: count, dtype: int64


In [13]:
test['Brand'] = test['Brand'].apply(correct_brand)

<a id="7th-bullet"></a>
### **3.2.2 Fuel Type Correction**


In [14]:
#Correct values of Fuel Type on 
fuel_type_corrections = {
    'dies': 'Diesel', 'diesl': 'Diesel', 'diesle': 'Diesel', 'iesel': 'Diesel', 'diese': 'Diesel', 'iese': 'Diesel',
    'petro': 'Petrol', 'etrol': 'Petrol', 'petro ': 'Petrol', 'etro': 'Petrol',
    'hybri': 'Hybrid', 'hybird': 'Hybrid', 'ybrid': 'Hybrid', 'ybri': 'Hybrid',
    'elctric': 'Electric', 'electri': 'Electric', 'elcetric': 'Electric',
    'other': 'Other', 'oth': 'Other', 'othe': 'Other', 'ther': 'Other'
}

def correct_fuel(name):
    if pd.isna(name):  
        return np.nan
    name = str(name).lower().strip()
    for key in fuel_type_corrections:
        if key in name:
            return fuel_type_corrections[key]
    return name.title() 

train['fuelType'] = train['fuelType'].apply(correct_fuel)
print(train['fuelType'].value_counts())

fuelType
Petrol      41181
Diesel      30885
Hybrid       2225
Other         167
Electric        4
Name: count, dtype: int64


In [15]:
test['fuelType'] = test['fuelType'].apply(correct_fuel)

<a id="8th-bullet"></a>
### **3.2.3 Transmission Correction**


In [16]:
#correct values of transmission
transmission_corrections = {
    'manua': 'Manual', 'maual': 'Manual', 'manul': 'Manual', 'anua': 'Manual',
    'semi-aut': 'Semi-Auto', 'semi-au': 'Semi-Auto', 'emi-auto': 'Semi-Auto', 'emi-aut': 'Semi-Auto',
    'automati': 'Automatic', 'automaic': 'Automatic', 'utomati': 'Automatic',
    'unknow': 'Unknown', 'unkown': 'Unknown', 'nknown': 'Unknown', 'nknow': 'Unknown'
}

def correct_transmission(name):
    if pd.isna(name):  
        return np.nan
    name = str(name).lower().strip()
    for key in transmission_corrections:
        if key in name:
            return transmission_corrections[key]
    return name.title()

train['transmission'] = train['transmission'].apply(correct_transmission)
print(train['transmission'].value_counts())

transmission
Manual       41627
Semi-Auto    16872
Automatic    15211
Unknown        736
Other            5
Name: count, dtype: int64


In [17]:
test['transmission'] = test['transmission'].apply(correct_transmission)

<a id="9th-bullet"></a>
### **3.2.4 Model Correction**


In [18]:
model_corrections = {
    'focu': 'Focus', 'focuss': 'Focus',
    'fies': 'Fiesta', 'fiest': 'Fiesta', 'fiesat': 'Fiesta',
    'gol': 'Golf', 'gof': 'Golf', 'glof': 'Golf', 'golff': 'Golf',
    'pasat': 'Passat', 'passa': 'Passat', 'pass': 'Passat',
    'mokk': 'Mokka', 'mok': 'Mokka',
    'insigni': 'Insignia', 'insign': 'Insignia',
    'astr': 'Astra',
    'pol': 'Polo',
    'yari': 'Yaris', 'yar': 'Yaris',
    'a clas': 'A Class', 'c clas': 'C Class', 'e clas': 'E Class',
    'gle clas': 'Gle Class', 'glc clas': 'Glc Class',
    'cla clas': 'Cla Class', 'sl clas': 'Sl Class', 'cls clas': 'Cls Class',
    'yet': 'Yeti', 'yeti outdoo': 'Yeti Outdoor', 'yeti out': 'Yeti Outdoor',
    'monde': 'Mondeo',
    'kar': 'Karoq', 'karo': 'Karoq',
    'rav': 'Rav4',
    'touare': 'Touareg',
    'tiguan allspac': 'Tiguan Allspace',
    'auri': 'Auris', 'aur': 'Auris',
    'coroll': 'Corolla',
    'kami': 'Kamiq', 'kam': 'Kamiq',
    't-cros': 'T-Cross',
    't-ro': 'T-Roc',
    ' Q2': 'Q2',
    ' 2 Series': '2 Series',
    ' A3': 'A3',
    ' Octavia': 'Octavia',
    ' C-Hr': 'C-Hr',
    ' Ecosport': 'Ecosport',
    ' Fabia': 'Fabia',
    ' Ka+': 'Ka+',
    ' 3 Series': '3 Series',
    ' C Class': 'C Class',
    ' I30': 'I30',
    ' Up': 'Up',
    ' Tt': 'TT',
    ' 5 Series': '5 Series',
    ' 4 Series': '4 Series',
    ' Slk': 'SLK',
    ' Cl Class': 'CLS Class',
    ' I20': 'I20',
    ' Rapid': 'Rapid',
    ' S-Max': 'S-Max',
    ' Crossland X': 'Crossland X',
    ' Grand Tourneo Connect': 'Grand Tourneo Connect',
    ' C-Ma': 'C-Max',
    ' Grand C-Ma': 'Grand C-Max',
    ' 6 Serie': '6 Series',
    ' 2 Serie': '2 Series',
    ' 8 Serie': '8 Series',
    ' 1 Serie': '1 Series',
    ' 4 Serie': '4 Series',
    ' S Clas': 'S Class',
    ' Gl Class': 'GL Class',
    ' V Clas': 'V Class',
    ' C-H': 'C-Hr',
    ' X-Clas': 'X-Class',
    'Shara': 'Sharan',
    '3 Serie': '3 Series',
    '5 Serie': '5 Series',
    '8 Serie': '8 Series',
    'Ioni': 'Ioniq',
    'Tigua': 'Tiguan',
    'Viv': 'Viva',
    'Kug': 'Kuga',
    'Rs': 'RS',
    'Zafir': 'Zafira',
    'Arteo': 'Arteon',
    'Kon': 'Kona',
    'Scirocc': 'Scirocco',
    'Fabi': 'Fabia',
    'Citig': 'Citigo',
    'S-Ma': 'S-Max',
    'B-Ma': 'B-Max',
    'Hilux': 'Hilux',
    'Hilu': 'Hilux',
    'Toura': 'Touran',
    'Tourneo Custo': 'Tourneo Custom',
    'Grandland ': 'Grandland X',
    'Urban Cruise': 'Urban Cruiser',
    'Sl': 'SLK',
    'I1': 'I10',
    'I2': 'I20',
    'I3': 'I30',
    'M Clas': 'M Class',
    ' Q2': 'Q2', ' 2 Series': '2 Series', '3 Serie': '3 Series', 'Tigua': 'Tiguan',
    'Shara': 'Sharan', 'Ioni': 'Ioniq'
}


def correct_model(name):
    if pd.isna(name):  
        return np.nan
    name = str(name).lower().strip()
    for key in model_corrections:
        if key in name:
            return model_corrections[key]
    return name.title()  

train['model'] = train['model'].apply(correct_model)
print(train['model'].value_counts())

model
Focus       6915
C Class     5955
Fiesta      4470
Golf        3515
A Class     2354
            ... 
Veloste        1
A2             1
200            1
Accent         1
Terracan       1
Name: count, Length: 256, dtype: int64


In [19]:
test['model'] = test['model'].apply(correct_model)

<a id="10th-bullet"></a>
## **3.3 Incoherencies Checking**


As in the main pipeline, we will remove `hasDamage` for analysis and prediction purposes as it doesn't add any value. We will convert the negative values in `mileage`, `mpg`, `engineSize`, `previousOwners` and `tax` into their absolute value. We will also transfrom the values in `year` above 2020 into 2020.

In [20]:
train['hasDamage'].value_counts(dropna=False)

hasDamage
0.0    74425
NaN     1548
Name: count, dtype: int64

In [21]:
train.drop('hasDamage', axis = 1, inplace=True)

In [22]:
incoherencies_limits = {
    'year' : 2020,
    'mileage' : 0,
    'mpg' : 0,
    'engineSize' : 0,
    'previousOwners' : 0,
    'tax' : 0
}

In [23]:
def incoherencies(df, dictionary):
    for key, value in dictionary.items():
        if value == 0:
            n = (df[key] < 0).sum()
            if n > 0:
                print(f'{key} has {n} negative values made absolute.')
            df.loc[df[key] < 0, key] = df.loc[df[key] < 0, key].abs()
        else:
            n = (df[key] > value).sum()
            if n > 0:
                print(f'{key} has {n} values above {value}, capped at {value}.')
            df.loc[df[key] > value, key] = value
    return df

In [24]:
not_floats = ['year', 'previousOwners']

In [25]:
def floats_to_int(df, variables):
    for var in variables:
        sum_floats = df[var].apply(lambda x: isinstance(x, float)).sum() 
        df[var] = df[var].round().astype('Int64')
        print(f'{var} has {sum_floats} that were converted to integers')
    return df

Calling both created functions on train:

In [26]:
train = incoherencies(train, incoherencies_limits)

year has 358 values above 2020, capped at 2020.
mileage has 369 negative values made absolute.
mpg has 36 negative values made absolute.
engineSize has 84 negative values made absolute.
previousOwners has 371 negative values made absolute.
tax has 378 negative values made absolute.


In [27]:
train = floats_to_int(train, not_floats)

year has 75973 that were converted to integers
previousOwners has 75973 that were converted to integers


Calling both created functions on test:

In [28]:
test = incoherencies(test, incoherencies_limits)

year has 180 values above 2020, capped at 2020.
mileage has 170 negative values made absolute.
mpg has 17 negative values made absolute.
engineSize has 33 negative values made absolute.
previousOwners has 168 negative values made absolute.
tax has 161 negative values made absolute.


In [29]:
test = floats_to_int(test, not_floats)

year has 32567 that were converted to integers
previousOwners has 32567 that were converted to integers


<!-- Calling both functions on test: -->

<a id="11th-bullet"></a>
## **3.4 Data Splitting - Features/Target & Train/Validation (To do)**


We follow the same split strategy as in the main pipeline to ensure comparability. At this stage, the split is performed exclusively on features and target, without introducing any price segmentation logic. All segmentation-related steps are handled later in the pipeline to avoid conceptual mixing between data preparation and modeling decisions.

In [30]:
x = train.drop('price', axis = 1)
y = train['price']

In [31]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size = 0.3, 
                                                  random_state = 15,  
                                                  shuffle = True)

In [32]:
y_train_log = np.log1p(y_train)
y_val_log   = np.log1p(y_val)

In [33]:
print(f'Training data: {len(x_train)} rows \nValidation data: {len(x_val)} rows')

Training data: 53181 rows 
Validation data: 22792 rows


<a id="12th-bullet"></a>
## **3.5 Outliers Treatment**


No additional outlier treatment was applied in the open-ended section, as removing extreme observations would bias the evaluation of the hierarchical strategy. Since the goal is to assess whether segmentation improves performance precisely in heterogeneous and extreme price ranges, preserving all observations is essential for a fair comparison.

<a id="13th-bullet"></a>
## **3.6 Feature Engineering**


No additional feature engineering was introduced in this section.
We reused all engineered variables from the main pipeline to ensure that any observed differences in performance are attributable exclusively to the proposed segmentation strategy.

Variables created:

- Car Age - gives us the age of the car. since the dataset is from 2020, we subctract from 2020 the year of the car.
- Mileage per Year - gives us the number of miles per year.
- mpg per Engine Size - numerical variable that gives us a sense of the car efficiency, by calculating the miles per galon divided by the engine Size.
- Car Age * Mileage - numerical variable that tries to capture the effect of the relationship between age and mileage on the price of the car: allowing us to distinguish between old cars with low mileage and new cars with high mileage.
- log(mileage) – log-transformed mileage to reduce skewness and limit the impact of extreme values.
- is_luxury – binary variable indicating whether the car belongs to a luxury brand (BMW, Audi, Mercedes), capturing brand-related price effects.

In [34]:
def feature_engineering(df):
    #create carAge
    df['carAge'] = 2020 - df['year']
    
    #create mileagePerYear
    df['mileagePerYear'] = np.where(
    (df['carAge'].notna()) & (df['carAge'] != 0) ,
    df['mileage'] / df['carAge'],
    0)
    
    #create mpg_per_EngineSize
    df['mpg_per_engineSize'] = np.where(
    (df['engineSize'].notna()) & (df['engineSize'] != 0),
    df['mpg'] / df['engineSize'],
    0)
    
    #create carAge*Mileage
    df['carAge*Mileage'] = df['carAge'] * df['mileage']
    
    # log(mileage)
    df['log(mileage)'] = np.log1p(df["mileage"])
   
    # create a binary variable, 1 for luxury brands (Audi, Mercedes & BMW), 0 for non luxury
    df['is_luxury'] = df['Brand'].isin(['BMW', 'Audi', 'Mercedes']).astype(int)
    
    return df

In [35]:
#feature engineering for train
x_train = feature_engineering(x_train)

#feature engineering for validation
x_val = feature_engineering(x_val)

#feature engineering for test
test = feature_engineering(test)

In [36]:
metric_features.append('carAge')
metric_features.append('mileagePerYear')
metric_features.append('mpg_per_engineSize')
metric_features.append('carAge*Mileage')
metric_features.append('log(mileage)')
metric_features.append('is_luxury')
metric_features

['year',
 'mileage',
 'tax',
 'mpg',
 'engineSize',
 'previousOwners',
 'carAge',
 'mileagePerYear',
 'mpg_per_engineSize',
 'carAge*Mileage',
 'log(mileage)',
 'is_luxury']

<a id="14th-bullet"></a>
## **3.7 Missing Values Treatment (Categorical only)**


We will treat missing values based on our strategy in the main pipeline. For the categorical features we will, fill the missing values with the word 'Unknown'.

In [37]:
#Remembering the nr of missing values
x_train.isna().sum()

Brand                 1017
model                 1065
year                  1030
transmission          1074
mileage               1046
fuelType              1061
tax                   5483
mpg                   5510
engineSize            1074
paintQuality%         1089
previousOwners        1128
carAge                1030
mileagePerYear         982
mpg_per_engineSize    5393
carAge*Mileage        2061
log(mileage)          1046
is_luxury                0
dtype: int64

In [38]:
non_metric_features

['Brand', 'model', 'transmission', 'fuelType', 'paintQuality%', 'hasDamage']

In [39]:
non_metric_mv = [ 'model', 'transmission', 'fuelType', 'Brand']

In [40]:
def fill_categorical_unknown(df, lista):
    for col in lista:
        df[col] = df[col].fillna('Unknown')
    return df

In [41]:
#fill categorical mv for training data
x_train = fill_categorical_unknown(x_train, non_metric_mv)

In [42]:
#fill categorical mv for validation data
x_val = fill_categorical_unknown(x_val, non_metric_mv)

#fill categorical mv for test data
test = fill_categorical_unknown(test, non_metric_mv)

In [43]:
x_train['Brand'].value_counts(dropna=False)

Brand
VW          12368
Ford        11211
Mercedes     8137
Opel         6528
Audi         5140
Toyota       3263
Skoda        3071
Hyundai      2347
Unknown      1017
BMW            99
Name: count, dtype: int64

<a id="15th-bullet"></a>
## **3.8 Scalling Numerical Features**


For our project, we found Standard Scaling to provide the best performance on the validation data.
To maintain consistency, we will apply the same scalling method.

In [44]:
x_train[metric_features] = x_train[metric_features].replace(pd.NA, np.nan).astype(float)
x_val[metric_features] = x_val[metric_features].replace(pd.NA, np.nan).astype(float)
test[metric_features] = test[metric_features].replace(pd.NA, np.nan).astype(float)

In [45]:
def scale_training(df, features):
    #create scaler
    scaler = StandardScaler()
    
    #fit transform to training data
    df_scalled_array = scaler.fit_transform(df[features])
    
    #transform into dataframe
    df_train_scaled = pd.DataFrame(df_scalled_array, columns=features, index=df.index)
    
    return df_train_scaled, scaler

In [46]:
#scale training data and store scaler
x_train_scaled, scaler = scale_training(x_train, metric_features)

In [47]:
def scale_val_test(df, features, scaler):
    #transform  validation/test data
    scaled_array = scaler.transform(df[features])
    
    #transform into dataframe
    df_scaled = pd.DataFrame(scaled_array, columns=features, index=df.index)
    
    return df_scaled

In [48]:
#scale validation data
x_val_scaled = scale_val_test(x_val, metric_features, scaler)

#scale test data
test_scaled = scale_val_test(test, metric_features, scaler)

<a id="16th-bullet"></a>
## **3.9 Missing Values Treatment (Numerical data)**


Like in our main notebook, missing values in metric features are handled consistently:
- We apply a KNN Imputer to fill missing values in 'mileage', 'mpg', 'tax', 'engineSize', 'mileagePerYear', 'mpg_per_engineSize'.
- We use the median for missing values in 'year', 'carAge' and 'carAge*Mileage'.
- We set missing values in 'previousOwners' to 0.

In [49]:
#Remembering the nr of missing values
x_train_scaled.isna().sum()

year                  1030
mileage               1046
tax                   5483
mpg                   5510
engineSize            1074
previousOwners        1128
carAge                1030
mileagePerYear         982
mpg_per_engineSize    5393
carAge*Mileage        2061
log(mileage)          1046
is_luxury                0
dtype: int64

In [50]:
#percentage of missing values
(x_train_scaled.isna().sum()/len(x_train))*100

year                   1.936782
mileage                1.966868
tax                   10.310073
mpg                   10.360843
engineSize             2.019518
previousOwners         2.121058
carAge                 1.936782
mileagePerYear         1.846524
mpg_per_engineSize    10.140840
carAge*Mileage         3.875444
log(mileage)           1.966868
is_luxury              0.000000
dtype: float64

In [51]:
variables_knn = ['mileage', 'tax', 'mpg', 'engineSize', 'mileagePerYear', 'mpg_per_engineSize', 'log(mileage)']

In [52]:
imputer = KNNImputer(n_neighbors=7)

In [53]:
imputer.fit(x_train_scaled[variables_knn])

0,1,2
,missing_values,
,n_neighbors,7
,weights,'uniform'
,metric,'nan_euclidean'
,copy,True
,add_indicator,False
,keep_empty_features,False


In [54]:
medians = {
    'year' : x_train_scaled['year'].median(),
    'carAge' : x_train_scaled['carAge'].median(),
    'carAge*Mileage' : x_train_scaled['carAge*Mileage'].median(),
}

In [55]:
def fill_MV(df, variables_knn, imputer, medians):
    
    df_scaled_mv = df.copy()
    
    #apply KNN imputer to variables: 'mileage', 'tax', 'mpg', 'engineSize', 'mileagePerYear', 'mpg_per_engineSize', 'log(mileage)'
    df_scaled_mv[variables_knn] = imputer.transform(df_scaled_mv[variables_knn])
    
    #fill year, carAge and carAge*Mileage with median
    for key, value in medians.items():
        df_scaled_mv[key] = df_scaled_mv[key].fillna(value)
    
    #fill PreviousOwners and hasDamage with 0
    df_scaled_mv['previousOwners'] = df_scaled_mv['previousOwners'].fillna(0)
    
    return df_scaled_mv

In [56]:
x_train_scaled_mv = fill_MV(x_train_scaled,variables_knn, imputer, medians)

In [57]:
x_val_scaled_mv = fill_MV(x_val_scaled,variables_knn, imputer, medians)

In [58]:
test_scaled_mv = fill_MV(test_scaled, variables_knn, imputer, medians)

In [59]:
x_train_scaled_mv.isna().sum()

year                  0
mileage               0
tax                   0
mpg                   0
engineSize            0
previousOwners        0
carAge                0
mileagePerYear        0
mpg_per_engineSize    0
carAge*Mileage        0
log(mileage)          0
is_luxury             0
dtype: int64

<a id="17th-bullet"></a>
## **3.10 Encoding Categorical Variables**


We will apply One Hot Encoding as in the main notebook for low-cardinality variables (Brand, Fuel Type, Transmission). For the variable **Model**, we will again use **Frequency Encoding**.

In [60]:
non_metric_features_encoder = [f for f in non_metric_features if f != 'hasDamage']
non_metric_features_encoder

['Brand', 'model', 'transmission', 'fuelType', 'paintQuality%']

In [61]:
ohe_features = ['Brand', 'transmission', 'fuelType']

In [62]:
def encoding_train(df):
    ohe_features = ['Brand', 'transmission', 'fuelType']
    model_feature = 'model'
    
    #apply ohe to 'Brand', 'transmission', 'fuelType'
    ohe = OneHotEncoder(drop='first', sparse_output=False, handle_unknown= 'ignore')
    ohe_array = ohe.fit_transform(df[ohe_features])
    ohe_cols = ohe.get_feature_names_out(ohe_features)
    ohe_df = pd.DataFrame(ohe_array, columns=ohe_cols, index=df.index)
    
    #apply frequency encoding to model
    freq_map = df[model_feature].value_counts(normalize=True)
    freq_df = df[model_feature].map(freq_map).to_frame(name = 'model_freq')
    
    #concatenate both df
    encoded_df = pd.concat([ohe_df, freq_df], axis=1)
    
    return encoded_df, ohe, freq_map 

In [63]:
def encoding_test_val(df, ohe, freq_map):
    ohe_features = ['Brand', 'transmission', 'fuelType']
    model_feature = 'model'
    
    #apply ohe to 'Brand', 'transmission', 'fuelType'
    ohe_array = ohe.transform(df[ohe_features])
    ohe_cols = ohe.get_feature_names_out(ohe_features)
    ohe_df = pd.DataFrame(ohe_array, columns = ohe_cols, index = df.index)
    
    #apply frequency encoding to model
    freq_df = df[model_feature].map(freq_map).fillna(0).to_frame(name='model_freq')
    
    #concatenate both df
    encoded_df = pd.concat([ohe_df, freq_df], axis = 1)
    
    return encoded_df 

In [64]:
#apply encoder to train
encoded_train, ohe, freq_map = encoding_train(x_train)

In [65]:
#apply encoder to validation
encoded_val = encoding_test_val(x_val, ohe, freq_map)

#apply encoder to validation
encoded_test = encoding_test_val(test, ohe, freq_map)



In [66]:
print("Train and Validation have same columns?:", encoded_train.columns.equals(encoded_val.columns))
print("Train and Test have same columns?:", encoded_train.columns.equals(encoded_test.columns))

Train and Validation have same columns?: True
Train and Test have same columns?: True


Now we have to concatenate the dataframe with the metric variables scaled and the dataframe with the categorical variables encoded.

In [67]:
def concatenate_df(scaled, encoded, df):
    
    #concat into final dataset
    df_final = pd.concat([scaled, encoded], axis = 1)
    
    return df_final

In [68]:
#concatenate all dfs
#train
x_train_final = concatenate_df(x_train_scaled_mv, encoded_train, x_train)

#validation
x_val_final = concatenate_df(x_val_scaled_mv, encoded_val, x_val)

#test
test_final = concatenate_df(test_scaled_mv, encoded_test, test)

In [69]:
x_train = x_train_final.copy()

In [70]:
x_train.columns 

Index(['year', 'mileage', 'tax', 'mpg', 'engineSize', 'previousOwners',
       'carAge', 'mileagePerYear', 'mpg_per_engineSize', 'carAge*Mileage',
       'log(mileage)', 'is_luxury', 'Brand_BMW', 'Brand_Ford', 'Brand_Hyundai',
       'Brand_Mercedes', 'Brand_Opel', 'Brand_Skoda', 'Brand_Toyota',
       'Brand_Unknown', 'Brand_VW', 'transmission_Manual',
       'transmission_Other', 'transmission_Semi-Auto', 'transmission_Unknown',
       'fuelType_Electric', 'fuelType_Hybrid', 'fuelType_Other',
       'fuelType_Petrol', 'fuelType_Unknown', 'model_freq'],
      dtype='object')

In [71]:
x_val = x_val_final.copy()

In [72]:
x_val.columns

Index(['year', 'mileage', 'tax', 'mpg', 'engineSize', 'previousOwners',
       'carAge', 'mileagePerYear', 'mpg_per_engineSize', 'carAge*Mileage',
       'log(mileage)', 'is_luxury', 'Brand_BMW', 'Brand_Ford', 'Brand_Hyundai',
       'Brand_Mercedes', 'Brand_Opel', 'Brand_Skoda', 'Brand_Toyota',
       'Brand_Unknown', 'Brand_VW', 'transmission_Manual',
       'transmission_Other', 'transmission_Semi-Auto', 'transmission_Unknown',
       'fuelType_Electric', 'fuelType_Hybrid', 'fuelType_Other',
       'fuelType_Petrol', 'fuelType_Unknown', 'model_freq'],
      dtype='object')

In [73]:
#make sure all columns are in the same order
x_val = x_val[x_train.columns]

In [74]:
x_val.columns

Index(['year', 'mileage', 'tax', 'mpg', 'engineSize', 'previousOwners',
       'carAge', 'mileagePerYear', 'mpg_per_engineSize', 'carAge*Mileage',
       'log(mileage)', 'is_luxury', 'Brand_BMW', 'Brand_Ford', 'Brand_Hyundai',
       'Brand_Mercedes', 'Brand_Opel', 'Brand_Skoda', 'Brand_Toyota',
       'Brand_Unknown', 'Brand_VW', 'transmission_Manual',
       'transmission_Other', 'transmission_Semi-Auto', 'transmission_Unknown',
       'fuelType_Electric', 'fuelType_Hybrid', 'fuelType_Other',
       'fuelType_Petrol', 'fuelType_Unknown', 'model_freq'],
      dtype='object')

In [75]:
test = test_final.copy()

In [76]:
test.columns 

Index(['year', 'mileage', 'tax', 'mpg', 'engineSize', 'previousOwners',
       'carAge', 'mileagePerYear', 'mpg_per_engineSize', 'carAge*Mileage',
       'log(mileage)', 'is_luxury', 'Brand_BMW', 'Brand_Ford', 'Brand_Hyundai',
       'Brand_Mercedes', 'Brand_Opel', 'Brand_Skoda', 'Brand_Toyota',
       'Brand_Unknown', 'Brand_VW', 'transmission_Manual',
       'transmission_Other', 'transmission_Semi-Auto', 'transmission_Unknown',
       'fuelType_Electric', 'fuelType_Hybrid', 'fuelType_Other',
       'fuelType_Petrol', 'fuelType_Unknown', 'model_freq'],
      dtype='object')

In [77]:
#make sure all columns are in the same order
test = test[x_train.columns]

<a id="18th-bullet"></a>
# **4. Feature Selection**

No additional feature selection is performed in this open-ended section, as we're not trying to further optimize the feature space.

In [78]:
metric_features_final = ['year', 'tax' ,'mpg', 'engineSize' ,'mileagePerYear', 'mpg_per_engineSize', 'log(mileage)']

non_metric_final = ['Brand_BMW', 'Brand_Ford', 'Brand_Hyundai',
       'Brand_Mercedes', 'Brand_Opel', 'Brand_Skoda', 'Brand_Toyota',
       'Brand_Unknown', 'Brand_VW', 'transmission_Manual',
       'transmission_Other', 'transmission_Semi-Auto', 'transmission_Unknown',
       'fuelType_Electric', 'fuelType_Hybrid', 'fuelType_Other',
       'fuelType_Petrol', 'fuelType_Unknown', 'model_freq', 'is_luxury']

features_final = metric_features_final + non_metric_final

In [79]:
features_final

['year',
 'tax',
 'mpg',
 'engineSize',
 'mileagePerYear',
 'mpg_per_engineSize',
 'log(mileage)',
 'Brand_BMW',
 'Brand_Ford',
 'Brand_Hyundai',
 'Brand_Mercedes',
 'Brand_Opel',
 'Brand_Skoda',
 'Brand_Toyota',
 'Brand_Unknown',
 'Brand_VW',
 'transmission_Manual',
 'transmission_Other',
 'transmission_Semi-Auto',
 'transmission_Unknown',
 'fuelType_Electric',
 'fuelType_Hybrid',
 'fuelType_Other',
 'fuelType_Petrol',
 'fuelType_Unknown',
 'model_freq',
 'is_luxury']

In [80]:
x_train = x_train[features_final].copy()

x_val = x_val[features_final].copy()

In [81]:
test = test[features_final].copy()

<a id="18th-bullet"></a>
# **5. Price Segment Definition**

We will define price segments by splitting the target variable into low, mid and high using quantiles computed on the training data only.

In [82]:
q_low, q_high = y_train.quantile([0.33, 0.66])

def price_segment(y, q_low, q_high):
    return np.where(
        y <= q_low, 'low',
        np.where(y <= q_high, 'mid', 'high')
    )

This function assigns each car to a low, mid, or high price segment based on the quantile cutoffs.

We apply the same segmentation rules to both training and validation targets, using cutoffs learned only from the training data.

In [83]:
y_train_segment = price_segment(y_train, q_low, q_high)
y_val_segment = price_segment(y_val, q_low, q_high)

We need to check actual price cutoffs and verify that segment distributions are balanced between train and validation sets

In [84]:
print(f"q_low (33%): {q_low:.2f}")
print(f"q_high (66%): {q_high:.2f}")

print("\nTrain segment distribution:")
print(pd.Series(y_train_segment).value_counts(normalize=True).round(3))

print("\nValidation segment distribution:")
print(pd.Series(y_val_segment).value_counts(normalize=True).round(3))

q_low (33%): 11495.00
q_high (66%): 18250.00

Train segment distribution:
high    0.340
low     0.331
mid     0.329
Name: proportion, dtype: float64

Validation segment distribution:
high    0.342
low     0.339
mid     0.319
Name: proportion, dtype: float64


We have to store segment labels as dataframes with original indexes so they can be used later for filtering by segment.

In [85]:
y_train_segment_df = pd.DataFrame({'segment': y_train_segment}, index=y_train.index)
y_val_segment_df   = pd.DataFrame({'segment': y_val_segment},   index=y_val.index)

<a id="18th-bullet"></a>
# **6. Price Segment Classification**

In this section, we are setting up a classification model to predict the price segment (low, mid or high) based on the training data.

First, we need to map price segments to numbers so the Classifier can work with them. Then, we convert the segment labels into integers for train and validation.

In [86]:
segment_mapping = {'low': 0, 'mid': 1, 'high': 2}

In [87]:
# Ensure 1D arrays and map strings -> ints
y_train_seg_enc = np.vectorize(segment_mapping.get)(np.ravel(y_train_segment_df)).astype(int)
y_val_seg_enc   = np.vectorize(segment_mapping.get)(np.ravel(y_val_segment_df)).astype(int)

We choose Random Forest for consistency with the main pipeline and because it has a good performance.

In [88]:
RF_seg = RandomForestClassifier(
    random_state=42,
    n_estimators=400,
    class_weight="balanced",
    n_jobs=-1
)

In [89]:
RF_seg.fit(x_train, y_train_seg_enc)

0,1,2
,n_estimators,400
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


We are using Accuracy and Balanced Accuracy to evaluate the classification model's performance. Balanced accuracy is used to make sure performance is not dominated by one price segment, even though segments are defined using quantiles.

In [90]:
# Train metrics
y_pred_train = RF_seg.predict(x_train)
print("Train Accuracy:", round(accuracy_score(y_train_seg_enc, y_pred_train), 4))
print("Train Balanced Accuracy:", round(balanced_accuracy_score(y_train_seg_enc, y_pred_train), 4))

Train Accuracy: 0.9994
Train Balanced Accuracy: 0.9994


In [91]:
# Validation metrics
y_pred_val = RF_seg.predict(x_val)
print("\nVal Accuracy:", round(accuracy_score(y_val_seg_enc, y_pred_val), 4))
print("Val Balanced Accuracy:", round(balanced_accuracy_score(y_val_seg_enc, y_pred_val), 4))


Val Accuracy: 0.8893
Val Balanced Accuracy: 0.8883


We merge train and validation data so we can tune the classifier using a fixed split.

In [92]:
x_full = np.concatenate([x_train, x_val])
y_full = np.concatenate([y_train_seg_enc, y_val_seg_enc])

We have to manually define which rows belong to train and validation so the model never mixes them during tuning.

In [93]:
test_fold = np.concatenate([
    -1 * np.ones(len(y_train_seg_enc), dtype=int),
     0 * np.ones(len(y_val_seg_enc), dtype=int)
])
ps = PredefinedSplit(test_fold=test_fold)

In [94]:
param_dist = {
    "n_estimators": randint(200, 900),
    "max_depth": [None] + list(range(5, 41, 5)),
    "min_samples_split": randint(2, 50),
    "min_samples_leaf": randint(1, 25),
    "max_features": ["sqrt", "log2", None],
    "bootstrap": [True, False]
}

In [95]:
rf = RandomForestClassifier(
    random_state=42,
    class_weight="balanced",
    n_jobs=-1
)

In [96]:
bal_acc_scorer = make_scorer(balanced_accuracy_score)

In [97]:
search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=30,                 
    scoring=bal_acc_scorer,    
    cv=ps,
    verbose=1,
    random_state=42,
    n_jobs=-1,
    refit=True
)

In [98]:
search.fit(x_full, y_full)

Fitting 1 folds for each of 30 candidates, totalling 30 fits


0,1,2
,estimator,RandomForestC...ndom_state=42)
,param_distributions,"{'bootstrap': [True, False], 'max_depth': [None, 5, ...], 'max_features': ['sqrt', 'log2', ...], 'min_samples_leaf': <scipy.stats....t 0x15b944da0>, ...}"
,n_iter,30
,scoring,make_scorer(b...hod='predict')
,n_jobs,-1
,refit,True
,cv,"PredefinedSpl......, 0, 0]))"
,verbose,1
,pre_dispatch,'2*n_jobs'
,random_state,42

0,1,2
,n_estimators,812
,criterion,'gini'
,max_depth,35
,min_samples_split,2
,min_samples_leaf,3
,min_weight_fraction_leaf,0.0
,max_features,'log2'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False


In [99]:
print("\nBest params:", search.best_params_)
print("Best CV (val) balanced accuracy:", round(search.best_score_, 4))


Best params: {'bootstrap': False, 'max_depth': 35, 'max_features': 'log2', 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 812}
Best CV (val) balanced accuracy: 0.8931


In [100]:
best_rf_seg = search.best_estimator_

y_pred_val_best = best_rf_seg.predict(x_val)

print("Val Accuracy (best):", round(accuracy_score(y_val_seg_enc, y_pred_val_best), 4))
print("Val Balanced Accuracy (best):", round(balanced_accuracy_score(y_val_seg_enc, y_pred_val_best), 4))

print("\nClassification report (val, best):\n", classification_report(y_val_seg_enc, y_pred_val_best, digits=4))



Val Accuracy (best): 0.958
Val Balanced Accuracy (best): 0.9576

Classification report (val, best):
               precision    recall  f1-score   support

           0     0.9683    0.9665    0.9674      7721
           1     0.9292    0.9413    0.9352      7277
           2     0.9751    0.9651    0.9701      7794

    accuracy                         0.9580     22792
   macro avg     0.9575    0.9576    0.9576     22792
weighted avg     0.9582    0.9580    0.9580     22792



We evaluate how well the model performs on validation data using both accuracy and balanced accuracy.
This shows precision, recall and F1-score for each price segment on validation data.

<a id="21st-bullet"></a>
# **7. Segment-Specific Regression Models**

We will now train one regression model per price segment to evaluate whether segment specific models perform better.

In [101]:
x_train = x_train.copy()
x_val = x_val.copy()
x_val = x_val.loc[:, x_train.columns]

Like in our main pipeline, this function tests the best weight combinations for blending 3 models.

In [102]:
def find_best_weights_3(pred1, pred2, pred3, y_true, step=0.05):
    best_mae = np.inf
    best_w = None
    for w1 in np.arange(0.10, 0.41, step):
        for w2 in np.arange(0.10, 0.61, step):
            w3 = 1 - w1 - w2
            if w3 <= 0:
                continue
            blend = w1*pred1 + w2*pred2 + w3*pred3
            mae = mean_absolute_error(y_true, blend)
            if mae < best_mae:
                best_mae = mae
                w_best = (w1, w2, w3)
    return w_best, best_mae

In [103]:
base_tree = DecisionTreeRegressor(random_state=42)

In [104]:
et2 = ExtraTreesRegressor(
    bootstrap=False, max_depth=22, max_features=None, min_impurity_decrease=0.0005,
    min_samples_leaf=1, min_samples_split=6, n_estimators=617, criterion="squared_error",
    random_state=42
)

In [105]:
ab2 = AdaBoostRegressor(
    estimator=base_tree, learning_rate=0.24598491974895575, n_estimators=191,
    random_state=42
)

In [106]:
gb2 = GradientBoostingRegressor(
    learning_rate=0.25959668994237617, loss="squared_error", max_depth=4, max_features=None,
    min_samples_leaf=17, min_samples_split=45, n_estimators=1293, subsample=0.7879778055963772,
    random_state=42
)

We train one regression per price segment to learn the best way to combine the different models.

In [107]:
segment_blends = {}

for seg in ["low", "mid", "high"]:

    train_mask = (y_train_segment_df["segment"] == seg)   
    val_mask   = (y_val_segment_df["segment"] == seg)     

    x_train_seg = x_train.loc[train_mask]
    y_train_seg_price = y_train.loc[train_mask]

    x_val_seg = x_val.loc[val_mask]
    y_val_seg_price = y_val.loc[val_mask]

    # fit base models on segment train
    et = clone(et2).fit(x_train_seg, y_train_seg_price)
    gb = clone(gb2).fit(x_train_seg, y_train_seg_price)
    ab = clone(ab2).fit(x_train_seg, y_train_seg_price)

    # tune weights on segment validation
    p_et = et.predict(x_val_seg)
    p_gb = gb.predict(x_val_seg)
    p_ab = ab.predict(x_val_seg)

    w_best, mae_best = find_best_weights_3(p_et, p_gb, p_ab, y_val_seg_price, step=0.05)

    # refit on train + val of that segment 
    x_full_seg = pd.concat([x_train_seg, x_val_seg], axis=0)
    y_full_seg = pd.concat([y_train_seg_price, y_val_seg_price], axis=0)

    et_final = clone(et2).fit(x_full_seg, y_full_seg)
    gb_final = clone(gb2).fit(x_full_seg, y_full_seg)
    ab_final = clone(ab2).fit(x_full_seg, y_full_seg)

    # evaluate on that segment's validation
    p1 = et_final.predict(x_val_seg)
    p2 = gb_final.predict(x_val_seg)
    p3 = ab_final.predict(x_val_seg)

    blend_val = w_best[0]*p1 + w_best[1]*p2 + w_best[2]*p3
    mae_val = mean_absolute_error(y_val_seg_price, blend_val)

    segment_blends[seg] = {
        "models": (et_final, gb_final, ab_final),
        "weights": w_best,
        "val_mae": mae_val
    }

    w_rounded = tuple(round(w, 2) for w in w_best)
    print(f"[{seg}] weights={w_rounded} | seg-val MAE={mae_val:.2f}")

[low] weights=(np.float64(0.35), np.float64(0.5), np.float64(0.15)) | seg-val MAE=304.59
[mid] weights=(np.float64(0.35), np.float64(0.5), np.float64(0.15)) | seg-val MAE=427.76
[high] weights=(np.float64(0.2), np.float64(0.3), np.float64(0.5)) | seg-val MAE=629.26


This MAE shows how well each segment-specific model performs when it already knows the true price segment, so it reflects the best-case performance within each group.

<a id="25th-bullet"></a>
# **8. Performance Evaluation and Comparison**

<a id="26th-bullet"></a>
## 8.1 Overall MAE Comparison ##

To assess the overall performance of our hierarchical approach, we aggregated the predictions from the segments and computed a global MAE, where the price segment is first predicted and then used to estimate the final price, so it captures both classification and regression errors. This means that each observation is evaluated using the predictive model associated with the predicted price segment, and the resulting errors are combined into a single MAE value. With this, we can have a view of how this approach performs when compared to a single global regression model.

In [108]:
# predict segments on validation set
y_val_seg_pred_enc = RF_seg.predict(x_val)

# map encoded segments back to labels
inv_segment_mapping = {0: "low", 1: "mid", 2: "high"}
y_val_seg_pred = pd.Series(y_val_seg_pred_enc).map(inv_segment_mapping).values

# hierarchical prediction
y_val_pred_hier = np.zeros(len(y_val))

for seg in ["low", "mid", "high"]:
    idx = (y_val_seg_pred == seg)
    if idx.sum() == 0:
        continue

    et_m, gb_m, ab_m = segment_blends[seg]["models"]
    w1, w2, w3 = segment_blends[seg]["weights"]

    p1 = et_m.predict(x_val.loc[idx])
    p2 = gb_m.predict(x_val.loc[idx])
    p3 = ab_m.predict(x_val.loc[idx])

    y_val_pred_hier[idx] = w1*p1 + w2*p2 + w3*p3

mae_hier = mean_absolute_error(y_val, y_val_pred_hier)
print("Hierarchical MAE (overall):", round(mae_hier, 2))

Hierarchical MAE (overall): 732.78


<a id="27th-bullet"></a>
## 8.2 MAE by Price Segment ##

In [109]:
for seg in ["low", "mid", "high"]:
    idx = (y_val_segment_df["segment"] == seg)
    if idx.sum() == 0:
        continue

    mae_seg = mean_absolute_error(
        y_val.loc[idx],
        y_val_pred_hier[idx]
    )
    print(f"{seg.capitalize()} segment MAE:", round(mae_seg, 2))

Low segment MAE: 502.78
Mid segment MAE: 824.69
High segment MAE: 874.82


In this case, the MAE refers to the performance by segment, showing the differences in the errors when segments are predicted instead of known.

<a id="28th-bullet"></a>
# **9. Final Model**

In this section, we are applying our best model from the main pipeline (`Blend between our optimized Gradient Boosting, our optimized Extra Trees Regressor and our optimized AdaBoost (with base tree)`) to test.

In [128]:
x_test = test.copy()
x_test = x_test.loc[:, x_train.columns]

In [129]:
x_seg_full = pd.concat([x_train, x_val], axis=0)
y_seg_full_enc = np.concatenate([y_train_seg_enc, y_val_seg_enc])

In [130]:
best_rf_seg.fit(x_seg_full, y_seg_full_enc)

0,1,2
,n_estimators,812
,criterion,'gini'
,max_depth,35
,min_samples_split,2
,min_samples_leaf,3
,min_weight_fraction_leaf,0.0
,max_features,'log2'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False


In [131]:
test_seg_pred_enc = best_rf_seg.predict(x_test)
test_seg_pred = pd.Series(test_seg_pred_enc).map(inv_segment_mapping).values

In [132]:
final_pred = np.zeros(len(x_test), dtype=float)

for seg in ["low", "mid", "high"]:
    idx = (test_seg_pred == seg)
    if idx.sum() == 0:
        continue

    et_m, gb_m, ab_m = segment_blends[seg]["models"]
    w1, w2, w3 = segment_blends[seg]["weights"]

    p1 = et_m.predict(x_test.loc[idx])
    p2 = gb_m.predict(x_test.loc[idx])
    p3 = ab_m.predict(x_test.loc[idx])

    final_pred[idx] = w1*p1 + w2*p2 + w3*p3

In [133]:
predictions_open = pd.DataFrame({"carID": x_test.index.astype("int32"), "price": final_pred})

In [134]:
predictions_open.to_csv("delivery_open_section.csv", index=False)

<a id="29th-bullet"></a>
# **10. Business Implications and Recommendations**

The hierarchical approach helped us understand how pricing behaves differently across low, mid, and high segments. However, its overall MAE was higher than the one obtained with the main pipeline, showing that the added segmentation step did not translate into better final predictions. This suggests that, for this dataset, the global model is still the most reliable option. This might be explained by the classification model not having 100% accuracy.

<a id="30th-bullet"></a>
# **11. Limitations and Future Work**

This analysis has a few limitations. First, price segments were defined with quantiles, which may not reflect real-world categories. Also, we are aware that both the classifier and predictive models make wrong evaluations.
In future work, this approach could be improved by exploring different segment methods and other classifier models. 
Additionally, if we had more time, we would develop a whole pipeline for each segment. Meaning, possibly different feature selection, or different models for each category, since the behaviour and the characteristics of each one are different. By doing this, we believe that our conclusions would be even more helpful for Cars4You.