<center>
  <h1 style="background-color: #80dfffff; color: #137a91ff; padding: 10px">
    <strong>Model Development</strong>
  </h1>
</center>

**Student ID's:**

Andreea Roica: 20250361

Beatriz Varela: 20250367

Barbara Franco: 20250388

Marisa Esteves: 20250348

#
<h2 style="background-color: #80dfffff; color: #137a91ff; padding: 5px; margin: 5px;">
<strong>Index</strong>
</h2>


[1. **Pre Processing**](#1st-bullet)<br>

[2. **Pipeline**](#2nd-bullet)<br>

[3. **Fine Tuning**](#3rd-bullet)<br>

[4. **Model Comparison**](#9th-bullet)<br>

[5. **Ablation Study**](#10th-bullet)<br>

[6. **Feature Importance**](#11th-bullet)<br>
    

#
<h2 style="background-color: #80dfffff; color: #137a91ff; padding: 5px; margin: 5px;">
<strong>Imports</strong>
</h2>

Importing the necessary libraries:

In [None]:
from Classes import *

from functions import *

from sklearn.model_selection import KFold

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.linear_model import HuberRegressor, LinearRegression

from sklearn.neural_network import MLPRegressor

from sklearn.compose import TransformedTargetRegressor

from sklearn.model_selection import RandomizedSearchCV

from sklearn.metrics import make_scorer

from sklearn.pipeline import Pipeline

import seaborn as sns

import shap

import math

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')


Uploading the dataset:

In [None]:
df_train = pd.read_csv('train.csv')

Setting carID as index:

In [None]:
df_train.set_index('carID', inplace=True)

Setting a seed:

In [None]:
random_state = 42

#
<h2 id="1st-bullet" style="background-color: #f7d888ff; color: #da6919ff; padding: 5px; margin: 5px;">
  <strong>  1. Pre Processing</strong>
</h2>


Drop informations given by the mechanic: 

In [None]:
df_train.drop('paintQuality', axis=1, inplace=True)

We start by defining the inconsistent values discussed in the EDA as NA:

In [None]:
df_train.loc[df_train['year']>2020, 'year'] = np.nan

df_train.loc[df_train['mileage']<0, 'mileage'] = np.nan

df_train.loc[df_train['tax']<0, 'tax'] = np.nan

df_train.loc[df_train['mpg'] < 8, 'mpg'] = np.nan

df_train.loc[df_train['previousOwners']< 0, 'previousOwners'] = np.nan

df_train.loc[df_train['engineSize'] < 1, 'engineSize'] = np.nan

We proceed to round 'year' and 'previousOwners' to whole numbers using the floor function. Other numerical features are rounded to 2 decimal points.

In [None]:
df_train['year'] = np.floor(df_train['year'])
df_train['previousOwners'] = np.floor(df_train['previousOwners'])

for feat in ['mileage', 'tax', 'mpg', 'engineSize', 'paintQuality%']:
    df_train[feat] = df_train[feat].round(2)

We also pre-process the categorical variables in order to have a uniform format for later treatment (inside k-fold CV). We remove leeading and trailing spaces and uppercase all letters.

In [None]:
# Pre processing the categorical variables to be easier to find clusters in typos:
    # remove spaces (at the beginning and end) and uppercase all letters
    # does not replace NaN's
df_train['Brand'] = df_train['Brand'].where(df_train['Brand'].isna(), df_train['Brand'].astype(str).str.strip().str.upper())

df_train['model'] = df_train['model'].where(df_train['model'].isna(), df_train['model'].astype(str).str.strip().str.upper())

df_train['fuelType'] = df_train['fuelType'].where(df_train['fuelType'].isna(), df_train['fuelType'].astype(str).str.strip().str.upper())

df_train['transmission'] = df_train['transmission'].where(df_train['transmission'].isna(), df_train['transmission'].astype(str).str.strip().str.upper())


The W's in brand could either mean 'BMW' or 'VW'. We already checked that they are all 'VW', so lets correct the W's:

In [None]:
df_train.loc[df_train['Brand'] == 'W', 'Brand'] = 'VW'

Division between input variables (X) and output variable (y):

In [None]:

y = df_train['price']
X = df_train.drop('price', axis=1)


#
<h2 id="2nd-bullet" style="background-color: #f7d888ff; color: #da6919ff; padding: 5px; margin: 5px;">
  <strong>  2. Pipeline</strong>
</h2>


We are going to build a pipeline that is shared by all the models.

This funtion is created here because it uses the classes, therefore it cannot be include in the functions.py file like the others.

In [None]:
def build_pipeline(regressor, use_log=False, target_scaler=None, scaler_instance=None):

    """
    Builds a complete machine learning pipeline for regression tasks. 
    The pipeline allows different regressors, scaling strategies and optional log-transformation or scaling of the target variable.

    Parameters
    ----------
    regressor: 
        Regression model used as the final estimator.

    use_log: boolean 
        Indicates whether to apply log-transformation to the target variable.

    target_scaler: scaler object
        Scaler applied to the target variable. Default is None (no target scaling).

    scaler_instance: scaler object 
        Scaler object passed to the scaling step.

    Returns
    -------
    sklearn.pipeline.Pipeline
        A machine learning pipeline with preprocessing, feature engineering, feature selection, and regression steps.
    """

    if use_log:
        final_reg = TransformedTargetRegressor(regressor=regressor,
                                               func=np.log,
                                               inverse_func=np.exp)

    elif target_scaler is not None:
        final_reg = TransformedTargetRegressor(
            regressor=regressor,
            transformer=target_scaler
        )
        
    else:
        final_reg = regressor

    return Pipeline([
        ('categorical_treatment', Categorical_Correction()),
        ('outlier_treatment', Outlier_Treatment()),
        ('missing_value_treatment', Missing_Value_Treatment()),
        ('typecasting', Typecasting()),
        ('feature_engineering', Feature_Engineering()),
        ('encoder', Encoder()),
        ('scaler', Scaler(scaler=scaler_instance)),
        ('feature_selection', Feature_Selection()),
        ('regressor', final_reg)
    ])

#
<h2 id="3rd-bullet" style="background-color: #f7d888ff; color: #da6919ff; padding: 5px; margin: 5px;">
  <strong>  3. Fine Tuning</strong>
</h2>


We are going to create a model dictionary that defines the regressor instance and the hyperparameter search space.

In [None]:
models_testing = {
    "KNN": {
        "regressor": KNeighborsRegressor(),
        "use_log": True,
        "param_distributions": {
            'feature_selection__rfe_k': list(range(1,15)),
            'feature_selection__spearman_thr': [0.2, 0.25, 0.3],
            'regressor__regressor__n_neighbors': np.arange(1, 100),
            'regressor__regressor__weights': ['uniform', 'distance'],
            'regressor__regressor__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
            'regressor__regressor__metric': ['minkowski', 'manhattan', 'euclidean', 'chebyshev'],
            'regressor__regressor__p': [1, 1.5, 2, 3, 4],
            'scaler__scaler': [StandardScaler(), MinMaxScaler(feature_range=(0,1)), MinMaxScaler(feature_range=(-1,1)), RobustScaler()]
        }
    },

    "GradientBoosting": {
        "regressor": GradientBoostingRegressor(),
        "use_log": True,
        "param_distributions": {
            'feature_selection__rfe_k': list(range(1,15)),
            'feature_selection__spearman_thr': [0.2, 0.25, 0.3],
            'regressor__regressor__max_features': [0.4, 0.5, 0.6],
            'regressor__regressor__loss': ['absolute_error', 'huber'],
            'regressor__regressor__n_estimators': [700, 900, 1100],
            'regressor__regressor__max_depth': [6, 7, 8, 9], 
            'regressor__regressor__learning_rate': [0.02, 0.03, 0.04, 0.05],
            'regressor__regressor__subsample': [0.75, 0.8],
            'regressor__regressor__min_samples_split': [4, 5, 6],
            'regressor__regressor__min_samples_leaf': [3, 4],
            'regressor__regressor__criterion': ['squared_error'],
            'regressor__regressor__min_impurity_decrease': [0.0, 0.0001, 0.0005],
            'regressor__regressor__max_leaf_nodes': [None, 30, 40],
            'regressor__regressor__n_iter_no_change': [None, 10, 15],
            'regressor__regressor__validation_fraction': [0.15, 0.2],
            'scaler__scaler': [StandardScaler(), MinMaxScaler(feature_range=(0,1)), MinMaxScaler(feature_range=(-1,1)), RobustScaler()]
        }
    },

    "RandomForest": {
        "regressor": RandomForestRegressor(),
        "use_log": False,
        "param_distributions": {
            'feature_selection__rfe_k': list(range(1,15)),
            'feature_selection__spearman_thr': [0.2, 0.25, 0.3],
            'regressor__n_estimators': [400,300],
            'regressor__min_samples_split': [24,25,26,30,35],  
            'regressor__min_samples_leaf': [2,3,4],   
            'regressor__max_features': [0.5, 'sqrt'],
            'regressor__max_depth': [18,19,20],         
            'regressor__min_impurity_decrease': [0.001, 0.0005],
            'regressor__ccp_alpha':[0.00005, 0.0001],
            'regressor__max_samples': [0.75, 0.80],
            'regressor__criterion': ['squared_error'], #default
            'regressor__bootstrap': [True], #default
            'scaler__scaler': [StandardScaler(), MinMaxScaler(feature_range=(0,1)), MinMaxScaler(feature_range=(-1,1)), RobustScaler()]
        }
    },

    "MLP_sgd": {
        "regressor" : MLPRegressor(),
        "use_log" : False,
        "target_scaler": StandardScaler(),
        "param_distributions" : { 
            'feature_selection__rfe_k': list(range(1,15)),
            'feature_selection__spearman_thr': [0.2, 0.25, 0.3],
            'regressor__regressor__solver' : ['sgd'],
            'regressor__regressor__hidden_layer_sizes' : [(32,16),(100,50), (150,70), (200,100), (100,50,25), (200,100,50)],
            'regressor__regressor__max_iter' :  [300],
            'regressor__regressor__activation' : ['relu', 'tanh', 'logistic'],
            'regressor__regressor__learning_rate' :  ['constant','invscaling','adaptive'],
            'regressor__regressor__learning_rate_init' : [0.01, 0.001, 0.0001, 0.00001],
            'regressor__regressor__batch_size' : [100, 200, 300],
            'regressor__regressor__alpha': [1e-6, 1e-5, 1e-4, 1e-3],
            'scaler__scaler': [StandardScaler(), MinMaxScaler(feature_range=(0,1)), MinMaxScaler(feature_range=(-1,1)), RobustScaler()]
        }
    },

    "MLP_adam": {
            "regressor" : MLPRegressor(),
            "use_log" : False,
            "param_distributions" : { 
                'feature_selection__rfe_k': list(range(1,15)),
                'feature_selection__spearman_thr': [0.2, 0.25, 0.3],
                'regressor__solver' : ['adam'],
                'regressor__hidden_layer_sizes' : [(32,16), (200, 100), (400, 200), (100,50,25)],
                'regressor__max_iter' :  [700],
                'regressor__activation' : ['relu', 'tanh', 'logistic'],
                'regressor__learning_rate_init' : [0.001, 0.01, 0.1],
                'scaler__scaler': [StandardScaler(), MinMaxScaler(feature_range=(0,1)), MinMaxScaler(feature_range=(-1,1)), RobustScaler()]
        }
    },


    "Huber": {
            "regressor" : HuberRegressor(),
            "use_log" : False,
            "param_distributions" : { 
                    'feature_selection__rfe_k': list(range(1,15)),
                    'feature_selection__spearman_thr': [0.2, 0.25, 0.3],
                    'regressor__epsilon': [1.1, 1.2, 1.35, 1.5, 2.0, 2.5, 3.0],
                    'regressor__alpha': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1.0, 10.0],
                    'regressor__fit_intercept': [True],
                    'regressor__max_iter': [500, 1000, 2000],
                    'scaler__scaler': [StandardScaler(), MinMaxScaler(feature_range=(0,1)), MinMaxScaler(feature_range=(-1,1)), RobustScaler()]
        }
    },

    "Decision Tree": {
            "regressor" : DecisionTreeRegressor(),
            "use_log" : False,
            "param_distributions" : { 
                    'feature_selection__rfe_k': list(range(1,15)),
                    'feature_selection__spearman_thr': [0.2, 0.25, 0.3],
                    'regressor__max_depth': [5, 10, 15, 20, None],
                    'regressor__min_samples_split': [5, 6, 7, 8, 9, 10],
                    'regressor__min_samples_leaf': [5, 6, 7],
                    'regressor__criterion': ['absolute_error'],
                    'regressor__min_impurity_decrease': [0.0001, 0.0005, 0.001],
                    'regressor__max_features': ['sqrt', 0.5, 0.2],  
                    'regressor__max_leaf_nodes': [None, 20, 50, 100],
                    'scaler__scaler': [StandardScaler(), MinMaxScaler(feature_range=(0,1)), MinMaxScaler(feature_range=(-1,1)), RobustScaler()]  

        }
     }

}

### Randomized Search

This funtion is created here because it uses build_pipeline, therefore it cannot be include in the functions.py file like the others.

In [None]:
def run_model_search(name, cfg, X, y, n_iter=2, cv=5, random_state=42):

    """
    Runs a RandomizedSearchCV for hyperparameter tuning of a regression model pipeline.

    Parameters
    ----------
    name : str
        The name of the model being evaluated.

    cfg : dict
        Configuration dictionary containing:    
        - 'regressor': The regression model to be used in the pipeline. 
        - 'use_log': Boolean indicating whether to apply log transformation to the target variable.
        - 'target_scaler': Scaler applied to the target variable. 
            If not specified, no scaling is applied.
        - 'param_distributions': Dictionary of hyperparameter distributions for RandomizedSearchCV.
    
    X : pandas.DataFrame
        Feature matrix used for training the model.

    y : pandas.Series
        Target variable (price).

    n_iter : int, default=2
        Number of parameter settings that are sampled in RandomizedSearchCV.

    cv : int, default=5
        Number of cross-validation folds.

    random_state : int, default=42
        Random seed for reproducibility.

    Returns
    -------
    dict
        A dictionary containing:
        - 'Model': The name of the model.
        - 'Use_Log': Boolean indicating if log transformation was used.
        - 'Target_Scaled': Boolean indicating if target variable was scaled.
        - 'Search_Object': The fitted RandomizedSearchCV object.
        - 'CV_Results': DataFrame with cross-validation results and additional metrics.

    """
    scoring = { 'R2': 'r2', 
    'MAE': 'neg_mean_absolute_error',
    'MAPE': 'neg_mean_absolute_percentage_error',
    'MedAE': 'neg_median_absolute_error',
    'RMSE': 'neg_root_mean_squared_error'}

    print(f"\n=== Running: {name} (use_log={cfg['use_log']}) ===")

    pipeline = build_pipeline(cfg['regressor'], use_log=cfg['use_log'], target_scaler=cfg.get('target_scaler', None))
    params = cfg['param_distributions']

    num_config = math.prod(len(v) for v in params.values())
    n_iter = min(n_iter, num_config)

    search = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=params,
        n_iter=n_iter,
        scoring=scoring,
        refit='MAE',
        cv=KFold(n_splits=cv, shuffle=True, random_state=random_state),
        verbose=2,
        return_train_score=True,
        random_state=random_state,
        n_jobs=-1
    )
    t0 = time.time()
    search.fit(X, y)
    t_elapsed = time.time() - t0

    cv_results_df = pd.DataFrame(search.cv_results_)
    
    cv_results_df['overfit_mae'] = (cv_results_df['mean_test_MAE'] - cv_results_df['mean_train_MAE']) / cv_results_df['mean_train_MAE'] * 100
    cv_results_df['overfit_R2'] = (cv_results_df['mean_test_R2'] - cv_results_df['mean_train_R2']) / cv_results_df['mean_train_R2'] * 100
    cv_results_df['Model'] = name
    cv_results_df['Use_Log'] = cfg['use_log']
    cv_results_df['Execution_Time_s'] = t_elapsed

    result = {
        'Model': name,
        'Use_Log': cfg['use_log'],
        'Target_Scaled': cfg.get('target_scaler', None) is not None,
        'Search_Object': search,
        'CV_Results': cv_results_df
    }
    return result

### Loop through all models

After collecting all CV results from every model, we are going to rank all hyperparameter combinations by lowest test MAE and display the top 50 best-performing configurations.

In [None]:
all_results = []
for model_name, cfg in models_testing.items():
    res = run_model_search(model_name, cfg, X, y, n_iter=2, cv=5, random_state=random_state)
    all_results.append(res)

all_cv_summary = pd.concat([r['CV_Results'] for r in all_results], ignore_index=True)

all_cv_summary['mean_train_MAE_pos'] = -all_cv_summary['mean_train_MAE']
all_cv_summary['mean_test_MAE_pos'] = -all_cv_summary['mean_test_MAE']

all_cv_summary.to_csv("all_cv_summary.csv", index=True)

top_50_cv_summary = all_cv_summary.sort_values('mean_test_MAE_pos', ascending=True).head(50)

top_50_cv_summary.to_csv("top_50_cv_summary.csv", index=True)

In [None]:

print("=== Top 50 hyperparameter combinations with lowest CV MAE ===")
display(top_50_cv_summary[[
    'Model', 'Use_Log', 'params',
    'mean_train_MAE_pos', 'mean_test_MAE_pos',
    'mean_train_R2', 'mean_test_R2',
    'mean_train_MAPE', 'mean_test_MAPE',
    'mean_train_MedAE', 'mean_test_MedAE',
    'mean_train_RMSE', 'mean_test_RMSE',
    'overfit_mae', 'overfit_R2'
]])


#
<h2 id="4th-bullet" style="background-color: #f7d888ff; color: #da6919ff; padding: 5px; margin: 5px;">
  <strong>  4. Model Comparison</strong>
</h2>

Fazer talvez uma tabela com os melhores resultados de cada modelo? Analisar em comparação com os melhores MAE'S que temos

DEPOIS PODEMOS FAZER UMA TABELA COM O MELHOR RESULTADO PARA CADA MODELO (AQUI NO MARKDOWN MESMO, NÃO É PRECISO CODIFICAR) COM R2, MAE, OVERFIT E ASSIM PARA TRAIN E VALIDATION

### **Conclusions**

justificar a escolhar e a seguir apresentar quais os escolhidos

The best parameters are:

In [None]:
print(all_cv_summary['params'][0]) #depois mudar de 0 para o indice certo!!

#
<h2 id="5th-bullet" style="background-color: #f7d888ff; color: #da6919ff; padding: 5px; margin: 5px;">
  <strong>  5. Ablation Study</strong>
</h2>


For the ablation study, we'll evaluate the performance of the chosen model with the best parameters removing or simplifying (when removing is not possible) one step of the pipeline at a time. We'll compare these results with the performance of the model using the full pipeline.

Defining the CV strategy and the scoring metrics:

In [None]:
CV = KFold(n_splits=2, shuffle=True, random_state=42)  
scoring = { 'R2': 'r2', 'MAE': 'neg_mean_absolute_error'}

Defining the full pipeline and saving the results:

In [None]:
pipeline = Pipeline([
    ('categorical treatment', Categorical_Correction()),  
    ('outlier treatment', Outlier_Treatment()),                   
    ('missing value treatment', Missing_Value_Treatment()),
    ('typecasting', Typecasting()), 
    ('feature engineering', Feature_Engineering()), 
    ('encoder', Encoder() ), 
    ('scaler', Scaler()), 
    ('feature selection', Feature_Selection()),
    ('regressor', LinearRegression() ) # corrigir para o regressor final!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
])

In [None]:
# Full Pipeline with the Final Regressor
baseline_train_r2, baseline_test_r2, baseline_train_mae, baseline_test_mae, baseline_time = evaluate(pipeline, X, y, cv=CV, scoring=scoring)
baseline_pipeline_results = {
    'Step Tested': 'Full Pipeline',
    'Train R2': baseline_train_r2,
    'Test R2': baseline_test_r2,
    'Train MAE': baseline_train_mae,
    'Test MAE': baseline_test_mae,
    'Execution Time': baseline_time }

Testing each step individually:

In [None]:
steps_to_test = ['categorical treatment','outlier treatment','missing value treatment','typecasting', 'feature engineering','encoder','scaler','feature selection']

results = []

results.append (baseline_pipeline_results)

for step in steps_to_test:

    simplified_pipeline = Pipeline([
        ('categorical treatment', Categorical_Correction() if step != 'categorical treatment' else Simplified_Categorical_Correction()),
        ('outlier treatment', Outlier_Treatment() if step != 'outlier treatment' else Identity_Transformer()),                   
        ('missing value treatment', Missing_Value_Treatment() if step != 'missing value treatment' else Simplified_Missing_Value_Treatment()),
        ('typecasting', Typecasting() if step != 'typecasting' else Identity_Transformer()), 
        ('feature engineering', Feature_Engineering() if step != 'feature engineering' else Identity_Transformer()), 
        ('encoder', Encoder() if step != 'encoder' else Simplified_Encode()), 
        ('scaler', Scaler() if step != 'scaler' else Identity_Transformer()), 
        ('feature selection', Feature_Selection() if step != 'feature selection' else Identity_Transformer()),
        ('regressor', LinearRegression() )
    ])
    
    train_r2, test_r2, train_mae, test_mae, exec_time = evaluate(simplified_pipeline, X, y, cv=CV, scoring=scoring)
    
    results.append({
        'Step Tested': step,
        'Train R2': train_r2,
        'Test R2': test_r2,
        'Train MAE': train_mae,
        'Test MAE': test_mae,
        'Execution Time': exec_time
    })

Finally, concatenate all results:

In [None]:
results = pd.DataFrame(results)

results ['Overfit_Mae'] = results ['Test MAE'] / results ['Train MAE']

results ['Delta'] = results ['Test MAE'] - results ['Test MAE'][0] 

results.to_csv("ablation_study.csv", index=True)

results

### **Conclusions** 


BLABLABLABALABALABALA

#
<h2 id="6th-bullet" style="background-color: #f7d888ff; color: #da6919ff; padding: 5px; margin: 5px;">
  <strong>  6. Feature Importance</strong>
</h2>

We will now explore the importance of each feature for the final prediction. For this, we will use our best model and the same pipeline, while skipping the feature selection step. We will fit the whole train dataset into said pipeline and use the feature_importances_ attribute from GradientBoostingRegressor as to obtain feature importance values. Corresponding feature names will be obtained from the last transformer's feats_names_ attribute which was defined within the scaler class.

In [None]:
# best model parameters (w/ gradient boost) MUDAR
params = {'regressor__validation_fraction': 0.15, 'regressor__subsample': 0.8, 'regressor__n_iter_no_change': None, 'regressor__n_estimators': 900, 'regressor__min_samples_split': 5, 'regressor__min_samples_leaf': 4, 'regressor__min_impurity_decrease': 0.0, 'regressor__max_leaf_nodes': 40, 'regressor__max_features': 0.6, 'regressor__max_depth': 7, 'regressor__loss': 'absolute_error', 'regressor__learning_rate': 0.04, 'regressor__criterion': 'squared_error'}

# pipeline
importance_pipeline = Pipeline([
    ('categorical treatment', Categorical_Correction()),  
    ('outlier treatment', Outlier_Treatment()),                   
    ('missing value treatment', Missing_Value_Treatment()),
    ('typecasting', Typecasting()), 
    ('feature engineering', Feature_Engineering()), 
    ('encoder', Encoder() ), 
    ('scaler', Scaler(scaler=RobustScaler())), # MUDAR scaler
    ('regressor', GradientBoostingRegressor())
])

importance_pipeline = importance_pipeline.set_params(**params)
importance_pipeline = importance_pipeline.fit(X,y)

feature_importance = importance_pipeline.named_steps['regressor'].feature_importances_ 
feature_names = importance_pipeline.named_steps['scaler'].feats_names_

# creating a dataframe of features and corresponding importance
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})
# sorting the dataframe by importance in descending order for clearer visualization
importance_df = importance_df.sort_values(by='Importance', ascending=False) 
# visualizing importance
plt.figure(figsize=(10, 6))
sns.barplot(x=importance_df['Importance'], y=importance_df['Feature'], color="skyblue")
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

Feature importance values given by GradientBoosterRegressor completar (...)

We will now obtain SHAP values for each feature in order to see how it affects predictions (how much and in which direction).

In [None]:
#np.bool = bool

# getting the previous pipeline without the last step, i.e. keeping the transformers and excluding the regressor
preprocessor = Pipeline(importance_pipeline.steps[:-1])

# applying the new pipeline to train set to get pre-processed data
X_processed = preprocessor.transform(X)

# creating a shap TreeExplainer for the regressor
explainer = shap.TreeExplainer(importance_pipeline.named_steps['regressor'])

# getting shap values by passing processed data into the explainer
shap_values = explainer.shap_values(X_processed)

shap.summary_plot(shap_values, X_processed)

In [None]:
y_labels = ['low', 'medium-low', 'medium-high', 'high']
y_bins, bin_edges = pd.cut(y, bins=4, labels=y_labels, retbins=True)
bin_edges 

In [None]:

# converting shap values with features to dataframe for convenience
shap_df = pd.DataFrame(shap_values, columns=feature_names)

# computing absolute mean shap value for each feature in each bin
mean_shap_per_bin = shap_df.abs().groupby(y_bins).mean()

# heatmap for visualization
plt.figure(figsize=(12,6))
sns.heatmap(mean_shap_per_bin.T, cmap="RdBu", annot=True, fmt=".2f", linewidths=0.5, cbar=True, center=0)
plt.xlabel("Price Categories")
plt.ylabel("Features")
plt.title("Mean Feature SHAP values per Price Categories ")
plt.show()


### **Conclusions**