# 2. End-to-End Pipeline Prototyping

**Objective:** This notebook serves as a workbench for building and testing the entire machine learning pipeline. Here, we prototype the feature extraction, model training, and prediction logic in an interactive environment before refactoring the final code into reusable `.py` scripts.

## 2.1. Setup, Imports, and Data Loading

We'll start by importing all necessary libraries and loading the development dataset. We will also perform the initial data cleaning steps identified previously.

In [47]:
import numpy as np
import pandas as pd
import random

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

from sklearn.linear_model import Ridge, LinearRegression
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.svm import SVR

import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path

import librosa
import librosa.display

In [48]:
DEVELOPMENT_PATH = Path("../data/development.csv")

if DEVELOPMENT_PATH.exists():
    dev_df = pd.read_csv(DEVELOPMENT_PATH, index_col=0)
else:
    print(f"Errore: File non trovato al percorso {DEVELOPMENT_PATH}")
    # Initialize an empty DataFrame
    dev_df = pd.DataFrame()

In [49]:
class TempoCleaner(BaseEstimator, TransformerMixin):
    """
    A custom transformer to clean the 'tempo' column.
    It checks if the column is of object type, and if so, it strips
    the brackets and converts it to a numeric type.
    """
    def __init__(self, column_name='tempo', verbose=False):
        self.column_name = column_name
        self.verbose = verbose

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_ = X.copy()

        if self.column_name in X_.columns and X_[self.column_name].dtype == 'object':
            if self.verbose:
                print(f"Cleaning column: '{self.column_name}'")
            X_[self.column_name] = pd.to_numeric(
                X_[self.column_name].str.strip('[]'),
                errors='coerce'
            )
        
        return X_

## 2.2. Strategic Preprocessing

Here we apply the feature engineering and selection decisions from our EDA.

* **Feature Selection:** Drop redundant columns (e.g., `num_characters`).
* **Feature Engineering:** Group rare categories in the `ethnicity` feature into an "Other" class.
* **Data Splitting:** Separate the features (`X`) from the target variable (`y`).

In [50]:
class RedundantFeatureDropper(BaseEstimator, TransformerMixin):
    """
    A custom transformer to drop specified redundant columns from a DataFrame.
    """
    def __init__(self, columns_to_drop, verbose=False):
        if not isinstance(columns_to_drop, list):
            raise ValueError("'columns_to_drop' must be a list of column names.")
        self.columns_to_drop = columns_to_drop
        self.verbose = verbose

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_ = X.copy()
        if self.verbose:
            print(f"Dropping redundant columns: {self.columns_to_drop}")
        X_ = X_.drop(columns=self.columns_to_drop, errors='ignore')
        
        return X_

In [51]:
class RareCategoryGrouper(BaseEstimator, TransformerMixin):
    """
    A custom transformer to group rare categorical features into a single 'Other' category.
    It identifies the top N most frequent categories during fit and applies this
    grouping during transform.
    """
    def __init__(self, columns, n_top_categories=10, verbose=False):
        self.columns = columns
        self.n_top_categories = n_top_categories
        self.top_categories_ = {}
        self.verbose = verbose

    def fit(self, X, y=None):
        for col in self.columns:
            top_cats = X[col].value_counts().nlargest(self.n_top_categories).index.tolist()
            self.top_categories_[col] = top_cats
        return self

    def transform(self, X, y=None):
        X_ = X.copy()
        
        if self.verbose:
            print(f"Grouping rare categories for columns: {self.columns}")
        
        for col in self.columns:
            # Sostituisce le categorie non principali con 'Other'
            top_cats = self.top_categories_.get(col)
            if top_cats:
                X_[col] = np.where(X_[col].isin(top_cats), X_[col], 'Other')
        
        return X_

## 2.3. Building the Preprocessing Pipeline with `ColumnTransformer`

This is the core of our preprocessing. We'll use `ColumnTransformer` to apply different transformations to different types of columns:
1.  **Numerical Features:** Apply `StandardScaler`.
2.  **Categorical Features:** Apply `OneHotEncoder`.

This ensures that each feature type receives the correct treatment.

In [52]:
REDUNDANT_COLS = ['num_characters']
CATEGORICAL_COLS_TO_GROUP = ['ethnicity']
NUMERICAL_FEATURES = [
    'mean_pitch', 'max_pitch', 'min_pitch', 'jitter', 'shimmer',
    'energy', 'zcr_mean', 'spectral_centroid_mean', 'tempo', 'hnr',
    'num_words', 'num_pauses', 'silence_duration'
]
CATEGORICAL_FEATURES = ['gender', 'ethnicity']

final_preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), NUMERICAL_FEATURES),
        ('cat', OneHotEncoder(handle_unknown='ignore'), CATEGORICAL_FEATURES)
    ],
    remainder='drop'
)

preprocessing_pipeline = Pipeline(steps=[
    ('tempo_cleaner', TempoCleaner(column_name='tempo')),
    ('feature_dropper', RedundantFeatureDropper(columns_to_drop=REDUNDANT_COLS)),
    ('rare_grouper', RareCategoryGrouper(columns=CATEGORICAL_COLS_TO_GROUP, n_top_categories=10)),
    ('final_preprocessor', final_preprocessor)
])

preprocessing_pipeline

0,1,2
,steps,"[('tempo_cleaner', ...), ('feature_dropper', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,column_name,'tempo'
,verbose,False

0,1,2
,columns_to_drop,['num_characters']
,verbose,False

0,1,2
,columns,['ethnicity']
,n_top_categories,10
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


## 2.4. Model Definition with Integrated Target Transformation

In this step, we define our final estimator. We use `TransformedTargetRegressor` to wrap our chosen regression model. This allows us to specify the target transformation (`func=np.log1p`) and its inverse (`inverse_func=np.expm1`) directly as part of the model definition, automating the entire process.

In [53]:
base_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

final_model = TransformedTargetRegressor(
    regressor=base_regressor,
    func=np.log1p,           
    inverse_func=np.expm1    
)

final_model

0,1,2
,regressor,RandomForestR...ndom_state=42)
,transformer,
,func,<ufunc 'log1p'>
,inverse_func,<ufunc 'expm1'>
,check_inverse,True

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## 2.5. Building the Full End-to-End Pipeline

Now we assemble all the pieces. We combine the feature preprocessor (the `ColumnTransformer` from step 2.3) and the final_model (the `TransformedTargetRegressor` from step 2.4) into a single, unified scikit-learn Pipeline. This object encapsulates the entire workflow, from raw data to a final prediction in the original scale.

In [54]:
full_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessing_pipeline),
    ('model', final_model)
])

full_pipeline

0,1,2
,steps,"[('preprocessing', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,steps,"[('tempo_cleaner', ...), ('feature_dropper', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,column_name,'tempo'
,verbose,False

0,1,2
,columns_to_drop,['num_characters']
,verbose,False

0,1,2
,columns,['ethnicity']
,n_top_categories,10
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,regressor,RandomForestR...ndom_state=42)
,transformer,
,func,<ufunc 'log1p'>
,inverse_func,<ufunc 'expm1'>
,check_inverse,True

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## 2.6. Validation Strategy: K-Fold Cross-Validation

To get a reliable estimate of our model's performance, we'll use K-Fold Cross-Validation. We will split our development data into training and validation folds multiple times, training and evaluating the pipeline on each fold. This gives us a more robust performance metric than a single train-test split.

In [55]:
X = dev_df.drop(columns=['age'])
y = dev_df['age']

param_grid = [
    # Esperimento 1: Linear Regression (nessun iperparametro da testare)
    {
        'model__regressor': [LinearRegression()]
    },
    
    # Esperimento 2: Ridge Regression
    {
        'model__regressor': [Ridge()],
        'model__regressor__alpha': [0.1, 1.0, 10.0]  # Prova diversi valori di regolarizzazione
    },

    # Esperimento 3: Support Vector Regressor (SVR)
    {
        'model__regressor': [SVR()],
        'model__regressor__kernel': ['linear', 'rbf'],
        'model__regressor__C': [1, 10]
    },

    # Esperimento 4: Random Forest Regressor
    {
        'model__regressor': [RandomForestRegressor(random_state=42)],
        'model__regressor__n_estimators': [100, 200]
    },

    # Esperimento 5: HistGradientBoosting Regressor
    {
    'model__regressor': [HistGradientBoostingRegressor(random_state=42)],
    'model__regressor__learning_rate': [0.05, 0.1],      
    'model__regressor__max_leaf_nodes': [31, 50] 
    }
]


cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(
    estimator=full_pipeline,
    param_grid=param_grid,
    cv=cv_strategy,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

grid_search.fit(X, y)

results = grid_search.cv_results_
results_df = pd.DataFrame(results)
report_df = results_df[[
    'rank_test_score',
    'param_model__regressor',
    'mean_test_score',
    'std_test_score'
]].sort_values(by='rank_test_score')

report_df = report_df.rename(columns={
    'rank_test_score': 'Rank',
    'param_model__regressor': 'Model',
    'mean_test_score': 'Mean RMSE',
    'std_test_score': 'Std Dev (RMSE)'
})
report_df['Mean RMSE'] = -report_df['Mean RMSE']
report_df['Model'] = report_df['Model'].astype(str).str.split('(').str[0]

print("--- Report Dettagliato dei Risultati della GridSearchCV ---")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print(report_df)

Fitting 5 folds for each of 14 candidates, totalling 70 fits


--- Report Dettagliato dei Risultati della GridSearchCV ---
    Rank                          Model  Mean RMSE  Std Dev (RMSE)
10     1  HistGradientBoostingRegressor  10.501515        0.558654
11     2  HistGradientBoostingRegressor  10.513985        0.605356
12     3  HistGradientBoostingRegressor  10.536389        0.634730
8      4          RandomForestRegressor  10.587516        0.549736
5      5                            SVR  10.592010        0.563219
9      6          RandomForestRegressor  10.596724        0.560225
13     7  HistGradientBoostingRegressor  10.619725        0.587582
3      8                          Ridge  10.915927        0.461787
2      9                          Ridge  10.921425        0.467503
1     10                          Ridge  10.922486        0.468361
0     11               LinearRegression  10.922612        0.468461
7     12                            SVR  11.105798        0.548895
4     13                            SVR  11.265678        0.554663
6 

## 2.7. Final Prototype Evaluation

After cross-validation, we will train the final pipeline on the entire development set. While we don't have the labels for the official `evaluation.csv`, this step simulates the final training process and makes our pipeline ready for generating the final predictions.

## 2.8. Saving the Pipeline

Once we are satisfied with the prototype, we will save the trained pipeline object to a file using `joblib`. This allows us to easily load and use it later in our `predict.py` script without retraining.