# Taking Things Further with scikit-learn Pipelines

The pipelines we presented in this step are extremely simple, as they apply the same transformations to all features. Consequently, the models were limited to the set of features to which these transformations could be applied.
In order to support a greater set of features in our models, we would need to apply different transformations to different features.

In this notebook, we shall briefly explore ways of doing this.

## Differentiating between Categorical and Numerical Features

As a first step, let us add support for categorical features, which we shall encode using one-hot encoding, alongside numerical features, to which we shall apply standard scaling.

We use an indicator function `is_categorical` which allows us to differentiate the two types of features.

In [1]:
from dataclasses import dataclass
from enum import Enum

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MaxAbsScaler

from songpop.data import COLS_MUSICAL_CATEGORIES


def is_categorical(feature: str):
    return feature in COLS_MUSICAL_CATEGORIES 


def create_random_forest_pipeline(features: list[str]):
    return Pipeline([
        ('preprocess', ColumnTransformer([
            ("cat", OneHotEncoder(), [feature for feature in features if is_categorical(feature)]),
            ("num", StandardScaler(), [feature for feature in features if not is_categorical(feature)])])),
        ('classifier', RandomForestClassifier())])

## Adding Support for Different Scaling Transformations of Numerical Features

In practice, it is, however, not usually reasonable to apply the same scaling transformation to all numerical features. How could we address this?

Frequently, the way in which a feature shall be transformed is inherent to the feature semantics, and upon having analyzed the nature of a feature, the choice of transformation becomes clear. Therefore, what is needed is really an explicit representation of a feature, which includes information on how to transform it. A very naive attempt at this could look like this: 

In [2]:
class FeatureTransformationType(Enum):
    NONE = 0
    ONE_HOT_ENCODING = 1
    STANDARD_SCALER = 2
    MAX_ABS_SCALER = 3


@dataclass
class Feature:
    col_name: str
    feature_transformation_type: FeatureTransformationType


def create_random_forest_pipeline(features: list[Feature]):
    features_none = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.NONE]
    features_one_hot = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.ONE_HOT_ENCODING]
    features_num_std = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.STANDARD_SCALER]
    features_num_abs = [f.col_name for f in features if f.feature_transformation_type == FeatureTransformationType.MAX_ABS_SCALER]
    return Pipeline([
        ('preprocess', ColumnTransformer([
            ("id", "passthrough", features_none),
            ("one_hot", OneHotEncoder(), features_one_hot),
            ("num_std", StandardScaler(), features_num_std),
            ("num_abs", MaxAbsScaler(), features_num_abs)])),
        ('classifier', RandomForestClassifier())])

A more sophisticated approach would involve a representation of each feature that is itself a transformer. This adds flexibility and allows for a more fine-grained control over the transformations applied to each feature.

In the following, we will, however, use the concepts of the library sensAI instead. sensAI builds upon scikit-learn concepts, using strictly object-oriented design and a higher level of abstraction (see subsequent steps in the journey). 