# Features names out

Having the ability to identify columns after applying transformations using different sklearn transformers is highly crucial. By default, sklearn transformers generate `numpy.array` outputs, lacking the column structure found in `pandas.DataFrame`.

Fortunately, a solution to this problem has been introduced in `sklearn==1.1`. This solution involves the utilization of the `get_feature_names_out` method, which returns the feature names as values.

In [1]:
import numpy as np
import pandas as pd

from IPython.display import HTML
header_template = "<p style='font-size:17px'>{}</p>"

from sklearn.preprocessing import (
    FunctionTransformer,
    StandardScaler
)

# this is frame with few numeric columns that
# is typically used for examples
input_frame = pd.DataFrame({
    f"feature{n}" : np.random.normal(0, 10, 5)
    for n in range(3)
})

## Defined names

There are converters that specify columns output names by themselves. In them you can simply call `get_features_names_out` in the trained object and the input names of the features will be returned. Some examples of such transformers:]

- `StandartScaler`;
- `OneHotEncoder`;
- `ColumnTransformer`;
- `PolynomialFeatures`;
- `CountVectorizer`;
- `TfidfVectorizer`.

So in the following example cell is for `StandardScaler` - it just keeps names of the input array:

In [2]:
display(HTML(header_template.format("Input dataframe")))
display(input_frame)

my_scaler = StandardScaler()
display(HTML(header_template.format("transform result")))
display(my_scaler.fit_transform(input_frame))

display(HTML(header_template.format(".get_features_names_out result")))
display(my_scaler.get_feature_names_out())

Unnamed: 0,feature0,feature1,feature2
0,-5.26324,14.943478,-0.875921
1,-1.549951,-10.348717,9.48864
2,-0.907326,-6.827048,12.779967
3,-6.299696,-6.115877,3.503167
4,-10.672533,-16.60705,0.652093


array([[-0.09178503,  1.87419164, -1.14646023],
       [ 0.95790347, -0.5037146 ,  0.83876015],
       [ 1.13956346, -0.17261638,  1.46917855],
       [-0.38477476, -0.105754  , -0.30769289],
       [-1.62090714, -1.09210666, -0.85378558]])

array(['feature0', 'feature1', 'feature2'], dtype=object)

## FunctionTransformer

This transformer have a bit tricky behaviour of `get_feature_names_out` method. 

You have to specify `feature_names_out`. It can be `"one-to-one"` or something callable. For both options I have examples below.

### `"one-to-one"`

`one-to-one` simply causes `.get_feature_names_out` to return the feature names as they were in the input. The following example is sutitable for `dummy` transformer - the following example is suitable for the `dummy` transformer - it simply returns it's input.

In [3]:
display(HTML(header_template.format("Input dataframe")))
display(input_frame)

my_scaler = FunctionTransformer(
    lambda X : X,
    feature_names_out = "one-to-one"
)
display(HTML(header_template.format("transform result")))
display(my_scaler.fit_transform(input_frame))

display(HTML(header_template.format(".get_features_names_out result")))
display(my_scaler.get_feature_names_out())

Unnamed: 0,feature0,feature1,feature2
0,-5.26324,14.943478,-0.875921
1,-1.549951,-10.348717,9.48864
2,-0.907326,-6.827048,12.779967
3,-6.299696,-6.115877,3.503167
4,-10.672533,-16.60705,0.652093


Unnamed: 0,feature0,feature1,feature2
0,-5.26324,14.943478,-0.875921
1,-1.549951,-10.348717,9.48864
2,-0.907326,-6.827048,12.779967
3,-6.299696,-6.115877,3.503167
4,-10.672533,-16.60705,0.652093


array(['feature0', 'feature1', 'feature2'], dtype=object)

**Note**: While the usefulness of `.get_feature_names_out` may not be apparent in the current case where a transformer returns a dataframe, it becomes valuable in more complex pipelines where it remembers feature names in a scikit-learn style. Its ability to retain names can be crucial in such scenarios, making it a useful feature to have.

### Callable

If you pass callable as argument to `feature_names_out` it will be called with `.get_feature_names_out`. It should instance of `FunctionTransformer` that called method and input array of input features. 

#### Expected inputs

Let's study what is passed to the specified in `feature_names_out` funcion on the example. Let's just try to substitute such a function that will return its own input elements.

In [4]:
my_transformer = FunctionTransformer(
    lambda X: np.array(X),
    # here is just function that returns
    # it's inputs to check what they actualy
    # are
    feature_names_out=(
        lambda transformer, input_features: \
        (transformer, input_features)
    )
)
my_transformer.fit_transform(input_frame)
features_output = my_transformer.get_feature_names_out()

display(HTML(header_template.format("First argument - transformer itself")))
display(features_output[0])
print(
    "Check if first input of the feature_names_out"
    "really initial transformer - ",
    features_output[0] is my_transformer
)
display(HTML(header_template.format("Second argument - input feature names")))
features_output[1]

Check if first input of the feature_names_outreally initial transformer -  True


array(['feature0', 'feature1', 'feature2'], dtype=object)

#### Real world example

Here is an example of how the tools we are considering might be used in the real world.

Suppose you need to build a transformer that returns squares and cubes of each feature of the input array. And the result features should have names `<inputname> square` for squares and `<inputname> cube` for cubes.

This can be achieved with the following code:

In [5]:
my_transformer = FunctionTransformer(
    # iterates over all input features
    # and returns square and cube of them
    lambda X : np.concatenate(
        [
            np.array(X[[col]]**power)
            for col in X
            for power in range(2,4)
        ], 
        axis = 1
    ),
    # iterates over input features anmes
    # and for each creates pair of square
    # and cube name
    feature_names_out = (
        lambda transformer, input_features: [
            f"{feature} {power_str}"
            for feature in input_features
            for power_str in ["square", "cubic"]
        ]
    )
)

display(HTML(header_template.format("Input dataframe")))
display(input_frame)

display(HTML(header_template.format("transform output")))
display(my_transformer.fit_transform(input_frame))

display(HTML(header_template.format(".get_feature_names_out")))
display(my_transformer.get_feature_names_out())

Unnamed: 0,feature0,feature1,feature2
0,-5.26324,14.943478,-0.875921
1,-1.549951,-10.348717,9.48864
2,-0.907326,-6.827048,12.779967
3,-6.299696,-6.115877,3.503167
4,-10.672533,-16.60705,0.652093


array([[ 2.77016967e+01, -1.45800682e+02,  2.23307526e+02,
         3.33699103e+03,  7.67237912e-01, -6.72039936e-01],
       [ 2.40234719e+00, -3.72351972e+00,  1.07095953e+02,
        -1.10830576e+03,  9.00342844e+01,  8.54302891e+02],
       [ 8.23239835e-01, -7.46946619e-01,  4.66085789e+01,
        -3.18198986e+02,  1.63327556e+02,  2.08732077e+03],
       [ 3.96861679e+01, -2.50010787e+02,  3.74039550e+01,
        -2.28757999e+02,  1.22721794e+01,  4.29914946e+01],
       [ 1.13902970e+02, -1.21563325e+03,  2.75794094e+02,
        -4.58012619e+03,  4.25224871e-01,  2.77286029e-01]])

array(['feature0 square', 'feature0 cubic', 'feature1 square',
       'feature1 cubic', 'feature2 square', 'feature2 cubic'],
      dtype=object)

#### Pandas output

Ideally, if the function specified in FunctionTransfomer returns pandas.DataFrame, then the columns that `get_feature_names_out` returns should be automatically applied to the result. But if you try to adapt the previous example, the result is an error. Just like in the following cell.

In [14]:
my_transformer = FunctionTransformer(
    # iterates over all input features
    # and returns square and cube of them
    lambda X : pd.concat(
        [
            X[col]**power
            for col in X
            for power in range(2,4)
        ],
        axis=1
    ),
    # iterates over input features anmes
    # and for each creates pair of square
    # and cube name
    feature_names_out = (
        lambda transformer, input_features: [
            f"{feature} {power_str}"
            for feature in input_features
            for power_str in ["square", "cubic"]
        ]
    )
)
my_transformer.set_output(transform='pandas')

try:
    my_transformer.fit_transform(input_frame)
except Exception as e:
    print(e)

The output generated by `func` have different column names than the ones provided by `get_feature_names_out`. Got output with columns names: ['feature0', 'feature0', 'feature1', 'feature1', 'feature2', 'feature2'] and `get_feature_names_out` returned: ['feature0 square', 'feature0 cubic', 'feature1 square', 'feature1 cubic', 'feature2 square', 'feature2 cubic']. The column names can be overridden by setting `set_output(transform='pandas')` or `set_output(transform='polars')` such that the column names are set to the names provided by `get_feature_names_out`.


I think it's a `sklearn` bug, so I've created a corresponding [issue](https://github.com/scikit-learn/scikit-learn/issues/28780). Check it out, there might be a solution already.