# Introduction

In this lab, we will learn how to use the **set_output** API in Scikit-Learn to configure transformers to output pandas DataFrames. This feature is useful when working with heterogeneous data and pipelines in Scikit-Learn.

# Load the Iris dataset

First, we will load the Iris dataset as a DataFrame to demonstrate the **set_output** API.

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# Configure a transformer to output DataFrames

To configure an estimator such as **preprocessing.StandardScaler** to return DataFrames, call **set_output**

In [2]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

scaler = StandardScaler().set_output(transform='pandas')

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
39,-0.894264,0.798301,-1.271411,-1.327605
12,-1.244466,-0.086944,-1.327407,-1.459074
48,-0.660797,1.462234,-1.271411,-1.327605
23,-0.894264,0.576989,-1.159419,-0.933197
81,-0.427329,-1.41481,-0.039497,-0.275851


# Configure **transform** after **fit**

**set_output** can be called after **fit** to configure **transform** after the fact.

In [3]:
scaler2 = StandardScaler()

scaler2.fit(X_train)
X_test_np = scaler2.transform(X_test)
print(f'Default output type: {type(X_test_np).__name__}')

scaler2.set_output(transform='pandas')
X_test_df = scaler2.transform(X_test)
print(f'Configured pandas output type: {type(X_test_df).__name__}')

Default output type: ndarray
Configured pandas output type: DataFrame


# Configure a pipeline to output DataFrames

In a **pipeline.Pipeline**, **set_output** configures all steps to output DataFrames.

In [4]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectPercentile

clf = make_pipeline(
    StandardScaler(),
    SelectPercentile(percentile=75),
    LogisticRegression()
)
clf.set_output(transform='pandas')
clf.fit(X_train, y_train)

# Load the Titanic dataset

Next, we will load the Titanic dataset to demonstrate **set_output** with compose.**ColumnTransformer** and heterogeneous data.

In [5]:
from sklearn.datasets import fetch_openml

X, y = fetch_openml(
    'titanic',
    version=1,
    as_frame=True,
    return_X_y=True,
    parser='pandas'
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# Configure set_output globally

The **set_output** API can be configured globally by using **set_config** and setting **transform_output** to **"pandas"**

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn import set_config, config_context

set_config(transform_output="pandas")

num_pipe = make_pipeline(SimpleImputer(), StandardScaler())
num_cols = ["age", "fare"]
ct = ColumnTransformer(
    (
        ("numerical", num_pipe, num_cols),
        (
            "categorical",
            OneHotEncoder(
                sparse_output=False, drop="if_binary", handle_unknown="ignore"
            ),
            ["embarked", "sex", "pclass"],
        ),
    ),
    verbose_feature_names_out=False,
)
clf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())
clf.fit(X_train, y_train)

# Configure set_output with config_context

When configuring the output type with **config_context**, the configuration at the time when **transform** or **fit_transform** are called is what counts.

In [8]:
from sklearn import config_context

scaler = StandardScaler()
scaler.fit(X_train[num_cols])

with config_context(transform_output="pandas"):
    X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled.head()

Unnamed: 0,age,fare
1139,0.543681,-0.479685
263,0.611666,0.394139
811,1.223536,0.002318
841,-0.884015,-0.482643
1055,,-0.479685


# Summary

*In this lab, we learned how to use the **set_output** API in Scikit-Learn to configure transformers to output pandas DataFrames. We demonstrated how to configure an estimator to output DataFrames, configure a pipeline to output DataFrames, and configure **set_output** globally with **set_config**. We also learned how to configure **set_output** with **config_context**