# Pandas output

In this notebook, we review the Pandas output API from scikit-learn v1.2.

<a href="https://colab.research.google.com/github/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/01-pandas-output.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

In [1]:
import sklearn
assert sklearn.__version__.startswith("1.2"), "Please install scikit-learn 1.2"

## Loading wine dataset

In [2]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

In [3]:
wine = load_wine(as_frame=True)
X, y = wine.data, wine.target

In [4]:
y

0      0
1      0
2      0
3      0
4      0
      ..
173    2
174    2
175    2
176    2
177    2
Name: target, Length: 178, dtype: int64

In [5]:
X

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0
)

## Default Scaler

In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler = StandardScaler()
scaler.fit_transform(X_train)

array([[ 0.60197404, -0.46117588,  0.97991346, ...,  0.65183482,
         0.38536664,  1.05318219],
       [-1.16265978, -0.8194594 ,  0.4424025 , ..., -0.116569  ,
         0.85117761, -1.14091833],
       [ 0.79941558, -0.94180012, -1.6359732 , ...,  1.23943774,
        -0.21145366, -0.38379914],
       ...,
       [ 0.17007066,  1.08556031, -0.81178973, ..., -1.60817641,
        -1.3468679 ,  0.23425735],
       [ 0.68835471,  0.22043668,  1.12324972, ..., -0.97537327,
        -1.17218879, -0.01296525],
       [-0.80479698, -0.97675461,  0.65740689, ...,  1.05863684,
        -0.44435915, -0.23546558]])

## Scaler with Pandas output

In [9]:
scaler.set_output(transform="pandas")
scaler.fit_transform(X_train)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
34,0.601974,-0.461176,0.979913,-0.192449,0.797539,0.074112,0.490212,-0.641726,-0.113656,-0.357151,0.651835,0.385367,1.053182
114,-1.162660,-0.819459,0.442403,0.814123,-1.084979,0.419102,0.254981,0.529739,-0.954378,-0.915784,-0.116569,0.851178,-1.140918
62,0.799416,-0.941800,-1.635973,-0.480041,-0.360934,-0.336591,-0.235084,-0.390698,-1.475626,-0.529038,1.239438,-0.211454,-0.383799
139,-0.224812,0.552504,0.836577,1.245512,0.145898,0.024827,-1.401439,1.366499,-1.341111,-0.047754,-0.297370,-0.662708,-0.507410
122,-0.743096,1.837082,1.266586,1.964492,0.218302,-0.172310,0.098160,0.529739,0.172190,-1.268153,-0.161769,0.749281,-1.202724
...,...,...,...,...,...,...,...,...,...,...,...,...,...
48,1.330040,-0.268926,0.084062,-0.249967,0.290707,0.731236,0.872463,-0.390698,1.298759,0.502285,0.516234,0.210688,0.945022
80,-1.261381,-1.230175,-1.349301,-0.192449,-0.940170,0.189108,0.225577,-0.558050,-0.298615,-1.087671,1.917441,0.749281,-1.471579
145,0.170071,1.085560,-0.811790,0.382735,0.218302,-1.322277,-1.450445,0.529739,-0.517203,-0.443094,-1.608176,-1.346868,0.234257
168,0.688355,0.220437,1.123250,1.389308,0.435516,-1.240136,-1.166207,0.195034,-0.113656,1.559392,-0.975373,-1.172189,-0.012965


## In a ML Pipeline

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import make_pipeline

In [12]:
log_reg = make_pipeline(
    StandardScaler(),
    SelectPercentile(percentile=50),
    LogisticRegression()
)

In [13]:
log_reg.set_output(transform="pandas")
log_reg.fit(X_train, y_train)

In [14]:
log_reg[-1]

In [15]:
log_reg[-1].feature_names_in_

array(['alcohol', 'flavanoids', 'color_intensity', 'hue',
       'od280/od315_of_diluted_wines', 'proline'], dtype=object)

## Exercise 1
**~ 4 minutes**

1. The Wisconsion cancer data set is loaded into `X` and `y`.
1. How many features are there in the dataset?
1. Which feature(s) or the dataset are missing?
    - **Hint**: Use panda's `isna().sum()`
1. Split the data set into a training and test set.
    - **Hint**: Remember to use `stratify=y` and `random_state=0` 
1. Use a `SimpleImputer` with `add_indicator=True` and `set_output(transform="pandas")`
1. Run the imputer's `fit_transform` on the training set.
1. How many output features are there in the transformed data?
1. Are there any new features added to the transformed data?

In [16]:
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer

cancer = fetch_openml(data_id=15, as_frame=True, parser="pandas")
X, y = cancer.data, cancer.target

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/solutions/01-ex01-solutions.py). 

In [None]:
# %load solutions/01-ex01-solutions.py

## Exercise 2
**~ 5 minutes**

1. Build a pipeline named `pipe`, with the `StandardScaler`, `KNNImputer(add_indicator=True)`, and `LogisticRegression` and configured for pandas output.
1. Train the pipeline on the Wisconsion cancer training set and evaluate the performance of the model on the test set.
1. Create a pandas series where the values is the coefficients of `LogisticRegression` and index is the `feature_names_in_`.
    - **Hint**: The logistic regression estimator is the final step of the pipeline. (`pipe[-1]`)
    - **Hint**: The coefficients are stored as `coef_` in logistic regression estimator. (Use `ravel` to flatten the `coef_` array)
1. Which feature has a negative impact on cancer?

In [None]:
from sklearn.impute import KNNImputer
import pandas as pd

**If you are running locally**, you can uncomment the following cell to load the solution into the cell. On **Google Colab**, [see solution here](https://github.com/thomasjpfan/ml-workshop-intermediate-v2/blob/main/notebooks/solutions/01-ex02-solutions.py). 

In [None]:
# %load solutions/01-ex02-solutions.py

## Global configuration

Output pandas by default!

In [None]:
import sklearn
sklearn.set_config(transform_output="pandas")

In [None]:
cancer = fetch_openml(data_id=15, as_frame=True, parser="pandas")
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0
)

In [None]:
scaler = SimpleImputer(add_indicator=True)

In [None]:
scaler.fit_transform(X_train)