# Chapter 2: Pre-Model Workflow and Data Preprocessing

This notebook provides "recipes" for using the scikit-learn Python library to preprocess data before modeling. Each recipe includes explanations, code examples, visualizations, best practices, and common pitfalls.

## Handling Missing Data

In this section, we will explore different strategies for handling missing data using scikit-learn's imputation tools. 

### Getting ready

To begin, we will create a toy dataset composed of random, quantitative data, ten features, and several missing data values randomly spread throughout. We will then store the dataset in a pandas DataFrame() object for better readability.

In [1]:
# Load libraries
import numpy as np
import pandas as pd

# Create a larger sample dataset with missing values
np.random.seed(2024)  # For reproducibility
n_samples = 20
n_features = 10

# Generate random data
data = {
    f"Feature{i+1}": np.random.uniform(0, 100, n_samples) for i in range(n_features)
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Randomly introduce missing values (approximately 20% of the data)
for column in df.columns:
    mask = np.random.random(n_samples) < 0.2
    df.loc[mask, column] = np.nan

# Display the DataFrame with missing values
display(df)

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,58.801452,,42.009814,34.680397,49.962259,,,89.954588,41.152421,
1,69.910875,9.554215,6.436369,31.287816,37.966499,,82.005649,,75.797616,91.382492
2,,96.090974,59.643269,84.710402,,84.109027,,70.288128,1.778343,4.11769
3,,25.176729,83.732372,88.02311,16.886931,97.205554,84.696823,,,80.077973
4,20.501895,,89.248639,67.655865,58.635861,78.225721,60.911562,,65.114243,99.119187
5,10.606287,76.825393,20.052744,5.367515,,19.703051,34.423301,93.399494,72.20668,12.640276
6,72.724014,79.79234,50.239523,55.921377,6.191019,,22.966899,21.049022,57.358544,14.302591
7,,,89.538184,69.451294,,47.885551,,33.620401,99.685711,6.683138
8,47.38457,,25.592093,82.41973,73.41454,61.6637,29.172571,65.946718,61.005155,34.052747
9,44.829582,38.165095,,31.142866,28.865545,,41.004459,41.460336,50.236236,


### How to do it...

The `SimpleImputer` class provides basic strategies for imputing missing values. It can replace missing values using a constant, the mean, median, or most frequent value of each column.

In [2]:
# Load libraries
from sklearn.impute import SimpleImputer

# Initialize the SimpleImputer and set the strategy to "mean," "median", or "most_frequent"
imputer = SimpleImputer(strategy="mean")

# Fit and transform the data
imputed_data = imputer.fit_transform(df)
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
imputed_df

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,58.801452,53.864661,42.009814,34.680397,49.962259,57.822964,51.145011,89.954588,41.152421,48.400044
1,69.910875,9.554215,6.436369,31.287816,37.966499,57.822964,82.005649,50.715003,75.797616,91.382492
2,52.558605,96.090974,59.643269,84.710402,46.055883,84.109027,51.145011,70.288128,1.778343,4.11769
3,52.558605,25.176729,83.732372,88.02311,16.886931,97.205554,84.696823,50.715003,60.616723,80.077973
4,20.501895,53.864661,89.248639,67.655865,58.635861,78.225721,60.911562,50.715003,65.114243,99.119187
5,10.606287,76.825393,20.052744,5.367515,46.055883,19.703051,34.423301,93.399494,72.20668,12.640276
6,72.724014,79.79234,50.239523,55.921377,6.191019,57.822964,22.966899,21.049022,57.358544,14.302591
7,52.558605,53.864661,89.538184,69.451294,46.055883,47.885551,51.145011,33.620401,99.685711,6.683138
8,47.38457,53.864661,25.592093,82.41973,73.41454,61.6637,29.172571,65.946718,61.005155,34.052747
9,44.829582,38.165095,52.897507,31.142866,28.865545,57.822964,41.004459,41.460336,50.236236,48.400044


The `KNNImputer` class uses the k-Nearest Neighbors approach to impute missing values. It considers the nearest neighbors to estimate the missing values.

In [3]:
# Load libraries
from sklearn.impute import KNNImputer

# Initialize the KNNImputer
knn_imputer = KNNImputer(n_neighbors=2)

# Fit and transform the data using the previously defined DataFrame
knn_imputed_data = knn_imputer.fit_transform(df)
knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=df.columns)
knn_imputed_df

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,58.801452,48.954043,42.009814,34.680397,49.962259,93.910386,52.549271,89.954588,41.152421,59.833678
1,69.910875,9.554215,6.436369,31.287816,37.966499,86.793468,82.005649,45.416191,75.797616,91.382492
2,67.752344,96.090974,59.643269,84.710402,73.338309,84.109027,59.012876,70.288128,1.778343,4.11769
3,47.880864,25.176729,83.732372,88.02311,16.886931,97.205554,84.696823,64.510281,39.726833,80.077973
4,20.501895,31.670912,89.248639,67.655865,58.635861,78.225721,60.911562,25.903612,65.114243,99.119187
5,10.606287,76.825393,20.052744,5.367515,59.716773,19.703051,34.423301,93.399494,72.20668,12.640276
6,72.724014,79.79234,50.239523,55.921377,6.191019,40.482911,22.966899,21.049022,57.358544,14.302591
7,47.629715,77.843853,89.538184,69.451294,32.753328,47.885551,55.683434,33.620401,99.685711,6.683138
8,47.38457,35.180639,25.592093,82.41973,73.41454,61.6637,29.172571,65.946718,61.005155,34.052747
9,44.829582,38.165095,29.188271,31.142866,28.865545,59.674651,41.004459,41.460336,50.236236,35.00082


The `IterativeImputer` class models each feature with missing values as a function of other features, and iteratively estimates missing values.

In [4]:
# Load libraries
from sklearn.experimental import enable_iterative_imputer # Experimental feature requires loading
from sklearn.impute import IterativeImputer

# Initialize the IterativeImputer
iterative_imputer = IterativeImputer()

# Fit and transform the data using the previously defined DataFrame
iterative_imputed_data = iterative_imputer.fit_transform(df)
iterative_imputed_df = pd.DataFrame(iterative_imputed_data, columns=df.columns)
iterative_imputed_df

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,58.801452,54.365008,42.009814,34.680397,49.962259,58.680582,51.178102,89.954588,41.152421,48.292275
1,69.910875,9.554215,6.436369,31.287816,37.966499,60.070634,82.005649,50.710323,75.797616,91.382492
2,52.539676,96.090974,59.643269,84.710402,46.12129,84.109027,51.155371,70.288128,1.778343,4.11769
3,52.630591,25.176729,83.732372,88.02311,16.886931,97.205554,84.696823,50.680265,50.650877,80.077973
4,20.501895,53.198954,89.248639,67.655865,58.635861,78.225721,60.911562,50.691674,65.114243,99.119187
5,10.606287,76.825393,20.052744,5.367515,46.15631,19.703051,34.423301,93.399494,72.20668,12.640276
6,72.724014,79.79234,50.239523,55.921377,6.191019,55.923578,22.966899,21.049022,57.358544,14.302591
7,52.558131,53.942885,89.538184,69.451294,46.070477,47.885551,50.969808,33.620401,99.685711,6.683138
8,47.38457,54.204978,25.592093,82.41973,73.41454,61.6637,29.172571,65.946718,61.005155,34.052747
9,44.829582,38.165095,52.882614,31.142866,28.865545,55.9783,41.004459,41.460336,50.236236,48.429901


## Scaling Techniques

Scaling and normalization are crucial steps in preprocessing data for machine learning models. They ensure that each feature contributes equally to the distance calculations in algorithms like k-NN and SVM.

### Getting ready

We will use the previously defined `iterative_imputed_df` DataFrame for this recipe so no need to redefine it.

### How to do it...

The `StandardScaler` standardizes features by removing the mean and scaling to unit variance.

In [5]:
# Load libraries
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the data using the iterative imputed DataFrame
scaled_data = scaler.fit_transform(iterative_imputed_df)
scaled_df = pd.DataFrame(scaled_data, columns=iterative_imputed_df.columns)
scaled_df

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,0.271931,0.019495,-0.373057,-0.894258,0.188761,0.031844,0.001905,1.6838,-0.780898,-0.003574
1,0.756048,-1.874939,-1.592196,-1.043864,-0.391744,0.085949,1.426871,-0.000534,0.652484,1.429056
2,-0.000939,1.783513,0.231259,1.311954,0.002887,1.021588,0.000854,0.83973,-2.409929,-1.472256
3,0.003022,-1.214477,1.056818,1.458037,-1.411839,1.53134,1.551267,-0.001824,-0.387917,1.053212
4,-1.397054,-0.029802,1.245866,0.559887,0.608499,0.792594,0.451822,-0.001334,0.210478,1.686279
5,-1.828276,0.969036,-1.125549,-2.186891,0.004582,-1.485267,-0.772566,1.831653,0.503915,-1.188904
6,0.878636,1.094467,-0.091017,0.042422,-1.929442,-0.075466,-1.302124,-1.273574,-0.110399,-1.133636
7,-0.000135,0.001649,1.255789,0.639061,0.000428,-0.388327,-0.007723,-0.73402,1.64081,-1.386962
8,-0.225584,0.012729,-0.93571,1.21094,1.323678,0.147955,-1.015275,0.6534,0.040472,-0.476999
9,-0.336923,-0.665377,-0.000435,-1.050256,-0.832163,-0.073336,-0.46836,-0.397536,-0.405072,0.001002


The `MinMaxScaler` transforms features by scaling each feature to a given range, often between zero and one.

In [6]:
# Load libraries
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
minmax_scaler = MinMaxScaler()

# Fit and transform the data using the iterative imputed DataFrame
minmax_scaled_data = minmax_scaler.fit_transform(iterative_imputed_df)
minmax_scaled_df = pd.DataFrame(
    minmax_scaled_data, columns=iterative_imputed_df.columns
)
minmax_scaled_df

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,0.603506,0.517824,0.38054,0.354639,0.56902,0.555896,0.435006,0.962767,0.402156,0.464988
1,0.721357,0.0,0.000369,0.313594,0.413077,0.57192,0.833074,0.538604,0.756013,0.918562
2,0.53708,1.0,0.568987,0.959922,0.519088,0.849027,0.434713,0.750206,0.0,0.0
3,0.538045,0.18053,0.826426,1.0,0.139045,1.0,0.867825,0.538279,0.499171,0.799569
4,0.197218,0.504349,0.885377,0.753589,0.681776,0.781206,0.560692,0.538402,0.646896,1.0
5,0.092244,0.777371,0.145886,0.0,0.519543,0.106575,0.218656,1.0,0.719336,0.08971
6,0.751199,0.811657,0.46849,0.611621,0.0,0.524114,0.070723,0.218016,0.567681,0.107208
7,0.537276,0.512946,0.888472,0.775311,0.518428,0.431454,0.432317,0.353891,1.0,0.027004
8,0.482394,0.515975,0.205085,0.932208,0.873897,0.590284,0.150855,0.703283,0.604927,0.315101
9,0.45529,0.330621,0.496737,0.31184,0.294766,0.524745,0.303637,0.438627,0.494936,0.466437


The `Normalizer` scales individual samples to have unit norm.

In [7]:
# Load libraries
from sklearn.preprocessing import Normalizer

# Initialize the Normalizer
normalizer = Normalizer()

# Fit and transform the data using the iterative imputed DataFrame
normalized_data = normalizer.fit_transform(iterative_imputed_df)
normalized_df = pd.DataFrame(normalized_data, columns=iterative_imputed_df.columns)
normalized_df

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10
0,0.339168,0.313579,0.242313,0.200037,0.288183,0.338471,0.295196,0.51886,0.237368,0.278551
1,0.376706,0.051482,0.034682,0.168591,0.204578,0.323683,0.441878,0.273247,0.408426,0.492404
2,0.264336,0.48345,0.300075,0.426192,0.232044,0.423166,0.257371,0.353631,0.008947,0.020717
3,0.243762,0.116607,0.387811,0.407684,0.078213,0.450213,0.392278,0.234729,0.234592,0.370886
4,0.095909,0.248868,0.417511,0.316499,0.274302,0.365945,0.284948,0.237139,0.304609,0.463686
5,0.068115,0.493382,0.128781,0.034471,0.296421,0.126535,0.221071,0.599823,0.46372,0.081177
6,0.460521,0.505281,0.318139,0.354119,0.039204,0.354133,0.145437,0.133292,0.36322,0.09057
7,0.274582,0.281816,0.467778,0.362837,0.240688,0.25017,0.266284,0.175644,0.520792,0.034915
8,0.265283,0.303467,0.143277,0.461427,0.411012,0.345225,0.163323,0.369203,0.341538,0.190645
9,0.321287,0.273524,0.379002,0.223196,0.206875,0.401188,0.293873,0.29714,0.360036,0.34709


## Encoding Categorical Variables

Encoding categorical variables is essential for converting non-numeric data into a format that can be used by machine learning algorithms.

### Getting ready

To begin, we will create a toy dataset composed of random, quantitative data, ten features, and several missing data values randomly spread throughout. We will then store the dataset in a pandas DataFrame() object for better readability.

In [8]:
# Load libraries
import numpy as np

# Create sample categorical data with 20 records
np.random.seed(2024)  # for reproducibility
categories = ["A", "B", "C", "D"]
categorical_data = pd.DataFrame(
    {
        "Department": np.random.choice(categories, size=20),
        "Position": np.random.choice(["Junior", "Senior", "Manager"], size=20),
        "Location": np.random.choice(["NY", "SF", "LA", "CHI"], size=20),
    }
)

# Display the DataFrame with categorical values
display(categorical_data)

Unnamed: 0,Department,Position,Location
0,A,Manager,LA
1,C,Manager,LA
2,A,Junior,NY
3,A,Manager,LA
4,D,Manager,NY
5,A,Senior,LA
6,C,Manager,SF
7,D,Manager,LA
8,B,Junior,LA
9,D,Manager,CHI


### How to do it...

The `OneHotEncoder` converts categorical values into a one-hot numeric array.

In [9]:
# Load libraries
from sklearn.preprocessing import OneHotEncoder

# Initialize the OneHotEncoder
onehot_encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the data
onehot_encoded_data = onehot_encoder.fit_transform(categorical_data)
onehot_encoded_df = pd.DataFrame(
    onehot_encoded_data, columns=onehot_encoder.get_feature_names_out()
)
onehot_encoded_df

Unnamed: 0,Department_A,Department_B,Department_C,Department_D,Position_Junior,Position_Manager,Position_Senior,Location_CHI,Location_LA,Location_NY,Location_SF
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
7,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
8,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
9,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


The `LabelEncoder` encodes target labels with values between 0 and n_classes-1.

In [10]:
# Load libraries
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Create a new DataFrame to store label encoded values
label_encoded_df = pd.DataFrame()

# Fit and transform each categorical column
for column in categorical_data.columns:
    label_encoded_df[f"{column}_encoded"] = label_encoder.fit_transform(
        categorical_data[column]
    )
label_encoded_df

Unnamed: 0,Department_encoded,Position_encoded,Location_encoded
0,0,1,1
1,2,1,1
2,0,0,2
3,0,1,1
4,3,1,2
5,0,2,1
6,2,1,3
7,3,1,1
8,1,0,1
9,3,1,0


The `ColumnTransformer` allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

In [11]:
# Load libraries
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Create sample mixed data with 20 records
np.random.seed(2024)  # for reproducibility
mixed_data = pd.DataFrame(
    {
        "Age": np.random.randint(25, 65, size=20),
        "Salary": np.round(np.random.normal(60000, 15000, size=20), 2),
        "Experience": np.random.randint(1, 20, size=20),
        "Department": np.random.choice(["IT", "HR", "Sales", "Finance"], size=20),
        "Position": np.random.choice(["Junior", "Senior", "Manager"], size=20),
    }
)

# Display the DataFrame with mixed data
display(mixed_data)

# Initialize the ColumnTransformer
column_transformer = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["Age", "Salary", "Experience"]),
        ("cat", OneHotEncoder(), ["Department", "Position"]),
    ],
    remainder="passthrough",
)

# Fit and transform the data
transformed_data = column_transformer.fit_transform(mixed_data)

# Get feature names for the transformed columns
numeric_cols = ["Age_scaled", "Salary_scaled", "Experience_scaled"]
categorical_cols = column_transformer.named_transformers_["cat"].get_feature_names_out(
    ["Department", "Position"]
)

# Create the transformed DataFrame
transformed_df = pd.DataFrame(
    transformed_data, columns=numeric_cols + list(categorical_cols)
)
transformed_df

Unnamed: 0,Age,Salary,Experience,Department,Position
0,33,59420.36,12,Finance,Manager
1,57,82895.92,17,Sales,Junior
2,25,38165.76,16,Finance,Manager
3,52,38242.36,7,IT,Junior
4,61,55088.65,8,Sales,Manager
5,26,78688.88,9,Sales,Senior
6,60,49585.62,14,Finance,Junior
7,35,47992.58,17,HR,Manager
8,27,54833.02,9,Sales,Senior
9,57,93535.56,14,Sales,Senior


Unnamed: 0,Age_scaled,Salary_scaled,Experience_scaled,Department_Finance,Department_HR,Department_IT,Department_Sales,Position_Junior,Position_Manager,Position_Senior
0,-1.045349,-0.305043,0.327303,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.727681,1.105091,1.262454,0.0,0.0,0.0,1.0,1.0,0.0,0.0
2,-1.636358,-1.581768,1.075424,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.3583,-1.577167,-0.607848,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,1.023186,-0.565241,-0.420818,0.0,0.0,0.0,1.0,0.0,1.0,0.0
5,-1.562482,0.852382,-0.233788,0.0,0.0,0.0,1.0,0.0,0.0,1.0
6,0.949309,-0.895798,0.701364,1.0,0.0,0.0,0.0,1.0,0.0,0.0
7,-0.897596,-0.991489,1.262454,0.0,1.0,0.0,0.0,0.0,1.0,0.0
8,-1.488606,-0.580596,-0.233788,0.0,0.0,0.0,1.0,0.0,0.0,1.0
9,0.727681,1.744195,0.701364,0.0,0.0,0.0,1.0,0.0,0.0,1.0


## Introduction to Pipelines

Pipelines are a simple way to streamline a machine learning workflow by chaining together transformers and estimators.

### Getting ready

The general syntax for defining a pipeline is as follows:

```
pipeline = Pipeline(
    [("name of step", transformer), ("name of step", transformer),…, (“name of step”, estimator]
)
```

### How to do it...

A basic pipeline chains together a sequence of transformations and a final estimator.

In [12]:
# Load libraries
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# First, separate features and target (assuming last column is target)
X = transformed_df.iloc[:, :-1]  # all columns except last
y = transformed_df.iloc[:, -1]  # last column

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=2024
)

# Create a pipeline
pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="mean")),  # handle missing values
        ("scaler", StandardScaler()),  # scale the features
    ]
)

# Fit and transform the training data
X_train_transformed = pipeline.fit_transform(X_train)

# Transform the test data
X_test_transformed = pipeline.transform(X_test)

# Create DataFrames with the transformed data (to preserve column names)
X_train_transformed = pd.DataFrame(
    X_train_transformed, columns=X_train.columns, index=X_train.index
)

X_test_transformed = pd.DataFrame(
    X_test_transformed, columns=X_test.columns, index=X_test.index
)

### Visualizing Pipelines

Visualizing pipelines can help understand the workflow and ensure all steps are correctly configured.

In [13]:
# Load libraries
from sklearn import set_config

# Set display configuration
set_config(display="diagram")

# Display the pipeline
pipeline

0,1,2
,steps,"[('imputer', ...), ('scaler', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True


## Feature Engineering

Feature engineering involves creating new features or modifying existing ones to improve model performance.

### Getting ready

We will use the previously defined `X_train_transformed` DataFrame for this recipe so no need to redefine it.

### How to do it...

The `PolynomialFeatures` transformer generates polynomial and interaction features.

In [14]:
# Load libraries
from sklearn.preprocessing import PolynomialFeatures

# Initialize the PolynomialFeatures
poly = PolynomialFeatures(degree=2)

# Fit and transform the X_train_transformed data
poly_features = poly.fit_transform(X_train_transformed)
poly_features_df = pd.DataFrame(
    poly_features, columns=poly.get_feature_names_out(X_train_transformed.columns)
)
poly_features_df

Unnamed: 0,1,Age_scaled,Salary_scaled,Experience_scaled,Department_Finance,Department_HR,Department_IT,Department_Sales,Position_Junior,Position_Manager,...,Department_IT^2,Department_IT Department_Sales,Department_IT Position_Junior,Department_IT Position_Manager,Department_Sales^2,Department_Sales Position_Junior,Department_Sales Position_Manager,Position_Junior^2,Position_Junior Position_Manager,Position_Manager^2
0,1.0,0.834298,-0.867451,-1.559779,-0.57735,-0.480384,2.081666,-0.774597,-0.57735,-1.0,...,4.333333,-1.612452,-1.20185,-2.081666,0.6,0.447214,0.774597,0.333333,0.57735,1.0
1,1.0,-0.689202,1.282818,-0.150946,-0.57735,2.081666,-0.480384,-0.774597,-0.57735,1.0,...,0.230769,0.372104,0.27735,-0.480384,0.6,0.447214,-0.774597,0.333333,-0.57735,1.0
2,1.0,0.544107,-1.173357,-1.761041,-0.57735,-0.480384,2.081666,-0.774597,-0.57735,1.0,...,4.333333,-1.612452,-1.20185,2.081666,0.6,0.447214,-0.774597,0.333333,-0.57735,1.0
3,1.0,-1.342131,0.821525,-0.150946,-0.57735,-0.480384,-0.480384,1.290994,-0.57735,-1.0,...,0.230769,-0.620174,0.27735,0.480384,1.666667,-0.745356,-1.290994,0.333333,0.57735,1.0
4,1.0,1.414679,0.611115,0.654101,-0.57735,-0.480384,-0.480384,1.290994,-0.57735,1.0,...,0.230769,-0.620174,0.27735,-0.480384,1.666667,-0.745356,1.290994,0.333333,-0.57735,1.0
5,1.0,-1.414679,-1.479801,1.257887,1.732051,-0.480384,-0.480384,-0.774597,-0.57735,1.0,...,0.230769,0.372104,0.27735,-0.480384,0.6,0.447214,-0.774597,0.333333,-0.57735,1.0
6,1.0,1.269583,0.576819,0.452839,1.732051,-0.480384,-0.480384,-0.774597,1.732051,-1.0,...,0.230769,0.372104,-0.83205,0.480384,0.6,-1.341641,0.774597,3.0,-1.732051,1.0
7,1.0,0.544107,-1.47545,-0.55347,-0.57735,-0.480384,2.081666,-0.774597,1.732051,-1.0,...,4.333333,-1.612452,3.605551,-2.081666,0.6,-1.341641,0.774597,3.0,-1.732051,1.0
8,1.0,0.906845,1.060444,1.459148,-0.57735,-0.480384,-0.480384,1.290994,1.732051,-1.0,...,0.230769,-0.620174,-0.83205,0.480384,1.666667,2.236068,-1.290994,3.0,-1.732051,1.0
9,1.0,0.906845,1.664674,0.855363,-0.57735,-0.480384,-0.480384,1.290994,-0.57735,-1.0,...,0.230769,-0.620174,0.27735,0.480384,1.666667,-0.745356,-1.290994,0.333333,0.57735,1.0


The `KBinsDiscretizer` discretizes continuous features into k bins.

In [15]:
# Load libraries
from sklearn.preprocessing import KBinsDiscretizer

# Initialize the KBinsDiscretizer
kbins = KBinsDiscretizer(n_bins=3, encode="ordinal", strategy="uniform")

# Fit and transform the X_train_transformed data
binned_data = kbins.fit_transform(X_train_transformed)
binned_df = pd.DataFrame(binned_data, columns=X_train_transformed.columns)
binned_df

Unnamed: 0,Age_scaled,Salary_scaled,Experience_scaled,Department_Finance,Department_HR,Department_IT,Department_Sales,Position_Junior,Position_Manager
0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
1,0.0,2.0,1.0,0.0,2.0,0.0,0.0,0.0,2.0
2,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0
3,0.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0
4,2.0,1.0,2.0,0.0,0.0,0.0,2.0,0.0,2.0
5,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,2.0
6,2.0,1.0,2.0,2.0,0.0,0.0,0.0,2.0,0.0
7,2.0,0.0,1.0,0.0,0.0,2.0,0.0,2.0,0.0
8,2.0,2.0,2.0,0.0,0.0,0.0,2.0,2.0,0.0
9,2.0,2.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0


`RFE()` is a powerful technique that recursively removes the least important features based on a specified estimator's importance ranking.

In [16]:
# Load libraries
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Initialize the RFE
rfe = RFE(estimator=LinearRegression(), n_features_to_select=1)

# Fit the RFE to the X_train_transformed and y_train data
rfe.fit(X_train_transformed, y_train)

# Get the ranking of features
rfe.ranking_

array([7, 5, 4, 3, 6, 9, 8, 2, 1])

`SelectFromModel()` allows users to select features based on their importance weights derived from a given model. 

In [17]:
# Load libraries
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LinearRegression

# Initialize SelectFromModel with LinearRegression
selector = SelectFromModel(
    estimator=LinearRegression(),
    prefit=False,
    threshold="mean",  # Use mean of feature importances as threshold
)

# Fit the selector
selector.fit(X_train_transformed, y_train)

# Get selected features
selected_features_mask = selector.get_support()

# Get feature names that were selected
selected_features = X_train_transformed.columns[selected_features_mask].tolist()

# Print feature importance scores and selection status
feature_importance = pd.DataFrame(
    {
        "Feature": X_train_transformed.columns,
        "Importance": selector.estimator_.coef_,
        "Selected": selected_features_mask,
    }
)
feature_importance.sort_values("Importance", key=abs, ascending=False)

Unnamed: 0,Feature,Importance,Selected
8,Position_Manager,-0.5,True
7,Position_Junior,-0.4330127,True
2,Experience_scaled,-1.103284e-15,False
1,Salary_scaled,1.054712e-15,False
4,Department_HR,-4.649059e-16,False
3,Department_Finance,4.614364e-16,False
6,Department_Sales,-2.810252e-16,False
0,Age_scaled,2.078464e-16,False
5,Department_IT,-2.046974e-16,False


## Practical Exercise on Data Preprocessing

In this section, we will combine all the recipes into a comprehensive pipeline and apply it to the California Housing dataset.

### Comprehensive Pipeline

We will create a pipeline that includes imputation, scaling, encoding, and modeling steps.

In [18]:
# Load libraries
YOUR CODE HERE

# Load the California Housing dataset
YOUR CODE HERE

# Split the data
YOUR CODE HERE

# Create a comprehensive pipeline
YOUR CODE HERE

# Fit the pipeline
YOUR CODE HERE

# Evaluate the pipeline
YOUR CODE HERE

SyntaxError: invalid syntax (3767633645.py, line 2)