# Data Preprocessing with ColumnTransformer

Applying data transforms like scaling or encoding categorical variables is straightforward when all input variables are the same type. It can be challenging when you have a dataset with mixed types and you want to selectively apply data transforms to some, but not all, input features.

Thankfully, the scikit-learn Python machine learning library provides the `ColumnTransformer` that allows you to selectively apply data transforms to different columns in your dataset.

## Challenge of Transforming Different Data Types

It is important to prepare data prior to modeling.

This may involve replacing missing values, scaling numerical values, and one hot encoding categorical data.

Data transforms can be performed using the scikit-learn library; for example, the `SimpleImputer` class can be used to replace missing values, the `MinMaxScaler` class can be used to scale numerical values, and the `OneHotEncoder` can be used to encode categorical variables.

It is very common to want to perform different data preparation techniques on different columns in your input data.

For example, you may want to impute missing numerical values with a median value, then scale the values and impute missing categorical values using the most frequent value and one hot encode the categories.

Traditionally, this would require you to separate the numerical and categorical data and then manually apply the transforms on those groups of features before combining the columns back together in order to fit and evaluate a model.

Now, you can use the `ColumnTransformer` to perform this operation for you.

## Dataset Loading

The abalone dataset is a standard machine learning problem that involves predicting the age of an abalone given measurements of an abalone.

The dataset has 4,177 examples, 8 input variables, and the target variable is an integer.

In [3]:
import pandas as pd

# load dataset
url ='abalone.csv'
dataframe = pd.read_csv(url, header=None)
dataframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [4]:
# split into inputs and outputs
last_idx = len(dataframe.columns) - 1

X, y = dataframe.drop(last_idx, axis=1), dataframe[last_idx]
print(X.shape, y.shape)

(4177, 8) (4177,)


## Data Preprocessing

We are interested in a list of columns that are numerical columns marked as `float64` or `int64` in Pandas, and a list of categorical columns, marked as `object` or `bool` type in Pandas.

In [5]:
# determine categorical and numerical features
numerical_ix = X.select_dtypes(include=['int64', 'float64']).columns
categorical_ix = X.select_dtypes(include=['object', 'bool']).columns
print(numerical_ix)
print(categorical_ix)

Int64Index([1, 2, 3, 4, 5, 6, 7], dtype='int64')
Int64Index([0], dtype='int64')


We can then use these lists in the `ColumnTransformer` to one hot encode the categorical variables, which should just be the first column.

We can also use the list of numerical columns to normalize the remaining data.

In [8]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer

# define the data preparation for the columns
t = [('cat', OneHotEncoder(), categorical_ix), ('num', MinMaxScaler(), numerical_ix)]
col_transform = ColumnTransformer(transformers=t)

## Define and Train the model

Next, we can define our SVR model and define a Pipeline that first uses the ColumnTransformer, then fits the model on the prepared dataset.

In [11]:
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline

# define the model
model = SVR(kernel='rbf',gamma='scale',C=100)
# define the data preparation and modeling pipeline
pipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])

Finally, we can evaluate the model using 10-fold cross-validation and calculate the mean absolute error, averaged across all 10 evaluations of the pipeline.

In [12]:
import numpy as np
from sklearn.model_selection import KFold, cross_val_score

# define the model cross-validation configuration
cv = KFold(n_splits=10, shuffle=True, random_state=1)
# evaluate the pipeline using cross validation and calculate MAE
scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# convert MAE scores to positive values
scores = np.absolute(scores)
# summarize the model performance
print('MAE: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

MAE: 1.465 (0.047)
