# Fit/Transform in Data Preparation

In Data Preparation Transformers are used that are fit on given data. When Training and Prediction are seperated, aligning these two steps is necessary. Methods in sklearn and pandas are 

Examples for Fit/Transform
* Encoding categorical features
* Scaling
* Distribution Mappers
* Normalization
* Discretization (otherwise known as quantization or binning)
* Imputation of missing values


Saving Methods
* pickle: Pickle files can be hacked. If you receive a raw pickle file over the network, it could have malicious code in it, that would run arbitrary python when you try to de-pickle it. 

Author: Enrico Lauckner ([github.com/elauckne](github.com/elauckne))

In [4]:
import pandas as pd

## Load Data

In [190]:
df = pd.read_csv('data/abalone.csv')

In [191]:
print(df.shape)
df.head()

(4177, 9)


Unnamed: 0,Type,LongestShell,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [193]:
numerical_features = list(df.iloc[:,1:9].columns)
categorical_features = ['Type']

In [192]:
df[numerical_features] = df[numerical_features].astype('float')

### Encoding categorical features

* Drop First is problematic because an unknown category in the test set would be perceived by the model as the dropped category. Therefore all categories should hold a dummy (e.g. 'Other') where unknown or small categories can be summarized
* sklearn encoder needs to be applied to every single column, because all pandas.get_dummies() does must be done manually
    * Drop first category
    * Create Feature Names
    * Append to original Data Frame
    * Drop original column
* pandas.get_dummies() seems to be the better choice
    * Save the categorical values of each column from training data
    * Set column as type category with the saved category values in prediction step
    * Can be saved in yaml-file
    
---
Change function for One Hot Encoding
* Train Mode: Always add 'Other' as Category, save Category Values, apply One Hot Encoding
* Predict Mode: Load/Apply Category Values, Set created NAs to 'Other' and apply One Hot Encoding

**sklearn**

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [92]:
onehotencoder = OneHotEncoder(categories='auto', sparse=False)
feature = onehotencoder.fit_transform(df[categorical_features])

In [93]:
feature.shape

(4177, 3)

In [105]:
'Type_' + onehotencoder.categories_[0][1:]

array(['Type_I', 'Type_M'], dtype=object)

In [96]:
feature[:,1:].shape

(4177, 2)

**pandas**

In [200]:
from pandas.api.types import CategoricalDtype

In [205]:
unique_values = df['Type'].unique()
df['Type'] = df['Type'].astype(CategoricalDtype(categories=unique_values, ordered=True))

In [202]:
pd.get_dummies(df, drop_first=True).head()

Unnamed: 0,LongestShell,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Rings,Type_F,Type_I,Type_Other
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15.0,0,0,0
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7.0,0,0,0
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9.0,1,0,0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10.0,0,0,0
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7.0,0,1,0


Test

In [203]:
df_test = pd.DataFrame(df.iloc[0]).T
df_test[numerical_features] = df_test[numerical_features].astype('float')
df_test['Type'] = df_test['Type'].astype(CategoricalDtype(categories=unique_values, ordered=True))

pd.get_dummies(df_test, drop_first=True)

Unnamed: 0,LongestShell,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Rings,Type_F,Type_I,Type_Other
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15.0,0,0,0


In [204]:
df_test = pd.DataFrame(df.iloc[0]).T
df_test['Type'].iloc[0] = 'R'
df_test[numerical_features] = df_test[numerical_features].astype('float')
df_test['Type'] = df_test['Type'].astype(CategoricalDtype(categories=unique_values, ordered=True))

pd.get_dummies(df_test)

Unnamed: 0,LongestShell,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Rings,Type_M,Type_F,Type_I,Type_Other
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15.0,0,0,0,0


## Imputing

Imputation is quite simple, there is only one value per column that needs to be saved.

There are two options:
1. Save imputed values per Column (together with category values in yaml)
2. Store Transformer Object as binary file

In [215]:
import numpy as np
import pickle
from sklearn.impute import SimpleImputer

In [236]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)

In [237]:
# Save as pickle
with open('output/impute_fit.p' ,'wb') as file:
    pickle.dump(imp_mean, file)
      
# Delete Object from workspace
del imp_mean

# Load from pickle
with open('output/impute_fit.p' ,'rb') as file:
    imp_mean = pickle.load(file)

In [238]:
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_mean.transform(X)

array([[ 7. ,  2. ,  3. ],
       [ 4. ,  3.5,  6. ],
       [10. ,  3.5,  9. ]])

## Combine Fitting Objects

[Colum Transformer (sklearn)](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py)

In [254]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

In [241]:
data = pd.read_csv('data/titanic3.csv')

In [255]:
print(data.shape)
data.head()

(1309, 14)


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [246]:
X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [242]:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

In [243]:
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [244]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

In [249]:
preprocessor.fit(X_train)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True))]), ['age', 'fare']), ('cat', Pipeline(memory=None,
     steps=...4'>, handle_unknown='ignore',
       n_values=None, sparse=True))]), ['embarked', 'sex', 'pclass'])])

In [252]:
preprocessor.transform(X_train).shape

(1047, 11)

In [251]:
preprocessor.transform(X_test).shape

(262, 11)