# Migrating SKLearn Pipeline to Keras

<img src="sklearn_to_tensorflow.png" alt="Drawing" style="width: 500px;"/>

*Image source: https://medium.com/towards-data-science/from-scikit-learn-to-tensorflow-part-1-9ee0b96d4c85*

In this notebook we look at ways to migrate an Sklearn training pipeline to Tensorflow Keras. There might be a few reasons to move from Sklearn to Tensorflow.

### Possible benefits:
* Flexibility to basically any ML model architecture.
* Distributed training utilising GPUs.
* Flexibility when it comes to retraining.
* Tensorflow serving.
* etc.

### Objectives
* We will implement a model training pipeline with Sklearn.
* Implement the same pipeline using only Tensorflow Keras modules. Preprocessing layers will used here.
* Use EasyFlow for equivalent pipelines such as Sklearn’s Pipeline and ColumnTransformer module.

Probably the most common use case comes from designing a ML pipeline where your preprocessing is implemented in Sklearn and model in Keras. We will have different artifacts for each part. With Keras and the `EasyFlow` module it will be easy to implement a native Keras solution.

For our dataset we will use the popular heart dataset. It consists of a mix of feature types such as numerical, categorical feature encoded as int’s and as strings.

In [None]:
try:
    import easyflow
except:
    ! pip install easy-tensorflow

## Dataset

In [1]:
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)
labels = dataframe.pop("target")

NUMERICAL_FEATURES = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'age']
CATEGORICAL_FEATURES = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca']
# thal is represented as a string
STRING_CATEGORICAL_FEATURES = ['thal']

## Sklearn Training Pipeline

In [2]:
preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), NUMERICAL_FEATURES),
        ("cat", OneHotEncoder(handle_unknown="ignore"), CATEGORICAL_FEATURES+STRING_CATEGORICAL_FEATURES)
    ]
)

sklearn_model = Pipeline(
    steps=[
        ('preprocessor', preprocess),
        ('classifier', LogisticRegression())
    ]
    
)

sklearn_model.fit(dataframe, labels)
sklearn_model.score(dataframe, labels)

0.8613861386138614

The above Pipeline is not very involved and easy to implement. The idea here is to show how we can go about implementing an equivalent Pipeline in Tensorflow Keras.

# Migrate Pipeline to Tensorflow Keras

Let’s create our feature preprocessing Pipeline natively in Keras. We will duplicate the same pipeline as implemented in the Sklearn section. Keras recently added preprocessing layers. The Keras equivalent preprocessors for StandardScaler and OneHotEncoder is Normalization and IntegerLookup respectively. When our categorical features are of type string we first need to apply StringLookup preprocessing layer followed by IntegerLookup layer.

The last thing we need is a Keras implementation for Pipeline and ColumnTransformer. There is currently no implementation in Keras so we will use another package for this:

**EasyFlow** : https://pypi.org/project/easy-tensorflow/
pip install easy-tensorflow

EasyFlow makes use of Keras preprocessing layers. All high level pipelines in EasyFlow such as FeatureUnion subclasses tf.keras.layers.Layer and thus behave like any other Keras layer. (FeatureUnion is equivalent to Sklearn’s ColumnTransformer.

In [3]:
import tensorflow as tf
from tensorflow.keras.layers import Normalization, IntegerLookup, StringLookup

from easyflow.data.mapper import TensorflowDataMapper
from easyflow.preprocessing.pipeline import FeatureUnion
from easyflow.preprocessing import FeatureInputLayer, StringToIntegerLookup

We will be making use of the Keras functional API. So we first need to create a feature input layer. Below we have a data type mapping dict as input to FeatureInputLayer. Next we will use FeatureUnion from EasyFlow module to implement similar preprocessing pipeline to ColumnTransformer. We will update the preprocessing layer states by running adapt method on our data. Keras preprocessing layers uses .*adapt* to update states and in this case similar to .*fit*.

In [4]:
feature_layer_inputs = FeatureInputLayer({
    "age": tf.float32,
    "sex": tf.float32,
    "cp": tf.float32,
    "trestbps": tf.float32,
    "chol": tf.float32,
    "fbs": tf.float32,
    "restecg": tf.float32,
    "thalach": tf.float32,
    "exang": tf.float32,
    "oldpeak": tf.float32,
    "slope": tf.float32,
    "ca": tf.float32,
    "thal": tf.string
})


preprocessor = FeatureUnion(
    feature_preprocessor_list = [
        ('num', Normalization(), NUMERICAL_FEATURES),
        ('cat', IntegerLookup(output_mode='binary'), CATEGORICAL_FEATURES),
        ('str_cat', StringToIntegerLookup(), STRING_CATEGORICAL_FEATURES)
    ]
)

# to update the states for preprocess layers:
preprocessor.adapt(dataframe)

2022-04-23 18:13:57.646517: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


As stated FeatureUnion and all other Pipelines in EasyFlow are Keras layers and are also callable. Below we setup our model.

In [9]:
preprocessed_inputs = preprocessor(feature_layer_inputs)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(preprocessed_inputs)
model = tf.keras.Model(inputs=feature_layer_inputs, outputs=outputs)
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[
        tf.keras.metrics.BinaryAccuracy(name='accuracy'),
        tf.keras.metrics.AUC(name='auc')
    ])

model.fit(dict(dataframe), labels, batch_size=32, epochs=100, verbose=0)
tf.keras.models.save_model(model=model, filepath='model')

loaded_model = tf.keras.models.load_model("model")
dict(zip(loaded_model.metrics_names, loaded_model.evaluate(dict(dataframe), labels)))

INFO:tensorflow:Assets written to: model/assets


{'loss': 0.33392223715782166,
 'accuracy': 0.8514851331710815,
 'auc': 0.9247261881828308}

That is it! We successfully ported an Sklearn training pipeline to Tensorflow Keras by utilising preprocessing layers and EasyFlow’s feature preprocessing pipelines. We also persisted the model and loaded it for inference to showcase that it is truly end to end. The results compares well with our Sklearn implementation. Huge advantage here is the fact that preprocessing is part of the network and persisted as such. Without Keras preprocessing layers and EasyFlow’s pipeline implementation we usually had a separate Sklearn artifact for preprocessing. Using Sklearn for preprocessing and then feeding data to a Keras model used to be a common design pattern. With serving containers such as Tensorflow serving that can serve our models without python or Tensorflow installation makes this migration to a native Tensorflow Keras implementation very appealing. All you need a saved model.

We can go one step further to improve our training speed. See next section.

# Quick note on improving Keras training speed

Next we will use the common pattern for training by creating a model that applies the preprocessing step to speed up training. When we start from raw data as in our example. We need to preprocess all preprocessing operations on the CPU and than feed that data to a GPU. Preprocessing is also not something that we train and it is independent from the forward pass. This will reduce our throughput as the GPU will be idle while waiting for data. To speed things up we will prefetch batches of preprocessed data. This will ensure that while we processing batch of data on the GPU the CPU is getting the next batch of preprocessed data ready.

<img src="gpu_cpu_gaps.png" alt="Drawing" style="width: 500px;"/>

*Image taken from https://www.tensorflow.org*

In [6]:
# create our preprocessing model
preprocessing_model = tf.keras.Model(feature_layer_inputs, preprocessed_inputs)

# create training model that will be applied on the forward pass
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(preprocessed_inputs)
training_model = tf.keras.Model(preprocessed_inputs, outputs)
training_model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[
        tf.keras.metrics.BinaryAccuracy(name='accuracy'),
        tf.keras.metrics.AUC(name='auc')
    ])

Next we will map the Pandas dataframe to tensorflow.data.Datasets type. The preprocessing model above will be mapped onto our feature data:

In [7]:
batch_size = 32
dataset_mapper = TensorflowDataMapper() 
dataset = dataset_mapper.map(dataframe, labels)
dataset = dataset.batch(batch_size)

preprocessed_ds = dataset.map(
    lambda x, y: (preprocessing_model(x), y),
    num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

training_model.fit(preprocessed_ds, batch_size=batch_size, epochs=100, verbose=0)

# evaluate model
dict(zip(training_model.metrics_names, training_model.evaluate(preprocessed_ds)))



{'loss': 0.3193400204181671,
 'accuracy': 0.8712871074676514,
 'auc': 0.927546501159668}

We need one last step to create a model that can be used for inference. Since we splitted the model into a preprocessing and training step we can't save training_model as is. We need to plug the preprocessing back into the model. Lets create our inference model.

In [10]:
inference_model = tf.keras.Model(feature_layer_inputs, training_model(preprocessed_inputs))
# compile model to get supplied metrics at inference time
inference_model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[
        tf.keras.metrics.BinaryAccuracy(name='accuracy'),
        tf.keras.metrics.AUC(name='auc')
    ])

tf.keras.models.save_model(model=inference_model, filepath='saved_inference_model')
saved_inference_model = tf.keras.models.load_model("saved_inference_model")

dict(zip(saved_inference_model.metrics_names, saved_inference_model.evaluate(dict(dataframe), labels)))

INFO:tensorflow:Assets written to: saved_inference_model/assets


{'loss': 0.3193400204181671,
 'accuracy': 0.8712871074676514,
 'auc': 0.927546501159668}

In conclusion we showed how we can migrate an sklearn training pipeline to Tensorflow and Keras. We started off by building our training pipeline in sklearn consisting of a preprocessing step and we used LogisticRegression as our estimator. We used the same sklearn pipeline and migrated it to Keras using preprocessing layers and easyflow Pipeline modules. Our model architecture was a simple linear model(Logistic Regression) with no hidden layers. Our feature preprocessing was part of our network architecture and we persisted and loaded the model to apply inference. We ended up with a final section on improving training speed by splitting the preprocessing and model steps for training. At the end we added the preprocessing back with the training model to create our inference model.