# Feature Engineering with a custom Keras Feature Layer for complete Training and Inference Pipeline.

This example is similar to the lambda_layer_engineering.ipynb notebook. But here we will implement a single custom layer by subclassing Keras Layer class.

The idea is similar to the reference notebook above. But lets set the scene again:

Often for structured data problems we end up using multiple libraries for preprocessing or feature engineering. We can go as far as having a full ML training pipeline using different libraries for example Pandas for reading data and also feature engineeering, sklearn for encoding features for example OneHot encoding and Normalization. The estimator might be an sklearn classifier, xgboost or it can for example be a Keras model. In the latter case, we would end up with artifacts for feature engineering and encoding and also different artifacts for the saved model. The pipeline is also disconnected and an extra step is needed to feed encoded data to the Keras model. For this step the data can be mapped from a dataframe to something like tf.data.Datasets type or numpy array before feeding it to a Keras model.

In this post we will consider implementing a training pipeline natively with Keras/Tensorflow. From loading data with tf.data and applying feature engineering in a single custom layer by subclassing Layer class. These engineered features will be stateless. For stateful preprocessing we could use something like Keras preprocessing layers. We will end up with a training pipeline where feature engineering will be part of the network architecture and can be persisted and loaded for inference as standalone.

Steps we will follow:
- Load data with tf.data
- Create Input layer
- Implementing a custom Feature Layer
- Train model

# Example

For the example below we will use the heart disease dataset. Lets import tensorflow and read in the data:

In [2]:
import tensorflow as tf
from keras.utils.vis_utils import plot_model

In [3]:
heart_dir = tf.keras.utils.get_file(
    "heart.csv", origin="http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
)

dataset = tf.data.experimental.make_csv_dataset(
      heart_dir,
      batch_size=64,
      label_name='target',
      num_epochs=10
)

In [4]:
binary_features = ['sex', 'fbs', 'exang']
numeric_features =  ['trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'cp', 'restecg', 'ca']
categoric_features = ['thal']

dtype_mapper = {
        'age': tf.float32,
        'sex': tf.float32,
        'cp': tf.float32,
        'trestbps': tf.float32,
        'chol': tf.float32,
        'fbs': tf.float32,
        'restecg': tf.float32,
        'thalach': tf.float32,
        'exang': tf.float32,
        'oldpeak': tf.float32,
        'slope': tf.float32,
        'ca': tf.float32,
        'thal': tf.string
}


## Create Input Layer

In [5]:
def create_inputs(data_type_mapper):
    """Create model inputs
    Args:
        data_type_mapper (dict): Dictionary with feature as key and dtype as value
                                 For example {'age': tf.float32, ...}
    Returns:
        (dict): Keras inputs for each feature
    """
    return {feature: tf.keras.Input(shape=(1,), name=feature, dtype=dtype)\
        for feature, dtype in data_type_mapper.items()}

feature_layer_inputs = create_inputs(dtype_mapper)

## Create custom Feature Layer for Feature Engineering

In [6]:
class FeatureLayer(tf.keras.layers.Layer):
    """Custom Layer for Feature engineering steps
    """
    def __init__(self, *args, **kwargs):
        super(FeatureLayer, self).__init__(*args, **kwargs)
        
    def call(self, inputs):
        age_and_gender = tf.cast(
            tf.math.logical_and(inputs['age'] > 50, inputs['sex'] == 1), dtype = tf.float32
        )

        thal_fixed_category = tf.cast(inputs['thal'] == "fixed", dtype = tf.float32)
        thal_reversible_category = tf.cast(inputs['thal'] == "reversible", dtype = tf.float32)
        thal_normal_category = tf.cast(inputs['thal'] == "normal", dtype = tf.float32)

        trest_chol_ratio = inputs['trestbps'] / inputs['chol']
        trest_cross_thalach = inputs['trestbps'] * inputs['thalach']

        # concat all newly created features into one layer
        feature_list = [thal_fixed_category, thal_reversible_category, thal_normal_category,
                        age_and_gender, trest_chol_ratio, trest_cross_thalach]

        engineered_feature_layer = tf.keras.layers.concatenate(feature_list, name='engineered_feature_layer')
        numeric_feature_layer = tf.keras.layers.concatenate(
            [inputs[feature] for feature in numeric_features], name='numeric_feature_layer'
        )

        binary_feature_layer = tf.keras.layers.concatenate(
            [inputs[feature] for feature in binary_features], name='binary_feature_layer'
        )

        # Add the rest of features into final feature layer
        feature_layer = tf.keras.layers.concatenate(
            [engineered_feature_layer, numeric_feature_layer, binary_feature_layer], name='feature_layer'
        )
        return feature_layer

# Train Model and Save model

Our last step is to create and fit our Keras model. For this example we will use a simple model architecture. We will persist the model and load it for inference.

In [7]:
# setup model, this is basically Logistic regression
feature_layer = FeatureLayer()(feature_layer_inputs)
x = tf.keras.layers.BatchNormalization(name='batch_norm')(feature_layer)
output = tf.keras.layers.Dense(1, activation='sigmoid', name='target')(x)
model = tf.keras.Model(inputs=feature_layer_inputs, outputs=output)
model.compile(
  loss=tf.keras.losses.BinaryCrossentropy(),
  optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
  metrics=[tf.keras.metrics.BinaryAccuracy(name='accuracy'), 
           tf.keras.metrics.AUC(name='auc')]
)

In [8]:
model.fit(dataset, epochs=10)

# save model
tf.keras.models.save_model(model, "saved_model")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
INFO:tensorflow:Assets written to: saved_model/assets


# Load model and predict on raw data

In [9]:
# load model for inference
loaded_model = tf.keras.models.load_model("saved_model")

In [10]:
dict(zip(loaded_model.metrics_names,loaded_model.evaluate(dataset)))



{'loss': 0.29918530583381653,
 'accuracy': 0.8712871074676514,
 'auc': 0.9335706830024719}