## Load Data - Mixed data types

The titanic dataset contain information about the passenges on the titanic.

The nominal task on this dataset is to predict who survived.

### Setup

In [1]:
import pandas as pd
import numpy as np

# Makes numpy values easier to read
np.set_printoptions(precision=3, suppress=True)

import tensorflow as tf
from tensorflow.keras import layers

The raw data can easily be loaded as pandas dataframe.

Note: the dataset is not immediately usable as input to a tensorflow model

In [2]:
titanic = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv")

In [3]:
titanic.head()

Unnamed: 0,survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
2,1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
3,1,female,35.0,1,0,53.1,First,C,Southampton,n
4,0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y


Next we need to grab the column or labels we want to predict

In [4]:
titanic_features = titanic.copy()
titanic_labels = titanic_features.pop('survived')

Because of the different data types and ranges, we cant simply stack the features into a numpy array and pass it in. Each column will need to be handled individually.

As one option, you can preprocess yoru data offline to convert categorical columns to numeric columns, then pass the processed output to your tensorflow model. Disadvantage to that approach is that if you save and export you model the preprocessing is not saved with it. Keras preprocessing layers avoid this problem because theyer part of the model.

In this example, we will build a model that implements the preprocessing logic using keras functional api. Additionally we could also use subclassing.

The functional api operates on "symbolic" tensors. Normal "eager" tensors have a value. In contrast these symbolic tensors do not. Instead they keep track of which operations are run on them, and build a representation of the calculation, that you can run later.

Example of symbolic tensor using functional api

In [5]:
# create a symoblic input
input = tf.keras.Input(shape=(), dtype=tf.float32)

# Perform a calculation using the input
result = 2 * input + 1

# the result does not have a value
result

Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB



2023-02-23 09:18:23.423433: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-02-23 09:18:23.424995: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


<KerasTensor: shape=(None,) dtype=float32 (created by layer 'tf.__operators__.add')>

In [7]:
calc = tf.keras.Model(inputs=input, outputs=result)

In [8]:
print(calc(1).numpy())
print(calc(2).numpy())

3.0
5.0


To build a preprocessing model, start by building a set of symbolic tf.keras.input objects, matching the names and data types of the csv columns

In [10]:
inputs = {}

for name, column in titanic_features.items():
    dtype = column.dtype
    if dtype == object:
        dtype = tf.string
    else:
        dtype = tf.float32

    inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)

inputs

{'sex': <KerasTensor: shape=(None, 1) dtype=string (created by layer 'sex')>,
 'age': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'age')>,
 'n_siblings_spouses': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'n_siblings_spouses')>,
 'parch': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'parch')>,
 'fare': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'fare')>,
 'class': <KerasTensor: shape=(None, 1) dtype=string (created by layer 'class')>,
 'deck': <KerasTensor: shape=(None, 1) dtype=string (created by layer 'deck')>,
 'embark_town': <KerasTensor: shape=(None, 1) dtype=string (created by layer 'embark_town')>,
 'alone': <KerasTensor: shape=(None, 1) dtype=string (created by layer 'alone')>}

The first step in your prepocessing logic is to concatenate the numeric inputs together and run them through the normalization layer

In [13]:
numeric_inputs = {name:input for name, input in inputs.items()
                  if input.dtype==tf.float32}

x = layers.Concatenate()(list(numeric_inputs.values()))
norm = layers.Normalization()
norm.adapt(np.array(titanic[numeric_inputs.keys()]))
all_numeric_inputs = norm(x)

all_numeric_inputs

2023-02-23 09:59:51.148165: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-02-23 09:59:51.264583: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-02-23 09:59:51.289176: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


<KerasTensor: shape=(None, 4) dtype=float32 (created by layer 'normalization_1')>

Collect all the symbolic preprocssing results. Then concatenate them later

In [14]:
preprocessed_inputs =[all_numeric_inputs]

For strings inputs use the stringlookup function to map from strings to integer incies in a vocabulary. Next use a categoryEncoding to convert the indexes into float32 data appropiate for the model.

For the default strings for the categoryencoding layer create a one-hot vector for each input (A embedding would also work)

In [15]:
for name, input in inputs.items():
    if input.dtype == tf.float32:
        continue

    lookup = layers.StringLookup(vocabulary = np.unique(titanic_features[name]))
    one_hot = layers.CategoryEncoding(num_tokens=lookup.vocabulary_size())

    x = lookup(input)
    x = one_hot(x)
    preprocessed_inputs.append(x)

With the collection of inputs and preprocessed_inputs, you can concatenate all the preprocessed inputs together, and build a model that handles the preprocessing

In [16]:
preprocessed_inputs_cat = layers.Concatenate()(preprocessed_inputs)

titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)

In [18]:
tf.keras.utils.plot_model(model = titanic_preprocessing, rankdir="LR",
                          dpi=72, show_shapes=True)

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.
