## Structured Data Classification

Example demonstrates how to do structured data classification, starting from a raw csv file.

Followed https://developer.apple.com/metal/tensorflow-plugin/ to install env for the project

Link
https://keras.io/examples/structured_data/structured_data_classification_from_scratch/

Setup

In [5]:
import tensorflow as tf
import pandas as pd
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers


print(tf.__version__)

2.10.0


### Preparing the data

Download the data and load into a dataframe

In [6]:
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)

In [7]:
print(dataframe.shape)

(303, 14)


In [8]:
dataframe.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


Note, the last column is the target. Uses to indicate if the patient has heart disease (1) or not (0)

Split the data into a training and validation set

In [9]:
val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)

print(
    "Using %d samples for training and %d for validation"
    % (len(train_dataframe), len(val_dataframe))
)

Using 242 samples for training and 61 for validation


Need to generate a tf.data.dataset objects for each dataframe

In [10]:
def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds

train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB



2022-12-09 07:06:59.972285: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-09 07:06:59.974624: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Dataset is a tuple (input, target)

Input is a dictionary of features
Target is the value from 0 to 1

In [11]:
for x, y in train_ds.take(1):
    print("input:", x)
    print("target:", y)

input: {'age': <tf.Tensor: shape=(), dtype=int64, numpy=39>, 'sex': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'cp': <tf.Tensor: shape=(), dtype=int64, numpy=3>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=94>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=199>, 'fbs': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'restecg': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=179>, 'exang': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=0.0>, 'slope': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'ca': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'thal': <tf.Tensor: shape=(), dtype=string, numpy=b'normal'>}
target: tf.Tensor(0, shape=(), dtype=int64)


We will need to batch the datasets

In [12]:
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

## Feature preporcessing with Keras layers

#### Integer categorical features

Following features categorical features encoded as integers
- sex
- cp
- fbs
- restecg
- exang
- ca

We will have to to encode these features using one-hot encoding. Two options are
- CategoryEncoding(), requires knowning the range of inputs values and will error on inputs
- IntegerLookup(), which will build a lookup table for inputs and reserve an output index for unknown values


#### String categocial features

Categoricla featues encded as a string such as "thal". We will create an index of all 
possible features and encoded output using the stringlookup layer

#### Continuous Numerical features

List of examples are 
- age
- testbps
- chol
- thalach
- oldpeak
- slope

Each of the featues will use a normalization() layer to make sure the 
mean of each feature is 0 and its standard deviation is 1

Note: FeatureSpace classification can not be used on tf 2.10

In [13]:
from tensorflow.keras.layers import IntegerLookup
from tensorflow.keras.layers import Normalization
from tensorflow.keras.layers import StringLookup

def encode_numerical_feature(feature, name, dataset):
    # Create a normalization layer for our feature
    normalizer = Normalization()

    # Prepare a dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # learn the statistics fo the data 
    normalizer.adapt(feature_ds)

    # Normalize the input feature
    encoded_feature = normalizer(feature)

    return encoded_feature

def encode_categorical_feature(feature, name, dataset, is_string):
    lookup_class = StringLookup if is_string else IntegerLookup
    # Create a lookup layer which will turn strings into integer indices
    lookup = lookup_class(output_mode="binary")

    # Prepare a dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))

    # learn the set of possible string values and assign them a fixed integer index
    lookup.adapt(feature_ds)

    # turn the string input into integer indices
    encoded_feature = lookup(feature)
    return encoded_feature

## Build a model

We will create an ended to end model

In [21]:
# Categorical features encoded as integers
sex = keras.Input(shape=(1, ), name="sex", dtype="int64")
cp = keras.Input(shape=(1, ), name="cp", dtype="int64")
fbs = keras.Input(shape=(1, ), name="fbs", dtype="int64")
restecg = keras.Input(shape=(1, ), name="restecg", dtype="int64")
exang = keras.Input(shape=(1, ), name="exang", dtype="int64")
ca= keras.Input(shape=(1, ), name="ca", dtype="int64")

# Categorical feature encded as string
thal = keras.Input(shape=(1, ), name="thal", dtype="string")

# Numerical features
age = keras.Input(shape=(1, ), name="age")
trestbps = keras.Input(shape=(1, ), name="trestbps")
chol = keras.Input(shape=(1, ), name="chol")
thalach = keras.Input(shape=(1, ), name="thalach")
oldpeak = keras.Input(shape=(1, ), name="oldpeak")
slope = keras.Input(shape=(1, ), name="slope")

all_inputs = [
    sex,
    cp,
    fbs,
    restecg,
    exang,
    ca,
    thal,
    age,
    trestbps,
    chol,
    thalach,
    oldpeak,
    slope,
]


# Enconding

# Integer categorical features
sex_encoded = encode_categorical_feature(sex, "sex", train_ds, False)
cp_encoded = encode_categorical_feature(cp, "cp", train_ds, False)
fbs_encoded = encode_categorical_feature(fbs, "fbs",train_ds, False)
restecg_encoded = encode_categorical_feature(restecg, "restecg", train_ds, False)
exang_econded = encode_categorical_feature(exang, "exang", train_ds, False)
ca_encoded =encode_categorical_feature(ca, "ca", train_ds, False)

# string categorical features
thal_encoded = encode_categorical_feature(thal, "thal", train_ds, True)

# numerical feature
age_encoded = encode_numerical_feature(age, "age", train_ds)
trestbps_encoded = encode_numerical_feature(trestbps, "trestbps", train_ds)
chol_encoded = encode_numerical_feature(chol, "chol", train_ds)
thalach_encoded = encode_numerical_feature(thalach, "thalach", train_ds)
oldpeak_encoded = encode_numerical_feature(oldpeak, "oldpeak", train_ds)
slope_encoded = encode_numerical_feature(slope, "slope", train_ds)

all_features = layers.concatenate(
    [
        sex_encoded,
        cp_encoded,
        fbs_encoded,
        restecg_encoded,
        exang_econded,
        slope_encoded,
        ca_encoded,
        thal_encoded,
        age_encoded,
        trestbps_encoded,
        chol_encoded,
        thalach_encoded,
        oldpeak_encoded,
    ]
)

x = layers.Dense(32, activation="relu")(all_features)
x = layers.Dropout(0.5)(x)
output = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])



2022-12-09 07:19:04.754509: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-09 07:19:04.831927: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-09 07:19:04.892702: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-09 07:19:04.952641: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-09 07:19:05.012819: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-09 07:19:05.074510: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-12-09 07:19:05.132092: I tensorflow/core/grappler/optimizers/cust

Lets create a graph to visualize our connectivity

TODO: You must install 
- pydot (`pip install pydot`)
- install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.

In [22]:
keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


## Train the model

In [23]:
model.fit(train_ds, epochs=50, validation_data=val_ds)

Epoch 1/50


2022-12-09 07:19:15.092756: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/50
1/8 [==>...........................] - ETA: 0s - loss: 0.8099 - accuracy: 0.4062

2022-12-09 07:19:17.555070: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x11719ddc0>

## Interence on new data

To get the prediction for a new smaple, we can call model.predict 

We will need to do the following:
1. wrap scalars into a list so as to have a batch dimensions (models only process batches of data not single samples)
2. Call convert_to_tensors on each feature

In [24]:
sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)

print(
    "This particular patient had a %.1F precent probability "
    "of having a heart disease, as evaluated by our model." 
    % (100 * predictions[0][0],)
)

2022-12-09 07:24:40.313375: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


This particular patient had a 33.0 precent probability of having a heart disease, as evaluated by our model.
