# Introduction

The notebook gives an example of how one can deal with tabular data using tensorflow. 

> This will be a very quick example. For more details, consult the TensorFlow documentation. The material below is following the tutorial https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers quite closely, using many of the functions shown there.

**Main takeaways and motivation:**

* Last time we had a look at an image classification example in TensorFlow. This notebook continues our explorations by showcasing a different kind of problem and data set
* Get to know some important TensorFlow concepts and players, e.g., `tf.data` and preprocessing layers.
* See that it is possible to use non neural net-based models in TensorFlow, compatible with the other components of TensorFlow. 

# Setup

In [None]:
%matplotlib inline
import numpy as np, pandas as pd
from pathlib import Path

In [None]:
import tensorflow as tf

# Load the data

We'll use a version of the heart disease data set from UCI ML repository: https://archive.ics.uci.edu/ml/datasets/heart+Disease

## Load data as a Pandas DataFrame

In [None]:
# We use a version of the data prepared by TensorFlow
url = 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv'

In [None]:
df = pd.read_csv(url)

In [None]:
df.head()

In [None]:
df.info()

## Split train and test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(df)

In [None]:
train.info()

In [None]:
test.info()

## Create a data loader

In [None]:
#tf.data.Dataset

```
`Dataset` usage follows a common pattern:

1. Create a source dataset from your input data.
2. Apply dataset transformations to preprocess the data.
3. Iterate over the dataset and process the elements.

Iteration happens in a streaming fashion, so the full dataset does not need to
fit into memory.
```

We create a train and a test dataset from the corresponding data frames. We want to shuffle the training data while keeping the test set as it is. 

In [None]:
# From 
# https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers#create_an_input_pipeline_using_tfdata

def df_to_dataset(dataframe, shuffle=True, batch_size=8):
    df = dataframe.copy()
    labels = df.pop('target')
    df_dict = {key: value[:,tf.newaxis] for key, value in dataframe.items()}
    
    # Create a tf Dataset
    ds = tf.data.Dataset.from_tensor_slices((df_dict, labels))
    
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
        
    # Create batches of data
    ds = ds.batch(batch_size)
    
    # Prefetch data (for efficiency: data can be prepared while 
    # current data is processing)
    ds = ds.prefetch(batch_size)
    
    return ds

In [None]:
batch_size = 8
train_ds = df_to_dataset(train, shuffle=True, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

In [None]:
train_ds

Here's a batch of data from the train dataset:

In [None]:
[(train_features, label_batch)] = train_ds.take(1)
print('Features:', list(train_features.keys()))
print('A batch of ages:', train_features['age'])
print('A batch of targets:', label_batch )


# Preprocess the data

We'll use preprocessing layers from Keras (https://www.tensorflow.org/guide/keras/preprocessing_layers) rather than preprocess separately using for example Pandas or scikit-learn.

> _"With Keras preprocessing layers, you can build and export models that are truly end-to-end: models that accept raw images or raw structured data as input; models that handle feature normalization or feature value indexing on their own."_ [source](https://www.tensorflow.org/guide/keras/preprocessing_layers)

In [None]:
from tensorflow.keras import layers

In [None]:
df.head()

We see that we have numerical, ordinal and categorical features. We want to normalize the numerical and ordinal features, and one-hot encode the categorical features. 

In [None]:
numerical = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'restecg']
categorical = ['cp', 'fbs', 'exang', 'slope', 'ca', 'thal']

## Set up normalization layers

In [None]:
?layers.Normalization

In [None]:
def get_normalization_layer(name, dataset):
    # Create a Normalization layer for the feature.
    normalizer = layers.Normalization(axis=None)

    # Prepare a Dataset that only yields the feature.
    feature_ds = dataset.map(lambda x, y: x[name])

    # Learn the statistics of the data.
    normalizer.adapt(feature_ds)

    return normalizer

> Note how the above code resembles doing normalization in scikit-learn: 

```
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
std.fit(X_train)
```

**Test the function:**

In [None]:
chol = train_features['chol']
chol

In [None]:
norm_layer = get_normalization_layer('chol', train_ds)
norm_layer(chol)

## Set up categorial encoding layers

We want to one-hot encode the categorical variables. We can use the various encoding layers from TensorFlow/Keras to achieve this.

We need to convert all the categorical features represented as numbers, and also the string feature `thal`. 

In [None]:
#?layers.StringLookup

In [None]:
#?layers.IntegerLookup

In [None]:
#?layers.CategoryEncoding

In [None]:
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
    # Create a layer that turns strings into integer indices.
    if dtype == 'string':
        index = layers.StringLookup(max_tokens=max_tokens)
    # Otherwise, create a layer that turns integer values into integer indices.
    else:
        index = layers.IntegerLookup(max_tokens=max_tokens)

    # Prepare a `tf.data.Dataset` that only yields the feature.
    feature_ds = dataset.map(lambda x, y: x[name])

    # Learn the set of possible values and assign them a fixed integer index.
    index.adapt(feature_ds)

    # Encode the integer indices.
    encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())

    # Apply multi-hot encoding to the indices. The lambda function captures the
    # layer, so you can use them, or include them in the Keras Functional model later.
    return lambda feature: encoder(index(feature))


**Test**

In [None]:
test_type_col = train_features['thal']
test_type_layer = get_category_encoding_layer(name='thal',
                                              dataset=train_ds,
                                              dtype='string')
test_type_layer(test_type_col)

In [None]:
test_age_col = train_features['sex']
test_age_layer = get_category_encoding_layer(name='sex',
                                             dataset=train_ds,
                                             dtype='int64',
                                             max_tokens=2)
test_age_layer(test_age_col)

## Preprocess all the features

We normalize all the numerical features and one-hot encode the rest.

In [None]:
all_inputs = []
encoded_features = []

# Numerical features.
for header in numerical:
    numeric_col = tf.keras.Input(shape=(1,), name=header)
    normalization_layer = get_normalization_layer(header, train_ds)
    encoded_numeric_col = normalization_layer(numeric_col)
    all_inputs.append(numeric_col)
    encoded_features.append(encoded_numeric_col)


In [None]:
for header in categorical[:-1]: # All except `thal`, which is a string feature
    categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='int64')
    encoding_layer = get_category_encoding_layer(name=header,
                                               dataset=train_ds,
                                               dtype='int64',
                                               max_tokens=5)
    encoded_categorical_col = encoding_layer(categorical_col)
    all_inputs.append(categorical_col)
    encoded_features.append(encoded_categorical_col)

Encode `thal` separately (a string feature):

In [None]:
categorical_col = tf.keras.Input(shape=(1,), name='thal', dtype='string')
encoding_layer = get_category_encoding_layer(name='thal',
                                           dataset=train_ds,
                                           dtype='string',
                                           max_tokens=5)
encoded_categorical_col = encoding_layer(categorical_col)
all_inputs.append(categorical_col)
encoded_features.append(encoded_categorical_col)

Now we have 12 encoded features:

In [None]:
encoded_features

# Train a neural network

We'll make a simple one-layer neural network on top of the preprocessing layers defined above.

In [None]:
all_features = tf.keras.layers.concatenate(encoded_features)
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(all_inputs, output)


In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])


Here's our model:

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True, rankdir="LR")


In [None]:
model.fit(train_ds, epochs=10, validation_data=test_ds)


# Evaluate

In [None]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

# Export

Now we can export the model (which includes all the preprocessing steps) to the [SaveModel format](https://www.tensorflow.org/guide/saved_model). This can then later be imported elsewhere, f.ex. for model deployment using [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) or similar. 

In [None]:
#?tf.keras.models.save_model

> We'll look more at this when we talk about TensorFlow Extended later in the module.

# Extra: Train a tree-based model

We could alternatively use the TensorFlow Decision Forests library, which can use a collection of state-of-the-art algorithm,s for training, serving and interpreting decision forest models (random forest, gradient boosted trees, etc):

https://github.com/google/yggdrasil-decision-forests

In [None]:
import tensorflow_decision_forests as tfdf

In [None]:
train_ds_trees = tfdf.keras.pd_dataframe_to_tf_dataset(train, label="target")
test_ds_trees = tfdf.keras.pd_dataframe_to_tf_dataset(test, label="target")

In [None]:
model = tfdf.keras.RandomForestModel()

In [None]:
model.compile(
    metrics=["accuracy"])

In [None]:
model.fit(train_ds_trees)

In [None]:
model.summary()

In [None]:
model.evaluate(test_ds_trees)

In [None]:
import IPython

In [None]:
IPython.display.HTML(tfdf.model_plotter.plot_model(model, tree_idx=0, max_depth=3))

## Evaluate

In [None]:
model.make_inspector().variable_importances()

In [None]:
import matplotlib.pyplot as plt

logs = model.make_inspector().training_logs()

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")

plt.subplot(1, 2, 2)
plt.plot([log.num_trees for log in logs], [log.evaluation.loss for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Logloss (out-of-bag)")

plt.show()


## Export

This model can also be exported to a SavedModel, and then served using TensorFlow Serving or similar.

https://www.tensorflow.org/decision_forests/tensorflow_serving

In [None]:
#model.save("rf_model")