## Imports

In [1]:
import os
import matplotlib.pyplot as plt
import pathlib
import PIL

# Data Manipulation
import numpy as np
import pandas as pd

# Deep Learning
import tensorflow as tf
import tensorflow_datasets as tfds

# TensorFlow
from tensorflow.keras import Sequential, layers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Normalization
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Sklearn
from sklearn import set_config
set_config(display="diagram")

from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Remove TF info messages like "Plugin optimizer for device_type GPU is enabled"
# Also removes warnings, but not errors. Set to 1 for warnings.
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

2025-06-03 17:52:06.106464: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-06-03 17:52:06.376525: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


# Data Pipelines with TensorFlow

🎯 **Objectives**
- How to make pipelines with Deep Learning
- How to load heavy data batch per batch

## **Part I: How to make pipelines with Deep Learning**

### Load Data (the Petfinder dataset)

👉 Let's load the **PetFinder** dataset. 
* Each row describes a pet
* Each column describes an attribute of a pet


🐶 You will use this information to ***predict whether a pet will be adopted or not***. 

| Column | Description| Feature Type | Data Type
|------------|--------------------|----------------------|-----------------
|Type | Type of animal (Dog, Cat) | Categorical | string
|Age |  Age of the pet | Numerical | integer
|Breed1 | Primary breed of the pet | Categorical | string
|Color1 | Color 1 of pet | Categorical | string
|Color2 | Color 2 of pet | Categorical | string
|MaturitySize | Size at maturity | Categorical | string
|FurLength | Fur length | Categorical | string
|Vaccinated | Pet has been vaccinated | Categorical | string
|Sterilized | Pet has been sterilized | Categorical | string
|Health | Health Condition | Categorical | string
|Fee | Adoption Fee | Numerical | integer
|Description | Profile write-up for this pet | Text | string
|PhotoAmt | Total uploaded photos for this pet | Numerical | integer
|AdoptionSpeed | Speed of adoption | Classification | integer
</details>

In [None]:
pets = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/petfinder.csv")
pets

In [None]:
pets.target.value_counts()

In [None]:
round(pets.target.value_counts(normalize = True), 2)

In [None]:
# Train-Test Split
train, test = train_test_split(pets, test_size=0.2)

# Train-Val Split
train, val = train_test_split(train, test_size=0.2)

print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

In [None]:
# Separating features and target in the Train, Val and Test Set

X_train = train.drop(columns='target')
y_train = train['target']

X_val = val.drop(columns='target')
y_val = val['target']

X_test = test.drop(columns='target')
y_test = test['target']

☝️ Our dataset has both ***numerical*** and ***categorical values***. 

As for any Machine Learning / Deep Learning model, we need to preprocess them before training the model.

👨🏻‍🏫 You have three options:

* **(A)** Use Scikit-Learn to preprocess the dataset before feeding a Neural Network (no Pipeline)  
* **(B)** Wrap your Neural Network into a Scikit-Learn estimator and use a Scikit-Learn Pipeline
* **(C)** Use full Tensorflow pipelines  

### (A) 0️⃣ No pipeline

1. Preprocess data with Scikit-Learn 
2. Feed your Neural Network with the preprocessed data

#### (A.1) Preprocessing


❓ Create `X_train_preproc`, `X_val_preproc`, `X_test_preproc` ❓

* Scaling numerical features
* Encoding categorical features 

In [None]:
# YOUR CODE HERE

#### (A.2) Neural Network

❓ **Questions** ❓
* Design (Architecture + Compile) a Neural Network
* Fit it on the training set
* Evaluate its performance on the test set 

In [None]:
# YOUR CODE HERE

### (B) 👻 Wrapping a Neural Net into a Sklearn estimator 

👉 _Pipeline_ and _ColumnTransformer_ are designed for Machine Learning Models.

👻 So, how about disguising a Neural Network into a Scikit-Learn estimator to use the aforementioned tools? It is possible! We can treat a Tensorflow.Keras model as a Scikit-Learn estimator using 📚 [**Keras Wrappers**](https://www.tensorflow.org/api_docs/python/tf/keras/wrappers/scikit_learn) 📚

🔥 Now, we can **_CrossValidate_, _GridSearchCV_, _RandomizedSearchCV_ a Deep Learning model which is wrapped**. 🔥

#### (B.1) Introduction to Keras Wrappers

❓ Run the cells below and try understand the syntax ❓

In [None]:
def create_model():

    ###########################
    #             1. Define architecture              #
    ###########################

    # Notice that we don't specify the input shape yet
    # as we don't know the shape post-preprocessing!
    # One consequence is that here, you cannot yet print
    # the model's summary. It will be known after fitting it
    # to X_train_preprocessed, y_train

    model = Sequential()
    model.add(layers.Dense(32, activation = 'relu'))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(15, activation = 'relu'))
    model.add(layers.Dropout(0.3))
    model.add(layers.Dense(1, activation = 'sigmoid'))

    ###########################
    #                 2. Compile model                 #
    ###########################

    model.compile(
        loss = 'binary_crossentropy',
        optimizer = 'adam',
        metrics = ['accuracy']
    )

    return model

In [None]:
# It is "Halloween" time
# Disguise your Deep Learning Model into a Scikit Learn model

disguised_deep_model = KerasClassifier(
    build_fn = create_model,
    epochs = 10,
    batch_size = 32,
    verbose = 0
)

#### (B.2) Cross-Validating a Deep Learning Model

❓ Evaluate/CrossValidate your estimator on your training set already preprocessed ❓

In [None]:
pass  # YOUR CODE HERE

#### (B.3) Pipelining a Wrapped Deep Learning model

❓ Wrap your model within a pipeline including a preprocessing step and evaluate it directly on the raw data this time (not on the preprocessed data)❓

In [None]:
# YOUR CODE HERE

#### (B.4) GridSearchCV on a Wrapped Deep Learning Model

Now that our model is pipelined, we can even **GridSearchCV** its hyper-parameters 🔥

❓ Run the cells below to understand how the syntax works!

In [None]:
def create_model_grid(activation = 'relu', optimizer = 'rmsprop'):
    # Create model
    model = Sequential()
    model.add(layers.Dense(32, activation=activation))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(15, activation=activation))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(1, activation='sigmoid'))

    # Compile model
    model.compile(
        loss='binary_crossentropy',
        optimizer=optimizer,
        metrics=['accuracy']
    )

    return model

In [None]:
model_grid = KerasClassifier(
    build_fn = create_model_grid,
    epochs = 10,
    batch_size = 32,
    verbose = 0,
)

In [None]:
pipe_grid = make_pipeline(preproc, model_grid)
#pipe_grid.get_params()

In [None]:
%%time
# Now, we can grid-search the hyper-params of everything
# From the preprocessing, the architecture, the compiler, and the fit!

param_grid = dict(
    columntransformer__standardscaler__with_mean = [True, False], # Preprocessing hyperparams
    kerasclassifier__activation = ['tanh', 'relu'],               # Architecture hyperparams
    kerasclassifier__optimizer = ["adam", "rmsprop"],             # Compiler hyperparams
    kerasclassifier__batch_size = [8, 64],                        # Fit hyperparams
)

grid = GridSearchCV(
    estimator = pipe_grid,
    param_grid = param_grid,
    cv = 2,
    verbose = 2,
    n_jobs = -1
)

In [None]:
grid.fit(X_train, y_train);

In [None]:
grid.best_params_

In [None]:
grid.best_score_

In [None]:
cross_val_score(grid.best_estimator_, X_train, y_train)

### (C) 🧨 Full pipeline in Tensorflow

🧨 This option is recommended for real projects, especially when you need:
1. Performance or
2. Production-Readiness



<details>
    <summary><i>Why?</i></summary>

Indeed, having all the preprocessing steps within one single TensorFlow Keras model allows you to generate one <a href="https://www.tensorflow.org/guide/intro_to_graphs">**`tf.Graph`**</a> representation of your model.

A **`tf.Graph`** is mandatory for:
* distributed computations
* and serving on many devices 

(using **`Tensorflow Lite`** for back-end free predictions for instance). 

</details>

The idea is to use **`Normalization layers`** and **`CategoryEncoding layers`** within your model architecture.

#### (C.1) 😌 If the preprocessing pipeline is sequential, everything is easy/straightforward

In [None]:
# EXAMPLE

sequential_pipe = make_pipeline(
        StandardScaler(),
        disguised_deep_model
)

sequential_pipe

For instance, if there was only numerical data in our dataset 👇:

In [None]:
# Imagine we focus exclusively on numerical data and scale them:
X_train_num = X_train.select_dtypes(exclude=['object']).values
X_val_num = X_val.select_dtypes(exclude=['object']).values
X_test_num = X_test.select_dtypes(exclude=['object']).values

In [None]:
# 0. The Normalization Layer

normalizer = Normalization()  # Instantiate a "normalizer" layer
normalizer.adapt(X_train_num) # "Fit" it on the train set

# 1. The Architecture
model = Sequential()
model.add(normalizer) # Using the Normalization layer to standardize the data points during the forward pass
model.add(layers.Dense(32, activation = 'relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(15, activation = 'relu'))
model.add(layers.Dropout(0.3))
model.add(layers.Dense(1, activation = 'sigmoid'))

# 2. Compiling
model.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy']
)

# 3. Training
model.fit(
    X_train_num,
    y_train,
    validation_data = (X_val_num, y_val),
    batch_size = 64,
    epochs = 20,
    verbose=0
)

# 4. Evaluating
model.evaluate(X_test_num, y_test)

#### (C.2) 🤯 if the preprocessing pipeline requires parallel column transformers, TF Sequential API `model.add(...)` is not enough... 

In [None]:
pipe_grid

☝️ You will need to use TF **`Functional API`** to **`produce a Non-Sequential Neural Network`**

📚 [Google's Tutorial](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers) show you how to solve this exact **PetFinder** dataset with this method

📆 We will show you an example of Non-Sequential Neural Network during the last session of Deep Learning.

🧑🏻‍🏫 Non-Sequential Models look something like this 👇:

```python
# Numerical preprocessing model = function(X_numerical)
model_numerical = ...  

# Categorical preprocessing model = function(X_categorical)
model_categorical = ...

# Combined model
all_features = layers.concatenate([model_numerical, model_categorical])

# Then create the Dense network on the preprocessed features
x = tf.keras.layers.Dense(8, activation="relu")(all_features)
x = tf.keras.layers.Dense(2, activation="relu")(x)
output = tf.keras.layers.Dense(1, activation="sigmoid")(x)

model = tf.keras.Model(all_inputs, output)

model.compile(...)
```

<img src='https://wagon-public-datasets.s3.amazonaws.com/data-science-images/DL/non_sequential_models.png' width=400>

---

## **Part II. How to deal with heavy datasets ?**

⚠️ Most Deep Learning projects use datasets that are **too heavy to be loaded on RAM entirely** 

Fortunately, we can train our network **batch per batch**!

✅ Tensorflow provides a powerful [**`tf.data.Dataset`**](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) class that helps to deal with:
* **data loading**
* **processing batch-per-batch**

✅ Keras provides [**`tf.keras.preprocessing`**](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing) wrappers around this to avoid getting your hands too dirty:
- **`image_dataset_from_directory`**
- **`text_dataset_from_directory`**
- **`timeseries_dataset_from_array`**

Let's illustrate this with a heavy image dataset:

### (1) Save large files on a hard drive (local or cloud)

❓ Run following cells (don't focus on the syntax here but on what is going on) ❓

In [None]:
# We download 229Mo of images

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"

data_dir = tf.keras.utils.get_file(
    origin = dataset_url,
    fname = 'flower_photos',
    untar = True
)

data_dir = pathlib.Path(data_dir)

In [None]:
# We just unzipped and saved all the files in the following folder
data_dir

In [None]:
# Notice how each photo is saved in a different folder depending on its category
! ls $data_dir

In [None]:
# In total we have 229Mo of files, compressed.
# Imagine if there was 50Go? They couldn't fit in RAM
! du -h $data_dir

In [None]:
# We have 3670 jpg images in 5 classes
len(list(data_dir.glob('*/*.jpg')))

In [None]:
# Just look at one image
sunflowers = list(data_dir.glob('sunflowers/*'))
PIL.Image.open(str(sunflowers[0]))

### (2) Prepare to load images in RAM memory batch per batch 

We will use 📚 <a href="https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image_dataset_from_directory">**`image_dataset_from_directory`**</a> 📚

In [None]:
ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir,
    batch_size = 32
)

☝️ Notice how it automatically labelled our image files into the 5 classes!
- The folder structure is deduced from a default paramter:  **`labels='inferred'`** of the `image_dataset_from_directory`
- You can manually pass a list of labels instead

In [None]:
# `ds` is a `tf.data.Dataset` object of "tuples"
ds

In [None]:
# Dataset contain no real data until they are iterated over
import sys
sys.getsizeof(ds)

In [None]:
for (X_batch, y_batch)  in ds:
    print(X_batch.shape)
    print(y_batch.shape)

    break # just show the first element

In [None]:
# check first image
plt.imshow(X_batch[0]/255);

📚 <a href="https://www.tensorflow.org/api_docs/python/tf/keras/datasets">**tf.data.Dataset**</a> 📚 are just an abstraction that represent a sequence of elements. 



They enable us to:

- Load elements batch-per-batch in memory
- From different formats, storage places, etc...
- Apply preprocessing on the fly (ex: shuffle, resize, and many many more)

📚 [**tf/guide/data**](https://www.tensorflow.org/guide/data) 📚

### (3) Train a model directly on a `Dataset`

❓ Try to fit a very simple dense Neural Network on `ds` 

- You can directly call `model.fit(ds, epochs=1)`
- Your first layer should use `layers.Flatten` to flatten a $(256,256,3)$ picture in into a ($256 \times 256 \times 3$) vector so acceptable for Dense layers
- You can use **`loss='sparse_categorical_crossentropy'`**: "*sparse_*" avoids *"one-hot-encoding"* the target with `to_categorical(y)`

In [None]:
# YOUR CODE HERE

💡 The accuracy of this model is approx~ 20-25% which doesn't really beat the baseline 20% (1 divided by 5 categories...)

🤡 Why ? **A Dense Neural Network Architecture is NOT designed for classifying images!**

🚀 In the next session, we will use **`Convolutional Neural Networks (CNN)`** 🚀

### (Bonus) Proper solution to the Flower problem using CNN & Early Stopping

🧑🏻‍🏫 Come back to this section after studying **Deep Learning > 03. Convolutional Neural Networks**.

In [None]:
train_ds = image_dataset_from_directory(
    data_dir,
    validation_split=0.2,
    subset="training",
    seed=123,
    image_size=(64, 64), # resize on the fly
    batch_size=32
)

val_ds = image_dataset_from_directory(
    data_dir,
    validation_split=0.2,
    subset="validation",
    seed=123,
    image_size=(64, 64), # resize on the fly
    batch_size=32
)

In [None]:
model = tf.keras.Sequential()

model.add(layers.experimental.preprocessing.Rescaling(1. / 255))
model.add(layers.Conv2D(32, 3, activation='relu'))
model.add(layers.MaxPooling2D())
model.add(layers.Conv2D(32, 3, activation='relu'))
model.add(layers.MaxPooling2D())
model.add(layers.Conv2D(32, 3, activation='relu'))
model.add(layers.MaxPooling2D())
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(5, activation='softmax'))

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=5,
    callbacks=[EarlyStopping(patience=0)]
)

🏁 Congratulations 

💾 Don't forget to `git add/push/commit` your notebook.

📆 What's next on the menu?

* **Deep Learning** **`> 01. Fundamentals`** and **`> 02. Optimizers, Fit, Loss`** helped you master the foundations of Deep Learning, what are neurons, layers, architecture, loss functions, optimizers, learning rates, ... 
    * We have been working with inputs which are **row vectors** (each row has $p$ features)
    * All the neurons from a layer are **fully connected** to the next layer

* What if we want to classify pictures instead? Each picture has a certain amount of pixels that we could potentially consider as features... But it is more complex than that. See you in the next session **Deep Learning `> 03. Convolutional Neural Networks`**!