### Introduction to Keras for Engineers

In [1]:
# Setup
import numpy as np
import tensorflow as tf
from tensorflow import keras

### Introduction

Are you a machine learning engineer looking to use Keras to ship deep-learning powered features in real products? This guide will
server as your first introduction to core Keras API concepts.

In this guide, you will learn how to:

* Prepare your data before training a model (by turning it into either NumPy arrays or tf.data.Dataset objects)
* Do data preprocessing, for instance feature normalization or vocabulary indexing.
* Build a model that turns your data into useful predictions, useing the Keras Functional API.
* Train your model with the built-in Keras fit() function, while being mindful of checkpointing, metrics monitoring, and fault tolerance
* Evaluate your model on a test dataset and how to use it for inference on new data.
* Customize what fit() does, for instance to build a GAN.
* Speed up training by leveraging multiple GPUs.
* Refine your model through hyperparameter tuning.

As the end of this guide, you will get pointers to end-to-end examples to solidify these concepts:

* Image classification
* Text classification
* Credit card fraud detection

### Data loading & preprocessing

Neural networks don't process raw data, like text files, encoded JPEG image files, or CSV files. They process vectorized & standardized representations.

* Text files need to be read into string tensors, then split into words. Finally, the words need to be indexed and turned into integer tensors.
* Images need to be read and decoded into integer tensors, then converted to floating point and normalized to small values (usually between 0 and 1).
* CSV data needs to be parsed, with numercial features converted to floating point tensors and categorical features indexed and converted to integer tensors. Then each feature typically needs to be normalized to zero-mean and unit-variance.
* Etc.

Let's start with data loading.

### Data Loading

Keras models accept three types of inputs:

* NumPy arrays, just like Scikit-learn and many other Python-based libraries. This is a good option if your data fits in memory.
* Tensorflow Dataset Objects. This is a high-performance option that is more suitable for datasets that do not fit in memory and that are streamed from disk or from a distributed file system.
* Python generators that yield batches of data (such as custom subclasses of the keras.utils.Sequence class).

Before you start training a model, you will need to make your data available as one of these formants. If you have a large dataset and you are training on GPU(s), consider using Dataset objects, since they will take care of performance-critical details, such as:

* Asynchronously preprocessing your data on CPU while your GPU is busy, and buffering it into a queue.
* Prefetching data on GPU memory so it's immediately available when the GPU has finished processing the previous batch, so you can reach full GPU utilization.

Kerras features a range of utilities to help you turn raw data on disk into a Dataset:

* tf.keras.preprocessing.image_dataset_from_directory turns image files sorted into class-specific folders into a labeled dataset of image tensors.
* tf.keras.preprocessing.text_dataset_from_directory does the same for text files.

In addition, the TensorFlow tf.data includes other similar utilities, such as tf.data.experimental.make_csv_dataset to load structured data from CSV files.


In [None]:
"""
Example: obtaining a labeled dataset from image files on disk

Supposed you have image files sorted by class in different folder, like this:

main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg

Then you can do:
"""

dataset = keras.preprocessing.image_dataset_from_directory("PATH_TO_MAIN_DIRECTORY", batch_size=64, image_size=(200,200))

for data, labels in dataset:
    print(data.shape)   # (64, 200, 200, 3)
    print(data.dtype)   # float32
    print(labels.shape) # (64, )
    print(labels.dtype) # int32

The label of a sample is the rank of its folder in alphanumeric order. Naturally, this can also be configured explicitly
by passing, e.g. class_names = ['class_a', 'class_b'], in which cases label 0 will be class_a and 1 will be class_b.

**Example: obtaining a labeled dataset from text files on disk**

Likewise for text: if you have .txt documents sorted by class in different folder, you can do:

In [None]:
dataset = keras.preprocessing.text_dataset_from_directory('path/to/main_directory', batch_size=64)

# For demonstration, iterate over the batches yielded by the dataset.
for data, labels in dataset:
    print(data.shape)   # (64, )
    print(data.dtype)   # string
    print(labels.shape) # (64, )
    print(labels.dtype) # int32

### Data preprocessing with Keras

Once your data is in the form of string/int/float NumpPy arrays, or a Dataset object (or Python generator) that yields batches of
string/int/float tensors, it is time to **preprocess** the data. This can mean:

 * Tokenization of string data, followed by token indexing
 * Feature normalization
 * Rescaling the data to small values (in general, input values to a neural network should be close to zero --
  typically we expect either data with zero-mean and unit-variance, or data in the [0, 1] range.)

#### The ideal machine learning model is end-to-end

In genreal, you should seek to do data preprocessing **as part of your model** as much as possible, not via an external data
preprocessing pipeline. That's because external data preprocessing makes your models less p