The brain’s architecture was inspiration on how to build an intelligent machine. This is what sparked artificial neural networks (ANNs)

- **ANN**: A Machine Learning model inspired by the networks of biological neurons found in our brains.

ANNs are at the very core of Deep Learning. 

They are used for Googles image recognization, apple's siri, recomendation systems, or learning games. 

# From Biological to Artificial Neurons

- **Connectionism** :the study of neural network

- ANNs frequently outperform other ML techniques on very large and complex problems.

- Increases in computing power have made training possible. 
- Anns rarely get stuck at a local optima

# Biological Neurons

- **Cell body**: containing the nucleus and most of the cell’s complex components
- **Dendrites**: branching extensions
- **Axon**: One very long extension called of the cell body
- **Telodendria**:  The axon splits off into many branches near its extremity. 
More bio stuff, don't have to remember? 

Individual biological neurons seem to behave in a rather simple way, but they are organized in a vast network of billions.

# Logical Computations with Neurons

- **Artificial neuron**: it has one or more binary (on/off) inputs and one binary output. 
- essentially a simple if statement

Artificial neurons activates its output when more than a certain number of its inputs are active

Figure 10-3 Artifical neurons operations 
![](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/assets/mls2_1003.png)

- The first network on the left is the identity function: if neuron A is activated, then neuron C gets activated as well (since it receives two input signals from neuron A); but if neuron A is off, then neuron C is off as well.

- The second network performs a logical AND: neuron C is activated only when both neurons A and B are activated (a single input signal is not enough to activate neuron C).

- The third network performs a logical OR: neuron C gets activated if either neuron A or neuron B is activated (or both).

- Finally, if we suppose that an input connection can inhibit the neuron’s activity (which is the case with biological neurons), then the fourth network computes a slightly more complex logical proposition: neuron C is activated only if neuron A is active and neuron B is off. If neuron A is active all the time, then you get a logical NOT: neuron C is active when neuron B is off, and vice versa.

# The Perceptron

- **Perceptron**: a slightly different artificial neuron where the input and output are numbers. And  each input connection is associated with a weight. 

- **Linear threshold unit**: computes a weighted sum of its inputs  $
\left(z=w_{1} x_{1}+w_{2} x_{2}+\cdots+w_{n} x_{n}=\mathbf{x}^{\top} \mathbf{w}\right)
$ then applies a step function to that sum and outputs the result $
h_{\mathbf{w}}(\mathbf{x})=\operatorname{step}(z), \text { where } z=\mathbf{x}^{\top} \mathbf{w}
$ 

![](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/assets/mls2_1004.png)

The most common step function used in Perceptrons is the Heaviside step function
Sometimes the sign function is used instead.

Equation 10-1. Common step functions used in Perceptrons (assuming threshold = 0)
$$
\text { heaviside }(z)=\left\{\begin{array}{ll}
0 & \text { if } z<0 \\
1 & \text { if } z \geq 0
\end{array} \quad \operatorname{sgn}(z)=\left\{\begin{array}{ll}
-1 & \text { if } z<0 \\
0 & \text { if } z=0 \\
+1 & \text { if } z>0
\end{array}\right.\right.
$$

- A single TLU can be used for simple linear binary classification

EX: Ue a single TLU to classify iris flowers based on petal length and width 

- Training a TLU in this case means finding the right values for $w_0$, $w_1$, and $w_2$

- A Perceptron is simply composed of a single layer of TLUs, with each TLU connected to all the inputs. 

-  **fully connected layer**, or a **dense layer**: When all the neurons in a layer are connected to every neuron in the previous layer.

- **input neurons**: they output whatever input they are fed.

- **Input layer**: All of the input neurons

- **bias neuron**: Represents a bias feature. outputs 1 all the time

 A Perceptron with two inputs and three outputs is represented in Figure 10-5 below. This Perceptron can classify instances simultaneously into three different binary classes, which makes it a multilabel classifier.
 
 ![](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/assets/mls2_1005.png)
 
 
 We can compute the output of a layer of neurons all at once. 
 
Eqn 10-2:  Computing the outputs of a fully connected layer 
 $$
h_{\mathbf{W}, \mathbf{b}}(\mathbf{X})=\phi(\mathbf{X} \mathbf{W}+\mathbf{b})
$$

- X represents the matrix of input features. It has one row per instance and one column per feature.

- The weight matrix W contains all the connection weights except for the ones from the bias neuron. It has one row per input neuron and one column per artificial neuron in the layer.

- The bias vector b contains all the connection weights between the bias neuron and the artificial neurons. It has one bias term per artificial neuron.

- The function ϕ is called the activation function: when the artificial neurons are TLUs, it is a step function.

-  **Hebb’s rule** (or Hebbian learning):the connection weight between two neurons tends to increase when they fire simultaneously. 

This Perceptron learning rule reinforces connections that help reduce the error.

More specifically, the Perceptron is fed one training instance at a time, and for each instance it makes its predictions. For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction. The rule is shown in Equation 10-3 below.

$$
w_{i, j}^{(\text {next step })}=w_{i, j}+\eta\left(y_{j}-\hat{y}_{j}\right) x_{i}
$$

- $w_{i, j}$ is the connection weight between the ith input neuron and the jth output neuron.

- $x_i$ is the ith input value of the current training instance.

- $\hat{y}_j$ is the output of the jth output neuron for the current training instance.

- $y_j$ is the target output of the jth output neuron for the current training instance.

- η is the learning rate.


The decision boundary of each output neuron is linear, so Perceptrons are incapable of learning complex patterns 

- **Perceptron convergence theorem.**  If the training instances are linearly separable,this algorithm would converge to a solution.

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)]  # petal length, petal width
y = (iris.target == 0).astype(np.int)  # Iris setosa?

per_clf = Perceptron()
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])

- Perceptron learning algorithm strongly resembles Stochastic Gradient Descent.

Scikit-Learn’s Perceptron class is equivalent to using an SGDClassifier with the following hyperparameters: loss="perceptron", learning_rate="constant", eta0=1 (the learning rate), and penalty=None (no regularization).

- Perceptrons do not output a class probability. This is one reason to prefer Logistic Regression over Perceptrons.

limitations of Perceptrons can be eliminated by stacking multiple Perceptrons.

- **Multilayer Perceptron**: Multiple Stacked Perceptrons 

# The Multilayer Perceptron and Backpropagation

- An MLP is composed of one input layer, one or more layers of TLUs(hidden layers)  and one final layer of TLUs called the output layer. 

- **lower layers**: layers close to the input layer

Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

- **deep neural network (DNN)**: When an ANN contains a deep stack of hidden layers


The trainging algorithms for MLP is 

- **backpropagation training algorithm**: In short, it is Gradient Descent using an efficient technique for computing the gradients automatically. Determines how to tweak the weights to reduce error. 

- **NOTE**:  Automatically computing gradients is called automatic differentiation, or autodiff. There are various autodiff techniques, with different pros and cons. The one used by backpropagation is called reverse-mode autodiff. It is fast and precise, and is well suited when the function to differentiate has many variables (e.g., connection weights) and few outputs (e.g., one loss). If you want to learn more about autodiff, check out Appendix D in the book.

Let’s run through this algorithm in a bit more detail:

- It handles one mini-batch at a time (for example, containing 32 instances each), and it goes through the full training set multiple times. Each pass is called an epoch.

- Each mini-batch is passed to the network’s input layer, which sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the forward pass: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.

- Next, the algorithm measures the network’s output error (i.e., it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error).

- Then it computes how much each output connection contributed to the error. This is done analytically by applying the chain rule (perhaps the most fundamental rule in calculus), which makes this step fast and precise.

- The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule, working backward until the algorithm reaches the input layer. As explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm).

- Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.

Summarzing this : for each training instance, the backpropagation algorithm first makes a prediction (forward pass) and measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally tweaks the connection weights to reduce the error (Gradient Descent step).

- **WARNING** : It is important to initialize all the hidden layers’ connection weights randomly, or else training will fail. For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical. In other words, despite having hundreds of neurons per layer, your model will act as if it had only one neuron per layer: it won’t be too smart. If instead you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons.

- In order for this algorithm to work properly the step function is the logistic (sigmoid) function $\sigma(z)=1 /(1+\exp (-z))$. This adds a hill(a gradient can then be used )


### Other Popular Step Functions 

The hyperbolic tangent function: tanh(z) = 2σ(2z) – 1

    Just like the logistic function, this activation function is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1 in the case of the logistic function). That range tends to make each layer’s output more or less centered around 0 at the beginning of training, which often helps speed up convergence.
    
The Rectified Linear Unit function: ReLU(z) = max(0, z)

     The ReLU function is continuous but unfortunately not differentiable at z = 0 (the slope changes abruptly, which can make Gradient Descent bounce around), and its derivative is 0 for z < 0. In practice, however, it works very well and has the advantage of being fast to compute, so it has become the default.13 Most importantly, the fact that it does not have a maximum output value helps reduce some issues during Gradient Descent).
     
- **activation functions**: A step function 

If we don't use activation functions then each layer will be considered a layer. This is because they all solve a problem with the same complexity. 

![](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/assets/mls2_1008.png)

# Regression MLPs

- MLPs can be used for regression tasks

-  If you want to predict a single value you just need a single output neuron.

- For multivariate regression, you need one output neuron per output dimension

- For MLP for regression, you do not want to use any activation function for the output neurons

-  To guarantee that the output will always be positive, then you can use the ReLU activation function in the output layer.  

- **Softplus activation function**: A smooth variant of ReLU, softplus(z) = log(1 + exp(z))

- If you want to guarantee that the predictions will fall within a given range of values, then you can use the logistic function or the hyperbolic tangent, then scale the labels to the appropriate range. 

Typical Loss function 

- mean squared error

- if you have a lot of outliers in the training set, you may prefer to use the mean absolute error instead

Huber loss, which is a combination of both.

**TIP**: The Huber loss is quadratic when the error is smaller than a threshold δ (typically 1) but linear when the error is larger than δ. The linear part makes it less sensitive to outliers than the mean squared error, and the quadratic part allows it to converge faster and be more precise than the mean absolute error.


### Typical  architecture of a regression MLP

| Hyperparameter	|Typical value |
|-------------------|--------------|
| # input neurons   | One per input feature (e.g., 28 x 28 = 784 for MNIST) | 
| # hidden layers   | Depends on the problem, but typically 1 to 5  |
|# neurons per hidden layer| Depends on the problem, but typically 10 to 100|
|# output neurons    | 1 per prediction dimension |
| Hidden activation  | ReLU (or SELU, see Chapter 11) | 
| Output activation  | None, or ReLU/softplus (if positive outputs) or logistic/tanh (if bounded outputs) |
| Loss function    | MSE or MAE/Huber (if outliers) | 

# Classification MLPs

- For a binary classification problem, you just need a single output neuron using the logistic activation function

- the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class

- The estimated probability of the negative class is equal to one minus that number.

Multilabel binary classification 

- Dedicate one output neuron for each positive class

- If each instance can belong only to a single class (eg a single digit from 0 thru 9) you need to have one output neuron per class,.

- Use a softmax activation function for the whole output layer. 

- **softmax function**: ensures that all the estimated probabilities are between 0 and 1 and that they add up to 1 

Loss Function 

- Cross-entropy loss (also called the log loss) is generally a good choice. 


Typical architecture of a classification MLP.

|Hyperparameter	         | Binary classification	|Multilabel binary classification | Multiclass classification |
|------------------------|--------------------------|---------------------------------|---------------------------|
|Input and hidden layers | Same as regression       |Same as regression               | Same as regression        |
| # output neurons       | 1                        |1 per label                      |       1 per class         |
|Output layer activation |Logistic                  | Logistic                        | Softmax                   |
| Loss function          |Cross entropy             |Cross entropy                    |Cross entropy              |


You have all the concepts you need to start implementing MLPs with Keras



# Implementing MLPs with Keras


**Multibackend Keras**

-  To perform the heavy computations required by neural networks it  relies on a computation backend. 

- you can choose from three popular open source Deep Learning libraries: TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano. other implementations have been released JavaScript or TypeScript (to run Keras code in a web browser), and PlaidML (which can run on all sorts of GPU devices, not just Nvidia) and many more. 

- TensorFlow itself now comes bundled with its own Keras implementation, tf.keras. This allows us to use Tensorflow Apis' such as TF Data API.  

![](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/assets/mls2_1010.png)

- The most popular Deep Learning library, after Keras and TensorFlow, is Facebook’s PyTorch library.

- Once you know Keras, it is not difficult to switch to PyTorch. They were inspired by sklearn and chainer.



In [1]:
import tensorflow as tf
from tensorflow import keras
tf.__version__

'2.3.0'

In [2]:
keras.__version__
#the version of the Keras API implemented by tf.keras

'2.4.0'

# Building an Image Classifier Using the Sequential API

-  We will use Fashion MNIST

- 70,000 grayscale images of 28 × 28 pixels each, with 10 classes

- the images represent fashion items rather than handwritten digits

- Thus the problem turns out to be significantly more challenging than MNIST.

- A simple linear model reaches about 92% accuracy on MNIST, but only about 83% on Fashion MNIST.

## USING KERAS TO LOAD THE DATASET

In [3]:
# Keras provides some utility functions to fetch and load common datasets
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [4]:
# t every image is represented as a 28 × 28 array rather than a 1D array of size 784
X_train_full.shape

(60000, 28, 28)

In [5]:
X_train_full.dtype

dtype('uint8')

There is no validation set, so we’ll create one now. since we are going to train the neural network using Gradient Descent, we must scale the input features. For simplicity, we’ll scale the pixel intensities down to the 0–1 range by dividing them by 255.0 .



In [6]:
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0

In [7]:
# For Fashion MNIST we need the list of class names to know what we are dealing with:
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

# the first image in the training set represents a coat:
class_names[y_train[0]]

'Coat'

In [8]:
y_train[0]

4

Samples from the Fashion MSNT Dataset 

![](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/assets/mls2_1011.png)

# CREATING THE MODEL USING THE SEQUENTIAL API

Here is a classification MLP with two hidden layers:

In [9]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))

Let’s go through this code line by line:

- The first line creates a Sequential model. This is the simplest kind of Keras model for neural networks that are just composed of a single stack of layers connected sequentially. This is called the Sequential API.

- Next, we build the first layer and add it to the model. It is a Flatten layer whose role is to convert each input image into a 1D array: if it receives input data X, it computes X.reshape(-1, 28*28). This layer does not have any parameters; it is just there to do some simple preprocessing. Since it is the first layer in the model, you should specify the input_shape, which doesn’t include the batch size, only the shape of the instances. Alternatively, you could add a keras.layers.InputLayer as the first layer, setting input_shape=[28,28].

- Next we add a Dense hidden layer with 300 neurons. It will use the ReLU activation function. Each Dense layer manages its own weight matrix, containing all the connection weights between the neurons and their inputs. It also manages a vector of bias terms (one per neuron). When it receives some input data, it computes Equation 10-2.

- Then we add a second Dense hidden layer with 100 neurons, also using the ReLU activation function.

- Finally, we add a Dense output layer with 10 neurons (one per class), using the softmax activation function (because the classes are exclusive).


- **TIP**: Specifying activation="relu" is equivalent to specifying activation=keras.activations.relu. Other activation functions are available in the keras.activations package, we will use many of them in this book. See https://keras.io/activations/ for the full list.

In [10]:
# Instead of adding the layers one by one as we just did, 
# you can pass a list of layers when creating the Sequential model:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

# USING CODE EXAMPLES FROM KERAS.IO

Code examples documented on keras.io will work fine with tf.keras, but you need to change the imports. For example, consider this keras.io code:



In [None]:
from keras.layers import Dense
# output_layer = Dense(10)

In [11]:
#You must change the imports like this:
from tensorflow.keras.layers import Dense
output_layer = Dense(10)

This approach is more verbose, but I use it in this book so you can easily see which packages to use, and to avoid confusion between standard classes and custom classes.

- The model’s summary() method displays all the model’s layers 

The summary includes:

- Each layer’s name
- its output shape
- number of parameters

- The summary ends with the total number of parameters, including trainable and non-trainable parameters.

In [12]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 300)               235500    
_________________________________________________________________
dense_4 (Dense)              (None, 100)               30100     
_________________________________________________________________
dense_5 (Dense)              (None, 10)                1010      
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________


- **Dense layers** often have a lot of parameters. 

- For example, the first hidden layer has 784 × 300 connection weights, plus 300 bias terms, which adds up to 235,500 parameters

- This gives the model quite a lot of flexibility to fit the training data, but it also means that the model runs the risk of overfitting especially when you do not have a lot of training data.



In [13]:
# You can easily get a model’s list of layers, to fetch a layer by its index, or you can fetch it by name:
model.layers

[<tensorflow.python.keras.layers.core.Flatten at 0x1ef67c7cf60>,
 <tensorflow.python.keras.layers.core.Dense at 0x1ef70634ef0>,
 <tensorflow.python.keras.layers.core.Dense at 0x1ef7645aa90>,
 <tensorflow.python.keras.layers.core.Dense at 0x1ef76464630>]

In [14]:
hidden1 = model.layers[1]
hidden1.name

'dense_3'

In [16]:
model.get_layer('dense_3') is hidden1

True

In [17]:
# Access parameters with get_weights() and set_weights()
weights, biases = hidden1.get_weights()
weights

array([[-0.053266  ,  0.01534975, -0.01698087, ..., -0.05883894,
        -0.0068607 ,  0.06915954],
       [-0.03469538,  0.06661914, -0.00159548, ...,  0.03484789,
        -0.06824687,  0.00852676],
       [ 0.05458854,  0.05302885,  0.02688093, ...,  0.02450662,
        -0.00050273,  0.00813788],
       ...,
       [ 0.03552338, -0.06078802,  0.06078289, ...,  0.00366828,
         0.05655251, -0.03215222],
       [-0.05883799, -0.0716325 , -0.05388476, ...,  0.0407314 ,
        -0.03563473,  0.0642748 ],
       [ 0.06628798,  0.01696514, -0.05609043, ..., -0.00144854,
         0.03427551, -0.04280059]], dtype=float32)

In [18]:
weights.shape

(784, 300)

In [19]:
biases

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [20]:
biases.shape

(300,)

- Dense layer initialized the connection weights randomly( is needed to break symmetry) and  biases were initialized to zeros, which is fine.

- For a different initialization method, you can set kernel_initializer (kernel is another name for the matrix of connection weights) or bias_initializer when creating the layer.

-  initialization method will be further dicussed in chapter 11


- NOTE:  The shape of the weight matrix depends on the number of inputs. This is why it is recommended to specify the input_shape when creating the first layer in a Sequential model. However, if you do not specify the input shape, it’s OK: Keras will simply wait until it knows the input shape before it actually builds the model. This will happen either when you feed it actual data (e.g., during training), or when you call its build() method. Until the model is really built, the layers will not have any weights, and you will not be able to do certain things (such as print the model summary or save the model). So, if you know the input shape when creating the model, it is best to specify it.


# COMPILING THE MODEL

After a model is created, you must call its compile() method to specify the loss function and the optimizer to use. Optionally, you can specify a list of extra metrics to compute during training and evaluation:



In [21]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

- Note: Using loss="sparse_categorical_crossentropy" is equivalent to using loss=keras.losses.sparse_categorical_crossentropy. Similarly, specifying optimizer="sgd" is equivalent to specifying optimizer=keras.optimizers.SGD(), and metrics=["accuracy"] is equivalent to metrics=[keras.metrics.sparse_categorical_accuracy] (when using this loss). We will use many other losses, optimizers, and metrics in this book; for the full lists,


- We use the "sparse_categorical_crossentropy" loss because we have sparse labels (i.e., for each instance, there is just a target class index, from 0 to 9 in this case), and the classes are exclusive.


- If instead we had one target probability per class for each instance (such as one-hot vectors to represent class 3), then we would need to use the "categorical_crossentropy" loss instead.


- were doing binary classification or multilabel binary classification, then we would use the "sigmoid"  activation function in the output layer instead of the "softmax" activation function, and we would use the "binary_crossentropy" loss.


- **TIP**: If you want to convert sparse labels (i.e., class indices) to one-hot vector labels, use the keras.utils.to_categorical() function. To go the other way round, use the np.argmax() function with axis=1.

Regarding the optimizer, "sgd" means that we will train the model using simple Stochastic Gradient Descent. In other words, Keras will perform the backpropagation algorithm described earlier (i.e., reverse-mode autodiff plus Gradient Descent). We will discuss more efficient optimizers in Chapter 11 (they improve the Gradient Descent part, not the autodiff).



- NOTE: When using the SGD optimizer, it is important to tune the learning rate. So, you will generally want to use optimizer=keras.optimizers.SGD(lr=???) to set the learning rate, rather than optimizer="sgd", which defaults to lr=0.01.

 # TRAINING AND EVALUATING THE MODEL
 
Now the model is ready to be trained. For this we simply need to call its fit() method:

In [22]:
history = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [30]:
model.save('./models')

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: ./models\assets
