### 2.1. A first look at a neural network

####  Loading the MNIST dataset in Keras


In [1]:
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz


#### The network architecture
    The core building block of neural networks is the layer, a data-processing module that you can think of as a filter for data. Some data goes in, and it comes out in a more useful form. Specifically, layers extract representations out of the data fed into them—hopefully, representations that are more meaningful for the problem at hand. Most of deep learning consists of chaining together simple layers that will implement a form of progressive data distillation. A deep-learning model is like a sieve for data processing, made of a succession of increasingly refined data filters—the layers.

In [2]:
from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

#### The compilation step
    To make the network ready for training, we need to pick three more things, as part of the compilation step
##### A loss function
    How the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.
##### An optimizer
    The mechanism through which the network will update itself based on the data it sees and its loss function.
##### Metrics 
    To monitor during training and testing— Here, we’ll only care about accuracy (the fraction of the images that were correctly classified).

In [3]:
network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

#### Preparing the image data

In [4]:
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

#### Preparing the labels

In [5]:
from keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [6]:
network.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fa3724570f0>

In [7]:
test_loss, test_acc = network.evaluate(test_images, test_labels)
print('test_acc:', test_acc)

test_acc: 0.9813


### 2.2. Data representations for neural networks

#### 2.2.1. Scalars (0D tensors)
#### 2.2.2. Vectors (1D tensors)
#### 2.2.3. Matrices (2D tensors)
#### 2.2.4. 3D tensors and higher-dimensional tensors
#### 2.2.5. Key attributes
    Number of axes (rank)
    Shape
    Data type

In [9]:
print(train_images.ndim)
print(train_images.shape)
print(train_images.dtype)

2
(60000, 784)
float32


#### 2.2.6. Manipulating tensors in Numpy
#### 2.2.7. The notion of data batches
#### 2.2.8. Real-world examples of data tensors
    Vector data— 2D tensors of shape (samples, features)
    Timeseries data or sequence data— 3D tensors of shape (samples, timesteps, features)
    Images— 4D tensors of shape (samples, height, width, channels) or (samples, channels, height, width)
    Video— 5D tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width)
    
#### 2.2.9. Vector data
#### 2.2.10. Timeseries data or sequence data
#### 2.2.11. Image data
#### 2.2.12. Video data

### 2.3. The gears of neural networks: tensor operations
#### 2.3.1. Element-wise operations
#### 2.3.2. Broadcasting
#### 2.3.3. Tensor dot
#### 2.3.4. Tensor reshaping
#### 2.3.5. Geometric interpretation of tensor operations
#### 2.3.6. A geometric interpretation of deep learning

### 2.4. The engine of neural networks: gradient-based optimization
    
    Draw a batch of training samples x and corresponding targets y.
    Run the network on x (a step called the forward pass) to obtain predictions y_pred.
    Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y.
    Update all weights of the network in a way that slightly reduces the loss on this batch.
    
#### 2.4.1. What’s a derivative?
    Consider a continuous, smooth function f(x) = y, mapping a real number x to a new real number y. Because the function is continuous, a small change in x can only result in a small change in y—that’s the intuition behind continuity. Let’s say you increase x by a small factor epsilon_x: this results in a small epsilon_y change to y:
    f(x + epsilon_x) = y + epsilon_y
    
    because the function is smooth (its curve doesn’t have any abrupt angles), when epsilon_x is small enough, around a certain point p, it’s possible to approximate f as a linear function of slope a, so that epsilon_y becomes a * epsilon_x:
    f(x + epsilon_x) = y + a * epsilon_x

    The slope a is called the derivative of f in p. If a is negative, it means a small change of x around p will result in a decrease of f(x) (as shown in figure 2.10); and if a is positive, a small change in x will result in an increase of f(x). Further, the absolute value of a (the magnitude of the derivative) tells you how quickly this increase or decrease will happen.
    
#### 2.4.2. Derivative of a tensor operation: the gradient

    A gradient is the derivative of a tensor operation. It’s the generalization of the concept of derivatives to functions of multidimensional inputs: that is, to functions that take tensors as inputs.
    
    y_pred = dot(W, x)
    loss_value = loss(y_pred, y)
    loss_value = f(W)
    
    Let’s say the current value of W is W0. Then the derivative of f in the point W0 is a tensor gradient(f)(W0) with the same shape as W, where each coefficient gradient(f) (W0)[i, j] indicates the direction and magnitude of the change in loss_value you observe when modifying W0[i, j]. That tensor gradient(f)(W0) is the gradient of the function f(W) = loss_value in W0.

    You saw earlier that the derivative of a function f(x) of a single coefficient can be interpreted as the slope of the curve of f. Likewise, gradient(f)(W0) can be interpreted as the tensor describing the curvature of f(W) around W0.

    For this reason, in much the same way that, for a function f(x), you can reduce the value of f(x) by moving x a little in the opposite direction from the derivative, with a function f(W) of a tensor, you can reduce f(W) by moving W in the opposite direction from the gradient: for example, W1 = W0 - step * gradient(f)(W0) (where step is a small scaling factor). That means going against the curvature, which intuitively should put you lower on the curve. Note that the scaling factor step is needed because gradient(f)(W0) only approximates the curvature when you’re close to W0, so you don’t want to get too far from W0.
    
#### 2.4.3. Stochastic gradient descent

    
    Draw a batch of training samples x and corresponding targets y.
    Run the network on x to obtain predictions y_pred.
    Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y.
    Compute the gradient of the loss with regard to the network’s parameters (a backward pass).
    Move the parameters a little in the opposite direction from the gradient—for example W = step * gradient—thus reducing the loss on the batch a bit.

    The term stochastic refers to the fact that each batch of data is drawn at random (stochastic is a scientific synonym of random)
    
    Additionally, there exist multiple variants of SGD that differ by taking into account previous weight updates when computing the next weight update, rather than just looking at the current value of the gradients. There is, for instance, SGD with momentum, as well as Adagrad, RMSProp, and several others. Such variants are known as optimization methods or optimizers. In particular, the concept of momentum, which is used in many of these variants, deserves your attention. Momentum addresses two issues with SGD: convergence speed and local minima
    
    As you can see, around a certain parameter value, there is a local minimum: around that point, moving left would result in the loss increasing, but so would moving right. If the parameter under consideration were being optimized via SGD with a small learning rate, then the optimization process would get stuck at the local minimum instead of making its way to the global minimum.

    You can avoid such issues by using momentum, which draws inspiration from physics. A useful mental image here is to think of the optimization process as a small ball rolling down the loss curve. If it has enough momentum, the ball won’t get stuck in a ravine and will end up at the global minimum. Momentum is implemented by moving the ball at each step based not only on the current slope value (current acceleration) but also on the current velocity (resulting from past acceleration). In practice, this means updating the parameter w based not only on the current gradient value but also on the previous parameter update, such as in this naive implementation:
    
#### 2.4.4. Chaining derivatives: the Backpropagation algorithm
#### 2.5. Looking back at our first example