## 1. Introduction

This chapter covers
- A first example of a neural network
- Tensors and tensor operations
- How neural networks learn via backpropagation and gradient descent

**Understanding deep learning** requires familiarity with many simple mathematical
concepts: **tensors**, **tensor operations**, **differentiation**, **gradient descent**, and so on.
Our goal in this chapter will be to build your intuition about these notions without
getting overly technical. In particular, we’ll steer away from mathematical notation,
which can be off-putting for those without any mathematics background and isn’t
strictly necessary to explain things well.

## 2. A first look at a neural network

Let’s look at a concrete example of a neural network that uses the [Python library Keras](https://keras.io/) to learn to **classify handwritten digits**. Unless you already have experience with Keras or similar libraries, you won’t understand everything about this first example right away.

You probably haven’t even installed Keras yet (you can install in your machine even if it doesnt have a GPU):

>```bash
conda install -c conda-forge tensorflow keras
```

In [1]:
import keras
print(keras.__version__)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


2.1.5


The problem we’re trying to solve here is to **classify grayscale images of handwritten digits (28 × 28 pixels)** into their **10 categories (0 through 9)**. We’ll use the [MNIST dataset](http://yann.lecun.com/exdb/mnist/), a classic in the machine-learning community, which has been around almost as long as the field itself and has been intensively studied. 

It’s a **set of 60,000 training images**, plus **10,000 test images**, assembled by the National Institute of Standards and Technology (the NIST in MNIST) in the 1980s. You can think of **“solving” MNIST as the “Hello World” of deep learning** — it’s what you do to verify that your algorithms are working as expected. As you become a machine-learning practitioner, you’ll see MNIST come up over and over again, in scientific papers, blog posts, and so on.

<img width="600" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1lH1e-555mRrwWOMRJmu_1_Ubr0b9TQZK">


### 2.1 Loading the data

In [2]:
# Loading the MNIST dataset in Keras
from keras.datasets import mnist

# The images are encoded as Numpy arrays, and the labels are an array of digits, ranging
# from 0 to 9. The images and labels have a one-to-one correspondence.

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In [3]:
print("Type of train_images {}\nType of train_labels {}".format(type(train_images),type(train_labels)))
print("\nShape of train data: images {}, labels {}".format(train_images.shape,train_labels.shape))
print("Shape of test data: images {}, labels {}".format(test_images.shape,test_labels.shape))

Type of train_images <class 'numpy.ndarray'>
Type of train_labels <class 'numpy.ndarray'>

Shape of train data: images (60000, 28, 28), labels (60000,)
Shape of test data: images (10000, 28, 28), labels (10000,)


### 2.2 The network architecture

In [4]:
from keras import models
from keras import layers

# The network architecture
network = models.Sequential()

# Input layer
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))

# Output layer
network.add(layers.Dense(10, activation='softmax'))

To make the network ready for training, we need to pick three more things, as part of the compilation step:
- **A loss function**: How the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.
- **An optimizer**: The mechanism through which the network will update itself based on the data it sees and its loss function.
- **Metrics to monitor during training and testing**: Here, we’ll only care about accuracy (the fraction of the images that were correctly classified).

### 2.3 The compilation step

In [5]:
network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

### 2.4 Preparing data

Before training, we’ll preprocess the data by **reshaping** it into the shape the network expects and scaling it so that **all values are in the [0, 1] interval**. Previously, our training images, for instance, were stored in an array of shape **(60000, 28, 28)** of type uint8 with values in the [0, 255] interval. We transform it into a float32 array of shape **(60000, 28 * 28)** with values between 0 and 1.

In [6]:
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

### 2.5 Preparing the labels

In [7]:
from keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

### 2.6 Train the network

In [8]:
network.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x107b5e978>

Two quantities are displayed during training: the **loss of the network** over the training data, and the **accuracy of the network** over the training data. We quickly reach an accuracy of 0.9891 (98.9%) on the training data. Now let’s
check that the model performs well on the test set, too:

### 2.7 Evaluate the network

The test-set accuracy turns out to be **97.8%** — that’s quite a bit lower than the training set accuracy. This gap between training accuracy and test accuracy is an example of **overfitting**: the fact that machine-learning models tend to perform worse on new data than on their training data.

In [9]:
test_loss, test_acc = network.evaluate(test_images, test_labels)
print('test_acc:', test_acc)

test_acc: 0.9817


### 2.8 Summarization

This concludes our first example—you just saw how you can build and train a neural network to classify handwritten digits in less than 20 lines of Python code. The follow steps were performed:

- Loading the data
- Create the network architecture
- Compilation
- Preparing the data
- Train the network 
- Evaluate the network

## 3. Data representations for neural networks

In the previous example, we started from data stored in **multidimensional Numpy arrays**, also called **tensors**. 

So what’s a tensor?
- At its core, a **tensor is a container for data** — almost always numerical data. So, it’s a
container for numbers.
- You may be already familiar with **matrices, which are 2D tensors**.
- Tensors are a generalization of matrices to an arbitrary number of dimensions (note that in the context of tensors, a **dimension is often called an axis**).

In [10]:
import numpy as np

# Scalars (0D tensors) - a tensor that contains only one number is called a scalar
scalar = np.array(12)
print("0D tensor (dimension): {}".format(scalar.ndim))

# Vectors (1D tensors) - an array of numbers is called a vector, or 1D tensor. A 1D tensor is said to have exactly one axis.
tensor1D = np.array([12, 3, 6, 14])
print("1D tensor (dimension): {}".format(tensor1D.ndim))

# Matrices (2D tensors) - an array of vectors is a matrix, or 2D tensor. A matrix has two axes (rows and columns).
tensor2D = np.array([[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]])
print("2D tensor (dimension): {}".format(tensor2D.ndim))

# 3D tensors and higher-dimensional tensors
# If you pack such matrices in a new array, you obtain a 3D tensor, which you can visually interpret as a cube of numbers.
tensor3D = np.array([[[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]],
                      [[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]],
                      [[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]]])
print("3D tensor (dimension): {}".format(tensor3D.ndim))

0D tensor (dimension): 0
1D tensor (dimension): 1
2D tensor (dimension): 2
3D tensor (dimension): 3


### 3.1 Real-world examples of data tensors

Let’s make data tensors more concrete with a few examples similar to what you’ll encounter later. The data you’ll manipulate will almost always fall into one of the following categories:

- **Vector**: data—2D tensors of shape **(samples, features)**
- **Timeseries**: data or sequence data—3D tensors of shape **(samples, timesteps,features)**
- **Images**: 4D tensors of shape **(samples, height, width, channels)** or **(samples,channels, height, width)**
- **Video**: 5D tensors of shape **(samples, frames, height, width, channels)** or **(samples, frames, channels, height, width)**

## 4. The gears of neural networks: tensor operations

All transformations learned by deep neural networks can be reduced to a handful of tensor operations applied to
tensors of numeric data. For instance, it’s possible to **add tensors, multiply tensors**, **reshape tensors**, and so on.

>```python
# this layer takes as input a 2D tensor and returns another 2D tensor
keras.layers.Dense(512, activation='relu')
```

- Specifically, the function is as follows (where W is a 2D tensor and b is a vector, both attributes of the layer):

>```python
output = relu(dot(W, input) + b)
```

- We have three tensor operations here: 
    - a **dot product** (dot) between the input tensor and a tensor named W; 
        - Tensor product operation
    - an **addition (+)** between the resulting 2D tensor and a vector b (1D); 
        - Broadcast operation
    - a **relu** operation. relu(x) is max(x, 0).
        - Element-wise operation

In [43]:
''' Tensor product (similar to a matrix multiplication)
(a, b, c, d) . (d,) -> (a, b, c)
(a, b, c, d) . (d, e) -> (a, b, c, e)
'''

tensor2D = np.array([[1,-1],[0,2]])
tensor1D = np.array([1, 0])

# Tensor product
print(np.dot(tensor2D,tensor1D))

# Element-wise operation
print(tensor2D * tensor1D)

[1 0]
[[1 0]
 [0 0]]


In [33]:
''' Broadcast operation
What happens with addition when the shapes of the two tensors being added differ?
'''

tensor2D = np.array([[1,-1],[0,2]])
tensor1D = np.array([1, 0])

tensor2D + tensor1D


array([[ 2, -1],
       [ 1,  2]])

In [31]:
''' Element-wise operation 
tensor2D =  |1,-1|   ->   relu(tensor2D)   ->  |1,0|     
            |0, 2|                             |0,2|
'''

tensor2D = np.array([[1,-1],[0,2]])
# before relu()
print(tensor2D)

# after relu()
print(np.maximum(tensor2D,0))

[[ 1 -1]
 [ 0  2]]
[[1 0]
 [0 2]]


## 5. The engine of neural networks: gradient-based optimization

As you saw in the previous section, each neural layer from our first network example transforms its input data as follows:

>```python
output = relu(dot(W, input) + b)
```

In this expression, **W** and **b** are tensors that are attributes of the layer. They’re called the **weights or trainable parameters** of the layer (**the kernel and bias attributes**, respectively). 

- These weights contain the **information learned** by the network from **exposure to training data**.
- Initially, these **weight** matrices are filled with small random values (a step called **random initialization**). 
- **Gradually adjust** these weights, based on a feedback signal (also called **training**). A trainning loop:
    1. Draw a **batch** of training samples x and corresponding targets y.
    2. Run the network on x (a step called the **forward pass**) to **obtain predictions y_pred**.
    3. **Compute the loss** of the network on the batch, a measure of the mismatch between y_pred and y.
    4. **Update all weights** of the network in a way that slightly reduces the loss on this batch.
    
**Step 1** sounds easy enough—just I/O code. **Steps 2 and 3** are merely the application of a handful of tensor operations, so you could implement these steps purely from what you learned in the previous section. The difficult part is **step 4**: updating the network’s weights.

### Question!!!
Given an individual weight coefficient in the network, how can you compute whether the coefficient should be increased or decreased, and by how much?

**One naive solution** would be to freeze all weights in the network except the one scalar coefficient being considered, and try different values for this coefficient. 
- Let’s say the initial value of the **coefficient is 0.3**. 
- After the forward pass on a batch of data, **the loss** of the network on the batch is **0.5**. 
- If you change the **coefficient’s value to 0.35** and rerun the forward pass, the **loss increases to 0.6**. 
- But if you lower the **coefficient to 0.25**, the **loss falls to 0.4**.
- In this case, it seems that updating the coefficient by -0.05 would contribute to minimizing the loss. 
- This would have to be repeated for all coefficients in the network.

But such an approach would be horribly inefficient, because you’d need to compute two forward passes (which are expensive) for every individual coefficient (of which there are many, usually thousands and sometimes up to millions). 

**A much better approach** is to take advantage of the fact that **all operations** used in the network are **differentiable**, and compute the gradient of the loss with regard to the network’s coefficients. You can then **move the coefficients** in the **opposite direction from the gradient**, thus **decreasing the loss**.

### 5.1 What’s a derivative?

Consider a continuous, smooth function $f(x)$, mapping a real number x to a new real number y. 

$$f(x) = y$$

Because the function is continuous, a small change in $x$ can only result in a small change in $y$ — that’s the intuition behind continuity. Let’s say you increase $x$ by a small factor $epsilon\_x$ will result in a small $epsilon\_y$ change to $y$:

$$f(x + epsilon\_x) = y + epsilon\_y$$

In addition, because the function is *smooth* (its curve doesn’t have any abrupt angles), **when $epsilon\_x$ is small enough**, around a certain point $p$, it’s possible to approximate $f$ as a linear function of slope $a$, so that $epsilon\_y$ becomes $a * epsilon\_x$:

$$f(x + epsilon\_x) = y + a * epsilon\_x$$

**The slope a** is called the **derivative** of $f$ in $p$. 

- If $a$ is negative, it means a small change of $x$ around $p$ will result in a decrease of $f(x)$
- If f $a$ is positive, a small change in $x$ will result in an increase of $f(x)$. 
- Further, the absolute value of $a$ (the magnitude of the derivative) tells you how quickly this increase or decrease
will happen.

If you’re trying to update $x$ by a factor $epsilon\_x$ in order to minimize $f(x)$, and you know the derivative of $f$, then your job is done: the derivative completely describes how $f(x)$ evolves as you change $x$. **If you want to reduce the value of $f(x)$, you just need to move $x$ a little in the opposite direction from the derivative.**

### 5.2 Derivative of a tensor operation: the gradient

A **gradient** is the **derivative of a tensor operation**. 

Consider:
- input vector $x$
- matrix $W$
- target $y$
- loss function. 

You can use $W$ to compute a target candidate $y\_pred$, and compute the $loss$, or mismatch, between the target candidate $y\_pred$ and the target $y$:

$$
y\_pred = dot(W, x)
$$

$$
loss\_value = loss(y\_pred, y)
$$

If the data inputs $x$ and $y$ are frozen, then this can be interpreted as a function mapping values of $W$ to loss values:

$$loss\_value = f(W)$$

Let’s say the current value of $W$ is $W0$. 

- The derivative of $f$ in the point $W0$ is a tensor $gradient(f)(W0)$ 
- Each coefficient $gradient(f) (W0)[i, j]$ indicates the direction and magnitude of the change in $loss\_value$


Thus, for a function $f(x)$, you can reduce the value of $f(x)$ by moving $x$ a little in the opposite direction from the derivative, with a function f(W) of a tensor, you can reduce f(W) by moving W in the opposite direction from the gradient: 

$$
W1 = W0 - step * gradient(f)(W0)
$$

where step is a small scaling factor. 

### 5.3 Stochastic gradient descent

1. Draw a batch of training samples $x$ and corresponding targets $y$.
2. Run the network on $x$ to obtain predictions $y\_pred$.
3. Compute the loss of the network on the batch, a measure of the mismatch between $y\_pred$ and $y$.
4. Compute the gradient of the loss with regard to the network’s parameters (a backward pass).
5. Move the parameters a little in the opposite direction from the gradient 
    - $W -= step * gradient$
    
What I just described is called mini-batch stochastic gradient descent.

Additionally, there exist multiple variants of SGD that differ by taking into account previous weight updates when computing the next weight update, rather than just looking at the current value of the gradients. There is, for instance, **SGD with momentum**, as well as **Adagrad**, **RMSProp**, and several others. **Such variants are known as optimization methods or optimizers.**

## 6. Looking back at our first example

You’ve reached the end of this chapter, and you should now have a general understanding of what’s going on behind the scenes in a neural network. Let’s go back to the first example and review each piece of it in the light of what you’ve learned in the previous three sections.

This was the input data:

>```python
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
```

Now you understand that the input images are stored in Numpy tensors, which are here formatted as float32 tensors of shape **(60000, 784) (training data)** and **(10000,784) (test data)**, respectively.

This was our network:

>```python
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
```

Now you understand that this network consists of a **chain of two Dense layers**, that each layer applies a few simple tensor operations to the input data, and that these operations involve weight tensors. Weight tensors, which are attributes of the layers, are where the knowledge of the network persists.

This was the network-compilation step:

>```python
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
```

Now you understand that **categorical_crossentropy** is the loss function that’s used as a feedback signal for learning the weight tensors, and which the training phase will attempt to minimize. You also know that this **reduction of the loss happens via minibatch stochastic gradient descent**. The exact rules governing a specific use of gradient descent are defined by the **rmsprop optimizer** passed as the first argument.

Finally, this was the training loop:

>```python
network.fit(train_images, train_labels, epochs=5, batch_size=128)
```

Now you understand what happens when you call fit: the network will start to iterate on the training data in mini-batches of **128 samples, 5 times over** (each iteration over all the training data is called an epoch). At each iteration, the network will compute the gradients of the weights with regard to the loss on the batch, and update the weights accordingly. **After these 5 epochs, the network will have performed 2,345 gradient updates (469 per epoch)**, and the loss of the network will be sufficiently low that the network will be capable of classifying handwritten digits with high accuracy.

**At this point, you already know most of what there is to know about neural networks.**