<img src="./pics/DL.png" width=110 align="left" style="margin-right: 10px">

# Introduction to Deep Learning

## 04. Dense Networks

---

## Extending Backpropagation

There are several ways to extend the backpropagation algorithm. We can change the activation function, the cost function, introduce regularization to the system or even use different optimization methods instead of the gradient descent. 

In this notebook we'll discover what are the most commonly used modifications.

### Activation functions

Changing the output function is one of the most direct way to change the network, and it is critical to match the function with the input data.

#### Sigmoid

<img src="./pics/functions/sigmoid.png" width=350 align="left">

$$\begin{align}
    \text{Range: } & \{0...1\} \\
    \sigma(x)      & = \frac{1}{1 + e^{-x}} \\
    \sigma'(x)     & = \sigma (x) \left( 1 - \sigma(x) \right)
\end{align}$$

<br><br><br><br>


#### Hyperbolic tangent

<img src="./pics/functions/tanh.png" width=350 align="left">

$$\begin{align}
    \text{Range: } & \{-1...1\} \\
    \tanh(x)       & = \frac{1 - e^{-2x}}{1 + e^{-2x}} \\
    \tanh'(x)      & = 1 - \frac{\left( e^x - e^{-x} \right) ^2}{ \left( e^x + e^{-x} \right)^2} \\
                   & = 1 - \tanh^2(x)
\end{align}$$

<br>


#### Linear

<img src="./pics/functions/linear.png" width=350 align="left">

$$\begin{align}
    \text{Range: } & \{-\inf...\inf\} \\
    f(x)           & = x \\
    f'(x)          & = 1
\end{align}$$

<br><br><br><br><br><br><br><br>


#### ReLU - Rectified Linear Unit

<img src="./pics/functions/relu.png" width=350 align="left">

$$\begin{align}
    \text{Range: }    & \{0...\inf\} \\
    \mathrm{relu}(x)  & = \max \left(0, x \right) \\
    \mathrm{relu}'(x) & = { \begin{cases} 
                                0 & {\text{for }} x < 0\\
                                1 & {\text{for }} x \leq 0
                            \end{cases} }
\end{align}$$

<br><br><br><br><br><br><br>


#### Leaky ReLU - Rectified Linear Unit

<img src="./pics/functions/leaky_relu.png" width=350 align="left">

A parametrized version of ReLU. Instead of setting the output to zero, multiply it with a very small number, tipical choice is $\alpha = 0.01$. 

$$\begin{align}
    \text{Range: }            & \{-\inf...\inf\} \\
    \mathrm{relu_{leaky}}(x)  & = \max \left(\alpha x, x \right) \\
    \mathrm{relu_{leaky}}'(x) & = { \begin{cases} 
                                        \alpha & {\text{for }} x < 0\\
                                        1      & {\text{for }} x \leq 0
                                    \end{cases} }
\end{align}$$

<br><br><br><br>


#### ELU - Exponential Linear Unit

<img src="./pics/functions/elu.png" width=350 align="left">

$$\begin{align}
    \text{Range: }   & \{-\inf...\inf\} \\
    \mathrm{elu}(x)  & = { \begin{cases} 
                               \alpha \left(e^{x} - 1\right) & {\text{for }} x < 0 \\
                               x                             & {\text{for }} x \leq 0
                           \end{cases} } \\
    \mathrm{elu}'(x) & = { \begin{cases} 
                                f(x) + \alpha & {\text{for }} x < 0 \\
                                1 & {\text{for }} x \leq 0
                            \end{cases} }
\end{align}$$

<br><br><br><br>


#### Softmax

Softmax activation function will turn output neuron activation values into probabilities. The sum of the outputs of the network will be always 1 and each output should have a nonzero value. The calculation can only be done knowing all of the neuron outputs.
Let's see how this change will be reflected in the derivatives.

$$\begin{align}
    \text{Range: } & \{0...1\} \\
    \sigma(x_i)    & = \frac{e^x_i}{\sum^n_{j=1}(e^x)} \\
    \sigma'(x_i)   & = \sigma(x_j)\left(\delta_{ij} - \sigma(x_i) \right) \\
    \text{where }  & \delta_{ij} = { \begin{cases} 
                                         0 & {\text{if }} i \neq j \\
                                         1 & {\text{for }} i = j
                                     \end{cases} }
\end{align}$$

### Cost functions

**Cost** or **loss** functions are the **same as** the **error** functions we used previously during our backpropagation. In reality, we can use the same quadratic error function as cost function in a deep network as well.  
The overall goal does not change: we would like to find the set of weights that minimizes the error on the output. However there are several better candidates for this purpose than the quadratic function.  
We'll add a small modification to our error computation on the neuron (which we refer as *delta*): 

$$\begin{align}
    \delta^n_k 
    & = \frac{\partial C^n}{\partial a^n_k}
    = \frac{\partial C^n}{\partial y^n_{j}} f'\left( a^n_j \right)
\end{align}$$

Instead of fixing the error on a special case, we'll leave the door open for any function, and compute the derivatives based on the loss function.  
Let's got through the most common cost functions with their advantages and use cases.

#### Regression

##### Mean Squared Error Loss

Regression problems requires direct measurement on the error. Notice that we measure multiple outcome and compute the average of the errors.

$$C = \frac{1}{m} \sum_k \left(t - y \right)^2 $$


##### Mean Squared Logarithmic Error Loss 

There are cases when the expected output of the regression has a large range range. Large values in the error causes large changes in the weights, so it is useful to measure the error on the logarithmic scale. Unscaled data is expected as input.

$$C = \frac{1}{m} \sum_k \left( \log(t + 1) - \log(y + 1) \right)^2 $$


##### Mean Absolute Error Loss

MAE is more robust to outliers, so if the target variable contains large values, it is appropriate to use this loss function. MAE can result in large gradient values which in turn leads to convergence problems.

$$C = \frac{1}{m} \sum_k \left| t - y \right|$$

#### Binary classification

##### Binary Cross-Entropy Loss

Target value range: $\{0, 1\}$  
Also referred as log loss.  
The average difference of the distribution of the target and prediction distribution in case of predicted **class = 1**. It is the preferred loss function when using maximum likelihood optimization (more on that later). Note that it should be used with sigmoid activation function.

$$C = - \frac{1}{m} \sum^m_i{\left[t\log(y_i) + (1 - t)\log(1 - y_i)\right]}$$

##### Hinge Loss

Target value range: $\{-1, 1\}$  
Created for SVM, punishes different incorrect sign. It is used for binary classification problem. Your result may vary, in some cases it has better performance than cross-entropy.

$$C = \sum^m_i{\max \left( 0, 1 - t \cdot y_i \right)}$$

##### Squared Hinge Loss

Target value range: $\{-1, 1\}$  
Hinge Loss ^2, smoothens the loss curve and finds the solution that maximizes the margin around the decision plane between the classes. It won't provide probabilistic information about the decision. It should be used in tandem with tanh activation.

$$C = \sum^m_i \left( \max \left( 0, 1 - t \cdot y_i \right)^2 \right)$$

##### Cosine similarity

Target value range: $\{-1, 1\}$  
It is used to measure similarity between vectors. Cosine similarity values have different meanings: -1 = total opposite, 0 = orthogonal, 1 = the same.

$$C = \frac{t \cdot y}{∥t∥ ∥y∥}$$


#### Multiclass classification

##### Multi-Class Cross-Entropy Loss

Target value range: $\{0, n\}$  
The average difference of the distribution of the target and prediction distribution **for every class**. Specifically for ML. Perfect score: entropy = 0. Requires $n$ output nodes (one for each class).

$$C = - \frac{1}{N} \sum^N_n{\left[t_n \log(y_n) + (1 - t_n)\log(1 - y_n) \right]}$$

##### Sparse Multiclass Cross-Entropy Loss

Target value range: $\{0, n\}$  
The same as Multi-Class Cross-Entropy Loss, but doesn't require the one-hot encoding of the target variable into $n$ distinct feature. It still requires $n$ output nodes and it is preferrably used with softmax activation function.

$$C = - \frac{1}{N} \sum^N_n{\left[t_n \log(y_n) + (1 - t_n)\log(1 - y_n) \right]}$$

##### Kullback Leibler Divergence Loss

Target value range: $\{0, n\}$  
aka relative entropy: difference from baseline distribution, how much info is lost if prediction used instead of target. Used for more sophisticated cases eg. approximating an another function. in multiclass case = Multi-Class Cross-Entropy Loss, so $n$ output node is required.

$$C = \sum^N_n {t_n \log \left({t_n}\Big/{y_n} \right)}$$

#### Recommendations

- Regression: MSE
- Binary classification: Cross Entropy
- Muliclass classification: Cross Entropy

### Regularization

> *With four parameters I can fit an elephant,  
> and with five I can make him wiggle his trunk.* - [Von Neumann](https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elephant/)

The more parameter a model has the more susceptible to overfitting. There are several way to prevent this: 
- We can split the data into train-test-validation datasets and measure error on validation sets, stop training once the error starts to increase on validation set and finally evaluate the model on the test set.
- We can decrease the size of the parameters and the size of the network with it, but larger networks has more expression and predictive power.
- It is also possible to handle overfitting by introducing a regularization punishment term to the cost function which will control the weight values.
- Other possible wy to reduce overfitting is to randomly select and temporarily drop neurons during each training batch.

Let's go through the effect of these methods one-by-one.

#### L1 

L_1 regularization uses the absolute value of the weights to prevent large changes in the weights except when it is an impactful. L_1 regularization will shrink the weights by a constant amount towards 0.

It can be described as: 

$$C = C_0 + \frac{\lambda}{m} \sum_w{|w|}$$

Where $C_0$ is the original cost function. In case of cross entropy we get: 

$$
C = -\frac{1}{m} \sum_{xi} 
    \left[ t_i \log a^L_i 
           + (1 - t_i) \log(1 - y^L_i) 
    \right]
    + \lambda \sum_i{|w_i|}
$$

#### L2

L_2 regularization uses the squared weights to prevent large changes in the weights except when it is an impactful. L_2 regularization will shrink the weights by the amount proportional to the value of the weights. It is also referred as weight decay as well.

$$C = C_0 + \frac{\lambda}{2m} \sum_w w^2$$

Where $C_0$ is the original cost function. In case of cross entropy we get: 

$$
C = -\frac{1}{m} \sum_{xi} 
    \left[ t_i \log a^L_i 
           + (1 - t_i) \log(1 - y^L_i) 
    \right]
    + \frac{\lambda}{2m} \sum_w w^2
$$

#### Dropout

<img src="https://cdn-images-1.medium.com/max/1600/1*iWQzxhVlvadk6VAJjsgXgg.png" width=500>By <a href="http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf">Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from overfitting”, JMLR 2014</a>

Instead of transforming the cost function, dropout regularization (temporarily) modifies the network itself by removing some of the neurons from part of the training process (we'll discuss different training strategies in the next section).

### Optimizers (update rules)

#### Stochastic Gradient Descent

The algorithm we implemented in the previous chapter was actually the Stochastic Gradient descent. It is using one training example at a time to update the weights. Usually we shuffle the dataset instead of going through each sample one by one.  
Let's revisit the weight update rules with a small change: let's use a variation called mini-batch stochasitc gradient descent. Instead of handling one training sample at a time we are going to compute the result of $m$ sample at once and use the average error on those result to update the weights. Typical $m$ sizes are powers of 2, starting from $m = 2^6 = 64$.

- **feedforward**: 
    - set input to $y^{x,1}$
    - for every $l = 2, 3, ... L$: $y^{x,l} = w^l y^{x, l-1}$
- **backpropagation**:
    - output error: $\delta^{x,L} = \nabla_y C_x \odot \sigma'(a^{x,L})$
    - for every $l = L-1, L-2, ... 2$: $\delta^{x,l} = \left( \left( w^{l+1} \right)^T \delta^{x,l+1} \right) \cdot f'\left( a^{x, l - 1} \right)^T$
- **weight update**:
    - weights: $w^l = w^l - \frac{\alpha}{m} \sum_x {\delta^{x,l} \left(y^{x,l-1}\right)^T }$

#### Gradient Descent

Gradient descent is a special case of mini-batch stochastic gradient descent where $m$ equals to $N$, the number of training samples. It is very rarely used, since it requires computing every result for every iteration. 
It is possible to extend gradient descent with regularization techniques.

#### Gradient Descent with momentum

Finding the optimum is not always a straightforward process. The gradients aren't pointing to the global optimum directly, they are oscillating. To smoothen this oscillation, and give a better general direction, we are going to give an overall momentum to the direction of the change by incrementally building the speed of change. 
We are going to use mini-batch updates, and for each batch, we compute $\delta$ and create the speed of delta, called $V_\delta$ using the following formula:
$$V_\delta = \beta V_\delta + \left(1 - \beta\right)\delta$$
where $\beta$ is the friction parameter, and use this speed for the updates:
$$w = w - \alpha V_\delta$$
This change will smoothen the gradients. Typical value for $\beta$ is 0.9 which is basically the average of the last 10 gradient.

#### RMSprop - Root Mean Squared Propagation

The goal of the method is to further accelerate the learning process by modifying the update rule. Similarly to gradient descent with momentum method, we are trying to gather the general direction towards the optimum by conserving the momentum of the gradients from the previous iterations. This time however we will use the square of the gradients from each mini-batch:
$$S_\delta = \beta_2 S_\delta + \left(1 - \beta_2\right)\delta^2$$
where $\beta_2$ is similarly a hyperparameter, and we use this $S_\delta$ value to update our weights:
$$w = w - \alpha \frac{\delta}{\sqrt{S_\delta} + \epsilon}$$
where $\epsilon$ is a really small value to practically prevent division by zero.

#### ADAM - Adaptive Moment Estimation

Adam is basically the combination of momentum and RMSprop. The algorithm works by computing $V_\delta$ and $S_\delta$ in each mini-batch iteration:

$$\begin{align}
    V_\delta & = \beta_1 V_\delta + \left(1 - \beta_1\right)\delta  \\
    S_\delta & = \beta_2 S_\delta + \left(1 - \beta_2\right)\delta^2 \\
\end{align}$$

then a corrected value is generated:

$$\begin{align}
    V^{\textrm{corrected}}_\delta & = \frac{V_\delta}{1 - \beta_1^t}  \\
    S^{\textrm{corrected}}_\delta & = \frac{S_\delta}{1 - \beta_2^t} \\
\end{align}$$

where $t$ is the number of the current iteration, finally using the values above, our weight update rule is:

$$w = w - \alpha \frac{V^{\textrm{corrected}}_\delta}{\sqrt{S^{\textrm{corrected}}_\delta} + \epsilon}$$

This update rule has several hyperparameters to tune. The recommended setup by the designers of the algorithm is to fine tune $\alpha$, use $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$.  
The name comes from the two types of momentum: $V_\delta$ is called the first momentum and $S_\delta$ is the second momentum.


#### +1: Learning rate decay

As the model approaches the optimum, it is desirable to slow down the speed of the weight change to actually reach the optimum instead of wandering around it.
It is a pretty straightforward, and there are many options:
$$\begin{align}
    \alpha & = \frac{1}{1 + \textrm{decay_rate} * \textrm{epoch_num}} \alpha_0 \\
    \alpha & = 0.95^\text{epoch_num} * \alpha_0 \\
    \alpha & = \frac{k}{\sqrt{\text{epoch_num}}} * \alpha_0 \\
    \alpha & = \frac{k}{\sqrt{\text{t}}} * \alpha_0
\end{align}$$

### Weight initialization

#### Vanishing and exploding gradients problem

If the initial weigths are larger than 1, the optimization process will incrementally raise it's value. This problem will result huge weight values which leads to problems with the convergence.  
The initial weights with smaller than 1 values will in turn always gets smaller by the optimization which will lead to really slow convergence.
The described problem is called vanishing and exploding gradients problem.

There is a third option however, setting the weights to zero. This, in turn will lead identical weight values over the entire network, basically degrading it's performance.

These problems were one of the main roadblock in front of deep learning, and there are several way to mitigate this problem. Let's see some of the ways to select good initial weights for the network.

#### Weight initialization method

The larger the number of inputs $n$ at layer $l$, the smaller the weights at that layer $w^{l}$ should be. It is a good practice to generate nonzero random initial values with variance of $\textrm{Var}(w^l) = \frac{1}{n}$. In case of ReLU activation function, the variance of $\frac{2}{n}$ yields better results. We can generate such weights by applying:
$$w^{l} = w^{l}_{\textrm{init}} * \sqrt{\frac{2}{n^{l-1}}}$$
where $l$ is the layer, $n$ is number of inputs and $w_\textrm{init}$ is randomly generated from the gaussian distribution (eg. using the `np.random.randn()` function). Using **$2$** in the variance and inside the fraction with ReLU activation function is the **He** initializer (named after the author of the paper), while using **$1$** in the variance, and in the fraction and using $\tanh$ activation function is called the **Xavier** intializer (again, after the author). There are many more variants available.


---

## In practice: Keras 

Everything we talked so far can be implemented by hand from scratch, or we can use one of the several available frameworks. There are several well established and widely used framework available, namely torch, theano, or tensorflow. There is a framework built on top of these low level frameworks, called `Keras` which we'll use to implement our networks.
Keras has two APIs:
- Sequential API: build the network by creating a list of layers.
- Functional API: define the network by chaining layer definitions using layers as input for the consecutive layers.

### Using Keras' `Sequential` API

In the Sequential API, we are going to define layers, and use those layers to build the final network. Many component can be a layer, layer of neurons, or even activation functions.
Once we defined the network layout, we'll compile the network by specifying the optimization method and the network itself, finally train it on training data.

#### [Sequential model](https://keras.io/guides/sequential_model/)

In order to create the network, we first have to create an empty model by initializing the Sequential model.

In [None]:
from keras.models import Sequential

In [None]:
model = Sequential()
print(model.layers)

#### [Dense layers](https://keras.io/api/layers/core_layers/dense/)

The fully connected layers we used so far are called dense layers in keras. We can define the network 1 layer at the time.  
For the initial layer we have to specify the input size, the number of neurons in the layer. For the following layers, the input size is automatically deducted based on the previous layers.

In [None]:
from keras.layers import Dense

In [None]:
hidden_layer_1 = Dense(2, input_dim=2)

In [None]:
model.add(hidden_layer_1)
print(model.layers)

Once we added at least one layer, we can also use the `.summary()` function to get more details about the model:

In [None]:
model.summary()

#### [Activation layers](https://keras.io/api/layers/core_layers/activation/)

Notice that we have not defined the activation function for our dense layer previously, so it'll use linear activation function. In order to modify that behaviour, we'll add an activation layer after the dense layer.
Using sigmoid activation layer, we can recreate our previously created network. 

In [None]:
from keras.layers import Activation

In [None]:
hidden_sigmoid_layer_1 = Activation('sigmoid')

In [None]:
model.add(hidden_sigmoid_layer_1)
model.summary()

We created the hidden layer with 2 neurons, let's add the final output layer with one (output) neuron.

In [None]:
output_layer = Dense(1)
output_sigmoid_layer = Activation('sigmoid')
model.add(output_layer)
model.add(output_sigmoid_layer)

In [None]:
model.summary()

#### [Optimizers](https://keras.io/api/optimizers/)

After we defined the layout of the network, we have to select and initialize the optimizer method.  
Let's use stochastic gradient descent just like we did previously without momentum.

In [None]:
from keras.optimizers import SGD

In [None]:
sgd = SGD(learning_rate=0.1, momentum=0.0)
print(sgd)

#### [Metrics](https://keras.io/api/metrics/) vs [Loss](https://keras.io/api/losses/)

The last part of the network is the selection of loss function and the evaluation metric. *Cost function* is the loss we'd like to minimize during training and *metric* is used to evaluate the trained model. We can set these during the *model compilation*.

#### Compilation

The last step in the network definition is the model compilation.  
It requires three parameters: an optimizer, a loss function, and a list of metrics.

In [None]:
model.compile(optimizer=sgd,
              loss='binary_crossentropy',
              metrics=['accuracy'])

#### Network Visualization

In [None]:
from keras.utils.vis_utils import plot_model

In [None]:
plot_model(model, show_shapes=True, show_layer_names=True)

#### [Regularization](https://keras.io/api/layers/regularizers/)

There are two types of regularization:

- weight regularization (eg. l2 normalization)
- dropout regularization

Weight regularization is available during layer definition, dropout regularization is available as a layer. Let's see an example for them.

In [None]:
from keras.regularizers import l1, l2
from keras.layers import Dropout

In [None]:
large_l2_regularizer = l2(5.0)
small_l1_regularizer = l1(0.001)

# hidden layer 
regularized_hidden_layer = Dense(2, input_dim=2, 
                                 kernel_regularizer=large_l2_regularizer, 
                                 bias_regularizer=small_l1_regularizer)

# dropout
hidden_dropout_layer = Dropout(rate=0.5)  # 50% of the neurons

# output layer
regularized_output_layer = Dense(1, kernel_regularizer=large_l2_regularizer)

In [None]:
regularized_model = Sequential()

regularized_model.add(regularized_hidden_layer)
regularized_model.add(Activation('sigmoid'))
regularized_model.add(hidden_dropout_layer)
regularized_model.add(regularized_output_layer)
regularized_model.add(Activation('sigmoid'))

regularized_model.compile(optimizer=SGD(),
                          loss='binary_crossentropy',
                          metrics=['accuracy'])

In [None]:
plot_model(regularized_model, show_shapes=True, show_layer_names=True)

---

## Example

### XOR revisited

For the last time, I promise. :)  
Let's check train the compiled keras models on this problem.

A | B | output |
--|---|--------|
0 | 0 |  0     |
0 | 1 |  1     |
1 | 0 |  1     |
1 | 1 |  0     |

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from helpers import plot_results_with_hyperplane, FILL_IN

np.random.seed(42)

In [None]:
inputs = np.array([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
labels = np.array([0, 1, 1, 0])

plt.scatter(x=inputs[:, 0], y=inputs[:, 1], c=labels)

#### Training

The final step is the training - it is really similar to training any sklearn model.

In [None]:
model.fit(inputs, labels, batch_size=1, epochs=2000)

In [None]:
model.predict_classes(inputs)

In [None]:
plot_results_with_hyperplane(inputs, labels, model, 'Keras.NN [2, 2, 1]');

#### Keras Scikit-Learn API

We can build scikit-learn compatible keras models as well, and we can use it in our pipeline just like any built-in model, including hyperparameter optimization. We can use it through wrapper classes: KerasClassifier for classification and KerasRegressor for regression.  
We have to define a build function, which setups the network and compiles it into a model, then pass it to the wrapper object.

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier

In [None]:
def build(learning_rate=0.1, hidden_size=2, activation_function='sigmoid'):
    model = Sequential([
        Dense(hidden_size, input_dim=2),
        Activation(activation_function),
        Dense(1),
        Activation(activation_function),
    ])
    optimizer = SGD(learning_rate=learning_rate)
    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

In [None]:
plot_model(build(), show_shapes=True, show_layer_names=True)

In [None]:
np.random.seed(42)
sklearn_model = KerasClassifier(build, batch_size=1, epochs=1500, verbose=0)

In [None]:
sklearn_model.fit(inputs, labels)

In [None]:
sklearn_model.predict(inputs)

In [None]:
plot_results_with_hyperplane(inputs, labels, sklearn_model, 'Keras.NN [2, 2, 1]');

## Exercise

### Build a handwritten number detector - the MNIST dataset

<p><a href="https://commons.wikimedia.org/wiki/File:MnistExamples.png"><img src="pics/external/mnist.png"></a>
<br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Jost_swd15&amp;action=edit&amp;redlink=1" class="new" title="User:Jost swd15 (page does not exist)">Josef Steppan</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=64810040">Link</a></p>

Let's build a classifier on the well known MNIST dataset. It contains 8x8 size pictures about handwritten numbers in grayscale. The goal is to predict which number the user written based on the pixel values.

The steps you have to complete are:
1. Load data and split into train-test set
2. Setup the network:
    - one hidden layer with 8 neurons
    - one output layer
3. Compile the network
4. Train data
5. Evaluate model

#### 1. Load data

We have already prepared this step for you. After loading the dataset with the built-in sklearn function split the dataset into a train and a test dataset. Use 1/4 of the data as test set.

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

In [None]:
X, y = load_digits(return_X_y=True)

print(y[0])
plt.imshow(X[0].reshape(8, 8), cmap=plt.get_cmap('gray'))

In [None]:
# TODO: split the data into train and test dataset:
# - use 1/4 of the data as test set
# - set the seed to 42

X_train, X_test, y_train, y_test = FILL_IN.train_test_split
assert X_train.shape == (1347, 64), "Incorrect split!"

#### 2. Setup network

We need a network with 1 hidden layer with 8 neurons.

Answer the following questions:
- What is the dimensionality of the input?
- What should be the output activation function?

In [None]:
# TODO: 
# - Initialize model
# - Create the hidden layer with 8 neurons and the proper input dimensionality
# - Create a ReLU activation function
# - Create the ouput layer
# - Select an appropriate activation function








#### 3. Model compilation

Initialize the optimizer, select the appropriate loss function and compile the model.

Answer the following questions:
- Which is the appropriate loss function considering the output activation function and the learning problem?
- Which optimizer would be ideal for this problem?

In [None]:
# TODO:
# - compile the model by selecting loss function and the optimizer




#### 4. Fit the model

Fit the model using an appropriate batch size and epoch.

Answer the following questions:
- What would be an ideal batch size?
- How many iterations should we use?

In [None]:
# TODO:
# - set the batch size to a reasonable size
# - set the epoch count to a reasonable number
# - fit the model






#### 5. Evaluate model

Using the test set.

In [None]:
final_loss, final_accuracy = model.evaluate(X_test, y_test)

print(f'The model final accuracy on test set is {final_accuracy:.2%}')

### Good job!

In the next chapter we'll discover how can we deal with high dimensional data (eg. images) when trying to learn their representation.