## Presentation
Please find the slideshow presentation [here](https://docs.google.com/presentation/d/14uD1zIE6CEnc8c_PvxOxPLazJZEj5kHqmsB6oeqbqyY/edit?usp=sharing)

## Imports and utility functions

In [0]:
import numpy as np
import types
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use('ggplot')
plt.rcParams['image.cmap'] = 'RdBu'

import sklearn.datasets as datasets
from sklearn.preprocessing import PolynomialFeatures

%matplotlib inline


def plot_decision_boundary(model, X, y, degree=1):
    """
    Use this to plot the decision boundary of a trained model.
    """
    grid_lim = np.array([[X[:,0].min(), X[:,0].max()], [X[:,1].min(), X[:,1].max()]])
    xx, yy = np.mgrid[grid_lim[0,0]:grid_lim[0,1]:.01, 
                      grid_lim[1,0]:grid_lim[1,1]:.01]
    grid = np.c_[xx.ravel(), yy.ravel()]
    
    t = PolynomialFeatures(degree=degree, include_bias=False)
    _poly = t.fit_transform(grid)
    
    probs = model.predict_proba(_poly)[:, 1].reshape(xx.shape)
    
    f, ax = plt.subplots(figsize=(8, 6))
    contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",
                        vmin=0, vmax=1)
    ax_c = f.colorbar(contour)
    ax_c.set_label("$P(y = 1)$")
    ax_c.set_ticks([0, .25, .5, .75, 1])

    ax.scatter(X[:,0], X[:, 1], c=y, s=100,
             cmap="RdBu", vmin=-.2, vmax=1.2,
             edgecolor="white", linewidth=1)

    ax.set(aspect="equal",
           xlim=(grid_lim[0,0],grid_lim[0,1]), 
           ylim=(grid_lim[1,0],grid_lim[1,1]),
           xlabel="$X_1$", ylabel="$X_2$")
    plt.gcf().set_size_inches(21, 14)
    return f, ax
  
  

def plot_history(h):
  
  fig, (ax1, ax2) = plt.subplots(1, 2)
  
  ax1.plot(h.history['loss'], label='Training Loss')
  ax1.plot(h.history['val_loss'], label='Test Loss')
  ax1.set_ylabel('Loss')
  ax1.set_xlabel('Epoch')
  ax1.legend(fontsize=24)
  ax1.set_ylim(0, 1)
  
  ax2.plot(h.history['acc'], label='Training Accuracy')
  ax2.plot(h.history['val_acc'], label='Test Accuracy')
  ax2.set_ylabel('Accuracy')
  ax2.set_xlabel('Epoch')
  ax2.legend(fontsize=24)
  ax2.set_ylim(0, 1)
  
  fig.suptitle('Evolution over epochs', fontsize=24)
  
  fig.set_size_inches(21, 14)


# Building a Neural Network Crash Course

### Important things to remember

1. Neural networks are not magic, just math
2. There are no recipes for building neural networks, only "best practices"
3. Neural networks can learn anything, given enough data

# Regression

First things first, let's import **keras**. Keras is a python library that makes it easier to use Google's Tensorflow, the most popular library for building neural networks. 

In [0]:
import keras

Let's use the Boston Housing dataset we used in Epoch 2.

In [0]:
from urllib.request import urlopen
file = urlopen('https://raw.githubusercontent.com/bucharestschoolofai/epoch_2/master/train.csv')
data = pd.read_csv(file, delimiter=',', usecols=['SalePrice', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'TotalBsmtSF', 'FullBath', 'GarageCars', 'Fireplaces'])

X = data.loc[:, :'GarageCars'].values
y = data.loc[:, 'SalePrice':].values

data.head(5)

Let's define our model.

In [0]:
model = keras.models.Sequential()

# First layer needs the input shape to be specified. We're dealing with 8D points, so input_shape=(8,)
model.add(keras.layers.Dense(units=4, input_shape=(8,), activation='relu', name='layer_1'))

# ReLU for the activation function
model.add(keras.layers.Dense(units=4, activation='relu',  name='layer_2'))

# The final layer has 1 neuron because we predict one value, the price
model.add(keras.layers.Dense(units=1, activation='relu',  name='layer_3'))

# Loss is Mean Absolute Error. 
model.compile(loss='mean_absolute_error', optimizer='adam')

In [0]:
model.summary()

In [0]:
history = model.fit(X, y, batch_size=1, epochs=10, validation_split=0.2)

## Coding Challenge: Make the neural network more powerful
- increase the number of neurons
- increase the number of layers
- change activation functions

# Classification

Let's get back to the moons dataset that we tackled in Epoch 3 using classical machine learning

In [0]:
X, y = datasets.make_moons(3000, noise=0.2, random_state=0)

plt.scatter(X[:, 0], X[:, 1], c=y, s=100)
plt.gcf().set_size_inches(21, 14)

We'll try to use this dataset to learn about building neural networks. Later on we'll see how ca we teach neural model to "**read**" some numbers.

Let's build a neural network with 3 layers that can classify these points.

In [0]:
model = keras.models.Sequential()


# First layer needs the input shape to be specified. We're dealing with 2D points, so input_shape=(2,)
model.add(keras.layers.Dense(units=4, input_shape=(2,), activation='relu', name='layer_1'))

# ReLU for the activation function
model.add(keras.layers.Dense(units=4, activation='relu',  name='layer_2'))

# Softmax activation to output probabilities 
model.add(keras.layers.Dense(units=2, activation='softmax',  name='layer_3'))

# Loss is crossentropy, like in LogisticRegression! Optimizer is Stochastic Gradient Descent and we are interested in accuracy.
model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

In [0]:
model.summary()

What's going on here? We've just made our model. But what does that mean? We actually create 3 matrices. Let's print them out:

In [0]:
for layer in model.layers:
  weights, bias = layer.get_weights()
  
  print(f"### Layer ", layer.name, "Neurons: ", layer.input_shape[1])
  print("Weights: \n", weights)
  print("Bias: ", bias)
  print()

The shapes of these matrices are such that they connect the current layer to the next layer (i.e. the first matrix is 2x4 because the first layer has 2 neurons and the next one has 4 neurons).

Let's get to training!

In [0]:
model.fit(X, y, batch_size=1, epochs=5, validation_split=0.2)

In [0]:
plot_decision_boundary(model, X, y)

## Coding Challenge: Make the neural network more powerful
- increase the number of neurons
- increase the number of layers
- change activation functions


# Initializers

We saw that the matrix weghits were some random numbers. In keras, we can specify the initialization function in the arguments. Let's initialize the weights with zeros and see what happens.

In [0]:
model = keras.models.Sequential()

model.add(keras.layers.Dense(units=4, input_shape=(2,), activation='relu', kernel_initializer='zeros', bias_initializer='zeros'))

model.add(keras.layers.Dense(units=4, activation='relu'))

model.add(keras.layers.Dense(units=2, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

history = model.fit(X, y, batch_size=1, epochs=5, validation_split=0.2)

# Learning Rate

The learning rate hyperparameter can be very important, as a too low learning rate can make the model get stuck in a local minima, and a higher learning rate can make the model overshoot. 

A good rule of thumb is to start with a higher learning rate initially, and then decrease it as the learning progresses.

In [0]:
model = keras.models.Sequential()

model.add(keras.layers.Dense(units=4, input_shape=(2,), activation='relu'))

model.add(keras.layers.Dense(units=4, activation='relu'))

model.add(keras.layers.Dense(units=2, activation='softmax'))

# TODO play with the learning rate hyperparameter
sgd = keras.optimizers.SGD(lr=0.01)

model.compile(loss='sparse_categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

history = model.fit(X, y, batch_size=1, epochs=5, validation_split=0.2)

# Reading Digits (Classification Extended)

Let's teach a neural network to **READ** some digits. We will be using the classic [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database).


In [0]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

In [0]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)

ax1.imshow(x_train[0], cmap='gray')
ax1.set_title('Label:' + str(y_train[0]))


ax2.imshow(x_train[1], cmap='gray')
ax2.set_title('Label:' + str(y_train[1]))

ax3.imshow(x_train[2], cmap='gray')
ax3.set_title('Label:' + str(y_train[2]))

fig.set_size_inches(21, 14)

To process these images with some fully connected neural network, we need to "flatten" them, because these types of layers can handle only vectors as input.

In [0]:
model = keras.models.Sequential()


model.add(keras.layers.Flatten(input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(keras.layers.Dense(10, activation='relu'))
model.add(keras.layers.Dense(10, activation='relu'))
model.add(keras.layers.Dense(10, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

![](https://www.superdatascience.com/wp-content/uploads/2018/08/CNN_Step3_Img1.png)

Flattening inherently neglects valuable spatial information, therefore using fully-connected layers with flattening is not the best approach for analysing images.

In [0]:
model.summary()

In [0]:
keras.utils.plot_model(model, show_layer_names=True, show_shapes=True, to_file='model.png')

from IPython.display import Image

Image(filename='model.png', height=600)

In [0]:
h = model.fit(x_train / 255., y_train, # our data
              epochs=10, # number of passes
              batch_size=32, # number of images per training step
              validation_data=(x_test / 255., y_test)) # validation data

In [0]:
plot_history(h)

## Coding Challenge: Make the neural network more powerful
- increase the number of neurons
- increase the number of layers
- change activation functions


# But how to decide on how many layers and neurons?

No clear answer here. You'll have to decide the optimal architecture for your problem and data. But you can also take this formula as a guide:

$\begin{equation}
N_h = \dfrac{N_s}{\alpha * (N_i + N_o)}
\end{equation}$

$N_h$ = number of neurons in the hidden layers

$N_i$ = number of input neurons

$N_o$ = number of output neurons

$N_s$ = number of samples

$\alpha$ = arbitrary scaling factor, usually 2-10 


**NOTE**
This only applies to **fully-conected** layers. The **UNIVERSAL APROXIMATION THEOREM** states that a fully-conected neural network with only one hidden layer can learn any function (with arbitrary number of neurons, of course)

# But can we do better? (Teaser)

Yes we can. In fact, fully-conected neural networks are not appropiate for working with images. There is another type of neural network that is purposely built for this type of data. That is the **Convolutional Neural Network**, or CNN, for short, which we will cover in detail **in the upcoming event**.

In [0]:
model = keras.models.Sequential()
model.add(keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu',input_shape=(28, 28, 1)))

model.add(keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))

model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dropout(0.5))

model.add(keras.layers.Dense(10, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(np.expand_dims(x_train, -1) / 255, y_train, batch_size=32, epochs=3, validation_data=(np.expand_dims(x_test, -1) / 255, y_test))
