# Neural networks


What to read and watch

* https://www.youtube.com/watch?v=aircAruvnKk
    
* http://neuralnetworksanddeeplearning.com/index.html



In [None]:
import ipywidgets as widgets

import matplotlib.pyplot as plt
import numpy as np

import tensorflow as tf


%matplotlib notebook

In [None]:
def dis_img(x,y, labeldict, title = None):
    np.random.seed(213)
    #plt.close('all')
    indx = np.random.choice(range(x.shape[0]) , 16 )   
    fig, ax = plt.subplots(4, 4,
                           figsize=(5,5),
                           subplot_kw={'xticks': [], 'yticks': []})      
    imag = [ (x[ind],y[ind])  for ind in indx] 
    for coef, ax in zip(imag, ax.ravel()):
        ax.imshow(coef[0].reshape(28, 28), cmap=plt.cm.gray)     
        ax.set_title(labeldict[coef[1]])
        fig.tight_layout()
    plt.show()



labeldict = {
    0: '0',
    1: '1',
    2: '2',
    3: '3',
    4: '4',
    5: '5',
    6: '6',
    7: '7',
    8: '8',
    9: '9'
}



In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()



x_train = x_train[np.where((y_train == 1) | (y_train == 7))]
y_train = y_train[np.where((y_train == 1) | (y_train == 7))]

x_test = x_test[np.where((y_test == 1) | (y_test == 7))]
y_test = y_test[np.where((y_test == 1) | (y_test == 7))]


x_train = x_train.reshape(-1, 784).astype('float32') /255.
x_test = x_test.reshape(-1, 784).astype('float32') /255.

dis_img(x_train, y_train, labeldict,'random_samples')


<IPython.core.display.Javascript object>

## Brain neurons and neural network

![Neuron](https://teenbraintalk.files.wordpress.com/2017/02/neuron.jpg?w=640)




![brain_neurons](https://cdn.geekwire.com/wp-content/uploads/2018/10/181031-neurons-630x599.png)

#### Logistic regression
#### Linear regression
#### Softmax regression

![softmax](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs00521-016-2401-x/MediaObjects/521_2016_2401_Fig3_HTML.gif?as=webp)

<br><br><br>


* Depend on the goal we choose the proper **output function**

## What is neural network?


![hidden](http://neuralnetworksanddeeplearning.com/images/tikz10.png)

- A network of connected neurons where information flows from 1 end to the other.
- By now we have learn $3$ hypothesis, now $h_{\theta(x)}$ is a NN.
- Similar to previous hypothesis, Neural networks are **function approximation** methods. The goal is to find the function that can map input $X$ to output $Y$. 





### Activation

* Each neuron is a non-linear transformation of neurons of the previous layer.
 
 ![perceptron](http://neuralnetworksanddeeplearning.com/images/tikz0.png)

* Activation of sigmoid neuron is :
\begin{equation*}
z = X w + b\\
a = \sigma(z)
\end{equation*}

### Hidden layer


![hidden](http://neuralnetworksanddeeplearning.com/images/tikz10.png)

### How to make $h_{w,b}(X)$ more complex?

* Adding a non-linear hidden layer.
* Non-linearities help Neural Networks perform more complex tasks.


### Bias:
   * How easy is to activate a neuron?
   * A neuron with large bias is hard to deactivate


### Universal Approximation Theorem

* A neural network with one non-linear hidden layer containing a sufficient but finite number of neurons can approximate any continuous function to a reasonable accuracy.

## Forward propagation
* vectorized

![hidden](http://neuralnetworksanddeeplearning.com/images/tikz10.png)

## Activation function vs output function

![hidden](http://neuralnetworksanddeeplearning.com/images/tikz11.png)


### Output function
* Define the range of the output:
    * Regression problems --> Linear output function
    * Binary classification --> Sigmoid output function
    * Multi-class classification --> Softmax output function
    
### Hidden layers activation function:

* Define how active are the nodes in a hidden layer.
* We want a complex model --> activation function should be non-linear. 
    * The composition of two linear function is linear. If $f(.)$ and $g(.)$ are linear $g(f(x))$ is linear.
    * It doesn't matter how many hidden layers we have; if all their activations are linear, the network is linear (in case of sigmoid and softmax for output, the model becomes simple (linear) logistic/softmax regression.
    
    
### Famous activation functions:

* **Sigmoid** 
    \begin{align*}
f(x) = \frac{1}{1+e^{-x}}
\end{align*} 

    * Towards either end of sigmoid, the outputs (y) tend to represent smaller changes (even when we have big changes in x) --> vanishing gradient (we talk more about this on the next lecture)
    
    * vanishing gradient: gradient is vanished or very small --> network is slow or stops learning 
    
    * Computationally expensive (exponential in the formula)
 <br><br><br>   

* **Tanh** \begin{align*}
f(x) &= \frac{2}{1+e^{-2x}} -1 \\
\\
& = 2 \sigma(2x) -1
\end{align*}

    * Has bigger range compare to sigmoid --> derivatives are steeper
    * Also suffers from vanishing gradient problem
    * Computationally expensive (exponential in the formula)
    
<br><br><br>
* **ReLu** (rectified linear unit)
 \begin{align*}
f(x) = max(0, x)
\end{align*} 

    * Non-linear.
    * When $x > 0$ derivative is one and $x < 0$ derivative is zero.
    * Computationally efficient. 
    * The Dying ReLU problem: For negative input, the gradient is zero --> network stops learning
    * However, often enough activations will have a positive value.
    
    
    
### [List of activation functions](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html)

In [None]:
fig_sigmoid, ax_sigmoid = plt.subplots()

 
p = np.linspace(-10, 10,200)
 
def Sigmoid(x, t0=0,t1=1):
    return (1 / (1 + np.exp(- (t0+ t1* x)))).reshape(x.shape[0],1)

def tan_h(x, t0=0,t1=1):
    return (np.tanh(t0+ t1* x))

def relu(x, t0=0,t1=1):
    return np.maximum(t0+ t1* x, 0)



@widgets.interact(b=(-7, 7, .01), w=(-5, 5, .01))
def update(b= 0 ,w = 1 ):
    [l.remove() for l in ax_sigmoid.lines]
    ax_sigmoid.plot(p, tan_h(p, b,w) , color='C0')


<IPython.core.display.Javascript object>

interactive(children=(FloatSlider(value=0.0, description='b', max=7.0, min=-7.0, step=0.01), FloatSlider(value…

## Forward and back propagation 

![hidden](http://neuralnetworksanddeeplearning.com/images/tikz10.png)


### Forward propagation
1. Randomly initialize $W$ and $b$ for all layers.
2. Pick a sample ($x_i, y_i$)
3. Find $z^{[1]}, a^{[1]}, z^{[2]}, a^{[2]}, \ldots, z^{[L]}, a^{[L]}$ (L is number of layer and h is number of node in a layer)
\begin{equation*}
z_h^l = a_i^{[l-1]} W_{ih}^l + b_h^l\\
a_h^l = f^l(z_h^l)
\end{equation*}

4. Calculate the output $h_{\mathbf{W} , \mathbf{b}}(x_i)$ and the cost $J(\mathbf{W} , \mathbf{b})$ for the sample.

5. Calculate total cost by sum over cost of samples.

### Gradient descent (back propagation)

![hidden](http://neuralnetworksanddeeplearning.com/images/tikz10.png)



### General loss fucntion for any kinda problem

\begin{equation*}
J(\mathbf{W} , \mathbf{b}) = \frac{1}{n}\sum_i^n \mathcal{L}(h_{\mathbf{W} , \mathbf{b}}(x_i), y_i)
\end{equation*}


Repeat until convergence

\begin{align*}
\pmb{ \theta} & :=  \pmb{ \theta} -  \alpha \nabla J(\theta) \\
\\
&\text{or}\\
\\
\mathbf{W} &:=  \mathbf{W} -  \alpha \nabla J(\mathbf{W} , \mathbf{b} ) \\
\mathbf{b} &:=  \mathbf{b} -  \alpha \nabla J(\mathbf{W} , \mathbf{b} ) 
\end{align*}




### What is [$\nabla J(\mathbf{W} , \mathbf{b} )$](http://neuralnetworksanddeeplearning.com/chap2.html):
* Also take a look at [3Blue1Brown](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4&ab_channel=3Blue1Brown)

* Back propagation is just **chain rule**



### Simple example with MNIST data

* Binary classification with a one hidden layer neural network.
* Use relu as activation function
* There are 10 hidden nodes in the single hidden layer

* TensorFlow provides:
    * [Sequential model](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential)
    * [Fully connected layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)

In [None]:
# Modifying the labels to be binary
y_train[y_train == 7] = 0
y_test[y_test == 7] = 0

dis_img(x_train, y_train, labeldict,'random_samples')
print(x_train.shape, y_train.shape)
print(y_train)

<IPython.core.display.Javascript object>

(13007, 784) (13007,)
[1 1 1 ... 1 0 1]


In [None]:
classifier = tf.keras.Sequential(name='mnist_17')
classifier.add(tf.keras.Input(shape=(784,)))
classifier.add(tf.keras.layers.Dense(256, activation='tanh'))
classifier.add(tf.keras.layers.Dense(1, activation='sigmoid'))





In [None]:
# compile the fully connected model
classifier.compile(loss='binary_crossentropy',
                   optimizer= tf.keras.optimizers.SGD(learning_rate=0.001), 
                   metrics=['accuracy'])
classifier.summary()

Model: "mnist_17"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 256)               200960    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 201,217
Trainable params: 201,217
Non-trainable params: 0
_________________________________________________________________


In [None]:
# fit the model on the dataset

history = classifier.fit(x_train, y_train, epochs=500, batch_size=x_train.shape[0], verbose=0)


print('Loss:  ', history.history['loss'][-1],'Acc:   ', history.history['accuracy'][-1] )

Loss:   0.20989476144313812 Acc:    0.9737833738327026


In [None]:
# Training history
_, axes = plt.subplots(2, 1, sharex=True)
# ax1.ylabel('training error')
# ax1.xlabel('epoch')
for ax, k in zip(axes, history.history.keys()):
    ax.set_ylabel(k)
    ax.set_xlabel("Epoch")
    ax.plot(history.history[k], label = k)
# plt.legend(loc='best')
plt.tight_layout()

<IPython.core.display.Javascript object>

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

y_pred_on_test = classifier(x_test)
print(y_pred_on_test)
y_pred = [0 if y <0.5 else 1 for y in y_pred_on_test]
print(y_pred)

sns.heatmap(confusion_matrix(y_test, y_pred), annot=True)

tf.Tensor(
[[0.08267272]
 [0.8739736 ]
 [0.87494445]
 ...
 [0.84609365]
 [0.22116429]
 [0.91958094]], shape=(2163, 1), dtype=float32)
[0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1

<matplotlib.axes._subplots.AxesSubplot at 0x7f88685edbd0>

<h1 align="center"> The end</h1> 