# Moving to Shallow Neural Networks

In this tutorial, you'll implement a shallow neural network to classify digits ranging from 0 to 9. The dataset you'll use is quite famous, it's called 'MNIST' http://yann.lecun.com/exdb/mnist/. A French guy put it up, he's very famous in the DL comunity, he's called Yann Lecun and is now both head of the Facebook AI reseach program and head of something in the University of New York...


### First step

As a first step, I invite you to discover what is MNIST. You might find [this notebook](https://nbviewer.jupyter.org/github/marc-moreaux/Deep-Learning-classes/blob/master/notebooks/dataset_MNIST.ipynb) to be usefull, but feel to browse the web.

Once you get the idea, you can download the dataset 

In [1]:
# Download the dataset in this directory (does that work on Windows OS ?)
#! wget http://deeplearning.net/data/mnist/mnist.pkl.gz
import os
os.getcwd()
os.chdir("D:\ECE\ING5\deep_learning" )
os.getcwd()

'D:\\ECE\\ING5\\deep_learning'

In [2]:
import pickle, gzip, numpy
import numpy as np

# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = pickle.load(f,encoding='latin1')
f.close()

def to_one_hot(y, n_classes=10): # You might want to use this as some point...
    _y = np.zeros((len(y), n_classes))
    _y[np.arange(len(y)), y] = 1
    return _y

X_train, y_train = train_set[0], train_set[1]
X_valid, y_valid = valid_set[0], valid_set[1]
X_test,  y_test  = test_set[0],  test_set[1]

---
# You can now implement a 2 layers NN

Now that you have the data, you can build the a shallow neural network (SNN). I expect your SNN to have two layers. 
    - Layer 1 has 20 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood (wich is also the cross entropy)
    
You'll need to comment your work such that I understand that you understand what you are doing

### 1 - Define Parameters

In [30]:
# HELPER 
def softmax(Z):
    """Z is a vector eg. [1,2,3]
    return: the vector softmax(Z) eg. [.09, .24, .67]
    """
    return np.exp(Z) / np.exp(Z).sum(axis=0)

def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

def dsigmoid(Z):
    return np.dot(Z,(1-Z))

def selu(x):
        ALPHA = 1.6732632423543772848170429916717
        LAMBDA = 1.0507009873554804934193349852946
        if x <= 0.0:
            return LAMBDA * (ALPHA * np.exp(x) - ALPHA)
        else:
            return LAMBDA * x
    

# Define the variables here (initialize the weights with the np.random.normal module):
W1, b1 =np.random.normal(0,0.1, (784,20)), np.random.normal(0,0.1, 20)
W2, b2 =np.random.normal(1,0.1, (20,20)),np.random.normal(1,0.1, 20) 

### 2 - Define Model

In [160]:
def layer(W,X,b):
    return (np.dot(W.transpose(),X)+b)

def Pred(X,W1,W2,b1,b2):
    """Explanations ...
    Arguments:
        X: An input image (as a vector)(shape is <784,1>)
    Returns : a vector ???
    """
    A2 = np.ones((20,1))
    P = np.ones((10,1))
    Z = np.ones((10,1))
    A1 = np.ones((20,1))
    
    A1 = (layer(W1,X,b1))
    np.place(A1, np.isnan(A1), 1)
        
    A2 = (sigmoid(A1))
    #for x in range(0,len(a1)):
    #   A2[x]=selu(A1[x])
    A2 = A2.reshape(20)
    np.place(A2, np.isnan(A2), 1)
    Z = (layer(W2,A2,b2))
    np.place(Z, np.isnan(Z), 1)
    P = softmax(Z)
    return P,Z,A2,A1

def loss(P, Y,i):
    """Explanations : 
    Arguments:
        P: The prediction vector corresponding to an image (X^s)
        Y: The ground truth of an image
    Returns: a vector ???
    """
    #for i in range (0,len(Y)):

    return np.sum(np.dot(Y[i],np.log(P)))

### 3 - Define Derivatives

In [148]:
def dW1(L,A,W2,X):
    """Explanations ??
    Returns: A vector which is the derivative of the loss with respect to W1
    """
    return  np.dot(X.transpose()[:,None], db1(L,A,W2).transpose()[None,:])


def db1(L,A,W2):
    """Explanations ??
    Arguments:
        L is the loss af a sample (a scalar)
    Returns: A scalar which is the derivative of the Loss with respect to b1
    """
    return np.dot(L,np.dot(W2.transpose(),dsigmoid(A)))


def dW2(L,A):
    return np.dot(A[:,None],L[None,:])



def dL(P,Y,i):
    return P-Y[i]

### 4 - Train you model

You may use Standard Gradient Descent (SGD) to train your model. (Experiment with many learning rates)

In [182]:
# Define the variables here (initialize the weights with the np.random.normal module):
W1, b1 =np.random.normal(0,0.1, (784,20)), np.random.normal(0,0.1, 20)
W2, b2 =np.random.normal(1,0.1, (20,10)),np.random.normal(1,0.1, 10)
alpha = 0.1

a2 = np.ones((20,1))
p = np.ones((10,1))
z = np.ones((10,1))
a1 = np.ones((20,1))
l = np.ones((10,1))
my = to_one_hot(y_train, n_classes=10)
#for i in range (1,50000):
j = 0
for j in range (1,100):
    for i in range (0,50000):
        p,z,a2,a1 = Pred(X_train[i,:],W1,W2,b1,b2)
        l = loss(p,my,i)
        dl = dL(p,my,i)
        b2 = b2 - alpha*dl
        W2 = W2 - alpha*dW2(dl,a2)
        b1 = b1 - alpha*db1(dl,a2,W2)
        W1 = W1 - alpha*dW1(dl,a2,W2,X_train[i,:])
    #print(l)
print("y")
print(my[i])
print("p")
print(p)
print("loss")
print((l))
print("dl")
print(dl)

  if __name__ == '__main__':


KeyboardInterrupt: 

In [181]:
p,z,a2,a1 = Pred(X_train[0,:],W1,W2,b1,b2)
print("y")
print(my[0])
print("p")
print(p)
print("loss")

y
[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
p
[  9.08038967e-02   4.02729873e-04   3.77741999e-03   4.16404277e-02
   1.24333258e-04   6.81290842e-02   3.02519634e-05   8.58326411e-04
   7.92567761e-01   1.66576853e-03]
loss


### 5 - Test the accuracy of your model on the Test set

In [178]:
p,z,a2,a1 = Pred(X_test[1],W1,W2,b1,b2)
print("y")
print(y_test[1])
print("p")
print(p)


y
2
p
[  6.91188563e-02   1.61672356e-02   4.33519062e-02   2.35961908e-02
   4.02317843e-03   5.84098594e-02   1.89182688e-02   3.17734985e-04
   7.65580692e-01   5.16077300e-04]


---
# You can now go Deeper

Build a deeper model trained with SGD (You don't need to use the biases here)
    - Layer 1 has 10 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a sigmoid activation
    - Layer 3 has 10 neurons with a sigmoid activation
    - Layer 4 has 10 neurons with a sigmoid activation
    - Layer 5 has 10 neurons with a sigmoid activation
    - Layer 6 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood

Is it converging ? Why ? What's wrong ?

In [104]:
toto =loss(p,my,0)
toto

array([-0.        , -0.        , -0.        , -0.        , -0.        ,
       -2.30258509, -0.        , -0.        , -0.        , -0.        ])

In [110]:
np.sum(l)

-2.3025850929940455