# Moving to Shallow Neural Networks

In this tutorial, you'll implement a shallow neural network to classify digits ranging from 0 to 9. The dataset you'll use is quite famous, it's called 'MNIST' http://yann.lecun.com/exdb/mnist/. A French guy put it up, he's very famous in the DL comunity, he's called Yann Lecun and is now both head of the Facebook AI reseach program and head of something in the University of New York...


### First step

As a first step, I invite you to discover what is MNIST. You might find [this notebook](https://nbviewer.jupyter.org/github/marc-moreaux/Deep-Learning-classes/blob/master/notebooks/dataset_MNIST.ipynb) to be usefull, but feel to browse the web.

Once you get the idea, you can download the dataset 

In [1]:
import pickle, gzip, numpy
import numpy as np

# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = pickle.load(f, encoding="latin1")
f.close()


def to_one_hot(y, n_classes=10): # You might want to use this as some point...
    _y = np.zeros((len(y), n_classes))
    _y[np.arange(len(y)), y] = 1
    return _y

#We use the function to_one_hot on the y part of our dataset to get the numeric value of y_set as a vector which has
#a 1 at his index corresponding to his value to be able to work on it

X_train, y_train = train_set[0], to_one_hot(train_set[1])
X_valid, y_valid = valid_set[0], to_one_hot(valid_set[1])
X_test,  y_test  = test_set[0], to_one_hot(test_set[1])



---
# You can now implement a 2 layers NN

Now that you have the data, you can build the a shallow neural network (SNN). I expect your SNN to have two layers. 
    - Layer 1 has 20 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood (wich is also the cross entropy)
    
You'll need to comment your work such that I understand that you understand what you are doing

### 1 - Define Parameters

In [5]:
# HELPER 
def softmax(Z):
    """Z is a vector eg. [1,2,3]
    return: the vector softmax(Z) eg. [.09, .24, .67]
    """
    #classical softmax function based on the formula
    return np.exp(Z) / np.sum(np.exp(Z))


def sigmoid(z):

    """The sigmoid function."""
    #classical sigmoid function based on the formula
    return 1.0/(1.0+np.exp(-z))
    

# Define the variables here (initialize the weights with the np.random.normal module):

#We create the weight of the first hidden layer of 784 (input layer number) by 20 (the size of layer 1 neurons)
W1 = np.random.normal(0,0.1, (784,20))
#We define b1, with 20 row corresponding for the 20 neurons of the first hiden layer
#The bias is really important in Neural network, it is used to smooth the sum of neuron*weight to avoid
#every points to be at 1 or 0 (really common with sigmoid function)
b1 = np.zeros((20,))

#We create the weight of the second hidden layer of 20 (input neuron from the hidden layer 1) by 20 (the size of layer 1 neurons)
W2 = np.random.normal(0,0.1, (20,10))
#We define b2, with 10 row corresponding for the 10 neurons of the second hiden layer
b2 = np.zeros((10,))


### 2 - Define Model

In [6]:
def Pred(X, W1, b1, W2, b2 ):
    """Explanations ...
    Arguments:
        X: An input image (as a vector)(shape is <784,1>)
        W1: Weights for hidden layer 1
        b1: bias for hidden layer 1
        W2: Weights for hidden layer 2
        b2: bias for hidden layer 2
        
        Here we took all the the image <50000 rows, each image,784 column, each point of the image>
        Then we calculate the prediction of our algorithm base on the formula seen in class and the data given
        (sigmoid to get A2 and softmax to get A3)
        It give in P the probability to be each digit (0->10) for the image corresponding to the row
        That is way P is a <50000,10> matrix
    Returns : a vector P
    """

    A1 = np.dot(X, W1) +b1
    A2 = sigmoid(A1)
    A3 = np.dot(A2, W2) +b2
    P = softmax(A3)
    
    return P

def loss(P, Y):
    """Explanations : 
    Arguments:
        P: The prediction vector corresponding to an image (X^s)
        Y: The ground truth of an image
        
        Here we use the formula of the loss seen in class
    Returns: a vector ???
    Not a vector but the sum of the negative log likelihood
    It represent how far our model is from the real data, we try to reduce it
    """
    return -np.sum(np.multiply(Y, np.log(P)))

### 3 - Define Derivatives

In [10]:
def dW1(P,Y,W2,W1,b1,X):
    """Explanations ??
    It is the derivate of the weight for hidden layer 1
    It is use to update the weight regarding how far the result of the sum weight + bias is far from expectation
    We determinate it by derivating the loss function
    (This function didn't work but i put the mathematic expression that we tried to build in Python:
    dW1 = (P-Y)*W2.T*[A2*(1-A2)]*X)
    Returns: A vector which is the derivative of the loss with respect to W1
    """
    
    """
    This is were we stop to try to make the function correctly working:
    
    A1 = np.dot(X, W1) +b1
    A2 = sigmoid(A1)
    
    part_1=np.dot((1-A2).T,A2) ##20x20
    part_2=np.dot(W2.T,part_1) ##10x20
    part_3=np.dot((P-Y),part_2) #50000x20
    part_4=np.dot(X.T,part_3) #784x20
    return part_4
    
    When we derivated the loss according to a variable, we tried to get back with the same variable's dimension. 
    We managed to do it for W1,W2 and b2 but not for b1
    """

    A1 = np.dot(X, W1) +b1
    A2 = sigmoid(A1)
    
    part_1=np.dot((1-A2).T,A2) ##20x20
    part_2=np.dot(W2.T,part_1) ##10x20
    part_3=np.dot((P-Y),part_2) #50000x20
    part_4=np.dot(X.T,part_3) #784x20
    return part_4


def db1(P,Y,W2,W1,X,b1):
    """Explanations ??
    It is the derivate of the bias for hidden layer 1
    It is use to update the bias regarding how far the result of the sum weight + bias is far from expectation
    We determinate it by derivating the loss function
    (This function didn't work but i put the mathematic expression that we tried to build in Python:
    dW1 = (P-Y)*W2.T*[A2*(1-A2)]    same as before except dA1/db1 = 1)
    Returns: A scalar which is the derivative of the Loss with respect to b1
    """
    
    """
    This is were we stop to try to make the function correctly working:
    
    A1 = np.dot(X, W1) +b1
    A2 = sigmoid(A1)
    part1=np.dot((1-A2).T,(P-y_train)) 
    part2=np.dot(part1,W2.T) 
    part3=np.dot(A2,part2) 
    return part3.T
    
    When we derivated the loss according to a variable, we tried to get back with the same variable's dimension. 
    We managed to do it for W1,W2 and b2 but not for b1
    """
    
    A1 = np.dot(X, W1) +b1
    A2 = sigmoid(A1)
    return (P-Y)*W2*A2(1-A2)


def dW2(P, Y, W1, X, b1):
    """
    Same as dW1 but for the second hidden layer
    """
    A1 = np.dot(X, W1) +b1
    A2 = sigmoid(A1)
    return np.dot(A2.T,(P-Y))


def db2(P,Y):
    """
    Same as db1 but for the second hidden layer
    """
    return np.sum(P-Y,axis=0)

### 4 - Train you model

You may use Standard Gradient Descent (SGD) to train your model. (Experiment with many learning rates)

In [11]:
#We set a learning rate

alpha = 0.0001


for i in range(50):

    #Forward
    P = Pred(X_train, W1, b1, W2, b2)
    
    #We check how for our prediction is from real output
    test = loss(P,y_train)/len(X_train)
    
    
    #Backward

    #We update all the parameters
    
    W1 = W1 - alpha*dW1(P,y_train,W2,W1,b1,X_train)   
    b1 = b1 - alpha*db1(P,y_train,W2,W1,X_train,b1)

    W2 = W2 - alpha*dW2(P,y_train,W1,X_train,b1)
    b2 = b2 - alpha*db2(P,y_train)
    
    #We update the loss with our new parameters to check if out model reduce his loss by iterations
    loss2 = loss(Pred(X_train, W1, b1, W2, b2), y_train)/len(X_train)
    print("Loss" , loss2)

    


  


ValueError: operands could not be broadcast together with shapes (50000,10) (20,10) 

### 5 - Test the accuracy of your model on the Test set

---
# You can now go Deeper

Build a deeper model trained with SGD (You don't need to use the biases here)
    - Layer 1 has 10 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a sigmoid activation
    - Layer 3 has 10 neurons with a sigmoid activation
    - Layer 4 has 10 neurons with a sigmoid activation
    - Layer 5 has 10 neurons with a sigmoid activation
    - Layer 6 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood

Is it converging ? Why ? What's wrong ?