# Moving to Shallow Neural Networks

In this tutorial, you'll implement a shallow neural network to classify digits ranging from 0 to 9. The dataset you'll use is quite famous, it's called 'MNIST' http://yann.lecun.com/exdb/mnist/. A French guy put it up, he's very famous in the DL comunity, he's called Yann Lecun and is now both head of the Facebook AI reseach program and head of something in the University of New York...


### First step

As a first step, I invite you to discover what is MNIST. You might find [this notebook](https://nbviewer.jupyter.org/github/marc-moreaux/Deep-Learning-classes/blob/master/notebooks/dataset_MNIST.ipynb) to be usefull, but feel to browse the web.

Once you get the idea, you can download the dataset 

In [4]:
# Download the dataset in this directory (does that work on Windows OS ?)
#! wget http://deeplearning.net/data/mnist/mnist.pkl.gz
import os
os.getcwd()
os.chdir("deep_learning" )
os.getcwd()

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\etudes\\ECE\\ING5\\deep_learning'

In [159]:
import pickle, gzip, numpy
import numpy as np

# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = pickle.load(f,encoding='latin1')
f.close()

def to_one_hot(y, n_classes=10): # You might want to use this as some point...
    _y = np.zeros((len(y), n_classes))
    _y[np.arange(len(y)), y] = 1
    return _y

X_train, y_train = train_set[0], train_set[1]
X_valid, y_valid = valid_set[0], valid_set[1]
X_test,  y_test  = test_set[0],  test_set[1]
print('Training set', X_train.shape, y_train.shape)
print('Validation set', X_valid.shape, y_valid.shape)
print('Test set', X_test.shape, y_test.shape)


Training set (50000, 784) (50000,)
Validation set (10000, 784) (10000,)
Test set (10000, 784) (10000,)


---
# You can now implement a 2 layers NN

Now that you have the data, you can build the a shallow neural network (SNN). I expect your SNN to have two layers. 
    - Layer 1 has 20 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood (wich is also the cross entropy)
    
You'll need to comment your work such that I understand that you understand what you are doing

### 1 - Define Parameters

In [460]:
image_size = 28
num_labels = 10

# HELPER 
def softmax(Z):
    """Z is a vector eg. [1,2,3]
    return: the vector softmax(Z) eg. [.09, .24, .67]
    """
    return np.exp(Z) / (np.exp(Z).sum(axis=0))

def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))
def dsigmoid(Z):
    return np.dot(Z,(1-Z))

def accuracy(X, labels,W1,W2,b1,b2):
    tot = 0
    for i in range(len(X)):
        A1 = (layer(W1,X[i],b1))
        np.place(A1, np.isnan(A1),  np.random.normal(0.5,0.2))
        A2 = (sigmoid(A1))
        A2 = A2.reshape(20)
        np.place(A2, np.isnan(A2),  np.random.normal(0.5,0.2))
        Z = (layer(W2,A2,b2))
        np.place(Z, np.isnan(Z),  np.random.normal(0.5,0.2))
        prediction = softmax(Z)
    if (np.argmax(prediction) == np.argmax(labels)):
        tot = tot+1
    return ((100.0 * tot)/ len(X))
    

# Define the variables here (initialize the weights with the np.random.normal module):
W1, b1 =np.random.normal(0,0.1, (image_size*image_size,20)), np.random.normal(0,0.1, 20)
W2, b2 =np.random.normal(1,0.1, (20,num_labels)),np.random.normal(1,0.1, num_labels) 

### 2 - Define Model

In [433]:
def layer(W,X,b):
    return (np.dot(W.transpose(),X).transpose()+b)

def Pred(X,W1,W2,b1,b2):
    """Explanations 
    we predict what is the probability for X to be 1 of the 10 solutions y
    the SNN equation we use to predict the value of x:
    P = softmax(W2.T*sigmoid((W1.T*X+b1))+b2)
    The prediction process uses 2 layers:
    X -> A1 = W1.T*X+b1 -> A2 = sigmoid(A1) -> Z = W2.T*A2 + b2 -> p= softmax(z)
                LAYER 1     A1 btw 0&1            LAYER 2          Z btw 0&1
    
    Arguments:
        X: An input image (as a vector)(shape is <784,1>)
    Returns : a vector ???
    """
    #for it in range (len(X)):
    A1 = (layer(W1,X,b1))
    np.place(A1, np.isnan(A1),  np.random.normal(0.5,0.2))
        
    A2 = (sigmoid(A1))
    #for x in range(0,len(a1)):
    #   A2[x]=selu(A1[x])
    A2 = A2.reshape(20)
    np.place(A2, np.isnan(A2),  np.random.normal(0.5,0.2))
    Z = (layer(W2,A2,b2))
    np.place(Z, np.isnan(Z),  np.random.normal(0.5,0.2))
    train_prediction = softmax(Z)
    return train_prediction,Z,A2,A1

def loss(P, Y,i):
    """Explanations : 
    Compute the loss between prediction P and real value Y
    argmax(P(Y|X;W))=argmax TT(0...9|Xi;w) = argmin(sumi(sumj(Yij*ln(Pij)))) = L(W)
    Arguments:
        P: The prediction vector corresponding to an image (X^s)
        Y: The ground truth of an image
    Returns: a vector ???
    """
    #for i in range (0,len(Y)):
    #   for y in range (0,len())
    return np.sum(np.dot(Y[i],np.log(P)))


### 3 - Define Derivatives

In [434]:

def dW1(L,A,W2,X):
    """Explanations :
    dL/dW1  = dL/dZ * dZ/dA2 * dA2/dA1 * dA1/dW1
            = (P-Y) * W2.T   * A2(1-A2)* X
    Returns: A vector which is the derivative of the loss with respect to W1
    """
    return  np.dot(X.transpose()[:,None], db1(L,A,W2).transpose()[None,:])


def db1(L,A,W2):
    """Explanations :
    dL/db1  = dL/dZ * dZ/dA2 * dA2/dA1 * dA1/b1
            = dL/dZ * d(W2.T*A2 + b2)/dA2 * d(sigmoid(A1))/dA1 * d(W1.T*X+b1)/b1
            = (P-Y) * W2.T * A2(1-A2) * 1
    Arguments:
        L is the loss af a sample (a scalar)
    Returns: A scalar which is the derivative of the Loss with respect to b1
    """
    return np.dot(L,np.dot(W2.transpose(),dsigmoid(A)))


def dW2(L,A):
    """Explanations :
    dL/dW2  = dL/dZ * dZ/dW2
            = dL/dZ * d(W2.T*A2 + b2)/d(W2)
            = (P-Y)A2
            = A2(P-Y).T
    """
    return np.dot(A[:,None],L[None,:])



def dL(P,Y,i):
    """
    Explanations :
    P = O(Z) = exp(Zj)/sumi(exp(Zi))     
    -> i=j:
    dPi/dZj = (exp(Zj)-(exp(Zj))^2)/sumi(exp(Zi))
    dPi/dZj = Pi(1-Pi)
    -> i=/=j :
    dPi/dZj = -exp(Zi)*exp(Zj)/(Sum(Pi)Sum(Pj))
    dPi/dZj = -Pi*Pj 
            = Pi(0-Pj)
    Hence,
    dOi(Z)/dZi = Pi(diracij - Pj) such that diracij = 1 if i=j;0 if i=/=j
    dL/dZi = Pi - Yi
    
    """
    return P-Y[i]

### 4 - Train you model

You may use Standard Gradient Descent (SGD) to train your model. (Experiment with many learning rates)

In [464]:
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = pickle.load(f,encoding='latin1')
f.close()
X_train, y_train = train_set[0], train_set[1]
X_valid, y_valid = valid_set[0], valid_set[1]
X_test,  y_test  = test_set[0],  test_set[1]
print('Training set', X_train.shape, y_train.shape)
print('Validation set', X_valid.shape, y_valid.shape)
print('Test set', X_test.shape, y_test.shape)
# Define the variables here (initialize the weights with the np.random.normal module):
W1, b1 =np.random.normal(0,0.1, (image_size*image_size,20)), np.random.normal(0,0.1, 20)
W2, b2 =np.random.normal(1,0.1, (20,num_labels)),np.random.normal(1,0.1, num_labels) 
alpha = 0.7
num_steps = 50000
train_set = 10000
valid_set = 1000
#####################################################################################################################
y_train = to_one_hot(y_train, n_classes=10)
y_valid = to_one_hot(y_valid, n_classes=10)
y_test = to_one_hot(y_test, n_classes=10)
j = 0

for step in range (1,num_steps):
    train_num = np.random.randint(train_set)
    
    #for i in range (0,50000):
    #np.place(p, np.isnan(p), np.random.normal(0.5,0.2))
    p,z,a2,a1, = Pred(X_train[train_num,:],W1,W2,b1,b2)
    l = loss(p,y_train,train_num)
    dl = dL(p,y_train,train_num)
    b2 = b2 - alpha*dl
    W2 = W2 - alpha*dW2(dl,a2)
    b1 = b1 - alpha*db1(dl,a2,W2)
    W1 = W1 - alpha*dW1(dl,a2,W2,X_train[np.random.randint(train_set),:])
    
    
    if (step % 100 == 0):
          print('Loss at step %d: %f' % (step, l))
          print('Training accuracy: %.1f%%' % accuracy(
           X_train[:valid_set,:] , y_train[:valid_set,:],W1,W2,b1,b2))

          print('Validation accuracy: %.1f%%' % accuracy(
            X_valid[:valid_set,:], y_valid[:valid_set,:],W1,W2,b1,b2))
print('Test accuracy: %.1f%%' % accuracy(X_test[:valid_set,:], y_test[:valid_set,:],W1,W2,b1,b2))

Training set (50000, 784) (50000,)
Validation set (10000, 784) (10000,)
Test set (10000, 784) (10000,)
Loss at step 100: -14.922225
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 200: -15.308063
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 300: -14.729732
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 400: -15.153083
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 500: -14.409481
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 600: -19.851177
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 700: -15.443309
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 800: -14.915286
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 900: -14.769103
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 1000: -14.852642
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 1100: -15.307542
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 1200: -0.0000

Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 10400: -14.932310
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 10500: -14.472854
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 10600: -14.563529
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 10700: -17.292682
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 10800: -14.675239
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 10900: -0.000003
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 11000: -15.043221
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 11100: -15.048026
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 11200: -17.022033
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 11300: -14.468674
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 11400: -14.815259
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 11500: -14.693757
Training accuracy: 0.0%
Valid

Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 20700: -14.614326
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 20800: -14.587417
Training accuracy: 0.1%
Validation accuracy: 0.0%
Loss at step 20900: -14.906649
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21000: -16.716190
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21100: -16.805695
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21200: -14.968290
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21300: -12.495912
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21400: -0.000003
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21500: -0.000004
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21600: -14.499191
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21700: -14.773333
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 21800: -12.702586
Training accuracy: 0.0%
Valida

Validation accuracy: 0.0%
Loss at step 30900: -16.166081
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 31000: -14.789435
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 31100: -15.054224
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 31200: -14.534999
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 31300: -14.621429
Training accuracy: 0.0%
Validation accuracy: 0.1%
Loss at step 31400: -14.681745
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 31500: -14.346619
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 31600: -14.715447
Training accuracy: 0.0%
Validation accuracy: 0.1%
Loss at step 31700: -14.793961
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 31800: -13.015280
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 31900: -15.051670
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 32000: -14.452559
Training accuracy: 0.0%
Validation accuracy: 0.0%
Lo

Validation accuracy: 0.0%
Loss at step 41100: -14.717812
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 41200: -16.076073
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 41300: -14.894004
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 41400: -14.928602
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 41500: -14.552056
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 41600: -14.602409
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 41700: -14.865644
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 41800: -14.559040
Training accuracy: 0.0%
Validation accuracy: 0.1%
Loss at step 41900: -13.481140
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 42000: -14.418869
Training accuracy: 0.0%
Validation accuracy: 0.1%
Loss at step 42100: -14.509887
Training accuracy: 0.0%
Validation accuracy: 0.0%
Loss at step 42200: -14.716739
Training accuracy: 0.0%
Validation accuracy: 0.0%
Lo



(10000,)

### 5 - Test the accuracy of your model on the Test set

In [465]:
print('Test accuracy: %.1f%%' % accuracy(X_test[:valid_set,:], y_test[:valid_set,:],W1,W2,b1,b2))

Test accuracy: 0.0%


---
# You can now go Deeper

Build a deeper model trained with SGD (You don't need to use the biases here)
    - Layer 1 has 10 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a sigmoid activation
    - Layer 3 has 10 neurons with a sigmoid activation
    - Layer 4 has 10 neurons with a sigmoid activation
    - Layer 5 has 10 neurons with a sigmoid activation
    - Layer 6 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood

Is it converging ? Why ? What's wrong ?

In [31]:
W2.shape

(20, 10)

In [3]:
# Python
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))


b'Hello, TensorFlow!'
