# Shallow Neural Networks

We will implement a shallow neural network to classify digits ranging from 0 to 9. The dataset you'll use is quite famous, it's called 'MNIST' http://yann.lecun.com/exdb/mnist/.


You might find [this notebook](https://nbviewer.jupyter.org/github/marc-moreaux/Deep-Learning-classes/blob/master/notebooks/dataset_MNIST.ipynb) to be usefull.

In [1]:
# Download the dataset in this directory 
#! wget http://deeplearning.net/data/mnist/mnist.pkl.gz

In [1]:
import pickle, gzip, numpy, math
import numpy as np

# Load the dataset
f = gzip.open('./Files/mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = pickle.load(f,encoding="latin1")
f.close()

def to_one_hot(y, n_classes=10): 
    _y = np.zeros((len(y), n_classes))
    _y[np.arange(len(y)), y] = 1
    return _y

X_train, y_train = train_set[0], train_set[1]
X_valid, y_valid = valid_set[0], valid_set[1]
X_test,  y_test  = test_set[0],  test_set[1]
X_train = np.transpose(X_train)
X_test = np.transpose(X_test)


---
# Implement a 2 layers NN

We will build a 2 layers SNN 
    - Layer 1 has 20 neurons with a sigmoid activation
    - Layer 2 has 10 neurons with a softmax activation
    - Loss is Negative Log Likelihood (wich is also the cross entropy)
    

### Define Parameters

In [2]:
        
def softmax(Z):
    """Z is a vector eg. [1,2,3]
    return: the vector softmax(Z) eg. [.09, .24, .67]
    """
    return np.exp(Z) / np.exp(Z).sum(axis=0)
    
def sigmoid(x):
    return (1/(1+np.exp(-x)))

np.random.seed(0)
inputSize=28
neuronSize=20
classifierSize=10
testSize = 50000
W1 = np.random.normal(0,0.1,(inputSize*inputSize,neuronSize))
W2 = np.random.normal(0,0.1,(neuronSize,classifierSize))
b1 = np.zeros((neuronSize,testSize))
b2 = np.zeros((classifierSize,testSize))
A1 = np.matmul( np.transpose(W1), X_train ) + b1
A2 = sigmoid(A1)
alpha=0.0005

y_train_b = to_one_hot(y_train, 10).T
y_test_b = to_one_hot(y_test, 10).T


### Define Model

In [3]:
def Pred(X, W1, W2, b1, b2):
    """Explanations ...
    Arguments:
        X: An input image (as a vector)(shape is <784,1>)
    Returns : a vector ???
    
    We can say that Pk = P(k|X;W) ; k belongs to [0;9]
    Then P = softmax(W * transposed(X))
    Because of the hidden layers, we have :
    A1 = W1 * transpose(X) + b1 (b1 being a bias)
    A2 = sigmoid(A1)
    z = W2 * transpose(A2) + b2
    and P = softmax(z)
    
    Hence : P = softmax( W2 * transpose[ sigmoid(W1*transpose[X]) ] )   -> the change between () & [] was done to ease reading
    """
    A1 = np.dot(X.T, W1).T + b1
    A2 = sigmoid(A1)
    Z = np.dot(A2.T, W2).T + b2
    P = softmax(Z.T).T
    return P   
  
def loss(P, Y):
    """Explanations : 
    Arguments:
        P: The prediction vector corresponding to an image (X^s)
        Y: The ground truth of an image
    Returns: a vector ???
    
    We want to calculate the prediction : argmax(W) P(Y|X;W) so Y knowing X and the weights.
    We develop that into argmax(W) P(0|X;W) , ... , argmax(W) P(9|X;W)
    Hence the prediction becomes argmax(W) Π(i=0;9) & Π(j=0;9) OF ( P[i,j] ^ Y[i,j] )
    We can apply the natural logarithm function to simplify this to :
    L(W) = argmax(W) Σ(i=0;9) & Σ(j=0;9) OF (-Y[i,j]*ln(P[i,j]))
    This is the loss function, we want to reduce it.
    """
    YP = -1* np.matmul(Y,np.log(P+0.000000000000001).T)
    loss = np.sum(YP)
    return loss


### Define Derivatives

In [6]:
def dW1(P, Y, W2, A2, X):
    """Explanations 
    W1 is a weight matrix applied for each pixel (784) to each neuron of the first layer (20)
    
    Returns: A vector which is the derivative of the loss with respect to W1
    
    We want to minimize the loss function, so let's calculate dL/dW1 and 
    decompose it by going from P to W1 on the schema done in class:
    
    dL/dW1 = dL/dP * dP/dz * dz/dA2 * dA2/dA1 * dA1/dW1 
    
    dL/dP = Σ(i=0;9) & Σ(j=0;9) OF (-Y[i,j]/P[i,j])
    
    dP/dz = dP[i]/dz[j] = d/dz[j] * ( exp(z[i])/Σ(k=0;9) OF exp(z[k]) )
        Σ(k=0;9) OF exp(z[k]) will be Σ
        If i = j
            d/dz[j] * P[i] = [exp(z[i]) * Σ * exp(z[j]) * exp(z[i])]/[Σ²]
                           = [exp(z[i])/Σ] - [exp(z[i])²/Σ²]
                           = softmax(z[i]) - softmax(z[i])² 
                           = P[i] (1 - P[j]) 
                (i=j, so we will have the second term according to j for the sake of future explanation)
        
        If i != j
            d/dz[j] * P[i] = [0 * Σ - exp(z[j]) * exp(z[i])]/[Σ²] 
                    We have 0 since we derive exp(z[i]) by z[j] and i is never equal to j
                           = -[exp(z[j])/Σ]*[exp(z[i])/Σ]
                           = -P[i]*P[j]
                           = P[i] (0 - P[j])
                           
        So dP/dZ = P[i](δ[i,j] - P[j]) ; δ = {1 if i=j ; 0 otherwise}
        So if i=j, dP/dz is positive (P - P², P among [0,1], P>P²) 
        If it i!=j, dP/dz is negative (-P[i]P[j], P>0)
        
    dP/dZ = P[i](δ[i,j] - P[j]) ; δ = {1 if i=j ; 0 otherwise}
        
    dZ/dA2 = W2
    
    dA2/dA1 = A2/(1 - A2)
    
    dA1/dW1 = X
    
    But we can simplify dL/dP * dP/dz :
    dL/dz[i] = dL/dP*dP/dz[i] = Σ(j=0;9) OF ( (-Y[i,j]/P[j])*P[i]*(δ[i,j] - P[j]) ) 
                                #basic aggregation of the previous calculations
                        
                              = P[i] * Σ(j=0;9) OF ( (-Y[i,j]/P[j])*(δ[i,j] - P[j]) ) 
                              #We extract P[i] because it does not depend on j
                              
                              = P[i] * Σ(j=0;9) OF ( ( (-Y[i,j]*δ[i,j]/P[j]) + Y[j]) 
                              #We distribute Y/P to (δ - P)
                              
                              = P[i] * [ Σ(j=0;9) OF (-Y[i,j]*δ[i,j]/P[j]) + Σ(j=0;9) OF (Y[j]) ] 
                              #we divide the sum in 2 parts
                              
                     We have  Σ(j=0;9) OF (-Y[j]*δ)/P[j]  which is equal to -Y[i]/P[i], i=j, because it is 0 if i!=j
                     And Σ(j=0;9) OF (Y[j]) is equal to 1 since Y is that type of vector : [0,1,0,0,0,0,0,0,0,0]
                              
                              = P[i] * [-Y[i]/P[i] + 1] = -Y[i] + P[i]
    dL/dz[i] = P[i] - Y[i]
                              
    So, we can finally deduce that
    
    dL/dW1 = (P[i] - Y[i]) *  W2  *    A2    *  (1-A2) *     X 
    SIZE :      [1 * 10]    [10*20]  [20 * 1]   [20 * 1]  [1 * 28²]
                [1 * 10]          [10 * 1]          [20 * 28²]
                        [1 * 1]              [20 * 28²]
      It is a vector of size  [20,28²].
      
N.B. : When calculating size, we overpass the transpose problem and simply verify the dimension.
        """
    
    PY = np.subtract(P,Y)

    A2_1_A2 = np.matmul(A2.T,(np.ones(A2.shape)-A2))

    
    PYW = np.matmul(PY.T,W2.T)
    
    PYW_A21A2 = np.matmul(PYW,A2_1_A2)
    
    dW1 = np.matmul(X,PYW_A21A2)

    
    return dW1

def db1(P, Y, W2, A2):
    """Explanations 
    The b1 are the biases applied to the first layer of neuron (20,1)
    
    Arguments:
        L is the loss af a sample (a scalar)
    Returns: A scalar which is the derivative of the Loss with respect to b1
    
    dL/db1 = dL/dP * dP/dz * dz/dA2 * dA2/dA1 * dA1/db1
        That is basically equal to dL/dW1, except that the last term was dA1/dW1 = X, now it is dA1/db1 = 1
    So dL/dB1 = (P[i] - Y[i]) *  W2  *     A2 *    (1-A2)  , we don't need L as an argument though
    SIZE :        [1 * 10]    [10*20]  [20 * 1]   [20 * 1]
                        [1 * 20]            [20 * 20]
      It is a matrix of size [1,20] despite the comment before.
    """
    
    PY = np.subtract(P,Y)
    
    A2_1_A2 = np.matmul(A2,(1-A2).T)
    PYW = np.matmul(PY.T,W2.T)
    
    db1 = np.matmul(PYW,A2_1_A2)
    
    return db1


def dW2(P, Y, A2):
    """Explanations 
    W2 is a weight matrix too, it is applied for each neuron from the first layer (20) to each neuron from the second layer (10)

    With the same logic, let's calculate dL/dW2 :
    
    dL/dW2 = dL/dP * dP/dz * dz/dW2
           = (P[i] - Y[i]) *   A2
    SIZE :    [10 * 1]      [1 * 20]
    dL/dW2 is a vector of size [10,20].
    """
    PY = np.subtract(P,Y)
    dW2 = np.matmul(PY,A2.T)
    return dW2

def db2(P, Y):
    """Explanations 
    The b2 are the biases applied to the first layer of neuron (10,1)
    
    Same logic,
    
    dL/db2 = dL/dP * dP/dz * dz/db2
           = (P[i]-Y[i]) * 1
           = P[i] - Y[i]
    So it is a vector of size [10,1]
    
    
    """
    db2 = np.subtract(P,Y)
    return db2


### Model training


In [7]:


for i in range(0,20): #instead of doing a for loop we should use the loss function and test if the difference between 2 iterations is less important than a little threshold
    #Forward propagation: we compute the probability of being of each of the 10 classes
    P = Pred(X_train, W1, W2, b1, b2)
    
    #Backward propagation: we update the parameters
    A1 = np.matmul( np.transpose(W1), X_train ) + b1
    A2 = sigmoid(A1)
    b2 = b2 - alpha*db2(P,y_train_b)
    #W1 = W1 - alpha*dW1(P,y_train_b,W2,A2,X_train)
    b1 = b1 - alpha*db1(P,y_train_b,W2,A2).T
    #W2 = W2 - alpha*dW2(P,y_train_b,A2).T

print(P)

#As we can see our method is not working here, we commented W1 and W2 to avoid the computation to be too much. 
#Our try to do ever computation using matricial computation did not fully suceeded (it create overflow here)


[[  1.57904561e-05   2.26239616e-05   2.24090588e-05 ...,   1.51675938e-05
    2.24053014e-05   1.51675914e-05]
 [  1.93878143e-05   2.00316682e-05   2.00298886e-05 ...,   1.93965132e-05
    2.00228602e-05   1.93965102e-05]
 [  2.16220775e-05   1.90733536e-05   1.90805181e-05 ...,   2.24790655e-05
    1.90798627e-05   2.24790635e-05]
 ..., 
 [  1.76610553e-05   2.08362890e-05   2.08347154e-05 ...,   1.81350793e-05
    2.08329256e-05   1.81350800e-05]
 [  2.80387417e-05   1.57911141e-05   1.57961955e-05 ...,   2.96888603e-05
    1.57967855e-05   2.96888585e-05]
 [  1.91624088e-05   2.02982871e-05   2.03042213e-05 ...,   1.94167813e-05
    2.02966211e-05   1.94167746e-05]]


### Test of the accuracy of the model on the Test set

In [7]:
#We designed our test function but could not use it since our training did not fully suceeded
y_pred=Pred(X_test, W1, W2, b1, b2)
cpt=0
sizeTest=10000
for i in range (0,sizeTest):
    if ((y_test_b[:,i].argmax(axis=0))==y_pred[:,i].argmax(axis=0)):
        cpt=cpt+1
accuracy=(cpt/sizeTest)*100
print("Accuracy % = ", accuracy)
        

ValueError: operands could not be broadcast together with shapes (20,10000) (20,50000) 