# Neural Networks

A Neural Network is an artificial system capable of learning to perform inormation processing tasks. The system is modeled on biological systems consisting of a (large) number of interconnected *Neurons*.
Examples of Applications of Neural Networks are:
 - Image recognition
 - Language processing (f.ex ChatGPT)
 - Data processing
 - Game playing
 - Handwriting recognition
 
 In the following we are going to use recognition of handwritten digits as an example.
 
 ![neuralHandigits](neuralHandigits.png)
 
 The aim is to create a network to recognize handwritten digits as the example above.
 In the process we will learn about the basics of Neural Networks, how to implement a network and also
 using python libraries to create and use Neural Networks.
 
 The text and code below closely follows the book by Michael Nielsen 
 
 ![book](http://neuralnetworksanddeeplearning.com/index.html
 
 The code, training and test data for the book is found on Github
 
     `git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git`
     
 The code was originaly written in Python2.7 but has been updated to Python3 by Michal Daniel Dobrzanski
 and is downloaded by
 
      git clonehttps://github.com/MichalDanielDobrzanski/DeepLearningPython35)
 
 
 ## Perceptrons
 A perceptron is a computational element capable of taking a number of input values
 
 ![neuralPerceptron](neuralPerceptron.png)
 
 and calculating an output value. The output value is either 0 or 1. The output is computed
 by the simple mathematical model
 
 \begin{eqnarray}
   output = 
   \left\{
  \begin{array}{cc}
    0 & \mbox{if}\,\, \sum_j^N x_j w_j \le threshold \\
    1 & \mbox{if}\,\, \sum_j^N x_j w_j  > threshold
   \end{array}
   \right. .
  \end{eqnarray}
  
  The weights $w_j$ reflects the importance of the input value relative to the output.
  $N$ is the total number of inputs. Each input is a single number.
  
  This simple device is capable of making simple decisisons. For example decide wether all inputs  are equal to the value of 1 given that the possible input values are 0 or 1. If all weights are set to one and the threshold set to 3 the output will be equal to 1 if all three inputs are equal to 1, but zero in all other cases. 
  
If one of the inputs are considered to be more important than the others, f.ex. imagine the conditions for a passenger plane to take of is that there is enough fuel, all passengers have boraded and the weather is good.
If fuel is considered the most important condition, this can be weighted by 3 and the threshold could be set to 5. The other inputs are weighted by 1. In this case the plane will never leave unless there is enough fuel.
If the threshold is set to 4, then the plane might leave even if the weather is bad or there are no passengers. 
Obviously this is a rather crude decisison making device, but it can make some decisisons. The thinking is now
if we connect together a number of perceptrons in a network, there might be some chance that more complex (and presumably better) decisisons might be taken. 

![neuralPerceptronNet](neuralPerceptronNet.png)

The network shown above consists of columns of perceptrons and connections between them. (There is still only one output, but connected to different other perceptrons). Each column is called a layer. The outputs from one
layer is connected to the inputs of another layer.
It is also customary to add an extra layer for the input values.

The notation used for each perceptron is usually made a bit simpler by using

\begin{eqnarray}
   output = 
   \left\{
  \begin{array}{cc}
    0 & \mbox{if}\,\,  w \cdot x +b \le 0\\
    1 & \mbox{if}\,\, w \cdot x +b >  0
   \end{array}
   \right. .
  \end{eqnarray}
  where we have used the vector $w=(w_0,\cdots,w_{n-1})$ and $x=(x_0,\cdots,w_{n-1})$ for weights and input values. $b$ is now equal to $-threshold$. $n$ would be the number of inputs and weights.
  
  
## Sigmoid neurons  
It can be shown that networks of perceptrons is capable of performing any kind of computation.
In that respect the perceptron networks are the same as the computers we already have. In order to get beyond this model the network is modified to be capable of learning.

To introduce learning experience shows that there is one property the perceptrone is lacking.
![neuralSmallchange](neuralSmallchange.png)

We need to have a node in the network to be sensitive to small changes, such that a small change in the input or the weights
creates a small change in the output. A perceptrone is not capable of this since a small change in the input weights might flip the output from 0 to 1, or vice versa.

To accomodate this the Sigmoid neuron is introduced in place of a perceptron.

![neuralSigmoid](neuralSigmoid.png)

The Sigmoid neuron is pictured in the same way as a perceptrone except that the inputs can now vary continously between 0 and 1, the weights are also real numbers and the output is a real number between 0 and 1.
To this end we use the Sigmoid function

\begin{eqnarray}
  \sigma(z) = \frac{1}{1+\exp(-z)}.
\end{eqnarray}
  
For a neuron this becomes
\begin{eqnarray}
  \sigma(z) = \frac{1}{1+\exp(-w\cdot x -b)}.
\end{eqnarray}

The Sigmoid function approches 1 when $w\cdot x +b $ is positively large, and approaches zero when $w\cdot x +b$ is large and negative. Comparing the Sigmoid function with the step function, which is the output from a perceptrone, we see that the Sigmoid function is a smoothed step function.

![neuralStep](neuralStep.png)

The Sigmoid function is called the *Activation* function for the neuron. Other activation functions are possible, but here we stay with the Sigmoid function. Because the activation function is smooth and differentiable we have

\begin{eqnarray}
  \Delta output \approx \sum_{j=0}^n \frac{\partial output}{\partial w_j}\Delta w_j + 
                                     \frac{\partial output}{\partial b}\Delta b
\end{eqnarray}

which means that a small change in the weights creates a small change in the output.

# Neural network architecture

![neuralNet](neuralNet.png)

The leftmost layer in the network shown above is the input layer and represent the data input. The rightmost layer is called the output layer and contains the neuron creating the actual output. The layer in the middle is a 
so called hidden layer.

![neuralNet2](neuralNet2.png)

It is possible to have several hidden layers, as in the network above.
The way a network works is by accepting some kind of data at the input layer and the output layer
delivering some kind of result.
An example is a network designed to recognize a handwritten digit. An image of the handwriting is used as an input. Assume the image contains 64 by 64 pixels = 4096 pixels. This would be the data contained in the
input layer. The output will contain a single number with a value of less than 0.5 signaling that the input
is not a 1 digit, and a value larger than 0.5 signaling that the input is a 1 digit.

The general architecture of the network shown above is that the data flows from the left to the right.
One hidden layer gets inputs from the previous layer. There is no data flowing backwards from a layer to the right to another layer. This kind of networks are called *feed-forward* networks.

# A network for recognizing handwritten digits

Our aim now is to construct a network which is capable of recognizing a single handwritten digit,

![neuralDigit](neuralDigit.png)

like the 5 digit shown above. To do this we will use the network shown below


![neuralDigitnet](neuralDigitnet.png)

The input layer consist of 784 pixels derived from an image consisting of 28 by 28 pixels. The input data
is grayscale with values from 0 to 1.
The second layer is the hidden layer and consist of 15 neurons. We will not consider this a fixed number but vary with deifferent values, so we assume the number of hidden layers is a variable named $n$.
The output layer consists of 10 neurons, one for each digit from 0 to 9. To decide what digit the input is we will
select the neuron with the largest output value.
This is all there is to this network, the next task is to find out how we can teach the network to recognize handwritten digits.

# Network learning

The main task is now to figure out how we can make the network described above learn the task of handwritten digit 
recognition. The main idea is to collect together a large number of handwritten digits and then adjust the weights and biases of each neuron to recognize each digit in the way we have described. This is in principle an optimization problem. If we denote the input as the vector $x$ where the components consists of the
input pixels and the output as a vector $y(x)$ where the components consists of the output layer, we can 
define a simple cost function

\begin{eqnarray}
  C(w,b) = \sum_{x}\lVert y(x) - a \rVert
\end{eqnarray}

and minimize $C(w,b)$ with respect to the weights and biases. $a$ is here the desired output.
The sum is over all training inputs.
In our case, with a small number of neurons, this can be accomplished with a global search algorithm, f.ex Simulated Annealing. 
However, if the number of neurons is very large, this will take too much time and a simpler algorithm, like steepest decscent og conjugate gradient search would be better and take less time.

# Training dataset

Before looking at how to do the optimization we need a training data set. We will use the *MNIST* dataset containig a large number of scanned handwritten digits

![neuralMNIST](neuralMNIST.png)

The scanned dataset consists of 60000 images of handwritten digits from 250 different people. The images are grayscale and sampled with 28 by 28 pixels, each pixel contain a real number between 0 and 1.
In addition to the training data there are also 10000 images to be used for testing the network.
In the following we will regard each image as a vector $x$ containing 748 components. This is the input
to the network. The desired output, $y(x)$ will be a vector with 10 components

\begin{eqnarray}
y = (0,0,0,0,0,0,1,0,0,0)
\end{eqnarray}

Here component no 7 is one, signaling that the network recognized the digit 7.


# Stochastic steepest desecent

We will use the very simple steepest descent algorithm for optimizing the weights and biases.
The unknowns in our problem are the weights $w_k$ and the biases $b_l$. In component notation the steepest
descent algorithm is trying to find a minimum of the cost function $C$ by updating the unknowns as

\begin{eqnarray}
  w'_k = w_k -\frac{\eta}{m}\sum_{x_j}\frac{C_{x_j}}{\partial w_k}  \nonumber\\
  b'_l = b_l -\frac{\eta}{m}\sum_{x_j}\frac{C_{x_j}}{\partial b_l}  
\end{eqnarray}
$\eta$ is the step length.
Here we do not include all of the input datasets $x_j$ in the summation, only a limited
number selected randomly. We then average the gradient over this subset. The idea is to minimize 
the cost of computing the gradient, and if we select a representative subset of the input data set, the
gradient will hopefully be correct to the degree that the minimization is actually working.
This approach is called stochastic steepest descent.

![neuralGradient](neuralGradient.png)


# Computation of the gradient

The most challenging part of this project is the actual computation of the gradient of
the cost function. We will use the automatic-derivative backward, or as it is sometimes called,
backpropagation method.

First a clear (but a bit cumbersome) notation for the weights, biases and activation functions must be defined.

By $w^l_{jk}$ we mean the weight from neuron no $k$ in layer no $l-1$ to neuron no $j$ in layer no $l$.


![neuralNotation](neuralNotation.png)

For the biases we use $b^l_j$ for neuron no $j$ in layer no $l$. The activation functions are denoted
by the same notation, $a^l_j$.
The activation function in layer no $l$ now becomes 

\begin{eqnarray}
a^l_j = \sigma\left(\sum_k w^l_{jk} a^{l-1}_k + b^l_j\right)
\end{eqnarray}

We se that $w^l_{jk}$ can be regarded as the elements of a matrix with indices $jk$, and $a^l_k$ as
a vector with components denoted by the $k$ index. The sum above over $k$ is then the matrix product
of the matrix $w^l$ with the vector $a^l$. The above equation can then be written

\begin{eqnarray}
a^l = \sigma\left(w^l a^{l-1} + b^l\right)
\end{eqnarray}

We will also use the notation 

\begin{eqnarray}
z^l = w^l a^{l-1} + b^l
\end{eqnarray}

So that we have

\begin{eqnarray}
a^l = \sigma(z^l)
\end{eqnarray}

We will also need the *Hadamard* product, or more conventionaly the *schur* product of two vectors which is the element wise product as in

\begin{eqnarray}
a\odot b = a_i b_i
\end{eqnarray}

Using this notation the cost function $C$ is now

\begin{eqnarray}
C = \frac{1}{2}\lVert y(x)-a^L(x)\lVert^2
\end{eqnarray}

Here $y(x)$ is th desired output and $a^L(x)$ is the activation output with input x where
$L$ denotes the number of layers in the network. 

Finaly we will need the derivatives of the cost function. At the moment we will consider only
derivatives with respect to $z^l$, bacuse this is the quantity which is most conveniently related
to the backpropagation method. Considering derivatives in terms of biases and weights is also possible, but is 
more naturally connected to the forward method.

We define then the so called sensistivity or error as

\begin{eqnarray}
\delta^l_j = \frac{\partial C}{\partial z^l_j}
\end{eqnarray}

The error in the output layer can now be written in this notation as

\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)
\end{eqnarray}
where the prime denotes differentiation.
If we are using the $L_2$ norm we can compute the above equation as

\begin{eqnarray}
\frac{1}{2}\frac{\partial (y_j - a^L_j)^2 }{\partial a^L_j} = (a^L_j-y_j)
\end{eqnarray}

Using our new notation we then have
\begin{eqnarray}
\delta^L = (a^L-y)\odot\sigma'(z^L)
\end{eqnarray}



This will be the first step in the computation of the gradient. We will use the automatic differentiation
backward method to compute all necessary derivatives using the chain rule.

The next step will be to compute the derivative of layer no $l-1$ in terms of the derivatives in layer
no $l$.

The derivative in layer no $l$ is equal to

\begin{eqnarray}
\delta^l_j = \frac{\partial C}{\partial z^l_j}
\end{eqnarray}

Now we use the chain rule as we did in the lectures on automatic differentiation
and write

\begin{eqnarray}
\delta^l_j = \sum_k \frac{\partial C}{\partial z^{l+1}_k}\frac{\partial z^{l+1}_k}{\partial z^l_j},
\end{eqnarray}

since $z^l$ in principle depends on all components in the next layer.

Which by definition is now
\begin{eqnarray}
\delta^l_j = \sum_k \delta^{l+1}_k\frac{\partial z^{l+1}_k}{\partial z^l_j}
\end{eqnarray}

To compute the z derivatives we use the definition of z

\begin{eqnarray}
z^{l+1}_k = \sum_j w^{l+1}_{kj}a^{l}_j+b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma^l_j(z^l_j) + b^{l+1}_k
\end{eqnarray}

So evaluating the derivative we have 

\begin{eqnarray}
  \frac{\partial z^{l+1}_k}{\partial z^l_j} =  w^{l+1}_{kj} {\sigma'}^l_j(z^l_j)
\end{eqnarray}

Inserting in the equation for the error above we have

\begin{eqnarray}
\delta^l_j = \sum_k\delta^{l+1}_k w^{l+1}_{kj} {\sigma'}^l_j(z^l_j)
\end{eqnarray}

In matrix notation this is the same as

\begin{eqnarray}
\delta^l = \left(w^{l+1})\right)^T \delta^{l+1} \odot {\sigma'}^l(z^l)
\end{eqnarray}

We will eventualy also need the derivatives of the bias at each neuron with respect to the cost function.

Again we use the definition of z

\begin{eqnarray}
z^l_j = \sum_k w^l_{jk}a^{l-1}_k + b^l_j
\end{eqnarray}

\begin{eqnarray}
\frac{\partial C}{\partial b^l_j} & = & \frac{\partial C}{\partial z^l_j}\frac{\partial z^l_j}{\partial b^l_j}\nonumber\\      
                                  & = & \delta^l_j 
\end{eqnarray}
                                  
We also will need the derivative of the cost function with respect to weights

\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} & = & \frac{\partial C}{\partial z^l_j}
                                          \frac{\partial z^l_j}{\partial w^l_{kj}} \nonumber\\
                                     & = &  \delta^l_j a^{l-1}_k 
\end{eqnarray}
                           
                         
Summarizing we now have all necessary equations to compute the gradient:

 1. $\delta^L = (a^L-y)\odot\sigma'(z^L)$
 2. $\delta^l = \left(w^{l+1}\right)^T \delta^{l+1} \odot {\sigma'}^l(z^l)$
 3. $\frac{\partial C}{\partial b^l_j} = \delta^l_j$
 4. $\frac{\partial C}{\partial w^l_{jk}} = \delta^l_j a^{l-1}_k$
 
 
 # Python implementation of network
 
     class Network(object):

       def __init__(self, sizes): 
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]
                        
 The `sizes`is a python list which contains the number of neurons in each layer. To use the Network class we
 can f.ex write
        
        net = Network([3,15,10]) 
        
 which creates an object with 3 neurons in the first layer, 15 neurons in the second layer and 10 neurons 
 in the third layer. The `biases`is a list of numpy arrays where each list item is a vector. `biases[0]`
 contains a numpy vector holding all the biases for the first layer.
 `weights[0]` contains a numpy matrix containg the weights connecting the first and second layer.
 The weights and biases are initialized with random numbers.
 
 The Sigmoid function are defined by a small function
 
     def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))
    
 The action of the network is implemented as
 
     def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a
 
 The most complicated routine is the stochastic gradient search routine, which in addition
 to computing weights and biases using the training data also runs a test using the test data set.
 
    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        
        training_data = list(training_data)
        n = len(training_data)

        if test_data:
            test_data = list(test_data)
            n_test = len(test_data)

        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print("Epoch {} : {} / {}".format(j,self.evaluate(test_data),n_test))
            else:
                print("Epoch {} complete".format(j))
                
  The `training_data` are lists of pairs of input and desired output data. `epochs`are the maximum
  number of iterations performed and the `mini_batch_size`is size of each subdata set used in the gradient
  computation. `eta`is the step length, or as it is called here, the learning rate.
  
  The `update_mini_batch` routine is doing the actual steepest descent search and are updating
  the weights and biases. However, the actual gradient computation is hidden in the 
  `backprop` function.
  
        def update_mini_batch(self, mini_batch, eta):
            """Update the network's weights and biases by applying
            gradient descent using backpropagation to a single mini batch.
            The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
            is the learning rate."""
            nabla_b = [np.zeros(b.shape) for b in self.biases]
            nabla_w = [np.zeros(w.shape) for w in self.weights]
            for x, y in mini_batch:
                delta_nabla_b, delta_nabla_w = self.backprop(x, y)
                nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
                nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
            self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
            self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]
                       
   The backpropagation routine is a straightforward implementation of the algorithm we 
   have described. Except for some differences in numbering the implementation below follows
   the computations we outlined for the gradient.
                       
       def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

To start with all z vectors and activations are computed for all layers. This is the forward
step. After that the backpropagation is done by first computing the cost function using the
`cost_derivative` function.
Then the gradients for the weights and biases are computed starting with the last layer. The
expressions in the last loop follows the formulas developed for the gradients.
    
Below I have put the entire `Network`class into a cell
 
   

In [1]:
"""
network.py
~~~~~~~~~~
IT WORKS

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]


    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a
    
    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""

        training_data = list(training_data)
        n = len(training_data)

        if test_data:
            test_data = list(test_data)
            n_test = len(test_data)

        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in range(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print("Epoch {} : {} / {}".format(j,self.evaluate(test_data),n_test))
            else:
                print("Epoch {} complete".format(j))

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in range(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))


"""
mnist_loader
~~~~~~~~~~~~
A library to load the MNIST image data.  For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import pickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data,
    the validation data, and the test data.
    The ``training_data`` is returned as a tuple with two entries.
    The first entry contains the actual training images.  This is a
    numpy ndarray with 50,000 entries.  Each entry is, in turn, a
    numpy ndarray with 784 values, representing the 28 * 28 = 784
    pixels in a single MNIST image.
    The second entry in the ``training_data`` tuple is a numpy ndarray
    containing 50,000 entries.  Those entries are just the digit
    values (0...9) for the corresponding images contained in the first
    entry of the tuple.
    The ``validation_data`` and ``test_data`` are similar, except
    each contains only 10,000 images.
    This is a nice data format, but for use in neural networks it's
    helpful to modify the format of the ``training_data`` a little.
    That's done in the wrapper function ``load_data_wrapper()``, see
    below.
    """
    f = gzip.open('mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = pickle.load(f, encoding="latin1")
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data,
    test_data)``. Based on ``load_data``, but the format is more
    convenient for use in our implementation of neural networks.
    In particular, ``training_data`` is a list containing 50,000
    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray
    containing the input image.  ``y`` is a 10-dimensional
    numpy.ndarray representing the unit vector corresponding to the
    correct digit for ``x``.
    ``validation_data`` and ``test_data`` are lists containing 10,000
    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional
    numpy.ndarry containing the input image, and ``y`` is the
    corresponding classification, i.e., the digit values (integers)
    corresponding to ``x``.
    Obviously, this means we're using slightly different formats for
    the training data and the validation / test data.  These formats
    turn out to be the most convenient for use in our neural network
    code."""
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

#Script to run the network

#Load the training and test data
training_data, validation_data, test_data = load_data_wrapper()

#Create the network first try with 30 neurons internal layers
net = Network([784, 30, 10])
#Run the network
net.SGD(training_data, 30, 10, 1.0, test_data=test_data)





  """Return the vector of partial derivatives \partial C_x /


Epoch 0 : 8744 / 10000
Epoch 1 : 9027 / 10000
Epoch 2 : 9154 / 10000
Epoch 3 : 9194 / 10000
Epoch 4 : 9238 / 10000
Epoch 5 : 9293 / 10000
Epoch 6 : 9302 / 10000
Epoch 7 : 9339 / 10000
Epoch 8 : 9343 / 10000
Epoch 9 : 9358 / 10000
Epoch 10 : 9348 / 10000
Epoch 11 : 9378 / 10000
Epoch 12 : 9373 / 10000
Epoch 13 : 9376 / 10000
Epoch 14 : 9412 / 10000
Epoch 15 : 9419 / 10000
Epoch 16 : 9406 / 10000
Epoch 17 : 9410 / 10000
Epoch 18 : 9417 / 10000
Epoch 19 : 9411 / 10000
Epoch 20 : 9424 / 10000
Epoch 21 : 9427 / 10000
Epoch 22 : 9449 / 10000
Epoch 23 : 9438 / 10000
Epoch 24 : 9445 / 10000
Epoch 25 : 9446 / 10000
Epoch 26 : 9433 / 10000
Epoch 27 : 9425 / 10000
Epoch 28 : 9435 / 10000
Epoch 29 : 9459 / 10000


The code will perform two tasks for each iteration

    1. Train the network by estimating the weights 
    2. Run the network on the testdata with 10000 handwritten digits. 

This gives the output
    
    Epoch 0 : 9062 / 10000
    Epoch 1 : 9248 / 10000
    Epoch 2 : 9348 / 10000
    Epoch 3 : 9375 / 10000
    Epoch 4 : 9394 / 10000
    Epoch 5 : 9391 / 10000
    Epoch 6 : 9438 / 10000
    Epoch 7 : 9431 / 10000
    Epoch 8 : 9422 / 10000
    Epoch 9 : 9467 / 10000
    Epoch 10 : 9476 / 10000
    Epoch 11 : 9488 / 10000
    Epoch 12 : 9468 / 10000
    Epoch 13 : 9449 / 10000
    Epoch 14 : 9490 / 10000
    Epoch 15 : 9486 / 10000
    Epoch 16 : 9491 / 10000
    Epoch 17 : 9499 / 10000
    Epoch 18 : 9459 / 10000
    Epoch 19 : 9471 / 10000
    Epoch 20 : 9475 / 10000
    Epoch 21 : 9498 / 10000
    Epoch 22 : 9494 / 10000
    Epoch 23 : 9495 / 10000
    Epoch 24 : 9488 / 10000
    Epoch 25 : 9499 / 10000
    Epoch 26 : 9504 / 10000
    Epoch 27 : 9478 / 10000
    Epoch 28 : 9488 / 10000
    Epoch 29 : 9513 / 10000
    
The output above gives the epoch number and the number of correctly recognized digits.
We see that the network already in the first iteration manages to recognize 9062 of 10000 digits.
After 30 epochs this number increases to 9513 out of 10000. Not all digits are recognized, but the successrate 
is close to 95 percent.

It would be interesting to try to vary the network parameters. In the above example we use 30 neurons in 
the internal layer and use only a selection of 10000 digits to train the network. The learning rate (step length) is
set to 3.0

If we use more neurons, will the result improve? Let us try to use 100 internal neurons
and keep the other parameters.

The output is:
    
    Epoch 0 : 6466 / 10000
    Epoch 1 : 7040 / 10000
    Epoch 2 : 7599 / 10000
    Epoch 3 : 7586 / 10000
    Epoch 4 : 7683 / 10000
    Epoch 5 : 7707 / 10000
    Epoch 6 : 7721 / 10000
    Epoch 7 : 7724 / 10000
    Epoch 8 : 7724 / 10000
    Epoch 9 : 7730 / 10000
    Epoch 10 : 7762 / 10000
    Epoch 11 : 7761 / 10000
    Epoch 12 : 7770 / 10000
    Epoch 13 : 7766 / 10000
    Epoch 14 : 7773 / 10000
    Epoch 15 : 7754 / 10000
    Epoch 16 : 7774 / 10000
    Epoch 17 : 7770 / 10000
    Epoch 18 : 7786 / 10000
    Epoch 19 : 7778 / 10000
    Epoch 20 : 7782 / 10000
    Epoch 21 : 7774 / 10000
    Epoch 22 : 7799 / 10000
    Epoch 23 : 7794 / 10000
    Epoch 24 : 7797 / 10000
    Epoch 25 : 7820 / 10000
    Epoch 26 : 7821 / 10000
    Epoch 27 : 7814 / 10000
    Epoch 28 : 7854 / 10000
    Epoch 29 : 7884 / 10000

Surprisingly, the network performs worse 

If we use 10 neurons the result is suprisingly good  with a success rate of about 90 perecnt.

    Epoch 0 : 8762 / 10000
    Epoch 1 : 8911 / 10000
    Epoch 2 : 8936 / 10000
    Epoch 3 : 8983 / 10000
    Epoch 4 : 8969 / 10000
    Epoch 5 : 9002 / 10000
    Epoch 6 : 9043 / 10000
    Epoch 7 : 9028 / 10000
    Epoch 8 : 9024 / 10000
    Epoch 9 : 9053 / 10000
    Epoch 10 : 8999 / 10000
    Epoch 11 : 9044 / 10000
    Epoch 12 : 9097 / 10000
    Epoch 13 : 9025 / 10000
    Epoch 14 : 9103 / 10000
    Epoch 15 : 9060 / 10000
    Epoch 16 : 9120 / 10000
    Epoch 17 : 9097 / 10000
    Epoch 18 : 9100 / 10000
    Epoch 19 : 9111 / 10000
    Epoch 20 : 9090 / 10000
    Epoch 21 : 9068 / 10000
    Epoch 22 : 9083 / 10000
    Epoch 23 : 9095 / 10000
    Epoch 24 : 9036 / 10000
    Epoch 25 : 9090 / 10000
    Epoch 26 : 9119 / 10000
    Epoch 27 : 9130 / 10000
    Epoch 28 : 9065 / 10000
    Epoch 29 : 9056 / 10000
    
    We could also try to inestigate the step length. If we reduce the step length to 1.0
    we get a slower convergence rate, as expected.
   
    Epoch 0 : 7400 / 10000
    Epoch 1 : 8214 / 10000
    Epoch 2 : 8322 / 10000
    Epoch 3 : 9167 / 10000
    Epoch 4 : 9220 / 10000
    Epoch 5 : 9278 / 10000
    Epoch 6 : 9301 / 10000
    Epoch 7 : 9324 / 10000
    Epoch 8 : 9351 / 10000
    Epoch 9 : 9340 / 10000
    Epoch 10 : 9366 / 10000
    Epoch 11 : 9381 / 10000
    Epoch 12 : 9384 / 10000
    Epoch 13 : 9395 / 10000
    Epoch 14 : 9406 / 10000
    Epoch 15 : 9405 / 10000
    Epoch 16 : 9420 / 10000
    Epoch 17 : 9418 / 10000
    Epoch 18 : 9406 / 10000
    Epoch 19 : 9427 / 10000
    Epoch 20 : 9428 / 10000
    Epoch 21 : 9442 / 10000
    Epoch 22 : 9445 / 10000
    Epoch 23 : 9440 / 10000
    Epoch 24 : 9440 / 10000
    Epoch 25 : 9439 / 10000
    Epoch 26 : 9449 / 10000
    Epoch 27 : 9442 / 10000
    Epoch 28 : 9440 / 10000
    Epoch 29 : 9450 / 10000