# Programming for Data Science and Artificial Intelligence

## 14 Deep Neural Network from Scratch

### Readings

- [WEIDMAN] Ch3
- [CHARU] Ch2-3



As a recap, last time, we have inputted our data into a linear function wwhich we got some decent result.  To improve, we inserted a non-linear function in between follow by a linear function, which obviously increase the result since it help model the non-linearity.  We can summarize that neural newtork has basically the following key things that make it work:

1. **Activation function (we can also generalized as Operations)**: these functions help model input data into a non-linear relationship

2. **Chain rule / Backpropagation**: they are essential for us to improve the neural network

3. **Layers of neurons**: they are performing some sequential processes that split out desired output.

Putting together, the typial procedure of training a neural network is as followss:

1. Feed observations/samples/records (X) into the model.  This step we called "**forward pass**"

2. Calculate the loss 

3. Calculate gradients based on how each parameters (e.g., W, B) affect the loss by using chain rule.  This step was called "**backward pass**"

4. Update the parameters (e.g., W, B) so that the loss will be hopefully be reduced in the next iteration.   This step was called "**training**"

5. Stop when the loss does not decrease further by some tolerance level (e.g., 0.00001) or when it exceeds the specified maximum iteration.  Sometimes we called this "**early stopping**"

In fact, you are now very close to understanding Deep Neural Networks.  In this lesson, we have several objectives:

- From our low-level understandings of neural network, we shall code them up as a Python class, so they are resuable.  They will be essential for understanding deep neural network, CNN, and RNN.  You will be so surprised that all these fancy terms are simply layers after layers

- When we code our work, we want to make sure these classes resemble PyTorch as much as possible, so you will understand PyTorch right away.

- Of course, we shall also understand what is "deep" neural network.  Here, we shall simply say that "deep" neural network is simply neural network that has more than "one" hidden layers (which we did not yet define what is "hidden" layers)  

### 1. Operations

Let's first code up the first building block, the class <code>Operation</code>,  which is the operations/functions.  

Each function has a **forward** and **backward** methods.  Forward methods for running the function and backward for calculating its gradients.

Each of these functions receives an <code>ndarray</code> as input and outputs an <code>ndarray</code>.  In some operations such as matrix multiplication, we receive <code>ndarray</code> as <code>params</code>, thus we probably should have another class inheriting from <code>Operation</code> and allow for params as another instance variable.

We also need to note that the shape of the output may vary.  For example, in matrix multiplication, the shape of output will be different from shape of input.  In sigmoid, input and output shares the same shape.  To make sure the shape is consistent, we can follow these facts:

1. Each Operation will send outputs forward on the forward pass and will receive an “output gradient” on the backward pass, which will represent the partial derivative of the loss with respect to every element of the Operation’s output.  Thus **The shape of the output gradient ndarray must match the shape of the output.**

2. On the backward pass, each Operation will send an “input gradient” backward, representing the partial derivative of the loss with respect to each element of the input.  **The shape of the input gradient that the Operation sends backward during the backward pass must match the shape of the Operation’s input.**

![](figures/3-1.png)

![](figures/3-2.png)

Based on this, we can write the class Operation like this:

In [1]:
from numpy import ndarray

class Operation(object):
  
    #nothing to init
    def __init__(self):
        pass

    #forward receive ndarray as input
    def forward(self, input_: ndarray) -> ndarray:
        #put trailing _ to avoid naming conflict
        self.input_ = input_

        #this _output will use self.input_ to calculate the ouput
        #_  here means internal use
        self.output = self._output()

        return self.output

    
    def backward(self, output_grad: ndarray) -> ndarray:
        
        #make sure output and output_grad has same shape
        assert_same_shape(self.output, output_grad)
        
        #perform input grad based on output_grad
        self.input_grad = self._input_grad(output_grad)
        
        #input grad must have same shape as input
        assert_same_shape(self.input_, self.input_grad)
        
        return self.input_grad

    def _output(self) -> ndarray:
        raise NotImplementedError()
        
    def _input_grad(self, output_grad: ndarray) -> ndarray:
        raise NotImplementedError()

Let's add also another class that inherits from <code>Operation</code> that we’ll use specifically for Operations that involve parameters.

In [2]:
class ParamOperation(Operation):
    def __init__(self, param: ndarray):
        super().__init__()  #inherit from parent if any
        self.param = param  #this will be used in _output

    def backward(self, output_grad: ndarray) -> ndarray:
        
        #make sure output and output_grad has same shape
        assert_same_shape(self.output, output_grad)

        #perform gradients for both input and param
        self.input_grad = self._input_grad(output_grad)
        self.param_grad = self._param_grad(output_grad)

        assert_same_shape(self.input_, self.input_grad)
        assert_same_shape(self.param, self.param_grad)

        return self.input_grad

    def _param_grad(self, output_grad: ndarray) -> ndarray:
        raise NotImplementedError()

Let's implement some functions that we implement in last class, including:
1. Matrix multiplication
2. Addition of bias term
3. Sigmoid activation function

Lets start with matrix multiplication.  Since the input has two params, X and W, we inherit from <code>ParamOperation</code>.

In [3]:
class WeightMultiply(ParamOperation):

    def __init__(self, W: ndarray):
        #initialize Operation with self.param = W
        super().__init__(W)

    def _output(self) -> ndarray:
        return self.input_ @ self.param

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        return output_grad @ self.param.T  #same as last class

    def _param_grad(self, output_grad: ndarray)  -> ndarray:
        return self.input_.T @ output_grad  #same as last class

Next is the BiasAdd operation where the gradients are simply one.  Since it is an operation between X and B, we inherit from <code>ParamOperation</code>.

In [4]:
class BiasAdd(ParamOperation):
    def __init__(self, B: ndarray):
        #initialize Operation with self.param = B.
        assert B.shape[0] == 1  #make sure it's only B
        super().__init__(B)

    def _output(self) -> ndarray:
        return self.input_ + self.param

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        return np.ones_like(self.input_) * output_grad

    def _param_grad(self, output_grad: ndarray) -> ndarray:
        param_grad = np.ones_like(self.param) * output_grad
        return np.sum(param_grad, axis=0).reshape(1, param_grad.shape[1])

Finally, let's do sigmoid.  Since sigmoid is simply a operation that maps to another value, it inherits from Operation:

In [5]:
class Sigmoid(Operation):
    def __init__(self):
        super().__init__()

    def _output(self) -> ndarray:
        return 1.0/(1.0 + np.exp(-1.0 * self.input_))

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        sigmoid_backward = self.output * (1.0 - self.output)
        input_grad = sigmoid_backward * output_grad
        return input_grad

### 2. Layers

In terms of <code>Operations</code>, <code>layers</code> are a series of linear operations followed by a nonlinear operation. For example, our neural network from the last chapter could be said to have had five total operations: two linear operations — a weight multiplication and the addition of a bias term — followed the sigmoid function and then two more linear operations.

![](figures/layer.png)

Here, we define the input as **input layer**, Layer 1 is typically called **hidden layer** because it is the only layer whose values we don't typically see explicitly during the course of training.  Layer 2 is typically called **output layer** which outputs the desired value.

By abstraction, we can make neural network look much simpler as follows:

![](figures/layer2.png)

Each layer can be said to have a certain number of neurons equal to the dimensionality of the vector that represents each observation in the layer’s output. The neural network from the last class can thus be thought of as having 13 neurons in the input layer (i.e., 13 features), then 13 neurons (again) in the hidden layer, and one neuron in the output layer.

Neurons in the brain have the property that they can receive inputs from many other neurons and will “fire” and send a signal forward only if the signals they receive cumulatively reach a certain “activation energy.” Neurons in the context of neural networks have a loosely analogous property: they do indeed send signals forward based on their inputs, but the inputs are transformed into outputs simply via a nonlinear function. Thus, this nonlinear function is called the activation function, and the values that come out of it are called the activations for that layer.

**Building on the context of layers, deep learning models are simply neural networks with more than one hidden layer.**

Now leaving all the theory behind, let's code the Layer together:

In [6]:
class Layer(object):
    def __init__(self, neurons: int):
        self.neurons = neurons
        self.first = True   #first layer is true for init
        self.params: List[ndarray] = []
        self.param_grads: List[ndarray] = []
        self.operations: List[Operation] = []

    def _setup_layer(self, num_in: int):
        #setup the series of operations
        raise NotImplementedError()

    def forward(self, input_: ndarray) -> ndarray:
        #setup self.operations if haven't
        if self.first:
            self._setup_layer(input_)
            self.first = False

        self.input_ = input_

        #run the series of operations
        for operation in self.operations:
            input_ = operation.forward(input_)

        self.output = input_

        return self.output

    def backward(self, output_grad: ndarray) -> ndarray:
        
        assert_same_shape(self.output, output_grad)

        for operation in reversed(self.operations):
            output_grad = operation.backward(output_grad)

        input_grad = output_grad
        
        self._param_grads()

        return input_grad

    #if the operation is a subclass of ParamOperatio
    #append param_grad to self.param_grads
    def _param_grads(self):
        self.param_grads = []
        for operation in self.operations:
            if issubclass(operation.__class__, ParamOperation):
                self.param_grads.append(operation.param_grad)

    def _params(self):
        self.params = []
        for operation in self.operations:
            if issubclass(operation.__class__, ParamOperation):
                self.params.append(operation.param)

Now, let's create our layer.  Remember that we have three layers:

1. Input layer
2. Hidden layer
3. Output layer

We don't really need to implement the input layer since it's only the input.  

As for our hidden layer, it composes of WeightMultiply, then BiasAdd, then sigmoid.   What name should we give to this layer?  How about LinearNonLinear layer.  In fact, there is a common name for this is "**Dense/Fully-Connected Layer**" which refers to layer where each output neuron is a function of all of the input neurons.   Imagine thirteen circles, each circle connected to all circles...(that's why it's called fully-connected)

Our output layer is very similar to the hidden layer but without the hidden layer.  We consider this still as a **Dense** layer because each output neuron is again connected to all input neurons.

To code this is simple, we simply inherit **Layers** and define the series of operations in <code>_setup_layer</code> function

In [7]:
class Dense(Layer):
    def __init__(self, neurons: int,
                 activation: Operation = Sigmoid()):
        #define the desired non-linear function as activation
        super().__init__(neurons)
        self.activation = activation

    def _setup_layer(self, input_: ndarray):
        #in case you want reproducible results
        if self.seed:
            np.random.seed(self.seed)

        self.params = []

        # randomize weights of shape (num_feature, num_neurons)
        self.params.append(np.random.randn(input_.shape[1], self.neurons))

        # randomize bias of shape (1, num_neurons)
        self.params.append(np.random.randn(1, self.neurons))

        self.operations = [WeightMultiply(self.params[0]),
                           BiasAdd(self.params[1]),
                           self.activation]

### 3. Loss Class

The next thing we have to code up is the loss function (forward) and its gradients (backward).  We gonna make a parent class called <code>Loss</code> and a child class called <code>MeanSquaredError</code>  The code is quite straightforward, similar to Layers

In [8]:
class Loss(object):
   
    def __init__(self):
        pass

    def forward(self, prediction: ndarray, target: ndarray) -> float:
        assert_same_shape(prediction, target)

        self.prediction = prediction
        self.target = target
        
        #self._output will hold the loss function
        loss_value = self._output()

        return loss_value

    def backward(self) -> ndarray:

        self.input_grad = self._input_grad()

        assert_same_shape(self.prediction, self.input_grad)

        #input_grad will hold the gradient of the loss function
        return self.input_grad

    def _output(self) -> float:
        raise NotImplementedError()

    def _input_grad(self) -> ndarray:
        raise NotImplementedError()

Now we have the Loss/Objective/Cost function, let's make the concrete loss function.  Here we will be using the <code>MeanSquaredError</code>

In [9]:
class MeanSquaredError(Loss):

    def __init__(self):
        super().__init__()

    def _output(self) -> float:
        loss = (
            np.sum(np.power(self.prediction - self.target, 2)) / 
            self.prediction.shape[0]
        )

        return loss

    def _input_grad(self) -> ndarray:
        return 2.0 * (self.prediction - self.target) / self.prediction.shape[0]