**Traditional feedforward neural network from scratch**

This would be my attempt at making a fully-connected multi-layer perceptron(MLP)/feed-forward network from scratch.


**Philosophy of AI**

Part of why the realm of AI excites me is the fact that theres many questions pertaining to it and how it intertwines with philosophy. Many thought-provoking questions can be drawn such as "Can machines think?" and "What is it to think?". An insightful answer to the former question by Hubert Dreyfus: "if the nervous system obeys the laws of physics and chemistry, which we have every reason to suppose it does, then ... we ... ought to be able to reproduce the behavior of the nervous system with some physical device".




In [None]:
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
%matplotlib inline

**Neurons and Layers**
* The number of neurons in the input layer is set to be the number of features/dimensions of the data.
* There is no framework/general consensus on the number of neurons and layers as it depends on the complexity of the problem. It is usually determined through trial and error.
* If the network is tasked to solve a regression problem, the output layer will have a single neuron whereas for classification it will either be single (binary classification) or multiple (multi-class /multi-label classification)



**Weight Initialization and Activation Functions**
* Xavier initialization is aimed at solving the vanishing gradient problem that arises when using the sigmoid/tanh activation function, so they should be paired together
* Weights should not be initialized to be the same as it would lead to equal gradient updates(symmetry problem)
* Too high weights would lead to over-saturation, where certain neurons are activated significantly more than their counterparts. This is also why in practice we implement normalization to get more neurons involved in the process of "estimating" the problem.


**How does training a basic AI work?**

1. Data is feed into the neural network and forward pass happens

2. Output is compared with expected value to find loss function

3. This loss function is used to perform backpropagation to calculate weight_gradients and bias gradients

4. These gradients are used to update the weights and parameters to fit the problem(finding the minimun point), with the asssumption of universal approximation theorem.


These are the basic idea of training a deep neural network. In reality, there are more complex architecture, optimization algorithms, regularization etc. An example is Recurrent Neural Networks(RNN) which is the common method for problems where discrete data are related to other discrete data such as sentimental analysis, where the word's meaning is affected by its context.




**Forward propagation**
Forward propagation (forward pass) is the process where the input is fed into the neural network at the input layer, and returns an output at the output layer.

For example, we would use a 2-4-7-1 neural network, 2 represents 2 nodes at the input layer which would take two values(data has 2 values such as a coordinate x,y), 1 represents 1 node at the output layer which would return a single value.
1.   Input (x,y) is fed into the 2 nodes, where it is then multipled by the weights connected these 2 nodes to the next nodes(4).
2. You will get a set of weighted sum(wsum) where it corresponds to the 4 nodes, so you would get 4 (wsum).
3. These wsum are then added with the biases element-wise and then fed into the activation function to get the "activations". These activations can been seen as the input to the 4 nodes.  
4.   The same process happens up till the output layer where the activations will be the output value

However, is it paramount to note that in practice, there are tweaks such as not initializing bias for output layer or not initializing bias at all!

Some neurons might not be connected to all the proceeding neurons, such as for a 2-4-7-1, each of the 2 neurons at input layer might be connected to 3 of the 4 neurons. (Sparse layer)

These are done in consideration of many factors, such as overfitting vs underfitting, making use of all neurons such as "dropout" layers, etc.




**Backward propagation**
Backward propagation is the process where the "training" of the neural network takes place

1. The output from forward propagation is compared with the expected value to get a loss value.
2. Using this loss value, the back propagation algorithm calculate weight gradients and bias gradients, which basically calculates the change in parameters values to "model" a more accurate model.
3. Technically, it is usually the optimizer(if there is one) that adjusts these parameters, but back propagation is commonly associated to the updating of the parameters.  
4. Repeat this cycle (epoch)


Fun fact: Back propagation makes use of partial differentiation



In [None]:
class NeuralNetwork():

  def __init__(self):
    self.layers = [] # keep track of number of nodes per layer (easier for params init)
    self.activation_functions = {}
    self.weights = []
    self.biases = []
    self.neuron_values = []
    self.metric = None

  def model_summary(self):
    if not self.layers:
      raise InvalidModel('There is no layers added.')

    if not self.biases:
      self.bias_init()

    if not self.weights:
      self.weight_init()

    total_params = 0
    print(f'Input layer  : {self.layers[0]} nodes')
    if self.activation_functions.get(1):
        print(f'Activation Function: {self.activation_functions.get(1)}')

    for i in range(1,len(self.layers)-1):
      if self.biases:
        param = self.layers[i] * self.layers[i-1] + self.layers[i]
      else:
        param = self.layers[i] * self.layers[i-1]
      print(f'Hidden layer{i}: {self.layers[i]} nodes|Number of parameters: {param}')
      total_params += param
      if self.activation_functions.get(i):
        print(f'Activation Function: {self.activation_functions.get(i)}')

    if self.biases:
        param = self.layers[-1] * self.layers[-2] + self.layers[-1]
    else:
        param = self.layers[-1] * self.layers[-2]
    print(f'Output layer : {self.layers[-1]} nodes|Number of parameters: {param}')
    total_params += param
    print(f'Total trainable parameters: {total_params}')


  # Setup of network
  def add_layer(self,nodes=0,layer='Dense',activation=None):
    if activation:
      self.activation_functions[len(self.layers)] = activation
    else:
      self.layers.append(nodes)



  def weight_init(self,init='he',factor=None,gain=1,seed=0):
    if self.weights:
      self.weights = []
    if len(self.layers) <= 2:
      raise LayerCountError(f'Number of layers = {self.layer_count} layers, requried to add more. ')

    if init == 'xavier':
      factor = 6

    if init == 'he':
      factor = 2

    if init == 'custom':
      factor = factor

    np.random.seed(seed)
    for i in range(0,len(self.layers) - 1):
      input_neuron = self.layers[i] #number of input neurons
      output_neuron = self.layers[i+1] #number of the next layer neurons
      self.weights.append(np.random.normal(loc=0.0, scale = gain * np.sqrt(factor/(input_neuron + output_neuron)),
                                          size = (input_neuron,output_neuron)))

    return self.weights


  def bias_init(self,value=0.01,init=True,method='constant',mean=0,var=0.01):
    if self.biases:
      self.biases = []
    if method == 'constant':
      bias = 0.01 #0.01 is a widely used and tested value for bias initialization

    if method == 'gaussian':
      bias = np.random.normal(loc=0.0,scale=np.sqrt(0.01))

    self.biases = [np.random.normal(loc=0.0,scale=np.sqrt(0.01),size=(i,1)) for i in self.layers[1:]]

    # for i in range(0,len(self.layers) - 1):
    #   input_neuron = self.layers[i]
    #   output_neuron = self.layers[i+1]
    #   self.biases.append(np.full((output_neuron,),bias,dtype=float))

    return self.biases



                                             #   first bias is 5x1
  def forward(self,X,batch=True): #X is input    3-5-2-1   first weight is 5x3, input 1x3
    if not self.weights:
      self.weights_init()

    if not self.biases:
      self.bias_init()

    batch_wsum = []
    batch_activations = []
    outputs = []

    if not batch:
      X = np.expand_dims(X.copy(), axis=0)

    for x in X:
      values = np.transpose(x.copy())
      weighted_sum_list = []
      activation_list = []
      for i in range(len(self.layers)-1):
        if self.activation_functions.get(i):
          if self.activation_functions[i] == 'relu':
            func = lambda x: max(0,x)

          elif self.activation_functions[i] == 'sigmoid':
            func = lambda x: 1/(1+np.exp(-x))

          elif self.activation_functions[i] == 'tanh':
            func = lambda x:(np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))

          vectorizer = np.vectorize(func)

          weighted_sum = np.sum(np.dot(values,self.weights[i]) + self.biases[i],axis=0,keepdims=True)
          values = vectorizer(weighted_sum)

        else:
          weighted_sum = np.sum(np.dot(values,self.weights[i]) + self.biases[i],axis=0,keepdims=True)
          values = weighted_sum



        weighted_sum_list.append(weighted_sum)
        activation_list.append(values)

      outputs.append(values)
      batch_wsum.append(weighted_sum_list)
      batch_activations.append(activation_list)

      # Flatten outputs
      outputs = [array.flatten() for array in outputs]


    return outputs,batch_wsum,batch_activations


    # else:
    #   values = np.transpose(X.copy())
    #   for i in range(len(self.layers)-1):
    #     if self.activation_functions.get(i):
    #       if self.activation_functions[i] == 'relu':
    #         func = lambda x: max(0,x)

    #       if self.activation_functions[i] == 'sigmoid':
    #         func = lambda x: 1/(1+np.exp(-x))

    #       if self.activation_functions[i] == 'tanh':
    #         func = lambda x:(np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))
    #       vectorizer = np.vectorize(func)

    #       weighted_sum = np.dot(values,self.weights[i]) + self.biases[i]
    #       print(f'weighted_sum = {weighted_sum}')
    #       values = vectorizer(weighted_sum,signature='()->(n)')
    #       print(f'values = {values}')

    #     else:
    #       print('no')
    #       weighted_sum = np.dot(values,self.weights[i]) + self.biases[i]
    #       values = weighted_sum

    #     weighted_sum_list.append(weighted_sum)
    #     activation_list.append(values)

    # return values,weighted_sum_list,activation_list


  # Prediction
  def predict(self,x):
    outputs,_,_ = self.forward(x)
    return outputs


  # Loss function & accuracy metrics

  def mse_error(self,target,predicted):
    assert len(target) == len(predicted), "target shape is not equal to predicted shape."
    # if not isinstance(target,np.ndarray):
    #   target = np.array(target)
    # if not isinstance(predicted,np.ndarray):
    #   predicted = np.array(predicted)

    # if len(target.shape) == 1 or target.shape[0] != 1:
    #   x = target.shape[0]
    # else:
    #   x = target.shape[1]

    return np.sum((target-predicted)**2)/len(target)


  def mse_error_grad(self,target,predicted):
    return 2*(predicted-target)


  def loss(self,target,predicted,method='MSELoss',batch=False):
    # pred is output of self.forward(), target is actual labels values
    if batch:
      losses = []
      for tar,pre in zip(target,predicted):
        if method == 'MSELoss':
          losses.append([self.mse_error(tar,pre),self.mse_error_grad(tar,pre)])
      return losses


    else:
      if method == 'MSELoss':
        if type(target) == list:
          target = target[0]
        if type(predicted) == list:
          predicted = predicted[0]
        return self.mse_error(target,predicted),self.mse_error_grad(target,predicted)




  # Derivatives of activation functions
  def relu_grad(x):
    return max(0,x)

  def sigmoid_grad(x):
    return (1/(1+np.exp(-x))) * (1 - 1/(1+np.exp(-x)))

  def tanh_grad(x):
    pass


  # Updating parameters
  def backward(self,x,y,batch=1):
    if not self.weights:
      self.weight_init()

    if not self.biases:
      self.bias_init()

    # assert x.shape[0] == y.shape[0]
    # Keep track of gradients
    weight_gradients = [np.zeros(w.shape) for w in self.weights]
    bias_gradients = [np.zeros((b.shape[0],1)) for b in self.biases]

    # Batches
    left,right = 0,batch
    while left < x.shape[0]:
      if left >= x.shape[0]:
        break
      if right == -1:
        X_train_batch = x[left:]
        y_train_batch = y[left:]
      else:
        X_train_batch = x[left:right]
        y_train_batch = y[left:right]

      for datapoint in range(batch):
        if batch > 1 and left !=len(x)-1:
          output,wsum_list,activation_list = self.forward(X_train_batch,batch=True)
          loss,loss_grad = self.loss(y_train_batch,output,method='MSELoss',batch=True)
          dc_da = loss_grad[datapoint]
        else:
          output,wsum_list,activation_list = self.forward(X_train_batch[0],batch=False)
          loss,loss_grad = self.loss(y_train_batch,output,method='MSELoss')
          dc_da = loss_grad

        for i in range(len(self.layers)-1,-1,-1):
          if self.activation_functions.get(i):
            grad_name = f'{self.activation_functions[i]}_grad'
            activation_grad_func = getattr(NeuralNetwork,grad_name) # becomes self.relu_grad()
            activation_grad_func = np.vectorize(activation_grad_func)
          else:
            activation_grad_func = None

          # Output layer
          if i == (len(self.layers) - 1):
            weighted_sum = wsum_list[0][i-1]
            activation = activation_list[0][i-1]
            if activation_grad_func:
              local_gradient = dc_da * activation_grad_func(weighted_sum)
              print(f"Output Layer Local Gradient = {local_gradient}")
            else:
              local_gradient = np.full(weighted_sum.shape,dc_da)
              # local_gradient = dc_da
            # print(local_gradient.shape)
            # weight_loss_gradient = local_gradient * activation
            bias_loss_gradient = local_gradient
            bias_gradients[i-1] += bias_loss_gradient

          # Input layer
          elif i == 0:
            weighted_sum = wsum_list[0][0]
            activation = activation_list[0][0]
            weights = self.weights[0]

            if self.activation_functions.get(i):
              local_gradient = np.multiply(np.transpose(activation_grad_func(weighted_sum)),(np.dot(weights, local_gradient)))
            else:
              local_gradient = np.dot(weights, local_gradient)

            #2x1, 4x1
            weight_loss_gradient = local_gradient * activation
            weight_gradients[0] += weight_loss_gradient

          # Hidden layers
          else:
            #local gradient is cached previously
            weighted_sum = wsum_list[0][i-1] #1x7
            activation = activation_list[0][i-1] #1x7
            weights = self.weights[i] #7x1

            if self.activation_functions.get(i):
              local_gradient = np.multiply(np.transpose(activation_grad_func(weighted_sum)),(np.dot(weights, local_gradient))) #7x1
            else:
              local_gradient = np.dot(weights, local_gradient)

            weight_loss_gradient = local_gradient * np.transpose(activation)
            bias_loss_gradient = local_gradient


            # Edit gradients in place
            weight_gradients[i] += weight_loss_gradient
            bias_gradients[i-1] += bias_loss_gradient



        # Update parameters


      left += batch
      right += batch
      if right > x.shape[0]:
        right = x.shape[0]

    return weight_gradients, bias_gradients



  # Define how backprop is used
  def optimize(self,method='SGD'):
    pass


  def update_parameters(self,weight_grad,bias_grad,lr=0.005):
    for i in range(len(weight_grad)):
      print(weight_grad[i])
      print(bias_grad[i])
      self.weights[i] -= lr * weight_grad[i]
      # self.biases[i] -= lr * bias_grad[i]
    return


  # Function to train model (Compilation of main functions)
  def fit(self,X,y,lr=0.005,batch=1):
    weight_gradients, bias_gradients = self.backward(X,y,batch=1)
    self.update_parameters(weight_gradients, bias_gradients, lr=0.005)
    return


  def accuracy(self,X,y):
    predicted = self.predict(X)
    if self.metric == 'MSE':
      error = ((predicted - np.array(X))**2).mean(ax=1)
    return error








In [None]:
w_grad,b_grad,l_grad = model.backward(input,output)
w_grad

[array([[-7.12842285e-45,  1.79976900e-45, -3.74386883e-45,
         -2.90507155e-45],
        [ 1.74081550e-44, -4.39517386e-45,  9.14281466e-45,
          7.09440740e-45]]),
 array([[-7.56073940e-85, -7.56073940e-85, -7.56073940e-85,
         -7.56073940e-85, -7.56073940e-85, -7.56073940e-85,
         -7.56073940e-85],
        [ 4.49735492e-45,  4.49735492e-45,  4.49735492e-45,
          4.49735492e-45,  4.49735492e-45,  4.49735492e-45,
          4.49735492e-45],
        [ 5.48685762e-52,  5.48685762e-52,  5.48685762e-52,
          5.48685762e-52,  5.48685762e-52,  5.48685762e-52,
          5.48685762e-52],
        [-8.66316749e-50, -8.66316749e-50, -8.66316749e-50,
         -8.66316749e-50, -8.66316749e-50, -8.66316749e-50,
         -8.66316749e-50]]),
 array([[-1.62134948e-231],
        [-1.57742757e-078],
        [ 0.00000000e+000],
        [ 1.43486832e-108],
        [ 1.42487208e-144],
        [ 1.54878612e-231],
        [ 0.00000000e+000]])]

In [None]:
targetted = np.array([
                  [4],
                  [2],
                  [5],
])

predicted = np.array([
                  [2],
                  [6],
                  [4],
])

predicted_2 = np.array([
    [0.49495892],
    [0.50364126],
    [0.44763302]
  ])

model.mse_error(targetted,predicted_2)

loss =[[3.50504108]
 [1.49635874]
 [4.55236698]]
sum=35.248447571856275


11.749482523952091

In [None]:
first = np.array([1,3,5])
second = np.array([4,5,2])
first*second

array([ 4, 15, 10])

In [None]:
output,_,_ = model.forward(input)
output

  func = lambda x: 1/(1+np.exp(-x))


[array([[0.49207497]]), array([[0.50077704]]), array([[0.44480438]])]

In [None]:
model = NeuralNetwork()
model.add_layer(2)
model.add_layer(activation='sigmoid')
model.add_layer(4)
model.add_layer(activation='sigmoid')
model.add_layer(7)
model.add_layer(activation='sigmoid')
model.add_layer(1)
model.model_summary()

Input layer  : 2 nodes
Activation Function: sigmoid
Hidden layer1: 4 nodes|Number of parameters: 12
Activation Function: sigmoid
Hidden layer2: 7 nodes|Number of parameters: 35
Activation Function: sigmoid
Output layer : 1 nodes|Number of parameters: 8
Total trainable parameters: 55


In [None]:
model.weight_init(init='xavier',seed=0)

[array([[ 1.76405235,  0.40015721,  0.97873798,  2.2408932 ],
        [ 1.86755799, -0.97727788,  0.95008842, -0.15135721]]),
 array([[-0.07623217,  0.30324709,  0.10638323,  1.07405217,  0.56206361,
          0.08986296,  0.32781472],
        [ 0.24643482,  1.10345052, -0.15151942,  0.23121582, -0.63079151,
         -1.88550794,  0.48272932],
        [ 0.63842844, -0.54812519,  1.67632488, -1.07412024,  0.0337949 ,
         -0.13824444,  1.13203247],
        [ 1.08519337,  0.11443626,  0.27929153, -0.65567323, -1.46291514,
         -0.25695015,  0.11547137]]),
 array([[ 1.06546298],
        [ 1.04129149],
        [-0.33543486],
        [-0.26180186],
        [-0.9080735 ],
        [-1.22977161],
        [-1.47767333]])]

In [None]:
# input = np.array([
#          [[-2,-5,-3],[-5,-7,-4]],
#          [[5,2,1],[5,8,2]],
#          [[12,53,72],[32,23,91]],
#          ])   #3 x 2x3

input = np.array([
                  [[3],[5]],
                  [[1],[9]],
                  [[-3],[-7]],  #3 x 2
])

output = np.array([
                  [4],
                  [2],
                  [5],
])


In [None]:
len(model.biases)


5

In [None]:
model.weights

[array([[ 1.76405235,  0.40015721,  0.97873798,  2.2408932 ],
        [ 1.86755799, -0.97727788,  0.95008842, -0.15135721]]),
 array([[-0.07623217,  0.30324709,  0.10638323,  1.07405217,  0.56206361,
          0.08986296,  0.32781472],
        [ 0.24643482,  1.10345052, -0.15151942,  0.23121582, -0.63079151,
         -1.88550794,  0.48272932],
        [ 0.63842844, -0.54812519,  1.67632488, -1.07412024,  0.0337949 ,
         -0.13824444,  1.13203247],
        [ 1.08519337,  0.11443626,  0.27929153, -0.65567323, -1.46291514,
         -0.25695015,  0.11547137]]),
 array([[ 1.06546298],
        [ 1.04129149],
        [-0.33543486],
        [-0.26180186],
        [-0.9080735 ],
        [-1.22977161],
        [-1.47767333]])]

In [None]:
# Main training loop
learning_rate = 0.01
batch = 3
epochs = 100

model.fit(input,output,batch=batch,lr=learning_rate)
pred = model.predict(input)


# for i in tqdm(range(epochs)):
#   model.fit(input,output,batch=batch,lr=learning_rate)
#   pred = model.predict(input)
#   print(pred)
#   print(f"Epoch {i+1}: Loss = {model.mse_error(output,pred)}")



Output Layer Local Gradient = [[-0.92653331]]
Output Layer Local Gradient = [[-0.44653327]]
Output Layer Local Gradient = [[-1.99352511]]
[[-1.28044078e-12  3.25115225e-13 -6.72219865e-13 -5.19186134e-13]
 [ 2.91182473e-12 -7.36401425e-13  1.52886125e-12  1.18596488e-12]]
[[ 1.54372760e-24]
 [-1.52535171e-13]
 [ 8.07529347e-19]
 [-2.04392944e-15]]
[[ 5.21184677e-23  5.21184677e-23  5.21184677e-23  5.21184677e-23
   5.21184677e-23  5.21184677e-23  5.21184677e-23]
 [ 1.30222893e-12  1.30222893e-12  1.30222893e-12  1.30222893e-12
   1.30222893e-12  1.30222893e-12  1.30222893e-12]
 [ 1.43147466e-17  1.43147466e-17  1.43147466e-17  1.43147466e-17
   1.43147466e-17  1.43147466e-17  1.43147466e-17]
 [-2.66999423e-14 -2.66999423e-14 -2.66999423e-14 -2.66999423e-14
  -2.66999423e-14 -2.66999423e-14 -2.66999423e-14]]
[[-1.82975540e-15]
 [-5.59957172e-14]
 [ 2.66748022e-84]
 [ 2.25354219e-12]
 [ 2.13814877e-09]
 [ 3.78476238e-39]
 [ 9.82485746e-61]]
[[-1.82975540e-015]
 [-5.62907315e-027]
 [ 1.84

In [None]:
forwarded = model.forward(input)
output,wsum_list,_ = forwarded
output

[array([[0.2923489]]), array([[0.2923489]]), array([[0.62195888]])]

In [None]:
wsum_list[0]

[array([[ 58.49481331, -14.76864574,  30.72164951,  23.83859958]]),
 array([[176.84045068, -60.36547531, 315.37922365,  50.77434953,
          39.28444625, 107.39537082, 234.62057112]]),
 array([[-1.04115536]])]

In [None]:
w_grad,b_grad,loc = model.backward(input,output,batch=2)

[0, array([[0.40552918]])]
[0, array([[0.40552918]])]
0
0


  func = lambda x: 1/(1+np.exp(-x))
  return (1/(1+np.exp(-x))) * (1 - 1/(1+np.exp(-x)))


In [None]:
loc

array([[-7.71349165e-50],
       [-9.10586327e-51]])

In [None]:
w_grad

[array([[-1.70540250e-34,  4.30576386e-35, -8.95682452e-35,
         -6.95008753e-35],
        [ 4.16472362e-34, -1.05150054e-34,  2.18732520e-34,
          1.69726465e-34]]),
 array([[-7.68505140e-70, -7.68505140e-70, -7.68505140e-70,
         -7.68505140e-70, -7.68505140e-70, -7.68505140e-70,
         -7.68505140e-70],
        [ 1.07594632e-34,  1.07594632e-34,  1.07594632e-34,
          1.07594632e-34,  1.07594632e-34,  1.07594632e-34,
          1.07594632e-34],
        [ 1.31267475e-41,  1.31267475e-41,  1.31267475e-41,
          1.31267475e-41,  1.31267475e-41,  1.31267475e-41,
          1.31267475e-41],
        [-2.07257450e-39, -2.07257450e-39, -2.07257450e-39,
         -2.07257450e-39, -2.07257450e-39, -2.07257450e-39,
         -2.07257450e-39]]),
 array([[-4.04195148e-159],
        [-1.53693395e-055],
        [ 0.00000000e+000],
        [ 4.41745462e-076],
        [ 2.78912225e-100],
        [ 4.10437439e-159],
        [ 1.01675607e-246]])]

In [None]:
b_grad

[array([[ 1.04579543e-71],
        [-7.28534177e-36],
        [ 4.27280035e-43],
        [-8.69419574e-41]]),
 array([[-7.20362902e-081],
        [-2.52957773e-029],
        [ 8.39540515e-172],
        [ 1.18048105e-039],
        [ 1.74695165e-051],
        [ 7.79870197e-081],
        [ 1.34550158e-124]]),
 array([[-0.04864247]])]

In [None]:
model.weights

[array([[ 1.76405235,  0.40015721,  0.97873798,  2.2408932 ],
        [ 1.86755799, -0.97727788,  0.95008842, -0.15135721]]),
 array([[-0.05160943,  0.20529925,  0.07202179,  0.72713675,  0.38051886,
          0.06083751,  0.22193162],
        [ 0.16683716,  0.74703954, -0.10257913,  0.15653385, -0.42704787,
         -1.27649491,  0.3268093 ],
        [ 0.4322181 , -0.37108251,  1.13487731, -0.72718284,  0.02287926,
         -0.09359193,  0.76638961],
        [ 0.73467938,  0.07747371,  0.18908126, -0.44389287, -0.99039823,
         -0.17395607,  0.07817448]]),
 array([[ 0.35151162],
        [ 0.3435371 ],
        [-0.1106648 ],
        [-0.08637221],
        [-0.29958656],
        [-0.40571941],
        [-0.48750577]])]

In [None]:
# Build network architecture
model = NeuralNetwork()


batch_size = X_train.shape[0]
epochs = 100
backprop_batch = 10
lr = 1e-17  #If batch-size is increased, learning rate is decreased


# Training(forward and backward pass)
for i in tqdm(range(epochs)):
  left,right = 0,backprop_batch
  while left < batch_size:
    X_train_batch = X_train[left:right]
    y_train_batch = y_train[left:right]

    # To be replaced with backprop function
    weight_gradients = [np.zeros(w.shape) for w in self.weights]
    bias_gradients = [np.zeros(b.shape) for b in self.biases]
    for i in range(backprop_batch):
      output,activation_list = model.forward(X_train_batch[i])
      loss = model.loss(y_train_batch[i],output,method='MSELoss')

      # Update parameter gradients

    left += backprop_batch
    right += backprop_batch
    if right > batch_size:
      right = batch_size+1


  if (epoch+1) == 10:
    print(f'Epoch: {i}, Loss: {loss}')











NameError: name 'X_train' is not defined

In [None]:
# Test with a 2-4-7-1 model
model = NeuralNetwork()
model.add_layer(2)
model.add_layer(4)
model.add_layer(7)
model.add_layer(1)
model.model_summary()

# model2 = NeuralNetwork()
# model2.add_layer(2)
# model2.add_layer(4)
# model2.add_layer(7)
# model2.add_layer(1)
# model2.bias_init()
# print(model2.biases)
# model2.model_summary()



# Concepts I learnt from this project

1) For mini-batch stochastic gradient descent, within a batch, the local gradients are accumulated and are used to update the parameters. Hence there is an inverse relationship between the batch-size and the learning rate (e.g. the higher the batch-size, the higher the accumulated local gradients and thus learning rate should be lower to compensate)

**References**

Kumar, S. K. (2017). On weight initialization in deep neural networks. Retrieved from http://arxiv.org/abs/1704.08863

Philosophy of artifical intelligence.Retrieved from https://en.wikipedia.org/wiki/Philosophy_of_artificial_intelligence#:~:text=The%20philosophy%20of%20artificial%20intelligence,%2C%20epistemology%2C%20and%20free%20will.