# Introduction and Establishing the Basis

In the past, I have done many gradient operations on the paper. I have also manually implemented the loss functions, gradients etc. to calculate the weight updates but still it is not fully automated. This makes manual implementation exponentially harder when I try to increase the layers (due to chain rule) or increase the number of nodes.

In Deep Learning era we know barely nothing about the learning process (calculations updates etc.). Instead, we focus on improving the model performance. Yet, I still belive that understanding what is going on in the model can be really beneficial especially the model learning is insufficient. The gradients can give us a hint about the problem and allows us to solve them.

In this work, I tried my best to explain the concept in deep. We will start small and gradually increase the difficulty. My hope is at the end of this work the reader mostly grasps the concept.

Thanks...

## Manual Gradient Update Implementation

The problem:
- Hypothesis Function: $w*x$ where wϵℝ and $xϵℝ^2$
- No bias
- Loss Function:$\frac 1N * Σ(xw-y_{true})^2$
- True labels: $[1,2]$
- Initial weight (yes we have only one): $2$
- Inputs: $[1,2]$
- Stop criteria: $|w_{t+1} - w_t| \le 0.1$ (when gradients are sufficiently small)

In [1]:
#import libraries
import tensorflow as tf
import numpy as np

In [2]:
#define the parameters
y_true = np.array([1,2]).astype('float32')
w = 2.
x = np.array([1,2]).astype('float32')
error_margin = 0.1 #if the weight update becomes smaller than 0.1 iterations will stop

In [3]:
#define mean squared error function
def mean_squared_error(y_true,x,w):
  return np.mean(np.square(y_true-x*w))

#define gradient of mse wrt w
def gradient_mse(y_true,x,w):
  return 2 * np.mean((x*w - y_true)*x)

#gradient update for single iteration: Xt+1 = Xt - eta*del(Loss)
def gradient_update(y_true,x,w, learning_rate = 0.1):
  return w - learning_rate * gradient_mse(y_true,x,w)

#define a multiple gradient update loop
loss_list = []
weight_list = []
while True:
  weight_list.append(w)
  loss_list.append(mean_squared_error(y_true,x,w))
  if len(weight_list) > 1 and np.abs(weight_list[-1]-weight_list[-2]) <= error_margin:
    break
  w = gradient_update(y_true,x,w)

weight_list,loss_list

([2.0, 1.5, 1.25, 1.125, 1.0625],
 [2.5, 0.625, 0.15625, 0.0390625, 0.009765625])

## Implementation with TensorFlow

Let's start with the basics and implement the same algorithm using TensorFlow. Luckily many functions are similar to the numpy.

In [4]:
#import the libraries
import tensorflow as tf
import numpy as np

In [5]:
#define parameters
y_true = np.array([1,2]).astype('float32')
w = 2.
x = np.array([1,2]).astype('float32')
error_margin = 0.1

In [6]:
#define mean squared error function
def mean_squared_error(y_true,x,w):
  return tf.reduce_mean(tf.square(y_true-x*w))

#define gradient of mse wrt w
def gradient_mse(y_true,x,w):
  return 2 * tf.reduce_mean((x*w - y_true)*x)

#gradient update for single iteration: Xt+1 = Xt - eta*del(Loss)
def gradient_update(y_true,x,w, learning_rate = 0.1):
  return w - learning_rate * gradient_mse(y_true,x,w)

#define a multiple graident update loop with tf
loss_list = []
weight_list = []
while True:
  weight_list.append(w)
  loss_list.append(mean_squared_error(y_true,x,w).numpy())
  if len(weight_list) > 1 and tf.abs(weight_list[-1]-weight_list[-2]) <= error_margin:
    break
  w = gradient_update(y_true,x,w).numpy()

weight_list,loss_list

([2.0, 1.5, 1.25, 1.125, 1.0625],
 [2.5, 0.625, 0.15625, 0.0390625, 0.009765625])

## Implemetation with TensorFlow GradientTape

We can use `tf.GradientTape` for doing the same operations. The bad thing is that we need to learn a few more parameters; the good thing is we don't need to manually define the derivative of the loss function to compute it.

In [7]:
#import libraries
import tensorflow as tf
import numpy as np

In [8]:
#define parameters
y_true = np.array([1,2]).astype('float32')
w = 2.
x = np.array([1,2]).astype('float32')
learning_rate = 0.1
error_margin = 0.1

In [9]:
#same loss function
def mean_squared_error(y_true,x,w):
  return tf.reduce_mean(tf.square(y_true-x*w))

#mulitple gradient update with tf.GradientTape()
weight_list = []
loss_list = []
w_tf = tf.Variable(w,trainable = True,dtype = tf.float32)

#initialize the training loop
while True:
  weight_list.append(w_tf.numpy())

  if len(weight_list) > 1 and tf.abs(weight_list[-1]-weight_list[-2]) <= error_margin:
    break

  with tf.GradientTape() as t:
    t.watch(w_tf)
    current_loss = mean_squared_error(y_true,x,w_tf)
    loss_list.append(current_loss.numpy())
    gradient = t.gradient(current_loss,[w_tf])
    w_tf = w_tf - learning_rate * gradient[0]

weight_list,loss_list

([2.0, 1.5, 1.25, 1.125, 1.0625], [2.5, 0.625, 0.15625, 0.0390625])

## Implementing GradientTape to a Deep Learning Model

Let's start by creating our model in two ways. Both of them contain a node with a single weight and no biases. The only difference is one is implemeted traditionally and the other is implemented with `tf.keras.Model`.

In [10]:
#import libraries
import tensorflow as tf
import numpy as np

#define parameters
y_true = np.array([1,2]).astype('float32')
w = 2.
x = np.array([1,2]).astype('float32')
error_margin = 0.1

In [11]:
#define a model class having only one weight - Regular
class SimpleModel(object):
  def __init__(self,weights):
    self.weights = weights

  def __call__(self,input):
    return input * self.weights

#test the model
simple_model = SimpleModel(weights = 2.)
simple_model(3.)

6.0

In [12]:
#define a model class having only one weight - Tensorflow
class SimpleModelTensorflow(tf.keras.Model):
  def __init__(self,units = 1):
    super(SimpleModelTensorflow,self).__init__() #allows this subclass to use tf.keras.Model attributes
    self.units = units

  def build(self,input_shape):
    self.weight = tf.Variable(initial_value = 2,trainable = True,dtype = tf.float32)


  def call(self,x):
    return x * self.weight

#initialize the model
tensorflow_model = SimpleModelTensorflow()

#test the model
tensorflow_model(3.)

<tf.Tensor: shape=(), dtype=float32, numpy=6.0>

Let's define the functions for each.

In [13]:
#functions for simple model
def mean_squared_error(y_true,x,model):
  return np.mean(np.square(y_true-model(x))) #use model(x) instead of w*x

def gradient_mse(y_true,x,model):
  return 2 * np.mean((model(x) - y_true)*x)

def gradient_update(y_true,x,model, learning_rate = 0.1):
  return model.weights - learning_rate * gradient_mse(y_true,x,model)

def multiple_gradient_update(y_true, x, model, learning_rate=0.1):
    weight_list = [model.weights]
    loss_list = [mean_squared_error(y_true, x, model)]

    error_margin = 0.1
    while True:
        w_new = gradient_update(y_true, x, model)
        weight_list.append(w_new)
        model.weights = w_new
        # Check if the weight difference is below the error margin
        if np.abs(weight_list[-1] - weight_list[-2]) <= error_margin:
            break
        loss_list.append(mean_squared_error(y_true, x, model))



    return weight_list, loss_list

In [14]:
multiple_gradient_update(y_true,x,simple_model)

([2.0, 1.5, 1.25, 1.125, 1.0625], [2.5, 0.625, 0.15625, 0.0390625])

In [15]:
#TF functions
def mean_squared_error_tf(y_true,x,model):
  return tf.reduce_mean(tf.square(y_true-model(x)))

def multiple_gradient_update_with_tf(y_true,x,model, learning_rate = 0.1):

  weight_list,loss_list = [],[]

  while True:
    weight_list.append(model.get_weights()[0])
    if len(weight_list) > 1 and tf.abs(weight_list[-1] - weight_list[-2]) <= error_margin:
      break
    with tf.GradientTape() as t:
      t.watch(model.trainable_variables)
      current_loss = mean_squared_error_tf(y_true,x,model)
      loss_list.append(current_loss.numpy())
      gradient = t.gradient(current_loss,model.trainable_variables)
      new_weight = model.get_weights()[0] - learning_rate * gradient[0]
      model.set_weights([new_weight])

  return weight_list,loss_list

multiple_gradient_update_with_tf(y_true,x,tensorflow_model, learning_rate = 0.1)

([2.0, 1.5, 1.25, 1.125, 1.0625], [2.5, 0.625, 0.15625, 0.0390625])

## Working with Larger Data & Introduction of SGD

Stochastic Gradient Descent (SGD) in short is a valuable method for deep learning optimization especially the data is normally distributed. Instead of computing the value of loss function and average these values before the graident update, we only consider the loss value of the one randomly selected instance.

Since we only consider one instance per time, the gradients may become unstable resulting the model to diverge. I've also included a part showing how this happens to the reader so that these concepts wouldn't remain abstract for the reader.

Problem Statement

- $x = [1,2,3...2000)$
- $y = [1,2,3...2000)$
- No bias
- $y_{pred} = model(x) = xw$
- Loss Function:$\frac 1N * Σ(y_{pred}-y_{true})^2$
- Stopping criteria: $|L_{t+1}-L_{t+1}\le 0.1|$

Let's see how exploding gradients look like:

In [16]:
#import the libraries
import tensorflow as tf
import numpy as np
import random

#seed the process
SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)


#define a model class having only one weight - Tensorflow
class SimpleModelTensorflow(tf.keras.Model):
  def __init__(self,units = 1):
    super(SimpleModelTensorflow,self).__init__() #allows this subclass to use tf.keras.Model attributes
    self.units = units

  def build(self,input_shape):
    self.weight = tf.Variable(initial_value = 2,trainable = True,dtype = tf.float32)#never forget to convert the dtype to float

  def call(self,x):
    return x * self.weight


In [17]:
#define error margin again
error_margin = 0.1
simple_loss_list = []
simple_weight_list = []

#define the model
simple_tfmodel = SimpleModelTensorflow()
simple_tfmodel(3.) #build the model

#define the parameters
learning_rate = 0.001
y_true = np.array([*range(1,2000)]).astype('float32')
x = np.array([*range(1,2000)]).astype('float32')

for i in range(30):

  with tf.GradientTape(persistent = True) as t: #allows us to use multiple times
    t.watch(simple_tfmodel.trainable_variables)

    #implement SGD
    random_index = tf.random.uniform(shape = [], minval = 0, maxval = len(y_true),dtype = tf.int32)
    random_x = x[random_index]
    random_y = y_true[random_index]

    current_loss = tf.keras.metrics.mean_squared_error(y_true=[random_y],y_pred = [simple_tfmodel(random_x)])
    gradient = t.gradient(current_loss,sources = simple_tfmodel.trainable_variables)
    simple_tfmodel.trainable_variables[0].assign_sub(learning_rate*gradient[0]) #instead of using the formula itself use assign_sub

    #record the process
    simple_weight_list.append(simple_tfmodel.trainable_variables[0].numpy())
    simple_loss_list.append(current_loss.numpy())

    #display the process
    print("CurrentLoss:", current_loss)
    print("Gradient:",gradient)
    print("Weights:",simple_tfmodel.trainable_variables[0].numpy())
    print()

    if len(simple_loss_list) > 1 and tf.abs(simple_loss_list[-1],simple_loss_list[-2]) <= error_margin:
      break

#simple_weight_list,simple_loss_list



CurrentLoss: tf.Tensor(417316.0, shape=(), dtype=float32)
Gradient: [<tf.Tensor: shape=(), dtype=float32, numpy=834632.0>]
Weights: -832.632

CurrentLoss: tf.Tensor(2017845800000.0, shape=(), dtype=float32)
Gradient: [<tf.Tensor: shape=(), dtype=float32, numpy=-4841094700.0>]
Weights: 4840262.5

CurrentLoss: tf.Tensor(2.629222e+18, shape=(), dtype=float32)
Gradient: [<tf.Tensor: shape=(), dtype=float32, numpy=1086396700000.0>]
Weights: -1081556500.0

CurrentLoss: tf.Tensor(7.1533894e+23, shape=(), dtype=float32)
Gradient: [<tf.Tensor: shape=(), dtype=float32, numpy=-1322795400000000.0>]
Weights: 1321713900000.0

CurrentLoss: tf.Tensor(8.101569e+29, shape=(), dtype=float32)
Gradient: [<tf.Tensor: shape=(), dtype=float32, numpy=1.2259188e+18>]
Weights: -1224597000000000.0

CurrentLoss: tf.Tensor(1.4668273e+36, shape=(), dtype=float32)
Gradient: [<tf.Tensor: shape=(), dtype=float32, numpy=-2.3956082e+21>]
Weights: 2.3943837e+18

CurrentLoss: tf.Tensor(inf, shape=(), dtype=float32)
Gradien

Here how it looks. The x values and y values are varying significantly. When this is combined with the punishing policy of mse, the gradients become huge. Normalization can be helpful in this regard. For this, let's use a custom defined `MinMax Scaling` function.

In [18]:
def minmax(array):
  max_val,min_val = max(array),min(array)
  array_normalized = (array - min_val)/(max_val-min_val)
  return array_normalized


#define error margin again
error_margin = 0.0001
simple_loss_list = []
simple_weight_list = []

#define the model
simple_tfmodel = SimpleModelTensorflow()
simple_tfmodel(3.) #build the model
initial_weights = simple_tfmodel.get_weights()[0]

#define the parameters
learning_rate = 0.1
y_true = np.array([*range(1,2000)]).astype('float32')
x = np.array([*range(1,2000)]).astype('float32')
x_normalized = minmax(x)
y_normalized = minmax(y_true) #https://stats.stackexchange.com/questions/532384/why-do-we-normalize-only-the-x-data-and-not-y

while True:

  with tf.GradientTape() as t:
    t.watch(simple_tfmodel.trainable_variables)

    #implement SGD
    random_index = tf.random.uniform(shape = [], minval = 0, maxval = len(y_true),dtype = tf.int32)
    random_x = x_normalized[random_index]
    random_y = y_normalized[random_index]

    # print("Random x:",random_x)
    # print("Random y:",random_y)

    current_loss = tf.keras.metrics.mean_squared_error(y_true=[random_y],y_pred = [simple_tfmodel(random_x)])
    gradient = t.gradient(current_loss,sources = simple_tfmodel.trainable_variables)
    simple_tfmodel.trainable_variables[0].assign_sub(learning_rate*gradient[0]) #instead of using the formula itself use assign_sub

    #record the process
    simple_weight_list.append(simple_tfmodel.trainable_variables[0].numpy())
    simple_loss_list.append(current_loss.numpy())

    # print("CurrentLoss:", current_loss)
    # print("Gradient:",gradient)
    # print("Weights:",simple_tfmodel.trainable_variables[0].numpy())

    if len(simple_loss_list) > 1 and tf.abs(simple_loss_list[-1],simple_loss_list[-2]) <= error_margin:
      break

#display the results
print("Results")
print("--------------\n")
print("Initial Weight:", initial_weights)
print("Final Weight:", simple_weight_list[-1])
print()
print("Initial Loss:", simple_loss_list[0])
print("Final Loss:",simple_loss_list[-1])

Results
--------------

Initial Weight: 2.0
Final Weight: 1.1171395

Initial Loss: 0.51440054
Final Loss: 3.440736e-05


# Configure an Actual Layer

From now on instead of using dummy layers, we will make our first `Dense` layer. Then we will optimize it with our custom SGD algorithm.

Problem Statement

- $x = [x_1,x_2,...x_{100}]∼N(0,1); xϵR^3$
- $y = [1,2,3...2000)$
- $y_{pred} = model(x) = xw +b$
- Loss Function:$\frac 1N * Σ(y_{pred}-y_{true})^2$
- Stopping criteria: $|L_{t+1}-L_{t+1}\le 0.01|$

## A Dense Layer with SGD

In [19]:
#import the libraries
import tensorflow as tf
import numpy as np
import random
import pandas as pd

#set seed
SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

#create the model class
class CustomDense(tf.keras.layers.Layer):
  def __init__(self,units,activation): #get the number of units
    super(CustomDense,self).__init__() #allow the subclass to use tf.keras.Model attributes

    self.units = units
    self.activation = tf.keras.activations.get(activation) #this allows our model to get activations in string type -> 'relu'

  def build(self,input_shape):

    #define the initializers
    weight_initializer = tf.keras.initializers.GlorotUniform() #use the same initializer with the layers.Dense()
    bias_initializer = tf.keras.initializers.Zeros() #use the same initializer with the layers.Dense()

    #create the first weights and biases
    initial_weights = weight_initializer(shape = (input_shape[-1],self.units),dtype = tf.float32)
    initial_biases = bias_initializer(shape = (self.units,),dtype = tf.float32)#one bias per node

    #create the trainable variables
    self.wgt = tf.Variable(initial_value = initial_weights,trainable = True,name = 'weights')
    self.bs = tf.Variable(initial_value = initial_biases,trainable = True,name = 'biases')
    super().build(input_shape)

  def call(self,inputs):
    return self.activation(tf.matmul(inputs,self.wgt) + self.bs)


Now by using SGD, we will try to find the pattern: $y = 3x_1 + 2x_2 - 7x_3 + 5$

In [20]:
#create the dataset y = 3x1 + 2x2 -7x3 + 5
x1 = np.random.normal(size = 100)
x2 = np.random.normal(size = 100)
x3 = np.random.normal(size = 100)

#put on a pandas df (optional)
dataset = pd.DataFrame({
    'x1': x1,
    'x2': x2,
    'x3': x3
})
dataset['y'] = 3*dataset['x1'] + 2*dataset['x2'] - 7*dataset['x3'] + 5
dataset.head()

Unnamed: 0,x1,x2,x3,y
0,0.496714,-1.415371,0.357787,1.154889
1,-0.138264,-0.420645,0.560785,-0.181575
2,0.647689,-0.342715,1.083051,-1.323722
3,1.52303,-0.802277,1.053802,0.587921
4,-0.234153,-0.161286,-1.377669,13.618654


In [21]:
#build an actual layer
custom_dense = CustomDense(1, activation = 'linear')
custom_dense.build(input_shape = (3,))

#record the initial weights and bias for final output
initial_weights = custom_dense.get_weights()[0]
initial_bias = custom_dense.trainable_variables[1].numpy()

#define parameters
error_margin = 0.01
learning_rate = 0.1

#create the containers
weight_list = []
loss_list = []
bias_list = []

#initialize the loop
while True:

  with tf.GradientTape() as t: #allows us to use multiple times
    t.watch(custom_dense.trainable_variables)

    #implement SGD with pandas
    random_row = dataset.sample(1).values
    random_x = random_row[0][:3]
    random_y = random_row[0][-1]

    # print("Random x:",random_x)
    # print("Random y:",random_y)

    current_loss = tf.keras.metrics.mean_squared_error(y_true=[random_y],y_pred = [custom_dense([random_x])])
    gradient = t.gradient(current_loss,sources = custom_dense.trainable_variables)
    custom_dense.trainable_variables[0].assign_sub(learning_rate*gradient[0]) #update weights
    custom_dense.trainable_variables[1].assign_sub(learning_rate*gradient[1]) #update biases

    #record the process
    weight_list.append(custom_dense.trainable_variables[0].numpy())
    bias_list.append(custom_dense.trainable_variables[1].numpy())
    loss_list.append(current_loss.numpy())

    # print("CurrentLoss:", current_loss)
    # print("Gradient:",gradient)
    # print("Weights:",custom_dense.trainable_variables[0].numpy())
    # print('Biases:',custom_dense.trainable_variables[1].numpy())

    if len(loss_list) > 1 and tf.abs(loss_list[-1],loss_list[-2]) <= error_margin:
      break

#display the results
print("Results")
print("--------------\n")
print("Initial Weights:", initial_weights)
print("Final Weights:", weight_list[-1])
print()
print("Initial Bias:", initial_bias)
print("Final Bias:",bias_list[-1])
print()
print("Initial Loss:", loss_list[0])
print()
print("Final Loss:",loss_list[-1])


Results
--------------

Initial Weights: [[ 0.8773805 ]
 [ 0.3983569 ]
 [-0.34608656]]
Final Weights: [[ 1.8544272]
 [ 2.2192671]
 [-7.0273986]]

Initial Bias: [0.]
Final Bias: [4.617867]

Initial Loss: [[114.21528]]

Final Loss: [[2.5682853e-05]]


## Repeating the Same Experiment with Mini Batch GD

In this section, we will repeat the same experiment using Minibatch GD. Sice this is a dummy problem I've selected the batch size as 3.

In [22]:
#import the libraries
import tensorflow as tf
import numpy as np
import random
import pandas as pd

#set seed
SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

#create the model class
class CustomDense(tf.keras.layers.Layer):
  def __init__(self,units,activation): #get the number of units
    super(CustomDense,self).__init__() #allow the subclass to use tf.keras.Model attributes

    self.units = units
    self.activation = tf.keras.activations.get(activation) #this allows our model to get activations in string type -> 'relu'

  def build(self,input_shape):

    #define the initializers
    weight_initializer = tf.keras.initializers.GlorotUniform() #use the same initializer with the layers.Dense()
    bias_initializer = tf.keras.initializers.Zeros() #use the same initializer with the layers.Dense()

    #create the first weights and biases
    initial_weights = weight_initializer(shape = (input_shape[-1],self.units),dtype = tf.float32)
    initial_biases = bias_initializer(shape = (self.units,),dtype = tf.float32)#one bias per node

    #create the trainable variables
    self.wgt = tf.Variable(initial_value = initial_weights,trainable = True,name = 'weights')
    self.bs = tf.Variable(initial_value = initial_biases,trainable = True,name = 'biases')
    super().build(input_shape)

  def call(self,inputs):
    return self.activation(tf.matmul(inputs,self.wgt) + self.bs)




Now by using Minibatch GD, we will try to find the pattern: $y = 3x_1 + 2x_2 - 7x_3 + 5$

In [23]:
#create the dataset y = 3x1 + 2x2 -7x3 + 5
x1 = np.random.normal(size = 100)
x2 = np.random.normal(size = 100)
x3 = np.random.normal(size = 100)

#put on a pandas df (optional)
dataset = pd.DataFrame({
    'x1': x1,
    'x2': x2,
    'x3': x3
})
dataset['y'] = 3*dataset['x1'] + 2*dataset['x2'] - 7*dataset['x3'] + 5
dataset.head()

Unnamed: 0,x1,x2,x3,y
0,0.496714,-1.415371,0.357787,1.154889
1,-0.138264,-0.420645,0.560785,-0.181575
2,0.647689,-0.342715,1.083051,-1.323722
3,1.52303,-0.802277,1.053802,0.587921
4,-0.234153,-0.161286,-1.377669,13.618654


In [24]:
#build the actual Dense layer
custom_dense = CustomDense(1, activation = 'linear')
custom_dense.build(input_shape = (3,))

#record the initial weights and bias for final output
initial_weights = custom_dense.get_weights()[0]
initial_bias = custom_dense.trainable_variables[1].numpy()

#define parameters
error_margin = 0.1
learning_rate = 0.1
batch_size = 3

#create the containers
weight_list = []
loss_list = []
bias_list = []

#initialize the loop
while True:

  with tf.GradientTape() as t: #allows us to use multiple times
    t.watch(custom_dense.trainable_variables)

    #implement minibatch GD with pandas
    batch = dataset.sample(batch_size).values
    random_x = batch[:,:3]
    random_y = batch[:,-1]

    # print("Random x:",random_x)
    # print("Random y:",random_y)

    current_loss = tf.reduce_mean(tf.keras.metrics.mean_squared_error(y_true=random_y,y_pred = custom_dense(random_x))) #get the mean of the errors
    gradient = t.gradient(current_loss,sources = custom_dense.trainable_variables)
    custom_dense.trainable_variables[0].assign_sub(learning_rate*gradient[0]) #update weights
    custom_dense.trainable_variables[1].assign_sub(learning_rate*gradient[1]) #update biases

    #record the process
    weight_list.append(custom_dense.trainable_variables[0].numpy())
    bias_list.append(custom_dense.trainable_variables[1].numpy())
    loss_list.append(current_loss.numpy())

    #print("CurrentLoss:", current_loss)
    #print("Gradient:",gradient)
    #print("Weights:",custom_dense.trainable_variables[0].numpy())
    #print('Biases:',custom_dense.trainable_variables[1].numpy())

    if len(loss_list) > 1 and tf.abs(loss_list[-1]-loss_list[-2]) <= error_margin:
      break

print("Results")
print("--------------\n")
print("Initial Weights:", initial_weights)
print("Final Weights:", weight_list[-1])
print()
print("Initial Bias:", initial_bias)
print("Final Bias:",bias_list[-1])
print()
print("Initial Loss:", loss_list[0])
print()
print("Final Loss:",loss_list[-1])


Results
--------------

Initial Weights: [[ 0.8773805 ]
 [ 0.3983569 ]
 [-0.34608656]]
Final Weights: [[ 0.42121026]
 [ 0.45499933]
 [-2.5288522 ]]

Initial Bias: [0.]
Final Bias: [4.05879]

Initial Loss: 155.67267

Final Loss: 39.838947


# Introducing and Optimizing Multiple Layers

In the final part of this work, we will look at how learning happens in multilayer neural networks. Nothing fancy, we will only use two Dense layers having one node. The gradient calculation is similar to single layer case but a bit more complex.

This time we will add `max_iterations` count for the first time.

Problem Statement

- $xϵR^{10}$
- $yϵR$
- $y_{pred} = model(x)$
- Loss Function:$\frac 1N * Σ(y_{pred}-y_{true})^2$
- Stopping criteria: $|L_{t+1}-L_{t+1}\le 0.1|$

In [25]:
#import the libraries
import tensorflow as tf
import numpy as np
import random
import pandas as pd

#set seed
SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

#create the model class
class CustomDense(tf.keras.layers.Layer):
  def __init__(self,units,activation): #get the number of units
    super(CustomDense,self).__init__() #allow the subclass to use tf.keras.Model attributes

    self.units = units
    self.activation = tf.keras.activations.get(activation) #this allows our model to get activations in string type -> 'relu'

  def build(self,input_shape):

    #define the initializers
    weight_initializer = tf.keras.initializers.GlorotUniform() #use the same initializer with the layers.Dense()
    bias_initializer = tf.keras.initializers.Zeros() #use the same initializer with the layers.Dense()

    #create the first weights and biases
    initial_weights = weight_initializer(shape = (input_shape[-1],self.units),dtype = tf.float32)
    initial_biases = bias_initializer(shape = (self.units,),dtype = tf.float32)#one bias per node

    #create the trainable variables
    self.wgt = tf.Variable(initial_value = initial_weights,trainable = True,name = 'weights')
    self.bs = tf.Variable(initial_value = initial_biases,trainable = True,name = 'biases')
    super().build(input_shape)

  def call(self,inputs):
    return self.activation(tf.matmul(inputs,self.wgt) + self.bs)



In [26]:
#create the dataset
from sklearn.datasets import make_regression
x,y = make_regression(n_samples = 10,n_features = 10,n_targets = 1,bias = 1,random_state = 42)
dataset_X = pd.DataFrame(x)
dataset_Y = pd.DataFrame(y)
dataset = pd.concat([dataset_X,dataset_Y],axis = 1)
dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,0.1
0,-0.299007,-0.035826,-1.987569,0.087047,-2.619745,0.821903,1.564644,0.091761,1.538037,0.361396,-324.129377
1,0.328751,1.477894,0.513267,0.915402,-0.808494,-0.501757,-0.51827,-0.52976,0.357113,-0.219672,8.745854
2,0.261055,-0.702053,-0.234587,0.29612,-0.392108,-1.463515,-0.327662,0.005113,0.968645,0.097078,-66.080255
3,-0.309212,-0.676922,0.975545,-0.839218,1.031,0.93128,0.611676,0.331263,-0.385082,0.324084,187.396417
4,1.057122,-0.115648,-1.76304,-0.460639,-1.478522,-0.719844,-0.301104,0.343618,0.171368,0.738467,-247.012074


In [27]:
#create dense layers
custom_dense1 = CustomDense(10, activation='relu')
custom_dense1.build(input_shape=(10,))
custom_dense2 = CustomDense(1, activation='linear')
custom_dense2.build(input_shape=(10,))  #output shape of the previous layer is input shape for this one

# Initialize the containers for both layers
weight_list1, weight_list2 = [], []
bias_list1, bias_list2 = [], []
loss_list = []

error_margin = 0.1
learning_rate = 0.0001
max_iterations = 1000
iterations = 0

#start training
while True:
    with tf.GradientTape(persistent=True) as t:
        random_row = dataset.sample(1).values
        random_x = random_row[:, :-1]
        random_y = random_row[:, -1]

        h = custom_dense1(random_x)
        y_pred = custom_dense2(h)

        current_loss = tf.reduce_mean(tf.keras.metrics.mean_squared_error(y_true=random_y, y_pred=y_pred))

    gradient2 = t.gradient(current_loss, custom_dense2.trainable_variables)
    gradient1 = t.gradient(current_loss, custom_dense1.trainable_variables)

    custom_dense2.trainable_variables[0].assign_sub(learning_rate * gradient2[0])
    custom_dense2.trainable_variables[1].assign_sub(learning_rate * gradient2[1])
    custom_dense1.trainable_variables[0].assign_sub(learning_rate * gradient1[0])
    custom_dense1.trainable_variables[1].assign_sub(learning_rate * gradient1[1])

    # Record the process
    weight_list1.append(custom_dense1.trainable_variables[0].numpy())
    bias_list1.append(custom_dense1.trainable_variables[1].numpy())
    weight_list2.append(custom_dense2.trainable_variables[0].numpy())
    bias_list2.append(custom_dense2.trainable_variables[1].numpy())
    loss_list.append(current_loss.numpy())

    # Check for convergence or max iterations
    if len(loss_list) > 1 and tf.abs(loss_list[-1] - loss_list[-2]) <= error_margin:
        break

    if iterations == max_iterations:
      break

    iterations +=1


In [28]:
print('Sample Results:')
print("--------------")

for _ in range(10):
  #get a sample
  sample = dataset.sample(1).values
  sample_x = sample[:,:-1]
  sample_y = sample[:,-1]

  #display the prediction
  prediction = custom_dense2(custom_dense1(sample_x))
  print("\nPrediction:",prediction.numpy())
  print("Real value:", sample_y)

Sample Results:
--------------

Prediction: [[-251.45706]]
Real value: [-247.01207405]

Prediction: [[-320.13092]]
Real value: [-324.12937683]

Prediction: [[-140.71696]]
Real value: [-136.84556017]

Prediction: [[187.34027]]
Real value: [187.39641651]

Prediction: [[-121.19164]]
Real value: [-121.36189259]

Prediction: [[-140.71696]]
Real value: [-136.84556017]

Prediction: [[187.34027]]
Real value: [187.39641651]

Prediction: [[203.538]]
Real value: [219.73358496]

Prediction: [[63.713146]]
Real value: [60.94917984]

Prediction: [[-140.71696]]
Real value: [-136.84556017]
