## Dense layer from scratch
**Design of a perceptron:** Take the inputs and do a weighted sum of these inputs. Pass it through a non linear activation function(can be sigmoid, tanh, or Relu, or LeakyRelu). 
 A perceptron is the fundamental building block of Neural Network. 

**Activation Functions**: It brings non-linearity to the model making it possible to create complex decision boundary in the feature space.


In [None]:
class DenseLayer(tf.keras.layers.Layer):
  def __init__(self, input_dim, output_dim):
    super(DenseLayer, self).__init__()

    #Initialize weights and bias
    self.W = self.add_weight((input_dim,output_dim))
    self.b = self.add_weight((1, output_dim))

  def call(self, inputs):
    # Forward propogate the inputs
    z = tf.matmul(inputs, self.W)+self.b

    # Feed through a non linear activation 
    output = tf.math.sigmoid(z)

    return output

### Calling the above function
Stacking perceptrons forms neural networks. 

In [None]:
import tensorflow as tf
layer = tf.keras.layers.Dense(units=2)

### Quantifying Loss
Loss measures the cost incurred from incorrect predictions. 
* MSE is used if we are predicting continous values.
* Cross entropy is used incase the output is catagorical.

In [None]:
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y,predicted))

### Loss Optimization
Once cost function is defined, learning becomes an optimization problem-find weights that achieve the lowest loss.

### Gradient Descent Algorithm
1. Initialize weights randomly
2. Loop until convergence
3. Compute gradient --> dJ(W)/dW
4. Update weights: W <-- (W-n(dJ(W)/dW))
5. Return weights

In [None]:
weights = tf.Variable((tf.random.normal()))

while True: 

  #infinite loop
  with tf.GradientTape() as g:
    loss = compute_loss(weights)
    gradient = g.gradient(loss,weights)

  weights = weights - lr*gradient
  #lr-->learning rate/step size

### Gradient Descent Algorithms
Taking the right step size matters and determines how well the model converges.
1. SGD
2. Adam
3. Adadelta
4. Adagrad
5. RMSProp

### Putting it all together

In [None]:
model = tf.keras.Sequential([])

#choose an optimizer
optimizer = tf.keras.optimizer.SGD()

while True:
  #forward pass
  prediction = model(x)
  with tf.GradientTape() as tape:

    #compute loss
    loss = compute_loss(y,prediction)
  
  #update the weights using gradients
  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradient(zip(grads, model.trainable_variable)) 


### Overfitting and underfitting
Machine learning models and especially deep learning models are prone to overfiting: model fails to explain the general trend of the data.

### Regularization
Technique that constrains our optimization problem to discourage complex models.

Need for regularization: Improve generalization of model on unseen data. The goal of the model is to reduce the cost function(computed using training set), but the ultimate goal is to be able to perform well in the test set.

**Regularization I**: Dropout

During training, we randomly drop some and set some activations in the hidden layer and make them zero. 

* Typically drop 50% of activations in layer in different iterations.
* Forces the network to not rely on any 1 node

**Regularization II**: Early Stopping

Stop training before we have a chance to overfit.

* Split the training set into two parts (test and train). Plot the iteration loss against the loss 


In [None]:
tf.keras.layers.Dropout(p = 0.5)