# Common Activation Functions:
- Sigmoid Function
formula: $f(x) = \frac{1}{1+e^{-x}}$
```python
tf.math.sigmoid(x)
```
- Tanh Function
formula: $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
```python
tf.math.tanh(x)
```
- ReLU Function
formula: $f(x) = max(0,x)$
```python
tf.nn.relu(x)
```

In [1]:
import tensorflow as tf

In [2]:
class MyDenseLayer(tf.keras.layers.Layer):
    def __init__(self, input_dim, output_dim):
        super(MyDenseLayer, self).__init__()
        # Initialize weights and biases
        self.W = self.add_weight([input_dim, output_dim])
        self.b = self.add_weight([1, output_dim])
    
    def call(self, inputs):
        # Forward propagate the inputs
        z = tf.matmul(inputs, self.W) + self.b
        
        # Feed through a non-linear activation
        output = tf.math.sigmoid(z)
        
        return output    

In [None]:
layer = tf.keras.layers.Dense(units=2, activation='sigmoid')

In [5]:
# stacking 
n = 4
model = tf.keras.Sequential([
    tf.keras.layers.Dense(n),
    tf.keras.layers.Dense(2)
])

# mean squared error loss
```python
loss = tf.reduce_mean(tf.square(tf.subtract(y, predicted)))
loss = tf.keras.losses.MSE(y, predicted)
```


# Gradient Descent:
```python
weights = tf.Variable([tf.random.normal()]) # random initialization

while True: 
    with tf.GradientTape() as g:
        loss = compute_loss(weights) # compute loss
        gradient = g.gradient(loss, weights) # compute gradient
    weights = weights - lr * gradient # update weights


 # Backpropagation:
 

using chain rule to compute the gradient of the loss function with respect to the weights of the network
$w_{ij}^{(l)}$ is the weight for the connection between the $i^{th}$ neuron in the $l^{th}$ layer and the $j^{th}$ neuron in the $(l+1)^{th}$ layer
$z_{j}^{(l+1)}$ is the input to the activation function of the $j^{th}$ neuron in the $(l+1)^{th}$ layer
formula: $\frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial z_{j}^{(l+1)}} \frac{\partial z_{j}^{(l+1)}}{\partial w_{ij}^{(l)}}$
```python

# Adaptive Learning Rate:
gradient descent algorithms:
- SGD: Stochastic Gradient Descent
```python
tf.keras.optimizers.SGD
```
- Adam: adaptive moment estimation
```python
tf.keras.optimizers.Adam
```
- AdaGrad: adaptive gradient algorithm
```python
tf.keras.optimizers.Adagrad
```

# Putting it together:
```python
model = tf.keras.Sequential([...])
optimizer = tf.keras.optimizers.SGD()
while True:
    prediction = model(x) # Forward pass
    with tf.GradientTape() as tape:
        loss = compute_loss(y, prediction) # Compute loss
    grads = tape.gradient(loss, model.trainable_variables) # Compute gradient
    optimizer.apply_gradients(zip(grads, model.trainable_variables)) # Update weights

```

# Stochastic Gradient Descent:
process:
- initialize weights with random values
- loop until convergence:
    - select a random subset of data points
    - calculate the gradient of the loss with respect to the weights
    - update the weights
- return the weights

# Regularization:
## Dropout:
- randomly set some activations to zero
- The idea is that by dropping some neurons from the network, we are removing the co-dependency between neurons. This forces the network to learn redundant representations, which makes it more robust and less likely to overfit the training data.
```python
tf.keras.layers.Dropout(rate)
# dropout rate: the fraction of the input units to drop in a layer
```

## Early Stopping:
- on overfitting model, the validation loss will start to increase after a certain point, while the training loss will continue to decrease
- stop training when the validation loss starts to increase
```python
tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=0)
```
