# Problem of Overfitting
- Overfitting: If we have too many features, the learned hypothesis may fit the training set very well (cost function will be very low), but fail to generalize to new examples (predict prices on new examples).
- Addressing overfitting:
    - Increase number of training examples
    - Reduce number of features
        - Manually select which features to keep
        - Model selection algorithm
    - Regularization
        - Eliminating features results in loss of information
        - Keep all features, but reduce magnitude/values of parameters $\theta_j$
        - Works well when we have a lot of features, each of which contributes a bit to predicting $y$.

# Addressing Overfitting
- Collect More Data
    - Fixes high variance
- Select Features
    - Features selection algorithm
- Reduce size of Parameters
    - Regularization

# Cost Function with Regularization:
$$J(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \lambda \sum_{j=1}^n \theta_j^2 \right]$$


# Regularized Linear Regression:
- Gradient Descent:
- Repeat {
    $$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}$$
- Derivation Steps:
- $$\frac{\partial}{\partial \theta_j} J(\theta) = \frac{\partial}{\partial \theta_j} \left[ \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2 \right]$$
- $$= \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m} \theta_j$$
- $$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$
- $$\theta_j := \theta_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m} \theta_j \right]$$
- $$\theta_j := \theta_j \left( 1 - \alpha \frac{\lambda}{m} \right) - \alpha \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}$$

# Advanced Learning Algorithms

# Neural Networks
- algorithms to mimic the brain
- applications: speech recognition, computer vision, robotics, etc.
- model computation as a network of neurons

# Hidden Layers:
- grid search for number of hidden layers

# Activation Value:
- $a_i^{(j)}$ = "activation" of unit $i$ in layer $j$
- $a_i^{(j)}$ is given by $a_i^{(j)} = g(z_i^{(j)})$ = g($\theta_{i0}^{(j)} {a_0^{(j-1)}} + \theta_{i1}^{(j)} {a_1^{(j-1)}} + \dots + \theta_{in}^{(j)} {a_n^{(j-1)}}$)

# Forward Propagation:
- $a_1^{(2)} = g(\theta_{10}^{(1)} x_0 + \theta_{11}^{(1)} x_1 + \theta_{12}^{(1)} x_2 + \theta_{13}^{(1)} x_3)$

# Representation:
```python
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
layer_1 = Dense(units=2, activation='sigmoid')
a1 = layer_1(x)
layer_2 = Dense(units=1, activation='sigmoid')
a2 = layer_2(a1)
```

# Building Neural Networks
- code:
```python
layer_1 = Dense(units=2, activation='sigmoid')
layer_2 = Dense(units=1, activation='sigmoid')
model = Sequential([layer_1, layer_2])
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.1))
model.fit(x, y, epochs=1000, verbose=False)
print(model.predict(x))
```

# Train a Neural Netwrok in TensorFlow
```python
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

#defining the model
# here the model has 3 layers
model = Sequential(
    [
        Dense(units=25, activation="sigmoid"),
        Dense(units=15, activation="sigmoid"),
        Dense(units=1, activation="sigmoid"),
    ]
)

from tensorflow.keras.losses import BinaryCrossentropy
model.compile(loss=BinaryCrossentropy()) # for logistic regression
model.compile(loss=MeanSquaredError()) # for linear regression

model.fit(x, y, epochs=1000) # does gradient descent
```

# Activation Functions:
- Sigmoid: $g(z) = \frac{1}{1 + e^{-z}}$ -> Binary Classification
- ReLU: $g(z) = \max(0, z)$ -> Hidden Layers
- Linear: $g(z) = z$ -> Regression
- Choosing Activation Function:
    - Sigmoid: $0 \leq h_\theta(x) \leq 1$
    - ReLU: $h_\theta(x)$ can be much greater than 1
    - Linear: $h_\theta(x)$ can be much less than -1

## Choosing Activation Function:
- Output Layer:
    - Linear: $h_\theta(x) = \theta^T x$ for regression, y= +ve or -ve
    - sigmoid: $h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}$ for binary classification, y = 0 or 1
    - relu: $h_\theta(x) = \max(0, \theta^T x)$ for multi-class classification, y = 0 or +ve

- Hidden Layers:
    - relu: $g(z) = \max(0, z)$ is faster to compute than sigmoid, it is more commonly used

# Multi-class Classification:
- activation function for output layer: softmax
- softmax: $h_\theta(x)_i = \frac{e^{\theta_i^T x}}{\sum_{j=1}^k e^{\theta_j^T x}}$
- softmax depends on all other units in the output layer
- logistic regression is a special case of softmax regression as k = 2, we can get the same result using sigmoid function

## Cost:
- logistic regression: $J(\theta) = - \frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_\theta(x^{(i)})) \right]$, $h_\theta(x^{(i)}) = \frac{1}{1 + e^{-\theta^T x^{(i)}}}$
    - here replacing 1 - y with y' and 1 - h with h' we get: $J(\theta) = - \frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log h_\theta(x^{(i)}) + y'^{(i)} \log h'_\theta(x^{(i)}) \right]$
- similarly for softmax regression: $J(\theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{j=1}^k \left[ y_j^{(i)} \log h_\theta(x^{(i)})_j \right]$
- for softmax we use loss function called sparse_categorical_crossentropy
```python 
from tensorflow.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentropy())
```    
### Numerical Roundoff Error:
- to avoid specify from_logits=True which will apply softmax activation function without rounding off
```python
from tensorflow.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True))
```

# Multi-Label Classification:
- multi-label classification: $y \in \{0, 1\}^4$
- multi-class classification: $y \in \{0, 1, 2, 3\}$
- multi class is different from multi-label as multi-class is mutually exclusive

## Adam Algorithm:
- adaptive moment estimation
- differs from gradient descent in the way it calculates the learning rate
- alpha value is calculated for each parameter

## Train Test Procedure:
- train test split: 80% train, 20% test
- it is important to shuffle the data before splitting
- it is done to avoid bias in the data and to avoid overfitting


## Model Selection:
- train test split: 60% train, 20% validation, 20% test
- Cross validation is used when the data is small, it helps to get a better estimate of the model performance
- k-fold cross validation: 60% train, 20% validation, 20% test, in this case we have 5 folds, here we train the model 5 times and take the average of the validation accuracy, then the model is tested on the test data

## Bias and Variance:
- bias: error due to wrong assumptions
- variance: error due to sensitivity to small fluctuations in the training set
- bias is error with respect to the training set
- variance is error with respect to the test set
- For Underfitting:
    - high bias
    - low variance
    - J_train is high
    - J_cv is high
- For Overfitting:
    - low bias
    - high variance
    - J_train is low
    - J_cv is high
- For Good Fit:
    - low bias
    - low variance        
    - J_train is low
    - J_cv is low

## Bias and Variance for Regularization:
- Large $\lambda$:
    - high bias
    - low variance
    - J_train is high
    - J_cv is high
- Small $\lambda$:
    - low bias
    - high variance
    - J_train is low
    - J_cv is high        

## Learning Curves:
- high bias: $J_{train}(\theta)$ is high and $J_{cv}(\theta)$ is high
- high variance: $J_{train}(\theta)$ is low and $J_{cv}(\theta)$ is high
- if there is high bias then increasing the number of training examples will not help
- if there is high variance then increasing the number of training examples will help

## Debugging a Learning Algorithm:
- Get more training examples -> fixes high variance
- Try smaller sets of features -> fixes high variance
- Try getting additional features -> fixes high bias
- Try adding polynomial features -> fixes high bias
- Try decreasing $\lambda$ -> fixes high bias
- Try increasing $\lambda$ -> fixes high variance

## Neural Network and bias/variance:
- Does it do well on training set?
    - No: high bias, try bigger network, train longer, different NN architecture
    - Yes: Does it do well on Cross Validation set?
        - No: high variance, try more data, regularization, different NN architecture
        - Yes: Done
- But higher data requires more computation power, so we can use regularization to reduce variance
- Regularised MNIST model:
```python
# l2 is the lambda value
layer_1 = tf.keras.layers.Dense(units=25, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001))
layer_2 = tf.keras.layers.Dense(units=10, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001))
layer_3 = tf.keras.layers.Dense(units=1, activation='softmax', kernel_regularizer=tf.keras.regularizers.l2(0.001))        
model = tf.keras.Sequential([layer_1, layer_2, layer_3])
```