<img src="../holberton_logo.png" alt="logo" width="500"/>

# Regularization

## What is regularization? What is its purpose?

**Regularization is a technique used in machine learning to prevent overfitting**, which occurs when a model is too complex and captures noise in the training data instead of the underlying patterns. 

**The goal of regularization is to add a penalty term to the model's objective function that discourages overly complex models**, forcing the model to focus on the most important features of the data.

<img src="figs/overfitting.png" alt="logo" width="500"/>



<img src="figs/biasvariance.png" alt="logo" width="300"/>



In practice, regularization involves adding a term to the loss function that penalizes the model for large weights

## What is are L1 and L2 regularization? What is the difference between the two methods?

`L1` and `L2` regularization are techniques used in machine learning to prevent overfitting by adding a penalty term to the loss function. The penalty term is a function of the model weights, which encourages the weights to be small.

- **`L1` regularization**, also known as Lasso regularization, **adds a penalty term that is proportional to the absolute value of the model weights**. This has the *effect of shrinking some of the weights to zero, effectively performing feature selection by removing some of the less important features*. L1 regularization can produce sparse models, where many of the weights are zero, making it useful for problems where only a small number of features are relevant.

$$
L1 = \lambda \cdot \Sigma |w| 
$$

- where
    - $\lambda$ is the regularization parameter
    - $w$ represents the model's weights
    - $\Sigma |w|$ represents the sum of the absolute values of the weights.




- **`L2` regularization**, also known as Ridge regularization, **adds a penalty term that is proportional to the square of the model weights**. This has the *effect of shrinking all of the weights towards zero, but rarely to zero*. L2 regularization can be useful for problems where all of the features are potentially relevant, but some may be more important than others. It can also be used to help prevent collinearity between features.

$$
L2 = \lambda \cdot \Sigma (w)^2 
$$

- where
    - $\lambda$ is the regularization parameter
    - $w$ represents the model's weights
    - $\Sigma w^2$ represents the square of the weights.



## What is dropout?

Dropout is a regularization technique used in neural networks to prevent overfitting. It works by randomly dropping out (i.e., setting to zero) a certain proportion of the neurons in a neural network during each training iteration. This means that each neuron is only active with a certain probability during a given training iteration, and the network must learn to perform well even when some of its neurons are missing.

The effect of dropout is to prevent the network from relying too heavily on any one particular subset of neurons. This can help to prevent overfitting, as the network is forced to learn more robust representations that work well even when some neurons are missing.

<img src="figs/dropout.png" alt="logo" width="500"/>


There are different variants of dropout, such as the standard dropout where neurons are dropped out with a fixed probability, and the spatial dropout where entire feature maps are dropped out. Dropout has been shown to be effective in improving the performance of neural networks, especially when the number of parameters is large and the amount of labeled data is limited.



## What is early stopping?

Early stopping is a regularization technique used in machine learning to prevent overfitting of a model. It involves monitoring the performance of a model on a validation set during the training process and stopping the training process when the performance on the validation set starts to deteriorate, i.e., the validation loss stops improving.

The idea behind early stopping is that if the model is trained for too long, it may start to memorize the training data instead of learning the underlying patterns that generalize to new data. By stopping the training process early, we can prevent the model from overfitting and improve its ability to generalize to new data.

<img src="figs/earlystop.png" alt="logo" width="300"/>


Early stopping is typically implemented by keeping track of the validation loss during training and stopping the training process when the validation loss stops improving for a certain number of epochs. The number of epochs to wait before stopping the training process is called the "patience" parameter, and it is a hyperparameter that needs to be tuned using cross-validation or other techniques.


## Example: Regularization Techniques Using Keras 


### Define Model Architecture and Regularization Techniques

We define a **Sequential model** using Keras, which allows us to stack layers sequentially. We add `Dense` layers to the model to create a feedforward neural network. These layers contain activation functions and regularization techniques like **Dropout, L1**, and **L2**.

- **L1 Regularization**: Penalizes the absolute value of weights, promoting sparsity by pushing less important weights to zero, reducing model complexity and overfitting.

- **L2 Regularization**: Penalizes the square of weights, constraining them to smaller values, preventing large weight updates and improving generalization by reducing overfitting.


- **Dropout**: Randomly sets a fraction of neuron outputs to zero during training, preventing complex co-adaptations of neurons and encouraging robustness by effectively training multiple models within one.

#### Code Example

```python
# Define a simple neural network model
model = Sequential()
model.add(Dense(64, input_dim=20, activation='relu', kernel_regularizer=l2(0.01)))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu', kernel_regularizer=l1(0.01)))
model.add(Dense(1, activation='sigmoid'))
```

```python
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
```

### Define Early Stopping and Train the Model with Early Stopping

We define an **EarlyStopping callback to prevent overfitting**. 

*Early stopping monitors the validation loss during training and stops training when the loss stops decreasing, preventing the model from overfitting to the training data.*

```python
early_stopping = EarlyStopping(monitor='val_loss', patience=5)
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stopping])
```


## What is data augmentation?

Data augmentation is a technique used in machine learning to increase the size and diversity of a dataset by generating new examples from the existing data. It involves applying a set of transformations or modifications to the original data to create new data points. These transformations can be simple, such as flipping or rotating an image, or more complex, such as adding noise or changing the color balance.

Data augmentation is particularly useful when the size of the training dataset is limited, as it can help to prevent overfitting and improve the generalization ability of the model. By generating new data points, data augmentation can also help to address class imbalance problems and improve the performance of the model on rare or underrepresented classes.

<img src="figs/dataaug.png" alt="logo" width="400"/>



## What are the pros and cons of the above regularization methods?

- **L1 regularization**:
    - Pros:
        - Can lead to sparse feature selection, which can improve model interpretability and reduce overfitting by removing irrelevant features.
        - Works well when there are only a few important features.
    - Cons:
        - Can be sensitive to correlated features.
        - Computationally expensive to optimize.
        
        
- **L2 regularization**:
    - Pros:
        - Encourages smaller weights, which can help prevent overfitting.
        - Computationally efficient to optimize.
    - Cons:
        - Does not lead to sparse feature selection.
        - May not perform well when there are only a few important features.
    
    
- **Dropout**:
    - Pros:
        - Simple to implement and computationally efficient.
        - Can help prevent overfitting by reducing the reliance on individual neurons in the network.
        - Can improve model robustness and generalization.
    - Cons:
        - Can increase the training time required for convergence.
        - Can reduce the representational capacity of the network if the dropout rate is too high.
        
        
- **Early stopping**:
    - Pros:
        - Simple to implement and computationally efficient.
        - Can prevent overfitting and improve model generalization by stopping the training process before the model starts to overfit the training data.
        - Can lead to faster training times.
    - Cons:
        - Requires the use of a validation set to determine when to stop training.
        - May not always result in the best performance, as the optimal stopping point can depend on the specific dataset and model architecture.





### Happy coding