<!-- HTML file automatically generated from DocOnce source (https://github.com/doconce/doconce/)
doconce format html exercisesweek47.do.txt  -->
<!-- dom:TITLE: Exercise week 47 -->

# Exercise week 47
**November 18-22, 2024**

Date: **Deadline is Friday November 22 at midnight**

# Overarching aims of the exercises this week

The exercise set this week is meant as a summary of many of the
central elements in various machine learning algorithms, with a slight
bias towards deep learning methods and their training. You don't need to answer all questions.

The last weekly exercise (week 48) is a general course survey.

## Exercise 1: Linear and logistic regression methods

1. What is the main difference between ordinary least squares and Ridge regression?

2. Which kind of data set would you use logistic regression for?

3. In linear regression you assume that your output is described by a continuous non-stochastic function $f(x)$. Which is the equivalent function in logistic regression?

4. Can you find an analytic solution to a logistic regression type of problem?

5. What kind of cost function would you use in logistic regression?

**Answers**
- The main difference lies in **regularization**. Ridge regression includes a penalty term to the loss function that discourages large coefficients:

  $$
  \text{Ridge Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2
  $$

  where $\lambda$ controls the strength of regularization. Ordinary least squares (OLS) does not include this penalty term and only minimizes the residual sum of squares.

- Logistic regression is used for datasets with a **binary classification problem**, where the target variable $y$ has two classes (e.g., $y \in \{0, 1\}$).

- In logistic regression, the equivalent function is the **logistic (sigmoid) function**:

  $$
  f(x) = \sigma(z) = \frac{1}{1 + e^{-z}}
  $$

  where $z = \mathbf{w}^T \mathbf{x} + b$.

- No, there is no **closed-form analytic solution** for logistic regression because the cost function is non-linear. Optimization is performed using iterative methods like **gradient descent**.

- The cost function used in logistic regression is the **log-loss** (cross-entropy loss):

  $$
  J(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
  $$

  where $\hat{y}_i$ is the predicted probability.


## Exercise 2: Deep learning

1. What is an activation function and discuss the use of an activation function? Explain three different types of activation functions?

2. Describe the architecture of a typical feed forward  Neural Network (NN).

3. You are using a deep neural network for a prediction task. After training your model, you notice that it is strongly overfitting the training set and that the performance on the test isn’t good. What can you do to reduce overfitting?

4. How would you know if your model is suffering from the problem of exploding Gradients?

5. Can you name and explain a few hyperparameters used for training a neural network?

6. Describe the architecture of a typical Convolutional Neural Network (CNN)

7. What is the vanishing gradient problem in Neural Networks and how to fix it?

8. When it comes to training an artificial neural network, what could the reason be for why the cost/loss doesn't decrease in a few epochs?

9. How does L1/L2 regularization affect a neural network?

10. What is(are) the advantage(s) of deep learning over traditional methods like linear regression or logistic regression?

**Answers**

- An **activation function** introduces non-linearity into a neural network, enabling it to learn complex patterns in the data. Without activation functions, the model would act as a linear mapping, regardless of depth.

  Three common activation functions are:
  - **ReLU (Rectified Linear Unit)**:
  $$
  f(x) = \max(0, x)
  $$
  Pros: Computationally efficient and mitigates the vanishing gradient problem.  
  Cons: Suffers from the "dying ReLU" problem (neurons output zero permanently).

  - **Sigmoid**:
  $$
  f(x) = \frac{1}{1 + e^{-x}}
  $$
  Pros: Outputs probabilities in the range (0, 1), suitable for binary classification.  
  Cons: Vanishing gradients for large input values.

  - **Tanh**:
  $$
  f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  $$
  Pros: Outputs in the range (-1, 1), better for zero-centered data than Sigmoid.  
  Cons: Also prone to vanishing gradients.

- A **feedforward neural network (NN)** has:
  - **Input Layer**: Receives input features.
  - **Hidden Layers**: Composed of neurons performing weighted sums followed by activation functions.
  - **Output Layer**: Produces predictions, e.g., Softmax for classification or linear for regression.

  Information flows forward, and weights are updated during backpropagation.

- To reduce overfitting in deep neural networks:
  - **Regularization**: Apply L1 or L2 penalties to constrain large weights.
  - **Dropout**: Randomly deactivate neurons during training.
  - **Early Stopping**: Halt training when validation loss stops improving.
  - **Data Augmentation**: Create additional training samples using transformations.
  - **Increase Dataset Size**: Collect more samples.
  - **Simplify Model**: Reduce the number of layers or neurons.

- A model suffers from **exploding gradients** if:
  - Gradients have excessively high magnitudes.
  - Loss fluctuates or outputs `NaN`.
  - Weight values grow uncontrollably.

- Common hyperparameters in neural networks:
  - **Learning Rate**: Controls the step size for weight updates.
  - **Batch Size**: Determines the number of samples used per training step.
  - **Number of Epochs**: Total passes over the training dataset.
  - **Dropout Rate**: Fraction of neurons to deactivate during training.
  - **Optimizer**: Algorithm for weight updates, e.g., SGD or Adam.

- A **Convolutional Neural Network (CNN)** consists of:
  - **Convolutional Layers**: Extract features using filters.
  - **Activation Functions**: Typically ReLU.
  - **Pooling Layers**: Downsample dimensions (e.g., MaxPooling).
  - **Fully Connected Layers**: Map features to output space.
  - **Output Layer**: Produces predictions, such as probabilities (Softmax).

- The **vanishing gradient problem** occurs when gradients become too small during backpropagation, leading to negligible updates in earlier layers.

  Fixes include:
  - Using ReLU activation.
  - Proper weight initialization.
  - Employing architectures with skip connections.

- Reasons for the cost/loss not decreasing in a few epochs:
  - **Learning Rate Too High/Low**: Causes poor convergence.
  - **Improper Weight Initialization**: Leads to unstable gradients.
  - **Insufficient Model Capacity**: Model unable to fit data complexity.
  - **Data Issues**: Poor preprocessing or mislabeled data.

- **L1/L2 regularization** affects networks by:
  - **L1 Regularization**: Encourages sparsity by driving some weights to zero.
  - **L2 Regularization**: Reduces large weights to improve generalization.

- Advantages of **deep learning** over traditional methods:
  - **Feature Learning**: Automatically extracts hierarchical features.
  - **Flexibility**: Handles complex, non-linear relationships.
  - **Scalability**: Performs well with large-scale data.
  - **State-of-the-Art Results**: Excels in domains like computer vision and NLP.

## Exercise 3: Decision trees and ensemble methods

1. Mention some pros and cons when using decision trees

2. How do we grow a tree? And which are the main parameters?

3. Mention some of the benefits with using ensemble methods (like bagging, random forests and boosting methods)?

4. Why would you prefer a random forest instead of using Bagging to grow a forest?

5. What is the basic philosophy behind boosting methods?

## Exercise 4: Optimization part

1. Which is the basic mathematical root-finding method behind essentially all gradient descent approaches(stochastic and non-stochastic)?

2. And why don't we use it? Or stated differently, why do we introduce the learning rate as a parameter?

3. What might happen if you set the momentum hyperparameter too close to 1 (e.g., 0.9999) when using an optimizer for the learning rate?

4. Why should we use stochastic gradient descent instead of plain gradient descent?

5. Which parameters would you need to tune when use a stochastic gradient descent approach?

**Answers**

- **Pros and cons of decision trees**:
  - **Pros**:
    - Easy to understand and interpret.
    - Handles both numerical and categorical data.
    - Non-parametric, requiring no assumptions about data distribution.
  - **Cons**:
    - Prone to overfitting, especially with deep trees.
    - Sensitive to small changes in the data (high variance).
    - Can struggle with imbalanced datasets.

- **How do we grow a tree, and what are the main parameters?**:
  - A tree is grown by recursively splitting the dataset based on the feature that maximizes a splitting criterion (e.g., **Gini impurity**, **entropy**, or **variance reduction**).
  - Main parameters include:
    - **Max Depth**: Limits the depth of the tree to control overfitting.
    - **Min Samples Split**: Minimum samples required to split a node.
    - **Min Samples Leaf**: Minimum samples required in a leaf node.
    - **Max Features**: Number of features considered for the best split.

- **Benefits of using ensemble methods**:
  - **Bagging** (e.g., Random Forests):
    - Reduces variance by averaging predictions from multiple models.
    - Improves stability and robustness.
  - **Boosting** (e.g., Gradient Boosting, AdaBoost):
    - Focuses on reducing bias by combining weak learners iteratively.
    - Can achieve high accuracy with complex datasets.
  - Both methods help mitigate overfitting compared to individual models.

- **Why prefer Random Forests over Bagging?**:
  - Random Forests introduce **feature randomness** by selecting a random subset of features for each split. This reduces correlation between individual trees, leading to improved generalization compared to Bagging, which uses all features.

- **Basic philosophy behind boosting methods**:
  - Boosting focuses on **sequentially building models** where each subsequent model corrects the errors of its predecessors. It assigns higher weights to misclassified samples, ensuring the next model pays more attention to them. The final prediction is an aggregated result of all the models, often weighted based on their performance.


### Exercise 5: Analysis of results
1. How do you assess overfitting and underfitting?

2. Why do we divide the data in test and train and/or eventually validation sets?

3. Why would you use resampling methods in the data analysis? Mention some widely popular resampling methods.

**Answers**

- **How do you assess overfitting and underfitting?**:
  - **Overfitting**: The model performs well on the training set but poorly on the test/validation set. Indicators:
    - Low training error but high test error.
    - Significant difference between training and test performance metrics.
  - **Underfitting**: The model performs poorly on both the training and test sets. Indicators:
    - High error on the training set.
    - Inability to capture the complexity of the data.

- **Why do we divide the data into test, train, and/or validation sets?**:
  - **Training Set**: Used to train the model.
  - **Validation Set**: Used to tune hyperparameters and select the best model configuration without biasing the final evaluation.
  - **Test Set**: Used to evaluate the model's performance on unseen data, providing an unbiased estimate of its generalization capability.
  - Dividing data ensures that the model is not evaluated on the same data it was trained on, which helps detect overfitting and underfitting.

- **Why use resampling methods in data analysis? Mention some popular methods**:
  - Resampling methods are used to:
    - Estimate the stability and reliability of model performance.
    - Make better use of limited data by creating multiple training/testing splits.
    - Provide robust evaluation metrics.
  - Popular resampling methods:
    - **Cross-Validation**: Divides data into \( k \)-folds, using \( k-1 \) folds for training and the remaining fold for testing, iteratively.
    - **Bootstrapping**: Samples data with replacement to create multiple datasets for training/testing, useful for estimating uncertainty.
    - **Leave-One-Out Cross-Validation (LOOCV)**: Uses all data points except one for training and the remaining point for testing, repeated for all points.

