### 1. Which Linear Regression training algorithm can you use if you have a training set with millions of features?

For datasets with **millions of features**, **Stochastic Gradient Descent (SGD)** or **Mini-batch Gradient Descent** are commonly used because they are efficient for large datasets, both in terms of memory usage and computational speed. In contrast, methods like **Normal Equation** or **Batch Gradient Descent** can be computationally expensive for such large-scale problems due to their complexity (inversion of large matrices, etc.).

### 2. Suppose the features in your training set have very different scales. Which algorithms might suffer from this, and how? What can you do about it?

Algorithms that use **Gradient Descent** (like Linear Regression, Logistic Regression, and Neural Networks) can suffer from features with different scales because the gradients might become unbalanced, making the algorithm **converge more slowly** or struggle to converge at all. 

Other algorithms like **SVMs** or **KNN** also rely on distance calculations, which can be impacted by feature scales.

**Solution**: **Feature scaling** is the answer—standardize or normalize the features (e.g., using StandardScaler in Scikit-learn).

### 3. Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?

No, **Gradient Descent** cannot get stuck in a local minimum for **Logistic Regression**, as its cost function (the log-loss function) is convex. Convex functions have a single global minimum, so Gradient Descent will always converge to this global minimum (assuming the learning rate is properly set).

### 4. Do all Gradient Descent algorithms lead to the same model, provided you let them run long enough?

Yes, **all Gradient Descent algorithms** (Batch, Stochastic, Mini-batch) will theoretically converge to the same optimal model if you let them run long enough with a sufficiently small learning rate. However, they may take different amounts of time to converge due to their step-size behavior:
- **Batch GD** may take fewer epochs but is computationally expensive.
- **Stochastic and Mini-batch GD** can fluctuate more and take longer but are faster per epoch.

### 5. Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this?

If the **validation error** goes up consistently, it's likely that the model is **overfitting** the training data.

**How to fix it**:
- Try using **early stopping** (stop training when the validation error increases).
- **Regularization**: Add **Ridge** (L2) or **Lasso** (L1) regularization to penalize large weights and reduce overfitting.
- Reduce model complexity or the number of features.

### 6. Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?

No, because **Mini-batch Gradient Descent** introduces more fluctuation in the cost function due to the smaller, random batches of data. The validation error might temporarily increase, but you should wait for a more consistent trend before stopping. **Early stopping** is still a good idea, but don't stop on the first increase in validation error—monitor the trend over several epochs.

### 7. Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?

- **Stochastic Gradient Descent (SGD)** reaches the vicinity of the optimal solution the fastest because it updates weights more frequently (after each sample).
- **Batch Gradient Descent** will actually converge because it uses the whole dataset for each update, leading to more stable convergence.
  
To make SGD or Mini-batch converge:
- Gradually **reduce the learning rate** (learning rate schedules), allowing finer adjustments as the algorithm approaches the minimum.

### 8. Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?

This indicates **overfitting**: the model is performing well on the training set but not generalizing well to the validation set.

Three ways to solve this:
1. **Decrease the model complexity**: Use a lower-degree polynomial.
2. **Regularization**: Apply **Ridge** or **Lasso** regression to penalize large coefficients.
3. **More training data**: Increasing the dataset size can help the model generalize better.

### 9. Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter α or reduce it?

If both the **training error** and **validation error** are high and almost equal, the model is likely suffering from **high bias** (underfitting).

**Solution**: You should **reduce the regularization hyperparameter (α)**. A high α increases bias, so reducing it will allow the model to fit the data better.

### 10. Why would you want to use:
a. **Ridge Regression** instead of plain Linear Regression?
   - Ridge Regression helps avoid **overfitting** by penalizing large coefficients, making the model more robust to noise and preventing it from fitting irrelevant features too closely.

b. **Lasso instead of Ridge Regression**?
   - **Lasso** can **perform feature selection** by shrinking some coefficients to zero, which is useful when you want a simpler model that ignores irrelevant features.

c. **Elastic Net instead of Lasso**?
   - **Elastic Net** combines the strengths of both Ridge and Lasso. It is useful when you have many correlated features, as Lasso alone might arbitrarily pick one feature and ignore others. Elastic Net allows for a mix of both penalties.

### 11. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

You should implement **two Logistic Regression classifiers** because you have **two separate binary classification tasks**:
1. Outdoor vs. Indoor
2. Daytime vs. Nighttime

Softmax Regression is more suited for **multiclass classification**, where there are more than two classes in a single task.

### 12. Implement Batch Gradient Descent with early stopping for Softmax Regression (without using Scikit-Learn).

In [2]:
# Load iris dataset

from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target

In [3]:
# Add bias term to X

X_with_bias = np.c_[np.ones([len(X), 1]), X]

In [4]:
# random seed

np.random.seed(2042)

In [5]:
# split the datasets

test_ratio = 0.2
validation_ratio = 0.2
total_size = len(X_with_bias)

test_size = int(total_size * test_ratio)
validation_size = int(total_size * validation_ratio)
train_size = total_size - test_size - validation_size

rnd_indices = np.random.permutation(total_size)

X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]

In [6]:
# one hot encoding

def to_one_hot(y):
    n_classes = y.max() + 1
    m = len(y)
    Y_one_hot = np.zeros((m, n_classes))
    Y_one_hot[np.arange(m), y] = 1
    return Y_one_hot

Y_train_one_hot = to_one_hot(y_train)
Y_valid_one_hot = to_one_hot(y_valid)
Y_test_one_hot = to_one_hot(y_test)

In [7]:
# softmax function

def softmax(logits):
    exps = np.exp(logits)
    exp_sums = np.sum(exps, axis=1, keepdims=True)
    return exps / exp_sums


In [8]:
# training

n_inputs = X_train.shape[1]
n_outputs = len(np.unique(y_train))

eta = 0.01
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba + epsilon), axis=1))
    error = Y_proba - Y_train_one_hot
    if iteration % 500 == 0:
        print(iteration, loss)
    gradients = 1/m * X_train.T.dot(error)
    Theta = Theta - eta * gradients

0 5.446205811872683
500 0.8350062641405651
1000 0.6878801447192402
1500 0.6012379137693313
2000 0.5444496861981872
2500 0.5038530181431525
3000 0.4729228972192248
3500 0.44824244188957774
4000 0.4278651093928793
4500 0.4106007142918712
5000 0.3956780375390374


In [9]:
logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)
print(accuracy_score)

0.9666666666666667


In [10]:
# add l2 regularization

eta = 0.1
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7

alpha = 0.1

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    xentropy_loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba + epsilon), axis=1))
    l2_loss = 1/2 * np.sum(np.square(Theta[1:]))
    loss = xentropy_loss + alpha * l2_loss
    error = Y_proba - Y_train_one_hot
    if iteration % 500 == 0:
        print(iteration, loss)
    gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]), alpha * Theta[1:]]
    Theta = Theta - eta * gradients

0 6.629842469083912
500 0.5339667976629506
1000 0.503640075014894
1500 0.4946891059460321
2000 0.4912968418075477
2500 0.48989924700933296
3000 0.4892990598451198
3500 0.489035124439786
4000 0.4889173621830817
4500 0.4888643337449302
5000 0.4888403120738818


In [11]:
logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)
print(accuracy_score)

1.0
