# Exercises

## Exercise 1

Which Linear Regression algorithm can you use if you have a training set with millions of features?

---

Any of the gradient descent algorithms (batch, stochastic, mini-batch) will be relatively fast.

## Exercise 2

Suppose the features in your training set have very different scales. Which algorithms suffer from this, and how? What can you do about this?

---

The gradient descent algorithms suffer from this. Essentially, the learning rate is the same in all directions, so if the features have very different scales then it can't be set appropriately for all of them. It will either quickly converge on one axis and then crawl along a valley, or will jump around wildly on one axis.

This can easily be resolved by feature scaling before training.

*Also, regularisation may converge to suboptimal weights, since regularisation penalises large weights, features with smaller values will tend to be ignored.*

## Exercise 3

Can Gradient Descent algorithms get stuck in a local minimum with training a Logistic Regression model.

---

No, the Logistic Regression cost function is convex.

## Exercise 4

Do all Gradient Descent algorithms lead to the same model, provided you let them run long enough?

---

Yes, provided that the learning rates are set such that the training algorithms converge.

## Exercise 5

Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this?

---

If the validation error is consistently going up then the algorithm is likely not converging because the learning rate is too high - with each step you are 'jumping over' the minimum instead of getting closer. You can fix this by decreasing the learning rate.

## Exercise 6

Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?

---

No. Mini-batch is naturally stochastic - it may go in the wrong direction for a step or two but later improve. You should let the algorithm run fully to find the global minimum for validation error.

## Exercise 7

Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution fastest? Which will actually converge? How can you make the others converge as well?

---

Batch will reach the vicinity of the optimal solution in the fewest number of *steps*, but each step will take longer to compute. After computation time is taken into account, I expect that mini-batch would be fastest. *(It would be stochastic gradient descent or mini-batch with very small mini-batch size)*

Only batch gradient descent will actually converge, but you can make stochastic and mini-batch converge by setting a suitable learning schedule that decreases the learning rate.

## Exercise 8

Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?

---

You are overfitting the training data. You could solve this by reducing the degree of the polynomials (simplifying the model), adding or increasig regularisation, or getting more training data.

## Exercise 9

Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularisation parameter $\alpha$ or reduce it?

---

If the training and validation error are very similar then you are underfitting the training data so you have high *bias*. You should reduce the regularisation parameter $\alpha$.

## Exercise 10

Why would you want to use:
1. Ridge Regression instead of plain Linear Regression?
2. Lasso instead of Ridge Regression?
3. Elastic Net instead of Lasso?

---

1. Ridge Regression allows you to apply regularisation which reduces overfitting.
2. Lasso tends to set coefficients equal to zero which performs feature selection as well as regularisation. This leads to a sparser model (which is generally perferred).
3. Lasso tends to behave strangely when there are more features than training samples or when several features are strongly correlated. Also, Lasso is a special case of Elastic net at a specific hyperparameter value - adding this additional hyperparameter can only decrease validation error.

## Exercise 11

Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

---

Two logistic regression classifiers. Softmax regression is for multiclass classification, not multioutput.

## Exercise 12

Implement Batch Gradient Descent with early stopping for Softmax Regression (without usinng Scikit-Learn).