# Chapter 4: Training Models

1. When training linear regression on data with millions of features, you should use some form of gradient descent as neither the normal equation or SVD scale well on data with a lot of features.

2. Gradient descent based algorithms suffer when features have very different scales. If the features have very different scales the MSE cost function (that is being minimized) will be an elongated bowl shape with a flat(ish) valley. This will increase the time taken for gradient descent algorithms to converge. To mitigate this, use StandardScaler etc.

3. Gradient descent cannot get stuck when training a logistic regression model because the MSE cost function is convex, i.e. has no local minima.

4. No, stochastic models will never converge and its parameters will always jump around a little.

5. If your using batch gradient descent (all training data is used, as opposed to mini-batch which is stochastic), and validation error consistently goes up on the training curve, the model is likely overfitting the data and you should stop the training early. Alternatively the learning rate could be too high.

6. Because mini-batch gradient descent is stochastic, the cost function will not be monotonically decreasing on either train or validation data. Rather than stopping training the moment validation cost goes up, you should wait to see if the uptrend is persistent.

7. SGD will reach the vicinity of the optimal solution the fastest, but batch GD will actually converge given enough time. SGD & mini-batch GD only converge if the learning rate decreases over time.

8. When using polynomial regression, if there is a big gap between training error and validation error, then the model is likely overfitting the data. Three things you can do: lower the degree used, add regularization, or find more data.

9. If, when using ridge regression, the training error & validation error are almost equal and fairly high, then the model is underfitting. I.e. suffering more from bias in the bias-variance tradeoff. You should reduce the regularization hyperparameter.

10. a. Ridge regression prevents overfitting when compared to plane linear regression. b. Lasso regression is good at eliminating features which are not useful. c. Elastic net regression gives a balance of both ridge & lasso.

11. Two classifiers would be best, instead of softmax regression, because instances can be both indoor and at night, i.e. the classes are not exclusive.

## 12. Implementing batch gradient descent with early stopping for softmax regression using only numpy

In [1]:
import numpy as np