1. Which Linear Regression training algorithm can you use if you have a training
set with millions of features?

MiniBatch gradient descent. SGD will take lot time. 

2. Suppose the features in your training set have very different scales. Which algorithms
might suffer from this, and how? What can you do about it?

All gradient descent based algorithms. Moreover, while regularizing features with smaller values will tend to be ignored compared to features with larger values. We tackle this with scaling - min max scaling, standard scaling, normalization

3. Can Gradient Descent get stuck in a local minimum when training a Logistic
Regression model?

No log loss cost function is convex

4. Do all Gradient Descent algorithms lead to the same model, provided you let
them run long enough?

not sure

<span style="color:red;">If the optimization problem is convex (such as Linear Regression or Logistic
Regression), and assuming the learning rate is not too high, then all Gradient
Descent algorithms will approach the global optimum and end up producing
fairly similar models. However, unless you gradually reduce the learning rate,
Stochastic GD and Mini-batch GD will never truly converge; instead, they will
keep jumping back and forth around the global optimum. This means that even
if you let them run for a very long time, these Gradient Descent algorithms will
produce slightly different models.</span>

5. Suppose you use Batch Gradient Descent and you plot the validation error at
every epoch. If you notice that the validation error consistently goes up, what is
likely going on? How can you fix this?

Overfitting. Start regularization techniques, early stopping etc
<span style="color:red;">one possibility
is that the learning rate is too high and the algorithm is diverging. If the training
error also goes up, then this is clearly the problem and you should reduce the
learning rate</span>

6. Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation
error goes up?

Not sure

<span style="color:red;">Due to their random nature, neither Stochastic Gradient Descent nor Mini-batch
Gradient Descent is guaranteed to make progress at every single training iteration.
So if you immediately stop training when the validation error goes up, you
may stop much too early, before the optimum is reached. A better option is to
save the model at regular intervals; then, when it has not improved for a long
time (meaning it will probably never beat the record), you can revert to the best
saved model.</span>

7. Which Gradient Descent algorithm (among those we discussed) will reach the
vicinity of the optimal solution the fastest? Which will actually converge? How
can you make the others converge as well?

SGD. batch gradient descent actually converge. Use learning schedules train for more epochs

8. Suppose you are using Polynomial Regression. You plot the learning curves and
you notice that there is a large gap between the training error and the validation
error. What is happening? What are three ways to solve this?

Overfitting. Start by reducing degree for polynomial features, then use regulariaztion, learning schedule, early stopping, inc training set size

9. Suppose you are using Ridge Regression and you notice that the training error
and the validation error are almost equal and fairly high. Would you say that the
model suffers from high bias or high variance? Should you increase the regularization
hyperparameter α or reduce it?

High bias(Underfitting). therefore we should reduce hyperparameter alpha

10. Why would you want to use:
a. Ridge Regression instead of plain Linear Regression (i.e., without any regularization)?
b. Lasso instead of Ridge Regression?
c. Elastic Net instead of Lasso?

a) to avoid overfitting
b) to reduce effect of useless features by diminishing its weight
c) to balance out lasso and ridge by some ratio If you want Lasso without the erratic behavior, you can just
use Elastic Net with an l1_ratio close to 1.

11. Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime.
Should you implement two Logistic Regression classifiers or one Softmax Regression
classifier?

<span style="color:red;">If you want to classify pictures as outdoor/indoor and daytime/nighttime, since
these are not exclusive classes (i.e., all four combinations are possible) you should
train two Logistic Regression classifiers.</span>

12. Implement Batch Gradient Descent with early stopping for Softmax Regression
(without using Scikit-Learn).

In [3]:
import numpy as np

In [1]:
from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())

['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

In [2]:
x = iris["data"][:, (2, 3)]
y = iris["target"]

In [4]:
test_size = int(len(x) * 0.2)
validation_size = int(len(x) * 0.2)
train_size = len(x) - test_size - validation_size

rnd_indices = np.random.permutation(len(x))

X_train = x[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = x[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = x[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]

In [5]:
def to_one_hot(y):
    n_classes = y.max() + 1
    m = len(y)
    Y_one_hot = np.zeros((m, n_classes))
    Y_one_hot[np.arange(m), y] = 1
    return Y_one_hot

In [7]:
Y_train_one_hot = to_one_hot(y_train)
Y_valid_one_hot = to_one_hot(y_valid)
Y_test_one_hot = to_one_hot(y_test)

In [8]:
def softmax(logits):
    exps = np.exp(logits)
    exp_sums = np.sum(exps, axis=1, keepdims=True)
    return exps / exp_sums

In [9]:
n_inputs = X_train.shape[1]
n_outputs = len(np.unique(y_train))

In [10]:
eta = 0.01
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba + epsilon), axis=1))
    error = Y_proba - Y_train_one_hot
    if iteration % 500 == 0:
        print(iteration, loss)
    gradients = 1/m * X_train.T.dot(error)
    Theta = Theta - eta * gradients

0 2.3125576993260766
500 1.2099289388260193
1000 1.1255691510501027
1500 1.0557826577267757
2000 0.998032542184211
2500 0.9499441556747465
3000 0.9094976774957052
3500 0.875073732884705
4000 0.8454165333779893
4500 0.8195672346410098
5000 0.7967964152246556


In [11]:
Theta

array([[ 0.74872347,  0.55372612, -0.13430308],
       [-2.28449691, -0.71463385,  1.27683547]])

In [12]:
logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)
accuracy_score

0.4

In [13]:
eta = 0.1
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1  # regularization hyperparameter

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    xentropy_loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba + epsilon), axis=1))
    l2_loss = 1/2 * np.sum(np.square(Theta[1:]))
    loss = xentropy_loss + alpha * l2_loss
    error = Y_proba - Y_train_one_hot
    if iteration % 500 == 0:
        print(iteration, loss)
    gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]), alpha * Theta[1:]]
    Theta = Theta - eta * gradients

0 2.4193588711085625
500 0.9678480766318557
1000 0.9678463924871283
1500 0.9678463924279299
2000 0.9678463924279271
2500 0.967846392427927
3000 0.9678463924279271
3500 0.9678463924279271
4000 0.9678463924279271
4500 0.9678463924279271
5000 0.9678463924279271


In [14]:
logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)
accuracy_score

0.36666666666666664

In [15]:
eta = 0.1 
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1  # regularization hyperparameter
best_loss = np.infty

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):
    logits = X_train.dot(Theta)
    Y_proba = softmax(logits)
    xentropy_loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba + epsilon), axis=1))
    l2_loss = 1/2 * np.sum(np.square(Theta[1:]))
    loss = xentropy_loss + alpha * l2_loss
    error = Y_proba - Y_train_one_hot
    gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]), alpha * Theta[1:]]
    Theta = Theta - eta * gradients

    logits = X_valid.dot(Theta)
    Y_proba = softmax(logits)
    xentropy_loss = -np.mean(np.sum(Y_valid_one_hot * np.log(Y_proba + epsilon), axis=1))
    l2_loss = 1/2 * np.sum(np.square(Theta[1:]))
    loss = xentropy_loss + alpha * l2_loss
    if iteration % 500 == 0:
        print(iteration, loss)
    if loss < best_loss:
        best_loss = loss
    else:
        print(iteration - 1, best_loss)
        print(iteration, loss, "early stopping!")
        break

0 1.1216253957163416
128 0.9970641695726357
129 0.9970649212931689 early stopping!


In [16]:
logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)
accuracy_score

0.4

In [17]:
logits = X_test.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_test)
accuracy_score

0.5333333333333333