# Chapter 4

## Exercise 1
What Linear Regression training algorithm can you use if you have a training set with millions of features?

I would use gradient descent to minimize a cost function instead of using the Normal Equation, since the Normal Equation computes the inverse of $X^\intercal * X$, which is an n x n matrix where n is the number of features.

## Exercise 2
Suppose the features in your training set have very different scales. What algorithms might suffer from this, and how? What can you do about it?

Gradient descent will suffer from this. It will always obtain the best solution regardless of the difference in scale (since the cost function of Linear Regression is convex), but if the features have different scales it will take longer to reach that minimum. The best solution would be to use one of Scikit-Learn's scalers (e.g. StandardScaler) to scale the different features.

## Exercise 3
Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model?

No, it can't. This is because the Logistic Regression cost function is also convex, so Gradient descent is guaranteed to arrive to the best solution (global minimum) given that the learning rate isn't too high and we wait for enough iterations.

## Exercise 4
Do all Gradient Descent algorithms lead to the same model provided you let them run long enough?

No, they don't. While it's true that all Gradient Descent algorithms will finish around the global minimum, the Stochastic and Mini-batch variants will bounce around the minimum, while Batch Gradient descent will stop at the exact global minimum given enough time. 

## Exercise 5
Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this?

If the validation error is going up that means that Gradient Descent is diverging from the global minimum. This is most likely caused by setting a high learning rate, so the solution would be to lower the learning rate until the validation error starts going down.

## Exercise 6
Is it a good idea to stop Mini-batch Gradient Descent immediately when the validation error goes up?

No, it is not. As said in exercise 4, Mini-batch gradient descent makes bounces when arriving to the solution (because it is evaluating part of the samples before making the next step, and not all the samples like Btach Gradient Descent does), so sometimes the validation error may go up.

## Exercise 7
Which Gradient Descent algorithm (among those we discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?

Usually the Stochastic and Mini-Batch (with small batch size) variants will reach the vicinity of the optimal solution the fastest. However, Batch gradient descent will be the one that converges to the solution.

## Exercise 8
Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?

This means that we are overfitting the data. The three posible ways to solve this are:
* Getting more training data.
* Using a regularized Linear Model (e.g. Ridge Regression).
* Removing some of the features used.

## Exercise 9
Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter or reduce it?

The model suffers from high bias (underfitting). We should reduce the regularization hyperparameter.

## Exercise 10
Why would you want to use:

### Ridge Regression instead of plain Linear Regression (i.e., without any regularization)

One should almost use Ridge Regression instead of plain Linear Regression, since you can always tune the regularization hyperparameter to fit your needs. Furthermore, it is almost useful to perform regularization, even if it is just a little.

### Lasso instead of Ridge Regression?

Lasso tends to eliminate the weights of useless features (setting them to 0), so it should be preferred if you think that only some of your features will be actually useful to train the model. Lasso can also help you detecting the useless features.

### Elastic net instead of Lasso?

Elastic Ner is a mix of both Ridge Regression and Lasso. It should be almost always preferred to Lasso, as it provides you a lot more flexibility and tuning to fit your needs. In the worst case it can perform exactly like Lasso, but you can tune it to behave more like Ridge Regression if needed.

## Exercise 11
Suppose you want to classify pictures as outdoor/indoor and daytime/nighttime. Should you implement two Logistic Regression classifiers or one Softmax Regression classifier?

You should implement two Logistic Regression classifiers, one to classify the picture as outdoor/indoor and other to classify it as daytime/nighttime. Softmax Regression should be used if you have to choose a class from many of them (more than 2), but it is not a multioutput classifier.

## Exercise 12
Implement Batch Gradient Descent with early stopping for Softmax Regression (without using Scikit-Learn).

In [4]:
import sys

import numpy as np

class CustomSoftmaxRegressor():
    
    def __init__(self):
        self.weights = []
        self.Y = [[]]
    
    def fit(self, X, y):
        m = np.shape(X)[1]
        n = np.shape(X)[0]
        
        self._compute_classes_matrix(y)
        self.weights = np.zeros((self.num_classes, m))
        
        delta = np.zeros((self.num_classes,))
        error = sys.maxsize
        num_iter = 1
        while (error > 1e-5 and num_iter < 500):
            for i in range(self.num_classes):
                delta[i] = np.sum((self._softmax_proba(X, i) - self.Y[:, i]) * np.transpose(X)) / m
                self.weights[i] -= delta[i]
            print("Iter: {0} - Weights: {1} - Cost: {2}".format(num_iter, self.weights, self._compute_cost(X,self.Y)))
            
            # error = self._compute_cost(X, self.Y)
            num_iter += 1
        
    def predict(self, X):
        prob_matrix = np.zeros((np.shape(X)[0], self.num_classes))
        for i in range(self.num_classes):
            prob_matrix[:, i] = self._softmax_score(X, i)
        print(prob_matrix)
        return np.max(X, axis=1)
        
    def _softmax_proba(self, X, k):
        """ 
        Computes the probability that each sample in the
        data matrix X has of belonging to class k.
        """
        total_score = np.zeros(np.shape(X)[0])
        k_score = np.exp(self._softmax_score(X, k))
        for i in range(self.num_classes):
            total_score += np.exp(self._softmax_score(X, i))
        return k_score / total_score
    
    def _softmax_score(self, X, k):
        assert 0 <= k < self.num_classes
        return np.dot(self.weights[k], np.transpose(X))
    
    def _compute_classes_matrix(self, y):
        _, counts = np.unique(y, return_counts=True)
        m = np.size(y)
        self.num_classes = np.size(counts)
        self.Y = np.zeros((m, self.num_classes))
        for i in range(m):
            self.Y[i, :] = [1 if y[i] == k else 0 for k in range(self.num_classes)]

    def _compute_cost(self, X, Y):
        m = np.shape(X)[0]
        cost = 0
        for i in range(self.num_classes):
            cost += np.sum(np.dot(Y[:, i] * np.log(self._softmax_proba(X, i)), X)) / m
        return cost

In [5]:
import unittest

class TestSoftmaxRegression(unittest.TestCase):
    
    def setUp():
        self.clf = CustomSoftmaxRegressor()
    
    def test_softmax_score():
        pass

    def test_compute_cost():
        pass

    def test_softmax_proba():
        pass
    
unittest.main()

E
ERROR: /Users/alejandro/Library/Jupyter/runtime/kernel-9de1f772-a646-4271-8e40-1656a000e9e6 (unittest.loader._FailedTest)
----------------------------------------------------------------------
AttributeError: module '__main__' has no attribute '/Users/alejandro/Library/Jupyter/runtime/kernel-9de1f772-a646-4271-8e40-1656a000e9e6'

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)


SystemExit: True

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [91]:
a = np.random.rand(2, 3)
b = np.random.rand(2, 3)
np.max(a, axis=1)

array([0.95228967, 0.98618932])

In [92]:
a

array([[0.95228967, 0.65806659, 0.45680161],
       [0.9181056 , 0.98618932, 0.98167172]])