# Batch Gradient Descent

I'm going to implement Batch Gradient Descent with early stopping for Softmax Regression without using Scikit-Learn. I'll use to Iris data set from Scikit-learn to do this.

 - [Training and testing data](#Training-and-testing-data)
 - [Softmax model training](#Softmax-model-training)
   - [Regularization](#Regularization)
 - [Early stopping](#Early-stopping)

In [1]:
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
print(iris['DESCR'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

## Training and testing data

We're only going to use petal length and petal width.

In [2]:
X = iris.data[:, (2, 3)]
y = iris.target

We are going to add the bias term to our data.

In [3]:
X_b = np.c_[np.ones((len(X), 1)), X]

We're also going to set the seed to make the results reproducable.

In [4]:
np.random.seed(42)

Normally with scikit-learn it would be easy to create training and testing sets, but here we are going to generate them from scratch.

In [5]:
test_ratio = 0.2
validation_ratio = 0.2

In [6]:
shuffled_index = np.random.permutation(len(X_b))
total_size = len(X_b)
test_size = int(len(X_b) * test_ratio)
valid_size = int(len(X_b) * validation_ratio)
train_size = total_size - test_size - valid_size

In [7]:
X_train = X_b[shuffled_index[:train_size]]
y_train = y[shuffled_index[:train_size]]
X_val = X_b[shuffled_index[train_size:-test_size]]
y_val = y[shuffled_index[train_size:-test_size]]
X_test = X_b[shuffled_index[test_size:]]
y_test = y[shuffled_index[test_size:]]

Currently the output vector labels each iris class with one of the three numbers `(0, 1, 2)`. We want to change the output vector to a one-hot vector. We'll create a function to do exactly this.

In [8]:
def to_one_hot(y):
    m = len(y)
    n_output = y.max() + 1
    one_hot_arr = np.zeros((m, n_output))
    one_hot_arr[np.arange(m), y] = 1
    return one_hot_arr

In [9]:
to_one_hot(y_train[:5])

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.]])

Let's create the three one-hot vectors we'll be using.

In [10]:
y_train_one_hot = to_one_hot(y_train)
y_test_one_hot = to_one_hot(y_test)
y_val_one_hot = to_one_hot(y_val)

## Softmax model training

Let's implement the Softmax function now. Recall that it is defined by the following equation:

$\sigma\left(\mathbf{s}(\mathbf{x})\right)_k = \dfrac{\exp\left(s_k(\mathbf{x})\right)}{\sum\limits_{j=1}^{K}{\exp\left(s_j(\mathbf{x})\right)}}$

In [11]:
def prob_softmax(logits):
    exps = np.exp(logits)
    sum_exps = np.sum(exps, axis=1, keepdims=True)
    return exps/sum_exps

What we will need are the cost function:

$J(\mathbf{\Theta}) =
- \dfrac{1}{m}\sum\limits_{i=1}^{m}\sum\limits_{k=1}^{K}{y_k^{(i)}\log\left(\hat{p}_k^{(i)}\right)}$

And the equation for the gradients:

$\nabla_{\mathbf{\theta}^{(k)}} \, J(\mathbf{\Theta}) = \dfrac{1}{m} \sum\limits_{i=1}^{m}{ \left ( \hat{p}^{(i)}_k - y_k^{(i)} \right ) \mathbf{x}^{(i)}}$

Note that $\log\left(\hat{p}_k^{(i)}\right)$ may not be computable if $\hat{p}_k^{(i)} = 0$. So we will add a tiny value $\epsilon$ to $\log\left(\hat{p}_k^{(i)}\right)$ to avoid getting `nan` values.

In [12]:
n_inputs = X_train.shape[1]
n_outputs = len(np.unique(y))
theta = np.random.randn(n_inputs, n_outputs)

In [13]:
eta = 0.01
epsilon = 1e-7
n_iterations = 5001
m = len(y_train)

In [14]:
y_proba = prob_softmax(X_train.dot(theta))

In [15]:
for iteration in range(n_iterations):
    y_proba = prob_softmax(X_train.dot(theta))
    loss = -np.mean(np.sum(y_train_one_hot * np.log(y_proba + epsilon), axis=1, keepdims=True))
    error = y_proba - y_train_one_hot
    if iteration % 500 == 0:
         print(f'iteration, loss: {iteration}, {loss}')
    gradients = 1/m * X_train.T.dot(error)
    theta = theta - eta * gradients

iteration, loss: 0, 3.5356045081790177
iteration, loss: 500, 0.7698276617097016
iteration, loss: 1000, 0.6394784332731978
iteration, loss: 1500, 0.5618741363839648
iteration, loss: 2000, 0.5095831080853223
iteration, loss: 2500, 0.47127377559909306
iteration, loss: 3000, 0.44155863305230325
iteration, loss: 3500, 0.41755986648041216
iteration, loss: 4000, 0.3975941721521858
iteration, loss: 4500, 0.38060484552797946
iteration, loss: 5000, 0.3658905593000995


So there you have it, the model is trained! Let's use the model to make a prediction on the validation set.

In [16]:
y_val_proba = prob_softmax(X_val.dot(theta))
y_predict = np.argmax(y_val_proba, axis=1)
print(f'Accuracy score: {np.mean(y_predict == y_val)}')

Accuracy score: 0.9333333333333333


### Regularization

Let's add some $L_2$ regularization to the Softmax function. The code will be similar to the code above except for the regularization itself. Let's increase $\eta$ as well and add $\alpha$, the regularization parameter, to the mix too.

In [29]:
eta = 0.1
alpha = 0.1
m = len(X_train)
theta = np.random.randn(n_inputs, n_outputs)

In [30]:
for iteration in range(n_iterations):
    y_proba = prob_softmax(X_train.dot(theta))
    xentropy = -np.mean(np.sum(y_train_one_hot * np.log(y_proba + epsilon), axis=1))
    l2_loss = 1/2 * np.sum(np.square(theta[1:]))
    loss = xentropy + alpha * l2_loss
    error = y_proba - y_train_one_hot
    if iteration % 500 == 0:
         print(f'iteration, loss: {iteration}, {loss}')
    gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros((1, n_outputs)), alpha * theta[1:]]
    theta = theta - eta * gradients

iteration, loss: 0, 1.4752023955379974
iteration, loss: 500, 0.513735945842348
iteration, loss: 1000, 0.4910012836945394
iteration, loss: 1500, 0.4842056391009044
iteration, loss: 2000, 0.4816990484730599
iteration, loss: 2500, 0.4806970194934077
iteration, loss: 3000, 0.48027949983270357
iteration, loss: 3500, 0.48010132036195646
iteration, loss: 4000, 0.4800241634344137
iteration, loss: 4500, 0.47999044355141046
iteration, loss: 5000, 0.47997561956546786


In [31]:
y_val_proba = prob_softmax(X_val.dot(theta))
y_predict = np.argmax(y_val_proba, axis=1)
print(f'Accuracy score: {np.mean(y_predict == y_val)}')

Accuracy score: 0.9333333333333333


### Early stopping

In [20]:
minimum_val_error = float('inf')
best_iteration = None
theta_arr = []