# Softmax Regression (without Scikit-Learn)
This notebook provides a solution to Chapter 4 Exercise 12:
>*Implement Batch Gradient Descent with early stopping for Softmax Regression (without using Scikit-Learn)*

For this task I will use the [IRIS](https://archive.ics.uci.edu/ml/datasets/iris) that is packaged with Scikit-Learn and accessible via the `datasets` module.

In [7]:
import numpy as np
from sklearn.datasets import load_iris

In [8]:
iris = load_iris()
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [9]:
iris['data'].shape # 150 samples with 4 features

(150, 4)

In [10]:
iris['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In order to allow me to compare my results to those in the corresponding chapter notebook, I will focus on the `petal length (cm)` and `petal_width (cm)` features.

## Data Preparation

In [68]:
feats = ['petal length (cm)', 'petal width (cm)']
feats_idx = [iris['feature_names'].index(ft) for ft in feats]
feats_idx

[2, 3]

In [69]:
X = iris['data'][:,feats_idx]
y = iris['target']

# Need to add x0 = 1 for the bias term - note that the Logistic Regression
# models in Scikit-Learn will automatically add this by default
X_b = np.c_[np.ones((X.shape[0],1)),X]

In [71]:
X_b.shape, y.shape

((150, 3), (150,))

In [72]:
# m  = number of samples
# n = number of features(use X as we dont want to count the bias term)
m, n = X.shape

### Train/Test split

In [73]:
ratio_valid = 0.2
ratio_test = 0.2

valid_count = int(m * ratio_valid)
test_count = int(m * ratio_test)
train_count = m - valid_count - test_count

print(f'Number of training samples: {train_count}')
print(f'Number of validation samples: {valid_count}')
print(f'Number of test samples: {test_count}')

# set random seed to same as exercise solution
np.random.seed(2042)
perms = np.random.permutation(m)

X_train = X_b[perms[:train_count]]
y_train = y[perms[:train_count]]

X_valid = X_b[perms[train_count:train_count + valid_count]]
y_valid = y[perms[train_count:train_count + valid_count]]

X_test = X_b[perms[train_count + valid_count:]]
y_test = y[perms[train_count + valid_count:]]

print(X_train.shape, y_train.shape,
      X_valid.shape, y_valid.shape,
      X_test.shape, y_test.shape,
      sep='\n')

Number of training samples: 90
Number of validation samples: 30
Number of test samples: 30
(90, 3)
(90,)
(30, 3)
(30,)
(30, 3)
(30,)


### Class Probabilities

In [74]:
# For multiclass classifications we need to convert the single target
# class integer into an array of values indictating whether or not
# the sample belongs to each class - 0/1.
# This is similar to One Hot Encoding!

def class_probabilities(y):
    n_classes = y.max() + 1
    m = len(y)
    Y_one_hot = np.zeros((m, n_classes))
    
    # indexes are determined pairwise i.e. [sample_1,target_1] = 1
    # therefore this has the effect of, for each sample, setting
    # a value of 1 in the one hot column where the index = target class y
    # e.g.
    # sample_1 = row 0, target_1 = y[0] = 2
    # Y_one_hot[0,2] = 1
    
    Y_one_hot[np.arange(m), y] = 1
    return Y_one_hot

class_probabilities(y_train[:10])

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])