# Implementing basic Softmax Regression

Hello, everyone. It's me again. This time I would like to introduce Softmax Regression or Multinomial Logistic Regression.

First thing's first. Softmax regression is NOT a regression model. It's a like Logistic Regression probabilistic classifier. For some historic reason the term regression stuck.

## Contents
  - One-vs-Rest Logistic Regression
  - Softmax Regression
  - Softmax regression intuition
  - The Softmax function
  - The Cross-Entropy loss function
  - Training
  - Validation
  - Debug

## One-vs-all Logistic regression

The one-vs-all logistic regression for multiclass problems is simple. You simply train $n$ differet classifiers for $n$ different classes, where the intended class is 1 and others 0. 

In a pseudocode the algorithm is as follows: <br>
Inputs: 
  - $L$, a learner
  - Samples $X$
  - Labels $y$ where $y \in \{1, ..., K \}$ <br>
Output:
  - a list of classifiers $f_k$ for $k \in \{1, ..., K \}$ <br>
Procedure:
  - For each $k$ in $\{1,...,K\}$
    - Construct a new label vector $z$ where $z_i=1$ if $y_i=k$ and $z_i = 0$ otherwise
    - Apply $L$ to $X$, $z$ to obtain $f_k$ <br>

Making decisions means applying all classifier to an unseen sample $x$ and predicting label $k$ for which the corresponding classifier reports the highest confidence score:
$$\widehat{y}=argmax_{k \in \{1,...,K\}} f_k(x)$$
I copied the above pseudocode shamelessly from wikipedia.

For a better understanding let's run a OvR (One-vs-rest) Logistic regression and see its results.

In [97]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
# This time we're going to use a much better dataset than previous simple datapoints.
from sklearn import datasets
from sklearn.metrics import confusion_matrix, accuracy_score
from mlxtend.plotting import plot_decision_regions
import os
import struct
np.set_printoptions(suppress=True)

In [98]:
iris = datasets.load_iris()
X = iris.data[:, [0, 3]]
y = iris.target

First, let's plot our data samples to see what we're dealing with.

In [99]:
print("Labels: ", np.unique(y))
print("Number of samples:", X.shape[0])
print("Number of features:", X.shape[1])

Labels:  [0 1 2]
Number of samples: 150
Number of features: 2


So, our data has 150 samples in 2 features in 3 different classes ($k = \{0, 1, 2\}$). It only has two features because it's easier to plot and visualize. <br>
Let's train a OvR classifier on the data and get some results.

In [100]:
classifier = LogisticRegression(max_iter = 100)
multi_classifier = OneVsRestClassifier(classifier)
multi_classifier.fit(X, y)
y_pred = multi_classifier.predict(X)
print("The confusion matrix:")
print(confusion_matrix(y, y_pred))
print("Accuracy:")
print(accuracy_score(y, y_pred))

The confusion matrix:
[[50  0  0]
 [ 0 38 12]
 [ 0  2 48]]
Accuracy:
0.9066666666666666


With OvR Logistic Regression we get accuracy of 90.6% and the decision boundary looks good. All is well, right?
Let's see the probabilities for the first 3 samples.

In [101]:
y_proba = multi_classifier.predict_proba(X)
print(y_proba[:10].sum(axis = 1))

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
