# <font color="blue"> Softmax Function </font>
In this lab, we will explore the softmax function. This function is used in both Softmax Regression and in Neural Networks when solving Multiclass Classification problems.  

In [1]:
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('./Materials_By_Deeplearning/deeplearning.mplstyle')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from IPython.display import display, Markdown, Latex
from sklearn.datasets import make_blobs
%matplotlib widget
from matplotlib.widgets import Slider

import sys
sys.path.append('./Materials_By_Deeplearning')
from lab_utils_common import dlc
from lab_utils_softmax import plt_softmax
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)

## Softmax Function
In both softmax regression and neural networks with Softmax outputs, N outputs are generated and one output is selected as the predicted category. In both cases a vector $\mathbf{z}$ is generated by a linear function which is applied to a softmax function. The softmax function converts $\mathbf{z}$  into a probability distribution as described below. After applying softmax, each output will be between 0 and 1 and the outputs will add to 1, so that they can be interpreted as probabilities. The larger inputs  will correspond to larger output probabilities.
The softmax function can be written:
$$a_j = \frac{e^{z_j}}{ \sum_{k=1}^{N}{e^{z_k} }} \tag{1}$$
The output $\mathbf{a}$ is a vector of length N, so for softmax regression, you could also write:
\begin{align}
\mathbf{a}(x) =
\begin{bmatrix}
P(y = 1 | \mathbf{x}; \mathbf{w},b) \\
\vdots \\
P(y = N | \mathbf{x}; \mathbf{w},b)
\end{bmatrix}
=
\frac{1}{ \sum_{k=1}^{N}{e^{z_k} }}
\begin{bmatrix}
e^{z_1} \\
\vdots \\
e^{z_{N}} \\
\end{bmatrix} \tag{2}
\end{align}


Which shows the output is a vector of probabilities. The first entry is the probability the input is the first category given the input $\mathbf{x}$ and parameters $\mathbf{w}$ and $\mathbf{b}$.  
Let's create a NumPy implementation:

## Cost

The loss function associated with Softmax, the cross-entropy loss, is:
\begin{equation}
  L(\mathbf{a},y)=\begin{cases}
    -log(a_1), & \text{if $y=1$}.\\
        &\vdots\\
     -log(a_N), & \text{if $y=N$}
  \end{cases} \tag{3}
\end{equation}

Where y is the target category for this example and $\mathbf{a}$ is the output of a softmax function. In particular, the values in $\mathbf{a}$ are probabilities that sum to one.
>**Recall:** In this course, Loss is for one example while Cost covers all examples. 
 
 
Note in (3) above, only the line that corresponds to the target contributes to the loss, other lines are zero. To write the cost equation we need an 'indicator function' that will be 1 when the index matches the target and zero otherwise. 
    $$\mathbf{1}\{y == n\} = =\begin{cases}
    1, & \text{if $y==n$}.\\
    0, & \text{otherwise}.
  \end{cases}$$
Now the cost is:
\begin{align}
J(\mathbf{w},b) = -\frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{N}  1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right] \tag{4}
\end{align}

Where $m$ is the number of examples, $N$ is the number of outputs. This is the average of all the losses.


## Tensorflow
This lab will discuss two ways of implementing the softmax, cross-entropy loss in Tensorflow, the 'obvious' method and the 'preferred' method. The former is the most straightforward while the latter is more numerically stable.

Let's start by creating a dataset to train a multiclass classification model.

In [2]:
# make  dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
x_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)
print("x_train",x_train.shape)
print("y_train",y_train.shape)

x_train (2000, 2)
y_train (2000,)


In [3]:
model = Sequential(
    [
        Dense(units = 25, activation = "relu", name= "Layer_1"),
        Dense(units = 15, activation = "relu", name="Layer_2"),
        Dense(units = 4,  activation = "softmax", name="Layer_3")
    ],name="SoftmaxModel"
)

In [4]:

model.compile(
    loss = tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(0.001)
)



In [5]:
model.fit(x_train,y_train, epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x11a6aab4410>

In [6]:
print(model.summary())

Model: "SoftmaxModel"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Layer_1 (Dense)             (None, 25)                75        
                                                                 
 Layer_2 (Dense)             (None, 15)                390       
                                                                 
 Layer_3 (Dense)             (None, 4)                 64        
                                                                 
Total params: 529 (2.07 KB)
Trainable params: 529 (2.07 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [7]:
Model_Pridect = model.predict(x_train)
print(Model_Pridect)
print("largest value", np.max(Model_Pridect), "smallest value", np.min(Model_Pridect))

[[9.32e-03 2.94e-03 9.70e-01 1.75e-02]
 [9.91e-01 8.50e-03 9.11e-05 1.28e-06]
 [9.58e-01 4.12e-02 7.80e-04 3.44e-05]
 ...
 [3.36e-03 9.90e-01 2.76e-04 6.79e-03]
 [3.39e-05 3.10e-04 1.90e-04 9.99e-01]
 [4.80e-03 8.80e-04 9.92e-01 2.48e-03]]
largest value 0.99999654 smallest value 1.4348449e-10


In [8]:
for i in range(5):
    print( f"{Model_Pridect[i]}, category: {np.argmax(Model_Pridect[i])}")

[0.01 0.   0.97 0.02], category: 2
[9.91e-01 8.50e-03 9.11e-05 1.28e-06], category: 0
[9.58e-01 4.12e-02 7.80e-04 3.44e-05], category: 0
[6.43e-03 9.89e-01 2.89e-04 4.59e-03], category: 1
[2.91e-03 3.64e-05 9.97e-01 7.38e-05], category: 2


### <b> As wel learn that we our model is making round off error because computer is having very limited space to store number in a float formate so to reduce rounfoff error  we're going to reduce the user of variable and directly provide the computation into the loss function so that the error becomes as much as less possible </b> 


In [11]:
# for this now we are modifing the model
model_Reduce_Error = Sequential(
    
    [
        Dense(units = 25, activation="relu", name="Layer_1"),
        Dense(units = 15, activation="relu", name="Layer_2"),
        Dense(units = 4, activation="linear", name="Layer_3")
        
    ], name = "SOFTMAX_ROUNDOFF_ERROR_SOLUTION"
)

model_Reduce_Error.compile(

        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer = tf.keras.optimizers.Adam(0.001)
)



In [12]:
model_Reduce_Error.fit(x_train,y_train,epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x11a6be05110>

In [13]:
print("\n\n\n",model_Reduce_Error.summary())

Model: "SOFTMAX_ROUNDOFF_ERROR_SOLUTION"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Layer_1 (Dense)             (None, 25)                75        
                                                                 
 Layer_2 (Dense)             (None, 15)                390       
                                                                 
 Layer_3 (Dense)             (None, 4)                 64        
                                                                 
Total params: 529 (2.07 KB)
Trainable params: 529 (2.07 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________



 None


In [14]:
Model_Pri_less_err = model_Reduce_Error.predict(x_train)
print(f"two example output vectors:\n {Model_Pri_less_err}")
print("largest value", np.max(Model_Pri_less_err), "smallest value", np.min(Model_Pri_less_err))

two example output vectors:
 [[-2.03  0.34  5.17  0.06]
 [ 6.74  1.74 -1.17 -8.27]
 [ 5.    1.68 -0.95 -6.55]
 ...
 [-3.25  4.2  -0.42 -1.08]
 [-8.74 -0.17  0.14  8.35]
 [-0.44  0.49  5.86 -1.6 ]]
largest value 15.194508 smallest value -13.431956
