# Softmax and Cross-Entropy
Softmax and Cross-Entropy are mostly used functions in the neural network.


# üßÆ Softmax Function

The **Softmax** function converts a vector of real numbers into a **probability distribution**, where each value is between **0 and 1**, and all probabilities **sum to 1**.

![softmax](https://media.geeksforgeeks.org/wp-content/uploads/20240706012340/Softmax-Activation-Function.webp)

---

## üîπ Definition

Given an input vector:

$$
z = [z_1, z_2, ..., z_n]
$$

The **Softmax** function is defined as:

$$
\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}} \quad \text{for } i = 1, 2, ..., n
$$

---

## What's the use?
- The **Softmax** function is mainly used in **multi-class classification problems** ‚Äî i.e., when your model needs to predict one class out of many possible classes (e.g., recognizing digits 0‚Äì9 or categories like cat/dog/horse).

- Neural networks usually output arbitrary real numbers, called logits. Softmax transforms these logits into probabilities that are between 0 to 1. And they all sum up to 1. **The largest output gets the highest probability.**

- Without **softmax**, the logits will be unbounded. The output will be of any number. Hence, tagging it to a label becomes difficult.


In [2]:
import torch
import torch.nn as nn
import numpy as np

# ---------- NumPy Implementation ----------

# Define the softmax function manually using NumPy
def softmax(x):
    # Exponentiate each value and divide by the sum of all exponentials
    # This converts raw scores (logits) into probabilities that sum to 1
    return np.exp(x) / np.sum(np.exp(x), axis=0)

# Input vector (logits) ‚Äî raw model outputs before normalization
x = np.array([2.0, 1.0, 0.1])

# Compute softmax probabilities
outputs = softmax(x)
print('softmax numpy:', outputs)

# ---------- PyTorch Implementation ----------

# Convert the same input into a PyTorch tensor
x = torch.tensor([2.0, 1.0, 0.1])

# Use PyTorch's built-in softmax function
# dim=0 means we apply softmax along the first (and only) dimension of the tensor
outputs = torch.softmax(x, dim=0)

print('softmax torch:', outputs)

# ‚úÖ Both implementations should give the same result.
# The outputs represent the probabilities of each class,
# and they always sum up to 1.


softmax numpy: [0.65900114 0.24243297 0.09856589]
softmax torch: tensor([0.6590, 0.2424, 0.0986])


# üßÆ Sigmoid Function

The **Sigmoid** (or **Logistic**) function maps any real-valued number into a value between **0 and 1**.  
It is widely used to model probabilities in **binary classification problems**.

---

## üîπ Definition

Given an input \( z \in \mathbb{R} \):

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

The output \( \sigma(z) \) is always between **0 and 1**.

| z value | Output (œÉ(z)) | Interpretation |
|----------|----------------|----------------|
| Large positive | ‚Üí 1 | Strong positive class confidence |
| Large negative | ‚Üí 0 | Strong negative class confidence |
| 0 | 0.5 | Uncertain (equal probability) |

---

## üîπ Derivative

The derivative of the sigmoid function is:

$$
\sigma'(z) = \sigma(z) \times (1 - \sigma(z))
$$

This property makes it smooth and differentiable ‚Äî ideal for gradient-based optimization during neural network training.

---

## üîπ What's the use?

- The **Sigmoid** function is primarily used in **binary classification** problems where the output represents the probability of belonging to the **positive class** (e.g., ‚Äúyes/no‚Äù, ‚Äúspam/not spam‚Äù).  

- It‚Äôs also used in:
  - The **output layer** of binary classifiers.  
  - **Gating mechanisms** in RNNs and LSTMs (to control information flow).  
  - Logistic regression to model probabilities.

---

## üîπ Example

For a model output (logit):

$$
z = 2.0
$$

The sigmoid activation is:

$$
\sigma(2.0) = \frac{1}{1 + e^{-2.0}} \approx 0.88
$$

This means the model is **88% confident** in predicting the positive class.

---

## üß† Intuition

- When \( z \) is large and positive ‚Üí output ‚âà **1**  
- When \( z \) is large and negative ‚Üí output ‚âà **0**  
- It **squashes** unbounded input values into a small, smooth range \([0, 1]\), making it perfect for representing probabilities.

---

‚úÖ **In summary:**
- Use **Sigmoid** for **binary classification**.  
- Use **Softmax** for **multi-class classification** (one of many classes).  


In [1]:
import torch
import torch.nn as nn
import numpy as np

# ---------- NumPy Implementation ----------
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

x = np.array([2.0, 1.0, 0.1])
outputs = sigmoid(x)
print('sigmoid numpy:', outputs)

# ---------- PyTorch Implementation ----------
x = torch.tensor([2.0, 1.0, 0.1])
outputs = torch.sigmoid(x)
print('sigmoid torch:', outputs)


sigmoid numpy: [0.88079708 0.73105858 0.52497919]
sigmoid torch: tensor([0.8808, 0.7311, 0.5250])


# üßÆ Cross-Entropy Loss  

**Cross-Entropy Loss** (also called **Log Loss**) measures the difference between two probability distributions ‚Äî the **true labels** (what the data actually is) and the **predicted probabilities** (what your model outputs).


It is the most common loss function for classification problems, especially when used in conjunction with a **softmax** output layer.

In a multi-class classification setting, we usually one-hot encode the true label.  

For example:  
$$
y_{\text{true}} = [1, 0, 0, 0]
$$

This indicates the correct class is the first one.  
Suppose the model‚Äôs softmax output is:

$$
y_{\text{hat}} = [0.4, 0.1, 0.3, 0.2]
$$

This means the model predicts the first class is most likely, but only with **40% confidence**.  
Cross-entropy loss quantifies how ‚Äúwrong‚Äù or ‚Äúuncertain‚Äù that prediction is ‚Äî the training objective is to **minimize** this loss.

---

## üîπ Formulae  

### ‚úÖ Multi-Class Cross-Entropy (Single Sample)

$$
L = - \sum_{j=1}^{C} y_{j} \log(\hat{y}_{j})
$$

where:  
- \(C\) = number of classes  
- \(y_{j}\) = true label for class *j* (one-hot: 1 for correct class, 0 for others)  
- \(\hat{y}_{j}\) = predicted probability that the sample belongs to class *j*  

---

### ‚úÖ Multi-Class Cross-Entropy (Batch)

$$
\text{Loss}_{\text{batch}} = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{C} y_{i,j} \log(\hat{y}_{i,j})
$$

where \(N\) is the number of samples.

---

### ‚úÖ Binary Cross-Entropy

$$
L = -\frac{1}{N} \sum_{i=1}^{N} \Big( y_{i} \log(p_{i}) + (1 - y_{i}) \log(1 - p_{i}) \Big)
$$

where:  
- $y_{i} \in \{0,1\}$ is the true label for sample *i*  
- $p_{i}$ is the predicted probability of class ‚Äú1‚Äù for sample *i*

---

## üîπ Worked Example  

Using the earlier example:  

- True label:  
  $y_{\text{true}} = [1, 0, 0, 0]$  
- Predicted probabilities:  
  $\hat{y} = [0.4, 0.1, 0.3, 0.2]$

Then the loss is:

$$
\begin{aligned}
L &= - \Big(1 \times \log(0.4) + 0 \times \log(0.1) + 0 \times \log(0.3) + 0 \times \log(0.2)\Big) \\
  &= -\log(0.4) \\
  &\approx -(-0.916) \quad (\text{since } \log(0.4) \approx -0.916) \\
  &\approx 0.916
\end{aligned}
$$

So the **cross-entropy loss ‚âà 0.916**.  
That‚Äôs fairly high (since perfect confidence would give loss ‚Üí 0). It indicates the model is uncertain in its prediction (40% confidence) and there‚Äôs room for improvement.

If instead the model predicted:  

$$
\hat{y} = [0.9, 0.05, 0.03, 0.02]
$$

Then:

$$
L = -\log(0.9) \approx 0.105
$$

Which is much lower ‚Äî showing the model is more confident and the loss is smaller.

---

## üß† Intuition & Key Points  

- The lower the loss, the **closer** the predicted distribution is to the true distribution (i.e., high probability on the true class).

- If the model gives very low probability to the correct class (e.g., 0.1), the loss becomes large ‚Äî penalizing confident but wrong predictions heavily.

- This loss function **works smoothly** with gradient-based optimization because it‚Äôs differentiable and provides strong gradients when predictions are poor.

- Typically used with **softmax** in multi-class classification, or with **sigmoid** in binary classification.


In [3]:
import torch
import torch.nn as nn

# ---------- Example 1: Sigmoid output (Binary Classification) ----------

# Logits from the model (before sigmoid)
logits = torch.tensor([2.0, -1.0, 0.5]) # 3 samples

# True binary labels
y_true = torch.tensor([1.0, 0.0, 1.0]) # 3 samples

# Sigmoid probabilities
y_pred_sigmoid = torch.sigmoid(logits)
print("Sigmoid outputs:", y_pred_sigmoid)

# Binary Cross-Entropy Loss (using BCELoss)
bce_loss = nn.BCELoss()
loss_sigmoid = bce_loss(y_pred_sigmoid, y_true)
print("Binary Cross-Entropy Loss (Sigmoid):", loss_sigmoid.item())

# ---------- Example 2: Softmax output (Multi-Class Classification) ----------

# Logits for 3 samples and 4 classes (before softmax)
logits_multi = torch.tensor([[2.0, 1.0, 0.1, 0.5],
                             [0.3, 2.2, 0.5, 1.1],
                             [1.2, 0.7, 1.5, 0.3]])

# True class labels (indices of correct class)
y_true_multi = torch.tensor([0, 1, 2])  # class indices

# Softmax probabilities (optional, not needed for CrossEntropyLoss)
y_pred_softmax = torch.softmax(logits_multi, dim=1)
print("\nSoftmax outputs:", y_pred_softmax)

# Categorical Cross-Entropy Loss (using CrossEntropyLoss)
# Note: CrossEntropyLoss in PyTorch expects raw logits (not softmax)
ce_loss = nn.CrossEntropyLoss()
loss_softmax = ce_loss(logits_multi, y_true_multi)
print("Cross-Entropy Loss (Softmax):", loss_softmax.item())


Sigmoid outputs: tensor([0.8808, 0.2689, 0.6225])
Binary Cross-Entropy Loss (Sigmoid): 0.3047555685043335

Softmax outputs: tensor([[0.5745, 0.2114, 0.0859, 0.1282],
        [0.0898, 0.6006, 0.1097, 0.1999],
        [0.2974, 0.1804, 0.4014, 0.1209]])
Cross-Entropy Loss (Softmax): 0.6589792370796204


# A Practical Example
Imagine we have a neural network that predicts which fruit is in an image.

**Classes**: 0 = Apple (class0), 1 = Banana (class1), 2 = Cherry (class2)

3 images (samples) we want to classify.




In [9]:
import torch
import torch.nn as nn
import numpy as np

# CrossEntropyLoss combines Softmax + Log Loss internally.
loss = nn.CrossEntropyLoss()

# 3 samples images that we are passing through the neurtal network to classify.
#                    [Sample1, Sample2, Sample3]
#                    [Cherry, Apple, Banana]
Y_true =torch.tensor([2,0,1])

# n_samples x n_classes = 3 x 3
# Good predictions (logits)
# It is using one-hot encoding to represent the output from neural network
#                           [class0, class1, class2]
Y_pred_good = torch.tensor([[0.1,1.0,2.1],   # Sample 1: highest score for class 2 ‚Üí correct
                            [2.0,1.0,0.1],   # Sample 2: highest score for class 0 ‚Üí correct
                            [0.1,3.0,0.1]])  # Sample 3: highest score for class 1 ‚Üí correct

# Bad predictions (logits)
Y_pred_bad = torch.tensor([ [2.1,1.0,0.1],   # Sample 1: highest score for class 0 ‚Üí wrong
                            [0.1,1.0,2.1],   # Sample 2: highest score for class 2 ‚Üí wrong
                            [0.1,3.0, 0.1]]) # Sample 3: highest score for class 1 ‚Üí correct

# CrossEntropyLoss combines Softmax + Log Loss internally.
l1 = loss(Y_pred_good, Y_true)
l2 = loss(Y_pred_bad, Y_true)

print(f'Loss1 numpy: {l1.item():.4f}') # small loss ‚Üí good predictions
print(f'Loss2 numpy: {l2.item():.4f}') # larger loss ‚Üí bad predictions

# Get predicted classes
_, predictions1 = torch.max(Y_pred_good, 1)
_, predictions2 = torch.max(Y_pred_bad, 1)

print(predictions1) # tensor([2, 0, 1]) ‚Üí matches with true labels y_true
print(predictions2) # tensor([0, 2, 1]) ‚Üí some wrong predictions



Loss1 numpy: 0.3018
Loss2 numpy: 1.6242
tensor([2, 0, 1])
tensor([0, 2, 1])


# A Neural Network for Binary Classification (using Cross Entropy Loss)


In [10]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# ----------------------------
# 1Ô∏è‚É£ Define NeuralNet1
# ----------------------------
class NeuralNet1(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(NeuralNet1, self).__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)  # Input layer
        self.relu = nn.ReLU()                              # Activation
        self.linear2 = nn.Linear(hidden_size, 1)          # Output layer

    def forward(self, x):
        out = self.linear1(x)
        out = self.relu(out)
        out = self.linear2(out)
        y_pred = torch.sigmoid(out)  # sigmoid for binary classification
        return y_pred

# ----------------------------
# 2Ô∏è‚É£ Prepare Binary Classification Data
# ----------------------------
X_numpy, y_numpy = make_classification(
    n_samples=200, n_features=4, n_informative=2, n_redundant=0, n_classes=2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_numpy = scaler.fit_transform(X_numpy)

# Convert to tensors
X = torch.tensor(X_numpy, dtype=torch.float32)
y = torch.tensor(y_numpy.reshape(-1, 1), dtype=torch.float32)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ----------------------------
# 3Ô∏è‚É£ Initialize Model, Loss, Optimizer
# ----------------------------
input_size = X.shape[1]
hidden_size = 5
model = NeuralNet1(input_size=input_size, hidden_size=hidden_size)

criterion = nn.BCELoss()  # Binary Cross-Entropy Loss
optimizer = optim.SGD(model.parameters(), lr=0.1)

# ----------------------------
# 4Ô∏è‚É£ Training Loop
# ----------------------------
epochs = 100
for epoch in range(epochs):
    y_pred = model(X_train)
    loss = criterion(y_pred, y_train)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if epoch % 10 == 0:
        print(f"Epoch {epoch+1:03d}, Loss: {loss.item():.4f}")

# ----------------------------
# 5Ô∏è‚É£ Evaluate on Test Set
# ----------------------------
with torch.no_grad():
    y_test_pred = model(X_test)
    y_test_pred_class = (y_test_pred > 0.5).float()
    accuracy = (y_test_pred_class == y_test).sum() / y_test.shape[0]
    print(f"NeuralNet1 Test Accuracy: {accuracy.item() * 100:.2f}%")


Epoch 001, Loss: 0.7370
Epoch 011, Loss: 0.6855
Epoch 021, Loss: 0.6445
Epoch 031, Loss: 0.6041
Epoch 041, Loss: 0.5618
Epoch 051, Loss: 0.5207
Epoch 061, Loss: 0.4832
Epoch 071, Loss: 0.4518
Epoch 081, Loss: 0.4273
Epoch 091, Loss: 0.4082
NeuralNet1 Test Accuracy: 82.50%


# A Neural Network for Multi Classification (using Cross Entropy Loss)

In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# ----------------------------
# 1Ô∏è‚É£ Define NeuralNet2
# ----------------------------
class NeuralNet2(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet2, self).__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)    # Input layer
        self.relu = nn.ReLU()                                # Activation
        self.linear2 = nn.Linear(hidden_size, num_classes)  # Output layer

    def forward(self, x):
        out = self.linear1(x)
        out = self.relu(out)
        out = self.linear2(out)  # raw logits
        return out

# ----------------------------
# 2Ô∏è‚É£ Prepare Multi-Class Data
# ----------------------------
X_numpy, y_numpy = make_classification(
    n_samples=300, n_features=4, n_informative=3, n_redundant=0, n_classes=3, random_state=42
)

# Scale features
scaler = StandardScaler()
X_numpy = scaler.fit_transform(X_numpy)

# Convert to tensors
X = torch.tensor(X_numpy, dtype=torch.float32)
y = torch.tensor(y_numpy, dtype=torch.long)  # integer labels required for CrossEntropyLoss

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ----------------------------
# 3Ô∏è‚É£ Initialize Model, Loss, Optimizer
# ----------------------------
input_size = X.shape[1]
hidden_size = 5
num_classes = 3

model = NeuralNet2(input_size=input_size, hidden_size=hidden_size, num_classes=num_classes)

criterion = nn.CrossEntropyLoss()  # Multi-class loss
optimizer = optim.SGD(model.parameters(), lr=0.1)

# ----------------------------
# 4Ô∏è‚É£ Training Loop
# ----------------------------
epochs = 100
for epoch in range(epochs):
    logits = model(X_train)
    loss = criterion(logits, y_train)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if epoch % 10 == 0:
        print(f"Epoch {epoch+1:03d}, Loss: {loss.item():.4f}")

# ----------------------------
# 5Ô∏è‚É£ Evaluate on Test Set
# ----------------------------
with torch.no_grad():
    logits_test = model(X_test)
    y_test_pred = torch.argmax(logits_test, dim=1)
    accuracy = (y_test_pred == y_test).sum() / y_test.shape[0]
    print(f"NeuralNet2 Test Accuracy: {accuracy.item() * 100:.2f}%")


Epoch 001, Loss: 1.1683
Epoch 011, Loss: 1.1031
Epoch 021, Loss: 1.0626
Epoch 031, Loss: 1.0289
Epoch 041, Loss: 0.9984
Epoch 051, Loss: 0.9686
Epoch 061, Loss: 0.9369
Epoch 071, Loss: 0.9056
Epoch 081, Loss: 0.8763
Epoch 091, Loss: 0.8489
NeuralNet2 Test Accuracy: 63.33%
