<a href="https://colab.research.google.com/github/your-username/pytorch-for-deeplearning/blob/main/notebooks/02_mathematical_operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2: Mathematical Operations

This notebook explores advanced mathematical operations and activation functions in PyTorch.

## Learning Objectives
- Perform advanced mathematical operations on tensors
- Understand and implement activation functions
- Visualize activation function behaviors
- Compare different activation functions

## Setup and Installation

In [None]:
# Install and import necessary libraries
try:
    import torch
    print(f"PyTorch version: {torch.__version__}")
except ImportError:
    !pip install torch torchvision torchaudio
    import torch
    print(f"PyTorch installed. Version: {torch.__version__}")

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set device and random seed
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
torch.manual_seed(42)
np.random.seed(42)

## 1. Advanced Mathematical Operations

Beyond basic arithmetic, PyTorch provides many mathematical functions.

In [None]:
# Create sample tensor
x = torch.linspace(-2, 2, 10)
print(f"Input tensor: {x}")

print("\n=== Mathematical Functions ===")
# Trigonometric functions
print(f"sin(x): {torch.sin(x)}")
print(f"cos(x): {torch.cos(x)}")
print(f"tan(x): {torch.tan(x)}")

# Exponential and logarithmic functions
x_pos = torch.linspace(0.1, 2, 5)  # Positive values for log
print(f"\nPositive input: {x_pos}")
print(f"exp(x): {torch.exp(x_pos)}")
print(f"log(x): {torch.log(x_pos)}")
print(f"log10(x): {torch.log10(x_pos)}")
print(f"sqrt(x): {torch.sqrt(x_pos)}")

# Power functions
print(f"\nPower functions:")
print(f"x^2: {torch.pow(x, 2)}")
print(f"x^3: {torch.pow(x, 3)}")
print(f"2^x: {torch.pow(2, x)}")

## 2. Linear Algebra Operations

PyTorch provides comprehensive linear algebra operations.

In [None]:
# Matrix operations
A = torch.randn(3, 3)
B = torch.randn(3, 3)

print(f"Matrix A:\n{A}")
print(f"\nMatrix B:\n{B}")

# Basic operations
print(f"\n=== Linear Algebra Operations ===")
print(f"Matrix multiplication (A @ B):\n{A @ B}")
print(f"\nElement-wise multiplication (A * B):\n{A * B}")

# Advanced operations
print(f"\nDeterminant of A: {torch.linalg.det(A)}")
print(f"Trace of A: {torch.trace(A)}")
print(f"Frobenius norm of A: {torch.linalg.norm(A, 'fro')}")

# Eigenvalues and eigenvectors
eigenvals, eigenvecs = torch.linalg.eig(A)
print(f"\nEigenvalues: {eigenvals}")
print(f"Eigenvectors shape: {eigenvecs.shape}")

# Matrix inverse (if invertible)
try:
    A_inv = torch.linalg.inv(A)
    print(f"\nMatrix inverse exists")
    print(f"A @ A_inv ≈ I: {torch.allclose(A @ A_inv, torch.eye(3), atol=1e-6)}")
except RuntimeError:
    print("\nMatrix is not invertible")

## 3. Activation Functions

Activation functions are crucial in neural networks. Let's explore the most common ones.

In [None]:
# Create input range for activation functions
x = torch.linspace(-5, 5, 100)

# Define activation functions
activations = {
    'ReLU': torch.relu(x),
    'Sigmoid': torch.sigmoid(x),
    'Tanh': torch.tanh(x),
    'LeakyReLU': F.leaky_relu(x, 0.1),
    'ELU': F.elu(x),
    'GELU': F.gelu(x),
    'SiLU/Swish': F.silu(x),
    'Softplus': F.softplus(x)
}

# Print some example values
test_vals = torch.tensor([-2., -1., 0., 1., 2.])
print("=== Activation Function Values ===")
print(f"Input: {test_vals}")
for name in ['ReLU', 'Sigmoid', 'Tanh', 'GELU']:
    if name == 'ReLU':
        result = torch.relu(test_vals)
    elif name == 'Sigmoid':
        result = torch.sigmoid(test_vals)
    elif name == 'Tanh':
        result = torch.tanh(test_vals)
    elif name == 'GELU':
        result = F.gelu(test_vals)
    print(f"{name:8}: {result}")

## 4. Activation Function Visualization

Let's visualize different activation functions to understand their behavior.

In [None]:
# Plot activation functions
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

x = torch.linspace(-5, 5, 100)

activations = [
    ('ReLU', torch.relu(x)),
    ('Sigmoid', torch.sigmoid(x)),
    ('Tanh', torch.tanh(x)),
    ('LeakyReLU', F.leaky_relu(x, 0.1)),
    ('ELU', F.elu(x)),
    ('GELU', F.gelu(x)),
    ('SiLU/Swish', F.silu(x)),
    ('Softplus', F.softplus(x))
]

for i, (name, y) in enumerate(activations):
    axes[i].plot(x.numpy(), y.numpy(), linewidth=2, color='blue')
    axes[i].set_title(f'{name}', fontsize=12, fontweight='bold')
    axes[i].grid(True, alpha=0.3)
    axes[i].axhline(y=0, color='k', linewidth=0.5)
    axes[i].axvline(x=0, color='k', linewidth=0.5)
    axes[i].set_xlabel('x')
    axes[i].set_ylabel(f'{name}(x)')

plt.tight_layout()
plt.show()

# Print characteristics
print("\n=== Activation Function Characteristics ===")
print("ReLU: Simple, fast, but can cause dead neurons (gradient = 0 for x < 0)")
print("Sigmoid: Smooth, bounded [0,1], but suffers from vanishing gradients")
print("Tanh: Zero-centered, bounded [-1,1], better than sigmoid")
print("LeakyReLU: Prevents dead neurons with small negative slope")
print("ELU: Smooth, can produce negative outputs")
print("GELU: Smooth approximation to ReLU, used in transformers")
print("SiLU/Swish: Smooth, self-gated, x * sigmoid(x)")
print("Softplus: Smooth approximation to ReLU, log(1 + exp(x))")

## 5. Derivatives and Gradients

Understanding gradients is crucial for backpropagation.

In [None]:
# Compute gradients of activation functions
x = torch.linspace(-3, 3, 100, requires_grad=True)

# Function to compute derivative
def compute_derivative(func, x_vals):
    y = func(x_vals)
    # Compute gradient
    grad_outputs = torch.ones_like(y)
    grads = torch.autograd.grad(outputs=y, inputs=x_vals,
                               grad_outputs=grad_outputs,
                               create_graph=True, retain_graph=True)[0]
    return grads

# Plot functions and their derivatives
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

functions = [
    ('ReLU', torch.relu),
    ('Sigmoid', torch.sigmoid),
    ('Tanh', torch.tanh),
    ('GELU', F.gelu)
]

for i, (name, func) in enumerate(functions):
    row, col = i // 2, i % 2
    
    # Reset gradients
    if x.grad is not None:
        x.grad.zero_()
    
    # Function values
    y = func(x)
    
    # Derivatives
    try:
        dy_dx = compute_derivative(func, x)
        
        # Plot function and derivative
        axes[row, col].plot(x.detach().numpy(), y.detach().numpy(), 
                           label=f'{name}', linewidth=2)
        axes[row, col].plot(x.detach().numpy(), dy_dx.detach().numpy(), 
                           label=f"{name}'", linewidth=2, linestyle='--')
    except:
        # For functions that don't support autograd everywhere
        axes[row, col].plot(x.detach().numpy(), y.detach().numpy(), 
                           label=f'{name}', linewidth=2)
    
    axes[row, col].set_title(f'{name} and its Derivative')
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)
    axes[row, col].axhline(y=0, color='k', linewidth=0.5)
    axes[row, col].axvline(x=0, color='k', linewidth=0.5)

plt.tight_layout()
plt.show()

## 6. Softmax Function

The softmax function is crucial for multi-class classification.

In [None]:
# Softmax examples
print("=== Softmax Function ===")

# Example 1: Basic softmax
logits = torch.tensor([2.0, 1.0, 0.1])
probs = F.softmax(logits, dim=0)
print(f"Logits: {logits}")
print(f"Softmax: {probs}")
print(f"Sum: {probs.sum()}")

# Example 2: Batch softmax
batch_logits = torch.tensor([[2.0, 1.0, 0.1],
                            [0.5, 2.5, 1.0],
                            [1.0, 1.0, 1.0]])
batch_probs = F.softmax(batch_logits, dim=1)
print(f"\nBatch logits:\n{batch_logits}")
print(f"\nBatch softmax:\n{batch_probs}")
print(f"Row sums: {batch_probs.sum(dim=1)}")

# Temperature scaling
print(f"\n=== Temperature Scaling ===")
temperatures = [0.5, 1.0, 2.0, 5.0]
for temp in temperatures:
    temp_probs = F.softmax(logits / temp, dim=0)
    print(f"Temperature {temp}: {temp_probs}")

## 7. Numerical Stability

Understanding numerical stability in mathematical operations.

In [None]:
print("=== Numerical Stability ===")

# Large logits can cause numerical instability
large_logits = torch.tensor([1000., 999., 998.])

# Naive softmax (might overflow)
print(f"Large logits: {large_logits}")
print(f"exp(logits): {torch.exp(large_logits)}")

# Stable softmax (PyTorch handles this automatically)
stable_probs = F.softmax(large_logits, dim=0)
print(f"Stable softmax: {stable_probs}")

# Log-softmax for numerical stability
log_probs = F.log_softmax(large_logits, dim=0)
print(f"Log-softmax: {log_probs}")

# Verify: exp(log_softmax) = softmax
reconstructed_probs = torch.exp(log_probs)
print(f"Reconstructed: {reconstructed_probs}")
print(f"Match: {torch.allclose(stable_probs, reconstructed_probs)}")

# LogSumExp trick
print(f"\n=== LogSumExp Trick ===")
def manual_logsumexp(x):
    max_x = torch.max(x)
    return max_x + torch.log(torch.sum(torch.exp(x - max_x)))

manual_lse = manual_logsumexp(large_logits)
pytorch_lse = torch.logsumexp(large_logits, dim=0)
print(f"Manual LogSumExp: {manual_lse}")
print(f"PyTorch LogSumExp: {pytorch_lse}")
print(f"Match: {torch.allclose(manual_lse, pytorch_lse)}")

## 8. Practice Exercises

In [None]:
# Exercise 1: Implement a custom activation function
print("Exercise 1: Custom Swish activation function")

def custom_swish(x, beta=1.0):
    """Custom implementation of Swish: x * sigmoid(beta * x)"""
    return x * torch.sigmoid(beta * x)

# Test implementation
x = torch.linspace(-3, 3, 7)
custom_result = custom_swish(x)
pytorch_result = F.silu(x)  # SiLU is Swish with beta=1

print(f"Input: {x}")
print(f"Custom Swish: {custom_result}")
print(f"PyTorch SiLU: {pytorch_result}")
print(f"Match: {torch.allclose(custom_result, pytorch_result)}")

# Try different beta values
plt.figure(figsize=(10, 6))
x_plot = torch.linspace(-5, 5, 100)
for beta in [0.5, 1.0, 1.5, 2.0]:
    y = custom_swish(x_plot, beta)
    plt.plot(x_plot.numpy(), y.numpy(), label=f'β={beta}', linewidth=2)

plt.title('Swish Activation with Different β Values')
plt.xlabel('x')
plt.ylabel('Swish(x, β)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.show()

In [None]:
# Exercise 2: Activation function comparison for gradient flow
print("Exercise 2: Gradient flow comparison")

# Create input with gradient tracking
x = torch.linspace(-5, 5, 100, requires_grad=True)

# Compare gradient magnitudes for different activations
activations = {
    'ReLU': torch.relu,
    'Sigmoid': torch.sigmoid,
    'Tanh': torch.tanh,
    'GELU': F.gelu
}

gradient_stats = {}

for name, func in activations.items():
    if x.grad is not None:
        x.grad.zero_()
    
    y = func(x)
    
    # Compute gradients
    grad_outputs = torch.ones_like(y)
    try:
        grads = torch.autograd.grad(outputs=y, inputs=x,
                                   grad_outputs=grad_outputs,
                                   retain_graph=True)[0]
        
        # Calculate statistics
        gradient_stats[name] = {
            'mean_abs_grad': torch.mean(torch.abs(grads)).item(),
            'max_grad': torch.max(torch.abs(grads)).item(),
            'zero_grads': torch.sum(grads == 0).item()
        }
    except:
        gradient_stats[name] = {'error': 'Could not compute gradients'}

# Display results
print("\nGradient Statistics:")
for name, stats in gradient_stats.items():
    if 'error' not in stats:
        print(f"{name:8} - Mean |grad|: {stats['mean_abs_grad']:.4f}, "
              f"Max |grad|: {stats['max_grad']:.4f}, "
              f"Zero grads: {stats['zero_grads']}")
    else:
        print(f"{name:8} - {stats['error']}")

print("\nInterpretation:")
print("- Higher mean absolute gradients suggest better gradient flow")
print("- Zero gradients indicate potential dead neurons (especially in ReLU)")
print("- Sigmoid/Tanh may have vanishing gradients in saturation regions")

## Summary

In this notebook, we explored:

1. **Advanced Mathematical Operations**: Trigonometric, exponential, and power functions
2. **Linear Algebra**: Matrix operations, determinants, eigenvalues, and more
3. **Activation Functions**: Common activation functions used in neural networks
4. **Visualization**: Plotting activation functions and their derivatives
5. **Gradients**: Understanding gradient computation for different functions
6. **Softmax**: Multi-class classification probability distributions
7. **Numerical Stability**: Techniques for stable computation with large numbers
8. **Custom Functions**: Implementing your own activation functions

### Key Takeaways
- Different activation functions have different properties for gradient flow
- ReLU is fast but can cause dead neurons
- Sigmoid and Tanh can suffer from vanishing gradients
- Modern activations like GELU and Swish offer smooth alternatives
- Numerical stability is crucial for reliable computations

### Next Steps
- Experiment with different activation functions in neural networks
- Try implementing other activation functions (Mish, PReLU, etc.)
- Understand how activation choice affects training dynamics
- Move on to the next notebook: Automatic Differentiation