In [None]:
"""
Understanding Integrals and their Applications in ML

What Are Integrals?
- Mathematical operation that computes the area under a curve
- Represents accumulation of quantities over an interval
- The definite integral of f(x) from a to b is denoted as: ∫[a,b] f(x) dx

Applications in ML:
1. Probability Distributions:
   - Calculating probabilities from probability density functions
   - Normalization of distributions (ensuring total area = 1)
   - Computing expected values and moments

2. Cost Functions:
   - Continuous loss functions in regression problems
   - Area under ROC curve (AUC) for classification performance
   - Integration in reinforcement learning for cumulative rewards

Python Implementation Examples:
"""

import sympy as sp

x = sp.Symbol('x')
f = x ** 2
definite_integral = sp.integrate(f, (x, 0, 2))
indefinite_integral = sp.integrate(f, x)
print("Definite integral ",definite_integral )
print("Indefinite integral ",indefinite_integral )

Definite integral  8/3
Indefinite integral  x**3/3


In [3]:
"""
# Optimization Concepts

- Local vs. Global Minima  
  - Local Minimum: A point where the function value is lower than all nearby points, but not necessarily the lowest overall
  - Global Minimum: The point where the function achieves its absolute lowest value across the entire domain

- Convex Functions  
  - Mathematical definition: f(λx₁ + (1-λ)x₂) ≤ λf(x₁) + (1-λ)f(x₂) for all λ ∈ [0, 1] and all x₁, x₂ in the domain
  - This property ensures that any local minimum is also a global minimum
  - Convex functions have a bowl-shaped curve with no "dips" or multiple minima

- Non-Convex Functions in ML  
  - Most neural network loss functions are non-convex due to their complex architecture
  - Non-convex functions have multiple local minima, saddle points, and complex landscapes
  - This makes optimization more challenging but allows modeling complex relationships


Key Insights for Machine Learning:
1. Convex optimization problems are easier to solve as any local minimum is global
2. Most real-world ML problems are non-convex, requiring sophisticated optimization techniques
3. Understanding the loss landscape helps in selecting appropriate optimization algorithms
4. Techniques like momentum, learning rate schedules, and advanced optimizers (Adam, RMSprop)
   help navigate non-convex loss landscapes more effectively

Practical Implications:
- For convex problems: Gradient descent will find the global optimum
- For non-convex problems: Optimization results depend on initialization and optimizer choice
- Regularization techniques can help make loss landscapes more favorable for optimization
"""
"""
# Stochastic Gradient Descent (SGD) and Its Variants

## What is Stochastic Gradient Descent?
- Optimization algorithm that uses random subsets (mini-batches) of the data to compute gradients and update parameters
- Unlike batch gradient descent which uses the entire dataset, SGD uses a single random sample or small batches
- This makes it much faster and able to handle large datasets that don't fit in memory

## Why Use SGD?
- Faster convergence for large datasets
- Can escape local minima due to the noise in gradient estimation
- Requires less memory as it processes data in batches
- Well-suited for online learning where data arrives sequentially

## Variants of SGD

### Mini-Batch SGD
- Compromise between batch GD and pure SGD
- Uses small random subsets (mini-batches) of the data
- Balances computational efficiency with gradient stability

### Momentum
- Adds a velocity term to parameter updates
- Helps accelerate convergence in relevant directions
- Reduces oscillations in parameter updates
- Formula: v = γv + η∇J(θ); θ = θ - v

### Adam Optimizer (Adaptive Moment Estimation)
- Combines ideas from both Momentum and RMSProp
- Maintains exponentially decaying averages of past gradients and squared gradients
- Computes adaptive learning rates for different parameters
- Particularly effective for problems with noisy or sparse gradients

Python Implementation Examples:
"""

"""
Practical Considerations for SGD:

1. Learning Rate Scheduling:
   - Often beneficial to decrease learning rate over time
   - Common strategies: step decay, exponential decay, cosine annealing

2. Batch Size Selection:
   - Smaller batches provide more frequent updates but noisier gradients
   - Larger batches provide more stable gradients but slower updates
   - Typical batch sizes: 32, 64, 128, 256

3. Initialization:
   - Proper initialization is crucial for convergence
   - He/Xavier initialization often works well with SGD variants

4. Regularization:
   - SGD has a regularizing effect due to its noisy updates
   - Often combined with explicit regularization like L2 weight decay

Common Use Cases:
- Training deep neural networks
- Large-scale machine learning problems
- Online learning scenarios
- Non-convex optimization problems

In modern deep learning frameworks, Adam is often the default optimizer due to its
generally good performance across a wide range of problems.
"""

'\nPractical Considerations for SGD:\n\n1. Learning Rate Scheduling:\n   - Often beneficial to decrease learning rate over time\n   - Common strategies: step decay, exponential decay, cosine annealing\n\n2. Batch Size Selection:\n   - Smaller batches provide more frequent updates but noisier gradients\n   - Larger batches provide more stable gradients but slower updates\n   - Typical batch sizes: 32, 64, 128, 256\n\n3. Initialization:\n   - Proper initialization is crucial for convergence\n   - He/Xavier initialization often works well with SGD variants\n\n4. Regularization:\n   - SGD has a regularizing effect due to its noisy updates\n   - Often combined with explicit regularization like L2 weight decay\n\nCommon Use Cases:\n- Training deep neural networks\n- Large-scale machine learning problems\n- Online learning scenarios\n- Non-convex optimization problems\n\nIn modern deep learning frameworks, Adam is often the default optimizer due to its\ngenerally good performance across a wid

In [4]:
################################### Exercise 1 ###################################
import sympy as sp

x = sp.Symbol('x')
f = sp.exp(-x)

definite_integral = sp.integrate(f, (x, 0, sp.oo))
indefinite_integral = sp.integrate(f, x)
print("Definite integral ",definite_integral )
print("Indefinite integral ",indefinite_integral )

Definite integral  1
Indefinite integral  -exp(-x)


In [6]:

import numpy as np

# Generate some data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add bias term to X
X_b = np.c_[np.ones((100, 1)), X]

#SGD Implementaion

def stochastic_gradient_descent(X, y, theta, learning_rate, n_epochs):
    m = len(y)
    for epoch in range(n_epochs):
        for i in range(m):
            random_index = np.random.randint(m)
            xi = X[random_index:random_index+1]
            yi = y[random_index:random_index+1]
            gradients = 2 * xi.T @ (xi @ theta - yi)
            theta -= learning_rate * gradients
    return theta

theta = np.random.randn(2,1)
learning_rate = 0.01
n_epochs = 50


theta_opt = stochastic_gradient_descent(X_b, y, theta, learning_rate, n_epochs)
print(theta_opt)

[[4.16427183]
 [2.69207144]]
