Optimization in machine learning is the fundamental process of adjusting model parameters and architecture to minimize errors and maximize predictive accuracy, ensuring the most effective learning from data.

### Core Principles
**Objective/Loss Function**: Machine learning models are trained to minimize (or maximize) an objective or cost function, such as Mean Squared Error (MSE) for regression or cross-entropy for classification, which quantifies the difference between predicted outputs and actual labels.

**Parameter and Hyperparameter Tuning**: Optimization includes learning model parameters (like weights in neural networks) and tuning hyperparameters (like learning rate or regularization strength) to achieve the best generalization and performance.

**Iterative Improvement**: Optimization is typically an iterative process in which the model is updated repeatedly based on the evaluation of the objective function, guiding the model toward optimal solutions.

**Gradient Descent**: The most common optimization algorithm, which updates parameters in the direction opposite to the gradient of the objective function to reduce cost.

In [None]:
# Sources:
# [1](https://www.neuralconcept.com/post/machine-learning-based-optimization-methods-use-cases-for-design-engineers)
# [2](https://towardsdatascience.com/understanding-optimization-algorithms-in-machine-learning-edfdb4df766b/)
# [3](https://www.geeksforgeeks.org/machine-learning/optimization-algorithms-in-machine-learning/)

### Gradient descent

Gradient descent is an iterative optimization algorithm central to machine learning, where it updates model parameters to minimize a loss (error) function by moving in the direction of steepest descent.

The update rule is:

    θ := θ - η ∇J(θ)

where:

- θ is the parameter vector (e.g., weights),  
- J(θ) is the loss function,  
- η is the learning rate,  
- ∇J(θ) is the gradient of the loss function with respect to θ.  

At each step, the algorithm moves the parameter θ closer to the minimum of J(θ) by subtracting a fraction (step size η) of the gradient.


#### Local Minimum vs. Global Minimum
A **local minimum** is a point where the loss is lower than adjacent points but not necessarily the lowest overall.

The **global minimum** is the absolute lowest loss value possible.

For convex loss functions, gradient descent is guaranteed to reach the global minimum, but for non-convex functions (like neural nets), it can get stuck in local minima.

#### Learning Rate (η)

The step size η determines how far θ moves in each iteration.
- If η is too large, updates may overshoot the minimum and potentially diverge.
- If η is too small, convergence will be very slow.

In [None]:
# Minimal Gradient Descent Example for J(θ) = (θ - 3)^2

# Initialize parameters
theta = 0.0        # Starting value of θ
eta = 0.1        # Learning rate
num_iterations = 10

# Gradient Descent Loop
for i in range(num_iterations):
    grad = 2 * (theta - 3)      # Derivative of J(θ) = (θ - 3)^2
    theta = theta - eta * grad  # Update θ
    print(f"Iteration {i+1}: θ = {theta}")

# Each iteration computes the current loss, gradient, and updates θ along the 
# negative gradient direction, moving toward the minimum at θ=3.
# i.e. Each step moves the value of θ closer to the minimum of J(θ) (3).

Iteration 1: θ = 0.6000000000000001
Iteration 2: θ = 1.08
Iteration 3: θ = 1.464
Iteration 4: θ = 1.7711999999999999
Iteration 5: θ = 2.01696
Iteration 6: θ = 2.213568
Iteration 7: θ = 2.3708544
Iteration 8: θ = 2.49668352
Iteration 9: θ = 2.597346816
Iteration 10: θ = 2.6778774528


In [3]:
# Sources:
# [1](https://www.geeksforgeeks.org/machine-learning/gradient-descent-algorithm-and-its-variants/)
# [2](https://developers.google.com/machine-learning/crash-course/linear-regression/gradient-descent)
# [3](https://en.wikipedia.org/wiki/Gradient_descent)
# [4](https://opus.govst.edu/cgi/viewcontent.cgi?article=1001&context=theses_math)
# [5](https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/resources/lecture-22-gradient-descent-downhill-to-a-minimum/)
# [6](https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/optimizing-multivariable-functions/a/what-is-gradient-descent)
# [7](https://www.youtube.com/watch?v=jc2IthslyzM)