<div class="alert alert-block alert-success">

<h2> Supervised Learning </h2>
Supervised learning is a fundamental type of machine learning where an algorithm learns from data that is already labeled with the correct output. </br>
The "supervision" comes from these labeled datasets, which guide the algorithm in identifying patterns and relationships between input features and their associated outputs.   

</div>


<div class="alert alert-block alert-info">

<b>Key Steps:</b>

<b>Data Collection and Preparation:</b>
Gather and label relevant data, with each data point containing input features and a corresponding output label (e.g., emails labeled as "spam" or "not spam").
Split the data into training, validation, and test sets.

<b>Model Selection:</b>

Choose a supervised learning algorithm based on the problem type (classification or regression) and data characteristics.

<b>Training:</b>

Feed the labeled training data into the algorithm.
Adjust internal parameters iteratively to minimize the difference between predictions and actual labels (optimization process).

<b>Evaluation:</b>

Test model performance on a separate labeled dataset (test set) to assess generalization to unseen data.
Prediction/Inference:

Use the trained model to make predictions on new, unlabeled data.

</div>


<div class="alert alert-block alert-success">

## Machine Learning Architecture

Below is a simple ML architecture

![Screenshot 2025-05-10 at 12.43.12 AM.png](attachment:94541bc0-7bf7-4d1b-9a6e-68ca2887d893.png)

We take inputs (x1, x2, x3, ..., xn) along with corresponding weights (w1, w2, w3, ..., wn). The output, denoted as z, is computed using the formula  </br>


$$z = \sum_{i=1}^{n} w_i x_i + b$$

This represents a linear combination of input features $x_i$ with their corresponding weights $w_i$, plus a bias term $b$. This is the fundamental equation used in many machine learning models, particularly for:

- The input to activation functions in neural networks
- Linear regression models
- The decision function in logistic regression and SVMs



</div>

## Loss function ##

<div class="alert alert-block alert-info">
A loss function stands at the core of machine learning, acting as a compass that guides models towards accuracy and reliability. It is a fundamental pillar that quantifies the difference between the model's predictions and the actual observed data, offering a numerical representation of "error" or "loss". </br>
The operational principle is simple yet powerful: loss functions output higher numbers for incorrect predictions and lower numbers for accurate ones. This mechanism allows machine learning models to 'learn' from their mistakes, steering them closer to accurate predictions with each iteration. The primary significance of loss functions lies in their ability to optimize algorithms. By providing a clear metric for error, they help ensure that algorithms accurately model datasets, continually improving their performance. <b> Minimizing this loss is the ultimate goal of the training process. </b>
</br>

![Screenshot 2025-05-09 at 11.40.55 PM.png](attachment:fcb6e457-3454-407d-99d1-ae6a54f4601a.png)

</div>

<div class="alert alert-block alert-info">

<b> Loss functions are categorized into two main types:</b> </br>
 _**Regression loss functions**_ for predicting continuous values, such as Mean Squared Error (MSE) and Mean Absolute Error (MAE) </br>
 _**Classification loss functions**_ for categorizing inputs into discrete classes, including Binary Cross-Entropy and Categorical Cross-Entropy. </br>
There are also specialized loss functions, like Hinge Loss and Huber Loss, designed to address specific challenges such as maximizing decision margins or being robust to outliers.

</div>

<div class="alert alert-block alert-success">
    
<b>Gradient Descent: An Intuitive Guide to an Essential Optimization Algorithm </b> </br>
Gradient descent is a fundamental optimization algorithm widely used in machine learning and other mathematical fields. Its primary goal is to find the minimum value of a function. Imagine a person trying to walk down a hill in the thick fog; they would feel the slope of the ground at their current position and take a step in the direction where the ground descends most steeply. Gradient descent operates on a similar principle. </br> </br>

<b>Core Concept: Iteratively Moving Towards the Minimum </b> </br>

At its heart, gradient descent is an iterative process. It starts at an arbitrary point on the function's surface (representing the initial guess for the parameters). Then, it calculates the gradient of the function at that point. The gradient is a vector that points in the direction of the steepest ascent of the function. To move towards the minimum, gradient descent takes a step in the opposite direction of the gradient.   

This process is repeated, with each step hopefully taking us closer to the lowest point of the function. The size of each step is determined by a parameter called the learning rate.   

![image.png](attachment:7aa5a30b-c6e5-4416-bdd9-96b24deab79d.png)


![image.png](attachment:00fab0c2-ef83-47ca-a246-8c224ddb7201.png)

<b>The Mathematics Behind It (Simplified)</b>

Let's say we have a function L that we want to minimize, where w represents the parameters of our model. The update rule for gradient descent can be expressed as:

<b>

w = w - $\alpha$ * $\frac{\partial L}{\partial w}$

</b>


Where:

w 
output is the updated value of the parameter.   
w 
input is the current value of the parameter.   
α (alpha) is the learning rate.   
∇L(w input
​
 ) is the gradient of the function L at w 
input
​
 .
The gradient ∇L(w input
​
 ) tells us the direction of the steepest increase. By subtracting this gradient (multiplied by the learning rate) from the current parameter values, we are effectively moving in the direction of the steepest decrease.   

<b>The Learning Rate: A Crucial Parameter</b>

The learning rate (α) is a critical hyperparameter in gradient descent. It controls how big of a step we take in each iteration:   

Too small a learning rate: Gradient descent will converge very slowly, requiring many iterations to reach the minimum.
Too large a learning rate: The algorithm might overshoot the minimum and fail to converge, or even diverge (the function value might increase).   
Finding an appropriate learning rate often involves experimentation and techniques like learning rate schedules, where the learning rate is adjusted during the training process.   

<b>Types of Gradient Descent</b>

There are three main variations of gradient descent, differing in how much data is used to compute the gradient at each step:   

<b>Batch Gradient Descent:</b>

Calculates the gradient using the entire training dataset in each iteration.
Pros: Guaranteed to converge to the global minimum for convex functions and to a local minimum for non-convex functions. The gradient calculation is stable.   
Cons: Can be very slow and computationally expensive for large datasets, as it requires processing all data points before making a single update. It also requires the entire dataset to be in memory.   

<b>Stochastic Gradient Descent (SGD):</b>

Calculates the gradient and updates the parameters for each individual training example one at a time.
Pros: Much faster updates, allowing for quicker learning, especially on large datasets. It can also help escape shallow local minima due to its noisy updates.
Cons: The updates can be very noisy, leading to a less stable convergence path (the loss function may fluctuate significantly). It might not converge to the exact minimum but will oscillate around it.

<b>Mini-Batch Gradient Descent:</b>

A compromise between batch gradient descent and SGD. It calculates the gradient and updates the parameters using a small batch of training examples (e.g., 32, 64, 128 examples).   
Pros: Offers a balance between the stability of batch gradient descent and the efficiency of SGD. It's the most common type used in practice due to its efficiency and ability to leverage optimized matrix operations.   
Cons: Introduces an additional hyperparameter (batch size) that needs to be tuned.

<b>Importance and Use Cases in Machine Learning</b>

Gradient descent is the backbone of many machine learning algorithms, particularly in training models. It's used to minimize a cost function (also known as a loss function). The cost function measures how well the model's predictions match the actual target values. By minimizing the cost function, gradient descent helps to find the optimal set of parameters for the model.   



</div>

<div class="alert alert-block alert-info">


## Computation Graph

A computation graph is a way to represent a mathematical expression or a series of computations as a directed graph. It's a fundamental concept, especially in machine learning and deep learning, for organizing and managing complex calculations.

Core Idea:

It visually breaks down a complex computation into a series of simpler, elementary operations.
Each operation, along with the variables involved, is represented in a structured way.

![CG.jpeg](attachment:9e719edf-fbda-4611-bdd6-33cf5a986828.jpeg)   

</div>


<div class="alert alert-block alert-success">

## Computational Graph for a simple logistic regresssion

![LG.jpeg](attachment:7d309982-a348-4a35-b402-172b6266606a.jpeg)


</div>

<div class="alert alert-block alert-success">

## Logistic regression results

![lgr.jpeg](attachment:365beab0-2c0d-458a-abb4-7d8f5a20a3fd.jpeg)

</div>

<div class="alert alert-block alert-info">

## Computational Graph for Neural Network 

![NN.jpeg](attachment:6619c24f-e9c0-446f-8d4c-a910d7b3129f.jpeg)

</div>