In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from scipy.stats import skewnorm

## Machine Learning

### **Supervised Learning**
Type of machine learning where the model learns using labelled data. Labelled data as explained in `lesson2a` is data that has both input/training example and the correct answer.
Formally, you give the model pairs (x,y) where,
x = input features
y = correct target
The mode learns a function **f(x) → y** from this.

*It has two branches and they include:*

#### 1. **Classification:**
- Process where a model learns how to recognize categories or classes by pattern so the output is a category or class label like : 
    - Spam vs Non-Spam
    - Disease Type A / B 
    - Dog vs Cat

- *Eg. demonstration:*
    Assuming our data as points scattered across a 2D space -- some points belong to Class A, some to Class B. The model studies where these points tend to cluster and then figures out a boundary that separates them. That boundary could be a straight line, a curve or even something very complex depending on the model. Once it learns this boundary, the model can look at a new point and decide:
    “Which side of the boundary is this on?” aaaaaand the answer becomes the predicted class/category.

- Do note that there’s no numerical meaning to the classes. Class A isn’t “higher” or “lower” than Class B - they’re just different categories. The model simply learns patterns that distinguish one category from another.


#### 2. **Regression:**
- Process where a model learns the mathematical relationship between input features(x) and a continuous target value(y) to output a numerical value. Unlike classification where it decides which class something belongs to, the model estimates how much or how many. Used in cases like:
    - Predicting the price of a house 
    - Estimating a student's mark this semester
    - Predicting temperature and weather on a particular day

- *Eg. demonstation:*
    Imagine all your data points plotted on a graph where the x-axis represents the input (like square footage of a house) and the y-axis represents the target (like house price). These points won’t form a perfect straight line but they’ll show a general upward or downward pattern.

    The model tries to capture that pattern by drawing the “best-fit line” (or curve) through the cloud of data points.
    This line represents what the model has learned:
    “As x increases or decreases, this is how y usually behaves.”

    Now when the model gets a new input, it simply checks:
    “Where would this new x fall on the learned line?”
    The corresponding y-value becomes the predicted output.

- Unlike classification, there is no concept of “boundaries” here.

---
### Supervised Learning Flow

#### **Phase 1: Training and Testing (Model Creation & Validation)**
This phase involves training the algorithm and validating the resulting Statistical Model ($f(x)$) using labeled historical data.

| Step | Component | Detailed Role / Percentage |
| :--- | :--- | :--- |
| 1 | **Historical Data** | The initial, labeled source dataset containing **input features** and their corresponding **known output labels**. |
| 2 | **Random Sampling** | The process that randomly divides the Historical Data to create independent sets, ensuring both are representative of the whole. |
| 3 | **Training Dataset** | **80%** of the data. This is the portion used by the Machine Learning algorithm to **learn** the underlying patterns and relationships. |
| 4 | **Test Dataset** | **20%** of the data. This **hold-out set** is used exclusively to evaluate the model's performance on unseen data before deployment. |
| 5 | **Machine Learning** | The computational process where the algorithm iteratively **fits** its parameters to the Training Dataset to minimize errors. |
| 6 | **Statistical Model $f(x)$** | The resulting **learned function** (the hypothesis) from the training process, which generalizes the input-output mapping ($y = f(x)$). |
| 7 | **Prediction and Testing** | The validated model ($f(x)$) makes predictions on the Test Dataset, which are then compared against the **known labels** to calculate accuracy. |
| 8 | **Model Validation Outcome** | The quantitative result (e.g., accuracy score, F1-score) that determines if the model is robust and **fit for use** (meets pre-defined performance metrics). |

---

####  **Phase 2: Prediction (Deployment)**
This phase involves using the validated model ($f(x)$) to generate output for new, unlabeled production data.

| Component | Function in the Prediction Phase |
| :--- | :--- |
| **New Data** | Unseen, unlabeled production data that is fed into the deployed model to obtain a real-time output or forecast. |
| **Model ($f(x)$)** | The validated Statistical Model from Phase 1, which now serves as the **prediction engine** in the production environment. |
| **Prediction Outcome** | The final, calculated output (label or value) generated by the model for the New Data. |
| **Improvement Note** | Prediction accuracy can be enhanced by **more training data**, increasing model **capacity** (complexity), or **algorithm redesign**. |