In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from scipy.stats import skewnorm

## Machine Learning

### **Supervised Learning**
Type of machine learning where the model learns using labelled data. Labelled data as explained in `lesson2a` is data that has both input/training example and the correct answer.
Formally, you give the model pairs (x,y) where,
x = input features
y = correct target
The mode learns a function **f(x) → y** from this.

*It has two branches and they include:*
#### 1. **Classification:**
- Process where a model learns how to recognize categories or classes by pattern so the output is a category or class label like:
    - Spam vs Non-Spam
    - Disease Type A / B
    - Dog vs Cat

- *Eg. demonstration:*
    Assuming our data as points scattered across a 2D space -- some points belong to Class A, some to Class B. The model studies where these points tend to cluster and then figures out a boundary that separates them. That boundary could be a straight line, a curve or even something very complex depending on the model. Once it learns this boundary, the model can look at a new point and decide:
    “Which side of the boundary is this on?” aaaaaand the answer becomes the predicted class/category.

- Do note that there’s no numerical meaning to the classes. Class A isn’t “higher” or “lower” than Class B - they’re just different categories. The model simply learns patterns that distinguish one category from another.


#### 2. **Regression:**
- Process where a model learns the mathematical relationship between input features(x) and a continuous target value(y) to output a numerical value. Unlike classification where it decides which class something belongs to, the model estimates how much or how many. Used in cases like:
    - Predicting the price of a house 
    - Estimating a student's mark this semester
    - Predicting temperature and weather on a particular day

- *Eg. demonstration:*
    Imagine all your data points plotted on a graph where the x-axis represents the input (like square footage of a house) and the y-axis represents the target (like house price). These points won’t form a perfect straight line but they’ll show a general upward or downward pattern.
    The model tries to capture that pattern by drawing the “best-fit line” (or curve) through the cloud of data points.

    This line represents what the model has learned:
    “As x increases or decreases, this is how y usually behaves.”
    Now when the model gets a new input, it simply checks:
    “Where would this new x fall on the learned line?” The corresponding y-value becomes the predicted output. 

- Unlike classification, there is no concept of “boundaries” here.


---
### Machine Learning Workflow
As discussed before, a machine learning project works in an iterative cycle.
It is iterative because we can't possibly get everything perfect the first time. In order to better and perfect it, we need many iterations which teach us something new about the data and the problem. Here's a rough cycle of it,

You look at data → clean it → build a model → evaluate it → realise something can be improved → go back → fix → try again.

Let's dive into each of the steps of each iteration:

#### 1. **Exploratory Data Analysis (EDA):**
EDA is a method of analyzing datasets to summarize their main characteristics, often using visual methods.
Here we use our raw data and question about it to obtain insights regarding the patterns in the data...but first, we split the data into "seen" and "unseen" data to make sure our insights are based on reviewing our "seen" sample data only.

In summary this is what we explore:
    1. Structure of the dataset
        - How many records/rows do we have?
            (Too few → model may not learn well.)
        - How many features/columns?
            Few features → simpler models like Linear/Logistic Regression
            Many features → more powerful models like Random Forest, XGBoost, etc.
        - What type of values exist inside those features?
            (Numerical? Categorical? Text?)

    2. Ranges and distributions
        - Are numerical features spread widely?
        - Are there extreme values (outliers)?
        - Do some features have very low variety (same repeated value)? → Not useful.

    3. Relationships
        - Which features influence the target?
        - How do features relate to each other? Highly correlated features = redundancy → may not be useful.

    4. Possible enhancements
        - Can we engineer new useful features from existing ones?
        - Do we have enough valuable data to train a model effectively?

    5. Potential issues
        - Missing values
        - Outliers
        - Wrong/incorrect data types
        - Imbalanced classes (important in classification tasks)
        - Too few meaningful samples
        - Need for encoding categorical features
        - Need for scaling numerical features

#### 2. **Pre-processing:**

