# Linear Models For Regression and Classification
1. Simple linear regression using Ordinary Least Squares
2. Gradient Descent Algorithm
3. Regularized Regression Methods - Ridge, Lasso, ElasticNet
4. Logistic Regression for Classification
5. OnLine Learning Methods - Stochastic Graddient Descent and Passive Aggresive
6. Robust Regresson - Dealing with ouliers and Model Errors
7. Polynomial Regression
8. Bias-Variance Tradeoff

In [1]:

# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

ModuleNotFoundError: No module named 'sklearn'

## Simple Linear Regression Using Ordinary Least Squares

### Linearly Seperable Data
> Two classes of data are linearly separable if you can draw a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) that 
> completely separates the classes — with all points of one class on one side, and all points of the other on the other side.

### Mathematical Aspect
- We want to learn a function (a mathematical rule) that connects inputs to outputs using examples. These examples are what we call the training set.

1. What is a training set?
    For instance, when trying to figure out how someone's salary depends on their years of experience.

    I'd collect data like this:

    Years of Experience $(x^t)$	| Salary $(r^t)$
    -------------------|---------------------
    1 | 30,000
    2 | 35,000
    3 |	42,000


    We call each row a training example.

    $x^t$: the input (features or data point) at time or index t

    $r^t$: the output (label, target value) at that index t

    We call the full training set:

    $X={(x^t ,r^t)}_{t=1}^N$
 
    → This just means: a set of N input-output pairs.

2.  __Goal__: Find a Function
    We want to find a function 

    $f(x)$ that maps any input $x$ to the correct output $r$, based on the examples in the training set.

    If there's no noise, that means the data is perfectly clean and consistent. So we have:

    $r^t = f(x^t)$
    This is called __interpolation__ — we find a function that goes exactly through all the points in our dataset.

3. __Polynomial Interpolation__
    If we have $N$ data points, we can fit a polynomial of degree N−1 that passes through all of them.

    For example:

        With 3 points → 2nd-degree polynomial (a parabola)

        With 4 points → 3rd-degree polynomial, and so on.

    This is helpful for predicting values within the range of known inputs.

    But if we try to predict for an input outside the range of what we’ve seen before, it's called:

4. __Extrapolation__
    Predicting for values outside the range of training data (e.g., future values in a time series).

5. __Why is There Noise?__
    Because there are often hidden variables (things we don’t observe) that also affect the output. So the true function might actually be:

    $r^t =f^∗ (x^t ,z^t)$  is the real function.

    $z^t$ are the hidden factors we don’t see — like mood, weather, or other unknowns.

    Since we can’t observe $z^t$ , we just try to find a good approximation using what we do know: $x$.

6. __The Model: g(x)__  
    We build a machine learning model $g(x)$ to approximate the true function $f(x)$. Our model is not perfect, but we want it to be as close as possible to the true outputs.

7. __Measuring How Good Our Model Is__
    We use Empirical Error (also called Mean Squared Error) to measure how well our model $g(x)$ does on the training set $X$.
    E(g|X) = \frac{1}{N}\sum^N_{t=1}[r^t-g(x^t)]^2
    - $r^t$ is the actual value.
    - $g(x^t) is the model's prediction.
    - We square the difference to make all errors positive.
    - Then we average it over all examples.
    - The smaller this value, the better the model fits the training data.

### Components of a Linear Model

1. **Features (Independent Variables)**
    * Features are the input variables used to predict the target outcome. In **simple linear regression**, there's only one feature (\(X\)), but in **multiple linear regression**, there are several ($X_1, X_2, X_3,\dots$). Features can be:
        - **Numerical** (e.g., age, salary)
        - **Categorical** (e.g., gender, product type—often one-hot encoded)
        - **Derived or engineered** (e.g., interaction terms, polynomial features)

2. **Weights (Coefficients)**
    * Weights ($\beta$) determine how much influence each feature has on the prediction. The model learns these weights during training using **optimization techniques** like **Ordinary Least Squares (OLS)** or **Gradient Descent**.

    * For multiple linear regression:
    * $ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n + \epsilon $
    * Each $\beta$ represents how much the target $Y$ changes with a unit increase in its corresponding feature, assuming other features remain constant.

3. **Bias (Intercept)**
    * The intercept $\beta_0$ is the value of $Y$ when all features $X_i$ are zero. It's crucial in adjusting the model's baseline prediction.

4. **Error Term ($\epsilon$)**
    * This represents the part of $Y$ that **cannot** be explained by the features. Ideally, it should be normally distributed and independent.

5. **Regularization (Preventing Overfitting)**
    * In complex models, regularization is used to control weights and avoid overfitting:
    * **Lasso Regression ($\ell_1$ penalty)** shrinks some weights to zero, effectively performing feature selection.
    * **Ridge Regression ($\ell_2$ penalty)** penalizes large weights to encourage simpler models.


:happy: