# Regression Tutorial

Regression is a fundamental technique in Machine Learning used to model relationships between variables. It is widely applied to predict outcomes based on input data.

Regression can be broadly categorized into:  
1. **Linear Regression**: Assumes a linear relationship between the variables.  
2. **Polynomial Regression**: Captures non-linear relationships by introducing polynomial terms.

---

## 1. Linear Regression

Linear regression assumes a **straight-line relationship** between the independent variable(s) and the target variable.

---

### Types of Linear Regression  

#### **Simple Linear Regression**  
A single independent variable is used to predict the target variable.  

The model predicts the output \(Y\) as a function of the independent variable \(X\):  

$$
Y = a + bX
$$

- **$Y$**: Predictor Variable (Dependent Variable)  
- **$X$**: Independent Variable  
- **$a$**: Intercept of the regression line (value of $Y$ when $X = 0$)  
- **$b$**: Slope of the regression line (rate of change of $Y$ when $X$ is incremented by 1 unit).  

- **Practical Example**:  
  A company uses **advertising budget** as the independent variable to predict **sales revenue**.  
  - **Independent Variable**: $X$ (Advertising Budget)  
  - **Dependent Variable**: $Y$ (Sales Revenue)  
  - **Goal**: Find how increasing the advertising budget impacts sales.

<img src="https://drive.google.com/uc?id=1TyiL4aN66v-I3JkvJS2ZC7VFX_jy7CWy" alt="Regression Example" width="300"/>
---

#### **Multiple Linear Regression**  
Two or more independent variables are used to predict the target variable.  

The equation for Multiple Linear Regression is:  

$$
Y = a + b_1X_1 + b_2X_2 + b_3X_3 + \dots
$$

- **Coefficients**:  
  $$
  (b_1, b_2, b_3, \dots)
  $$  
- **Independent Variables**:  
  $$
  (X_1, X_2, X_3, \dots)
  $$  

- **Practical Example**:  
  A real estate agency uses **property features** to predict the **house price**.  
  - **Independent Variables**: $(X_1, X_2, X_3)$ (Square footage, number of bedrooms, and location rating)  
  - **Dependent Variable**: $Y$ (House Price)  
  - **Goal**: Analyze how multiple factors affect housing prices.

---

## 2. Polynomial Regression

Polynomial regression extends linear regression to model **non-linear relationships** between variables. It transforms the features into polynomial terms (e.g., $X^2, X^3$).

The equation for Polynomial Regression is:  

$$
Y = a + b_1X + b_2X^2 + b_3X^3 + \dots + b_nX^n
$$

- **$Y$**: Dependent Variable  
- **$X, X^2, X^3, \dots$**: Polynomial terms  
- **Coefficients**:  
  $$
  (a, b_1, b_2, b_3, \dots)
  $$  

- **Practical Example**:  
   Predicting the **growth of a plant** based on **time**.  
   Initially, the plant grows slowly, then faster, and later the growth levels off, forming a curve.  
   - **Independent Variable**: $X$ (Time in days)  
   - **Dependent Variable**: $Y$ (Plant Height in cm)  
   - **Goal**: Use a polynomial regression model to predict the non-linear growth pattern over time.

---

## Comparison of Linear and Polynomial Regression

| **Aspect**                 | **Linear Regression**                                | **Polynomial Regression**                     |
|----------------------------|-----------------------------------------------------|---------------------------------------------|
| **Nature of Relationship** | Assumes a straight-line relationship                | Models non-linear relationships (curved)     |
| **Complexity**             | Simple and interpretable                            | More complex due to polynomial terms         |
| **Use Cases**              | When data follows a linear trend                    | When data shows curves or non-linear trends  |

---

- **Linear Regression** is suitable for data with a linear relationship, like predicting sales based on advertising.  
- **Polynomial Regression** is ideal when the relationship is non-linear, such as growth patterns over time.  

Understanding the type of relationship in the data helps choose the correct regression technique to make accurate predictions. 🚀


# **Derivation of the Simple Linear Regression Formula**

Simple Linear Regression (SLR) aims to model the relationship between one independent variable \(X\) and one dependent variable \(Y\). The relationship is represented as:

$$
Y = a + bX
$$

Where:  
- \(Y\): Dependent variable (target)  
- \(X\): Independent variable (predictor)  
- \(a\): Intercept (value of \(Y\) when \(X = 0\))  
- \(b\): Slope (rate of change of \(Y\) with respect to \(X\))  

The goal is to determine the values of the slope \(b\) and intercept \(a\) that minimize the sum of squared residuals (errors). Let’s derive the formula step by step.

---

## **1. Sum of Squared Residuals**

The residual (error) for each data point is the difference between the observed value \(Y_i\) and the predicted value \(\hat{Y}_i\):

$$
e_i = Y_i - \hat{Y}_i
$$

The predicted value \(\hat{Y}_i\) is given as:

$$
\hat{Y}_i = a + bX_i
$$

Thus, the residual becomes:

$$
e_i = Y_i - (a + bX_i)
$$

The objective is to minimize the **Sum of Squared Errors (SSE)**:

$$
SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n \left(Y_i - (a + bX_i)\right)^2
$$

---

## **2. Deriving the Slope (\(b\))**

To minimize SSE, we differentiate the SSE with respect to \(a\) and \(b\), and set the derivatives to zero.

### **Step 1: Partial Derivative with Respect to \(a\)**

Taking the derivative of SSE with respect to \(a\):

$$
\frac{\partial SSE}{\partial a} = -2 \sum_{i=1}^n \left(Y_i - a - bX_i\right)
$$

Set this derivative to zero:

$$
\sum_{i=1}^n (Y_i - a - bX_i) = 0
$$

Rearranging, we get:

$$
\sum Y_i = na + b \sum X_i
$$

---

### **Step 2: Partial Derivative with Respect to \(b\)**

Taking the derivative of SSE with respect to \(b\):

$$
\frac{\partial SSE}{\partial b} = -2 \sum_{i=1}^n X_i \left(Y_i - a - bX_i\right)
$$

Set this derivative to zero:

$$
\sum_{i=1}^n X_i (Y_i - a - bX_i) = 0
$$

Expanding and rearranging:

$$
\sum X_iY_i = a \sum X_i + b \sum X_i^2
$$

---

## **3. Solving for \(a\) and \(b\)**

We now have two equations:

1. $ \sum Y_i = na + b \sum X_i $  
2. $ \sum X_iY_i = a \sum X_i + b \sum X_i^2 $  

### Solving for \(b\) (Slope)

First, substitute \(a\) from the first equation into the second equation.

After simplification, the slope \(b\) is given as:

$$
b = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}
$$

Where:  
- $\bar{X}$: Mean of $X$ values  
- $\bar{Y}$: Mean of $Y$ values  

The numerator represents the **covariance** between \(X\) and \(Y\), and the denominator represents the **variance** of \(X\).

---

### Solving for \(a\) (Intercept)

Once we have \(b\), the intercept \(a\) can be calculated using:

$$
a = \bar{Y} - b\bar{X}
$$

---

## **4. Final SLR Formula**

The final Simple Linear Regression equation is:

$$
Y = a + bX
$$

Where:  
- \(b\) (Slope):  

$$
b = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}
$$  

- \(a\) (Intercept):  

$$
a = \bar{Y} - b\bar{X}
$$

---

## **Summary**

1. **Slope \(b\)** measures the rate of change of \(Y\) with respect to \(X\) (calculated as covariance of \(X\) and \(Y\) divided by variance of \(X\)).  
2. **Intercept \(a\)** determines the value of \(Y\) when \(X = 0\).  
3. Together, \(a\) and \(b\) define the best-fit line that minimizes the sum of squared residuals.

The derived formulas allow us to find the optimal regression line for any given set of data. 🚀




In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
import pandas as pd


In [None]:
# Load the California Housing Dataset
housing = fetch_california_housing()

# Convert the dataset to a DataFrame for easier exploration
data = pd.DataFrame(housing.data, columns=housing.feature_names)
data['Target'] = housing.target

# Display the first few rows
print(data.head())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  Target  
0    -122.23   4.526  
1    -122.22   3.585  
2    -122.24   3.521  
3    -122.25   3.413  
4    -122.25   3.422  
