---
# **MULTIPLE LINEAR REGRESSION**
---

# #Ô∏è‚É£ **Why Do We Need Multiple Linear Regression?**

In real life, an outcome (sales, price, marks, etc.) usually depends on **more than one factor**.

Examples:

* Sales depend on **TV + Radio + Newspaper** ads.
* House price depends on **size + location + age + rooms**.
* Student marks depend on **study hours + sleep + tuition + motivation**.

If we use separate simple linear regressions for each factor, we get confusing results because:

1. **We cannot make one combined prediction**.
2. **Predictors may be correlated with each other**, causing misleading coefficients.

---

# #Ô∏è‚É£ **The Multiple Linear Regression Model**

Multiple Linear Regression allows us to predict a response ($Y$) based on multiple predictors ($X_1, X_2, ..., X_p$).

$$
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \varepsilon
$$

Instead of a line, we are now fitting a plane (in 3D) or a hyperplane (in higher dimensions).

Where:


| Symbol        | Meaning                                                           |
| ------------- | ----------------------------------------------------------------- |
| (Y)           | Response/outcome (e.g., sales)                                    |
| $(X_j)$         | j-th predictor (e.g., TV budget)                                  |
| $(\beta_0)$     | Intercept                                                         |
| $(\beta_j)$     | Effect of predictor $(X_j)$, **holding all other predictors fixed** |
| $(\varepsilon)$ | Random error                                                      |

---

## ‚≠ê Interpretation of $(\beta_j)$ (Very Important!)

$
\beta_j = \text{Change in } Y \text{ from a 1-unit increase in } X_j \text{ while keeping all other variables constant.}
$

Example (Advertising):

* $(\beta_{\text{radio}} = 0.189)$ means:

> ‚ÄúIf you increase radio advertising by $1,000 **while keeping TV and newspaper fixed**, sales increase by about 189 units.‚Äù

This interpretation is fundamentally different from simple regression.

---

# #Ô∏è‚É£ ** Estimating Coefficients (Least Squares)**

The prediction for any observation is:

$$
\hat{y} = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p
$$

Least squares chooses $(\beta_0, \dots, \beta_p)$ that minimize:

$$
RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Where:

* $(y_i)$ = actual value
* $(\hat{y}_i)$ = predicted value

In multiple regression, the formulas involve matrices, so we rely on software.

---



# #Ô∏è‚É£ **The "Shark Attack" Analogy**

Imagine a dataset showing a strong positive relationship between Ice Cream Sales and Shark Attacks.

* Does eating ice cream cause shark attacks? No.
* The Hidden Variable: Temperature.
* Hot days $\rightarrow$ More people buy ice cream.
* Hot days $\rightarrow$ More people swim $\rightarrow$ More shark attacks.
* If you run a multiple regression with both Ice Cream and Temperature, the coefficient for Ice Cream would drop to zero.
---

# **Important Questions Multiple Regression Answers**

These are:

1. **Is at least one predictor useful?**
2. **Which predictors matter?**
3. **How well does the model fit the data?**
4. **How accurate are our predictions?**

---

#  **Question 1: Is There Any Relationship between the predictors and the responses? (F-Test)**

We don't just look at individual p-values anymore. We test if at least one predictor is useful.

We test:

* Null hypothesis: $H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0$  (All coefficients are zero; the model is useless)

* Alternative: $H_a: \text{At least one } \beta_j \neq 0$

### The Test: F-Statistic Formula:

Logic: The F-statistic compares the explained variance (TSS - RSS) to the unexplained variance (RSS).

$$
F =
\frac{(TSS - RSS)/p}
{RSS/(n - p - 1)}
$$

Where:

| Symbol | Meaning                 |
| ------ | ----------------------- |
| (TSS)  | Total variability in Y  |
| (RSS)  | Unexplained variability |
| (n)    | Number of observations  |
| (p)    | Number of predictors    |

* If **F ‚âà 1**, predictors are useless (no relationship).
* If **F ‚â´ 1**, at least one predictor matters.

**Why not just check every individual p-value?** If you have 100 predictors, simple chance says about 5 of them will look "significant" (p < 0.05) even if they are random noise. The F-statistic tests the **whole team** at once, protecting you from these false positives.

---

# #Ô∏è‚É£ **9. Testing a Subset of Predictors**

Sometimes we test only q predictors.

Suppose we test whether last q predictors have zero coefficients.

Second model $RSS = (RSS_0)$

Then:

$$
F =
\frac{(RSS_0 - RSS)/q}
{RSS/(n - p - 1)}
$$

This tells whether removing predictors significantly worsens the model.

---

# #Ô∏è‚É£ **Question 2: Deciding Which Variables Matter (Variable Selection)**

We often want to trim the "dead weight" predictors. Since checking every possible combination of variables is computationally impossible (for 30 variables, there are >1 billion combinations!), we use shortcuts:
Three classical methods:

---

## **A. Forward Selection**

Start with **no** predictors, then:

1. Start with a blank model and add the single best predictor that reduces RSS the most
2. Then add the  next best
3. Stop when nothing else helps. (Good for when variables > samples) (based on rule like p-value or AIC)

---

## **B. Backward Selection**

Start with **all** predictors, then:

1. Remove the least significant predictor (whith highest p-value)
2. Repeat until all remaining predictors are significant

Cannot use if **p > n** (Only works if samples > variables).

---

## **C. Stepwise (Mixed) Selection**

Combination:

* Add predictors like forward
* Remove predictors whose p-value becomes high

More flexible and commonly used.

---

# **Question 3: How Well Does the Model Fit?**
We use two main metrics:

1. $R^2$ (Coefficient of Determination)**

$$R^2 = Cor(Y, \hat{Y})^2$$

* It measures the proportion of variance.
* **$R^2$ always increases** when adding more predictors (even useless ones!). A tiny increase in $R^2$ (like adding Newspaper) suggests the variable isn't actually helpful. Therefore, a small increase in $R^2$ is meaningless

 2. **RSE (Residual Standard Error)**

 $$RSE = \sqrt{\frac{RSS}{n - p - 1}}$$

* It measures the average standard deviation of the error (how far off our predictions are).

* Unlike $R^2$, RSE can actually get worse (increase) if you add a useless variable, because the penalty for complexity ($p$) increases (because denominator = (n - p - 1))


---

# **Question 4: How accurate are our predictions?**

Prediction formula:

$$
\hat{Y} = \beta_0 + \beta_1X_1 + \dots + \beta_pX_p
$$

Three types of uncertainty:

---

## **1. Coefficient Uncertainty ‚Üí Confidence Interval**

* Predicts the average response for a given set of $X$ values.

* Example: "If we spend $100k on TV, the average sales across all such markets will be between 10,985 and 11,528."

This addresses:

> How close is our estimate of the **average** response?

---

## **2. Model Bias**

Linear model may be oversimplified.
We pretend model is correct for now.

---

## **3. Random Error ‚Üí Prediction Interval**

* Predicts the response for a single specific value (e.g., one specific city next month).

* Example: "For this specific market, sales will be between 7,930 and 14,580."

* Note: Prediction intervals are always wider because they account for both the uncertainty in the model coefficients (reducible error) AND the random noise in that one specific data point (irreducible error)

Addresses:

> How much will a **new individual observation** vary?


---

## üìå **Example 1: House Price Prediction**

Predictors:

* Size
* Number of bedrooms
* Distance to city center

Model:

$$
\text{Price} = \beta_0 + \beta_1(\text{Size}) + \beta_2(\text{Bedrooms}) + \beta_3(\text{Distance}) + \varepsilon
$$

Interpretation:

* $(\beta_3 < 0)$: farther houses are cheaper
* $(\beta_2)$: effect of adding an extra bedroom **holding size constant**

---

## üìå **Example 2: Student Marks**

Predictors:

* Study hours
* Sleep hours
* Attendance

Multiple regression separates:

* True effect of study
* True effect of sleep
* True effect of attendance

Even if students who study more also attend more, regression isolates each effect.

---

## üìå **Example 3: Salary Prediction**

Predictors:

* Experience
* Education
* Number of certifications

simple regression (salary vs certifications) may be misleading because certifications and experience are correlated.

Multiple regression fixes this.

---
