# Comparing Simple and Multiple Linear Regression using F-Test

This notebook demonstrates how to:
- Fit simple and multiple linear regression models
- Compare their explanatory power using R² and F-statistic
- Decide whether a new predictor variable adds significant value to the model

The F-test is a statistical test that compares the variances of two or more samples to see if they are significantly different. It's commonly used in hypothesis testing, particularly in analysis of variance (ANOVA) to determine if there's a significant difference between the means of multiple groups.


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

np.random.seed(42)
n = 100
X1 = np.random.normal(10, 2, n)
X2 = 0.5 * X1 + np.random.normal(0, 1, n)
noise = np.random.normal(0,2,n)

Y = 3 + 2* X1 + 0.5 * X2 + noise

df = pd.DataFrame({'X1': X1, 'X2': X2, 'Y': Y})
df.head()


Unnamed: 0,X1,X2,Y
0,10.993428,4.081343,27.743103
1,9.723471,4.44109,25.789057
2,11.295377,5.304974,30.409344
3,13.04606,5.720753,34.0601
4,9.531693,4.604561,21.610328


## Goal

We want to check if adding `X2` significantly improves prediction of `Y` compared to using only `X1`.

We'll use the **F-test** to compare two nested models:
- Simple model: Y ~ X1
- Full model: Y ~ X1 + X2


In [4]:
X_simple = sm.add_constant(df['X1'])
model_simple = sm.OLS(df['Y'], X_simple).fit()
print("simple model summary: ")
display(model_simple.summary())

simple model summary: 


0,1,2,3
Dep. Variable:,Y,R-squared:,0.806
Model:,OLS,Adj. R-squared:,0.804
Method:,Least Squares,F-statistic:,407.4
Date:,"Wed, 30 Jul 2025",Prob (F-statistic):,1.1e-36
Time:,09:55:03,Log-Likelihood:,-219.12
No. Observations:,100,AIC:,442.2
Df Residuals:,98,BIC:,447.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.2605,1.205,1.046,0.298,-1.130,3.651
X1,2.4420,0.121,20.184,0.000,2.202,2.682

0,1,2,3
Omnibus:,2.243,Durbin-Watson:,2.302
Prob(Omnibus):,0.326,Jarque-Bera (JB):,1.739
Skew:,0.15,Prob(JB):,0.419
Kurtosis:,3.572,Cond. No.,55.4


### F-Test Formula

$
F = \frac{(R^2_{full} - R^2_{reduced}) / (k_{full} - k_{reduced})}{(1 - R^2_{full}) / (n - k_{full})}
$

Where:
- $ R^2_{full} $: R² of full model
- $ R^2_{reduced} $: R² of reduced model
- $ k $: number of parameters (including intercept)
- $ n $: number of data points


In [6]:
X_multiple = sm.add_constant(df[['X1', 'X2']])
model_multiple = sm.OLS(df['Y'], X_multiple).fit()
print("multiple model summary: ")
display(model_multiple.summary())

multiple model summary: 


0,1,2,3
Dep. Variable:,Y,R-squared:,0.814
Model:,OLS,Adj. R-squared:,0.811
Method:,Least Squares,F-statistic:,212.8
Date:,"Wed, 30 Jul 2025",Prob (F-statistic):,3.39e-36
Time:,09:57:07,Log-Likelihood:,-216.94
No. Observations:,100,AIC:,439.9
Df Residuals:,97,BIC:,447.7
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.9164,1.196,0.766,0.446,-1.458,3.291
X1,2.2384,0.154,14.521,0.000,1.932,2.544
X2,0.4754,0.229,2.078,0.040,0.021,0.929

0,1,2,3
Omnibus:,3.125,Durbin-Watson:,2.22
Prob(Omnibus):,0.21,Jarque-Bera (JB):,3.08
Skew:,0.108,Prob(JB):,0.214
Kurtosis:,3.832,Cond. No.,62.5


In [15]:
# From regression output
r2_full = 0.814
r2_reduced = 0.806
n = 100
k_full = 3     # X1, X2, intercept
k_reduced = 2  # X1, intercept

numerator = (r2_full - r2_reduced) / (k_full - k_reduced)
denominator = (1 - r2_full) / (n - k_full)

F_manual = numerator / denominator
print(f"Manual F-statistic: {F_manual:.4f}")


Manual F-statistic: 4.1720


Adding X2 to a model that already includes X1 improves the model a little bit. The improvement is statistically significant, but not dramatic.

### When is it worth adding a predictor?

- If the **R² increases significantly**, and  
- The **p-value is small (< 0.05)** for the new variable or F-test,  
→ Then the added variable likely contributes meaningfully.

This helps avoid **overfitting** by only keeping significant predictors.
