# Multiple Linear Regression

### California Test Score Data Set

The California Standardized Testing and Reporting data set contains data on test performance, school characteristics, and student demographic backgrounds. 

Test scores are the average of the reading and math scores. The student–teacher ratio used here is the number of students in the district divided by the number of full-time equivalent teachers.

The demographic variables include the percentage of students who are in the public assistance program CalWorks, the percentage of students who qualify for a reduced-price lunch, and the percentage of students who has English as a second language.

The dataset is used in the reference textbook Introduction to Econometrics, 4th edition 
(Stock and Watson).

In [1]:
# Importing useful libraries and loading data set

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.compat import lzip
import statsmodels.stats.diagnostic as smd
import statsmodels.formula.api as smf

In [2]:
df = pd.read_excel("caschool.xlsx")

## Multiple Linear Regression

Let's run a multiple linea regression including percent English Learners in the district (<b>el_pct</b>) as an additional regressor.

The regression equation is

$$ TestScore_i = \beta_0 + \beta_1 STR_i + \beta_2 PctEL_i $$


In [3]:
formula_el = 'testscr ~ str + el_pct'  

model_el = smf.ols(formula_el, df).fit(cov_type = "HC0")
print(model_el.summary())

                            OLS Regression Results                            
Dep. Variable:                testscr   R-squared:                       0.426
Model:                            OLS   Adj. R-squared:                  0.424
Method:                 Least Squares   F-statistic:                     225.4
Date:                Sun, 02 Mar 2025   Prob (F-statistic):           4.28e-67
Time:                        21:21:57   Log-Likelihood:                -1716.6
No. Observations:                 420   AIC:                             3439.
Df Residuals:                     417   BIC:                             3451.
Df Model:                           2                                         
Covariance Type:                  HC0                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    686.0322      8.697     78.882      0.0

The resulting estimated model is now

$$ \hat{TestScore_i} = 686.0 - 1.10 STR_i -0.65 PctEL_i $$
$$ \hspace{0.35in} (8.70) \hspace{0.15in} (0.43) \hspace{0.7in} (0.03)$$

## Control Variables

In [4]:
formula_control = 'testscr ~ str + el_pct + meal_pct'  

model_control = smf.ols(formula_control, df).fit(cov_type = "HC0")
print(model_control.summary())

                            OLS Regression Results                            
Dep. Variable:                testscr   R-squared:                       0.775
Model:                            OLS   Adj. R-squared:                  0.773
Method:                 Least Squares   F-statistic:                     457.8
Date:                Sun, 02 Mar 2025   Prob (F-statistic):          2.28e-131
Time:                        21:22:01   Log-Likelihood:                -1520.5
No. Observations:                 420   AIC:                             3049.
Df Residuals:                     416   BIC:                             3065.
Df Model:                           3                                         
Covariance Type:                  HC0                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    700.1500      5.542    126.338      0.0

The resulting estimated model is now

$$ \hat{TestScore_i} = 686.0 - 1.00 STR_i -0.12 PctEL_i -0.55 LchPct$$
$$ \hspace{0.55in} (8.70) \hspace{0.15in} (0.43) \hspace{0.55in} (0.03) \hspace{0.55in} (0.02)$$