---
title: Linear Regression
author: George Whittington
date: today
date-format: long
---

# Imports and Data

In [1]:
import pandas as pd
import numpy as np

import statsmodels.formula.api as smf
import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("~/Data/Synthetic_Student_Performance.csv")

df.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


## Data Cleaning

This is simulated data, so no real cleaning needs to be done, but there are some quality of life improvements we can do

In [3]:
# coverts column names to snake_case
df.columns = df.columns.str.lower().str.replace(' ', '_')

# turn binary variable to categorical var
df_cat = df.copy()
df_cat["extracurricular_activities"] = df_cat["extracurricular_activities"].astype("category")
# dummy code instead
df_dum = df.copy()
df_dum = pd.get_dummies(df_dum, columns=["extracurricular_activities"], dtype=int)


### Categorical Column

In [4]:
print(df_cat.shape)
df_cat.head()

(10000, 6)


Unnamed: 0,hours_studied,previous_scores,extracurricular_activities,sleep_hours,sample_question_papers_practiced,performance_index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


### Dummy Columns

In [5]:
print(df_dum.shape)
df_dum.head()

(10000, 7)


Unnamed: 0,hours_studied,previous_scores,sleep_hours,sample_question_papers_practiced,performance_index,extracurricular_activities_No,extracurricular_activities_Yes
0,7,99,9,1,91.0,0,1
1,4,82,4,2,65.0,1,0
2,8,51,7,2,45.0,0,1
3,5,52,5,2,36.0,0,1
4,7,75,8,5,66.0,1,0


# Simple Linear Regression

## Using Statsmodels

In [6]:
model = smf.ols("performance_index ~ previous_scores", data=df).fit()

In [7]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:      performance_index   R-squared:                       0.838
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                 5.156e+04
Date:                Thu, 25 Dec 2025   Prob (F-statistic):               0.00
Time:                        01:48:32   Log-Likelihood:                -34657.
No. Observations:               10000   AIC:                         6.932e+04
Df Residuals:                    9998   BIC:                         6.933e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         -15.1818      0.320    -

## Manual Calculation

In [21]:
# sample size
n = len(df)

# number of predictors
p = 1

# extract features and target 
X = np.c_[np.ones(n), df["previous_scores"]]

y = df["performance_index"]


### Coefficients Table

#### Estimates

$$
    \hat{\beta} = (X^{\top}X)^{-1} X^{\top}Y
$$

In [38]:
# first half of above equation, minus the inverse
XTX = X.T @ X

# second half of the equation above
XTY = X.T @ y

# optimization trick to solve for x: Ax = b, instead of x = A^-1 b
beta_hat =  np.linalg.solve(XTX, XTY)

# nicer formatting
pd.Series(beta_hat, index=["Intercept", "previous_scores"])

Intercept         -15.181799
previous_scores     1.013837
dtype: float64