# Linear Regression

> The importance of the sweep operation in statistical computing is not so much that it is an inversion technique, but rather that is a conceptual tool for understanding the least squares process. Without this conceptual tool, it is **extremely difficult** to explain concepts such as absorption and what the R notation is testing in terms of the parameters of the model.  
> --James Goodnight (1979)

In [1]:
import sweepystats as sw
import numpy as np

Lets generate some random data. Here we simulated 10 samples each with 3 covariates. 

In [2]:
X = np.random.normal(10, 3, size=(10, 3))
beta = np.array([1., 2., 3.])
y = X @ beta + np.random.normal(5)

We can form an instance of the `LinearRegression` class and fit it as follows:

In [3]:
ols = sw.LinearRegression(X, y)
ols.fit()

100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 3306.94it/s]


The resulting beta (estimated effect size) can be extracted as

In [4]:
beta = ols.coef()
beta

array([1.09983632, 2.02886888, 3.2716904 ])

Sum-of-square residuals is

In [5]:
resid = ols.resid()
resid

np.float64(2.3953241840840747)

Var($\hat{\beta}$):

In [6]:
cov = ols.cov()
cov

array([[ 0.00323325, -0.00144217, -0.00143833],
       [-0.00144217,  0.00344722, -0.00208484],
       [-0.00143833, -0.00208484,  0.00358656]])

Standard deviation of $\hat{\beta}$:

In [7]:
std = ols.coef_std()
std

array([0.05686169, 0.05871306, 0.05988785])

R2 (coefficient of determination):

In [8]:
ols.R2()

np.float64(0.9969400431529486)

## Comparison with `numpy`

For comparison, lets check whether the answer agrees with the least squares solution implemented in `numpy` package. 

In [9]:
# least squares solution by QR
beta, resid, _, _ = np.linalg.lstsq(X, y)
beta

array([1.09983632, 2.02886888, 3.2716904 ])

In [10]:
resid # true residuals

array([2.39532418])

`numpy` doesn't have built-in methods to extract Var($\hat{\beta}$) or std of beta, but we can manually extract them as:

In [11]:
# true Var(beta)
n, p = 10, 3
sigma2 = resid[0] / (n - p)
beta_cov = sigma2 * np.linalg.inv(X.T @ X)
beta_cov

array([[ 0.00323325, -0.00144217, -0.00143833],
       [-0.00144217,  0.00344722, -0.00208484],
       [-0.00143833, -0.00208484,  0.00358656]])

In [12]:
# true std of beta
beta_std = np.sqrt(np.diag(beta_cov))
beta_std

array([0.05686169, 0.05871306, 0.05988785])