# Chapter 10 Simple Linear Regression and Correlation

In [1]:
import polars as pl
from polars import col, lit
from scipy import stats
import numpy as np
import altair as alt

RNG = np.random.default_rng()
DATA = {}  # input data
ANS = {}   # calculation results

## 10.1 A Probabilistic Model for Simple Linear Regression

*Linear regression analysis* begins by fitting a straight line, $y = \beta_0 + \beta_1 x$, to a set of paired data $\{(x_i, y_i), i = 1, 2, \ldots , n\}$ on two numerical variables $x$ and $y$. The *least squares(LS) estimates* $\hat{\beta}_0$ and $\hat{\beta}_1$ minimize $Q = \sum_{i=1}^n [y_i - (\beta_0 + \beta_1 x_i) ]^2$ and are given by

$$
\begin{align*}
\hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x},\\
\hat{\beta}_1 &= \frac{S_{xy}}{S_{xx}}
\end{align*}
$$

where $S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$ and $S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2$. The *fitted values* are given by $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ and the *residuals* by $e_i = y_i - \hat{y}_i$.

The total sum of squares (SST), regression sum of squares (SSR) and error sum of squares (SSE) are defined as $\mathrm{SST} = \sum_{i=1}^n (y_i - \bar{y})^2$, $\mathrm{SSR} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2$, and $\mathrm{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2$. These sums of squares satisfy the identity $\mathrm{SST} = \mathrm{SSR} + \mathrm{SSE}$. A measure of goodness of fit of the least squares line is the *coefficient of determination*,

$$
r^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}
$$

which represents the proportion of variation in $y$ that is accounted for by regression on $x$. The *correlation coefficient* $r$ equals $\pm\sqrt{r^2}$, where $\mathrm{sign}(r) = \mathrm{sign}(\hat{\beta}_1)$. In fact, $r = \hat{\beta}_1 (s_x / s_y)$, where $s_x$ and $s_y$ are the sample standard deviations of $x$ and $y$, respectively.

The *probabilistic model* for linear regression assumes that $y_i$ is the observed value of r.v. $Y \thicksim N(\mu_i, \sigma^2)$, where $\mu_i = \beta_0 + \beta_1 x_i$ and the $Y_i$ are independent. An unbiased estimate of $\sigma^2$ is provided by $s^2 = \mathrm{SSE}/(n - 2)$ with $n-2$ d.f. The estimated standard errors of $\hat{\beta}_0$ and $\hat{\beta}_1$ equal

$$
\mathrm{SE}(\hat{\beta}_0) = s\sqrt{\frac{\sum x_i^2}{n\,S_{xx}}} \quad \text{and}\quad \mathrm{SE}(\hat{\beta}_1) = \frac{s}{\sqrt{S_{xx}}}.
$$

### Ex 10.1

Tell whether the following mathematical models are theoretical and deterministic or empirical and probabilistic.

1. Maxwell's equations of electromagnetism.
2. An econometric model of the U.S. economy.
3. A credit scoring model for the probability of a credit applicant being a good risk as a function of selected variables, e.g., income, outstanding debts, etc.

### Ex 10.2

Tell whether the following mathematical models are theoretical and deterministic or empirical and probabilistic.

1. An item response model for the probability of a correct response to an item on a "true-false" test as a function of the item's intrinsic difficulty.
2. The Cobb-Douglas production function, which relates the output of a firm to its capital and labor inputs.
3. Kepler's laws of planetary motion.

### Ex 10.3

Give an example of an experimental study in which the explanatory variable is controlled at fixed values, while the response variable is random. Also, give an example of an observational study in which both variables are uncontrolled and random.

## 10.2 Fitting the Simple Linear Regression Model

## 10.3 Statistical Inference for Simple Linear Regression

These are used to construct confidence intervals and perform hypothesis tests on /Jo
and /31.For example, a 100(1 - a)% confidence interval on /31is given by

/31± 1n-2.a/2SE(/J1).
A common use of the fitted regression model is to predict Y* for specified x = x• or
to estimateµ,* = E(Y*). In both cases we have
A.= /JO
R + R
*
Y • = µ,
/JIX •
A
A
A
However, a 100(1 - a)% prediction interval for Y* is wider than a 100(1 - a)%
confidence interval forµ,• because y• is an r.v., whileµ,* is a fixed constant.
Residuals are key to checking the model assumptions such as normality of the Y;,
linearity of the regression model, constant variance u 2 , and independence of the Y;.
Residuals are also useful for detecting outlien and influentialobservations. Many of
these diagnostic checks are done by plotting residuals in appropriate ways.
Correlation analysis assumes that the data {(x;, y;), ; = 1, 2, ... , n} form a
random sample from a bivariatenormaldistributionwith correlation coefficient p. An
estimate of p is the sample correlation coefficient r. An exact tes·t of H0 : p = 0 is a
t-test with n - 2 d.f. based on the test statistic
t
A
rJn - 2
= ✓1 - ,2.
A
This equals t = /31/SE(/31) which is used to test H 0 : {31 = 0 in the related regression
model. In other cases only approximate large sample inferences are available. These
inferences use the para_meterization
1 (1+ p)
t/1= -loge
2
--
1- p
A
.
The sample estimate t/1of t/1,obtained by substituting p
½loge (
is approximately normally distributed with mean=
■
= r in the above expression,
!+~)
and variance=
n
1
3
•

## 10.4 Regression Diagnosis

## 10.5 *Correlation Analysis

## 10.6 Pitfalls of Regression and Correlation Analysis