## Chapter 4 -  Training Models

### Linear Regression - Validity of the Coefficient Estimates

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split

In [2]:
# Ingest, preprocessing
df = pd.read_csv('Advertising.csv', index_col=0)

X1 = df.iloc[:,:1]
df1 = df.iloc[:,[0,1,2,3]]

X2 = df.iloc[:,:3]
df2 = df.iloc[:, [0,3]]
y = df.iloc[:, 3]

In [4]:
# For testing
# display(df.head())
# display(X1.head())
# display(df1.head())
# display(X2.head())
# display(y.head())

In [4]:
# Obtaining the coefficients using statsmodels
X1 = sm.add_constant(X1)
reg11 = sm.OLS(y, X1)
results11 = reg11.fit()
print(results11.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     312.1
Date:                Mon, 11 May 2020   Prob (F-statistic):           1.47e-42
Time:                        18:45:18   Log-Likelihood:                -519.05
No. Observations:                 200   AIC:                             1042.
Df Residuals:                     198   BIC:                             1049.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.0326      0.458     15.360      0.0

In [15]:
Xarray2 = np.c_[np.ones((X1.shape[0],1)), X2] # Add x0=1

In [16]:
# For multiple linear regression, obtain the coefficients using the normal equations
Theta_hat2 = np.dot(np.dot(np.linalg.inv(np.dot(Xarray2.T, Xarray2)), Xarray2.T),y)
print(Theta_hat2)

[ 2.93888937e+00  4.57646455e-02  1.88530017e-01 -1.03749304e-03]


In [11]:
# Obtaining the coefficients using statsmodels
X2 = sm.add_constant(X2)
reg21 = sm.OLS(y, X2)
results21 = reg21.fit()
print(results21.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Mon, 11 May 2020   Prob (F-statistic):           1.58e-96
Time:                        19:56:37   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.9389      0.312      9.422      0.0

### Validity of the Coefficient Estimates

The model assumes that the true relationship between $x$ and $y$ is $y=f(x)+\epsilon$ where $\epsilon$ is a mean-zero random error term. For univariate linear regression,
$$y = f(x) + \epsilon =\beta_0 + \beta_1x_1 + \epsilon$$ 

and for multivariate linear regression with $p$ variables, 
$$y=f(x) + \epsilon = \beta_0 + \beta_1x_{1} + \beta_2x_{2}+ \cdots + \beta_px_{p} + \epsilon$$ 

Hence, for the univariate case, $p=1$. 

Here, $\beta_0$ is the intercept term - the value of $y$ when $x_j=0 \,\,\forall j \in \{1,\cdots,p\}$ for the multivariate case. $\beta_j$ is the average increase in $y$ associated with one unit increase in $x_j$. The error term is a catch-all for what is missed with the model: there may be other variables that cause a variation in $y$, and there may be measurement error. This error term is independent of the $x_j$.

This model is the population regression line, and the linear approximation model with parameters $\hat{\beta_0}, \hat{\beta_1}, \cdots$ is the least squares line. The true relationship is <u>generally not known</u> and is estimated from the observed data. Fundamentally, we are <u>using observations from an experiment to estimate characteristics of a large population</u>.


If we use the sample mean $\hat{\mu}$ to estimate the population mean $\mu$, we say the estimate is unbiased. On average (across many estimates), we expect $\hat{\mu}=\mu$. Specifically, when we measure $\hat{\mu}$ many times and average the estimates, we will get an average that exactly equals $\mu$. Hence, an unbiased estimator does not systematically overestimate or underestimate the true parameter. This holds for the least squares estimates in this model.

If some estimates are above and some are below the true parameter $\mu$, how, then can we establish how far is a single estimate $\hat\mu$ from the true parameter? We use the standard error of $\hat\mu$, $\text{SE}(\hat\mu)$ to help us:

$$\text{Var}(\hat\mu) = \text{SE}(\hat\mu)^2 = \frac{\sigma^2}{n}$$

where $\sigma$ is the population standard deviation. Observe from the formula that the standard error decreases as $n$ increases. The more observations we have in a sample, the smaller the standard error of $\hat\mu$. For univariate linear regression, we want to compute the standard errors associated with $\hat{\beta_0}$ and $\hat{\beta_1}$ and they are:

$$\text{SE}(\beta_0)^2 = \sigma^2\begin{bmatrix}\frac 1n + \frac{\bar x^2}{\sum_{i=1}^n(x-\bar x)^2}\end{bmatrix}$$

$$\text{SE}(\beta_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n(x-\bar x)^2}$$

where $\sigma^2 = \text{Var}(\epsilon)$. The assumption is that the errors $\epsilon_i$ for each observation is uncorrelated and have the same variance $\sigma^2$. $\sigma^2$ can be estimated from the data and it is known as the <u>residual standard error</u>, RSE and is given by the formula
$$\text{RSE} = \sqrt{\frac{\text{RSS}}{n-2}}$$

Assuming that the standard errors are Gaussian distributed, Standard errors can be used to compute confidence intervals. a 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. The range is defined as the lower and upper limits computed from the sample of data. For linear regression, the 95% C.I. for $\beta_1$ is:

$$\beta_1 \pm 2\times \text{SE}(\hat{\beta_1})$$

and thus there is a 95% chance that the true value of $\beta_1$ lies in the interval:

$$\begin{bmatrix}\beta_1 - 2\times \text{SE}(\hat{\beta_1}), \beta_1 + 2\times \text{SE}(\hat{\beta_1})\end{bmatrix}$$

Similarly, the confidence interval for $\beta_0$ is:
$$\begin{bmatrix}\beta_0 - 2\times \text{SE}(\hat{\beta_0}), \beta_1 + 2\times \text{SE}(\hat{\beta_0})\end{bmatrix}$$

(Strictly speaking, the value $2$ in the above equations should be substituted with the 97.5% quantile of a $t$-distribution with $n-2$ degrees of freedom.)

<b> Hypothesis Testing I - Univariate Regression</b>

Standard errors can also be used to perform hypothesis testing on the coefficients. The null and alternative hypothesis are:
$$H_0:\text{There is no relationship between }x\text{ and }y$$
$$H_1:\text{There is some relationship between }x\text{ and }y$$

Mathematically, 
$$H_0:\beta_1=0$$
$$H_1:\beta_1 \neq 0$$

If the null is true, then the model simply reduces to $y=\beta_0 + \epsilon$, with the conclusion that there is no relationship. How large must $\beta_1$ be to reject the null? It depends on $\text{SE}(\hat{\beta_1})$, relative to (\hat{\beta_1}). For a some value of $\text{SE}(\hat{\beta_1})$, the estimate $\hat \beta_1$ must be large enough to reject the null hypothesis. To illustrate this, calculate the test statistic, in this case the $t$-statistic:

$$t=\frac{\hat{\beta_1}-0}{\text{SE}(\beta_1)}$$

which measures how many standard deviations is $\hat \beta_1$ from $0$. Consequently, the $p$-value is the probability of observing a value larger than or equal to $|t|$, given the null hypothesis is true. 

A small $p$-value indicates that it is <u>unlikely to observe no relationship</u> between $x$ and $y$, and we can infer that there is indeed a relationship between the variables and the predictor. We reject the null hypothesis and conclude that there is indeed a relationship between the variables and the response.

In the following example:
<img src="s1.png" width="500" />
<img src="s2.png" width="275" />

Observe that the coefficients are large relative to their standard errors. So the $t$-statistics are large and the $p$-values are small. This means the chance of observing $\beta_0=0$, and $\beta_1=0$ are extremely small. Hence, we can conclude that $\beta_0\neq0$, and $\beta_1\neq0$ and there is indeed a relationship between TV and advertising.

### Accuracy of the Model

Now that we have rejected the null hypothesis in favour of the alternative, the next step is to quantify the extent which the model fits the data. This is measured using the residual standard error (RSE) and the $R^2$ statistic. 

<b>Residual Standard Error (RSE)</b> - Recall that the model contains an error term $\epsilon$. Because of these error terms, even if we know the true regression line, we cannot perfectly predict $y$ from $x$. The RSE is an estimate of the standard deviation of $\epsilon$. Roughly, it is the average amount that the response will deviate from the true regression line, and is computed as:

$$\text{RSE} = \sqrt{\frac{\text{RSS}}{n-2}} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n \begin{bmatrix} \hat{y^{(i)}} - y^{(i)})\end{bmatrix}^2}$$

In the example, the RSE is $3.26$. This means actual sales deviate from the true regression line by approximately 3260 units. Or, even if the model were correct, the prediction wll still be off by 3260 on average. How significant this is depends on the ovarall value. If the mean is 14000 units then 3260 will mean an estimation error of about (3260/14000)=23%.

<b>$R^2$ statistic</b> - the RSE provides an absolute measure of lack of fit of the model to the data. But it is measured in units of $y$. To overcome this, we use the $R^2$ statistic, which takes the form of a proportion - the proportion of variance explained, and is independent of $y$. It is calculated as:

$$R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}$$

where $\text{TSS}$ is defined as total sum of squares, $\text{TSS} = (y^{(i)}-\bar y)^2$ and $\text{RSS} = \begin{pmatrix} \hat{y^{(i)}} - y^{(i)}\end{pmatrix}^2$. TSS is the total variance in the response $y$ and can be thought of the amount of variability before applying the regression model. Hence, $\text{TSS} - \text{RSS}$ is the variance explained amount of variability in the response that is explained (or removed) by performing the regression, and $R^2$ measures the proportion of variability that can be explained using $x$. An $R^2$ close to $1$ indicates a large proportion of variability is explained by the regression while a value close to $0$ indicates that the regression did not explain much of the variability in the response.

In the example, since $R^2=0.612$, about 61% of variability in sales is explained by a linear regression on the TV variable.

The $R^2$ is a measure of the linear relationship between $x$ and $y$. Recall that correlation is also a measure of relationship between $x$ and $y$:

$$\text{Cor}(x,y) = r^2=\frac{\sum_{i=1}^n (x^{(i)}-\bar x)
(y^{(i)}-\bar y)}{\sqrt{\sum_{i=1}^n (x^{(i)}-\bar x)^2}\sqrt{\sum_{i=1}^n(y^{(i)}-\bar y)^2}}$$

This suggest that we can use correlation to also measure the fit of the linear model. In fact, in univariate linear regression, $r^2 = R^2$. However, this does not apply to the multivariate case.

<b>Hypothesis Testing II - Multivariate Regression</b>

In the multivariate case, we set the null to be that all the coefficients of the regression model are $0$.

Mathematically, 
$$H_0:\beta_1 = \beta_2 = \cdots = \beta_p =0$$
$$H_1:\text{Any }\beta_j \,\, , j \in \{1,\cdots,p\} \text{ is non-zero}$$

Now, this is tested using the $F$-statistic:

$$F = \frac{\frac{\text{TSS} - \text{RSS}}{p}}{\frac{\text{RSS}}{n-p-1}}$$

The definition of TSS and RSS are the same as of univariate regression. If there is no relationship between the response and the variables, then the $F$-statistic is close to $1$. Otherise $F>1$. So $F>1$ is the rule to use to reject the null in favour of the alternative (there is indeed relationship between $x$ and $y$.

A large $F$-statistic suggests that at least one of the variables is related to the target variable.

Similarly, we can use the $p$-value to determine whether to reject the null.

In [8]:
# Obtaining the coefficients using statsmodels
X2 = sm.add_constant(X2)
reg21 = sm.OLS(y, X2)
results21 = reg21.fit()
print(results21.summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Mon, 11 May 2020   Prob (F-statistic):           1.58e-96
Time:                        18:59:39   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.9389      0.312      9.422      0.0