# DATA 5600: Introduction to Regression and Machine Learning for Analytics

## __Koop Chapter 05: Statistical Aspects of Regression__ <br>

Author:  Tyler J. Brough <br>
Updated: November 8, 2021 <br>

---

<br>

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

In [2]:
np.random.seed(7)

---

<br>

## __Introduction__

<br>


These notes are taken from chapter 5 of the book _Analysis of Economic Data 2nd Edition_ by Gary Koop.

<br>

In this chapter we build up to an understanding of the statistical aspects of the regression model. 

* Discuss what the statistical methods are and what they are designed to do

* Show how to carry out a regression analysis using these statistical methods

* Interpret the results correctly

* Provide some graphical intuition in order to gain insight into where statistical results come from and what they mean

<br>

<br>

We distinguish between $\alpha$ and $\beta$ in the regression and the OLS (ordinary least squares) estimates of these coefficients $\hat{alpha}$ and $\hat{\beta}$

Remember the regression model:

<br>

$$
\Large{Y_{i} = \alpha + \beta X_{i} + \epsilon_{i}}
$$

<br>

* for $i = 1, \ldots, N$ observations

* $\alpha$ and $\beta$ measure the relationship between $Y$ and $X$

* We do not generally know what this relationship is without numerical values for $\alpha$ and $\beta$

* We derived OLS estimates which we labeled $\hat{\alpha}$ and $\hat{\beta}$

* We emphasized that $\alpha$ and $\beta$ are the true but unknown population coefficients

* $\hat{\alpha}$ and $\hat{\beta}$ are statistical estimates

* This leads us to ask how accurate these estimates are

* To answer this question we can bring to bear statistical theory

* At first we will focus on frequentist methods of understanding this process

* We will calculate _confidence intervals_ and conduct _hypothesis tests_ for the coefficients

* We say that OLS provides _point estimates_ for $\beta$ (e.g. $\hat{\beta} = 0.000842$ is the point estimate of $\beta$ in the regression of deforestation on population density)

* Think of the point estimate as our best statistical guess for what the right value of $\beta$ is

* Confidence intervals provide us with interval estimates allowing us to make statements that reflect the uncertainty we may have about the true value of $\beta$ (e.g. "we are confident that $\beta$ is greater than 0.0006 and less than 0.0010")

* We can obtain different confidence intervals corresponding to different levels of confidence

* The degree of confidence we have in a chosen interval (e.g. $95\%$) is referred to as a _confidence level_

* The other major activity of the empirical researcher is _hypothesis testing_ 

* An example: $H_{0}: \beta = 0$

* If this later hypothesis is true, then it means that the explanatory variable has not explanatory power

* Hypothesis testing allows us to carry out such tests




### __Which Factors Affect the Accuracy of the Estimate $\hat{\beta}$?__

<br>

* Simulated models with $\alpha = 0$ and $\beta = 1$

* Figures 5.1, 5.2, 5.3, and 5.4

* If we try to fit a regression line to these different data sets they will lead to very different levels of accuracy

* Q: how confident would you be in the fitted model to each of these artificial datasets?

* 5.3 would be the most accurate - the linear pattern "leaps out" at you

<br>

These figures illustrate three main factors that affect the accuracy of OLS estimates and the uncertainty that surrounds our knowledge of what the true value of $\beta$ really is:

1. Having more data points improves the accuracy of estimation. This can be seem by comparing Figure 5.1 ($N = 5$) with Figure 5.3 ($N = 100$).

2. Having smaller errors improves accuracy of estimation. Equivalently, if the SSR is small or the variance of the errors is small, the accuracy of the estimation will be improved. This can be seen by comparing Figure 5.2 (large variance of errors) with Figure 5.3 (small variance of errors)

3. Having a larger spread of values (i.e. larger variance) of the explanatory variable ($X$) improves accuracy of estimation. This can be seen by comparing Figure 5.3 (values of the explanatory variable spread all the way from 0 to 6) to Figure 5.4 (values of the explanatory variable are clustered around 3)

<br>

### __Simulation Replication Exercise__

### __Calculating Confidence Intervals__

<br>

* The confidence interval reflects the uncertainty surrounding the accuracy of the estimate $\hat{\beta}$

* A smaller confidence interval indicates higher accuracy

* A larger confidence interval indicates great uncertainty over the true value of $\beta$

* In most cases researchers present both point estimates and confidence intervals

<br>

The mathematical formula for the confidence interval for $\beta$ is:

<br>

$$
\Large{[\hat{\beta} - t_{b} s_{b}, \hat{\beta} + t_{b} s_{b}]}
$$

<br>

An equivalent way of presenting the confidence interval is:

<br>

$$
\Large{\hat{\beta} - t_{b} s_{b} \le \beta \ge \hat{\beta} + t_{b} s_{b}}
$$

<br>

The above formulas require the following:

* $\hat{\beta}$ the OLS estimate of the slope coefficient $\beta$

* $s_{b}$ is the standard deviation of $\hat{\beta}$ (also called the _standard error_ because it is the standard deviation of the sampling distribution for $\hat{\beta}$)

* We typically use a Central Limit Theorem assumption to derive $s_{b}$. Bootstrapping and other resampling methods can also be used. 

* The formula for $s_{b}$ is as follows:

<br>

$$
\Large{s_{b} = \sqrt{\frac{SSR}{(N-2) \sum (X_{i} - \bar{X})^{2}}}}
$$

<br>

Recall the following:

<br>

$$
\Large{SSR = \sum\limits_{i=1}^{N} u_{i}^{2}}
$$

* for $i = 1, \ldots, N$

* CI = confidence interval

* The larger $s_{b}$ is the wider will be the CI

* The width of the CI varies directly with $SSR$ (i.e. more variable errors or residuals imply less accurate estimation)

* The CI varies inversely with $N$ (i.e. more data points imply more accurate estimation)

* The CI varies inversely with $\sum(X_{i} - \bar{X})^{2}$ (i.e. more variability in $X$ implies more accurate estimation)

<br>

The third item in the CI formula is $t_{b}$

* Note that the more confident you wish to be in your CI the wider it must become

* $99\%$ CIs will always be wider than $95\%$ CIs

* The value of $t_{b}$ controls the confidence level

* If the confidence level is high (e.g. $99\%$) $t_{b}$ will be large, while if it is low (e.g. $50\%$) $t_{b}$ will be small

* $t_{b}$ decreases with N (i.e. the more data points one has the smaller the CI will be)

* $t_{b}$ increases with the level of confidence you choose

<br>

### __Running Regressions in Python with the `Statsmodels` Module__

<br>

See here for an introductory example: https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html

<br>

In [3]:
import statsmodels.api as sm

In [4]:
fig51 = pd.read_excel("FIG51.XLS")

In [5]:
fig51

Unnamed: 0,X,Y
0,0.414991,1.951798
1,1.73927,-1.562661
2,4.355215,3.926579
3,2.509085,1.960235
4,1.46525,-1.43353


In [6]:
fig51.shape

(5, 2)

In [7]:
fig51.describe()

Unnamed: 0,X,Y
count,5.0,5.0
mean,2.096762,0.968484
std,1.468467,2.391504
min,0.414991,-1.562661
25%,1.46525,-1.43353
50%,1.73927,1.951798
75%,2.509085,1.960235
max,4.355215,3.926579


In [9]:
y51 = fig51.Y
X51 = fig51.X
X51 = sm.add_constant(X51)

In [10]:
X51

Unnamed: 0,const,X
0,1.0,0.414991
1,1.0,1.73927
2,1.0,4.355215
3,1.0,2.509085
4,1.0,1.46525


In [11]:
model51 = sm.OLS(y51, X51)
results51 = model51.fit()
print(results51.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.313
Model:                            OLS   Adj. R-squared:                  0.084
Method:                 Least Squares   F-statistic:                     1.366
Date:                Mon, 08 Nov 2021   Prob (F-statistic):              0.327
Time:                        14:37:27   Log-Likelihood:                -9.9583
No. Observations:                   5   AIC:                             23.92
Df Residuals:                       3   BIC:                             23.14
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.9416      1.928     -0.488      0.6

  warn("omni_normtest is not valid with less than 8 observations; %i "


In [12]:
fig52 = pd.read_excel("FIG52.XLS")

In [13]:
fig52.head()

Unnamed: 0,X,Y
0,0.414991,0.571723
1,1.73927,1.383048
2,4.355215,6.150757
3,2.509085,2.692491
4,1.46525,0.263816


In [14]:
fig52.tail()

Unnamed: 0,X,Y
95,2.146799,3.88773
96,3.357127,3.821765
97,3.162842,3.169746
98,1.574346,4.480406
99,2.948085,3.236203


In [15]:
fig52.shape

(100, 2)

In [16]:
fig52.describe()

Unnamed: 0,X,Y
count,100.0,100.0
mean,3.031319,3.018473
std,1.13192,2.257812
min,0.005124,-3.263926
25%,2.140309,1.288116
50%,3.113845,3.073942
75%,3.837339,4.766682
max,5.597868,7.417962


In [17]:
y52 = fig52.Y
X52 = fig52.X
X52 = sm.add_constant(X52)

In [18]:
model52 = sm.OLS(y52, X52)
results52 = model52.fit()
print(results52.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.271
Model:                            OLS   Adj. R-squared:                  0.264
Method:                 Least Squares   F-statistic:                     36.52
Date:                Mon, 08 Nov 2021   Prob (F-statistic):           2.73e-08
Time:                        14:41:06   Log-Likelihood:                -206.99
No. Observations:                 100   AIC:                             418.0
Df Residuals:                      98   BIC:                             423.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1321      0.556     -0.238      0.8

In [19]:
fig53 = pd.read_excel("FIG53.XLS")

In [20]:
fig53.head()

Unnamed: 0,X,Y
0,0.414991,0.422828
1,1.73927,1.721458
2,4.355215,4.444992
3,2.509085,2.518255
4,1.46525,1.405178


In [21]:
fig53.tail()

Unnamed: 0,X,Y
95,2.146799,2.233845
96,3.357127,3.380359
97,3.162842,3.163188
98,1.574346,1.719649
99,4.325516,4.387482


In [22]:
fig53.shape

(100, 2)

In [23]:
fig53.describe()

Unnamed: 0,X,Y
count,100.0,100.0
mean,3.045094,3.044927
std,1.139254,1.146261
min,0.005124,-0.077661
25%,2.140309,2.185499
50%,3.13918,3.097413
75%,3.87408,3.882446
max,5.597868,5.65559


In [24]:
y53 = fig53.Y
X53 = fig53.X
X53 = sm.add_constant(X53)

In [25]:
model53 = sm.OLS(y53, X53)
results53 = model53.fit()
print(results53.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.993
Model:                            OLS   Adj. R-squared:                  0.993
Method:                 Least Squares   F-statistic:                 1.372e+04
Date:                Mon, 08 Nov 2021   Prob (F-statistic):          3.92e-107
Time:                        14:44:51   Log-Likelihood:                 92.401
No. Observations:                 100   AIC:                            -180.8
Df Residuals:                      98   BIC:                            -175.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0080      0.028     -0.288      0.7

In [None]:
fig54 = pd.read_excel("FIG54.XLS")

In [None]:
fig54.head()

In [None]:
fig54.tail()

In [None]:
fig54.shape

In [None]:
fig54.describe()

In [None]:
y54 = fig54.Y
X54 = fig54.X
X54 = sm.add_constant(X54)

In [None]:
model54 = sm.OLS(y54, X54)
results54 = model54.fit()
print(results54.summary())

### __The Deforestation Regression__

In [None]:
forest = pd.read_excel("FOREST.XLS")

In [None]:
forest.head()

In [None]:
forest.shape

In [None]:
forest.describe()

In [None]:
y = forest['Forest loss']
X = forest['Pop dens']
X = sm.add_constant(X)

In [None]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

In [26]:
whos

Variable    Type                        Data/Info
-------------------------------------------------
X51         DataFrame                      const         X\n0    <...>09085\n4    1.0  1.465250
X52         DataFrame                       const         X\n0   <...>n\n[100 rows x 2 columns]
X53         DataFrame                       const         X\n0   <...>n\n[100 rows x 2 columns]
fig51       DataFrame                             X         Y\n0 <...>35\n4  1.465250 -1.433530
fig52       DataFrame                              X         Y\n0<...>n\n[100 rows x 2 columns]
fig53       DataFrame                              X         Y\n0<...>n\n[100 rows x 2 columns]
model51     OLS                         <statsmodels.regression.l<...>object at 0x7fa7606bf8e0>
model52     OLS                         <statsmodels.regression.l<...>object at 0x7fa77a6321c0>
model53     OLS                         <statsmodels.regression.l<...>object at 0x7fa77a6395b0>
np          module                  

In [27]:
results53.conf_int()

Unnamed: 0,0,1
const,-0.063201,0.047175
X,0.985592,1.019562
