
# <span style="color:brown">The simple linear regression model</span>



## <span style="color:brown">Contents</span> 

In this chapter we illustrate the procedures and properties associated with conducting a simple linear regression analysis from a sample of paired observations corresponding to two variables, which we believe may be linearly related. *Simple* refers to the fact that we will be studying the relationship between just two variables; for more than two variables we would need to use multiple linear regression models, to be introduced in a later chapter.

The use of the linear regression model requires that it is clearly specified, and in particular that the parts of the model affected by the uncertainty are identified and isolated. We will consider problems such as that of finding the best representation for the model gien our data, or how to take into account the uncertainty to be able to provide meaningful measures for the variability in any use we might make of this model. We will also discuss how to measure the quality of the approximation provided by the linear regression model to the true relationship between the variables.

We will cover the following topics:

- Objectives of a linear regression analysis
- Model specification
- Least Square Estimators (LSE): construction and properties
- Statistical inference on the linear regression model:
  - For the slope and the intercept of the linear regression model
  - For the variance of the errors
  - Prediction for a new observation (actual value or average value)
- ANOVA table for the simple linear regression model and its interpretation


## <span style="color:brown">Introduction</span>

---

### <span style=color:brown;>Goals (i)</span>

The preceding lessons have considered the study of parameters of one or several populations with different goals: obtaining good approximations for these values (Lesson 1), checking if these values may be close to other reference values (Lesson 2), or comparing two populations through the values of some of their parameters (Lesson 3).

In this lesson we will compare two populations, as in Lesson 3, but now we will conduct this comparison from the point of view of the possible existence of a functional relationship between the values of the two random variables of interest. That is, we would like to know if it is reasonable to assume that the values of two variables might be related in a manner that is approximately linear.

To do this, we will define formally the linear model that represents this linear relationship, and we will discuss how to find the best possible choice for the parameters of the relationship. As uncertainty is important for us, we will introduce distributional information to be able to associate probability information to the answers for the questions of interest, such as determining if the linear relationship is significant, or using this relationship to approximate values for one of the variables, given some information from the other variable. Finally, we will  discuss when a linear relationship identified under uncertainty may be a good model for the joint behavior of the variables of interest, and how to evaluate the quality of this model.


### <span style="color:brown">Goals (ii)</span>

Our goals for this lesson will be:

- to understand the main characteristics of the available data,
- to summarize these data in a way that is helpful to conduct the linear regression analysis,
- to estimate the parameters of the linear regression model,
- to obtain values useful to conduct inference on these parameters, and in particular to test the significance of the model,
- to estimate mean responses and forecasts using the model,
- to compute the ANOVA table associated to the model and to determine its explanatory power, and
- to conduct some limited diagnostics on the assumptions for this model.


## <span style="color:brown;">Relationships between variables under uncertainty</span>

---

In Lesson 3 we have considered a case when we compare some properties of two populations, by studying the possible relationships between the values of some of their parameters, given the available data. We have used hypothesis tests to answer these questions.

In many cases we wish to go beyond the relationship between parameters, and to study relationships between the individual values of two variables that we suspect are associated in some way. The joint study of their values has many advantages, as we can use the values of some of the variables to gain information about the other variables. This is particularly interesting if we do not observe these variables at the same time or with the same precision. Also, this joint study may allow us to identify relationships between the variables that may help us to understand their interaction, in order to make better decisions in the future.

These relationships can take may different forms, and in general they will be represented as a mathematical function of the variables, satisfying a certain condition or conditions. For example, for two random variables $X$ and $Y$ we may be interested in finding a relationship of the form

$$
f(X,Y,U) = 0
$$

for some function $f$, which might not be known in advance, where $U$ is a random variable that represents the uncertainty in the relationship between $X$ and $Y$. In many practical cases we select in advance a functional form for $f$, but we still need to determine values for the parameters of the function; most or all of these parameters will in general be unknown and will need to be approximated from the collected observations in our samples.

In addition to $X$ and $Y$, the description of $f$ includes a random variable $U$, to include the effect of the uncertainty in the relationship in practical situations. This uncertainty will be associated with the values of $X$ and $Y$, which in general will not fit exactly the model we have selected, but also with the limitations in our choice of a functional representation for this relationship, usually a simplification of the true relationship.

This variable $U$ is assumed to contain no relevant information about the relationship, in the sense that the parameters of its distribution cannot contribute any information to the model. For example, we usually impose conditions such as $E[U] = 0$, and we require that $U$ should be independent of our variables $X$ and $Y$.

Given observed samples for our two variables, our goal is to identify a specification of the function $f$ that offers the best possible compromise between two goals:

1. Obtaining the best possible fit for our data, that is, making the values of the function $f$ in the absence of noise, $\epsilon_i = f(X,Y)$ to be as small as possible; and
2. using as few parameters as possible to specify the function $f$.

This second condition is very important, as the errors can be made to be arbitrarily small by introducing a sufficiently complex model for the specific sample we have collected. But if we would do that, we would be selecting a model that approximates both the underlying relationship and the noise we have observed in our sample. While this underlying relationship should be useful to represent future data, in general the noise in our sample will not be related to the noise we might observe in the future. This problem is usually referred to as *overfitting*, and it happens when our model goes beyond finding a reasonable relationship between $X$ and $Y$ and also tries to explain the noise, even as we understand that this noise provides no information to model the population relationship and should not be used to define our model. Finally, note that in general we do not know how to separate the noise from the actual relationship.

In practice, we usually work with models that relate the two variables in our sample (or the many variables if that is the case) so that the values of one of the variables are specified as a function of the values of the other variable. Also, we include the noise in a simple, additive or multiplicative, form to simplify its treatment. Usual models are for example

$$
Y = f(X) + U , \quad\text{ or }\quad Y = U f(X)
$$

This second model might be useful when the variable $Y$ must take nonnegative values.

In this lesson we will consider models that are as simple as possible. This simplest case corresponds to the model with additive noise shown above, where $f$ is being represented as a linear function. It takes the form

$$
Y = \beta_0 + \beta_1 X + U 
$$

We call this model the *simple linear regression* model. We will devote this lesson to study how to specify the model, that is, how to find values for $\beta_0$ and $\beta_1$, and how to use it to generalize this relationship to other observations beyond our samples. As this model is linear, normality is a relevant theoretical property we should take into consideration: it will help us to identify relevant properties of the model and its parameters. We will analyze it assuming that our data, and in particular our errors, follow normal distributions.


## <span style="color:brown;">The simple linear regression model</span>

---

As we have mentioned, a simple linear regression model is a model that approximates the value of a (random) variable $Y$, the <span style="color:brown">*dependent* or *response*</span> variable, from a linear combination of values of another variable (or variables in the case of a multiple linear regression model) $X$, the <span style="color:brown">*independent* or *explanatory*</span> variable.

This approximation will not provide in general an exact value for the dependent variable, as the relationship will include errors due to inaccuracies in the observed values of the variables, simplifications in the representation of the true relationships, etc. The errors in the relationship will be represented through an (unknown) random variable, $U$. These errors will be assumed to satisfy conditions that ensure they do not contain any relevant information for the model, but have some structure to simplify the theoretical analysis of this model. These conditions will be presented later as assumptions we will impose on our model.

The simple linear regression model has the mathematical representation introduced above, $Y = \beta_0 + \beta_1 X + U$, where $\beta_0$ (the <span style="color:brown">intercept</span> of the model) and $\beta_1$ (the <span style="color:brown">slope</span>) are the parameters of the model, to be estimated from the data, and $U$ denotes the (random) errors associated with this model.



### <span style="color:brown">Observations and the simple linear regression model</span>

The treatment of the model starts with the information contained in a collection of values for the variables corresponding to a simple random sample of paired values $\{ (x_i , y_i )\}_{i=1}^n$. Based on this information, we usually proceed through the following steps:

1. Identify appropriate estimators for the parameters $\beta_j$, $j = 0,1$, and use these estimators to obtain estimates based on the available the data (<span style="color:brown">*estimation*</span>).
2. Interpret the values of these estimates with respect to their population, based on their distributions, to obtain confidence intervals or to conduct significance tests, for example (<span style="color:brown">*inference*</span>).
3. Use the model and the estimated parameters to obtain information about some population values of interest, for example, mean values of the response variable outside of the sample (<span style="color:brown">*forecasting*</span>).

In our treatment of this model we will assume that the values of the variable $X$ are known, that is, they will not be considered as observations from a random variable, but rather as known values. This represents situations where we observe $X$ before we observe $Y$, or when we observe $X$ with very little variability, compared to our observations for $Y$. Formally, we will conduct all the relevant analysis and evaluations related to this model, conditional on the values of the variable $X$ given in our sample.


## <span style="color:blue;">Preparing R and the data</span>

---

To illustrate the concepts we have introduced, and to motivate possible choices of good estimators, we will consider specific examples, mostly based on real data, which we will process using <span style="color:blue;font-family:monospace;font-size:90%;">R</span>.

We start by preparing <span style="color:blue;font-family:monospace;font-size:90%;">R</span> to read and manipulate the data mentioned above. In the following <span style="color:blue;font-family:monospace;font-size:90%;">R</span> <span style="color:brown">code cell</span> we:

1. Load the <span style="color:blue;font-family:monospace;font-size:90%;">R</span> libraries we are going to need for our examples.
2. Define a function, <span style="color:blue;font-family:monospace;font-size:90%;">table_prnt</span>, specifying the format for the tables that will present the numerical results in this lesson.
3. Introduce information to work with the available data sets.

The <span style="color:brown;">available data sets</span> and their identifying codes are:

1. Hourly prices for the Iberian electricity market
2. Grades for a Statistics subject in UC3M
3. Share prices for a company (Iberdrola) from the IBEX index
4. Simulated data from a N(80,30) distribution (var 1), an Exp(lambda=1/30) distribution (var 2) and a Binom(20,0.4) distribution (var 3)
5. Data from the Sustainable Develpment Report 2021, with the scores by country for goals 1 and 2

In order to add another data set to this collection, you should include information for each of the following variables: the <span style="color:blue;font-family:monospace;font-size:90%;">.csv</span> file containing the data and a text with a short description for the data.

It is also important to ensure that the <span style="color:brown;">working directory</span> has been <span style="color:brown;">selected correctly,</span> as the directory that includes all the data sets that could be used in this lesson.

To execute the commands in the cell, select the cell by clicking on it, and then <span style="color:blue;">press the **RUN** button</span> in the menu bar, or press <span style="color:blue;">Shift-Enter.</span>


In [None]:
#options(jupyter.plot_mimetypes = c("text/plain","image/png"))

# Load libraries with R functions

suppressMessages(library(tidyverse))
suppressMessages(library(huxtable))
library(knitr)
suppressMessages(library(kableExtra))
library(IRdisplay)
suppressMessages(library(gridExtra))
suppressMessages(library(qqplotr))
suppressMessages(library(GGally))
suppressMessages(library(car))
library(grid)

# Define a function to format and print the results of interest

outp.type = 0   # = 1 for html output, = 0 for Jupyter Books

if (outp.type == 1) {
    table_prnt <- function(p.df,p.capt) {
    # A function to control the presentation of tables with numerical summaries
    p.df %>% kable("html",caption=paste0('<em>',p.capt,'</em>'),align='r') %>%
    kable_styling(full_width = F, position = "left") %>% as.character() %>% display_html()
    }
    } else {
    table_prnt <- function(p.df,p.capt) {
    # A function to control the presentation of tables with numerical summaries
    p.df %>% kable("simple",caption=p.capt,align='r')
    }
}


## <span style="color:blue;">Examples based on external data sources</span>

---

To illustrate the concepts we have introduced, and to motivate possible choices of good estimators, we will consider specific examples, mostly based on real data, which we will process using <span style="color:blue;font-family:monospace;font-size:90%;">R</span>.



#### <span style="color:blue;">Selecting and displaying the data set and the variables of interest</span>

We select one of these data sets and two variables of interest in the following cell.

1. We assign the corresponding number to the variable <span style="color:blue;font-family:monospace;font-size:90%;">sel.data</span>, at the start of the following code cell.
2. We read the file and include the data in a <span style="color:brown;">data frame</span> with the name <span style="color:blue;font-family:monospace;font-size:90%;">Data.fr</span>.
3. We assign the numbers corresponding to the order of the two variables of interest for our linear model in the data set, to the variable <span style="color:blue;font-family:monospace;font-size:90%;">sel.col</span>. The last value of <span style="color:blue;font-family:monospace;font-size:90%;">sel.col</span> will be assumed to correspond to the <span style="color:brown;">dependent variable.</span>
4. We then assign the values of the variables to two <span style="color:blue;font-family:monospace;font-size:90%;">R</span> variables with the names <span style="color:blue;font-family:monospace;font-size:90%;">data.sel.x</span>, our <span style="color:brown;">independent variable,</span> and <span style="color:blue;font-family:monospace;font-size:90%;">data.sel.y</span>, our <span style="color:brown;">dependent variable.</span> The values of both these variables are assigned to a single data frame, <span style="color:blue;font-family:monospace;font-size:90%;">data.sel</span>.

Finally, we print the names of the selected data set and variables, to check that these values are the correct ones. Then we display a part of the values from the <span style="color:blue;font-family:monospace;font-size:90%;">.csv</span> file, keeping the same structure of the file.


In [None]:
# Define the data set of interest

## Datasets that are available for this lesson

v.pref = data.frame(file = c("Dat_PreciosOMIE.csv",     # Name of the .csv data file
                            "Dat_Calificaciones.csv",
                            "Dat_PreciosIBE_MC.csv",
                            "Dat_SimulatedData.csv",
                            "Dat_SDR21.csv"))
v.pref$title = c("Electricity prices",         # Short title for the data
                "Grades",
                "Share returns",
                "Simulated data",
                "SDG 2021 Scores")

## Indicate the data set and variable to select
## These values can be modified

sel.data = 2
sel.col = c(1,3)

## Read the data

s.pref = v.pref[sel.data,]
Data.fr = read.csv2(s.pref$file)

n.dat = nrow(Data.fr)
data.sel.x = Data.fr[,sel.col[1]]
data.sel.y = Data.fr[,sel.col[2]]
data.sel = data.frame(X=data.sel.x, Y=data.sel.y)
c.names = colnames(Data.fr)[c(sel.col[1],sel.col[2])]
vr.1 = c(rep("VarX",n.dat),rep("VarY",n.dat))
vr.2 = c(data.sel.x,data.sel.y)
val.melt = data.frame(variable = vr.1, value = vr.2)

## Summary of the selected data

descr.df = as.data.frame(c(s.pref$title,c.names))
colnames(descr.df) <- c("Selection")
rownames(descr.df) <- c("Data set","Variable X","Variable Y")

Data.hux.0 <-
  hux(descr.df) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)
table_prnt(Data.hux.0[-1,],"Selected data and variables")

# Print a part of the data we have selected

max.row.show = 8       # Max number of individual values to show
max.col.show = 8       # Max number of variables to show

n.row.show = min(nrow(Data.fr),max.row.show)
n.col.show = min(ncol(Data.fr),max.col.show)

Data.hux.1 <-
  hux(Data.fr[1:n.row.show,1:n.col.show]) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)
rownames(Data.hux.1) <- c(0:n.row.show)
table_prnt(Data.hux.1[-1,],s.pref$title)



#### <span style="color:blue;">Summaries for the sample data</span>

In the following cell we conduct some simple exploratory analysis of the data from the variables we have selected. We start by computing some of their most relevant numerical summaries, such as their means, standard deviations and medians.

We also draw a boxplot of the sample data corresponding to these two variables, as well as a scatterplot of the values of the variables.


In [None]:
# Print summaries from the selected data set

smp.mn.x = mean(data.sel.x)
smp.mn.y = mean(data.sel.y)
smp.sd.x = sd(data.sel.x)
smp.sd.y = sd(data.sel.y)
smp.med.x = median(data.sel.x)
smp.med.y = median(data.sel.y)
Sum.fr = as.data.frame(round(matrix(c(n.dat,smp.mn.x,smp.med.x,smp.sd.x,
                                      n.dat,smp.mn.y,smp.med.y,smp.sd.y),4,2),3))

Data.hux.2 <-
  hux(Sum.fr) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)

rownames(Data.hux.2) <- c("","Sample size","Mean","Median","Standard deviation")
colnames(Data.hux.2) <- c("Values X","Values Y")
table_prnt(Data.hux.2[-1,],sprintf("%s summary",s.pref$title))

## Boxplot for the data

plt.bxpl = val.melt %>% ggplot(aes(x=variable,y=value)) + geom_boxplot() +
  ggtitle(sprintf("Boxplot %s",s.pref$title)) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5)) +
  scale_x_discrete(labels=c("VarX" = c.names[1], "VarY" = c.names[2]))
plot(plt.bxpl) 

## Scatterplot for the data

plt.scat = data.sel %>% ggplot(aes(x=X,y=Y)) + geom_point() +
  ggtitle(sprintf("Scatterplot %s",s.pref$title)) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))
plot(plt.scat)



## <span style="color:brown;">Parameter estimation: the least squares method</span>

---

In the procedure we have described for this lesson, the first step to conduct the analysis of a linear regression model is the identification of estimators for the parameters in the model, $\beta_0$ and $\beta_1$. We will obtain these estimators by applying a procedure known as the <span style="color:brown">least squares method</span> to the available sample data.

This procedure is not the only one that could be applied to find these estimators, but it is the most common one and it has very good theoretical properties. See [Appendix 1](#App4_1) for other possible estimation procedures.

This method aims to find values, the estimates $\hat \beta_0$ and $\hat \beta_1$, which will yield the smallest possible errors in the model (using these estimates) for our sample values. Formally, we define the error estimate associated to an observation as $e_i \equiv y_i - \hat \beta_0 - \hat \beta_1 x_i$; we refer to this error as the <span style="color:brown">residual</span> for observation $i$. Note that as we take the values of the independent variable as given, we are looking at the errors associated to the *values of the response variable*, that is, the errors in the values provided by the model for each observation $i$, $\hat \beta_0 - \hat \beta_1 x_i$, when compared with the corresponding observed values of $Y$, $y_i$.

Formally, we define the problem whose solutions are the parameter values yielding the smallest errors as

$$
\min_{\hat \beta_0,\hat \beta_1} \sum_{i=1}^n e_i (\hat \beta_0,\hat \beta_1)^2 = \min_{\hat \beta_0,\hat \beta_1} \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2
$$

If we use the first-order optimality conditions for this problem to obtain the optimal values for $\hat \beta_0$ and $\hat \beta_1$, we have that its unique optimal solution (the least squares estimators for the parameters) is given by

$$
\left. \begin{array}{rcl}
\hat \beta_1 & = & \displaystyle \frac{\text{cov}(x,y)}{s_x^2} \\
\hat \beta_0 & = & \bar y - \hat \beta_1 \bar x
\end{array} \right\}
$$

These formulas define the least squares estimators for the parameters of the simple linear regression model. See [Appendix 2](#App4_2) for their formal derivation.


##### <span style="color:green;">Questions</span>

<span style="color:green">Answer the following questions:</span>
- <span style="color:green">For these least-squares estimates, should the resulting line leave the same number of observations above and below it?</span>
- <span style="color:green">Assume that there is no relationship between the variables, how how would the least-squares line look like? Would it be a horizontal line? Why?</span>



#### <span style="color:blue">Least squares: a numerical example</span>

The following cell shows the values of the estimates obtained using the preceding formulas, as well as the plot of the regression line for the data we have selected.


In [None]:
# Plot the regression line computed using least squares

## Parameter values

hat.beta.1 = cov(data.sel.x,data.sel.y)/var(data.sel.x)
hat.beta.0 = mean(data.sel.y) - hat.beta.1*mean(data.sel.x)
lr.par.df = as.data.frame(round(matrix(c(hat.beta.1,hat.beta.0),2,1),3))

Data.hux.3 <-
  hux(lr.par.df) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)

rownames(Data.hux.3) <- c("","Slope","Intercept")
colnames(Data.hux.3) <- c("Values")
table_prnt(Data.hux.3[-1,],"Regression params")

## Scatterplot for the data

data.sel.yh = hat.beta.0 + hat.beta.1*data.sel.x
data.sel$Yest = data.sel.yh

plt.scat.2 = data.sel %>% ggplot() + geom_point(aes(x=X,y=Y)) +
  geom_line(aes(x=X,y=Yest),color="blue",linewidth=0.75) +
  ggtitle(sprintf("Linear regr %s vs %s",c.names[2],c.names[1])) +
  labs(y = c.names[2], x = c.names[1]) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

plt.scat.3 = data.sel %>% ggplot() + geom_point(aes(x=X,y=Y)) +
  geom_line(aes(x=X,y=Yest),color="blue",linewidth=0.75) +
  geom_segment(aes(x=X,y=Yest,xend=X,yend=Y),color="red",linewidth=0.5) +
  ggtitle(sprintf("%s vs %s residuals",c.names[2],c.names[1])) +
  labs(y = c.names[2], x = c.names[1]) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

suppressWarnings(grid.arrange(plt.scat.2,plt.scat.3,nrow = 1,
                              top=textGrob(sprintf("Linear regression fit %s",s.pref$title),
                                           gp=gpar(fontsize=15,col="blue"))))



### <span style="color:brown;">A procedure to compute the least squares estimates</span>

#### <span style="color:brown;">Basic sample information and relevant statistics</span>

In this section we describe a procedure to compute the values of the parameter estimates for the simple linear regression model, which we will be able to apply to the manual, step-by-step solution of practical exercises. Later on we will see how to conduct these computations using the tools available in R, similar to those present in many other data analysis software packages.

We note that, in order to obtain the values of these estimates, we do not need all the information in the sample; it is enough that we collect and use some summaries of this data. Some basic information that would allow us to conduct the computations in the formulas we have introduced would include the values of:

- the number of observations, $n$
- the sums of the values of the two variables, $\sum_{i=1}^n x_i$ and $\sum_{i=1}^n y_i$
- the sums of the squares of these values, $\sum_{i=1}^n x_i^2$ and $\sum_{i=1}^n y_i^2$
- the sum of the crossproducts of the two variables, $\sum_{i=1}^n x_i y_i$

From these values, we can compute the values of the basic sample statistics: means, quasivariances and covariance, which we will use to estimate the values of the linear regression parameters. The formulas for these statistics are:

$$
\begin{array}{rclrcl}
\bar x & = & \displaystyle \frac{1}{n} \sum_{i=1}^n x_i, & \quad \displaystyle \bar y & = & \displaystyle \frac{1}{n} \sum_{i=1}^n y_i ,\\
s_x^2 & = & \displaystyle \frac{1}{n-1} \left( \sum_{i=1}^n x_i^2 - n\bar x^2 \right) , & \quad
s_y^2 & = & \displaystyle \frac{1}{n-1} \left( \sum_{i=1}^n y_i^2 - n\bar y^2 \right) , \\
\mbox{cov}(x,y) & = & \displaystyle \frac{1}{n-1} \left( \sum_{i=1}^n x_i y_i - n\bar x\bar y \right) &
\end{array}
$$

For the selected data we obtain the values shown below.


In [None]:
# Summary of the data in the data frame

n.obs = nrow(data.sel)
sum.x = sum(data.sel$X)
sum.y = sum(data.sel$Y)
sum.x2 = sum(data.sel$X^2)
sum.y2 = sum(data.sel$Y^2)
sum.xy = sum(data.sel$X*data.sel$Y)

rn.1 <- 'Number of obs'
rn.2 <- sprintf('Sum of %s',c.names[1])
rn.3 <- sprintf('Sum of %s',c.names[2])
rn.4 <- sprintf('Sum of sq %s',c.names[1])
rn.5 <- sprintf('Sum of sq %s',c.names[2])
rn.6 <- sprintf('Sum of %s * %s',c.names[1],c.names[2])

val.0 <- c(n.obs,sum.x,sum.y,sum.x2,sum.y2,sum.xy)
out.0 <- as.data.frame(matrix(val.0,length(val.0),1))
colnames(out.0) = c("Values")
rownames(out.0) = c(rn.1, rn.2, rn.3, rn.4, rn.5, rn.6)

table_prnt(out.0,"Summaries of our sample data")

# Values of means, variances and covariance

mn.x = sum.x/n.obs
mn.y = sum.y/n.obs
s2.x = (sum.x2 - n.obs*mn.x^2)/(n.obs-1)
s2.y = (sum.y2 - n.obs*mn.y^2)/(n.obs-1)
cov.xy = (sum.xy - n.obs*mn.x*mn.y)/(n.obs-1)

rn.2 <- sprintf('Mean of %s',c.names[1])
rn.3 <- sprintf('Mean of %s',c.names[2])
rn.4 <- sprintf('Quasivar of %s',c.names[1])
rn.5 <- sprintf('Quasivar of %s',c.names[2])
rn.6 <- sprintf('Covar of %s %s',c.names[1],c.names[2])

val.0 <- round(c(mn.x,mn.y,s2.x,s2.y,cov.xy),3)
out.0 <- as.data.frame(matrix(val.0,length(val.0),1))
colnames(out.0) = c("Values")
rownames(out.0) = c(rn.2, rn.3, rn.4, rn.5, rn.6)

table_prnt(out.0,"Basic statistics for our data")



#### <span style="color:brown;">Regression parameter estimates</span>

Based on the preceding information, we are now able to compute the values for the linear regression coefficients, given our data, using the *least squares method*. The formulas we should apply are

$$
\hat \beta_1 = \frac{\mbox{cov}(x,y)}{s_x^2} , \qquad \hat \beta_0 = \bar y - \hat \beta_1 \bar x .
$$

In addition to the optimality properties derived from the procedure used to define the estimators, and as $\bar y = \hat \beta_0 + \hat \beta_1 \bar x$, it also holds that:

- The point $(\bar x,\bar y)$ always lies on the linear regression line.
- The sum of the residuals always satisfies

$$
\sum_{i=1}^n e_i = \sum_{i=1}^n \left( y_i - \hat \beta_0 - \hat \beta_1 x_i \right) = n \bar y - n \hat \beta_0 - \hat \beta_1 n \bar x = 0
$$

For our sample observations we obtain the estimates shown in the following cell.


In [None]:
# Parameters of the model

hat.beta.1 = cov(data.sel.x,data.sel.y)/var(data.sel.x)
hat.beta.0 = mean(data.sel.y) - hat.beta.1*mean(data.sel.x)
lr.par.df = as.data.frame(round(matrix(c(hat.beta.1,hat.beta.0),2,1),3))

Data.hux.3 <-
  hux(lr.par.df) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)

rownames(Data.hux.3) <- c("","Slope","Intercept")
colnames(Data.hux.3) <- c("Estimates")
table_prnt(Data.hux.3[-1,],"Regression parameters")



#### <span style="color:red">Exercise</span>

*A fast-food company wants to evaluate the relationship between the number of ads in a social media app it purchases each week ($x$) and the profits in euros from its home-delivery sales during the week ($y$). From the observed data corresponding to a sample of 19 observations (weeks) we have the following information:*

$$
   \bar x = 636.5 , \quad \bar y = 1757.502 , \quad \sum_{i=1}^{19} x_i^2 = 8073424 , \quad \sum_{i=1}^{19} x_i y_i = 21629441
$$

*Estimate the regression line for the weekly profits in terms of the number of social media ads.*



##### <span style="color:red">Exercise. Solution</span>

We define our variables $X =$ "Number of ads distributed in a week" and $Y =$ "Weekly profit in euros"

To estimate the simple linear regression model of interest,

$$
    y_i = \beta_0 + \beta_1 x_i + u_i ,
$$

we will make use of the least squares estimators formulas. But before we do that, we will compute the values that will be required to replace in the formulas:

$$
\begin{array}{rcl}
s_x^2 & = & \displaystyle \frac{\sum_i x_i^2 - n\bar x^2}{n-1} = \frac{8073424 - 19\times 636.5^2}{18} = 20883.96 \\
\text{cov}(x,y) & = & \displaystyle \frac{\sum_i x_i y_i - n\bar x\bar y}{n-1} = \frac{21629441 - 19\times 636.5\times 1757.502}{18} = 20838.36
\end{array}
$$

The estimates for the parameters are given by

$$
\begin{array}{rcl}
\hat \beta_1 & = & \displaystyle \frac{\text{cov}(x,y)}{s_x^2} = \frac{20838.36}{20883.96} = 0.9978 , \\
\hat \beta_0 & = & \bar y - \hat \beta_1 \bar x = 1757.502 - 0.9978\times 636.5 = 1122.391
\end{array}
$$

Thus, the estimated regression line is given by:

$$
   \hat y_i = 1122.391 + 0.9978 x_i
$$



## <span style="color:brown;">Inference for the simple linear regression model</span>

---

Once we have estimated the parameters of our linear regression model from the observations in our sample, we would like to conduct inference on these parameters, and in general on any results we may obtain from this model. For example, as our estimates are based on a random sample, they will take different (random) values for different samples. We would like to provide some information on the variability of these possible random values, on their distribution and on some properties of interest of their (population) parameters.

We will start with an exploratory analysis, based on the estimates we would obtain from random subsamples generated from our selected data. We may notice regularities in the distribution of these values, for example by looking at their histograms. The following plot shows these histograms, with the main aim of characterizing the distributions associated to the parameter estimators we have defined.


In [None]:
# Define regression estimates based on random samples

smp.sz.prop = 0.4
smp.sz = floor(smp.sz.prop*n.dat)

n.reps = 100
n.bins = 18

## Extract subsamples from the data and compute estimates

param.sel = NULL
for (ix in 1:n.reps) {
    ix.sel = sample(n.dat,smp.sz)
    data.sel.ss = data.sel[ix.sel,]
    beta.1.ss = cov(data.sel.ss$X,data.sel.ss$Y)/var(data.sel.ss$X)
    beta.0.ss = mean(data.sel.ss$Y) - beta.1.ss*mean(data.sel.ss$X)
    param.sel = rbind(param.sel,c(beta.0.ss,beta.1.ss))
}
param.sel = data.frame(param.sel)
colnames(param.sel) = c("beta0","beta1")

## Histograms for the collected data

plt.hist.0 = param.sel %>% ggplot(aes(x=beta0)) + geom_histogram(bins=n.bins,alpha=0.7) +
  ggtitle("Histogram beta0") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))
plt.hist.1 = param.sel %>% ggplot(aes(x=beta1)) + geom_histogram(bins=n.bins,alpha=0.7) +
  ggtitle("Histogram beta1") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

suppressWarnings(grid.arrange(plt.hist.1,plt.hist.0,nrow = 1,
                              top=textGrob(sprintf("%s Linear regression parameters",s.pref$title),gp=gpar(fontsize=15,col="blue"))))



### <span style="color:brown;">Assumptions on the simple linear regression model</span> 

To be able to conduct these inference procedures, we need to make use of distributional information related to our data. We start by introducing some assumptions on our population, and in particular on the distribution of the uncertainty in the errors $U$. 

Given a random sample $\{(X_i,Y_i)\}$, let $Y_i = \beta_0 + \beta_1 X_i + U_i$. Our basic assumptions for the linear regression model will be the following:

1. There exists a linear relationship between the variables $X$ and $Y$, as opposed to these variables being related through some nonlinear relationship.
2. The errors in the model follow a normal distribution $U \sim N(0,\sigma^2)$ where $\sigma^2$, the variance of the errors, is some unknown scalar value that does not depend on the sample values.
3. The errors in the model corresponding to different values of $X$ are independent, that is, $U | (X = x)$ is independent of $U | (X = x')$ for $x \not= x'$. In particular, $U_i = U | (X = x_i)$ is independent of $U_j = U | (X = x_j)$ for any $i$ and any $j \not= i$.

These assumptions, and in particular the ones related to the distribution of the errors, allow us to obtain information on the distribution of the different parameters of the model, the forecasts we may generate from this model, etc. And in general, to conduct statistical inference on these parameters.



#### <span style="color:blue">Checking the assumptions for our data</span>

In the following cell we check how well (some of) the preceding assumptions would apply to our (real) data. We present a histogram and a Q-Q plot for the residuals in our model.


In [None]:
# Residual plots

## This value can be modified

n.bins = 15

## Values of the residuals and graphical representations

data.sel$Res = data.sel$Y - data.sel$Yest

plt.res.s = data.sel %>% ggplot(aes(x=Yest,y=Res)) + geom_point() +
  ggtitle(sprintf("Scatterplot estimates vs residuals %s",s.pref$title)) +
  xlab("Estimates") + ylab("Residuals") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

plt.res.h = data.sel %>% ggplot(aes(x=Res)) + geom_histogram(bins=n.bins,alpha=0.7) +
  ggtitle(sprintf("Histogram residuals %s",s.pref$title)) +
  xlab("Residuals") + ylab("Frequencies") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

plt.res.q <- data.sel %>% ggplot(aes(sample=Res)) +
  stat_qq_band() + stat_qq_line(color="red") + stat_qq_point() +
  ggtitle(sprintf("QQplot residuals %s",s.pref$title)) +
  ylab("Residual quantiles") + xlab("Normal distribution quantiles") +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

suppressWarnings(grid.arrange(plt.res.s,plt.res.h,plt.res.q,ncol = 2,
                              top=textGrob(sprintf("%s residuals normality check",s.pref$title),gp=gpar(fontsize=15,col="blue"))))



#### <span style="color:brown;">Regression parameter estimates (ii)</span>

As part of our regression model assumptions, we have introduced a third parameter for our model, the <span style="color:brown">variance of the errors,</span> $\sigma^2 = \text{Var}(U)$. This parameter is unknown and will need to be estimated from the sample data.

We define our estimator for this third parameter from the (sample) variance of the residuals $e_i$. These residuals represent our sample estimates for the errors in the model $U$. As the mean of the residuals is always equal to zero, $\bar e = 0$, this variance will be given by

$$
\hat \sigma_R^2 = \frac{1}{n} \sum_{i=1}^n (e_i - \bar e)^2 = \frac{1}{n} \sum_{i=1}^n e_i^2
$$

As in the case of the sample variance, this estimator of the variance of the errors is not unbiased. To obtain  an unbiased estimator we have to divide the sum of the squares by $n-2$, instead of $n$. An intuitive explanation is that in order to compute $e_i$ we have had to estimate two parameters, $\hat \beta_0$ and $\hat \beta_1$, from the sample data, and as a consequence we lose two degrees of freedom.

Our <span style="color:brown">unbiased estimator</span> for the variance of the errors, which we will refer to as the <span style="color:brown">*residual variance*,</span> $s_R^2$, is defined as

$$
s_R^2 = \frac{1}{n-2} \sum_{i=1}^n e_i^2 .
$$

A justification for the unbiasedness of $s_R^2$ can be found in [Appendix 3](#App4_3).


In [None]:
# Parameters of the model: variance of the errors

## Sum of the squared residuals

data.sel$Res = data.sel$Y - data.sel$Yest
sum.e2 = sum(data.sel$Res^2)

n.obs = nrow(data.sel)
df.m = 1                 # Number of independent variables
df.r = n.obs - 1 - df.m
sR.2 = sum.e2/df.r

rn.3 <- "Sum of sq residuals"
rn.4 <- "Residual variance"
val.0 <- c(sum.e2,sR.2)
out.0 <- round(as.data.frame(matrix(val.0,length(val.0),1)),3)
colnames(out.0) = c("Values")
rownames(out.0) = c(rn.3,rn.4)

table_prnt(out.0,"Parameter estimates")



#### <span style="color:red">Exercise</span>

*A fast-food company wants to evaluate the relationship between the number of ads in a social media app it purchases each week ($x$) and the profits in euros from its home-delivery sales during the week ($y$). From the observed data corresponding to a sample of 19 observations (weeks) we have the following information:*

$$
\bar x = 636.5 , \quad \bar y = 1757.502 , \quad \sum_{i=1}^{19} x_i^2 = 8073424 , \quad \sum_{i=1}^{19} x_i y_i = 21629441
$$

- *Estimate the regression line for the weekly profits in terms of the number of social media ads.*

- *Compute an estimate for the variance of the errors in the regression model, knowing that the sum of squares of the residuals for this model is*

$$
\sum_{i=1}^{19} e_i^2 = 29821
$$

- *Conduct again this last computation assuming now that you are not given the value of $\sum_i e_i^2$, but you know that*

$$
\sum_{i=1}^{19} y_i^2 = 59091545
$$



##### <span style="color:red">Exercise. Solution</span>

We define our variables $X =$ "Number of ads distributed in a week" and $Y =$ "Weekly profit in euros"

We have already obtained the estimated regression line as:

$$
   \hat y_i = 1122.391 + 0.9978 x_i
$$

To estimate the variance of the errors we will use its unbiased estimator, the residual variance $s_R^2$,

$$
s_R^2 = \frac{1}{n-2} \sum_{i=1}^n e_i^2
$$

Replacing the value of $\sum_i e_i^2$, we have

$$
s_R^2 = \frac{29821}{17} = 1754.18
$$

If we did not have the value of $\sum_i e_i^2$, we still can obtain the residual variance from

$$
\begin{array}{rcl}
\sum_i e_i^2 & = & \sum_i (y_i - \hat y_i)^2 = \sum_i (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \\
& = & \sum_i y_i^2 + n \hat \beta_0^2 + \hat \beta_1^2 \sum_i x_i^2 - 2 \hat \beta_0 \sum_i y_i - 2 \hat \beta_1 \sum_i x_i y_i + 2 \hat \beta_0 \hat \beta_1 \sum_i x_i \\
& = & 59091545 + 19\times 1122.391^2 + 0.9978^2 \times 8073424 - 2 \times 1122.391 \times 19 \times 1757.502 \\
& & \hbox{} - 2\times 0.9978 \times 21629441 + 2\times 1122.391 \times 0.9978 \times 19 \times 636.5 \\
& = & 29821
\end{array}
$$

the same value as before.



## <span style="color:brown;">Distributions for the parameter estimators</span>

---

The distributions of the estimators for the parameters of the regression model are obtained from the assumptions we have introduced on the model, and by considering that the values of the independent variable, $X$, are given. That is, when we derive a distribution for $\hat \beta_1$ we will be obtaining the distribution of $\hat \beta_1 | (\underline{X} = \{x_1,\ldots , x_n\}) = \hat \beta_1 | \underline{X}$, where $\underline{X}$ denotes the sample values of this independent variable, $\underline{X} = \{ X_1 , \ldots , X_n \}$. To simplify the notation, in all that follows we will omit this conditional assumption from the formulas.

The knowledge and use of these functions and distributions is a basic prerequisite to conduct inference on these parameters and on the model in general. The distributions for the estimators of the three parameters are:

$$
\begin{array}{rcl}
T_{\beta_1} & = & \displaystyle \frac{\hat \beta_1 - \beta_1}{\mbox{se}(\hat \beta_1)} \sim t_{n-2} \\
T_{\beta_0} & = & \displaystyle \frac{\hat \beta_0 - \beta_0}{\mbox{se}(\hat \beta_0)} \sim t_{n-2} \\
T_{\sigma^2} & = & \displaystyle \frac{(n-2)s_R^2}{\sigma^2} \sim \chi^2_{n-2}
\end{array}
$$

The standard errors for these statistics, indicated in the preceding formulas as $\text{se} (\hat \beta_j)$, are given by

$$
\begin{array}{rcl}
\mbox{se}(\hat \beta_1) & = & \displaystyle \sqrt{\frac{s_R^2}{(n-1)s_x^2}} \\
\mbox{se}(\hat \beta_0) & = & \displaystyle \sqrt{s_R^2\left( \frac{1}{n} + \frac{\bar x^2}{(n-1)s_x^2} \right)}
\end{array}
$$

The derivation of these results can be found in [Appendix 3](#App4_3).


In [None]:
# Standard errors for the parameter estimators

se.beta.1 = sqrt(sR.2/((n.obs - 1)*s2.x))
se.beta.0 = sqrt(sR.2*(1/n.obs + mn.x^2/((n.obs - 1)*s2.x)))

val.0 <- round(c(hat.beta.0,hat.beta.1,sR.2,se.beta.0,se.beta.1,0),3)
out.0 <- as.data.frame(matrix(val.0,3,2))
out.0[3,2] = " "
rownames(out.0) = c("beta 0","beta 1","Variance of errors")
colnames(out.0) = c("Estimates","Std errors")

table_prnt(out.0,"Parameter estimates and std errors")


### <span style="color:brown;">Inference on the parameters of the linear regression model</span>

Once we have identified the pivotal statistics and their distributions, we can conduct inference on these parameters. In particular, we can compute confidence intervals for any of the parameters. For a confidence level $1 - \alpha$ these intervals are given by

$$
\begin{array}{rcl}
\text{CI}_{1-\alpha} (\beta_1) & = & \displaystyle \left[ \hat \beta_1 - t_{n-2,\alpha/2}\, \text{se}(\hat \beta_1) \; ; \; \hat \beta_1 + t_{n-2,\alpha/2}\, \text{se}(\hat \beta_1) \right] \\ 
\text{CI}_{1-\alpha} (\beta_0) & = & \displaystyle \left[ \hat \beta_0 - t_{n-2,\alpha/2}\, \text{se}(\hat \beta_0) \; ; \; \hat \beta_0 + t_{n-2,\alpha/2}\, \text{se}(\hat \beta_0) \right] \\ 
\text{CI}_{1-\alpha} (\sigma^2) & = & \displaystyle \left[ \frac{(n-2)s_R^2}{\chi^2_{n-2;\alpha/2}} \; ; \; \frac{(n-2)s_R^2}{\chi^2_{n-2;1-\alpha/2}} \right]
\end{array}
$$

We can also use this information to conduct hypothesis tests on the values of the population parameters and obtain p-values to measure the significance of these parameters, by applying the procedures described in Lesson 2.



### <span style="color:brown;">Significance test for the slope</span>

We now present a specially relevant test on $\beta_1$, the slope of the model. This parameter is the most informative one for the simple linear regression model, as it provides direct evidence on the relationship between the values of $Y$ and $X$. In particular, if there is no linear relationship between $Y$ and $X$ then it should hold that $\beta_1 = 0$.

We often wish to test the significance of the model, that is, the existence of a significant linear relationship between the dependent and the independent variables, based on the value of $\beta_1$. From the preceding comments, this significance test can be defined as

$$
\begin{array}{rcl}
H_0 & : & \beta_1 = 0 \\
H_1 & : & \beta_1 \not= 0
\end{array}
$$

Based on the distribution of the estimator $\hat \beta_1$, we can compute the p-value for this test from the expression

$$
\mbox{p-value} = 2 \Pr \left( T_{n-2} > \frac{| \hat \beta_1 |}{\text{se}(\hat \beta_1)} \right) = 2 \Pr \left( T_{n-2} > \frac{| \hat \beta_1 |}{\displaystyle \sqrt{\frac{s_R^2}{(n-1)s_x^2}}} \right) ,
$$

where $T_{n-2}$ denotes a random variable having a Student t distribution with $n-2$ degrees of freedom.


##### <span style="color:green;">Questions</span>

<span style="color:green">Answer the following questions:</span>
- <span style="color:green">If the value of the test statistic for the slope of the regression line is smaller than its standard error, does it imply that there is no significant linear relationship between the variables? Why?</span>
- <span style="color:green">Assume that the values of $X$ and $Y$ are modified by intoducing linear transformations $X' = a + b X$, $Y' = c + d Y$, is it true that the significance of the linear relationship between $X'$ and $Y'$ is the same as that between $X$ and $Y$? If it were different, how would the values of the constants affect it?</span>


#### <span style="color:blue;">Conducting a significance test for the slope</span>

In the following cell we present the results corresponding to the linear regression model significance test for to the data we have selected.

In [None]:
# Building confidence intervals and hypothesis testing for model significance

## This value can be modified

conf.lvl.1 = 0.95

## Parameter computations

q.p = 0.5*(1 + conf.lvl.1)
q.beta.1 = qt(q.p,n.obs-2,lower.tail=T)
tst.stat = abs(hat.beta.1)/se.beta.1
p.beta.1 = pt(tst.stat,n.obs-2,lower.tail=F)

## Printouts for outputs

val.0 <- c(sprintf('%7.2f',conf.lvl.1))
out.0 <- as.data.frame(matrix(val.0,1,1))
rownames(out.0) = c("Confidence level:")
colnames(out.0) = c("Value")

table_prnt(out.0,"Conf/signif levels")

val.1 <- c(sprintf('[%8.4f;%8.4f ]',
            hat.beta.1-q.beta.1*se.beta.1,hat.beta.1+q.beta.1*se.beta.1))
out.1 <- as.data.frame(matrix(val.1,1,1))
rownames(out.1) = c("Confidence interval:")
colnames(out.1) = c("Values")

table_prnt(out.1,"Confidence interval beta1")

val.2 <- round(c(tst.stat,p.beta.1),4)
out.2 <- as.data.frame(matrix(c(val.2),2,1))
rownames(out.2) = c("Test statistic value:","p value significance test:")
colnames(out.2) = c("p value")

table_prnt(out.2,"Significance test beta1")



#### <span style="color:red">Exercise</span>

*A fast-food company wants to evaluate the relationship between the number of ads in a social media app it purchases each week ($x$) and the profits in euros from its home-delivery sales during the week ($y$). From the observed data corresponding to a sample of 19 observations (weeks) we have the following information:*

$$
\begin{array}{rcl}
\bar x & = & 636.5 , \quad \bar y = 1757.502 \\
\sum_{i=1}^{19} x_i^2 & = & 8073424 , \quad \sum_{i=1}^{19} y_i^2 = 59091545 , \quad \sum_{i=1}^{19} x_i y_i = 21629441 \\
\sum_{i=1}^{19} e_i^2 & = & 29821
\end{array}
$$

- *Estimate the regression line for the weekly profits in terms of the number of social media ads. Compute an estimate for the variance of the errors in the regression model.*

- *Compute a confidence interval at a 99% confidence level for the slope of the regression line.*

- *Test the hypothesis that the slope of the regression line is different from zero at a significance level of 5%.*



##### <span style="color:red">Exercise. Solution</span>

We define our variables $X =$ "Number of ads distributed in a week" and $Y =$ "Weekly profit in euros"

We have already obtained the estimated regression line as:

$$
   \hat y_i = 1122.391 + 0.9978 x_i
$$

and the residual variance as

$$
s_R^2 = \frac{29821}{17} = 1754.18
$$

To compute the confidence interval for the slope of the regression line for $1-\alpha = 0.99$, we need the value of the quantile $t_{n-2;\alpha/2} = t_{17;0.005} = 2.898$. We also have that

$$
\begin{array}{rcl}
(n-1) s_x^2 & = & \sum_i x_i^2 - n \bar x^2 = 8073424 - 19 \times 636.5^2 = 375911.2 \\
\mbox{se}(\hat \beta_1) & = & \displaystyle \sqrt{\frac{s_R^2}{(n-1)s_x^2}} = \sqrt{\frac{1754.18}{375911.2}} = 0.06831
\end{array}
$$

We can now apply the formula for the confidence interval, and we obtain

$$
\begin{array}{rcl}
\text{CI}_{1-\alpha} (\beta_1) & = & \displaystyle \left[ \hat \beta_1 - t_{n-2,\alpha/2}\, \text{se}(\hat \beta_1) \; ; \; \hat \beta_1 + t_{n-2,\alpha/2}\, \text{se}(\hat \beta_1) \right] \\
& = & \displaystyle \left[ 0.9978 - 2.898\times 0.06831 \; ; \; 0.9978 + 2.898\times 0.06831 \right] = \left[ 0.7998 \; ; \; 1.1958 \right]
\end{array}
$$

For the significance test, we define it as

$$
\begin{array}{rcl}
H_0 & : & \beta_1 = 0 \\
H_1 & : & \beta_1 \not= 0
\end{array}
$$

Its test statistic is given by

$$
T_{\beta_1} = \frac{\hat \beta_1 - \beta_1}{\mbox{se}(\hat \beta_1)} \sim t_{n-2} 
$$

and its value under the null hypothesis is

$$
t_0 = \frac{\hat \beta_1}{\mbox{se}(\hat \beta_1)} = \frac{0.9978}{0.06831} = 14.607
$$

This value is huge, much larger than the critical value, $t_{n-2;\alpha/2} = t_{17;0.025} = 2.110$. As $t_0$ lies in the critical region for the test, we conclude that the model is significant, that is, there exists a significant linear relationship between these two variables.


In [None]:
## Numerical calculations for the exercise

# Data for the exercise

n.obs = 19
x.bar = 636.5
y.bar = 1757.502
x.sum2 = 8073424
xy.sum = 21629441
e.sum2 = 29821

# Some estimates

s.x2 = (x.sum2 - n.obs*x.bar^2)/(n.obs-1)
s.xy = (xy.sum - n.obs*x.bar*y.bar)/(n.obs-1)
hat.beta1 = s.xy/s.x2
s.R2 = e.sum2/(n.obs-2)
se.beta1 = sqrt(s.R2/((n.obs-1)*s.x2))

# Confidence interval

conf.lvl = 0.99

quant.val = qt(0.5 + conf.lvl/2,n.obs-2)
ci.lv = hat.beta1 - quant.val*se.beta1
ci.uv = hat.beta1 + quant.val*se.beta1

cat(sprintf("\nConfidence level          : %8.4f\n",conf.lvl))
cat(sprintf("Confidence interval       : [ %8.4f ; %8.4f ]\n",ci.lv,ci.uv))

# Significance test

sig.lvl = 0.05
cat(sprintf("\nSignificance level                : %10.4f\n",conf.lvl))

t.0 = hat.beta1/se.beta1
cat(sprintf("Value of the statistic under H_0  : %10.4f\n",t.0))

crit.v = qt(sig.lvl/2,n.obs-2,lower.tail=FALSE)
cat(sprintf("Critical value for the test       : %10.4f\n",crit.v))

p.val = 2*pt(t.0,n.obs-2,lower.tail=FALSE)
cat(sprintf("P-value for the test              : %10.4f\n",p.val))



### <span style="color:blue">Computing regression estimates using R functions</span>

We now describe how to conduct the preceding calculations using functions available in R.

Several functions are able to compute the values of the estimates for the parameters in the model, and provide other general results corresponding to the fitting of a regression model.

A widely used option is the function <span style="color:blue;font-family:monospace;font-size:90%;">lm</span> (where *lm* stands for Linear Model). For general linear regression models, this function computes and returns the estimates of the coefficients, the standard errors, the p-values of the significance tests for the parameters being equal to zero, etc.


In [None]:
# Using R to estimate linear regression models

## Results obtained using R functions

lr.xy = lm(Y ~ X, data = data.sel)
lr.sum = summary(lr.xy)
lr.sum.val = lapply(lr.sum[4],round,5)

table_prnt(lr.sum.val,"Parameter values")

lr.sum.2 = round(as.data.frame(lr.sum[8:9]),3)
colnames(lr.sum.2) = c("R squared","Adj R sq")

lr.sum.3 = round(as.data.frame(lr.sum[10]),3)
colnames(lr.sum.3) = c("F statistic")
rownames(lr.sum.3) = c("Value","df num","df den")

table_prnt(list(lr.sum.2,lr.sum.3),"Coeff of determination and global significance")

lr.sum.h = huxreg("Values" = lr.xy, number_format = "%10.4f")
lr.sum.h = lr.sum.h %>% set_caption("Parameter estimates using huxreg")
colnames(lr.sum.h) <- c("Parameters","Values")
rownames(lr.sum.h) <- c(0:9)
lr.sum.h <- lr.sum.h[c(-1,-10),]
cat("\n------\n")
print(lr.sum.h)



## <span style="color:brown;">Mean responses and forecasts</span>

---

In the preceding sections we have described how to estimate the parameters of our simple linear regression model, as well as how to conduct a test to determine if the linear relationship we are considering is significant. Once we have the information from these steps, we may proceed to use the model to obtain information for the dependent variable $Y$, assuming that we have some additional values of interest for the independent variable, not included in our original sample.

This problem is known as that of obtaining a <span style="color:brown;">forecast/mean response</span> estimate for the value of $Y$ corresponding to a given value of the independent variable, $x = x_0$.

There are two types of relevant information that we would like to obtain from the model:

- A <span style="color:brown;">forecast</span> estimate, that is, a value that would approximate as best as possible, given our information and the fitted model, the value we would observe for <span style="color:brown;">*one instance of the dependent variable*,</span> when the independent variable takes a given value.

- A <span style="color:brown;">mean response</span> estimate, that is, a value that would approximate as best as possible the <span style="color:brown;">*average value of all the occurrences of the dependent variable*,</span> when the independent variable takes the same given value for a very large number of observations of $Y$.


### <span style="color:brown;">Inference on mean responses and forecasts</span>

Following our comments in Lesson 1, to estimate the values of the (population) forecasts/mean responses for a given value of $X$ we will define point estimators with good properties for these values. Then, we will consider how to obtain confidence intervals for them.

#### <span style="color:brown">Point estimates</span> 

We define our point estimate by replacing in our definition of the quantities of interest: i) the independent variable $X$ with its known value $x_0$, ii) the unknown parameters in the linear regression model by their least squares estimates and iii) the error term $U$ by its expected value, equal to zero.

For both of the preceding cases we obtain the same point estimator, given by

$$
\hat Y_0 = \hat \beta_0 + \hat \beta_1 x_0 .
$$

This point estimator is unbiased in both cases, having expected value $E[\hat Y_0] = \beta_0 + \beta_1 x_0 \equiv y_0$.


#### <span style="color:brown">Confidence intervals</span>

To obtain confidence intervals for the mean response and the forecast corresponding to the given value $X = x_0$, we need to identify the estimators to use in these cases, as well as their distributions.
  
- For the <span style="color:brown">mean response</span> estimator we have

$$
T_{mr} = \frac{\hat Y_0 - y_0}{\text{se}_{mr} (y_0)} \sim t_{n-2}
$$

- For the <span style="color:brown">forecast</span> estimator we have that

$$
T_f = \frac{\hat Y_0 - y_0}{\text{se}_f (y_0)} \sim t_{n-2}
$$

The only differences between these two cases are their standard errors, taking values:

$$
\begin{array}{rcl}
\mbox{se}_{mr} (y_0) & = & \displaystyle \sqrt{s_R^2\left( \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2} \right)} \\
\mbox{se}_{f} (y_0) & = & \displaystyle \sqrt{s_R^2\left( 1 + \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2} \right)}
\end{array}
$$

The standard error for the forecast estimator is larger than the one for the mean response, and it does not go to zero when $n \rightarrow \infty$.

Using this information, we can compute confidence intervals or conduct hypothesis testing on the values of the forecasts/mean responses. For example, given a confidence level $1-\alpha$, the corresponding confidence intervals can be obtained using the formulas

$$
\begin{array}{rcl}
\text{CI}_{mr,1-\alpha} (y_0) & = & \left[ \hat y_0 - t_{n-2;\alpha/2}\, \mbox{se}_{mr} (y_0) \, ; \, \hat y_0 + t_{n-2;\alpha/2}\, \mbox{se}_{mr} (y_0) \right] \\
\text{CI}_{f,1-\alpha} (y_0) & = & \left[ \hat y_0 - t_{n-2;\alpha/2}\, \mbox{se}_f (y_0) \, ; \, \hat y_0 + t_{n-2;\alpha/2}\, \mbox{se}_f (y_0) \right]
\end{array}
$$

The derivation of the distributions for the mean response and forecast estimators can be found in [Appendix 4](#App4_4).


#### <span style="color:red">Exercise</span>

*A fast-food company wants to evaluate the relationship between the number of ads in a social media app it purchases each week ($x$) and the profits in euros from its home-delivery sales during the week ($y$). From the observed data corresponding to a sample of 19 observations (weeks) we have the following information:*

$$
\begin{array}{rcl}
\bar x & = & 636.5 , \quad \bar y = 1757.502 \\
\sum_{i=1}^{19} x_i^2 & = & 8073424 , \quad \sum_{i=1}^{19} y_i^2 = 59091545 , \quad \sum_{i=1}^{19} x_i y_i = 21629441 \\
\sum_{i=1}^{19} e_i^2 & = & 29821
\end{array}
$$

- *Estimate the regression line for the weekly profits in terms of the number of social media ads. Compute an estimate for the variance of the errors in the regression model.*

- *Estimate the value of the expected weekly profits in euros for those weeks when 550 ads were purchased. Compute a confidence interval at a confidence level of 95% for this estimation.*



##### <span style="color:red">Exercise. Solution</span>

We define our variables $X =$ "Number of ads distributed in a week" and $Y =$ "Weekly profit in euros"

We have already obtained the estimated regression line as:

$$
\hat y_i = 1122.391 + 0.9978 x_i
$$

and the residual variance as

$$
s_R^2 = \frac{29821}{17} = 1754.18
$$

From our preceding results, we also have that

$$
(n-1) s_x^2 = 375911.2
$$

The requested point estimate for $x_0 = 550$ is given by

$$
\hat y_0 = 1122.391 + 0.9978 \times 550 = 1671.18
$$

The confidence interval for the mean response (the average weekly profits) will be defined in terms of its standard error,

$$
\mbox{se}_{mr} (y_0) = \sqrt{s_R^2\left( \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2} \right)} = \sqrt{1754.18\left( \frac{1}{19} + \frac{(550 - 636.5)^2}{375911.2} \right)} = 11.2801
$$

and using $t_{n-2;\alpha/2} = t_{17;0.025} = 2.110$, we obtain

$$
\begin{array}{rcl}
\text{CI}_{mr,1-\alpha} (y_0) & = & \left[ \hat y_0 - t_{n-2;\alpha/2}\, \mbox{se}_{mr} (y_0) \, ; \, \hat y_0 + t_{n-2;\alpha/2}\, \mbox{se}_{mr} (y_0) \right] \\
& = & \displaystyle \left[ 1671.18 - 2.110\times 11.2801 \; ; \; 1671.18 + 2.110\times 11.2801 \right] \\
& = & \left[ 1647.38 \; ; \; 1694.98 \right]
\end{array}
$$



#### <span style="color:blue">Computing mean responses and forecasts using R</span>

In the following cell we obtain the values of point estimates and confidence intervals for particular cases, using the data from our fitted linear regression model. The value for $x_0$ used in the mean response/forecast calculation can be modified by assigning a different value to the variable <span style="color:blue;font-family:monospace;font-size:90%;">x.0</span> in the following cell.

We also show some plots to illustrate the impact of the value $x_0$ on the confidence intervals.


In [None]:
# Mean responses and forecasts

## These values can be modified

x.0 = 7.5
conf.lvl.2 = 0.95

## Computation of parameter values

q.p.2 = 0.5*(1 + conf.lvl.2)
q.hat.y.0 = qt(q.p.2,n.obs-2,lower.tail=T)
hat.y.0 = hat.beta.0 + hat.beta.1*x.0

se.hat.y.0.mr = sqrt(sR.2*(1/n.obs + (mn.x - x.0)^2/(n.obs-1)*s2.x))
se.hat.y.0.fc = sqrt(sR.2*(1 + 1/n.obs + (mn.x - x.0)^2/(n.obs-1)*s2.x))

## Printing the estimates

val.0 <- round(c(x.0,hat.y.0),3)
out.0 <- as.data.frame(matrix(val.0,length(val.0),1))
rownames(out.0) = c("Value of x0:","Point estimate for the response:")
colnames(out.0) = c("Values")

table_prnt(out.0,"Point estimate")

val.2 = c(sprintf('%8.2f',conf.lvl.2),sprintf('%8.3f',se.hat.y.0.mr),sprintf('%8.3f',se.hat.y.0.fc),
          sprintf('[%8.3f;%8.3f ]',hat.y.0-q.hat.y.0*se.hat.y.0.mr,hat.y.0+q.hat.y.0*se.hat.y.0.mr),
          sprintf('[%8.3f;%8.3f ]',hat.y.0-q.hat.y.0*se.hat.y.0.fc,hat.y.0+q.hat.y.0*se.hat.y.0.fc))
out.2 = as.data.frame(matrix(val.2,5,1))
rownames(out.2) = c("Selected confidence level:","Standard error mean response:","Standard error forecast:",
                    "Confidence interval for mean response:","Confidence interval for forecast:")
colnames(out.2) = c("Values")

table_prnt(out.2,"CIs mean response and forecast")



#### <span style="color:blue">Plotting mean responses and forecasts using R (i)</span>

In the following cell we show the confidence intervals for a set of values of $x = x_0$, both in the case of mean responses, in red, and forecasts, in green.

Note the different sizes of the intervals and their dependence on the value of $x_0$.


In [None]:
# Plots for mean responses and forecasts

## These values can be modified

qt.x = c(0.5,0.75,1)
conf.lvl = 0.95

## Computation of parameter values

s.x = sqrt(s2.x)
x.0m = mn.x + qt.x*s.x
x.0f = x.0m + 0.05*s.x
q.p.2 = 0.5*(1 + conf.lvl)
q.hat.y0 = qt(q.p.2,n.obs-2,lower.tail=T)
hat.y0m = hat.beta.0 + hat.beta.1*x.0m
hat.y0f = hat.beta.0 + hat.beta.1*x.0f

se.hat.y0m = sqrt(sR.2*(1/n.obs + (mn.x - x.0m)^2/(n.obs-1)*s2.x))
se.hat.y0f = sqrt(sR.2*(1 + 1/n.obs + (mn.x - x.0f)^2/(n.obs-1)*s2.x))

## Plotting results

plt.sct.0 = data.sel %>% ggplot() + geom_point(aes(x=X,y=Y)) +
  geom_line(aes(x=X,y=Yest),color="blue",linewidth=0.75) +
  ggtitle(sprintf("Forecast and Mean response CIs %s vs %s",c.names[1],c.names[2])) +
  labs(y = c.names[2], x = c.names[1]) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

y.li.m = hat.y0m - q.hat.y0*se.hat.y0m
y.ls.m = hat.y0m + q.hat.y0*se.hat.y0m
y.li.f = hat.y0f - q.hat.y0*se.hat.y0f
y.ls.f = hat.y0f + q.hat.y0*se.hat.y0f
ci.df = data.frame(xm = x.0m, xf = x.0f, hym = hat.y0m, hyf = hat.y0f, lim = y.li.m, lsm = y.ls.m, lif = y.li.f, lsf = y.ls.f)

plt.sct.1 = plt.sct.0 + geom_segment(data=ci.df,aes(x=xm,y=lim,xend=xm,yend=lsm),
                                     inherit.aes=FALSE,color="red",linewidth=0.75) +
      geom_segment(data=ci.df,aes(x=xf,y=lif,xend=xf,yend=lsf),inherit.aes=FALSE,color="green",linewidth=0.75) +
      geom_point(data=ci.df,aes(x=xm,y=hym),inherit.aes=FALSE,color="red") +
      geom_point(data=ci.df,aes(x=xf,y=hyf),inherit.aes=FALSE,color="green")

plot(plt.sct.1)



#### <span style="color:blue">Plotting mean responses and forecasts using R (ii)</span>

In the following cell we show the confidence intervals associated to all possible values of $x$ for mean responses and forecasts, as grey bands on the scatterplot for the data. We present these results for different sample sizes, to illustrate the impact of $n$ on the length of the intervals.


In [None]:
# Plots for mean responses and forecasts (ii)

## These values can be modified

conf.lvl = 0.95
n.smp.1 = 20
n.smp.2 = 50

## Computation of parameter values

n.obs = nrow(data.sel)
q.cl = 0.5*(1 + conf.lvl)
qt.y1 = qt(q.cl,n.smp.1-2,lower.tail=T)
qt.y2 = qt(q.cl,n.smp.2-2,lower.tail=T)

## Generation of the samples

ix.sel.1 = sample(n.obs,n.smp.1)
data.sel.1 = data.sel[ix.sel.1,]
ix.sel.2 = sample(n.obs,n.smp.2)
data.sel.2 = data.sel[ix.sel.2,]

x.mn = min(c(data.sel.1$X,data.sel.2$X))
x.mx = max(c(data.sel.1$X,data.sel.2$X))
x.val = seq(x.mn,x.mx,length.out = 100)

## Computation of sample statistics

x.mn.1 = mean(data.sel.1$X)
x.sd.1 = sd(data.sel.1$X)
y.mn.1 = mean(data.sel.1$Y)
y.sd.1 = sd(data.sel.1$Y)
x.mn.2 = mean(data.sel.2$X)
x.sd.2 = sd(data.sel.2$X)
y.mn.2 = mean(data.sel.2$Y)
y.sd.2 = sd(data.sel.2$Y)

## Estimation of the models

hat.b1.1 = cov(data.sel.1$X,data.sel.1$Y)/x.sd.1^2
hat.b0.1 = y.mn.1 - hat.b1.1*x.mn.1
hat.b1.2 = cov(data.sel.2$X,data.sel.2$Y)/x.sd.2^2
hat.b0.2 = y.mn.2 - hat.b1.2*x.mn.2

data.sel.1$Yest = hat.b0.1 + hat.b1.1*data.sel.1$X
data.sel.2$Yest = hat.b0.2 + hat.b1.2*data.sel.2$X

hat.y1 = hat.b0.1 + hat.b1.1*x.val
hat.y2 = hat.b0.2 + hat.b1.2*x.val

res.1 = data.sel.1$Y - data.sel.1$Yest
res.2 = data.sel.2$Y - data.sel.2$Yest

sR2.1 = sum(res.1^2)/(n.smp.1-2)
sR2.2 = sum(res.2^2)/(n.smp.2-2)

## Standard errors in forecasting

se.ym.1 = sqrt(sR2.1*(1/n.smp.1 + (x.mn.1 - x.val)^2/(n.smp.1-1)*x.sd.1^2))
se.yf.1 = sqrt(sR2.1*(1 + 1/n.smp.1 + (x.mn.1 - x.val)^2/(n.smp.1-1)*x.sd.1^2))
se.ym.2 = sqrt(sR2.2*(1/n.smp.2 + (x.mn.2 - x.val)^2/(n.smp.2-1)*x.sd.2^2))
se.yf.2 = sqrt(sR2.2*(1 + 1/n.smp.2 + (x.mn.2 - x.val)^2/(n.smp.2-1)*x.sd.2^2))

y.li.m = hat.y1 - qt.y1*se.ym.1
y.ls.m = hat.y1 + qt.y1*se.ym.1
y.li.f = hat.y1 - qt.y1*se.yf.1
y.ls.f = hat.y1 + qt.y1*se.yf.1
vl.u1 = max(y.ls.f)
vl.l1 = min(y.li.f)
ci.df.1 = data.frame(x = x.val, hy = hat.y1, lim = y.li.m, lsm = y.ls.m, lif = y.li.f, lsf = y.ls.f)
y.li.m = hat.y2 - qt.y2*se.ym.2
y.ls.m = hat.y2 + qt.y2*se.ym.2
y.li.f = hat.y2 - qt.y2*se.yf.2
y.ls.f = hat.y2 + qt.y2*se.yf.2
vl.u2 = max(y.ls.f)
vl.l2 = min(y.li.f)

vl.u = max(vl.u1,vl.u2)
vl.l = min(vl.l1,vl.l2)
vlim.u = vl.u + 0.05*(vl.u - vl.l)
vlim.l = vl.l - 0.05*(vl.u - vl.l)
ci.df.2 = data.frame(x = x.val, hy = hat.y2, lim = y.li.m, lsm = y.ls.m, lif = y.li.f, lsf = y.ls.f)

## Plotting results

plt.sct.1a = data.sel.1 %>% ggplot() + geom_point(aes(x=X,y=Y)) +
  geom_line(aes(x=X,y=Yest),color="blue",size=0.75) +
  ggtitle(sprintf("Forecast and Mean response CIs sample size %3.0f",n.smp.1)) +
  xlim(x.mn,x.mx) + ylim(vlim.l,vlim.u) +
  labs(y = c.names[2], x = c.names[1]) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))
plt.sct.1b = plt.sct.1a + geom_ribbon(data=ci.df.1,aes(x=x,ymin=lim,ymax=lsm),inherit.aes=FALSE,alpha=0.5) +
                   geom_ribbon(data=ci.df.1,aes(x=x,ymin=lif,ymax=lsf),inherit.aes=FALSE,alpha=0.25)

plt.sct.2a = data.sel.2 %>% ggplot() + geom_point(aes(x=X,y=Y)) +
  geom_line(aes(x=X,y=Yest),color="blue",size=0.75) +
  ggtitle(sprintf("Forecast and Mean response CIs sample size %3.0f",n.smp.2)) +
  xlim(x.mn,x.mx) + ylim(vlim.l,vlim.u) +
  labs(y = c.names[2], x = c.names[1]) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))
plt.sct.2b = plt.sct.2a + geom_ribbon(data=ci.df.2,aes(x=x,ymin=lim,ymax=lsm),inherit.aes=FALSE,alpha=0.5) +
                   geom_ribbon(data=ci.df.2,aes(x=x,ymin=lif,ymax=lsf),inherit.aes=FALSE,alpha=0.25)

suppressWarnings(grid.arrange(plt.sct.1b,plt.sct.2b,nrow = 2,
                              top=textGrob(sprintf("Forecasting errors for different sample sizes",s.pref$title),
                                           gp=gpar(fontsize=15,col="blue"))))


## <span style="color:brown;">Assessing the quality of the linear regression model</span>

---

In this final part of the lesson, we introduce some measures to evaluate the quality of the approximations provided by the fitted linear regression model, obtained from a given sample.

Our approach is that we wish to use this model to improve our knowledge of the values of the dependent variable $Y$, by efficiently incorporating the information provided by the values of the independent variable $X$. We will measure this improvement through the variability in the estimates for $Y$, with and without the information in $X$. We associate the quality of our model to the magnitude of the reduction in this variability of the estimates for our dependent variable, which we might achieve by using the linear model.


### <span style="color:brown;">The correlation coefficient</span>

From the value of the covariance between the two variables, which we need for our least squares parameter estimates, we can compute the value of the correlation coefficient for these variables. It was already commented in Statistics I how this coefficient provides relevant information either to support or to doubt the existence of a linear relationship between these variables.

Its value is defined as

$$
\mbox{cor}(x,y) = \frac{\mbox{cov}(x,y)}{s_x s_y} \in [-1,1]
$$

If the value of this coefficient were close to $+1$ or $-1$, it would be an indication of a clear linear relationship between the variables, while values close to 0 would be an indication of the absence of any such relationship.

The following cell shows its value for the data we have been studying.


In [None]:
# Correlation coefficient

cor.xy = cov.xy/sqrt(s2.x*s2.y)

rn.1 <- sprintf('Correlation coef of %s %s',c.names[1],c.names[2])
val.0 <- round(c(cor.xy),3)
out.0 <- as.data.frame(matrix(val.0,length(val.0),1))
colnames(out.0) = c("Value")
rownames(out.0) = c(rn.1)

table_prnt(out.0,"Correlation coef")



#### <span style="color:blue;">Graphical interpretation of the correlation coefficient</span>

To illustrate further the relationship between the correlation coefficient and the characteristics of the linear regression model, the following cell presents a scatterplot obtained for simulated data, where the value of the correlation coefficient has been specified in advance.


In [None]:
# Plot for the correlation coefficient

## Modify this value and run the cell

xy.cor = -0.75

## Generate random data

n.obs = 100
x.var = 1
y.var = 0.6

v.obs.0 = matrix(rnorm(2*n.obs),n.obs,2)
v.obs.0 = scale(v.obs.0)
v.obs.0[,2] = v.obs.0[,2] - cov(v.obs.0)[1,2]*v.obs.0[,1]
xy.cov = xy.cor*sqrt(x.var*y.var)
xy.S = matrix(c(x.var,xy.cov,xy.cov,y.var),2,2)
xy.S.eig = eigen(xy.S)
xy.sqS = xy.S.eig$vectors %*% diag(sqrt(c(xy.S.eig$values))) %*% t(xy.S.eig$vectors) 
v.obs = v.obs.0 %*% xy.sqS
v.obs = scale(v.obs)
df.obs = data.frame(X=v.obs[,1],Y=v.obs[,2])

## Regression line

sim.S = cov(df.obs)
sim.mn = colMeans(df.obs)
beta.1.sim = sim.S[1,2]/sim.S[1,1]
beta.0.sim = sim.mn[2] - beta.1.sim*sim.mn[1]
df.obs$Yest = beta.0.sim + beta.1.sim*df.obs$X

## Show the resulting scatterplot

xy.cor.e = cor(df.obs$X,df.obs$Y)
plt.scat.a = df.obs %>% ggplot() + geom_point(aes(x=X,y=Y)) +
  ggtitle(sprintf("Scatterplot normal data. Correlation coef %5.2f",xy.cor.e)) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))
plt.scat.b = plt.scat.a + geom_line(aes(x=X,y=Yest),color="blue",size=0.75)

suppressWarnings(grid.arrange(plt.scat.a,plt.scat.b,nrow = 2,
                              top=textGrob(sprintf("Scatterplots for simulated data",s.pref$title),
                                           gp=gpar(fontsize=15,col="blue"))))



### <span style="color:brown;">Other quality measures</span>

We have illustrated that when the value of the correlation coefficient is close to $\pm 1$, we can expect a close linear relationship between the two variables, while values close to 0 indicate the absence of a linear relationship. But a difficulty with this measure is that it is not clear where to set a limit to separate good models from those that are not as useful to explain $Y$.

To conduct inference, we usually prefer measures that have a known distribution associated to it, which can be used to associate probabilities to specific situations and to conduct hypothesis tests. In that sense, the test statistic we introduced for the significance test of the linear regression model,

$$
T_{\beta_1} = \displaystyle \frac{\hat \beta_1}{\mbox{se}(\hat \beta_1)} \sim t_{n-2}
$$

provides an alternative measure of the quality of the model, based on a different scale, but with the advantage of following a known distribution. We have already seen how to define reasonable limits for acceptable models vs.\ models that do not provide information on the relationship of interest, based on this measure.



### <span style="color:brown;">Measures based on variability reduction</span>

There is another requirement we would like to impose on our quality measures: the existence of some intuitive interpretation for the values of the selected measure. For example, this interpretation is reasonably clear for the correlation coefficient, although it is not so clear for the value of the test statistic for the significance test.

In what follows we introduce some additional measures used often in practice, based on the reduction in the variability in the observed values of $Y$ that we could achieve by using additional information from the value the independent variable $X$. These measures also have intuitive interpretations.

We start with some observations about variability measures:

- When the information available for the variable $Y$ is based only on the values of a given sample, without taking into account any other information, a measure of its variability (the total variability) is given by the value of its quasivariance, $s_y^2$. 
- But if we have a paired sample of values from both the independent variable $X$ and from $Y$, and we use the values of $x_i$ to obtain an approximation for the value of $y_i$ based on the linear regression model relating the two variables, $\hat y_i$, the variability that is left unexplained in the values of $Y$ corresponds to the residuals of the model. And its estimated variance is given by the residual variance, $s_R^2$.

By the variability left unexplained we mean the part of the value of $Y$ that is different from the value predicted by the model. That is, we take the value $\hat y_i = \hat \beta_0 + \hat \beta_1 x_i$ as known given the value of $X$ and the linear regression model. The part of the value of $y_i$ that we are not able to predict is the difference between $y_i$ and $\hat y_i$, that is, $y_i - \hat y_i = e_i$.

In the following cell we present two plots comparing the total variability in $Y$ with the variability left unexplained after using the regression model, the residual variability.


In [None]:
# Plots for variance reduction

## Plotting results

vlim.l = min(df.obs$Y)
vlim.u = max(df.obs$Y)

n.obs = nrow(df.obs)
y.mn = mean(df.obs$Y)
y.sd = sd(df.obs$Y)
x.sd = sd(df.obs$X)
beta.1.sel = cov(df.obs$X,df.obs$Y)/var(df.obs$X)
beta.0.sel = y.mn - beta.1.sel*mean(df.obs$X)
df.obs$Yest = beta.0.sel + beta.1.sel*df.obs$X
df.obs$Res = df.obs$Y - df.obs$Yest
v.sR2 = sum(df.obs$Res^2)/(n.obs-2)
df.obs$Yhatup = df.obs$Yest + 2*sqrt(v.sR2)
df.obs$Yhatdown = df.obs$Yest - 2*sqrt(v.sR2)

lim.down = y.mn - 2*y.sd
lim.up = y.mn + 2*y.sd
vlim.l = min(vlim.l,lim.down)
vlim.u = max(vlim.u,lim.up)

plt.jit.y = df.obs %>% ggplot() + geom_jitter(aes(x=0,y=Y),height = 0) +
  geom_hline(aes(yintercept=y.mn),color="blue",size=0.75) +
  geom_hline(aes(yintercept=lim.up),color="blue",linetype="dashed",size=0.75) +
  geom_hline(aes(yintercept=lim.down),color="blue",linetype="dashed",size=0.75) +
  ggtitle("Variability in the variable Y") +
  ylim(vlim.l,vlim.u) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

plt.var.y = df.obs %>% ggplot() + geom_point(aes(x=X,y=Y)) +
  geom_line(aes(x=X,y=Yest),color="blue",size=0.75) +
  geom_line(aes(x=X,y=Yhatup),color="blue",linetype="dashed",size=0.75) +
  geom_line(aes(x=X,y=Yhatdown),color="blue",linetype="dashed",size=0.75) +
  ggtitle("Variability associated to the regression model") +
  ylim(vlim.l,vlim.u) +
  theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

suppressWarnings(grid.arrange(plt.jit.y,plt.var.y,nrow = 2,
                              top=textGrob(sprintf("Comparing uncertainty sizes without and with the model",
                                                   s.pref$title),
                                           gp=gpar(fontsize=15,col="blue"))))



#### <span style="color:brown;">The coefficient of determination, $R^2$</span>

We can generate additional measures for the quality of the regression model by comparing the variabilities in the dependent variable and in the residuals. A way to define these measures, which has the advantage of having an easy interpretation, is to look at the sum of squares of distances to some center associated with this model.

For example, the sample quasivariance of $Y$ is defined as a sum of squares divided by $n-1$. We will define this sum of squares as the <span style="color:brown;">total sum of squares</span> ($\text{SST}$) for our model:

$$
s_y^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar y)^2 = \frac{\text{SST}}{n-1} , \qquad \text{SST} = \sum_{i=1}^n (y_i - \bar y)^2 = (n-1) s_y^2
$$

This total sum of squares is a measure of the distance between each observation of the dependent variable and its mean, that is, a measure of the error we would be making if we use the mean of $Y$, $\bar Y$, to approximate the values $Y_i$. This is our best alternative if we do not have any other information available.

Let $\hat y_i \equiv \hat \beta_0 + \hat \beta_1 x_i$, the predicted value under the linear regression model. The sum of squares in $\text{SST}$ can be written as

$$
\text{SST} = \sum_{i=1}^n (y_i - \bar y)^2 = \sum_{i=1}^n (y_i - \hat y_i)^2 + \sum_{i=1}^n (\hat y_i - \bar y)^2 = \sum_{i=1}^n e_i^2 + \sum_{i=1}^n (\hat y_i - \bar y)^2 = \text{SSR} + \text{SSM}
$$

where

$$
\begin{array}{rcl}
\text{SSR} & = & \sum_{i=1}^n e_i^2 = (n-2) s_R^2 \\
\text{SSM} & = & \sum_{i=1}^n (\hat y_i - \bar y)^2 = (n-1) s_y^2 - (n-2) s_R^2
\end{array}
$$

A justification for this result can be found in [Appendix 5](#App4_5).

We have introduced the following sums of squares:
- $\text{SSM}$ denotes the <span style="color:brown;">sum of squares of the model,</span> that is, the sum of the squares of the differences between the predicted values under the model and the mean of $Y$. These quantities indicate how much we gain by using the model to obtain better values for $Y$, based on the values of $X$.
- $\text{SSR}$ denotes the <span style="color:brown;">sum of squares of the residuals,</span> that represents the sum of the squares of the errors left unexplained by the regression model, that is, the errors after we use the model to predict a value for $Y_i$, given the corresponding value of the independent variable $x_i$. Its size is a measure of how much variability we are unable to explain after taking into account the values of $X$ and the regression model.

These sums of squares follow $\chi^2$ distributions with different degrees of freedom, see [Appendix 5](#App4_5). This property will allow us to conduct inference on their values.

These values also have a simple interpretation regarding our model. The value of the sum of squares of the residuals should be small with respect to the total sum of squares whenever we have a linear regression model that provides a very good approximation for the variable $Y$, given the values of $X$. We will use this property to introduce another measure for the quality of the linear regression model.

We define the value of the <span style="color:brown;">coefficient of determination, $R^2$,</span> of a linear regression model by comparing the sums of squares of the errors in the model (residuals) with the sum of squares of the differences between the observed values of the dependent variable and its mean,

$$
R^2 = \frac{\mbox{SSM}}{\mbox{SST}} = 1 - \frac{\mbox{SSR}}{\mbox{SST}} = 1 - \frac{\sum_i e_i^2}{\sum_i (y_i - \bar y)^2} = 1 - \frac{(n-2) s_R^2}{(n-1) s_y^2}
$$

This value provides a measure of the explanatory power of the model, as it can be interpreted as the proportion of the total variability in the dependent variable that is explained by the regression model.

The coefficient of determination has the following properties:
- It takes values in $[0,1]$. If its value is close to 1, the regression model provides a nearly perfect fit, while if it is close to zero, the model provides very little additional information about the dependent variable.
- It represents the proportion of the total variability of the dependent variable that is explained by the regression model and the values of the independent variable. Thus, it has an immediate interpretation in terms of the explained variability.
- It is closely related to the correlation coefficient, as $R^2 = \text{cor}(x,y)^2$.

A justification for this last result can be found in [Appendix 5](#App4_5).

In the next lesson we introduce a modified measure, the <span style="color:brown;">adjusted coefficient of determination,</span> and we provide a motivation for this definition. It is obtained as

$$
\mbox{adj. } R^2 = 1 - \frac{s_R^2}{s_y^2}
$$

The following cell obtains the values of these measures for our sample data.


In [None]:
# Measuring the quality of the model
## Computing the sums of squares

SSR = (n.obs - 2)*v.sR2
SST = (n.obs - 1)*y.sd^2
R.2 = 1 - SSR/SST
R.2.adj = 1 - v.sR2/y.sd^2

out.0 <- round(as.data.frame(matrix(c(R.2,R.2.adj),1,2)),3)
colnames(out.0) = c("Value","Adj value")
rownames(out.0) = c("Coefficient of determination:")

table_prnt(out.0,"Coefficient of determination")



#### <span style="color:red">Exercise</span>

*The sales department of a clothing company is conducting a study on the company's online sales. Their goal is to determine if there is a meaningful relationship between the number of daily visits to its web page ($V$, measured in thousands) and the daily volume of Internet sales ($S$, measured in thousands of euros). The department has the following data on the values of these variables for the last 20 days:*

$$
\begin{array}{rcl}
\sum_{i=1}^{20} v_i & = & 599 , \quad \sum_{i=1}^{20} s_i = 2835 \\
\sum_{i=1}^{20} v_i^2 & = & 19195 , \quad \sum_{i=1}^{20} s_i^2 = 458657 , \quad \sum_{i=1}^{20} v_i s_i = 92000 \\
\sum_{i=1}^{20} e_i^2 & = & 16720.67
\end{array}
$$

*where $e_i$ denotes the residuals of the regression model explaining the variable $S$ as a function of $V$.*

- *Compute the value of the coefficient of determination and interpret it.*



##### <span style="color:red">Exercise. Solution</span>

We define our variables $V =$ "Number of visits in a given day (in thousands)" and $S =$ "Daily Internet sales in hundreds of euros". $V$ will be our independent variable.

To obtain the estimated regression line we compute:

$$
\begin{array}{rcl}
\bar v & = & \displaystyle \frac{1}{20} \sum_{i=1}^{20} v_i = \frac{599}{20} = 29.95 , \quad \bar s = \frac{1}{20} \sum_{i=1}^{20} s_i = \frac{2835}{20} = 141.75 \\
s_v^2 & = & \displaystyle \frac{1}{19} \left( \sum_{i=1}^{20} v_i^2 - 20 \bar v^2 \right) = \frac{1}{19} ( 19195 - 20\times 29.95^2 ) = 66.05 \\
s_s^2 & = & \displaystyle \frac{1}{19} \left( \sum_{i=1}^{20} s_i^2 - 20 \bar s^2 \right) = \frac{1}{19} ( 458657 - 20\times 141.75^2 ) = 2989.25 \\
\text{cov} (v,s) & = & \displaystyle \frac{1}{19} \left( \sum_{i=1}^{20} v_i s_i - 20 \bar v \bar s \right) = \frac{1}{19} ( 92000 - 20\times 29.95 \times 141.75 ) = 373.25
\end{array}
$$

From these values we obtain the least squares estimates,

$$
\begin{array}{rcl}
\hat \beta_1 & = & \displaystyle \frac{\text{cov}(v,s)}{s_v^2} = \frac{373.25}{66.05} = 5.651 \\
\hat \beta_0 & = & \bar s - \hat \beta_1 \bar v = 141.75 - 5.651\times 29.95 = -27.498
\end{array}
$$

and the residual variance is

$$
s_R^2 = \frac{\sum_i e_i^2}{n-2} = \frac{16720.67}{18} = 928.926
$$

The coefficient of determination can be computed from the correlation coefficient,

$$
\text{cor} (v,s) = \frac{\text{cov}(v,s)}{s_v s_s} = \frac{373.25}{\sqrt{66.05\times 2989.25}} = 0.840
$$

and we obtain

$$
R^2 = \text{cor} (v,s)^2 = 0.840^2 = 0.7056
$$

Alternatively, we could have computed this value as

$$
R^2 = 1 - \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSR}}{(n-1)s_s^2} = 1 - \frac{16720.67}{19\times 2989.25} = 0.7038
$$

The interpretation of this value is that the variable $V$, through the linear regression model, is able to explain 70% of the variability in the variable $S$.


<h3 style="color:brown;">ANOVA table</h3>

The sums of squares we have introduced above are very useful to analyze the quality of the linear regression model. In particular, they provide information regarding the explanatory power of the model, through the value of the coefficient of determination $R^2$. We will see that they also contain information with respect to the significance of the model.

From a practical point of view, it is useful to group these values into an ANOVA table. The name ANOVA corresponds to the initials of <span style="color:brown;">*ANalisys Of VAriance*,</span> which is what we have been doing in this last part of the lesson.

The ANOVA table organizes the preceding values in the following form:

$$
\small
\begin{array}{lcccc}
\text{Variability source} & \text{Sums of squares} & \text{Deg of freedom} & \text{Means of squares} & \text{F ratio} \\
\hline
\text{Model} & \text{SSM} = \sum_i (\hat y_i - \bar y)^2 & 1 & \text{SSM}/1 & \text{SSM}/s_R^2 \\
\text{Residuals} & \text{SSR} = \sum_i (y_i - \hat y_i)^2 = \sum_i e_i^2 & n-2 & \text{SSR}/(n-2) = s_R^2 & \\
\hline
\text{Total} & \text{SST} = \sum_i (y_i - \bar y)^2 = (n-1) s_y^2 & n-1 & & \\
\end{array}
$$

To complete the table we:
1. Start with the values of the sums of squares, obtained from the information of the model, and the degrees of freedom
2. Compute the means of squares by dividing the sums of squares by the degrees of freedom
3. Compute the <span style="color:brown;">F Ratio</span> as the quotient between the sum of squares of the model and the residual variance

The <span style="color:brown;">F ratio</span> provides important information about the model: it allows us to test for the significance of this model. When the model is significant we should have a large value for the sum of squares of the model and a small value for the sum of squares of the residuals, implying a large value for the F Ratio.

It also has a know distribution: it follows a Fisher F distribution with 1 and $n-2$ degrees of freedom; for a justification see [Appendix 6](#App4_6).

In summary, we have for the F-Ratio

$$
\text{F Ratio} = \frac{\text{SSM}}{S_R^2} \sim F_{1,n-2} .
$$

and we can use its value and distribution to test for the significance of the simple linear regression model, $H_0 : \beta_1 = 0$, by computing the p-value associated to the F ratio test, as

$$
\text{p-value} = \Pr \left( F_{1,n-2} > \text{F Ratio} \right)
$$

A proof of this result can be found in [Appendix 6](#App4_6).

The <span style="color:brown;">F ratio</span> is also related to the value of the <span style="color:brown;">coefficient of determination,</span> through

$$
\text{F Ratio} = (n-2) \frac{\text{SSM}/\text{SST}}{\text{SSR}/\text{SST}} = (n-2) \frac{R^2}{1-R^2}
$$

In the following cell we compute the ANOVA table for the linear regression model and conduct the F ratio significance test for the model.


In [None]:
# ANOVA table

## Values for the table

df.m = 1
df.r = n.obs-2

F.anova.s = (df.r/df.m)*(SST-SSR)/SSR
xy.anova = matrix(c(SST-SSR,SSR,SST,df.m,df.r,n.obs-1,SST-SSR,SSR/df.r,NA,
                    F.anova.s,NA,NA),3,4)
xy.anova = as.data.frame(xy.anova)
colnames(xy.anova) = c("SS","DF","Mean","F ratio")
rownames(xy.anova) = c("Model","Residuals","Total")

## Print the ANOVA table

xy.anova.f = xy.anova
xy.anova.f[,1:4] = format(xy.anova[,1:4],digits = 3)
xy.anova.f[2,4] = NA
xy.anova.f[3,3:4] = NA

options(knitr.kable.NA = ' ')
options(align = 'r')
table_prnt(xy.anova.f,"ANOVA Table")

## Values for the significance test

sig.lvl = 0.05
val.b1.t = hat.beta.1/sqrt(v.sR2/((n.obs-1)*x.sd^2))
crit.b1.t = qt(sig.lvl/2,df.r,lower.tail=FALSE)
F.crit = qf(sig.lvl,df.m,df.r,lower.tail=FALSE)
F.pval = pf(F.anova.s,df.m,df.r,lower.tail=FALSE)

n.val.1 = 5
val.t = c(sig.lvl,val.b1.t,crit.b1.t,F.anova.s,F.crit,F.pval)
out.t = as.data.frame(matrix(format(val.t[1:n.val.1],digits=4),n.val.1,1))
out.t = rbind(out.t,format(val.t[n.val.1+1],scientific=TRUE,digits=3))
rownames(out.t) = c("Significance level","Test beta 1","Critical value beta 1","F Ratio",
                    "Critical value F ratio","P value for the test")
colnames(out.t) = c("Values")

table_prnt(out.t,"ANOVA significance test")



#### <span style="color:red">Exercise</span>

*The sales department of a clothing company is conducting a study on the company's online sales. Their goal is to determine if there is a meaningful relationship between the number of daily visits to its web page ($V$, measured in thousands) and the daily volume of Internet sales ($S$, measured in thousands of euros). The department has the following data on the values of these variables for the last 20 days:*

$$
\begin{array}{rcl}
\sum_{i=1}^{20} v_i & = & 599 , \quad \sum_{i=1}^{20} s_i = 2835 \\
\sum_{i=1}^{20} v_i^2 & = & 19195 , \quad \sum_{i=1}^{20} s_i^2 = 458657 , \quad \sum_{i=1}^{20} v_i s_i = 92000 \\
\sum_{i=1}^{20} e_i^2 & = & 16720.67
\end{array}
$$

*where $e_i$ denotes the residuals of the regression model explaining the variable $S$ as a function of $V$.*

- *Compute the ANOVA table for $S$.*
- *From the information in the ANOVA table, conduct a test for the significance of the linear regression model.*



##### <span style="color:red">Exercise. Solution</span>

We define our variables $V =$ "Number of visits in a given day (in thousands)" and $S =$ "Daily Internet sales in hundreds of euros". $V$ will be our independent variable.

We have already computed the estimates for the parameters of the model,

$$
\hat \beta_1 = 5.651 , \qquad \hat \beta_0 = -27.498 , \qquad s_R^2 = 928.926
$$

To obtain the ANOVA table we start with the sums of squares,

$$
\begin{array}{rcl}
\text{SSR} & = & \sum_i e_i^2 = 16720.67 , \quad \text{SST} = \sum_i (s_i - \bar s)^2 = (n-1) s_s^2 = 19\times 2989.25 = 56795.75 , \\
\text{SSM} & = & \text{SST} - \text{SSR} = 56795.75 - 16720.67 = 40075.08
\end{array}
$$

The remaining values will be given by

$$
s_R^2 = \frac{16720.67}{18} = 928.926 , \qquad \text{F ratio} = \frac{\text{SSM}}{s_R^2} = \frac{40075.08}{928.926} = 43.141
$$

And the ANOVA table will be

$$
\begin{array}{lcccc}
\text{Source} & \text{Sums of squares} & \text{Deg of freedom} & \text{Means of squares} & \text{F ratio} \\
\hline
\text{Model} & 40075.08 & 1 & 40075.08 & 43.141 \\
\text{Residuals} & 16720.67 & 18 & 928.926 & \\
\hline
\text{Total} & 56795.75 & 19 & & \\
\end{array}
$$

The significance test can be conducted from the value of the F ratio, as this value follows a Fisher F distribution with 1 and 18 degrees of freedom. The critical value in our case, for a significance level of 5%, is $F_{1,18;0.05} = 4.414$, and the value in the ANOVA table is clearly in the rejection region. We conclude that there is a significant linear relationship between the variables $S$ and $V$.


---
---

<a id='App4_1'></a>

## <span style="color:orange;">Appendix 1: Other parameter estimation methods</span>

---

The least squares method is not the only procedure that can be used to estimate the values of the parameters of the linear regression model. Other alternatives, with different properties for the estimators derived from them, are for example:

- Least-squares estimation formulas minimizing other definitions of the residuals. If we define the residuals as the length of the shortest segment from observation $(x_i,y_i)$ to the regression line $y = \hat \beta_0 + \hat \beta_1 x$, the estimators are given as the solutions of

$$
\min_{\hat \beta_0,\hat \beta_1} \sum_{i=1}^n \frac{(y_i - \hat \beta_0 - \hat \beta_1 x_i)^2}{1 + \hat \beta_1^2}
$$

- Least absolute values regression. In order to reduce the influence of outlier observations, we may replace the square in the least squares problem with an absolute value, which gives less weight to any outlier observation. The estimators in this case are given by

$$
\min_{\hat \beta_0,\hat \beta_1} \sum_{i=1}^n | y_i - \hat \beta_0 - \hat \beta_1 x_i |
$$

- Ridge regression. The least squares formulation can be regularized to ensure that the covariance matrix (in the multiple regression model) has reasonable numerical properties. This regularization can be done by adding penalization terms for the squared coefficients,

$$
\min_{\hat \beta_0,\hat \beta_1} \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 + \rho \left( \hat \beta_0^2 + \hat \beta_1^2 \right)
$$

 - Lasso regression. The regularization can also be carried out by adding penalization terms on the absolute values of the coefficients. This alternative has the advantage of allowing for the automatic selection of significant coefficients (coefficients different from zero) in the model.

$$
\min_{\hat \beta_0,\hat \beta_1} \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 + \rho \left( | \hat \beta_0 | + | \hat \beta_1 | \right)
$$


<a id='App4_2'></a>

## <span style="color:orange">Appendix 2: Least squares estimates</span>

---

### <span style="color:orange">Deriving the formulas of the least squares estimates</span>

Given the least squares optimization problem,

$$
\min_{\hat \beta_0,\hat \beta_1} f(\hat \beta_0,\hat \beta_1) = \min_{\hat \beta_0,\hat \beta_1} \sum_{i=1}^n e_i (\hat \beta_0,\hat \beta_1)^2 = \min_{\hat \beta_0,\hat \beta_1} \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2
$$

The first-order optimality conditions are

$$
\left. \begin{array}{rcl}
\displaystyle \frac{\partial f}{\partial \hat \beta_0} & = & \displaystyle \sum_{i=1}^n 2 (-1) (y_i - \hat \beta_0 - \hat \beta_1 x_i) = 0 \\
\displaystyle \frac{\partial f}{\partial \hat \beta_1} & = & \displaystyle \sum_{i=1}^n 2 (-x_i) (y_i - \hat \beta_0 - \hat \beta_1 x_i) = 0
\end{array} \right\}
\quad \Rightarrow \quad
\left. \begin{array}{rcl}
\displaystyle \sum_{i=1}^n y_i & = & \displaystyle n \hat \beta_0 + \hat \beta_1 \sum_{i=1}^n x_i \\
\displaystyle \sum_{i=1}^n x_i y_i & = & \hat \beta_0 \sum_{i=1}^n x_i + \hat \beta_1 \sum_{i=1}^n x_i^2
\end{array} \right\}
$$

If we reorder the terms in the preceding equations, these conditions are equivalent to

$$
\begin{array}{rcl}
& & \left. \begin{array}{rcl}
\displaystyle \frac{1}{n} \sum_{i=1}^n y_i & = & \displaystyle \hat \beta_0 + \hat \beta_1 \frac{1}{n} \sum_{i=1}^n x_i = 0 \\
\displaystyle \sum_{i=1}^n x_i y_i - n \bar x \bar y & = & \displaystyle\hat \beta_0 \sum_{i=1}^n x_i + \hat \beta_1 \sum_{i=1}^n x_i^2 - \hat \beta_1 n \bar x^2 - n \bar x \bar y + \hat \beta_1 n \bar x^2
\end{array} \right\} \\
& \Rightarrow \quad &
\left. \begin{array}{rcl}
\displaystyle \bar y & = & \displaystyle \hat \beta_0 + \hat \beta_1 \bar x \\
\displaystyle \frac{1}{n} \left( \sum_{i=1}^n x_i y_i - n \bar x \bar y \right) & = & \displaystyle \hat \beta_0 \frac{1}{n} \sum_{i=1}^n x_i + \hat \beta_1 \frac{1}{n} \left( \sum_{i=1}^n x_i^2 - \hat \beta_1 \bar x^2 \right) - \bar x \left( \bar y - \hat \beta_1 \bar x \right)
\end{array} \right\}
\end{array}
$$

Replacing the first equation into the second one, and using the equalities

$$
\frac{1}{n} \left( \sum_{i=1}^n x_i y_i - n \bar x \bar y \right) = \frac{n-1}{n} \text{cov} (x,y) , \quad \hat \beta_1 \frac{1}{n} \left( \sum_{i=1}^n x_i^2 - \hat \beta_1 \bar x^2 \right) = \hat \beta_1 \frac{n-1}{n} s_x^2 ,
$$

we obtain from the first equation $\hat \beta_0 = \bar y - \hat \beta_1 \bar x$, and from the second one

$$
\frac{n-1}{n} \text{cov} (x,y) = \displaystyle \hat \beta_0 \bar x + \hat \beta_1 \frac{n-1}{n} s_x^2 - \bar x \hat \beta_0 = \hat \beta_1 \frac{n-1}{n} s_x^2 \quad \Rightarrow \quad \hat \beta_1 = \frac{\text{cov} (x,y)}{s_x^2}
$$

These are the two definitions that were introduced as the least-squares estimators for the two parameters defining the regression line.

Finally, for the matrix of second derivatives,

$$
\nabla^2 f(\hat \beta_0,\hat \beta_1) = \left( \begin{array}{cc} 2n & 2 \sum_i x_i \\ 2 \sum_i x_i & 2 \sum_i x_i^2 \end{array} \right) ,
$$

and this matrix is positive definite whenever $n \sum_i x_i^2 - (\sum_i x_i )^2 > 0$, implying that the sufficient second-order optimality conditions are satisfied whenever $s_x^2 > 0$.


<a id='App4_3'></a>

## <span style="color:orange">Appendix 3: Parameter estimator distributions</span>

---

In this cell we justify why the estimators we have introduced for the different model parameters have the indicated distributions. We will base these justifications on our assumptions for the model, and on considering the values of the independent variable as given. That is, our results are conditional on the values of $X$.

The values of the independent variable $X$ on which we will condition the estimators to obtain their distributions are denoted as $\{ x_1, \ldots , x_n \}$, with mean $\bar x$ and cuasivariance $s_x^2$.

#### <span style="color:orange">Distribution of the dependent observations, $Y_i$</span>

From the definition of the linear regression model, $Y_i = \beta_0 + \beta_1 X_i + U_i$ and the assumptions on the model, we have that

$$
Y_i | \left( \underline{X} = \{ x_1, \ldots , x_n \} \right) = \beta_0 + \beta_1 x_i + U_i , \quad U_i \sim N(0,\sigma^2) \  \Rightarrow \  Y_i | \underline{X} \sim N(\beta_0 + \beta_1 x_i , \sigma^2 )
$$

Also, from the independence of the errors $U_i$ it follows that the dependent variables $Y_i$ are also independent, when we condition on the values of $\underline{X}$.

In what follows, and to simplify the notation, we will skip the conditional notation, but we will continue assuming this condition applies.

#### <span style="color:orange">Distribution of the statistic for the slope of the regression line, $\hat \beta_1$</span>

We have

$$
\begin{array}{rcl}
\hat \beta_1 & = & \displaystyle \frac{\text{cov}(x,Y)}{s_x^2} = \frac{1}{s_x^2} \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x) (Y_i - \bar Y) = \sum_{i=1}^n w_i (Y_i - \bar Y) = \sum_{i=1}^n w_i Y_i - \bar Y \sum_{i=1}^n w_i \\
& = & \sum_{i=1}^n w_i Y_i
\end{array}
$$

where we have introduced the notation

$$
\begin{array}{rclcl}
\displaystyle v_i & \equiv & x_i - \bar x , & \text{ which satisfy } & \sum_{i=1}^n v_i = \sum_{i=1}^n w_i = 0 \\
\displaystyle w_i & \equiv & \displaystyle \frac{v_i}{(n-1) s_x^2} & & \sum_{i=1}^n v_i^2 = \sum_{i=1}^n v_i (x_i - \bar x) = (n-1) s_x^2 = \sum_{i=1}^n v_i x_i \\
& & & \text{ and } & \sum_{i=1}^n w_i (x_i - \bar x) = \sum_{i=1}^n w_i x_i = 1
\end{array}
$$

This result implies that $\hat \beta_1$ follows a normal distribution, as the $Y_i$ are normally distributed.

The mean and variance of $\hat \beta_1$ are given by

$$
\begin{array}{rcl}
E[\hat \beta_1] & = & \sum_{i=1}^n w_i E[Y_i] = \sum_{i=1}^n w_i (\beta_0 + \beta_1 x_i ) = \beta_0 \sum_{i=1}^n w_i + \beta_1 \sum_{i=1}^n w_i x_i = \beta_1 \\
\text{Var} (\hat \beta_1) & = & \displaystyle \text{Var} \left( \sum_{i=1}^n w_i Y_i \right) = \sum_{i=1}^n w_i^2 \text{Var} (Y_i) = \sum_{i=1}^n w_i^2 \sigma^2 = \frac{\sigma^2}{(n-1)s_x^2}
\end{array}
$$

where we have made use of the independence of the $Y_i$ and $\text{Var} (Y_i) = \sigma^2$. One consequence of these results is that $\hat \beta_1$ is an unbiased estimator for $\beta_1$ under our assumptions.

#### <span style="color:orange">Distribution of the statistic for the intercept of the regression line, $\hat \beta_0$</span>

It holds that

$$
\begin{array}{rcl}
\hat \beta_0 & = & \bar Y - \hat \beta_1 \bar x = \bar Y - \sum_{i=1}^n w_i (Y_i - \bar Y) \bar x = \frac{1}{n} \sum_{i=1}^n Y_i - \sum_{i=1}^n \bar x w_i Y_i + \bar Y \bar x \sum_{i=1}^n w_i \\
& = & \frac{1}{n} \sum_{i=1}^n \left( 1 - n \bar x w_i \right) Y_i
\end{array}
$$

implying that $\hat \beta_0$ also follows a normal distribution.

Its mean and variance are given by

$$
\begin{array}{rcl}
E[\hat \beta_0] & = & \displaystyle \frac{1}{n} \sum_{i=1}^n \left( 1 - n \bar x w_i \right) E[Y_i] = \frac{1}{n} \sum_{i=1}^n \left( 1 - n \bar x w_i \right) ( \beta_0 + \beta_1 x_i) \\
& = & \displaystyle \beta_0 + \beta_1 \frac{1}{n} \sum_{i=1}^n \left( 1 - n \bar x w_i \right) x_i = \beta_0 + \beta_1 \left( \bar x - \bar x \sum_{i=1}^n w_i x_i \right) = \beta_0 \\
\text{Var} (\hat \beta_0) & = & \displaystyle \text{Var} \left( \frac{1}{n} \sum_{i=1}^n \left( 1 - n \bar x w_i \right) Y_i \right) = \frac{1}{n^2} \sum_{i=1}^n \left( 1 - n \bar x w_i \right)^2 \text{Var} (Y_i) \\
& = & \displaystyle \frac{\sigma^2}{n^2} \sum_{i=1}^n \left( 1 - n \bar x w_i \right)^2 = \frac{\sigma^2}{n^2} n + \frac{\sigma^2}{n^2} n^2 \bar x^2 \sum_{i=1}^n w_i^2 = \sigma^2 \left( \frac{1}{n} + \frac{\bar x^2}{(n-1) s_x^2} \right)
\end{array}
$$

and we also have that $\hat \beta_0$ is an unbiased estimator for $\beta_0$ under our assumptions.

#### <span style="color:orange">Distribution of the estimator for the variance of the errors, $s_R^2$</span>

<span style="color:brown">The following derivation makes use of results from matrix algebra and multivariate statistics.</span>

From the definition of the residual variance,

$$
S_R^2 = \frac{1}{n-2} \sum_{i=1}^n E_i^2 = \frac{1}{n-2} \sum_{i=1}^n \left(Y_i - \hat \beta_0 - \hat \beta_1 x_i \right)^2
$$

And from the definition of the residuals

$$
E_i = Y_i - \hat \beta_0 - \hat \beta_1 x_i ,
$$

it holds that these residuals $E_i$ are normally distributed, as they are a linear combination of normal random variables, $Y_i$, $\hat \beta_0$ and $\hat \beta_1$.

Its expected values are

$$
E[E_i] = E[Y_i] - E[\hat \beta_0] - E[\hat \beta_1] x_i = \beta_0 + \beta_1 x_i - \beta_0 - \beta_1 x_i = 0
$$

To obtain the variance of the residuals it is easier to use matrix notation. In this notation we have

$$
\begin{array}{rcl}
E & = & \displaystyle Y - \hat \beta_0 e - \hat \beta_1 x = Y - e \left( \frac{1}{n} e - \bar x w \right)^T Y - x w^T Y \\
& = & \displaystyle Y - \frac{1}{e^T e} e e^T Y + \bar x e \frac{1}{v^Tv} v^T Y - (v + \bar x e) \frac{1}{v^T v} v^T Y \\
& = & \displaystyle \left( I - \frac{1}{e^T e} e e^T - \frac{1}{v^T v} v v^T \right) Y = M Y
\end{array}
$$

where we have used the vector $e$, having all components equal to one, and the matrix of coefficients for $Y$, $M$.

The covariance matrix for $E$ is given by

$$
\text{Var} (E) = M \text{Var} (Y) M^T = \sigma^2 M = \sigma^2 \left( I - \frac{1}{n} e e^T - \frac{1}{(n-1)s_x^2} v v^T \right)
$$

This result implies that the variances of the different residuals (the values in the diagonal of this matrix) are different,

$$
\text{Var} (E_i) = \sigma^2 \left( 1 - \frac{1}{n} - \frac{1}{(n-1)s_x^2} (x_i - \bar x)^2 \right)
$$

and the residuals are not independent, as the values outside the diagonal of $M$ are not zero.

Define an orthogonal matrix $Q \in \mathbb{R}^{n\times (n-2)}$ with columns corresponding to an orthonormal basis for the subspace orthogonal to $\text{span} (e,v)$ (orthogonal to the subspace generated by the vectors $e$ and $v$, and such that $Q^T e = Q^T v = 0$), and another matrix $\tilde Q \in \mathbb{R}^{n\times (n-2)}$ with columns corresponding to another orthonormal basis for $\text{span} (e,v)$. Also, let $\hat Q = \left( \begin{array}{cc} Q & \tilde Q \end{array} \right)$.

Define a new set of $n-2$ variables $R_i$ as the components of $R \equiv Q^T E \in \mathbb{R}^{n-2}$; these variables are linear combinations of normal random variables, and as a consequence they follow a normal distribution. It holds that $\text{Var} (R) = \text{Var} (Q^T E) = Q^T \text{Var} (E) Q = \sigma^2 I$, that is, $R_i \sim N(0,\sigma^2)$, and these variables are independent of each other, as their covariance matrix is diagonal.

It holds that

$$
e^T E = \sum_i E_i = \sum_i Y_i - n \hat \beta_0 - \hat \beta_1 \sum_i x_i = n(\bar Y - \hat \beta_0 - \hat \beta_1 \bar x) = 0
$$

and

$$
\begin{array}{rcl}
v^T E & = & \sum_i x_i E_i - \bar x \sum E_i = \sum_i x_i Y_i - \hat \beta_0 \sum_i x_i - \hat \beta_1 \sum_i x_i^2 \\
& = & \sum_i x_i Y_i - \bar Y \sum_i x_i + \hat \beta_1 \bar x \sum_i x_i - \hat \beta_1 \sum_i x_i^2 = \sum_i x_i ( Y_i - \bar Y ) - \hat \beta_1 \sum_i x_i (x_i - \bar x) \\
& = & \sum_i (x_i - \bar x) ( Y_i - \bar Y ) - \hat \beta_1 \sum_i (x_i - \bar x)^2 = 0
\end{array}
$$

implying that $d^T E = 0$ for any $d \in \text{span} (e,v)$.

As $\hat Q \hat Q^T = I$, we have that

$$
E^T E = E^T \hat Q \hat Q^T E = E^T Q Q^T E + E^T \tilde Q \tilde Q^T E = R^T R + E^T \tilde Q \tilde Q^T E = R^T R ,
$$

and we have the desired result,

$$
T_{\sigma^2} = \frac{(n-2) S_R^2}{\sigma^2} = \frac{(n-2) E^T E/(n-2)}{\sigma^2} = \sum_{i=1}^{n-2} \frac{R_i^2}{\sigma^2} \sim \chi^2_{n-2}
$$

A consequence of this result is that

$$
E[T_{\sigma^2}] = E[\chi^2_{n-2}] = n - 2 = E\left[ \frac{(n-2) S_R^2}{\sigma^2} \right] = \frac{n-2}{\sigma^2} E [ S_R^2 ] \ \Rightarrow \ E [ S_R^2 ] = \sigma^2 
$$


#### <span style="color:orange">Distribution of the estimator for $\hat \beta_1$</span>

We have seen that

$$
Z \equiv \frac{\hat \beta_1 - \beta_1}{\displaystyle \sqrt{\frac{\sigma^2}{(n-1)s_x^2}}} \sim N(0,1) , \qquad V \equiv \frac{(n-2)S_R^2}{\sigma^2} \sim \chi^2_{n-2}
$$

implying from the definition of a Student t distribution (see Lesson 1) that

$$
\frac{Z}{\sqrt{V/(n-2)}} = \frac{\displaystyle(\hat \beta_1 - \beta_1)/\sqrt{\frac{\sigma^2}{(n-1)s_x^2}}}{\displaystyle \sqrt{\frac{(n-2)S_R^2/\sigma^2}{n-2}}} = \frac{\hat \beta_1 - \beta_1}{\displaystyle \sqrt{\frac{S_R^2}{(n-1)s_x^2}}} = T_{\beta_1} \sim t_{n-2}
$$

#### <span style="color:orange">Distribution of the estimator for $\hat \beta_0$</span>

Analogously to the preceding case, we have that

$$
Z' \equiv \frac{\hat \beta_0 - \beta_0}{\displaystyle \sqrt{\sigma^2 \left( \frac{1}{n} + \frac{\bar x^2}{(n-1) s_x^2} \right)}} \sim N(0,1) , \qquad V \equiv \frac{(n-2)S_R^2}{\sigma^2} \sim \chi^2_{n-2}
$$

and we obtain

$$
\begin{array}{rcl}
\displaystyle \frac{Z'}{\sqrt{V/(n-2)}} & = & \frac{\displaystyle(\hat \beta_0 - \beta_0)/\sqrt{\sigma^2 \left( \frac{1}{n} + \frac{\bar x^2}{(n-1) s_x^2} \right)}}{\displaystyle \sqrt{\frac{(n-2)S_R^2/\sigma^2}{n-2}}} \\
& = & \displaystyle \frac{\hat \beta_0 - \beta_0}{\sqrt{S_R^2 \left( \frac{1}{n} + \frac{\bar x^2}{(n-1) s_x^2} \right)}} = T_{\beta_0} \sim t_{n-2}
\end{array}
$$

<a id='App4_4'></a>

## <span style="color:orange;">Appendix 4: Forecasting distributions</span>

---

#### <span style="color:orange;">A formal definition for mean responses and forecasts</span>

- <span style="color:brown;">Forecast</span> estimates

  Let $U_0 = U | (X = x_0)$. The random variable of interest whose expected value we would like to estimate in this case will be

$$
Y_0 = Y | (X = x_0) = ( \beta_0 + \beta_1 X + U ) | (X = x_0) = \beta_0 + \beta_1 x_0 + U_0
$$
  
- <span style="color:brown;">Mean response</span> estimates

  We will denote the random variable of interest as $\hat Y$. It will be given by

$$
\hat Y = E[Y | X = x_0 ] = E [ \beta_0 + \beta_1 X + U | X = x_0 ] = \beta_0 + \beta_1 x_0
$$
    

#### <span style="color:orange;">Derivation of the distributions</span>

We now provide justification for the results regarding the distributions of the estimators for the mean response and the forecast.

In both cases, as $\hat Y_0$ is a linear combination of normal random variables, it will follow a normal distribution. Its mean, also in both cases, will be given by $y_0 = \beta_0 + \beta_1 x_0$. But the variance of the estimator will be different for the case of a mean response estimator or a forecast estimator.

In what follows we omit the indication that we are computing a conditional variance for $X = x_0$, as well as the fact that we are conditioning on the known values of $X$, to simplify the notation.

##### <span style="color:orange;">Variance and distribution of the mean response estimator</span>

To obtain the variance in this case, we use

$$
\hat Y = \hat \beta_0  + \hat \beta_1 x_0 = \bar Y - \hat \beta_1 \bar x + \hat \beta_1 x_0 = \bar Y + (x_0 - \bar x) \hat \beta_1 = \sum_{i=1}^n \left( \frac{1}{n} + (x_0 - \bar x) w_i \right) Y_i
$$

where $w_i$ are the values introduced in our previous justifications for the distributions of the linear regression parameter estimators.

We obtain

$$
\begin{array}{rcl}
\text{Var} (\hat Y) & = & \displaystyle \sum_{i=1}^n \left( \frac{1}{n} + (x_0 - \bar x) w_i \right)^2 \text{Var} (Y_i) = \sigma^2 \sum_{i=1}^n \left( \frac{1}{n^2} + (x_0 - \bar x)^2 w_i^2 \right) \\
& = & \displaystyle \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2} \right)
\end{array}
$$

and as a consequence,

$$
\frac{\hat Y - y_0}{\sigma \sqrt{\displaystyle \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2}}} \sim N(0,1) \quad \text{ and } \quad T_{mr} = \frac{\hat Y - y_0}{s_R \sqrt{\displaystyle \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2}}} \sim t_{n-2}
$$

##### <span style="color:orange;">Variance and distribution of the forecast estimator</span>

We now have

$$
Y_0 = \hat \beta_0  + \hat \beta_1 x_0 + U_0 = \sum_{i=1}^n \left( \frac{1}{n} + (x_0 - \bar x) w_i \right) Y_i + U_0
$$

As $x_0$ is different from the values in the sample $x_i$, from our assumptions $U_0 = U | X = x_0$ is independent of the errors $U_i = U | X = x_0$, and as a consequence it is also independent of the dependent variables $Y_i$. The variance will be given by

$$
\begin{array}{rcl}
\text{Var} (Y_0) & = & \displaystyle \sum_{i=1}^n \left( \frac{1}{n} + (x_0 - \bar x) w_i \right)^2 \text{Var} (Y_i) + \text{Var} (U_0) \\
& = & \displaystyle \sigma^2 \sum_{i=1}^n \left( \frac{1}{n^2} + (x_0 - \bar x)^2 w_i^2 \right) + \sigma^2 = \sigma^2 \left( 1 + \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2} \right)
\end{array}
$$

We then have

$$
\frac{Y_0 - y_0}{\sigma \sqrt{\displaystyle 1 + \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2}}} \sim N(0,1) \quad \text{ and } \quad T_{mr} = \frac{Y_0 - y_0}{s_R \sqrt{\displaystyle 1 + \frac{1}{n} + \frac{(x_0 - \bar x)^2}{(n-1)s_x^2}}} \sim t_{n-2}
$$


<a id='App4_5'></a>

## <span style="color:orange;">Appendix 5: Sums of squares properties</span>

In this cell we provide justifications for several results on the behavior of different sums of squares related to the linear regression model.

#### <span style="color:orange;">General relationship between sums of squares</span>

The sums of squares of a linear regression model satisfy $\text{SST} = \text{SSM} + \text{SSR}$. This is not a trivial result, as we are comparing sums of squares and in general $a^2 + b^2 \not= (a+b)^2$. But in our case this result holds, as

$$
\begin{array}{rcl}
\text{SST} & = & \sum_{i=1}^n (y_i - \bar y)^2 = \sum_{i=1}^n (y_i - \hat y_i + \hat y_i - \bar y)^2 \\
& = & \sum_{i=1}^n (y_i - \hat y_i)^2 + 2 \sum_{i=1}^n (y_i - \hat y_i)(\hat y_i - \bar y) + \sum_{i=1}^n (\hat y_i - \bar y)^2 \\
& = & \text{SSR} + \text{SSM} + 2 \sum_{i=1}^n (y_i - \hat y_i)(\hat y_i - \bar y)
\end{array}
$$

and

$$
\begin{array}{rcl}
\sum_{i=1}^n (y_i - \hat y_i)(\hat y_i - \bar y) & = & \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)(\hat \beta_0 + \hat \beta_1 x_i y_i) \\
& = & 2n \hat \beta_0 (\bar y - \hat \beta_1 \bar x) - n \bar y (\bar y - \hat \beta_1 \bar x) - n\hat \beta_0^2 + \hat \beta_1 \left( \sum_i x_i y_i - \hat \beta_1 \sum_i x_i^2 \right) \\
& = & n \hat \beta_0 (\hat \beta_0 - \bar y) + \hat \beta_1 n \bar x ( \bar y - \hat \beta_1 \bar x ) = n \hat \beta_0 \hat \beta_1 \bar x - \hat \beta_0 \hat \beta_1 \bar x = 0
\end{array}
$$

#### <span style="color:orange;">Distributions of the sums of squares</span>

Regarding the distributions of the sums of squares, we have already shown that

$$
\begin{array}{rcl}
\displaystyle \frac{(n-1)S_Y^2}{\sigma^2} & \sim & \displaystyle \chi^2_{n-1} \quad \Rightarrow \quad \frac{\text{SST}}{\sigma^2} \sim \chi^2_{n-1} \\
\displaystyle \frac{(n-2)S_R^2}{\sigma^2} & \sim & \displaystyle \chi^2_{n-2} \quad \Rightarrow \quad \frac{\text{SSR}}{\sigma^2} \sim \chi^2_{n-2}
\end{array}
$$

For the distribution of $\text{SSM}$, it holds that

$$
\begin{array}{rcl}
\text{SSM} & = & \sum_{i=1}^n \left( \hat Y_i - \bar Y \right)^2 = \sum_{i=1}^n \left( \hat \beta_0 + \hat \beta_1 x_i - \bar Y \right)^2 = \sum_{i=1}^n \left( \bar Y - \hat \beta_1 \bar x + \hat \beta_1 x_i - \bar Y \right)^2 \\
& = & \hat \beta_1^2 \sum_{i=1}^n (x_i - \bar x)^2 = (n-1) s_x^2 \hat \beta_1^2 
\end{array}
$$

We have seen that

$$
\hat \beta_1 \sim N \left( \beta_1 , \frac{\sigma^2}{(n-1)s_x^2} \right)
$$

and as a consequence, if we condition on $\beta_1 = 0$, that is, on not having a linear relationship between the variables,

$$
\frac{\text{SSM} | \beta_1 = 0}{\sigma^2} = \frac{\hat \beta_1^2}{\displaystyle \frac{\sigma^2}{(n-1)s_x^2}} \sim \chi^2_1
$$

that is, $\text{SSM} | \beta_1 = 0$ follows a chi squared distribution with one degree of freedom, as it can be written as the square of one standard normal random variable.

#### <span style="color:orange;">The coefficient of determination and the correlation coefficient</span>

The relationship between the coefficient of determination and the correlation coefficient follows from

$$
R^2 = \frac{\text{SSM}}{\text{SST}} = \hat \beta_1^2 \frac{(n-1) s_x^2}{(n-1) s_y^2} = \frac{\text{cov}(x,y)^2}{s_x^4} \frac{s_x^2}{s_y^2} = \frac{\text{cov}(x,y)^2}{s_x^2 s_y^2} = \text{cor}(x,y)^2
$$

<a id='App4_6'></a>

## <span style="color:orange;">Appendix 6: The F ratio</span>

---

Under $\beta_1 = 0$ it holds that

$$
\text{F Ratio} | (\beta_1 = 0) = \frac{\text{SSM} | (\beta_1 = 0)}{S_R^2} \sim F_{1,n-2}
$$

This distribution is a consequence of the F Ratio being defined as the ratio between two chi squared random variables with degrees of freedom 1 and $n-2$, both with the same value of $\sigma^2$.

As we mentioned before, we can test for $H_0 : \beta_1 = 0$, by finding the p-value associated to the distribution of the F ratio, as

$$
\text{p-value} = \Pr \left( F_{1,n-2} > \text{F Ratio} \right)
$$

This test is closely related to the significance test introduced before for $\hat \beta_1$, as we have that

$$
\text{F Ratio} = \frac{\text{SSM}}{\text{SSR}/(n-2)} = \frac{(n-1)s_x^2 \hat \beta_1^2}{s_R^2} = \frac{\hat \beta_1^2}{\displaystyle \frac{s_R^2}{(n-1)s_x^2}} = T_{\beta_1}^2 | (\beta_1 = 0)
$$

that is, the value of the F ratio is the square of the value of the test statistic based on the distribution of $\hat \beta_1$.
