## Linear regression

A linear approach (linear combination of constants and variables) to model the relationship between a dependent variable (Y) and one or more independent variables (X's)

#### Questions
1. **Prediction** Question: How accurately can I predict the price of a house, given the values of all variables
2. **Inferential** Question: How accurately can we estimate the effect of each of this variables on the house price

#### Simple linear regression

It is an approach for rpedicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is approximately a linear relationship between X and Y

* Model equation:
$ Y = \beta_{0} + \beta_{1} X $
    
Where $\beta_{0}$ is the intercept (mean predict value) and $\beta_{1}$ is the slope (coefficient of correlation). Together they are known as the **model coeffiecients and parameters** and they are estimated based on the **least square estimates** that reduces the mean error from predicter values in relation to actual observed ones.

For the dataset House Price:
* Y will represent the variable price
* X will represents the variable room_num

Fitting the variables in the equation, we get: $Y = \beta_{0} + \beta_{1}*X$ or $price = \beta_{0} + \beta_{1}*room_num$

* The goal is to obtain coefficient estimates $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ such that the linear model fits the data well
* The number of rows in our dataset is 506
* The datapoints relationship are therefore distributed as $(x_{1},x_{1}), (x_{2},x_{2}),...,(x_{506},x_{506})$
* Lets calculate y value as $\hat{y}$ (hat means predicted values) as:
$$\hat{y}_{1} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{1}$$ 
$$\hat{y}_{2} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{2}$$ 
$$.$$
$$.$$
$$.$$
$$\hat{y}_{506} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{506}$$ 



The difference between the $i_{th}$ observed response value and the $i_{th}$ response value that is predicted by our linear model is known as the **residual value** (distance from the actual value from the model line)
$$e_{i} = y_{i} - \hat{y}_{i}$$

![image-2.png](attachment:image-2.png)

Some of the values can be positive or negative (falling above or below the regression line) and therefore we can not straightway sum them up, and therefore we define a new quantity that is the **residual sum of squares**

**Residual sum of squares (RSS)**

$RSS = e_{1}^{2} + e_{2}^{2} + ... + e_{n}^{2}$ 

since $e_{i} = y_{i} - \hat{y}_{i}$

and $ \hat{y}_{1} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{i} $

you can substitute $\hat{y}_{i}$ in the error formula:

$e_{i} = ( y_{i} - (\hat{\beta}_{0} + \hat{\beta}_{1}x_{i}) )^{2}$

and get:

$e_{i} = ( y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i} )^{2}$

So the sum of error squares will the be the sum of each error given the formula above for each point

$$ RSS = ( y_{1} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{1} )^{2} + ( y_{2} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{2} )^{2} + ... + ( y_{n} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{n} )^{2} $$


The **least squares estimates** approach will choose $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ that minimizes the RSS. 

Using calculus, to find the estimates for $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$, the minimizers are:

![image.png](attachment:image.png)

#### Assessing accuracy of predicted coefficients

We assume that the true relationship between X and Y variables takes the form of $Y = f(X) + \epsilon$ for some unknown function $f$, where $\epsilon$ is a mean-zero random error term 

If $f$ is to be approximated by a linear function, then we can write the relationship as: 
$$ Y = \beta_{0} + \beta_{1}X + \epsilon$$

Where $\beta_{0}$ is **intercept**; $\beta_{1}$ is **slope**; $\epsilon$ is an **error** term

If we get all the points of the data we can get a **population** regression line; 
If we take a random sample of the data we can get a **sample** regression line

**Standard error of coefficients**

How far the predicted estimates $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ will be from the population coefficients $\beta_{0}$ and $\beta_{1}$ is named as the **standard error**

The standard error will be used to give us a **confidence interval**, usually calculated as the 95% conf. interval that it will containt the true value of the estimate $\beta_{0}$ or $\beta_{1}$

![image.png](attachment:image.png)


Standard error are also used to **Hypothesis testing** to evaluate if there are relationship between two variables or not.

For example:

* If $\beta_{1}$ is 0 in the regression equation, it means there is no relationship between the variables, and we say that the **null hypothesis** (H0) is true; if there is some relationship between the variables we say that the **alternative hypothesis** (HA) is true:
    * H0: $\beta_{1} = 0$
    * HA: $\beta_{1} \not= 0$
    
If there is no relationship, we cannot use X to predict Y

You also do not want that 0 lies between the confidence interval of standard errors of the coefficients. **If the interval includes 0, that means the actual coefficient value *can be 0 as well* and that means that the predictor variable can have no relationship with the response variable or it is insignificant in terms of its influence on response variable**

When the confidence intervals does not include the 0 it usually means that there is significance (p<0.5) and the alternative hypothesis is acceptable. If confidence interval crosses the 0 it can uncertainty if there is significance because your treatment/relationship can be 0 at some time.

#### Hypothesis testing

* To disapprove the null hypothesis (H0), we calculate a *t-statistics* whici is given as $$t = \frac{\hat{\beta}_{1} - 0}{SE(\hat{\beta}_{1})}$$


* From this, we compute the probability of observing any value equal or larger than absolute value of t: $|t|$. The t-value says how far $\hat{\beta}_{1}$ is from 0 and when you divide by $SE$ you say how many times standard error the beta is from 0. You typicall want the t-value to be large. It is called t-value because it's based on t-distribution (similar to normal dist). 


* We call this probability as the *p-value*


* A small p-value means there is significant association/relationship between the predictor variable and the response (typically less than 0.05 or 0.01)

#### Residual standard error and R squared

The quality of a linear regression fit is typically assessed using two related quantities:

* **Residual standard error (RSE)**

RSE is the average amount that the response will deviate from the true regression line

RSE can also be considered as a measure of lack of fit of the model to the data

The RSE gives degrees of freedom which is $n - 2$

The RSE tells the value in average how your predicted model is mispredicting from the actual observation value


* **$R^{2}$ statistic**

Since RSE is quantified in the measured of the predited variable (Y), sometimes it may be unclear what constitutes a good RSE.

$R^{2}$ is an alternative to evaluate the model

$R^{2}$ is the proportion of variance explained by our model

$R^{2}$ always takes on a value between 0 and 1 since it is a proportion

$R^{2}$ is independent of the scale of Y

$R^{2}$ is given as $$ R^{2} = \frac{TSS - RSS} {TSS} = 1 - \frac{RSS} {TSS}$$

Where TSS is to total sum of squares = sum((actual value of Y minus mean Y)^2)

RSS is the residual sum of squares = sum((predicted value of Y minus mean Y)^2)

TSS is giving the total amount of variability in Y variable
RSS is the total amount of variability that is not explained by the linear model after regression

Therefore, $R^{2}$ is the proportion of explained variance *vs.* the total variance of the model

$R^{2}$  close to 1 indicates that large proportion of variability in the response variable can be explained by the regression model

$R^{2}$ close to 0 indicates that regression do not explain much of the variability, and that a linear model is not the right model to explain the relationship between X and Y.

Adjusted $R^{2}$  is a value adjusted for the number of predictor variables, because if you keep adding predictors, the $R^{2}$  will increase accordingly.

The acceptable value of $R^{2}$  will largely depend on the type of problem you have