# Intro to Linear Regression

When our data is linear, we found that we can describe the correlation between two variables using Person's r and a line drawn through our data. This line can be called the *regression line* or the *line of best fit* and will help us describe our data and make predictions.

### Terminology

Once a line of best fine has been determined (by finding the slope and y-intercept), we can make estimates for unseen or future values. Our dataset, $\mathcal{D}$, consists of all the pairs of observed ($x$,$y$).

With the slope, $b$, and y-intercept, $a$, determined, we can make predictions $\hat{y}=bx+a$. For a known data point ($x$, $y$), the difference between $y$ and $\hat{y}$ is called the *residual*.

When conducting regression, $a$ and $b$ are known as *regression coefficients*. It is important to note that software packages often present the coefficients in $\hat{y}=a+bx$ order.

### Computing the Line of Best Fit
To compute the line of best fit, we imagine an optimization problem that attempts to minimize the difference between our models outputs (our guesses), $\hat{y}$ and the actual value from our data $y$.

##### Minimize Sum of Residuals
One way we could think about computing the line of best fit is to minimize the sum of residuals, $$\underset{a,b}{\operatorname{\argmin}}\sum\limits_{(x,y)\in\mathcal{D}}y-\hat{y}$$

However, we will find that we can have poor lines of fit in this case. All lines in which the sum of positive and negative residuals is nearly equal results in equally good lines of best fit.

For this reason, we usually employ an alternative method that either sums the squares or absolute values of residuals.

##### Minimize Sum of Absolute Residuals
To ensure the values of the residuals is always positive, we can consider summing the absolute value of residuals, $$\underset{a,b}{\operatorname{\argmin}}\sum\limits_{(x,y)\in\mathcal{D}}\lvert y-\hat{y}\rvert$$

As we learned in calculus, we can take the derivative of the expression to identify the minimum. $$\underset{x}{\operatorname{\min}} \lvert x \rvert= \text{sign}(x)$$

We find that using the absolute value in our expression results in sparse solutions. In machine learning, using this term in a loss function is known as L1, or Lasso, regularization. While this is an option, we will look to the sum of squared residuals for this course.

##### Minimize Sum of Squared Residual
Another way to overcome the cancellation of positive and negative values is to square the residuals. Thus, our objective is to minimize this sum of the square of the residuals. $$\underset{a,b}{\operatorname{\argmin}}\sum\limits_{(x,y)\in\mathcal{D}}(y-\hat{y})^2$$

Using calculus, we can solve for the slope to find

$$b = \underset{b}{\operatorname{\argmin}}\sum\limits_{(x,y)\in\mathcal{D}}(y-\hat{y})^2 = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}$$

Now, we can look ack to our terms for our covariance, standard deviation, $S$, and Person's r, $r$. $$S_x = \sum(x_i-\bar{x})$$ $$r=\frac{\text{Cov}(x, y)}{S_x\cdot S_y}$$

We can now write $b$ in these terms. $$b=r\frac{S_x}{S_y}$$

Therefore, we can now represent our regression line as $$\hat{y} = a + bx = a + r\frac{S_x}{S_y}x$$