# Regression

### Machine Learning Introduction

**Machine Learning** is frequently split into **supervised** and **unsupervised** learning. Regression, which you will be learning about in this lesson (and its extensions in later lessons), is an example of supervised machine learning.

In supervised machine learning, you are interested in predicting a label for your data. Commonly, you might want to predict fraud, customers that will buy a product, or home values in an area.

In unsupervised machine learning, you are interested in clustering data together that isn't already labeled.


### Scatter Plots

Scatter plots are a common visual for comparing two quantitative variables. A common summary statistic that relates to a scatter plot is the **correlation coefficient** commonly denoted by r.**

Though there are a [few different ways](http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/) to measure correlation between two variables, the most common way is with [Pearson's correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). Pearson's correlation coefficient provides the:

1. Strength
1. Direction

of a **linear relationship**. [Spearman's Correlation Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) does not measure linear relationships specifically, and it might be more appropriate for certain cases of associating two variables.

##### Correlation Coefficients

Correlation Coefficients
Correlation coefficients provide a measure of the strength and direction of a linear relationship.

We can tell the direction based on whether the correlation is positive or negative.

A general rule of thumb for judging the strength:

$$\begin{align}
&\text{Strong}\quad &\text{Moderate} \quad&\text{Weak} \\
&0.7\leq\vert r\vert\leq 1.0 \quad&0.3\leq\vert r\vert\lt 0.7\quad&0.0\leq\vert r\vert\lt 0.3
\end{align}$$

##### Calculation of the Correlation Coefficient

$$r=\frac{\text{Cov}(X, Y)}{S_X\cdot S_Y} = \frac{\sum\limits_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\sqrt{\sum(y_i-\bar{y})^2}}$$

### Simple Linear Regression

In simple linear regression, we compare two quantitative variables to one another.

The **response** (or dependent) variable is what you want to predict, while the **explanatory** (or independent) variable is the variable you use to predict the response. A common way to visualize the relationship between two variables in linear regression is using a scatter plot.

##### Defining a Line

A line is commonly identified by an **intercept** and a **slope**.

The **intercept** is defined as the **predicted value of the response when the x-variable is zero**.

The **slope** is identified as the **predicted change in the response variable for every one unit increase in the x-variable**.

We notate the line in linear regression as:
$$\hat{y} = b_0 + b_1x_1$$
where
* $\hat{y}$ is the predicted value of the response from the line, 
* $b_0$ is the intercept (determined from our sample),
* $b_1$ is the slope (determined from our sample),
* $x_1$ is the explanatory variable, and
* $y$ is the actual response for a a data point in our data set (i.e., not our prediction from our line).

The actual (population) parameters are annotated as
* $\beta_0$ for the intercept and
* $\beta_1$ for the slope.

##### Fitting a Regression Line

The main algorithm used to find the best fit line is called the **least-squares** algorithm, which finds the line that minimizes $\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2$.

There are many other ways to choose a "best" line, but this algorithm tends to do a good job in many scenarios.

In order to compute the slope and intercept we would need to compute the following.

$$\begin{align}\bar{x}=&\frac{1}{n}\sum x_i \\\bar{y}=&\sum y_i\\ s_y =&\sqrt{\frac{1}{n-1}\sum{(y_i-\bar{y})^2}}\\ s_x=&\sqrt{\frac{1}{n-1}\sum{(x_i-\bar{x})^2}}\\ r=&\frac{\sum\limits_{i=1}^{n}{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum{(x_i-\bar{x})^2}}\sqrt{\sum{(y_i-\bar{y})^2}}}\end{align}$$

Taking the derivative of our least squared equations, we find that
$$\begin{align}b_1&=r\frac{s_x}{s_y}\\ b_0&=\bar{y}-b_1\bar{x}\end{align}$$

In [None]:
import statsmodel.api as sm