# Regression

## Machine learning

### 1. Supervised learning
Where we use input data to predict a label for that data
* Using credit card transaction data to predict fraudulent transactions;
* Using customer financial data to predict the chance of a default on a loan;
* Using neighborhood characteristics to predict home prices.

**Linear and logistic regression fall into here.**

### 2. Unsupervised learning
Clustering together data based on common characteristics (these data don't have labels like in supervised ML techniques)
* Trying to group similar customer segments;
* Group documents that cover similar topics.
---

## Linear regression introduction

### a) Simple linear regression
The most simple form of regression. A linear comparison of only two quantitative variables.
* *Prices* x **sales**;
* *Temperature* x **humidity**;
* *Height* x **weight**;
* *Hours studying* x **test grade**.

A common way to visualize these relationships is with a scatter plot. The variable on the $Y$ axis is called *the response* or *dependent*, where the variable on the $X$ axis is called *explanatory* or *independent*
* $Response$ $variable$ $(Y):$ the variable we're interested in predict;
* $Explanatory$ $variable$ $(X):$ the variable used to predict the response.

The **scatter plot** can be used to visualize both the strength and the direction of the relationship between two variables.
* Positive relation: when both variables increases.
* Negative relation: when one variable increases and the other decreases.
* As the points spread out from one another this weakens the relationship.
* To identify strong or weak relationships, we aren't so much looking at the slope associated with the relationship.
* Generally, we consider strength as either weak, moderate or strong. And direction as positive or negative.

[$Correlation$ $coeficient$ $of$ $Pearson$ $(r)$](https://pt.wikipedia.org/wiki/Coeficiente_de_correla%C3%A7%C3%A3o_de_Pearson) is the strength and direction of a linear relationship, always between -1 and 1, where the closer it is to 1 or -1, the stronger the relationship. Negative values indicate negative relationship, otherwise positive relationship.

**Ps.:** [Spearman coeficient](https://pt.wikipedia.org/wiki/Coeficiente_de_correla%C3%A7%C3%A3o_de_postos_de_Spearman) is[ more indicated for specific cases with two variables (more options](https://pt.wikipedia.org/wiki/Coeficiente_de_correla%C3%A7%C3%A3o_de_postos_de_Spearman)).

---

## Correlation coeficients
There is some rules, but this is a highly field-dependent measure.
* **Strong relationship:** 0.7 <= |r| < 1;
* **Moderate relationship:** 0.3 <= |r| < 0.7;
* **Weak relationship:** 0.0 <= |r| < 0.3.

**Ps.:** Negative correlation coeficient don't indicate a weak relationship.

**Ps2.:** On excel the function is `CORREL(col1, col2)` ([example](https://docs.google.com/spreadsheets/d/1bZs0QjX0d_TKeLcbBZKwLq9mCE5Gpgq0XrliVeq_FZg/edit#gid=0))

---

## What defines a line?
We define it by two values, an intercept and a slope.
* **Intercept** tells us the predicted value of the response when the explanatory variable is zero. Commonly used for the population and sample intercept values, pronouced  beta knot for the parameter and B0 for statistic.
* **Slope ($beta1$)** tells us the predicted change in the response for each additional one unit increase in the explanatory variable ($X$). For parameter is beta one and b1 for the statistic

Once we've fit a line to these points, we define with this equation

$ŷ$ $=$ $b0$ $+$ $b1x$

* $b0$ is where the x value is equal to zero;
* $b1$ is the change along the y-axis in the line;
* $ŷ$ define the values that we get from the fitted line (predicted result);
* $y$ defines the actual data points (actualresult);


Values out of the line are define line e $(x1,y1)$, but values in the line are define positions like $(x1,ŷ1)$

---

## Fitting a regression line
In bi-variate case, we're interested in fiding a line that best allows us to predict the response variable (y) using the explanatory variable (x).
The main algorith used to find the best line is  the **least squares regression algorithm** and the way the line is chosen is by minimizing the sum of squared vertical distances between our fitted line and each of these points
To calculate the difference between the point and the arrow: $y1$ $-$ $ŷ1$.
For each of the data points in the data set, look at the distance between the predicted and actual values, square these, and them sum them all together.

$\sum_{i=1}^{x}$ $(yi$ $-$ $ŷi)^2$

And if our line creates a smaller value than for any other line, then this is the line we want to use.

* [Another video about linear regressionan analysis intro](https://www.youtube.com/watch?v=zPG4NjIkCjc)

To calculate the intercept we need to known:

$\bar{x}$ $=$ $\frac{1}{n}$ $\sum$ $x_i$

$\bar{y}$ $=$ $\frac{1}{n}$ $\sum$ $y_i$

$s_y$ $=$ $\sqrt{\frac{1}{n=1}\sum(y_i-\bar{y})²}$

$s_x$ $=$ $\sqrt{\frac{1}{n=1}\sum(x_i-\bar{x})²}$

$r$ $=$ $\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})²}{\sqrt{\sum(y_i-\bar{y})^2}}}$

$b_1$ $=$ $r_\frac{s_y}{s_x}$

$b_0$ $=$ $\bar{y}-b_1\bar{x}$

## Fitting a regression line in python

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

df = pd.read_csv('house_price_area_only.csv')
df.head()

Unnamed: 0,price,area
0,598291,1188
1,1744259,3512
2,571669,1134
3,493675,1940
4,1101539,2208


In [3]:
# Add column to intercept result, statsmodels doesn't do this
df['intercept'] = 1
df.head()

Unnamed: 0,price,area,intercept
0,598291,1188,1
1,1744259,3512,1
2,571669,1134,1
3,493675,1940,1
4,1101539,2208,1


* There are only rare cases where add intercept is not necessary ([link discussion](https://stats.stackexchange.com/questions/7948/when-is-it-ok-to-remove-the-intercept-in-a-linear-regression-model))

In [4]:
# Provide to OLS method the Y and X variables
lm = sm.OLS(df['price'], df[['intercept', 'area']])
# Fit the model
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,12690.0
Date:,"Mon, 14 Jan 2019",Prob (F-statistic):,0.0
Time:,20:34:42,Log-Likelihood:,-84517.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6026,BIC:,169100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,9587.8878,7637.479,1.255,0.209,-5384.303,2.46e+04
area,348.4664,3.093,112.662,0.000,342.403,354.530

0,1,2,3
Omnibus:,368.609,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,349.279
Skew:,0.534,Prob(JB):,1.43e-76
Kurtosis:,2.499,Cond. No.,4930.0
