In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
plt.style.use('seaborn')

# Variance: Covariance and correlation

**Variance**
$$\sigma^2 = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i-\mu)^2$$

**Covariance** ($\sigma_{xy}$)  
1. Mean normalization
2. Dot product
$$\sigma_{XY} = \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)$$


- Positive: **positively related**
- Negative: **inversely related**
- Equal or close to 0: **no linear relationship**

**Correlation** (standardizing covariance)

- Pearson's correlation coefficient $r$ (linear correlation coefficient) (between -1 and 1)
$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {\sqrt{\sum_{i=1}^{n}(x_i - \mu_x)^2 \sum_{i=1}^{n}(y_i-\mu_y)^2}}$$

# Statistical learning theory

**Types of data**  
- Data that can be controlled directly (**independent variables** / **features**): time, age
- Data that cannot be controlled directly (**dependent variables**): weight over age
- **controlled variable**

A **model** defines the relationship between a dependent and an independent variable

**model parameters**: co-efficients of the model equation for estimating the output

**Loss function**: evaluates how well the model represents the relationship between variables (low if well modeled)

# Linear regression

- Simple linear regression
- Multiple linear regression

Components
- dependent variable
- independent variable
- slope
- intercept

Calculating regression (**least squares method**)
- The mean of the X $(\bar{X})$
- The mean of the Y $(\bar{Y})$
- The standard deviation of the X values $(S_X)$
- The standard deviation of the y values $(S_Y)$
- The correlation between X and Y ( often denoted by the Greek letter "Rho" or $\rho$ - Pearson Correlation)

**Slope** (m):
$$\hat m = \rho \frac{S_Y}{S_X}$$

**Y-intercept** (b):
$(\hat y = \hat m x+ \hat c)$ -> $$\bar{Y} = \hat c + \hat m \bar{X}$$
$$ \hat c = \bar{Y} - \hat m\bar{X}$$

## Coefficient of Determination (R-Squared $R^2$)

Compares regression line with baseline (worst) model

An obtained R-squared value of say 0.85 can be put into a statement as
> ***85% of the variations in dependent variable $y$ are explained by the independent variable in our model.***

## Regression asumptions

- Linearity (scatter plots). Check for outliers
- Normality: **model residuals** should follow a normal distribution (histograms or Q-Q plots)
- Homoscedasticity <> Heteroscedasticity: dependent variable variability (scatter)

## OLS (Ordinary Least Squared) in statsmodels

In [19]:
df = pd.read_csv('../Data/heightWeight.csv')
f = 'weight~height'
model = ols(formula=f, data=df).fit()