# Regression: Simple Linear Regression

Introduction to Machine Learning, BCAM & UPV/EHU course, by Carlos Cernuda, Ekhine Irurozki and Aritz Perez.

## References 

* James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). 
An introduction to statistical learning (Vol. 112). New York: springer.
* Data sets: http://www-bcf.usc.edu/~gareth/ISL/data.html
* SCIKIT-LEARN library example http://scikit-learn.org
* References Jupyter notebooks:
    - R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)
    http://www.science.smith.edu/~jcrouser/SDS293/labs/lab10-py.html
    - General Assembly's Data Science course in Washington, DC
    https://github.com/justmarkham/DAT4
    - An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani, 2013) adapted to Python code
    https://github.com/JWarmenhoven/ISLR-python

## Python: check that the python version is correct (Pyhton 3)

In [None]:
#Path correct
import sys
from pprint import pprint as p
p(sys.path)

#Python 3 
a = 5/2
print(a)

## Python Libraries

In [None]:
##########################################################
import numpy as np #scientific computing (n-dim arrays, etc)
import pandas as pd #data analysis library
##########################################################
# Plots:
import matplotlib.pyplot as plt 
import matplotlib.pylab as pylab
import seaborn as sns #visualization library based on matplotlib
%matplotlib inline
plt.style.use(['seaborn-white'])   
params = {'legend.fontsize': 'xx-large',
              'figure.figsize': (15, 5),
              'axes.labelsize': 'xx-large',
              'axes.titlesize':'xx-large',
              'xtick.labelsize':'xx-large',
              'ytick.labelsize':'xx-large'}    
pylab.rcParams.update(params)  #fix the parameters for the plots

pd.set_option('display.notebook_repr_html', False)
##########################################################
# SKLEARN: scikit-learn machine learning tools
from sklearn.preprocessing import scale #Standardize a dataset 
import sklearn.linear_model as skl_lm #linear model regression
##########################################################
# STATSMODELS: provides classes and functions for the estimation of many different statistical models, 
# as well as for conducting statistical tests, and statistical data exploration.
import statsmodels.api as sm
import statsmodels.formula.api as smf
##########################################################
np.random.seed(0)

## Data set: Advertising

Advertising data set: 
sales of a product in relation with spending money on advertising TV, newspaper, radio in 200 different markets. 
- Response: Sales; thousands of units
- Predictors: TV, radio, newspaper; thousands of dollars

In [None]:
advertising = pd.read_csv('dataset/Advertising.csv', usecols=[1,2,3,4])
#print(advertising.keys())
#advertising.info()
advertising.describe()

In [None]:
advertising.head(3) #show 3 first rows

## Coefficient estimates from data: least squares fitting

Predict sales (response variable) by considering as predictor variable the TV advertisiment budget. The linear model will be: 

$$
sales \sim \beta_0 +\beta_1 TV
$$

We want to use our training data (predictor advertising.TV and response advertising.sales) to calculate the estimates for both regression parameters. 

Minimize the RSS (residual sum of squares): 

$$
\min_{\beta_0,\beta_1} \sum_{i=1}^n (y_i - \hat{y}_i )^2 
$$

where

$$
y_i - \hat{y}_i  = y_i - (\beta_0 +\beta_1 TV)
$$

Using the sns.regplot function we can plot directly the linear fitting that we are looking for. We want to find the two coefficients such that the resulting line is as close as possible to the 200 data points. 

In [None]:
plt.figure(figsize=(5, 5))
sns.regplot(advertising.TV, advertising.sales, order=1, ci=None, scatter_kws={'color':'r'})
plt.xlim(-10,310)
plt.ylim(ymin=0);

In the previous plot the distances between the red points and the line are the errors made by the linear model. As the TV increase the line is less accurate. 

In order to calculate the model parameters (estimates of the coefficients) using least squares fitting we can do the following:

In [None]:
# Regression coefficients 

#Data: Predictor and Response
#x_TV = scale(advertising.TV, with_mean=True, with_std=False).reshape(-1,1)#one column unknown nb of rows
x_TV = advertising.TV.values.reshape(-1, 1)#one column unknown nb of rows
y_sales = advertising.sales

#Load the model (Ordinary Least Squares)
regr_model = skl_lm.LinearRegression() 

#Fit the model with data
regr_model.fit(x_TV,y_sales) 

# Coefficients estimates
beta_0 = regr_model.intercept_
beta_1 = regr_model.coef_[0]

print('beta_0', beta_0)
print('beta_1', beta_1)

#Predict "new" values with the model
x_TV_pred = np.asarray([50, 100, 200]).reshape(-1, 1)
y_sales_pred = regr_model.predict(x_TV_pred)

print('y_sales_pred', y_sales_pred)
