# Regression: Multiple Linear Regression

Introduction to Machine Learning, BCAM & UPV/EHU course, by Carlos Cernuda, Ekhine Irurozki and Aritz Perez.


## References 

* James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). 
An introduction to statistical learning (Vol. 112). New York: springer.
* Data sets: http://www-bcf.usc.edu/~gareth/ISL/data.html
* SCIKIT-LEARN library example http://scikit-learn.org
* References Jupyter notebooks:
    - R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)
    http://www.science.smith.edu/~jcrouser/SDS293/labs/lab10-py.html
    - General Assembly's Data Science course in Washington, DC
    https://github.com/justmarkham/DAT4
    - An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani, 2013) adapted to Python code
    https://github.com/JWarmenhoven/ISLR-python

## Python Libraries

In [None]:
##########################################################
import numpy as np #scientific computing (n-dim arrays, etc)
import pandas as pd #data analysis library
##########################################################
# Plots:
import matplotlib.pyplot as plt 
import matplotlib.pylab as pylab
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns #visualization library based on matplotlib
%matplotlib inline
plt.style.use(['seaborn-white'])   
params = {'legend.fontsize': 'xx-large',
              'figure.figsize': (15, 5),
              'axes.labelsize': 'xx-large',
              'axes.titlesize':'xx-large',
              'xtick.labelsize':'xx-large',
              'ytick.labelsize':'xx-large'}    
pylab.rcParams.update(params)  #fix the parameters for the plots

pd.set_option('display.notebook_repr_html', False)
##########################################################
# SKLEARN: scikit-learn machine learning tools
import sklearn.linear_model as skl_lm ###linear model regression
##########################################################
np.random.seed(0)

## Data set: Advertising

Advertising data set: 
sales of a product in relation with spending money on advertising TV, newspaper, radio in 200 different markets. 
- Sales: thousands of units
- TV, radio, newspaper: thousands of dollars

In [None]:
advertising = pd.read_csv('dataset/Advertising.csv', usecols=[1,2,3,4])

## Coefficient estimates from data: least squares fitting

Predict sales by considering as predictor variable the TV and radio advertisiment budget. The linear model will be: 

$$
sales \sim \beta_0 +\beta_1 radio +\beta_2 TV
$$

We are going to calculate the estimates coefficients as before and to plot the resulting plane considering the two predictor variables.

In [None]:
# Predictor and response data
x_predictors = advertising[['radio', 'TV']].as_matrix()
y_sales = advertising.sales

# Load the model
regr_model = skl_lm.LinearRegression()
regr_model.fit(x_predictors,y_sales)


beta_0 = regr_model.intercept_
beta_1 = regr_model.coef_[0]
beta_2 = regr_model.coef_[1]

print('beta_0', beta_0)
print('beta_1', beta_1)
print('beta_2', beta_2)

In [None]:
# Prepare the values to be predicted with the model. 
# What are the min/max values of Radio & TV?
# Use these values to set up the grid for plotting.
advertising[['radio', 'TV']].describe()

In [None]:
# Create a coordinate grid
Radio = np.arange(0,50) #Return evenly spaced values within a given interval.
TV = np.arange(0,300)

X1_Radio, X2_TV = np.meshgrid(Radio, TV, indexing='xy')
Z_model = np.zeros((TV.size, Radio.size))

for (i,j),v in np.ndenumerate(Z_model):
        Z_model[i,j] =(beta_0 + X1_Radio[i,j]*beta_1 + X2_TV[i,j]*beta_2)

In [None]:
# Create plot
fig = plt.figure(figsize=(10,6))
fig.suptitle('Sales ~ beta_0 + beta_1 Radio + beta_2 TV Advertising', fontsize=20)
ax = axes3d.Axes3D(fig)
# Model
ax.plot_surface(X1_Radio, X2_TV, Z_model, rstride=10, cstride=5, alpha=0.4)
# Data 
ax.scatter3D(advertising.radio, advertising.TV, advertising.sales, c='r')

ax.set_xlabel('Radio')
ax.set_xlim(0,50)
ax.set_ylabel('TV')
ax.set_ylim(ymin=0)
ax.set_zlabel('Sales');

## Multiple Regression: linear without additive assumptions

Predict sales by considering as predictor variable the TV and radio advertisiment budget, but including also a relation between both quantities. 

The linear model will be: 

$$
sales \sim \beta_0 +\beta_1 radio +\beta_2 TV + \beta_3 radio * TV
$$

We are going to calculate the estimates coefficients as before and to plot the resulting linear plane considering the two predictor variables Radio and TV.

In [None]:
# Predictor and response data. 

# 3 predictors
x_radio = advertising[['radio']].values.reshape(-1, 1)
x_TV = advertising[['TV']].values.reshape(-1, 1)
x_RadxTV = np.multiply(x_radio, x_TV)

# Include all in one matrix with 3 columns
x_1 = np.concatenate((x_radio, x_TV), axis=1)
x_predictors = np.concatenate((x_1, x_RadxTV), axis=1)

# Response
y_sales = advertising.sales

# Load the model
regr_model = skl_lm.LinearRegression()
regr_model.fit(x_predictors,y_sales)

# Coefficient estimates
beta_0 = regr_model.intercept_
print('beta_0', beta_0)

#Vector with 3 coefficients
beta_1 = regr_model.coef_[0]
beta_2 = regr_model.coef_[1]
beta_3 = regr_model.coef_[2]

print('beta_1', beta_1)
print('beta_2', beta_2)
print('beta_3', beta_3)

In [None]:
# Create a coordinate grid (same as before)
#Radio = np.arange(0,50)
#TV = np.arange(0,300)

#X1_Radio, X2_TV = np.meshgrid(Radio, TV, indexing='xy')
Z_model_mix = np.zeros((TV.size, Radio.size))

for (i,j),v in np.ndenumerate(Z_model_mix):
        Z_model_mix[i,j] =(beta_0 + X1_Radio[i,j]*beta_1 + X2_TV[i,j]*beta_2 + X1_Radio[i,j]*X2_TV[i,j]*beta_3)

In [None]:
# Create plot
fig = plt.figure(figsize=(10,6))
fig.suptitle('Sales ~ beta_0 + beta_1 Radio + beta_2 TV + beta_3 TV*Radio', fontsize=20)
ax = axes3d.Axes3D(fig)

# Model
ax.plot_surface(X1_Radio, X2_TV, Z_model_mix, rstride=10, cstride=5, alpha=0.4)
# Data 
ax.scatter3D(advertising.radio, advertising.TV, advertising.sales, c='r')

ax.set_xlabel('Radio')
ax.set_xlim(0,50)
ax.set_ylabel('TV')
ax.set_ylim(ymin=0)
ax.set_zlabel('Sales');