# Multiple Regression

## Learning goals:

For a multivariable linear regression, students will be able to:

* compare and contrast with univariable linear regression
* write an example of the equation
* develop one with statsmodels 
* assess the model fit 
* validate the model


### Keyterms
- Multivariable
- Train-test split
- MSE: Mean squared error
- RSME: Root squared mean error


## Scenario

The University of San Paulo in Brazil is likes to party. We are a contracted beer supplier to the University and we want to make sure we have enough supply on hand. We are hoping to build a model that can predict beer consumption given other variables. 


![beer](pexels-photo-544988-small.jpeg)
More about the dataset can be found [here](https://www.kaggle.com/dongeorge/beer-consumption-sao-paulo)


###  Prior Knowledge


Before looking at the dataset, what variables do we think might be in there? What might make a student drink more? 

#### Step 1:  Discussion 

- compare and contrast with univariable linear regression
- How is this different from the regression we've done before?
- Here, you'll explore how to perform linear regressions using multiple independent variables to better predict a target variable.

#### Step 2:  Develop a multivariable regression model with statsmodels 

**Load Libraries and load in data**

In [26]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('Consumo_cerveja.csv')

In [4]:
df.head()
#df.info()

Unnamed: 0,Data,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros)
0,2015-01-01,273,239,325,0,0.0,25.461
1,2015-01-02,2702,245,335,0,0.0,28.972
2,2015-01-03,2482,224,299,0,1.0,30.814
3,2015-01-04,2398,215,286,12,1.0,29.799
4,2015-01-05,2382,21,283,0,0.0,28.9


In [7]:
df_comma  = pd.read_csv('Consumo_cerveja.csv', decimal =',', parse_dates=['Data'])

In [9]:
df_comma  = pd.read_csv('Consumo de cerveja(litos)' == to_numeric('Consumo de cerveja(litos)')

SyntaxError: unexpected EOF while parsing (<ipython-input-9-8fd41b2fa26d>, line 1)

### Small Data Cleaning Tasks:
- Drop Date
- convert all the columns to numeric (replace ',' with '.')
- rename columns to be `name = ['temp-median', 'temp-min', 'temp-max', 'rain', 'finals-week', 'target']`

In [30]:
# clean data here
name = ['date','temp-median', 'temp-min', 'temp-max', 'rain', 'weekends', 'target']

In [31]:
df = df_comma

In [32]:
df.columns = name

In [33]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 365 entries, 0 to 364
Data columns (total 7 columns):
date           365 non-null datetime64[ns]
temp-median    365 non-null float64
temp-min       365 non-null float64
temp-max       365 non-null float64
rain           365 non-null float64
weekends       365 non-null float64
target         365 non-null object
dtypes: datetime64[ns](1), float64(5), object(1)
memory usage: 22.8+ KB


Unnamed: 0,temp-median,temp-min,temp-max,rain,weekends
count,365.0,365.0,365.0,365.0,365.0
mean,21.226356,17.46137,26.611507,5.196712,0.284932
std,3.180108,2.826185,4.317366,12.417844,0.452001
min,12.9,10.6,14.5,0.0,0.0
25%,19.02,15.3,23.8,0.0,0.0
50%,21.38,17.9,26.9,0.0,0.0
75%,23.28,19.6,29.4,3.2,1.0
max,28.86,24.5,36.5,94.8,1.0


**Check** for NaNs

In [34]:
df.isna().sum()

date           0
temp-median    0
temp-min       0
temp-max       0
rain           0
weekends       0
target         0
dtype: int64

In [35]:
df.dropna(inplace=True)

In [36]:
df.shape

(365, 7)

In [37]:
df.tail()

Unnamed: 0,date,temp-median,temp-min,temp-max,rain,weekends,target
360,2015-12-27,24.0,21.1,28.2,13.6,1.0,32.307
361,2015-12-28,22.64,21.1,26.7,0.0,0.0,26.095
362,2015-12-29,21.68,20.3,24.1,10.3,0.0,22.309
363,2015-12-30,21.38,19.3,22.4,6.3,0.0,20.467
364,2015-12-31,24.76,20.2,29.0,0.0,0.0,22.446


### Everyone write an example of an equation for our multiple regression

The main idea here is pretty simple. Whereas, in simple linear regression we took our dependent variable to be a function only of a single independent variable, here we'll be taking the dependent variable to be a function of multiple independent variables.

<img src="https://miro.medium.com/max/1400/1*d0icRnPHWjHSNXxuoYT5Vg.png" width=450 />

Our regression equation, then, instead of looking like $\hat{y} = mx + b$, will now look like:

$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + ... + \hat{\beta}_nx_n$.

Remember that the hats ( $\hat{}$ ) indicate parameters that are estimated.

$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$ 

What would the formula be with real values?

**Send your equations to me via zoom or slack and I will paste them into the notebook**

Equations here

>

![statsmodels](https://www.statsmodels.org/stable/_static/statsmodels_hybi_banner.png)

Okay, now here's how you can use format and join to make the formula with **code**:

In [38]:
formula = 'target~{}'.format("+".join(df.columns[:-1]))
formula

'target~date+temp-median+temp-min+temp-max+rain+weekends'

In [39]:
model = sm.OLS(df.target, df.drop('target', axis=1)).fit()

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

In [None]:
model.summary()

### What's the actual multivariable  linear regression equation with the coefficients?

$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$ 

#### Step 3: Assess the model fit
Demonstrate and Apply:

**Discussion:**

In groups of 2 or 3 write a synopsis of the following summary

* What can you say about the coefficients?

* What do the p-values tell us?

* What does R^2 represent

* What other insights do you notice?





#### Step 4: Validate the model 
![scikit](https://cdn-images-1.medium.com/max/1200/1*-FHtcdQljtGKQGm77uDIyQ.png)
- Build LinReg Model with Scikit-Learn
- Check some of the linear regression assumptions


In [40]:
linreg = LinearRegression()

In [41]:
X = df.drop("target", axis=1)
y = df.target

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [43]:
# use fit to form model
linreg.fit(X_train, y_train)

TypeError: float() argument must be a string or a number, not 'Timestamp'

In [None]:
# gives you r squared of the model
linreg.score(X_test, y_test)

`score` here returns the R^2. 

How does it differ from when you use the whole dataset?

### Integration:

Repeat this process for concrete mixture. the documentation can be found [here](http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength)
![test](building-construction-building-site-constructing-small.jpg)

In [44]:
df2 = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls')

In [45]:
df2.head()
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
Cement (component 1)(kg in a m^3 mixture)                1030 non-null float64
Blast Furnace Slag (component 2)(kg in a m^3 mixture)    1030 non-null float64
Fly Ash (component 3)(kg in a m^3 mixture)               1030 non-null float64
Water  (component 4)(kg in a m^3 mixture)                1030 non-null float64
Superplasticizer (component 5)(kg in a m^3 mixture)      1030 non-null float64
Coarse Aggregate  (component 6)(kg in a m^3 mixture)     1030 non-null float64
Fine Aggregate (component 7)(kg in a m^3 mixture)        1030 non-null float64
Age (day)                                                1030 non-null int64
Concrete compressive strength(MPa, megapascals)          1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB


### Assessment

### Reflection

### Resources

Resources
https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86

Full code implementation of Linear Regression
Full code — https://github.com/SSaishruthi/Linear_Regression_Detailed_Implementation

Multiple regression explained
https://www.statisticssolutions.com/what-is-multiple-linear-regression/
