# Module 03

## Session 02 Supervised Learning Regression

## Multiple Linear Regression

### Import data

In [1]:
import statsmodels.api as sm
import seaborn as sns

### Load data

In [2]:
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Model

In [3]:
# variable
y = tips['tip'] # dependent
X = tips[['total_bill', 'size']] # independent

# add intercept to the formula
X = sm.add_constant(X) 

# defining the model
model = sm.OLS(y, X)

# fitting model
result = model.fit()

# print summary
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                    tip   R-squared:                       0.468
Model:                            OLS   Adj. R-squared:                  0.463
Method:                 Least Squares   F-statistic:                     105.9
Date:                Wed, 07 Jul 2021   Prob (F-statistic):           9.67e-34
Time:                        11:40:47   Log-Likelihood:                -347.99
No. Observations:                 244   AIC:                             702.0
Df Residuals:                     241   BIC:                             712.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6689      0.194      3.455      0.0

<b>Important output</b>:
1. <b>Adjusted R-square</b> 
2. <b>Prob (F-statistic)</b> (F-test)
3. <b>coef each variables</b> (β<sub>0</sub>, β<sub>1</sub>, and β<sub>2</sub>) 
4. <b>prob each variables</b> (T-test from β<sub>0</sub>, β<sub>1</sub>, and β<sub>2</sub>)

### Interpretation Summary:
Model: <b>y = 0.6689 + 0.0927X<sub>1</sub> + 0.1926X<sub>2</sub></b>
> - X<sub>1</sub> = Total bill
> - X<sub>2</sub> = Size
1. <b>Adjusted R-square</b>: the model can define 46.3% variation of tips
2. <b>Prob (F-statistic)</b>: 9.67e-34 < 0.05, then one or both variables are significantly affect the tips
3. <b>β<sub>0</sub>, β<sub>1</sub></b>, and <b>β<sub>2</sub></b>:
> - <b>β<sub>0</sub></b>: y when both x=0, tips when total bill = 0 and size = 0
> - <b>β<sub>1</sub></b>: Every 1 dollars total bill increase, then it will increase the tips approximately 0.0927 dollars
> - <b>β<sub>2</sub></b>: Every size increase by 1 person, then it will increase the tips approximately 0.1926 dollars
4. <b>T-test</b>:
> - β<sub>0</sub> p-value: 0.001... < 0.05, then it confirm the model needs intercept
> - β<sub>1</sub> p-value: 0.000... < 0.05, then the correlation between total bill and tips are significant and the correlations are positive
> - β<sub>2</sub> p-value: 0.025... < 0.05, then it confirm the correlation between size and tips are significant and the correlations are positive

<b>note</b>: Interpretation for coef each variables (β<sub>1</sub> and β<sub>2</sub>) only applied in the range of the data we have for each variables. We need to define the maximum and minimum value for each variable to determine the range which this model can be applied.

In [4]:
tips[['total_bill', 'size']].describe()

Unnamed: 0,total_bill,size
count,244.0,244.0
mean,19.785943,2.569672
std,8.902412,0.9511
min,3.07,1.0
25%,13.3475,2.0
50%,17.795,2.0
75%,24.1275,3.0
max,50.81,6.0


> - The model can be applied in the range of total bill from 3.07 to 50.81 dollars and size between 1 to 6 person
-------------------

<b>Note</b>: 
> - <b>F-test</b>: to test wether the independent variable affects the dependent variable
>> - <b>Hypothesis</b>: Independent variable <b>does not</b> affects dependent variable
>> - <b>H<sub>0</sub></b>: β<sub>1</sub> = 0
>> - <b>H<sub>a</sub></b>: β<sub>1</sub> != 0
>> - <b>Rejection criteria</b>: p-value <= α
> - <b>T-test</b>: to test wether β<sub>0</sub> and β<sub>1</sub> are significant and wether β<sub>0</sub> (intercept/constant) are needed
>> - <b>Hypothesis</b>: Model <b>does not</b> need the intercept
>> - <b>H<sub>0</sub></b>: β<sub>0</sub> = 0
>> - <b>H<sub>a</sub></b>: β<sub>0</sub> != 0 or β<sub>0</sub> != 0 or β<sub>0</sub> != 0 
>> - <b>Rejection criteria</b>: 
>>> - <b>One-sided</b>: p-value <= α
>>> - <b>Two-sided</b>: p-value/2 <= α
----------------------