# Section 10 
### Caveat
- Assumptions of a linear regression 
    - linearity
    - homoscedasticity
    - multivariate normality
    - indepedence of error
    - lack of multicollinearity
    
### Dummy Variables
- Make sure that you do not include both dummy variables because this will make it a wrap. 


### Way to Build a Model
- If you throw a lot of garbage, it will spit garbage out
- You must explain why certain variables predict the behavior of the dependent variable
- 5 Way To Build a Model
    - **All-in**: 
        - Throw all in the variables, all the variables are the true predictors
    - **Backward Elimination**: 
        - First, you must choose a signficance level you want to be under
        - Fit the model with all of the variables
        - Consider the predictor with the lowest p-value
        - Remove the variables that have the highest p-value
        - Recreate the model with a fewer number of models (which will change the coefficient and p-value)
        - Continue doing this until you reach the conclusion that all the p-values are lower than the sigificant point
    - **Forward Elimination**:
        - Select a signficance for the treshold
        - Fit all the independent regression with each variables and select the one with the lowest p-value
        - Once you have the lowest predictor, check for all the possibilties with every other predictor (making it a 2 variable regression)
        - Keep the one with the lowest p-value
        - Repeat this process
    - **Bidirectional Eliminiation**:
        - It combines the two previous models
        - Perform the next step in the forward model to enter new predictors
        - Perform all the steps in the backward model and take out predictors that do not add any value
    - **Score Comparision**:
        - Select a criteria (like R-Squared)
        - Construct all possible regression models $2^{N-1}$ total combination
        - Select the one with the best criteria
        
- The bad thing is that when we have statistical signficance, the answer will be very black and white

### Using Adjusted R-Squared for Robust Models
- While just looking at the p-value can provide some value, we should also look at the way that the adjusted r-squared is either increasing or decreasing

### Coefficients
- For every (in this example) increase in $ 1 RD spend, it would increase profit by .79 cents


In [1]:
# great way to reload 
%load_ext autoreload
%autoreload 2

In [4]:
import pandas as pd
from pandas import DataFrame

#import statsmodels.formula.api as sm
from scipy import stats
from statsmodels.api import add_constant

from sklearn import metrics
from sklearn.linear_model import LogisticRegression

import BackwardElimination

In [3]:
PATH = '/Users/alexguanga/Downloads/'
df = pd.read_csv(PATH+"50-Startups.csv")

df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,California,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,California,166187.94


In [4]:
# Making Y-Variable
y = df['Profit']

# Including all but one column
Xs = df.loc[:, df.columns != 'Profit']

In [5]:
Xs['State'].unique() # Only two variables. must make it dummy variables

array(['New York', 'California'], dtype=object)

In [6]:
state_dummies = pd.get_dummies(df['State'])
state_dummies.head()

Unnamed: 0,California,New York
0,0,1
1,1,0
2,1,0
3,0,1
4,1,0


In [7]:
# Do not need the string or the extra colums (will create mulitcolinearity)

Xs = pd.concat([Xs, state_dummies], axis=1)
Xs.drop(['State', 'California'], axis=1, inplace=True) 

Xs.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,New York
0,165349.2,136897.8,471784.1,1
1,162597.7,151377.59,443898.53,0
2,153441.51,101145.55,407934.54,0
3,144372.41,118671.85,383199.62,1
4,142107.34,91391.77,366168.42,0


### Using Backward Elimination
- **Quick Visual of what the script does for all the values**

In [8]:
import statsmodels.formula.api as sm
from statsmodels.api import add_constant

In [9]:
from collections import defaultdict

global dict_adjus_R
dict_adjus_R = defaultdict(list)

In [18]:
# Statistical sigficance we would like to uses
stats_signf = 0.05

final_model, dict_of_AdjusR = BackwardElimination.BackwardElimination(Xs, y, stats_signf, 'Linear')
final_model.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Sun, 14 Jan 2018",Prob (F-statistic):,3.5000000000000004e-32
Time:,09:27:37,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
R&D Spend,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


In [19]:
dict_of_AdjusR

defaultdict(list,
            {1: [[Index(['R&D Spend'], dtype='object'), 0.94542146849878173],
              [Index(['R&D Spend'], dtype='object'), 0.94542146849878173],
              [Index(['R&D Spend'], dtype='object'), 0.94542146849878173],
              [Index(['R&D Spend'], dtype='object'), 0.94542146849878173],
              [Index(['R&D Spend'], dtype='object'), 0.94542146849878173]]})