# **IEOR E4650  Business Analytics (Fall 2019)**

##**Lecture 5: Buidling a Good Model**

Linear regression is a commonly used technique when analyzing data. In this lecture, we discuss how to build a good model to address the questions. 

Learning objective:

(1) Understand the modeling focus for correlation and prediction

(2) Understand to importance of cross validation

(3) Able to perform cross validation

(4) Able to work on model selection algorithms

## Modeling focus 





Again, let's return to two focuses of modeling:

$$y=\beta_0+\beta_1 x_1 + \epsilon $$

* Correlation/causation

We would like to have a model that allows us to do a good inference on $\beta_1 $.

* Prediction

We would like to have a model that gives us good prediction 


Both correlation and prediction require us to "correctly" specify the model. 
> Is the data clean?

> Is the model correctly specified? 

>>Do we need to take variable transformations? 

>>Is linear model a good choice?

### Special Focus for Causal inference. 

 

 (1) Include factors that impact both $x_1$ and $y$, otherwise, $\beta_1$ will be biased.  

  >>For example, if we want to run the following model
  
  $$\text{Spending_on_clothing}=\beta_0+\beta_1 \text{Spending_on_grocery}+ \epsilon$$ 

  >>We might find $\beta_1$ positive and significant. However, does this mean spending more on grocery lead to more spending on clothing? Well, we are at least over-estimating the influence, because we are missing an impact factor, which is income. Maybe it is also a good idea to add the number of family members in the model. A better model to use is 

  $$\text{Spending_on_clothing}=\beta_0+\beta_1 \text{Spending_on_grocery}+  \beta_2 \text{Income} + \beta_3 \text{Family_members}  + \epsilon$$ 

>> Here, "Income" and "Family_members" are called control variables.

  >>Of course,we also want to check if variable transformations are needed. For example, it is very likely to see a right-skewed distribution for `Spending_on_clothing`, `Spending_on_grocery`, and `Income`. Check the histogram and make a decision. 

  >(2) If we have any factors that influence only the dependent variable, but not our independent variable, you should also include it in your model. It will help with the p-value. 

We will revisit the causal inference in the later lectures.

### Special Focus of Prediction 
  




A good model is the one that has good prediction power on a set of samples that we never touched on before.


####Bias-Variance Trade-off

As model goes from simple to complicated, we risk the danger of moving from underfitting to overfitting. 
* Underfitting: miss the relevant relations between features and target outputs. 
	-> Too much “bias”

"bias" means our prediction is not close enough to the actual value. 

* Overfitting: starts to model the random noise in the training data
	-> Too much “variance”

"variance" means our prediction has lots of variation.

<div>
<img src="https://i.stack.imgur.com/8RlJk.png
" width="250"/>
</div>

[Source: stackexchange.]

Thus, our task is to find a model that balances variance and bias.

<div>
<img src="https://www.researchgate.net/profile/Peter_Thesling2/publication/281821310/figure/fig2/AS:391521762332674@1470357470701/The-Bias-Variance-Tradeoff.png
" width="200"/>
</div>

[Source: ResearchGate]




#### Solution: Cross-Validation

General Idea:  Using the same data for both model training and model evaluating can lead to over-fitting. Thus, we will have separate dataset for model training and the evaluation of the model accuracy.


Split our data into three parts:

* Training
* Validation
* Test

Go through the following model training process

* Step 1: Loop for each candidate model:

> Use training set to train the model

> Calculate the accuracy on validation set

* Step 2: Choose the model with the highest accuracy on the validation set

* Step 3: Report the performance of the final model using test set


<div>
<img src="https://miro.medium.com/max/1484/1*OJVhBtg5YgeW7rKXoxKQxg.png
" width="350"/>
</div>

[Source: Medium]

#### Example of model selection:

Suppose we would like to select a good model for predicting the sales.

Which model is a good model?

 $$SalePrice=\beta_0+\beta_1 GrossSquareFeet + \epsilon $$

 $$SalePrice=\beta_0+\beta_1 GrossSquareFeet+\beta_2 GrossSquareFeet^2 + \epsilon $$

  $$log(SalePrice)=\beta_0+\beta_1 log(GrossSquareFeet) + \epsilon $$

Let's do the following three steps:

1. Split the data into Part 1 (Training), Part 2 (Validation), and Part 3 (Testing)
2. Run three models on Part 1 (Training)
3. Test the model performances on Part 2 (Validation)
4. Select the best model 
5. Report the model performance on Part 3 (Testing)

**In addition, you need to have a standard on what is considered a good model.**

Root Mean Square Error (RMSE):

$RMSE=\sqrt{\frac{\sum(y_i-\widehat{y_i})^2}{N}}$ ---> This one weights error for each point equally

Root Mean Square Percentage Error (RMSPE) [An example can be seen [here](https://www.kaggle.com/c/rossmann-store-sales/overview/evaluation)]:

$RMSPE=\sqrt{\frac{\sum(\frac{\widehat{y_i}-y_i}{y_i})^2}{N}}$ ---> This one gives a lower weights on the errors for observations with larger actual values 

RMSPE is less sensitive to outliers in the testing/training data. However, is not exactly aligning with the OLS objective function, which essentially is minimizing RMSE on the training set.


In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
link="https://drive.google.com/open?id=17Sa-DuRFCWfPzCW6uRbPwxAyo1mQARUn"
_,id=link.split("=")
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('myfile.csv')  
import pandas as pd
import numpy as np
Sales = pd.read_csv('myfile.csv')
from statsmodels.formula.api import ols


In [0]:

Sales=Sales.replace("-",np.nan)
Sales=Sales.dropna()
Sales.shape
Sales.rename(columns={})
Sales=Sales.astype({"GROSS_SQUARE_FEET":"float64","YEAR_BUILT":"float64","LAND_SQUARE_FEET":"float64"}) 

Sales=Sales.rename(columns={"GROSS_SQUARE_FEET":"GSF","LAND_SQUARE_FEET":"LSF","TOTAL_UNITS":"TU","COMMERCIAL_UNITS":"CU","RESIDENTIAL_UNITS":"RU","SALE_PRICE":"SP","YEAR_BUILT":"YB"})

In [0]:
from sklearn.utils import shuffle
Sales= shuffle(Sales)
#900 samples
Sales_training= Sales[:900]
#300 samples
Sales_validation= Sales[900:1200]
#rest
Sales_testing= Sales[1200:]


In [0]:
#Model training on the training set
model1=ols("SP~GSF",data=Sales_training).fit()
model2=ols("SP~GSF+I(GSF**2)",data=Sales_training).fit()
model3=ols("np.log(SP)~np.log(GSF)",data=Sales_training).fit()


#Selecting the best model based the validation set

##First, get the predicted value 
Prediction1=model1.predict(Sales_validation)
Prediction2=model2.predict(Sales_validation)
###Model 3 took used log transformation on the dependent variable. 
###Need to exponentiate the predicted value 
Prediction3=np.exp(model3.predict(Sales_validation))

##Second, get the RMSPE and RMSE
Actual=Sales_validation["SP"]
RMSPE1= (np.mean(((Prediction1-Actual)/Actual)**2))**0.5
RMSPE2= (np.mean(((Prediction2-Actual)/Actual)**2))**0.5
RMSPE3= (np.mean(((Prediction3-Actual)/Actual)**2))**0.5
RMSE1= (np.mean((Prediction1-Actual))**2)**0.5
RMSE2= (np.mean((Prediction2-Actual))**2)**0.5
RMSE3= (np.mean((Prediction3-Actual))**2)**0.5

print(RMSPE1,RMSPE2,RMSPE3)
print(RMSE1,RMSE2,RMSE3)

#Report the final performance of the chosen the model on the test set
Actual=Sales_testing["SP"]
Prediction=np.exp(model3.predict(Sales_testing))
RMSPE= (np.mean(((Prediction-Actual)/Actual)**2))**0.5
RMSE= (np.mean((Prediction-Actual))**2)**0.5
print(RMSPE, RMSE)


13.10348580060805 11.822571201084596 4.697885602766187
262094.16113019158 233953.27991566624 173730.86681698897
1.0369816533120197 977329.7339288273


## Model Selection Algorithm

Besides relying on the expertise. We can also use some model selection algorithms to helps us choose a better model.

Two commonly used model selection algorihm are:

**(1) Step-wise model section**
  This model selection method works on add or remove variables one at a time to until the accuracy on the hold-out samples starts drop. 

  For example, forward model section as follow: 

(1) Select a set of variables 

(2) Loop:
  >Try each variable still left in the set

  >Select the variable that leads to the highest accuracy on the hold-out samples

(3) Break the loop if adding one additional variable leads to decrease in accuracy on the hold-out sample


**(2) Regularization-based model section**

Ridge regression, Lasso regression,and Elastic-net are examples of commonly used algorithms. 

The idea is to add a penalty for complicated models.

**Optimization without penalty:**
$$\sum(y_i-\widehat{y_i})^2$$

**Ridge (L2 regularization):**

$$\sum(y_i-\widehat{y_i})^2+\alpha \sum(\widehat{\beta_k})^2$$

**Lasso (L1 regularization ):**

$$\sum(y_i-\widehat{y_i})^2+\alpha \sum|\widehat{\beta_k}|$$

Both ridge and lasso can help reduce overfit. In addition, Lasso regression can drive certain $\widehat{\beta}$s to 0, which works as a variable selection tool.

**Elastic-Net**
Elastic-net is a weighted combination of Ridge and Lasso:

$$\sum(y_i-\widehat{y_i})^2+\alpha( L \sum|\widehat{\beta_k}|+(1-L) \sum(\widehat{\beta_k})^2)$$

Especially, Elastic-Net reduces to Lasso when L=1, while reduces to Ridge when L=0.

To implement Elastic-net using statsmodel package,use  [fit_regularized()](http://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.fit_regularized.html) instead of fit(). Especially, fit_regularized() has two important arguments. 
* alpha: this corresponds to the $\alpha$ in the formula
* L1_wt: this corresponds to $L$ in the formula 

Here, $\alpha$ and $L$ values can be tuned using cross-validation.






In [0]:
Sales.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'RU', 'CU', 'TU', 'LSF', 'GSF', 'YB', 'SP'], dtype='object')

In [0]:
model=ols("np.log(SP)~np.log(GSF)+RU+C(BOROUGH)+C(NEIGHBORHOOD)",data=Sales).fit_regularized(alpha=0.0001,L1_wt=0) 
params=model.params
params[params!=0]

array([ 9.42120538e+00, -7.55088350e-01, -8.71618637e-02, -4.42400447e-01,
       -6.20399882e-01, -2.89612442e-01,  4.46752199e-02,  8.57288218e-02,
       -7.80583187e-01,  1.94898777e-01, -1.52796522e-03,  2.33595465e-01,
       -2.61461080e-01, -2.04333942e-01,  2.79726438e-01, -1.24370694e-01,
       -1.20049646e-01,  6.51313447e-02,  3.99412397e-01, -9.80305104e-02,
       -1.51230560e-01,  1.07571242e-01, -3.95512354e-01,  9.53365446e-01,
       -4.85170991e-02,  2.72127194e-01, -1.45146606e-01, -3.41567424e-01,
       -3.45500989e-01,  1.22797059e+00, -7.41515512e-01, -1.43812551e-03,
        7.67004326e-01, -2.72543644e-01, -5.55590074e-01, -6.77665106e-01,
        5.54258630e-01, -2.06109148e-01, -1.64169357e-01,  6.27277701e-01,
        3.25041545e-01,  4.78552296e-01,  3.93327105e-02,  5.15285592e-01,
        3.59425332e-01,  1.68935649e-01, -1.90157017e-01,  4.54284971e-02,
       -4.48874796e-01,  1.36287806e-01,  8.63651458e-02,  6.28831274e-01,
       -2.52419183e-02, -

In [0]:
#running OLS without penalty
model_ols=ols("np.log(SP)~np.log(GSF)*RU*C(BOROUGH)*YB",data=Sales_training).fit()
Prediction=np.exp(model_ols.predict(Sales_validation))

Actual=Sales_validation["SP"]
PE=(Prediction-Actual)/Actual
RMSPE= (np.mean(PE**2))**0.5
RMSPE

0.8066959886200192

In [0]:
#running ridge
model=ols("np.log(SP)~np.log(GSF)*RU*C(BOROUGH)*YB",data=Sales_training)\
  .fit_regularized(alpha=0.0000004, L1_wt=0)
Prediction=np.exp(model.predict(Sales_validation))

Actual=Sales_validation["SP"]
PE=(Prediction-Actual)/Actual
RMSPE= (np.mean(PE**2))**0.5
RMSPE

0.7833960806520405

In [0]:
#running lasso
model=ols("np.log(SP)~np.log(GSF)*RU*C(BOROUGH)*YB",data=Sales_training)\
  .fit_regularized(alpha=0, L1_wt=1,start_params=model_ols.params)
Prediction=np.exp(model.predict(Sales_validation))

Actual=Sales_validation["SP"]
PE=(Prediction-Actual)/Actual
RMSPE= (np.mean(PE**2))**0.5
RMSPE

0.8066959831314037

The following code shows why Q7 in HW2 is a valid method.

In [0]:
ols_predict=ols("np.log(SP)~np.log(GSF)",data=Sales_training).fit().predict()
ols_predict[0:5]

array([13.12921003, 13.21717984, 13.91089931, 13.64417272, 13.46904907])

In [0]:
ols1_predict=ols("I(np.log(SP)-np.mean(np.log(SP)))~I((np.log(GSF)-np.mean(np.log(GSF)))/np.std(np.log(GSF)))-1",data=Sales_training).fit().predict()
ols1_predict[0:5]+np.mean(np.log(Sales_training["SP"]))

array([13.12921003, 13.21717984, 13.91089931, 13.64417272, 13.46904907])