# **IEOR E4650  Business Analytics (Fall 2019)**

##**Lecture 4: Building a linear regression model**

In this lecture, we discuss different ways of variable transformation to help improve the prediction power of our model.

Learning objective:

* Understand different ways to transform the variables.
* Understand how to apply different techniques in the real-world settings.  
* Understand how to use Python to estimate a model that includes transformed variables. 





In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
link="https://drive.google.com/open?id=17Sa-DuRFCWfPzCW6uRbPwxAyo1mQARUn"
_,id=link.split("=")
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('myfile.csv')  
import pandas as pd
Sales = pd.read_csv('myfile.csv')



In [0]:
from statsmodels.formula.api import ols
import numpy as np
from sklearn.utils import shuffle
Sales= shuffle(Sales)
Sales.replace("-",np.nan,inplace=True)
Sales=Sales.astype({"GROSS_SQUARE_FEET":"float64","YEAR_BUILT":"float64","LAND_SQUARE_FEET":"float64"}) 


##  Linear regression model for prediction

Besides a prediction model like the following:

$y=\beta_0+\beta_1 x_1+ \beta_2 x_2 +\beta_3 x_3 +\epsilon$

Are there different ways to build a richer model?

Yes, here are 4 most popular methods:

* polynomial 
* log transformation
* interaction term
* converting a categorical variable into dummy variables.

###Including Polynomial Terms

If we suspect that the impact of a variable on the dependent variable is not linear, we can add polynomial terms.

For example, we might want to use the amount of time a sales agent spent talking to clients to predict the performance. We might not expect the effect of time on the sales to be linear, but expect the marginal return of the time on the sales to be deminishing. In this case, we can construct a model as follows:

$Sales=\beta_0+\beta_1*time+\beta_2*time^2+\epsilon$


`ols(formula="y~x1+I(x1**2)+I(x1**3)",data=S1)`

###Including Logarithm Terms

Usually, if the distribution of a variable exhibits a right tail, it can be useful to take the log transformation. Otherwise, the extreme values (which we call outlier) might influece our prediction too much. 

For example, if we want to perform log-transformation on $x_2$, we can use
$$y=\beta_0+\beta_1 x_1+\beta_2 log(x_2)+\epsilon$$


In order to take log-transformation, we will need the range of the variable to be positive. If the range is non-negative, we can add a small value to that variable as a remedy, for example using $log(x+1)$.

To perform log-transformation in Python, we can simply use

`ols(formula="y~x1+np.log(x2)",data=S1)`



In [0]:
import matplotlib.pyplot as plt



One thing to notice is that if we performed log-transformation on the dependent variable. Once we finished our prediction, we can then exponentiate the predicted value. 

###Including  Interaction Terms

When we expect that the impact of one variable on the depedent variable might be affected by the value of another variable, we can use the interaction term. For example, if we expect the influence of $x_1$ on $y$ varies based on the value of $x_2$, we might want to use 

$$y=\beta_0+\beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon$$


For example, we might want the following model to customer satisfaction level of our hotdog. We might use the number of condiments we provide and the number of sausage options we provide as the predictors. If you expect that consumers will care less about (or more) about the number of condiments we provide if we give them more sausage options, then we have a good reason to include the interaction term. 


To include an interaction term in Python, we can use x1:x2.

`ols(formula="y~x1+x2+x1:x2",data=S1)`

when we have `x1`, `x2` and `x1:x2` are all in the model, we can equivalently use x1*x2.

###Converting Categorical Terms into Dummy Variables

A variable might be categorical, which mean the value of that variable indicates which category an observation belongs to. 

* If a categorical variable can only take 0/1, indicating whether an observation belongs to the category or not, we also call it dummy variable.

* If a categorical variable can take multiple values (say N categories), we can then need to tranform this variable into N-1 dummy variables and include them in the regression model. 

Suppose we have a categorical variable that can take three values, then we can run the following regression:

$$y=\beta_0+\beta_1 category_2 +\beta_2 category_3 +\epsilon$$

Notice that you can only include 2 (more generally N-1) categories, since the whether an observation belongs to the first category can be perfectly inferred from whether this observation belongs to the second and the third category or not.



In our example, "Borough" is can be considered a categorical variable. In this case, we can convert a categorical variable into several dummy variables. Each dummy variable indicates whether an observation belongs to a specific category.

Suppose x2 is a categorical variable, we can use `C(x2)` to convert this variable to a dummy variable.

`ols(formula="y~x1+C(x2)",data=S1)`

#Activity

Based on the data we have (Feel free to use all 3000 observations or discard some), construct a model that gives the highest RMSE. 