# Homework 3

1. Python users, use Python's pandas library to read in the Stata-formatted dataset used in class called "SI Sales.dta".  R users, use R's foreign library to read in the Stata-formatted dataset called "SI Sales Old.dta".  Replicate all of the regression results for this dataset that I presented in class.  

2. There is an additional feature in this dataset called "sales_year", which captures the year the sale of a house in Staten Island occurred.  From this feature, generate a feature that is linear time trend.  (A linear time trend is a feature that takes on value "1" in the initial year and increments by "1" each subsequent year.  For example, if 2003 were "1", 2004 would be "2", 2005 would be "3", and so forth.)  Run a linear regression model that relates the sales price to unit size, land size, age, the Todt Hill indicator, and the linear time trend.  How would you interpret the estimated coefficient associated with the linear time trend?  What is the 95% confidence interval of your interpretation?  Based on your regression diagnostics, have you improved the fit of the house price sales data by including the linear time trend as an additional explanatory feature?  

3. As noted in class, the unit size and land size features are measured in squared meters.  Suppose I ask you to re-express these features using the Imperial system of square feet rather than square meters, but I express a concern that the interpretation of the estimated coefficients, such as age, would be changed.  Without actually doing any statistical learning, what would you say to me about my concern?  Rerun the linear regression in 2. using the dwelling size and land size measured in square feet (rather than square meters).  What, if anything, has changed in your estimated coefficients?

4. (Challenging question.  Feel free to work together to the extent that it assists you.)  Assume the following data generating process (DGP) governs a random sample of size 10,000: $y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i$  for $\epsilon_i \sim N(0,1)$.  Further assume for this DGP that  $\beta_0=\beta_1=\beta_2=1$.  (a) Suppose the following process governs your features: $x_{1i} \sim N(0,1)$ and $x_{2i} \sim N(0,1)$ are independent.  Using R or Python, calculate the correlation between features $x_{1i}$ and $x_{2i}$.  Mistakenly, you decide to estimate a linear regression that includes only the feature $x_{1i}$.  Using R or Python, simulate this DGP and run the mistaken linear regression that includes only feature $x_{1i}$.  What value do you obtain for the coefficient associated with with feature $x_{1i}$?  (b) Suppose instead that the following process governs your features: $x_{1i}=z_i+\eta_i$ and $x_{2i}=-z_i+\omega_i$, where $z_i \sim N(0,1)$, $\eta_i \sim N(0,1)$, and $\omega_i \sim N(0,1)$ are independent.  Using R or Python, calculate the correlation between features $x_{1i}$ and $x_{2i}$.  Again, you mistakenly decide to estimate a linear regression that includes only the feature $x_{1i}$.  Using R or Python, simulate this DGP and run the mistaken linear regression that includes only feature $x_{1i}$.  What value do you obtain for the coefficient associated with with feature $x_{1i}$?  (c) Are there any conclusions you can draw from your results in (a) and (b)? 

4.  Spend time working on your proposal for the Foundations Project.

For questions 1 through 4 above, submit code and results.

In [1]:
%pylab inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

Populating the interactive namespace from numpy and matplotlib


In [2]:
sales = pd.read_stata('SI Sales Old.dta')

In [15]:
def run_regression(data, formula):
    mod = smf.ols(formula=formula, data=data).fit()
    print(mod.summary())

In [16]:
# Question 1
# Set up additional variables
sales['priceper1000'] = sales['price']/1000
sales['logprice'] = np.log(sales['price'])
sales['logunit'] = np.log(sales['unit_size'])
sales['logland'] = np.log(sales['land_size'])
sales['logage'] = np.log(sales['age']+1)

# Regressions to run
formulas = [
    'price ~ unit_size', 
    'price ~ unit_size + land_size', 
    'price ~ unit_size + land_size + age', 
    'price ~ unit_size + land_size + age + todt',
    'priceper1000 ~ unit_size + land_size + age + todt',
    'logprice ~ logunit + logland + logage + todt'
]

for formula in formulas:
    run_regression(sales, formula)

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.302
Model:                            OLS   Adj. R-squared:                  0.302
Method:                 Least Squares   F-statistic:                 1.372e+04
Date:                Wed, 30 Sep 2015   Prob (F-statistic):               0.00
Time:                        19:17:48   Log-Likelihood:            -4.3166e+05
No. Observations:               31680   AIC:                         8.633e+05
Df Residuals:                   31678   BIC:                         8.633e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    1.35e+05   2542.335     53.118      0.0

In [17]:
# Question 2
sales['linear_time_trend'] = sales['sales_year'] - sales['sales_year'].min() + 1
run_regression(sales, 'price ~ unit_size + land_size + age + todt + linear_time_trend')

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.412
Model:                            OLS   Adj. R-squared:                  0.412
Method:                 Least Squares   F-statistic:                     4431.
Date:                Wed, 30 Sep 2015   Prob (F-statistic):               0.00
Time:                        19:18:31   Log-Likelihood:            -4.2896e+05
No. Observations:               31680   AIC:                         8.579e+05
Df Residuals:                   31674   BIC:                         8.580e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept          1.435e+05   3295.43

The coefficient of the linear time trend is 6325, with a 95% confidence interval from 5586 to 7064.  This number can be intepreted as the average increase in price per year due to an overall inflation in the housing market for this dataset.  The fit has been improved slightly by including the linear time trend, seen by the .006 increase in the R-squared.  

Question 3:
There should be no change in any of the coefficents besides the sizes because square feet and square meters are linearly related.  And the coefficents for the sizes should change proportionately to the number of square feet in a square meter.

I might also say "C'mon Tim, stop your whining - this is _'Murica_ and we use feet here."

In [19]:
ftpm = 10.764
sales['unit_size_ft'] = sales['unit_size']*ftpm
sales['land_size_ft'] = sales['land_size']*ftpm
run_regression(sales, 'price ~ unit_size_ft + land_size_ft + age + todt + linear_time_trend')

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.412
Model:                            OLS   Adj. R-squared:                  0.412
Method:                 Least Squares   F-statistic:                     4431.
Date:                Wed, 30 Sep 2015   Prob (F-statistic):               0.00
Time:                        19:28:53   Log-Likelihood:            -4.2896e+05
No. Observations:               31680   AIC:                         8.579e+05
Df Residuals:                   31674   BIC:                         8.580e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept          1.435e+05   3295.43

In [24]:
# Question 4
ss = 10000
beta = 1
data = pd.DataFrame({
    'e': np.random.randn(ss),
    'x1': np.random.randn(ss),
    'x2': np.random.randn(ss),
})
data['y'] = beta + beta*data['x1'] + beta*data['x2'] + data['e']
print data.corr()

run_regression(data, 'y ~ x1')

           e        x1        x2         y
e   1.000000  0.000421  0.019491  0.585974
x1  0.000421  1.000000 -0.009133  0.570217
x2  0.019491 -0.009133  1.000000  0.582071
y   0.585974  0.570217  0.582071  1.000000
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.325
Model:                            OLS   Adj. R-squared:                  0.325
Method:                 Least Squares   F-statistic:                     4817.
Date:                Wed, 30 Sep 2015   Prob (F-statistic):               0.00
Time:                        19:48:27   Log-Likelihood:                -17779.
No. Observations:               10000   AIC:                         3.556e+04
Df Residuals:                    9998   BIC:                         3.558e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                     

The coefficent is .9913, with a 95% confidence interval including the true value of 1.

In [28]:
z = np.random.randn(ss)
data = pd.DataFrame({
    'e': np.random.randn(ss),
    'x1': z + np.random.randn(ss),
    'x2': -z + np.random.randn(ss),
})
data['y'] = beta + beta*data['x1'] + beta*data['x2'] + data['e']
print data.corr()

run_regression(data, 'y ~ x1')

           e        x1        x2         y
e   1.000000  0.014828 -0.010294  0.577767
x1  0.014828  1.000000 -0.497248  0.417628
x2 -0.010294 -0.497248  1.000000  0.403428
y   0.577767  0.417628  0.403428  1.000000
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.174
Model:                            OLS   Adj. R-squared:                  0.174
Method:                 Least Squares   F-statistic:                     2112.
Date:                Wed, 30 Sep 2015   Prob (F-statistic):               0.00
Time:                        19:53:06   Log-Likelihood:                -18785.
No. Observations:               10000   AIC:                         3.757e+04
Df Residuals:                    9998   BIC:                         3.759e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                     

This time the coefficent is only .5.  The difference this time is that x1 and x2 are correlated (about -.5 correlation) so leaving x2 out of the regression, OLS can't determine accurately the influence of just x1 on y.