## Project 1: Housing Price Prediction

<div class="alert alert-block alert-danger">
<b>Due: 9:29am, Tuesday, 21 January 2020</b>
</div>

Price prediction is one the key ingredients in market design and
market competition. An important feature of the price in a competitive
market is that it arises as an outcome of the market equilibrium where
supply is equal to demand. As a result, factors that may be good
predictors of the price, such as the volume of market sales, do not
have a causal relationship with it.

The distinction between factors that have good predictive power and
causal factors becomes particularly important when prediction is
needed in the changing market settings.  For instance, the change in
average disposable consumer income shifts market demand, causing the
change in both the prices and the volume of sales. Accounting for
market conditions, substitution between products and variation in
consumer demographics makes such predictions even more challenging in
complex markets, such as the market for real estate.

In your first project you are tasked with constructing a model for predicting the median price of a detached 3-bedroom single-family home in the US. We provide some starting code to give you and example of how to analyze the zillow data. Then, you need to perform your own analysis to answer the given questions. Your answers should be in the form of a clear argument that includes both well-written prose and code and its results (when the notebook is run).

<div class="alert alert-block alert-info">
You and your team members should work together on this
assignment. Both team members should fully understand everything you
submit.  If there are parts you understand quickly but are new to your
partners, it is your responsibility to explain them to your partners
until everyone understands. If there are parts that your partners
understands quickly but that are new to you, it is your responsibility
to insist that your partners explain things to you until you
understand them well.
</div>

### Data

We will use data provided by Zillow: https://www.kaggle.com/c/zillow-prize-1/

<div class="alert alert-block alert-warning">
Download the data you need for this assignment from:
Collab/Resources/Datasets
</div>

This file contains two `.csv` (comma-separated values) files, and one Excel directory file. Unzip the file to extract the CSV files into a directory of your choice.

### Libraries

You will find it useful to install several relevant libraries for this project (which will also be useful for later projects). 

We recommend using these libraries (but you are welcome to use any open source libraries you prefer):

- [pandas](https://pandas.pydata.org/) (Python Data Analysis Library):
````
conda install pandas
````

- [numpy](http://www.numpy.org/) (if you installed Anaconda, this should already be installed; if not, follow the directions there)

- [StatsModels](https://www.statsmodels.org/stable/index.html) 
````
conda install statsmodels
````

<div class="alert alert-block alert-warning">
All members on your team should set up the data and these libraries on your own machine, so you can each run things locally.  You should also decide on a way to share the `project1.ipynb` file (its up to you how to do this, but recommended options include using Dropbox, Google Drive, or a shared private github repository).
</div>

Import the libraries and give them abbreviated names:

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

### Example Analysis

We choose two consecutive months arbitrariliy. We use the first month's data to fit our regression model, and then test our model's performance on data from the next month. Here, we use July 2017 and August 2017.

In [2]:
month1 = [7] # month1 = July
month2 = [8] # month2 = August

Total taxable value of the property is typically a good proxy for its market value. We use it as an outcome variable.

In [3]:
# load the data, use the directory where you saved the data
df_properties = pd.read_csv('properties_2017.csv') 
df_train = pd.read_csv('train_2017.csv', parse_dates=["transactiondate"])

df_train['transactionmonth'] = df_train['transactiondate'].dt.month # create a new column indicating the month
df_merged = pd.merge(df_train, df_properties, on='parcelid', how='left') # merge two loaded files

  interactivity=interactivity, compiler=compiler, result=result)


Split the dataframe into `month1` and `month2`:

In [4]:
df_month1 = df_merged[df_merged['transactionmonth'].isin(month1)]  # save the data of July
df_month2 = df_merged[df_merged['transactionmonth'].isin(month2)]  # save the data of August

df_month1 = df_month1.fillna(0) # Be careful when you deal with blank observations. Here we substitute with 0 because we use only the number of the bedroom
df_month2 = df_month2.fillna(0)

For each month, create an independent variable (`x`) with the predictive variables, and a dependent variable (`y`).

In [5]:
y_month1 = df_month1['taxvaluedollarcnt']
x_month1 = df_month1['bedroomcnt']
y_month2 = df_month2['taxvaluedollarcnt']
x_month2 = df_month2['bedroomcnt']

Now fit the linear regression model on `month1`'s data:

In [6]:
model = sm.OLS(y_month1, x_month1).fit()  # fit the model
print(model.summary())      # print a summary of results

                            OLS Regression Results                            
Dep. Variable:      taxvaluedollarcnt   R-squared:                       0.430
Model:                            OLS   Adj. R-squared:                  0.430
Method:                 Least Squares   F-statistic:                     7164.
Date:                Wed, 15 Jan 2020   Prob (F-statistic):               0.00
Time:                        09:43:18   Log-Likelihood:            -1.3960e+05
No. Observations:                9490   AIC:                         2.792e+05
Df Residuals:                    9489   BIC:                         2.792e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
bedroomcnt   1.57e+05   1855.121     84.638      0.0

Predict month2's taxable assessment of a home using the model fitted above:

In [7]:
yhat = model.predict(x_month2)

Calculate mean squared error (MSE), variance and r-squared:

In [8]:
# MSE
serror = np.square(y_month2 - yhat)
mse = np.mean(serror)
print("MSE: ", mse)

# Sample Variance
ybar = y_month2.mean()
variance = np.mean((y_month2 - ybar)**2)
print("Variance: ", variance)

# Explanation of variance
# this is the r-squared 
rsq = 1 - (mse / variance)
print("R-squared: ", rsq)

('MSE: ', 309344169217.8261)
('Variance: ', 329449869767.6554)
('R-squared: ', 0.061028102891674596)


### Assignment

You should complete the assignment by inserting cells in the notebook with your answers to these questions, including both prose and code you used for your analysis.

<div class="alert alert-block alert-warning">
Construct and estimate a linear regression model to predict the taxable value of 3-bedroom homes.
</div>

Provide an argument which variables can potentially be good predictors for the value of interest and try to estimate the linear regression with all those variables included. Discuss which models you have considered estimating but decided to discard and why. Present and discuss the outcome of that estimation.  


   

<div class="alert alert-block alert-info">Replace with your answers here</div>

Compute the in-sample mean squared error of your prediction, i.e. the sum of squared deviations of the price predicted by your model and the price that was actually observed in the data. Compare the mean-squared error of your model with the empirical variance of the price of 3-bedroom houses. How much of that price variance is explained by your model in percentage terms?


<div class="alert alert-block alert-info">Replace with your answers here</div>

How does your analysis relate to the idea of generalization performance discussed in the introduction to V. Vapnik's _The Nature of Statistical Learning Theory_?


<div class="alert alert-block alert-info">Replace with your answers here</div>