## Project 2: Kaggle Zillow prize challenge replication

<div class="alert alert-block alert-danger">
<b>Due: 9:29am, Tuesday, 4 February 2020</b>
</div>

In your second project you are tasked with building a model to improve the Zestimate residual error using the data for 2017 in the Kaggle Zillow prize competition. The following description is adapted from the Kaggle. For more details, reference: https://www.kaggle.com/c/zillow-prize-1/

### Data

You can only use the following data.

<div class="alert alert-block alert-warning">
Download the data you need for this assignment from:
Collab/Resources/Datasets
</div>

This file contains two `.csv` (comma-separated values) files, and one Excel directory file. Unzip the file to extract the CSV files into a directory of your choice.

### Data description

(Train/Test split)

- You are provided with a full list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2017 in the file `properties_2017.csv`.

- Not all the properties are sold in each time period. If a property was not sold in the time period, it will not have a row in `train_2017.csv` and so will not be used in predictions.

(File descriptions)

- properties_2017.csv - all the properties with their home features for 2017 (released on 10/2/2017)
- train_2017.csv - the training set with transactions from 1/1/2017 to 9/15/2017 (released on 10/2/2017)

(Data fields)

- Please refer to zillow_data_dictionary.xlsx

### Instruction Overview

Zillow Prize is challenging the data science community to help push the accuracy of the Zestimate even further. In the competition, Zillow is asking you to predict the log-error between their Zestimate and the actual sale price, given all the features of a home. 

The log error is defined as

    logerror=log(Zestimate)−log(SalePrice)

and it is recorded in the transactions file `train_2017.csv`. Using this training set and features of the home, set up your model for log error prediction. And then for each property (unique parcelid) in the `properties_2017.csv` dataset, you must predict a log error for the next period. Your program should write to output.csv in the following format.
(Example)

|   parcelid    |    logerror   |
| ------------- | ------------- |
|   10754147    |    0.1234     |
|   10759547    |   -0.3212     |
|       ...     |       ...     |

Your algorithm also needs to output the mean of all logerrors. 
Your answers should be in the form of a clear argument that includes both well-written prose, code and the numerical results (when the notebook is run). 


### Assignment

You should complete the assignment by inserting cells in the notebook with your answers to these questions, including both prose and code you used for your analysis.

<div class="alert alert-block alert-warning">
 Construct a model and predict the log-error for each property (unique parcelid) given all the features of a home.
</div>

Let's start with the data loading.

In [0]:
# Import the libraries and give them abbreviated names:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# load the data, use the directory where you saved the data:
df_properties = pd.read_csv('properties_2017.csv') 
df_train = pd.read_csv('train_2017.csv', parse_dates=["transactiondate"])

(1) Provide an argument for which variables in `properties_2017.csv` can potentially be good predictors for the value of interest (log-error), and explain why each will be useful. Explanations of why certain variables could be particularly poor predictors are also welcomed (but not required).


<div class="alert alert-block alert-info">Replace with your answers here</div>

(2) You can see that some properties such as parcelid have more detailed home information available, such as number of bedrooms or bathrooms, and other areas do not in `properties_2017.csv`. Explain how you can handle this missing information when you construct a model.


<div class="alert alert-block alert-info">Replace with your answers here</div>

(3) Construct your model and report the predicted log-error for each property for the next period. This should be a python script `project2.py` that reads from `properties_2017.csv` and writes to `output.csv` that lists parcelID and log-error predicted value for each parcel. 

Evaluate your model's ability to predict log-error values by comparing the predicted log-error values for the next period and the actual log-error testing values. 

Discuss why the model can or cannot predict the log-error values for the test data. 


<div class="alert alert-block alert-info">Replace with your answers here</div>