## Project 2: Kaggle Zillow prize challenge replication

<div class="alert alert-block alert-danger">
<b>Due: 9:29am, Tuesday, 4 February 2020</b>
</div>

In your second project you are tasked with building a model to improve the Zestimate residual error using the data for 2017 in the Kaggle Zillow prize competition. The following description is adapted from the Kaggle. For more details, reference: https://www.kaggle.com/c/zillow-prize-1/

### Data

You can only use the following data.

<div class="alert alert-block alert-warning">
Download the data you need for this assignment from:
Collab/Resources/Datasets
</div>

This file contains two `.csv` (comma-separated values) files, and one Excel directory file. Unzip the file to extract the CSV files into a directory of your choice.

### Data description

(Train/Test split)

- You are provided with a full list of real estate properties in three counties (Los Angeles, Orange and Ventura, California) data in 2017 in the file `properties_2017.csv`.

- Not all the properties are sold in each time period. If a property was not sold in the time period, it will not have a row in `train_2017.csv` and so will not be used in predictions.

(File descriptions)

- properties_2017.csv - all the properties with their home features for 2017 (released on 10/2/2017)
- train_2017.csv - the training set with transactions from 1/1/2017 to 9/15/2017 (released on 10/2/2017)

(Data fields)

- Please refer to zillow_data_dictionary.xlsx

### Instruction Overview

Zillow Prize is challenging the data science community to help push the accuracy of the Zestimate even further. In the competition, Zillow is asking you to predict the log-error between their Zestimate and the actual sale price, given all the features of a home. 

The log error is defined as

    logerror=log(Zestimate)−log(SalePrice)

and it is recorded in the transactions file `train_2017.csv`. Using this training set and features of the home, set up your model for log error prediction. And then for each property (unique parcelid) in the `properties_2017.csv` dataset, you must predict a log error for the next period. Your program should write to output.csv in the following format.
(Example)

|   parcelid    |    logerror   |
| ------------- | ------------- |
|   10754147    |    0.1234     |
|   10759547    |   -0.3212     |
|       ...     |       ...     |

Your algorithm also needs to output the mean of all logerrors. 
Your answers should be in the form of a clear argument that includes both well-written prose, code and the numerical results (when the notebook is run). 


### Assignment

You should complete the assignment by inserting cells in the notebook with your answers to these questions, including both prose and code you used for your analysis.

<div class="alert alert-block alert-warning">
 Construct a model and predict the log-error for each property (unique parcelid) given all the features of a home.
</div>

Let's start with the data loading.

In [3]:
# Import the libraries and give them abbreviated names:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# load the data, use the directory where you saved the data:
df_properties = pd.read_csv('properties_2017.csv') 
df_train = pd.read_csv('train_2017.csv', parse_dates=["transactiondate"])

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df_properties.head()

Unnamed: 0,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,10754147,,,,0.0,0.0,,,,,...,,,,9.0,2016.0,9.0,,,,
1,10759547,,,,0.0,0.0,,,,,...,,,,27516.0,2015.0,27516.0,,,,
2,10843547,,,,0.0,0.0,5.0,,,,...,1.0,,660680.0,1434941.0,2016.0,774261.0,20800.37,,,
3,10859147,,,,0.0,0.0,3.0,6.0,,,...,1.0,,580059.0,1174475.0,2016.0,594416.0,14557.57,,,
4,10879947,,,,0.0,0.0,4.0,,,,...,1.0,,196751.0,440101.0,2016.0,243350.0,5725.17,,,


In [5]:
df_train.head()

Unnamed: 0,parcelid,logerror,transactiondate
0,14297519,0.025595,2017-01-01
1,17052889,0.055619,2017-01-01
2,14186244,0.005383,2017-01-01
3,12177905,-0.10341,2017-01-01
4,10887214,0.00694,2017-01-01


### Join the two dataframes on parcel_id

In [6]:
prop_sorted = df_properties.sort_values(by='parcelid', axis=0)
train_sorted = df_train.sort_values(by='parcelid',axis=0)

In [7]:
merged_inner = pd.merge(left=df_train,right=df_properties, left_on='parcelid', right_on='parcelid')
merged_inner.head()


Unnamed: 0,parcelid,logerror,transactiondate,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,14297519,0.025595,2017-01-01,,,,3.5,4.0,,,...,,,485713.0,1023282.0,2016.0,537569.0,11013.72,,,60590630000000.0
1,17052889,0.055619,2017-01-01,,,,1.0,2.0,,,...,1.0,,88000.0,464000.0,2016.0,376000.0,5672.48,,,61110010000000.0
2,14186244,0.005383,2017-01-01,,,,2.0,3.0,,,...,1.0,,85289.0,564778.0,2016.0,479489.0,6488.3,,,60590220000000.0
3,12177905,-0.10341,2017-01-01,,,,3.0,4.0,,8.0,...,,,108918.0,145143.0,2016.0,36225.0,1777.51,,,60373000000000.0
4,10887214,0.00694,2017-01-01,1.0,,,3.0,3.0,,8.0,...,,,73681.0,119407.0,2016.0,45726.0,1533.89,,,60371240000000.0


In [17]:
train_df = merged_inner
train_df.dropna(subset=['parcelid'])
y_train=train_df[['logerror']]
X_train=train_df.drop(columns=['logerror','parcelid'], axis=0)

In [9]:
corr_matrix = merged_inner.corr()
corr_matrix["logerror"].sort_values(ascending=False)

logerror                        1.000000
basementsqft                    0.372067
buildingclasstypeid             0.315372
finishedsquarefeet6             0.072870
finishedsquarefeet12            0.045921
calculatedfinishedsquarefeet    0.040516
garagetotalsqft                 0.035015
bedroomcnt                      0.031638
calculatedbathnbr               0.029330
garagecarcnt                    0.029002
fullbathcnt                     0.027133
bathroomcnt                     0.025817
fireplacecnt                    0.023242
poolsizesum                     0.021174
longitude                       0.015876
threequarterbathnbr             0.015540
parcelid                        0.015407
roomcnt                         0.014567
lotsizesquarefeet               0.011012
airconditioningtypeid           0.009341
structuretaxvaluedollarcnt      0.008433
numberofstories                 0.008204
fips                            0.006413
rawcensustractandblock          0.006333
yearbuilt       

In [11]:
numerical_df = merged_inner[['logerror','finishedsquarefeet13','finishedsquarefeet13','finishedfloor1squarefeet','landtaxvaluedollarcnt','unitcnt','taxamount','taxvaluedollarcnt','numberofstories','lotsizesquarefeet','bathroomcnt','fullbathcnt','fireplacecnt','garagecarcnt','bedroomcnt','garagetotalsqft','basementsqft']]

In [15]:
merged_inner['finishedsquarefeet13_log'] = np.log10(merged_inner['finishedsquarefeet13'])
merged_inner['finishedfloor1squarefeet_log'] = np.log10(merged_inner['finishedfloor1squarefeet'])
merged_inner['landtaxvaluedollarcnt_log'] = np.log10(merged_inner['landtaxvaluedollarcnt'])
merged_inner['taxvaluedollarcnt_log'] = np.log10(merged_inner['taxvaluedollarcnt'])

# numerical_df = merged_inner[['logerror','finishedsquarefeet13','finishedfloor1squarefeet','landtaxvaluedollarcnt','unitcnt','taxamount','taxvaluedollarcnt','numberofstories','lotsizesquarefeet','bathroomcnt','fullbathcnt','fireplacecnt','garagecarcnt','bedroomcnt','garagetotalsqft','basementsqft']]
corr_matrix = merged_inner.corr()
corr_matrix["logerror"].sort_values(ascending=False)

logerror                        1.000000
basementsqft                    0.372067
buildingclasstypeid             0.315372
finishedsquarefeet6             0.072870
finishedsquarefeet12            0.045921
calculatedfinishedsquarefeet    0.040516
garagetotalsqft                 0.035015
bedroomcnt                      0.031638
calculatedbathnbr               0.029330
garagecarcnt                    0.029002
fullbathcnt                     0.027133
bathroomcnt                     0.025817
fireplacecnt                    0.023242
poolsizesum                     0.021174
longitude                       0.015876
threequarterbathnbr             0.015540
parcelid                        0.015407
roomcnt                         0.014567
lotsizesquarefeet               0.011012
airconditioningtypeid           0.009341
structuretaxvaluedollarcnt      0.008433
numberofstories                 0.008204
fips                            0.006413
rawcensustractandblock          0.006333
yearbuilt       

In [60]:
X_train.head()

Unnamed: 0,transactiondate,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,...,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,finishedsquarefeet13_log,finishedfloor1squarefeet_log,landtaxvaluedollarcnt_log,taxvaluedollarcnt_log
0,2017-01-01,,,,3.5,4.0,,,3.5,,...,2016.0,537569.0,11013.72,,,60590630000000.0,,,5.730434,6.009995
1,2017-01-01,,,,1.0,2.0,,,1.0,,...,2016.0,376000.0,5672.48,,,61110010000000.0,,3.165838,5.575188,5.666518
2,2017-01-01,,,,2.0,3.0,,,2.0,,...,2016.0,479489.0,6488.3,,,60590220000000.0,,,5.680779,5.751878
3,2017-01-01,,,,3.0,4.0,,8.0,3.0,,...,2016.0,36225.0,1777.51,,,60373000000000.0,,,4.559008,5.161796
4,2017-01-01,1.0,,,3.0,3.0,,8.0,3.0,,...,2016.0,45726.0,1533.89,,,60371240000000.0,,,4.660163,5.07703


In [61]:
X_train[['latitude']].head()

Unnamed: 0,latitude
0,33634931.0
1,34449266.0
2,33886168.0
3,34245180.0
4,34185120.0


In [49]:
from sklearn.model_selection import train_test_split

train_df = merged_inner
train_df.dropna(subset=['parcelid'])
y_train=train_df[['logerror']]
X_train=train_df.drop(columns=['logerror','parcelid'], axis=0)

variable = 'calculatedbathnbr'
variable1 = 'garagetotalsqft'
variable2 = 'finishedsquarefeet6'
X_train.loc[X_train[variable].isnull(), variable] =0
X_train.loc[X_train[variable1].isnull(), variable1] =X_train[variable1].mean()
X_train.loc[X_train[variable2].isnull(), variable2] = X_train[variable2].mean()
my_variable = X_train[[variable, variable1, variable2]]

X, X_test, y, y_test = train_test_split(my_variable, y_train, test_size=0.16, random_state=42)


y_test.head()

Unnamed: 0,logerror
59756,0.052298
69100,0.005054
54038,0.01144
75563,-0.069159
36562,0.028567


In [50]:
# basement_median = np.median(np.asarray(X_train['basementsqft']))
# my_variable = X_train['basementsqft'].fillna(basement_median)


# my_variable = np.asarray(X_train[variable]).reshape(-1,1)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lin_reg = LinearRegression()

lin_reg.fit(X, y[['logerror']])

predictions = lin_reg.predict(X_test)
print("SECOND")
print(predictions)

my_sum = np.sum(np.subtract(predictions,y_test[['logerror']])**2)

print("SSE: "+str(my_sum))
    
mse = mean_squared_error(y_test[['logerror']], predictions)
rmse = np.sqrt(mse)
print("MSE: " + str(mse))
print("RMSE: " + str(rmse))

from sklearn.model_selection import cross_val_score

scores = cross_val_score(lin_reg, X_test, y_test[['logerror']],
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
print("Scores:", rmse_scores)
print(pd.Series(rmse_scores).describe())


SECOND
[[0.01557534]
 [0.02221417]
 [0.01125216]
 ...
 [0.02414781]
 [0.01557534]
 [0.01125216]]
SSE: logerror    411.19125
dtype: float64
MSE: 0.03310985184221622
RMSE: 0.181961127283319
Scores: [0.1968172  0.19632083 0.24978023 0.14404809 0.14006549 0.16915137
 0.15284596 0.16359686 0.21995397 0.15526237]
count    10.000000
mean      0.178784
std       0.035937
min       0.140065
25%       0.153450
50%       0.166374
75%       0.196693
max       0.249780
dtype: float64


In [51]:
from sklearn.metrics import r2_score
print("R^2 Score: ", r2_score(predictions, y_test[['logerror']]))
print(predictions)
y_test.head()

R^2 Score:  -905.0645541629696
[[0.01557534]
 [0.02221417]
 [0.01125216]
 ...
 [0.02414781]
 [0.01557534]
 [0.01125216]]


Unnamed: 0,logerror
59756,0.052298
69100,0.005054
54038,0.01144
75563,-0.069159
36562,0.028567


In [37]:
print(lin_reg.coef_)

[[3.63980829e-03 3.77331636e-06]]


In [32]:
merged_inner.head()
incomplete = merged_inner[merged_inner['parcelid'].isnull()]
incomplete.head()

Unnamed: 0,parcelid,logerror,transactiondate,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,...,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,finishedsquarefeet13_log,finishedfloor1squarefeet_log,landtaxvaluedollarcnt_log,taxvaluedollarcnt_log


(1) Provide an argument for which variables in `properties_2017.csv` can potentially be good predictors for the value of interest (log-error), and explain why each will be useful. Explanations of why certain variables could be particularly poor predictors are also welcomed (but not required).


<div class="alert alert-block alert-info">Replace with your answers here</div>

(2) You can see that some properties such as parcelid have more detailed home information available, such as number of bedrooms or bathrooms, and other areas do not in `properties_2017.csv`. Explain how you can handle this missing information when you construct a model.


<div class="alert alert-block alert-info">Replace with your answers here</div>

(3) Construct your model and report the predicted log-error for each property for the next period. This should be a python script `project2.py` that reads from `properties_2017.csv` and writes to `output.csv` that lists parcelID and log-error predicted value for each parcel. 

Evaluate your model's ability to predict log-error values by comparing the predicted log-error values for the next period and the actual log-error testing values. 

Discuss why the model can or cannot predict the log-error values for the test data. 


<div class="alert alert-block alert-info">Replace with your answers here</div>