# Zillow

## Exploratory Analysis of Zillow Data

In [4]:
import numpy as np
import pandas as pd

# Visualizing
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# default pandas decimal number display format
pd.options.display.float_format = '{:.2f}'.format

# Split 
from sklearn.model_selection import train_test_split

# Scale
from sklearn.preprocessing import MinMaxScaler

# Stats
import scipy.stats as stats

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

#My Files
import env
from wrangle_zillow import *

In [5]:
df = wrangle_zillow()
df.head()

Unnamed: 0,parcelid,bathroomcnt,bedroomcnt,buildingqualitytypeid,calculatedfinishedsquarefeet,fips,latitude,longitude,lotsizesquarefeet,rawcensustractandblock,...,roomcnt,yearbuilt,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,logerror,transactiondate,county
0,14297519,3.5,4.0,6.0,3100.0,6059.0,33634931.0,-117869207.0,4506.0,60590630.07,...,0.0,1998.0,485713.0,1023282.0,2016.0,537569.0,11013.72,0.03,2017-01-01,Orange
1,17052889,1.0,2.0,6.0,1465.0,6111.0,34449266.0,-119281531.0,12647.0,61110010.02,...,5.0,1967.0,88000.0,464000.0,2016.0,376000.0,5672.48,0.06,2017-01-01,Ventura
2,14186244,2.0,3.0,6.0,1243.0,6059.0,33886168.0,-117823170.0,8432.0,60590218.02,...,6.0,1962.0,85289.0,564778.0,2016.0,479489.0,6488.3,0.01,2017-01-01,Orange
3,12177905,3.0,4.0,8.0,2376.0,6037.0,34245180.0,-118240722.0,13038.0,60373001.0,...,0.0,1970.0,108918.0,145143.0,2016.0,36225.0,1777.51,-0.1,2017-01-01,Los Angeles
6,12095076,3.0,4.0,9.0,2962.0,6037.0,34145202.0,-118179824.0,63000.0,60374608.0,...,0.0,1950.0,276684.0,773303.0,2016.0,496619.0,9516.26,-0.0,2017-01-01,Los Angeles


In [6]:
train, validate, test = split_data(df)

In [7]:
scaler, train_scaled, validate_scaled, test_scaled = min_max_scaler(train, validate, test)

In [8]:
train_scaled.head()

Unnamed: 0,parcelid,bathroomcnt,bedroomcnt,buildingqualitytypeid,calculatedfinishedsquarefeet,fips,latitude,longitude,lotsizesquarefeet,rawcensustractandblock,...,roomcnt,yearbuilt,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,logerror,transactiondate,county
61675,0.01,0.0,0.09,0.27,0.06,0.0,0.45,0.69,0.0,0.01,...,0.0,0.45,0.01,0.02,0.0,0.01,0.03,0.58,2017-07-28,Los Angeles
7581,0.01,0.11,0.18,0.64,0.13,0.0,0.3,0.58,0.0,0.01,...,0.0,0.58,0.04,0.11,0.0,0.09,0.11,0.63,2017-02-02,Los Angeles
48630,0.01,0.0,0.18,0.27,0.08,0.0,0.47,0.8,0.0,0.0,...,0.0,0.54,0.03,0.07,0.0,0.06,0.07,0.58,2017-06-21,Los Angeles
66151,0.04,0.11,0.27,0.45,0.1,1.0,0.61,0.24,0.0,1.0,...,0.5,0.6,0.01,0.01,0.0,0.0,0.01,0.58,2017-08-14,Ventura
38789,0.02,0.17,0.27,0.45,0.21,0.3,0.24,0.89,0.0,0.3,...,0.0,0.91,0.11,0.16,0.0,0.09,0.19,0.58,2017-05-23,Orange


1. Ask at least 5 questions about the data, keeping in mind that your target variable is logerror. e.g. Is logerror significantly different for properties in LA County vs Orange County vs Ventura County?

target = logerror

log_error = difference of log(Zestimate) and log(SalePrice)

Q: Why did Zillow pick the log error instead of an absolute error metric such as RMSE?

Home sale prices have a right skewed distribution and are also strongly heteroscedastic, so we need to use a relative error metric instead of an absolute metric to ensure valuation models are not biased towards expensive homes. A relative error metric like the percentage error or log ratio error avoids these problems. While we report Zestimate errors in terms of percentages on Zillow.com because we believe that to be a more intuitive metric for consumers, we do not advocate using percentage error to evaluate models in Zillow Prize, as it may lead to biased models The log error is free of this bias problem and when using the natural logarithm, errors close to 1 approximate percentage errors quite closely. See this paper for more on relative errors and why log error should be used instead of percentage error.

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2635088

2. Answer those questions through a mix of statistical tests and visualizations.

Bonus:

Compute the mean(logerror) by zipcode and the overall mean(logerror). Write a loop that will run a t-test between the overall mean and the mean for each zip code. We want to identify the zip codes where the error is significantly higher or lower than the expected error.