# Housing Data Preprocessing
_Calvin Whealton_

This notebook takes the raw Zillow housing data and converts it into an monthly precentage change in the housing value. The Zillow data is available from https://www.zillow.com/research/data/ and specifically the Zillow Home Value Index (ZHVI) is used in this analysis. The result of this notebook will be a incorporated in the feature matrix for each zip code-time interval and in the predictions following floods.

In [2]:
import pandas as pd 
import numpy as np
import os

In [6]:
zillow_data = pd.read_csv('http://files.zillowstatic.com/research/public_v2/zhvi/Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv')

In [7]:
zillow_data.head()

Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,1996-01-31,...,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31,2020-06-30
0,61639,0,10025,Zip,NY,NY,New York,New York-Newark-Jersey City,New York County,233265.0,...,1248340.0,1234262.0,1229890.0,1226466.0,1208024.0,1182758.0,1150900.0,1134880.0,1120949.0,1112549.0
1,84654,1,60657,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,211748.0,...,494425.0,493485.0,492514.0,491726.0,491562.0,492618.0,494017.0,494766.0,494546.0,494435.0
2,61637,2,10023,Zip,NY,NY,New York,New York-Newark-Jersey City,New York County,245773.0,...,1161916.0,1153259.0,1156287.0,1175142.0,1193746.0,1205413.0,1203165.0,1209735.0,1211403.0,1212520.0
3,91982,3,77494,Zip,TX,TX,Katy,Houston-The Woodlands-Sugar Land,Harris County,200430.0,...,336121.0,336159.0,336142.0,336234.0,335959.0,336153.0,336611.0,337678.0,338602.0,339179.0
4,84616,4,60614,Zip,IL,IL,Chicago,Chicago-Naperville-Elgin,Cook County,286382.0,...,646296.0,645348.0,643973.0,642628.0,642209.0,642227.0,642454.0,641440.0,640355.0,639311.0


Extracting the column codes that indicate time index.

In [8]:
cols_time = zillow_data.columns[9:zillow_data.shape[1]]
cols_time

Index(['1996-01-31', '1996-02-29', '1996-03-31', '1996-04-30', '1996-05-31',
       '1996-06-30', '1996-07-31', '1996-08-31', '1996-09-30', '1996-10-31',
       ...
       '2019-09-30', '2019-10-31', '2019-11-30', '2019-12-31', '2020-01-31',
       '2020-02-29', '2020-03-31', '2020-04-30', '2020-05-31', '2020-06-30'],
      dtype='object', length=294)

Some calculations to determine the number of null values in the time series.

In [9]:
# number of nulls
zillow_data[cols_time].isnull().sum(1).sum()

2106108

In [10]:
# number of possible values
zillow_data.shape[0]*len(cols_time)

8950242

In [11]:
# number of non-null values
8960532-2110998

6849534

Completing calculations for the monthly percentage increase in the Zillow Housing Value Index (ZHVI). The formula used will be:


<div align="center">Pct Increase i = 100x(zhvi_(i)-zhvi(i-1))/zhvi(i-1).</div>


Therefore, if the value is 100 in month _i-1_ and 110 in month _i_, the result will be 100x(110-100)/100 = 10%.

In [12]:
zillow_mon_pct_val = pd.DataFrame()

In [13]:
zillow_mon_pct_val['GEOID10_str'] = zillow_data['RegionName'].apply(lambda x: '{0:0>5}'.format(x))

In [15]:
# will loop over the time columns
# first itertation takes second month relative to first month
# i index not over whole range because need to have one less month for the percentages
for i in range(len(cols_time)-1):
    zillow_mon_pct_val[cols_time[i+1]] = 100*(zillow_data[cols_time[i+1]]-zillow_data[cols_time[i]])/(zillow_data[cols_time[i]])

In [16]:
zillow_mon_pct_val.head()

Unnamed: 0,GEOID10_str,1996-02-29,1996-03-31,1996-04-30,1996-05-31,1996-06-30,1996-07-31,1996-08-31,1996-09-30,1996-10-31,...,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31,2020-06-30
0,10025,-0.622468,0.033648,-0.010781,0.22858,0.277544,-0.048918,0.350754,0.151876,0.694578,...,-1.321208,-1.127738,-0.35422,-0.278399,-1.50367,-2.091515,-2.693535,-1.391954,-1.227531,-0.749365
1,60657,-0.074145,-0.194243,-0.0644,-0.268664,0.087896,-0.101585,0.289858,0.424054,0.71714,...,-0.173436,-0.19012,-0.196764,-0.159995,-0.033352,0.214825,0.283993,0.151614,-0.044465,-0.022445
2,10023,0.020751,0.055324,0.356155,0.516132,0.420378,0.13887,0.041684,0.15745,0.297205,...,-1.215343,-0.745062,0.26256,1.630651,1.583128,0.977344,-0.186492,0.54606,0.137881,0.092207
3,77494,0.123734,-0.098167,-0.593074,-0.571021,-0.336102,0.386353,-0.157881,-0.329901,-0.400436,...,0.046731,0.011305,-0.005057,0.027369,-0.081788,0.057745,0.136247,0.316983,0.273633,0.170407
4,60614,-0.091486,-0.184188,-0.050421,-0.241024,0.031254,-0.175882,0.200458,0.384671,0.688425,...,-0.184404,-0.146682,-0.213063,-0.20886,-0.065201,0.002803,0.035346,-0.157832,-0.169151,-0.163035


In [18]:
os.chdir('/Users/calvinwhealton/Documents/GitHub/tdi_capstone/data/processed')
zillow_mon_pct_val.to_csv('zillow_mon_pct_val.csv')