# General Figures for Zip Codes
_Calvin Whealton_

This notebook develops creates visualizations for each zip code about the expected results on the housing market following a 100-year flood.

In [1]:
import numpy as np
import pandas as pd
import os
import pickle
import matplotlib.pyplot as plt
import geopandas as gpd

## Exploratory Analysis for Typical GDP and Housing

Finding some ranges of typical GDP and housing prices that will be used. The number of lines per zip code is the product of these two results.

In [2]:
# housing data
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/data/processed_data')
housing = pd.read_csv('zillow_mon_pct_val.csv')

In [3]:
housing.quantile(q=[0.1,0.25,0.5,0.75,0.9], axis=0)

Unnamed: 0.1,Unnamed: 0,GEOID10_str,1996-02-29,1996-03-31,1996-04-30,1996-05-31,1996-06-30,1996-07-31,1996-08-31,1996-09-30,...,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31,2020-06-30
0.1,3044.2,12465.2,-0.478668,-0.366003,-0.606188,-0.523935,-0.47185,-0.446308,-0.368729,-0.343346,...,-0.128778,-0.138746,-0.203892,-0.224561,-0.226007,-0.218879,-0.265076,-0.316519,-0.406283,-0.499222
0.25,7610.5,25984.5,-0.217125,-0.150177,-0.251985,-0.190601,-0.152667,-0.128857,-0.073524,-0.046001,...,0.136759,0.120738,0.054458,0.035699,0.030113,0.046695,0.023209,-0.0112,-0.058884,-0.104979
0.5,15221.0,48367.0,0.029793,0.053338,0.11548,0.13142,0.154107,0.158695,0.190117,0.194189,...,0.348193,0.333664,0.275845,0.260891,0.261342,0.271538,0.273806,0.253334,0.226847,0.218458
0.75,22831.5,71675.5,0.257943,0.252718,0.471605,0.465017,0.455999,0.442437,0.461245,0.445511,...,0.553938,0.537908,0.503971,0.502197,0.511456,0.529489,0.55124,0.548627,0.529546,0.548357
0.9,27397.8,87930.8,0.516132,0.480733,0.856023,0.830368,0.795122,0.781042,0.756128,0.733269,...,0.803819,0.783801,0.75559,0.765873,0.784719,0.817336,0.844969,0.856616,0.864715,0.921391


Based on the above results, a "bad" market would be in the range of -0.2 to -0.5 % month-over-month. A "good market would be about 0.75 to 0.85 % month-over-month. Typical values might be in the range of 0.2 to 0.3 % month-over-month.

For typical values for each zip code, constant values of -0.4, 0.0, 0.3, and 0.8 will be used.

In [4]:
# gdp data
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/data')
gdp = pd.read_csv('A191RL1Q225SBEA.csv')

In [5]:
gdp.head()

Unnamed: 0,DATE,A191RL1Q225SBEA
0,1947-04-01,-1.0
1,1947-07-01,-0.8
2,1947-10-01,6.4
3,1948-01-01,6.2
4,1948-04-01,6.8


In [6]:
# calculating some quantiles of the quarterly gdp
gdp.iloc[(293-4*25):293].quantile(q=[0.1,0.25,0.5,0.75,0.9], axis=0)

Unnamed: 0,A191RL1Q225SBEA
0.1,-0.64
0.25,1.5
0.5,2.55
0.75,3.725
0.9,5.1


Based on the above results, the "bad" value of quarterly GDP change would be -0.6, a typical value would be 2, and a good value would be 5.

Calculating all combinations of the above results (4 housing prices and 3 gdps) would give 12 lines total.

## Loading Files with Demographic Data

Reading in the files for zip codes that will include the population density and median income for the zip code.

In [7]:
#shapefile for zip code
#shapefile available from https://drive.google.com/file/d/1yTwgTfbYZirtNQOIfgQVDY4Tc-QKDVTb/view?usp=sharing
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/data/geo_data/tl_2019_us_zcta510_clipped48contig')
zips_shapefile = gpd.read_file('clipped48contig.shp')

# mapping between zip code and county subdivision
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/data')
zcta_cousub = pd.read_csv('zcta_countysub_uscensus.txt')

# creating dataframe for zip code and area
zips_key_vals = pd.DataFrame({'zips':zips_shapefile['GEOID10'].astype(int).values,
                              'area':zips_shapefile['ALAND10'].values })


# creating dataframe for zip code and population
pop_df = pd.DataFrame({'zips':(zcta_cousub.groupby('ZCTA5').mean())['ZPOP'].index,
                      'zpop':(zcta_cousub.groupby('ZCTA5').mean())['ZPOP']})

# merging dataframes and calculating the population density
zips_key_vals2 = pd.merge(left=zips_key_vals, right = pop_df, left_on = 'zips', right_on = 'zips')
zips_key_vals2['pop_dens'] = zips_key_vals2['zpop']/zips_key_vals2['area']

# loading dataframe with median income
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/data/processed_data')
zip_medinc = pd.read_csv('zips_med_inc.csv')

# merging to have demographic characteristics in one dataframe
zips_key_vals3 = pd.merge(left=zips_key_vals2, right = zip_medinc, left_on = 'zips', right_on = 'zip')

## Loading Pickled Machine Learning Model

Loading the user-defined transformers and estimators before the pickled file.

In [8]:
from sklearn import base
class ColumnSelectTransformer(base.BaseEstimator, base.TransformerMixin):
    '''
    Transformer used in the practical machine learning mini project
    Selects the columns defined as col_names from the dataframe
    Returns the values for those columns
    Does not need to learn anything about the data
    '''
    
    def __init__(self, col_names):
        self.col_names = col_names
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        rets = np.zeros((X.shape[0], len(self.col_names)))
        for c in range(len(self.col_names)):
            rets[:,c] = X[self.col_names[c]]
        return rets
    
    
class LogTransformer(base.BaseEstimator, base.TransformerMixin):
    '''
    Transforms columns as the logarithm of the given values
    It does not have to learn anything about the data
    '''
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.log(X)
    
class TimeSeriesRescaler(base.BaseEstimator, base.TransformerMixin):
    '''
    Transforms columns as a time series
    Scales around a mean value of 0
    Uses standard deviation of the whole series
    '''
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        self.std = np.std(X)
        return self
    
    def transform(self, X):
        return [row/self.std for row in X]
    
class MoveRefScale(base.BaseEstimator, base.TransformerMixin):
    '''
    Transforms columns as the logarithm of the given values
    It does not have to learn anything about the data
    Assumes X is a single input
    '''
    
    def __init__(self,ref=None,scaler='std'):
        self.scaler = scaler
        self.ref = ref
    
    def fit(self, X, y=None):
        if self.ref is None:
            self.ref_use = np.mean(X)
        else:
            self.ref_use = self.ref
        
        if self.scaler == 'std':
            self.scale_value = np.std(X)
        if self.scaler == 'min_max':
            self.scale_value = np.max(X) - np.min(X)
        if self.scaler == 'iqr':
            self.scale_value = np.quantile(X,0.75) - np.quantile(X,0.25)
        return self
    
    def transform(self, X):
        return (X-self.ref_use)/self.scale_value

In [9]:
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline

pipe_rp = Pipeline([
                ('cst_rp', ColumnSelectTransformer(col_names=['flood_rp'])),
                ('lt_rp', LogTransformer()),
                ('mrs_rp',  MoveRefScale(ref=np.log(100), scaler='iqr'))
])


pipe_gdp = Pipeline([
                ('cst_gdp', ColumnSelectTransformer(col_names=['GDP'])),
                ('mrs_gdp', MoveRefScale(ref=0.0, scaler='std'))
])

pipe_inc = Pipeline([
                ('cst_inc', ColumnSelectTransformer(col_names=['med_inc'])),
                ('lt_inc', LogTransformer()),
                ('mrs_inc', MoveRefScale(ref=None,scaler='std'))
])

pipe_popden = Pipeline([
                ('cst_pden',  ColumnSelectTransformer(col_names=['pop_dens'])),
                ('lt_pden', LogTransformer()),
                ('mrs_pden', MoveRefScale(ref=None,scaler='std'))
])

pipe_houTS = Pipeline([
                ('cst_gdp', ColumnSelectTransformer(col_names=['h-12', 'h-11','h-10','h-09', 'h-08','h-07','h-06','h-05','h-04','h-03','h-02','h-01'])),
                ('tsr_gdp', TimeSeriesRescaler())
])

In [10]:
union = FeatureUnion([
        ('rp',pipe_rp),
        ('gdp',pipe_gdp),
        ('inc',pipe_inc),
        ('popden',pipe_popden),
        ('houTS', pipe_houTS)
    ])

In [11]:
class KNNMixedTSConsts(base.BaseEstimator, base.RegressorMixin):
    '''
    Custom estimator for the time series data (and non-time series) for problem
    neighbors = number of neighbors
    ts_inds = indices of the time series (assumed to be in correct order)
    weights = weights for different parts of distance (time series collapsed to singe distance then weighted)
    '''

    def __init__(self,neighbors,ts_inds, weights):
        self.neighbors = neighbors
        self.ts_inds = ts_inds
        self.weights = weights
    
    def fit(self, X, y):
        self.X = X # store the values passed in
        self.y = y
        return self
    
    def predict(self, X):
        # prediction will be the mean of the k nearest neighbors
        # prediction also will return 80% interval
        # size will be number_of_prediction * length_of_time_series * number_of_metrics
        num_metrics = 3
        num_preds = X.shape[0]
        length_of_ts = self.y.shape[1]
        
        pred_arr = np.zeros((num_preds, length_of_ts, num_metrics))
        
        ts_vals = np.array(self.y)
        
        for p in range(num_preds):
            
            # calculate the distance
            dists = dist_calc(X[p,:], self.X, self.ts_inds, self.weights)
            
            # find neighbors by index
            neighbors_close = (np.argsort(dists))[0:self.neighbors]
            
            # take mean down the columns
            # length will be same as number of columns
            # also estimate the quantiles
            pred_arr[p,:,0] = np.mean(ts_vals[neighbors_close],axis=0)
            
            pred_arr[p,:,1] = np.quantile(ts_vals[neighbors_close], 0.1,axis=0)
            pred_arr[p,:,2] = np.quantile(ts_vals[neighbors_close], 0.9,axis=0)
        
        return pred_arr
    
    
def dist_calc(X_fitting, X_mat, ts_inds, weights):
    
    # dimensions and initializing an array to store results
    nrows = np.array(X_mat).shape[0]
    ncols = np.array(X_mat).shape[1] - len(ts_inds) + 1
    dist = np.zeros((nrows, ncols))
    
    # calculate the distances
    for i in np.arange(ncols):
        if i != (ncols - 1):
            dist[:,i] = weights[i]*((np.array(X_mat[:,i])-X_fitting[i])**2)
        else:
            dist_ts = np.zeros((nrows,len(ts_inds)))
            for j in range(len(ts_inds)):
                dist_ts[:,j] = ((np.array(X_mat[:,ts_inds[j]])-X_fitting[ts_inds[j]])**2)
            dist[:,i] = weights[i]*np.sum(dist_ts,axis=1)
    
    # return the mean across a row
    # will be length equal to number of rows
    return np.mean(dist,axis=1)

In [12]:
knn_pipe_opt = Pipeline([
                    ('union_feature', union),
                    ('mix_knn', KNNMixedTSConsts(neighbors=6,
                                                 ts_inds = np.arange(4,16),
                                                 weights = [1.0,
                                                            50.20425478187665,
                                                            0.05320161750432445,
                                                            0.4751064830429864,
                                                            0.05650927225287341]))
])

In [13]:
os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/pickled_models')
mix_knn_opt = pickle.load(open('mix_knn_opt.sav', 'rb'))

## Predicting Results for Each Zip Code

In [16]:
hou_typ = [-0.4, 0.0, 0.3, 0.8] # housing market typical values
gdp_typ = [-0.6, 2., 5.] # gdp typical values
flood = 100. # 100-year return period

size=15
params = {'legend.fontsize': 'large',
          'figure.figsize': (20,8),
          'axes.labelsize': size,
          'axes.titlesize': size,
          'xtick.labelsize': size*0.75,
          'ytick.labelsize': size*0.75,
          'axes.titlepad': 0,
          'axes.titlepad': 25}

plt.rcParams.update(params)

# loop over all zip codes (housing.shape[0])
for z in housing['GEOID10_str'].values:
    
    # checking if there is population density and median income data for zip code
    if z in zips_key_vals3['zips'].values:
        med_inc = zips_key_vals3.loc[zips_key_vals3['zips'] == z, 'med_hh_inc']
        pop_den = zips_key_vals3.loc[zips_key_vals3['zips'] == z, 'pop_dens']
        
        # dataframe that will be sent to the machine learning model
        typ_dat = pd.DataFrame(columns = ['flood_rp','GDP','pop_dens','h-12',
                                         'h-11','h-10','h-09','h-08','h-07','h-06',
                                         'h-05','h-04','h-03','h-02','h-01'])
        # loop to create a dataframe
        for h in hou_typ:
            for g in gdp_typ:
                
                # adding row to dataframe
                typ_dat =  typ_dat.append({'flood_rp':flood,
                                    'GDP': g,
                                    'med_inc': med_inc,
                                    'pop_dens': pop_den,
                                    'h-12':h,
                                    'h-11':h,
                                    'h-10':h,
                                    'h-09':h,
                                    'h-08':h,
                                    'h-07':h,
                                    'h-06':h,
                                    'h-05':h,
                                    'h-04':h,
                                    'h-03':h,
                                    'h-02':h,
                                    'h-01':h},ignore_index=True)
                
                
        # making predictions from the machine learning model
        # predictions will be an n x t x 3 array
        # n = number of typical cases (3*4)
        # t = number of time periods (13)
        # mean, 90th percentile, 10th percentile
        zip_preds = mix_knn_opt.predict(typ_dat)
        
        # changing directory to folder for images
        os.chdir('/Users/calvinwhealton/Documents/GitHub/floods_housing_zipcode/visualizations/zip_results/zip_pred_100yr')
        
        gdp_poly_cols = ['#d7191c40', '#ffffbf40', '#2c7bb640']
        gdp_line_cols = ['#d7191c', 'gold', '#2c7bb6']
        
        # open plot
        plt. plot()
        ind = 0
        for ind2 in range(len(hou_typ)):
            plt.plot()
            # plot all the background polygons
            for ind1 in range(len(gdp_typ)):
                ind_take = ind
                plt.plot(np.arange(-12,1), [typ_dat['h-12'].values[ind_take]]*13, color='black')
                plt.fill_between(np.arange(0,13), zip_preds[ind_take,:,1], zip_preds[ind_take,:,2],color=gdp_poly_cols[ind1])
                #print(ind)
                ind += 1
                
            # plot mean predictions
            ind -= len(gdp_typ)
            for ind1 in range(len(gdp_typ)):
                ind_take = ind
                plt.plot(np.arange(0,13), zip_preds[ind_take,:,0], color=gdp_line_cols[ind1])
                ind += 1

            # add labels and titles
            plt.ylabel('ZHVI Change (month-over-month)')
            plt.xlabel('Month Relative to Flood')
            plt.title(str(int(z)).zfill(5) + [' Poor', ' Neutral',' Typical', ' Good'][ind2] + ' Pre-Flood Housing Market ')
            
            # saving figure
            fig_name = str(int(z)).zfill(5) + ['_Poor', '_Neutral','_Typical', '_Good'][ind2] + '_market.png'
            plt.savefig(fig_name)
            plt.close()