# Zip Code Median Income
_Calvin Whealton_

This notebook combines the data from the US Census American Community Survey 2018. The data were downloaded from the US Census website, specifically tables S1901 for each of the 48 contiguous states and Washington, DC.

Because the data is available at the county subdivision, but not the zip code, the values must be spatially disaggregated and summed. This is accomplished by using a file that maps between the zip code and the county subdivision.

In [2]:
import numpy as np
import pandas as pd
import geopandas as gpd
import os

## Stitching State Data Together

This section merges the files together to make one file that is the median income. There are many more data fields that are not processed. The column for median income estimate for the county subdivision is S1901_C01_012E.

In [5]:
os.chdir('/Users/calvinwhealton/Documents/GitHub/tdi_capstone/data/us_census_s1901_income/csvs')

In [9]:
state_incs_list = os.listdir()

In [24]:
# dataframe for values of median income in each county subdivision
state_medinc_cousub = pd.DataFrame(columns=['GEOID_COUSUB','med_hh_inc'])

In [25]:
# loop over all files (states)
for file in state_incs_list:
    
    # row 1 has explanation/long captions for column titles
    state_inc = pd.read_csv(file,skiprows=[1])
    
    # dictionary of values from the state
    state_med = pd.DataFrame({'GEOID_COUSUB':state_inc['GEO_ID'],
                              'med_hh_inc':state_inc['S1901_C01_012E']})
    
    # appending values to the large state dataframe
    state_medinc_cousub = state_medinc_cousub.append(state_med,ignore_index=True)

In [27]:
state_medinc_cousub.shape

(34413, 2)

In [31]:
state_medinc_cousub.isnull().sum(axis=0).sum()

10

In [None]:
state_medinc_cousub = state_medinc_cousub.dropna

In [51]:
state_medinc_cousub.head()

Unnamed: 0,GEOID_COUSUB,med_hh_inc
0,0600000US5100191196,48639
1,0600000US5100191346,49821
2,0600000US5100191496,35845
3,0600000US5100191646,39019
4,0600000US5100191796,44063


In [53]:
state_medinc_cousub['GEOID_CS'] = [string.split('US')[1] for string in state_medinc_cousub['GEOID_COUSUB']]

In [54]:
state_medinc_cousub.head()

Unnamed: 0,GEOID_COUSUB,med_hh_inc,GEOID_CS
0,0600000US5100191196,48639,5100191196
1,0600000US5100191346,49821,5100191346
2,0600000US5100191496,35845,5100191496
3,0600000US5100191646,39019,5100191646
4,0600000US5100191796,44063,5100191796


In [32]:
os.chdir('/Users/calvinwhealton/Documents/GitHub/tdi_capstone/data/processed')
state_medinc_cousub.to_csv('state_medinc_cousub.csv')

# Aggregating/Disaggregating to Zip Code

The median household income for a zip code will be assigned as the median income based on a population-weighted median income for the zip codes that share the value. For example, if zip code 12345 is composed of 40% of the population from county subdivision 23 with median income 10,000 and 60%  from county subdivision 45 with median income 20,000, then the estimated median income for the zip code will be 10,000x0.4 + 20,000x0.6 = 4,000 + 12,000 = 16,000. In the case where some county subdivisions do not have data, the fractions are normalized to sum to 1.

In [33]:
os.chdir('/Users/calvinwhealton/Documents/GitHub/tdi_capstone/data/zcta_to_countysub')

zcta_cousub_map = pd.read_csv('zcta_countysub_uscensus.txt')

In [34]:
zcta_cousub_map.head()

Unnamed: 0,ZCTA5,STATE,COUNTY,COUSUB,CLASSFP,GEOID,POPPT,HUPT,AREAPT,AREALANDPT,...,CSAREA,CSAREALAND,ZPOPPCT,ZHUPCT,ZAREAPCT,ZAREALANDPCT,CSPOPPCT,CSHUPCT,CSAREAPCT,CSAREALANDPCT
0,601,72,1,401,Z1,7200100401,4406,1968,1942319,1942319,...,1942319,1942319,23.73,25.41,1.16,1.17,100.0,100.0,100.0,100.0
1,601,72,1,13645,Z1,7200113645,1038,425,9420707,9387179,...,9494851,9461323,5.59,5.49,5.63,5.63,98.48,97.93,99.22,99.22
2,601,72,1,30458,Z1,7200130458,1337,509,16497991,16271520,...,16497991,16271520,7.2,6.57,9.85,9.76,100.0,100.0,100.0,100.0
3,601,72,1,32049,Z1,7200132049,140,60,7312819,6974412,...,7312952,6974412,0.75,0.77,4.37,4.18,100.0,100.0,100.0,100.0
4,601,72,1,32608,Z1,7200132608,254,115,2763743,2763743,...,7695788,7443424,1.37,1.49,1.65,1.66,29.78,31.17,35.91,37.13


In [140]:
zip_med_inc = pd.DataFrame(columns=['zip','med_hh_inc'])

In [151]:
# unique values of zip code
zip_use = zcta_cousub_map.ZCTA5.unique()


# loop over every zip code
for zc in zip_use:
    
    # used in calculations for each zip code
    temp_df = pd.DataFrame()
    
    # geoids for the county
    # casting to string to match string in other dataframe
    cou_geoids = zcta_cousub_map.loc[zcta_cousub_map['ZCTA5'] == zc,'GEOID'].astype(str).values
    cou_df = zcta_cousub_map.loc[zcta_cousub_map['ZCTA5'] == zc]
    
    # searching for geoids in the median income data
    temp_df = state_medinc_cousub.loc[state_medinc_cousub['GEOID_CS'].isin(cou_geoids)]
    
    if temp_df.shape[0] == 0:
        zip_med_inc = zip_med_inc.append({'zip':zc, 'med_hh_inc': np.NaN},ignore_index=True)
        
    else:
        
        # adding column for population fraction [0,100]
        temp_df['pop_frac'] = 0
        
        #finding the fractions that match the county subdivisions
        for i in temp_df.index:
            temp_df.loc[i,'pop_frac'] = cou_df.loc[cou_df['GEOID'].values.astype(str) == temp_df.loc[i,'GEOID_CS'],'ZPOPPCT'].values
        
        # empty values removed
        temp_df = temp_df[temp_df['med_hh_inc'] != '-']
        
        for i in temp_df.index:
            if temp_df.loc[i,'med_hh_inc'] == '250,000+':
                temp_df.loc[i,'med_hh_inc'] = '250000'
            if temp_df.loc[i,'med_hh_inc'] == '2,500-':
                temp_df.loc[i,'med_hh_inc'] = '2500'
        
        # estimate of the median income
        est_med_inc = np.sum(np.array(temp_df['pop_frac'].values)*np.array(temp_df['med_hh_inc'].values.astype(float)))/(np.sum(np.array(temp_df['pop_frac'].values)))
        
        zip_med_inc = zip_med_inc.append({'zip':zc, 'med_hh_inc': est_med_inc},ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [152]:
zip_med_inc.head()

Unnamed: 0,zip,med_hh_inc
0,601.0,
1,602.0,
2,603.0,
3,606.0,
4,610.0,


In [153]:
zip_med_inc.reset_index()

Unnamed: 0,index,zip,med_hh_inc
0,0,601.0,
1,1,602.0,
2,2,603.0,
3,3,606.0,
4,4,610.0,
...,...,...,...
38405,38405,99923.0,
38406,38406,99925.0,
38407,38407,99926.0,
38408,38408,99927.0,


In [154]:
zip_med_inc.shape[0]

38410

In [155]:
zip_med_inc = zip_med_inc.dropna()

In [156]:
os.chdir('/Users/calvinwhealton/Documents/GitHub/tdi_capstone/data/processed')

zip_med_inc.to_csv('zips_med_inc.csv')