## Racial Neighborhood Change (2009-2018)

In this notebook, I bring us back to ACS data, and show the power of using Python rather than Excel to work with ACS and the associated margins of error and statistical testing.  I also provide an example of using a crosswalk to reconcile different geographies over time.  The notebook also starts to introduce more sophisticated coding structures - these are still new to me too, but they show how programmers move from the "write out every piece of code" approach to getting from A to B to more streamlined and automated codes, which can reduce time and errors.


### Import packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 100)
pd.options.display.float_format = '{:.3f}'.format

In [None]:
import warnings
warnings.filterwarnings("ignore") 

### Import and clean ACS data from 2009

Due to the Census website changes, they have not yet posted the 2009 5-year ACS for download.  Instead, I downloaded the data from Social Explorer.  The biggest difference is that Social Explorer provides the standard error, not the margin of error, so I will have to do some manipulation of the variables to aggregate my MOEs.

In [None]:
df_2009 = pd.read_csv("ACS2009.csv", dtype={'Geo_FIPS': str})

# drop extra columns
df_2009.drop(columns=['Geo_GEOID', 'Geo_NAME', 'Geo_QName', 'Geo_STUSAB',
       'Geo_SUMLEV', 'Geo_GEOCOMP', 'Geo_FILEID', 'Geo_LOGRECNO', 'Geo_US',
       'Geo_REGION', 'Geo_DIVISION', 'Geo_STATECE', 'Geo_STATE', 'Geo_COUNTY',
       'Geo_COUSUB', 'Geo_PLACE', 'Geo_PLACESE', 'Geo_TRACT', 'Geo_BLKGRP',
       'Geo_CONCIT', 'Geo_AIANHH', 'Geo_AIANHHFP', 'Geo_AIHHTLI', 'Geo_AITSCE',
       'Geo_AITS', 'Geo_ANRC', 'Geo_CBSA', 'Geo_CSA', 'Geo_METDIV', 'Geo_MACC',
       'Geo_MEMI', 'Geo_NECTA', 'Geo_CNECTA', 'Geo_NECTADIV', 'Geo_UA',
       'Geo_UACP', 'Geo_CDCURR', 'Geo_SLDU', 'Geo_SLDL', 'Geo_VTD',
       'Geo_ZCTA3', 'Geo_ZCTA5', 'Geo_SUBMCD', 'Geo_SDELM', 'Geo_SDSEC',
       'Geo_SDUNI', 'Geo_UR', 'Geo_PCI', 'Geo_TAZ', 'Geo_UGA', 'Geo_PUMA5',
       'Geo_PUMA1'], inplace=True)

df_2009.columns

In [None]:
# rename columns
rename_2009 = {'ACS09_5yr_B03002001': "total_2009", 
    'ACS09_5yr_B03002002': "nh_total_2009",
    'ACS09_5yr_B03002003': "nhwhite_2009", 
    'ACS09_5yr_B03002004' : 'nhblack_2009',
    'ACS09_5yr_B03002005':'nhamindian_2009',
    'ACS09_5yr_B03002006':"nhasian_2009",
    'ACS09_5yr_B03002007':"nhhpi_2009",
    'ACS09_5yr_B03002008':'nhother_2009',
    'ACS09_5yr_B03002009':'nhtwoplus_2009',
    'ACS09_5yr_B03002010':'nhtwoplincother_2009', 
    'ACS09_5yr_B03002011':'nhtwoplexother_2009',
    'ACS09_5yr_B03002012':'hispanic_2009',
    'ACS09_5yr_B03002013':'hwhite_2009',
    'ACS09_5yr_B03002014':'hblack_2009',
    'ACS09_5yr_B03002015':'hamindian_2009',
    'ACS09_5yr_B03002016':'hasian_2009',
    'ACS09_5yr_B03002017': 'hhpi_2009',
    'ACS09_5yr_B03002018':'hother_2009',
    'ACS09_5yr_B03002019':'htwoplus_2009',
    'ACS09_5yr_B03002020':'htwoplincother_2009',
    'ACS09_5yr_B03002021':'htwoplexother_2009',
    'ACS09_5yr_B03002001s': "total_2009_se",
    'ACS09_5yr_B03002002s': "nh_total_2009_se",
    'ACS09_5yr_B03002003s': "nhwhite_2009_se",
    'ACS09_5yr_B03002004s': 'nhblack_2009_se',
    'ACS09_5yr_B03002005s':'nhamindian_2009_se',
    'ACS09_5yr_B03002006s':"nhasian_2009_se",
    'ACS09_5yr_B03002007s':"nhhpi_2009_se",
    'ACS09_5yr_B03002008s':'nhother_2009_se',
    'ACS09_5yr_B03002009s':'nhtwoplus_2009_se',
    'ACS09_5yr_B03002010s':'nhtwoplincother_2009_se',
    'ACS09_5yr_B03002011s':'nhtwoplexother_2009_se',
    'ACS09_5yr_B03002012s':'hispanic_2009_se',
    'ACS09_5yr_B03002013s':'hwhite_2009_se',
    'ACS09_5yr_B03002014s':'hblack_2009_se',
    'ACS09_5yr_B03002015s':'hamindian_2009_se',
    'ACS09_5yr_B03002016s':'hasian_2009_se',
    'ACS09_5yr_B03002017s': 'hhpi_2009_se',
    'ACS09_5yr_B03002018s':'hother_2009_se',
    'ACS09_5yr_B03002019s':'htwoplus_2009_se',
    'ACS09_5yr_B03002020s':'htwoplincother_2009_se',
    'ACS09_5yr_B03002021s':'htwoplexother_2009_se',
    'ACS09_5yr_B25003001':'hu_2009',
    'ACS09_5yr_B25003002':'owner_2009',
    'ACS09_5yr_B25003003': 'renter_2009',
    'ACS09_5yr_B25003001s': 'hu_2009_se',
    'ACS09_5yr_B25003002s':'owner_2009_se',
    'ACS09_5yr_B25003003s': 'renter_2009_se'}

df_2009.rename(columns=rename_2009, inplace=True)
df_2009

In [None]:
#create MOEs from the SE provided in the raw data
moe_fields = list(rename_2009.values())
moe_fields = [x for x in moe_fields if "se" in x]
for i in moe_fields:
    df_2009[i] = df_2009[i] * 1.645

df_2009.rename(columns={"total_2009_se": "total_2009_moe",
"nh_total_2009_se":"nh_total_2009_moe",
"nhwhite_2009_se":"nhwhite_2009_moe",
'nhblack_2009_se': 'nhblack_2009_moe',
'nhamindian_2009_se':'nhamindian_2009_moe',
"nhasian_2009_se": "nhasian_2009_moe",
"nhhpi_2009_se": "nhhpi_2009_moe",
'nhother_2009_se' :'nhother_2009_moe',
'nhtwoplus_2009_se': 'nhtwoplus_2009_moe',
'nhtwoplincother_2009_se': 'nhtwoplincother_2009_moe',
'nhtwoplexother_2009_se': 'nhtwoplexother_2009_moe',
'hispanic_2009_se': 'hispanic_2009_moe',
'hwhite_2009_se': 'hwhite_2009_moe',
'hblack_2009_se': 'hblack_2009_moe',
'hamindian_2009_se' :'hamindian_2009_moe',
'hasian_2009_se':'hasian_2009_moe',
'hhpi_2009_se':'hhpi_2009_moe',
'hother_2009_se' :'hother_2009_moe',
'htwoplus_2009_se' :'htwoplus_2009_moe',
'htwoplincother_2009_se': 'htwoplincother_2009_moe',
'htwoplexother_2009_se': 'htwoplexother_2009_moe',
'hu_2009_se': 'hu_2009_moe',
'owner_2009_se' :'owner_2009_moe',
'renter_2009_se': 'renter_2009_moe'}, inplace=True)


In [None]:
#aggregate our race/ethnicity columns and then drop the columns we no longer need

df_2009['nhothers_2009']=(df_2009['nhamindian_2009_moe'] + df_2009['nhhpi_2009'] + df_2009['nhother_2009'] + df_2009['nhtwoplus_2009'])

# sum moe columns

df_2009['nhothers_2009_moe']=(np.sqrt(df_2009['nhamindian_2009_moe']**2 + df_2009['nhhpi_2009_moe']**2
                                         + df_2009['nhother_2009_moe']**2 + df_2009['nhtwoplus_2009_moe']**2))

#keep the variables of interest

acs_2009_df = df_2009[['Geo_FIPS', 'total_2009', 'total_2009_moe', 'nhwhite_2009', 'nhwhite_2009_moe',
                      'nhblack_2009', 'nhblack_2009_moe', 'nhasian_2009', 'nhasian_2009_moe',
                       'hispanic_2009', 'hispanic_2009_moe', 'nhothers_2009', 'nhothers_2009_moe', 
                       'hu_2009', 'hu_2009_moe',
                      'owner_2009', 'owner_2009_moe', 'renter_2009', 'renter_2009_moe']]
acs_2009_df

In [None]:
#Above, I used the "rename" dictionary to pass a list of variables to my operation, 
#but I can also create a specific list of variables to use in future code
list_2009={'total_2009', 'total_2009_moe', 'nhwhite_2009', 'nhwhite_2009_moe',
                      'nhblack_2009', 'nhblack_2009_moe', 'nhasian_2009', 'nhasian_2009_moe',
                       'hispanic_2009', 'hispanic_2009_moe', 'nhothers_2009', 'nhothers_2009_moe', 
                       'hu_2009', 'hu_2009_moe',
                      'owner_2009', 'owner_2009_moe', 'renter_2009', 'renter_2009_moe'}

### Import and Clean 2018 ACS Data

In [None]:
df_2018 = pd.read_csv("ACS2018.csv", dtype={'Geo_FIPS': str})

# drop extra columns
df_2018.drop(columns=['Geo_GEOID', 'Geo_BTTR', 'Geo_BTBG','Geo_NAME', 'Geo_QName', 'Geo_STUSAB',
       'Geo_SUMLEV', 'Geo_GEOCOMP', 'Geo_FILEID', 'Geo_LOGRECNO', 'Geo_US',
       'Geo_REGION', 'Geo_DIVISION', 'Geo_STATECE', 'Geo_STATE', 'Geo_COUNTY',
       'Geo_COUSUB', 'Geo_PLACE', 'Geo_PLACESE', 'Geo_TRACT', 'Geo_BLKGRP',
       'Geo_CONCIT', 'Geo_AIANHH', 'Geo_AIANHHFP', 'Geo_AIHHTLI', 'Geo_AITSCE',
       'Geo_AITS', 'Geo_ANRC', 'Geo_CBSA', 'Geo_CSA', 'Geo_METDIV', 'Geo_MACC',
       'Geo_MEMI', 'Geo_NECTA', 'Geo_CNECTA', 'Geo_NECTADIV', 'Geo_UA',
       'Geo_UACP', 'Geo_CDCURR', 'Geo_SLDU', 'Geo_SLDL', 'Geo_VTD',
       'Geo_ZCTA3', 'Geo_ZCTA5', 'Geo_SUBMCD', 'Geo_SDELM', 'Geo_SDSEC',
       'Geo_SDUNI', 'Geo_UR', 'Geo_PCI', 'Geo_TAZ', 'Geo_UGA', 'Geo_PUMA5',
       'Geo_PUMA1'], inplace=True)

df_2018

In [None]:
# rename columns
rename_2018 = {'ACS18_5yr_B03002001': "total_2018", 
    'ACS18_5yr_B03002002': "nh_total_2018",
    'ACS18_5yr_B03002003': "nhwhite_2018", 
    'ACS18_5yr_B03002004' : 'nhblack_2018',
    'ACS18_5yr_B03002005':'nhamindian_2018',
    'ACS18_5yr_B03002006':"nhasian_2018",
    'ACS18_5yr_B03002007':"nhhpi_2018",
    'ACS18_5yr_B03002008':'nhother_2018',
    'ACS18_5yr_B03002009':'nhtwoplus_2018',
    'ACS18_5yr_B03002010':'nhtwoplusincother_2018', 
    'ACS18_5yr_B03002011':'nhtwoplusexother_2018',
    'ACS18_5yr_B03002012':'hispanic_2018',
    'ACS18_5yr_B03002013':'hwhite_2018',
    'ACS18_5yr_B03002014':'hblack_2018',
    'ACS18_5yr_B03002015':'hamindian_2018',
    'ACS18_5yr_B03002016':'hasian_2018',
    'ACS18_5yr_B03002017': 'hhpi_2018',
    'ACS18_5yr_B03002018':'hother_2018',
    'ACS18_5yr_B03002019':'htwoplus_2018',
    'ACS18_5yr_B03002020':'htwoplincother_2018',
    'ACS18_5yr_B03002021':'htwoplexother_2018',
    'ACS18_5yr_B03002001s': "total_2018_se",
    'ACS18_5yr_B03002002s': "nh_total_2018_se",
    'ACS18_5yr_B03002003s': "nhwhite_2018_se",
    'ACS18_5yr_B03002004s': 'nhblack_2018_se',
    'ACS18_5yr_B03002005s':'nhamindian_2018_se',
    'ACS18_5yr_B03002006s':"nhasian_2018_se",
    'ACS18_5yr_B03002007s':"nhhpi_2018_se",
    'ACS18_5yr_B03002008s':'nhother_2018_se',
    'ACS18_5yr_B03002009s':'nhtwoplus_2018_se',
    'ACS18_5yr_B03002010s':'nhtwoplincother_2018_se',
    'ACS18_5yr_B03002011s':'nhtwoplexother_2018_se',
    'ACS18_5yr_B03002012s':'hispanic_2018_se',
    'ACS18_5yr_B03002013s':'hwhite_2018_se',
    'ACS18_5yr_B03002014s':'hblack_2018_se',
    'ACS18_5yr_B03002015s':'hamindian_2018_se',
    'ACS18_5yr_B03002016s':'hasian_2018_se',
    'ACS18_5yr_B03002017s': 'hhpi_2018_se',
    'ACS18_5yr_B03002018s':'hother_2018_se',
    'ACS18_5yr_B03002019s':'htwoplus_2018_se',
    'ACS18_5yr_B03002020s':'htwoplincother_2018_se',
    'ACS18_5yr_B03002021s':'htwoplexother_2018_se',
    'ACS18_5yr_B25003001':'hu_2018',
    'ACS18_5yr_B25003002':'owner_2018',
    'ACS18_5yr_B25003003': 'renter_2018',
    'ACS18_5yr_B25003001s': 'hu_2018_se',
    'ACS18_5yr_B25003002s':'owner_2018_se',
    'ACS18_5yr_B25003003s': 'renter_2018_se'}

df_2018.rename(columns=rename_2018, inplace=True)
df_2018.info()

In [None]:
#create MOEs from the SE provided in the raw data
moe_fields = list(rename_2018.values())
moe_fields = [x for x in moe_fields if "se" in x]
for i in moe_fields:
    df_2018[i] = df_2018[i] * 1.645
    
df_2018.rename(columns={"total_2018_se": "total_2018_moe",
"nh_total_2018_se":"nh_total_2018_moe",
"nhwhite_2018_se":"nhwhite_2018_moe",
'nhblack_2018_se': 'nhblack_2018_moe',
'nhamindian_2018_se':'nhamindian_2018_moe',
"nhasian_2018_se": "nhasian_2018_moe",
"nhhpi_2018_se": "nhhpi_2018_moe",
'nhother_2018_se' :'nhother_2018_moe',
'nhtwoplus_2018_se': 'nhtwoplus_2018_moe',
'nhtwoplincother_2018_se': 'nhtwoplincother_2018_moe',
'nhtwoplexother_2018_se': 'nhtwoplexother_2018_moe',
'hispanic_2018_se': 'hispanic_2018_moe',
'hwhite_2018_se': 'hwhite_2018_moe',
'hblack_2018_se': 'hblack_2018_moe',
'hamindian_2018_se' :'hamindian_2018_moe',
'hasian_2018_se':'hasian_2018_moe',
'hhpi_2018_se':'hhpi_2018_moe',
'hother_2018_se' :'hother_2018_moe',
'htwoplus_2018_se' :'htwoplus_2018_moe',
'htwoplincother_2018_se': 'htwoplincother_2018_moe',
'htwoplexother_2018_se': 'htwoplexother_2018_moe',
'hu_2018_se': 'hu_2018_moe',
'owner_2018_se' :'owner_2018_moe',
'renter_2018_se': 'renter_2018_moe'}, inplace=True)

In [None]:
#aggregate our race/ethnicity columns and then drop the columns we no longer need

df_2018['nhothers_2018']=(df_2018['nhamindian_2018'] + df_2018['nhhpi_2018'] + df_2018['nhother_2018'] + df_2018['nhtwoplus_2018'])

# sum moe columns

df_2018['nhothers_2018_moe']=(np.sqrt(df_2018['nhamindian_2018_moe']**2 + df_2018['nhhpi_2018_moe']**2
                                         + df_2018['nhother_2018_moe']**2 + df_2018['nhtwoplus_2018_moe']**2))

acs_2018_df = df_2018[['Geo_FIPS', 'total_2018', 'total_2018_moe', 'nhwhite_2018', 'nhwhite_2018_moe',
                      'nhblack_2018', 'nhblack_2018_moe', 'nhasian_2018', 'nhasian_2018_moe',
                       'hispanic_2018', 'hispanic_2018_moe', 'nhothers_2018', 'nhothers_2018_moe', 
                       'hu_2018', 'hu_2018_moe',
                      'owner_2018', 'owner_2018_moe', 'renter_2018', 'renter_2018_moe']].copy()
acs_2018_df

### Import Crosswalk

The 2009 ACS data is on 2000 census tracts, while the 2018 data is on 2010 census tracts.  Because we will soon have 2020 Census geographies, learning how to crosswalk is an important skill!

In [None]:
#I downloaded this crosswalk from the Brown Longitudinal Database
crosswalk = pd.read_csv("crosswalk_2000_2010.csv", 
                        dtype={'trtid00': str, 'trtid10': str})
crosswalk = crosswalk[['trtid00', 'trtid10', 'weight']].copy()

### Join 2009 data to crosswalk

The 2009 5-yr ACS data is on 2000 tracts. We are going to crosswalk these to 2010 tracts so they can be joined to the 2018 5-yr ACS.

In [None]:
crosswalk_2009 = crosswalk.merge(acs_2009_df, left_on="trtid00", right_on="Geo_FIPS")

In [None]:
crosswalk_2009.head(100)

In [None]:
crosswalk_2009[crosswalk_2009['trtid00'] == '06001403500']

In [None]:
crosswalk_2009[(crosswalk_2009['trtid10'] == '06001425104') | (crosswalk_2009['trtid00'] == '06001401000') | (crosswalk_2009['trtid00'] == '06001425100')]

### Multiply each of the 2009 variables by the crosswalk weight

For example, the 2000 tract `01025958000` is split in 2010. It is no longer the same FIPS code so we need to reallocate ~42% to `01025958001` and ~58% to `01025958002`.

| trtid00 | trtid10 | weight |
| ------- | ------- | ------ |
| 01025958000 |	01025958001 | 0.416454 |
| 01025958000 |	01025958002 | 0.583546 |

*NOTE:* Do not reweigh the margins of error (MOE). We are taking a conservative approach and keeping the MOE as is to avoid overestimating the number of tracts with a statistically significant change in tenure.

In [None]:
reweigh_fields = [x for x in list_2009 if "moe" not in x]
for i in reweigh_fields:
    crosswalk_2009[i] = crosswalk_2009[i] * crosswalk_2009['weight']

In [None]:
crosswalk_2009[crosswalk_2009['trtid00'] == '06001403500']

### Sum adjusted variables by 2010 census tract fips code

The dataset currently has multiple rows for each tract, need to condense by grouping on the 2010 FIPS codes

In [None]:
crosswalk_2009.head(75)

In [None]:
# sum count columns
keep_fields = ['trtid10'] + reweigh_fields
crosswalk_2009_count = crosswalk_2009[keep_fields].groupby('trtid10').sum()
crosswalk_2009_count.reset_index(inplace=True)

In [None]:
# sum moe columns
keep_fields = [x for x in list_2009 if "moe" in x]
keep_fields_dict = {}
for k in keep_fields:
    keep_fields_dict[k] = lambda x: np.sqrt(np.sum(x**2))

In [None]:
crosswalk_2009_moe = crosswalk_2009.groupby('trtid10').agg(keep_fields_dict)
crosswalk_2009_moe.reset_index(inplace=True)
crosswalk_2009_moe

In [None]:
# join back together
crosswalk_2009 = crosswalk_2009_count.merge(crosswalk_2009_moe, on="trtid10")
crosswalk_2009

### Join weighted 2009 tracts to 2018 tracts

In [None]:
df_join = acs_2018_df.merge(crosswalk_2009, left_on="Geo_FIPS", right_on="trtid10")
df_join.shape

In [None]:
df_join.describe()

### Calculate percents (and associated MoEs)

In [None]:
fields = [
    # [numerator, denominator]
    ['owner', 'hu'],
    ['renter', 'hu'],
    ['nhwhite', 'total'],
    ['nhblack', 'total'],
    ['nhasian', 'total'],
    ['nhothers', 'total'],
    ['hispanic', 'total']
]
years = ['2009', '2018']

for y in years:
    for f in fields:
        numer = f[0]
        denom = f[1]
        df_join["p_"+numer+"_"+y] = df_join[numer+"_"+y]/df_join[denom+"_"+y]
        df_join["p_"+numer+"_"+y+"_moe"]=  np.sqrt(df_join[numer+"_"+y+"_moe"]**2 - 
                                                  (df_join["p_"+numer+"_"+y]**2 * 
                                                   df_join[denom+"_"+y+"_moe"]**2)) / df_join[denom+"_"+y]

In [None]:
df_join

### Test if change between 2009 and 2018 is statistically significant

In [None]:
for field in fields:
    f = field[0]
    print(f)
    
    # calculate percent change
    df_join['p_'+f+'_change'] = df_join['p_'+f+'_2018'] - df_join['p_'+f+'_2009']
    df_join['p_'+f+'_change_moe'] = df_join['p_'+f+'_2018_moe']/df_join['p_'+f+'_2009_moe']
    
    # calculate z-statistic for percent change
    df_join['z_'+f] = (df_join['p_'+f+'_2018'] - 
                           df_join['p_'+f+'_2009']) / np.sqrt(((df_join['p_'+f+'_2018_moe']/1.645)**2) + 
                                                               ((df_join['p_'+f+'_2009_moe']/1.645)**2))
    
    # statistically significant increase
    print("Statistically significant increase:", len(df_join[df_join['z_'+f]>1.645]))
    df_join['s_incr_'+f] = np.where(df_join['z_'+f]>1.645, 1, 0)
    
    # statistically significant decrease
    print("Statistically significant decrease:", len(df_join[df_join['z_'+f]<-1.645]))
    df_join['s_decr_'+f] = np.where(df_join['z_'+f]<-1.645, 1, 0)
    
    # no statistically significant change
    print("No statistically significant change:", len(df_join[(df_join['z_'+f]<=1.645)&(df_join['z_'+f]>=-1.645)]))
    print("")

### Correlation 

Do we see any correlation between changes at the census tract level?

In [None]:
df_join[['s_incr_owner', 's_incr_nhwhite']].corr()

In [None]:
df_join[['s_incr_nhwhite', 's_decr_nhblack']].corr()