### DataDive September 2021
### Housing Insecurity project

Notebook for Data Exploration of ACS data

Analyze variable codes and create mapping from 2014 to 2019 data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys, os
import seaborn as sns
import geopandas as gpd

In [2]:
data_dir = './data/housing_insecurity/ACS/'
print(os.listdir(data_dir))

['acs5_variable_dict_2014_2019.csv', 'acs5_variable_dict_2014_2019.xlsx', 'analysis', 'hillsborough_acs5-2014_census(1).xlsx', 'hillsborough_acs5-2014_census.csv', 'hillsborough_acs5-2014_census.xlsx', 'hillsborough_acs5-2019_census.csv', 'kept_2014_dict_labels.csv', 'label_mismatch.csv', 'miami_dade_acs5-2014_census.csv', 'miami_dade_acs5-2019_census.csv', 'new_2019_dict_labels.csv', 'orange_acs5-2014_census.csv', 'orange_acs5-2019_census.csv', 'variables.json']


In [3]:
output_dir = './output/'

#### Load data

In [4]:
data_file = data_dir + 'acs5_variable_dict_2014_2019.csv'
acs_dict = pd.read_csv(data_file)
acs_dict.shape

(8264, 8)

In [5]:
acs_dict.head()

Unnamed: 0,variable_code,label,concept,predicateType,group,limit,predicateOnly,acs_year
0,DP02_0019EA,Annotation of Estimate!!RELATIONSHIP!!Populati...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
1,DP02_0126E,Estimate!!ANCESTRY!!Total population!!Danish,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2014
2,DP02_0072EA,Annotation of Estimate!!DISABILITY STATUS OF T...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
3,DP02_0069PMA,Annotation of Percent Margin of Error!!VETERAN...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,string,DP02,0,True,2014
4,DP02_0126M,Margin of Error!!ANCESTRY!!Total population!!D...,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,int,DP02,0,True,2014


Variable Codes

Variable codes appear to contain 3 types of information:
- DPxx : concept (matches "group")
- variable description
- measurement type (estimate, percent, error)

In [6]:
acs_dict['code_p1'] = acs_dict['variable_code'].str[5:9]
acs_dict['code_p2'] = acs_dict['variable_code'].str[9:]


Labels

Labels appear to contain the following:
- measurement type
- variable description (in CAPS)
- specific value or category

In [7]:
temp = acs_dict['label'].str.upper().str.split('!!')

acs_dict['measurement'] = temp.apply(lambda x: x[0])
acs_dict['label_p1'] = temp.apply(lambda x: x[1])
acs_dict['label_p2'] = temp.apply(lambda x: ', '.join(x[2:]))

In [8]:
keep_vars = ['variable_code','concept','group','acs_year','code_p1','code_p2','measurement','label_p1','label_p2']
acs_dict = acs_dict[keep_vars]

In [9]:
acs_dict.head()

Unnamed: 0,variable_code,concept,group,acs_year,code_p1,code_p2,measurement,label_p1,label_p2
0,DP02_0019EA,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,DP02,2014,19,EA,ANNOTATION OF ESTIMATE,RELATIONSHIP,"POPULATION IN HOUSEHOLDS, SPOUSE"
1,DP02_0126E,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,DP02,2014,126,E,ESTIMATE,ANCESTRY,"TOTAL POPULATION, DANISH"
2,DP02_0072EA,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,DP02,2014,72,EA,ANNOTATION OF ESTIMATE,DISABILITY STATUS OF THE CIVILIAN NONINSTITUTI...,UNDER 18 YEARS
3,DP02_0069PMA,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,DP02,2014,69,PMA,ANNOTATION OF PERCENT MARGIN OF ERROR,VETERAN STATUS,"CIVILIAN POPULATION 18 YEARS AND OVER, CIVILIA..."
4,DP02_0126M,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...,DP02,2014,126,M,MARGIN OF ERROR,ANCESTRY,"TOTAL POPULATION, DANISH"


In [10]:
print("Number of unique Groups:    ", len(acs_dict['group'].unique()))
print("Number of unique Concepts:  ", len(acs_dict['concept'].unique()))
print("Number of unique Measurements:  ", len(acs_dict['measurement'].unique()))

print("Number of unique Label-1:  ", len(acs_dict['label_p1'].unique()))
print("Number of unique Label-2:  ", len(acs_dict['label_p2'].unique()))


Number of unique Groups:     4
Number of unique Concepts:   4
Number of unique Measurements:   8
Number of unique Label-1:   49
Number of unique Label-2:   657


In [11]:
acs_dict['measurement'].value_counts()

PERCENT                                  1033
ESTIMATE                                 1033
MARGIN OF ERROR                          1033
ANNOTATION OF MARGIN OF ERROR            1033
ANNOTATION OF ESTIMATE                   1033
PERCENT MARGIN OF ERROR                  1033
ANNOTATION OF PERCENT MARGIN OF ERROR    1033
ANNOTATION OF PERCENT                    1033
Name: measurement, dtype: int64

#### Comparison between 2014 and 2019

In [12]:
vars_2014 = acs_dict[(acs_dict['acs_year']==2014) & (acs_dict['code_p2']=='E')]
vars_2019 = acs_dict[(acs_dict['acs_year']==2019) & (acs_dict['code_p2']=='E')]

In [13]:
vars_2014['variable_code_pref'] = vars_2014['variable_code'].str[0:9]
vars_2019['variable_code_pref'] = vars_2019['variable_code'].str[0:9]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [14]:
print("2014 Number of unique Label-1:  ", len(vars_2014['label_p1'].unique()))
print("2019 Number of unique Label-1:  ", len(vars_2019['label_p1'].unique()))
print()
print("2014 Number of unique Label-2:  ", len(vars_2014['label_p2'].unique()))
print("2019 Number of unique Label-2:  ", len(vars_2019['label_p2'].unique()))


2014 Number of unique Label-1:   46
2019 Number of unique Label-1:   48

2014 Number of unique Label-2:   484
2019 Number of unique Label-2:   499


In [15]:
# New Label-1
set(vars_2019['label_p1'].unique()).difference(set(vars_2014['label_p1'].unique()))


{'CITIZEN, VOTING AGE POPULATION',
 'INCOME AND BENEFITS (IN 2019 INFLATION-ADJUSTED DOLLARS)',
 'RACE ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES'}

In [16]:
# Removed Label-1
set(vars_2014['label_p1'].unique()).difference(set(vars_2019['label_p1'].unique()))


{'INCOME AND BENEFITS (IN 2014 INFLATION-ADJUSTED DOLLARS)'}

In [19]:
# New Label-2
#set(vars_2019['label_p2'].unique()).difference(set(vars_2014['label_p2'].unique()))


In [20]:
#set(vars_2014['label_p2'].unique()).difference(set(vars_2019['label_p2'].unique()))

In many cases, the label text has changed slightly but the underlying concept remains similar.
In other cases, the underlying variable has changed (different bin boundaries, new categories) and the user will have to determine whether the data are comparable.


#### Output

Export data for analysis and manual data mapping

In [45]:
output_vars = ['variable_code_pref','concept','group','acs_year','code_p1','label_p1','label_p2']

outfile = output_dir + 'acs_vars_2014.csv'
vars_2014[output_vars].to_csv(outfile)
print ('Exported to: ', outfile)

outfile = output_dir + 'acs_vars_2019.csv'
vars_2019[output_vars].to_csv(outfile)
print ('Exported to: ', outfile)


Exported to:  ./output/acs_vars_2014.csv
Exported to:  ./output/acs_vars_2019.csv


#### Analysis by County

In [46]:
orange_2014 = pd.read_csv(data_dir+'orange_acs5-2014_census.csv')
orange_2019 = pd.read_csv(data_dir+'orange_acs5-2019_census.csv')
print(orange_2014.shape, orange_2019.shape)

(207, 1555) (207, 1613)
