# Selected Economic Characteristics: Income from the American Community Survey

**[Work in progress]**

This notebook downloads [selected economic characteristics (DP03)](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP03) from the American Community Survey 2018 5-Year Data.

Data source: [American Community Survey 5-Year Data 2018](https://www.census.gov/data/developers/data-sets/acs-5year.html)

Authors: Peter Rose (pwrose@ucsd.edu), Ilya Zaslavsky (zaslavsk@sdsc.edu)

In [1]:
import os
import pandas as pd
from pathlib import Path
import time

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Download selected variables

* [Selected economic characteristics for US](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP03)

* [List of variables as HTML](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP03.html) or [JSON](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP03/)

* [Description of variables](https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2018_ACSSubjectDefinitions.pdf)

* [Example URLs for API](https://api.census.gov/data/2018/acs/acs5/profile/examples.html)

### Specify variables from DP03 group and assign property names

Names must follow the [Neo4j property naming conventions](https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-naming-rules-and-recommendations).

In [4]:
variables = {# INCOME AND BENEFITS
             'DP03_0051E': 'totalHouseholds',
             'DP03_0052E': 'householdIncomeLessThan10000USD',
             'DP03_0052PE': 'householdIncomeLessThan10000USDPct',
             'DP03_0053E': 'householdIncome10000To14999USD',
             'DP03_0053PE': 'householdIncome10000To14999USDPct',
             'DP03_0054E': 'householdIncome15000To24999USD',
             'DP03_0054PE': 'householdIncome15000To24999USDPct',
             'DP03_0055E': 'householdIncome25000To34999USD',
             'DP03_0055PE': 'householdIncome25000To34999USDPct',
             'DP03_0056E': 'householdIncome35000To49999USD',
             'DP03_0056PE': 'householdIncome35000To49999USDPct',
             'DP03_0057E': 'householdIncome50000To74999USD',
             'DP03_0057PE': 'householdIncome50000To74999USDPct',
             'DP03_0058E': 'householdIncome75000To99999USD',
             'DP03_0058PE': 'householdIncome75000To99999USDPct',
             'DP03_0059E': 'householdIncome100000To149999USD',
             'DP03_0059PE': 'householdIncome100000To149999USDPct',
             'DP03_0060E': 'householdIncome150000To199999USD',
             'DP03_0060PE': 'householdIncome150000To199999USDPct',
             'DP03_0061E': 'householdIncomeMoreThan200000USD',
             'DP03_0061PE': 'householdIncomeMoreThan200000USDPct',
             'DP03_0062E': 'medianHouseholdIncomeUSD',
             'DP03_0063E': 'meanHouseholdIncomeUSD',
            }

In [5]:
fields = ",".join(variables.keys())

In [6]:
for v in variables.values():
    if 'Pct' in v:
        print('i.' + v + ' = toFloat(row.' + v + '),')
    else:
        print('i.' + v + ' = toInteger(row.' + v + '),')

i.totalHouseholds = toInteger(row.totalHouseholds),
i.householdIncomeLessThan10000USD = toInteger(row.householdIncomeLessThan10000USD),
i.householdIncomeLessThan10000USDPct = toFloat(row.householdIncomeLessThan10000USDPct),
i.householdIncome10000To14999USD = toInteger(row.householdIncome10000To14999USD),
i.householdIncome10000To14999USDPct = toFloat(row.householdIncome10000To14999USDPct),
i.householdIncome15000To24999USD = toInteger(row.householdIncome15000To24999USD),
i.householdIncome15000To24999USDPct = toFloat(row.householdIncome15000To24999USDPct),
i.householdIncome25000To34999USD = toInteger(row.householdIncome25000To34999USD),
i.householdIncome25000To34999USDPct = toFloat(row.householdIncome25000To34999USDPct),
i.householdIncome35000To49999USD = toInteger(row.householdIncome35000To49999USD),
i.householdIncome35000To49999USDPct = toFloat(row.householdIncome35000To49999USDPct),
i.householdIncome50000To74999USD = toInteger(row.householdIncome50000To74999USD),
i.householdIncome50000

In [7]:
print(len(variables.keys()))

23


## Download county-level data using US Census API

In [8]:
url_county = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=county:*'

In [9]:
df = pd.read_json(url_county, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
0,DP03_0051E,DP03_0052E,DP03_0052PE,DP03_0053E,DP03_0053PE,DP03_0054E,DP03_0054PE,DP03_0055E,DP03_0055PE,DP03_0056E,DP03_0056PE,DP03_0057E,DP03_0057PE,DP03_0058E,DP03_0058PE,DP03_0059E,DP03_0059PE,DP03_0060E,DP03_0060PE,DP03_0061E,DP03_0061PE,DP03_0062E,DP03_0063E,state,county
1,1170,55,4.7,23,2.0,120,10.3,122,10.4,100,8.5,240,20.5,150,12.8,174,14.9,96,8.2,90,7.7,66907,87391,02,195
2,1692,183,10.8,124,7.3,257,15.2,271,16.0,271,16.0,303,17.9,123,7.3,104,6.1,45,2.7,11,0.7,35539,46732,02,158
3,1520767,89567,5.9,57241,3.8,128278,8.4,139146,9.1,202904,13.3,281762,18.5,195265,12.8,230321,15.1,94440,6.2,101843,6.7,61606,84884,04,013
4,10782,1059,9.8,604,5.6,1073,10.0,974,9.0,1516,14.1,2213,20.5,1615,15.0,1326,12.3,288,2.7,114,1.1,51352,59370,04,009


##### Add column names

In [10]:
df = df[1:].copy() # skip first row of labels
columns = list(variables.values())
columns.append('stateFips')
columns.append('countyFips')
df.columns = columns

Remove Puerto Rico (stateFips = 72) to limit data to US States

TODO handle data for Puerto Rico (GeoNames represents Puerto Rico as a country)

In [11]:
df.query("stateFips != '72'", inplace=True)

Save list of state fips (required later to get tract data by state)

In [12]:
stateFips = list(df['stateFips'].unique())
stateFips.sort()
print(stateFips)

['01', '02', '04', '05', '06', '08', '09', '10', '11', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56']


In [13]:
df.head()

Unnamed: 0,totalHouseholds,householdIncomeLessThan10000USD,householdIncomeLessThan10000USDPct,householdIncome10000To14999USD,householdIncome10000To14999USDPct,householdIncome15000To24999USD,householdIncome15000To24999USDPct,householdIncome25000To34999USD,householdIncome25000To34999USDPct,householdIncome35000To49999USD,householdIncome35000To49999USDPct,householdIncome50000To74999USD,householdIncome50000To74999USDPct,householdIncome75000To99999USD,householdIncome75000To99999USDPct,householdIncome100000To149999USD,householdIncome100000To149999USDPct,householdIncome150000To199999USD,householdIncome150000To199999USDPct,householdIncomeMoreThan200000USD,householdIncomeMoreThan200000USDPct,medianHouseholdIncomeUSD,meanHouseholdIncomeUSD,stateFips,countyFips
1,1170,55,4.7,23,2.0,120,10.3,122,10.4,100,8.5,240,20.5,150,12.8,174,14.9,96,8.2,90,7.7,66907,87391,2,195
2,1692,183,10.8,124,7.3,257,15.2,271,16.0,271,16.0,303,17.9,123,7.3,104,6.1,45,2.7,11,0.7,35539,46732,2,158
3,1520767,89567,5.9,57241,3.8,128278,8.4,139146,9.1,202904,13.3,281762,18.5,195265,12.8,230321,15.1,94440,6.2,101843,6.7,61606,84884,4,13
4,10782,1059,9.8,604,5.6,1073,10.0,974,9.0,1516,14.1,2213,20.5,1615,15.0,1326,12.3,288,2.7,114,1.1,51352,59370,4,9
5,15430,1455,9.4,1032,6.7,2185,14.2,2236,14.5,2323,15.1,2519,16.3,1485,9.6,1314,8.5,507,3.3,374,2.4,40467,59164,4,23


In [14]:
# Example data
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')]

Unnamed: 0,totalHouseholds,householdIncomeLessThan10000USD,householdIncomeLessThan10000USDPct,householdIncome10000To14999USD,householdIncome10000To14999USDPct,householdIncome15000To24999USD,householdIncome15000To24999USDPct,householdIncome25000To34999USD,householdIncome25000To34999USDPct,householdIncome35000To49999USD,householdIncome35000To49999USDPct,householdIncome50000To74999USD,householdIncome50000To74999USDPct,householdIncome75000To99999USD,householdIncome75000To99999USDPct,householdIncome100000To149999USD,householdIncome100000To149999USDPct,householdIncome150000To199999USD,householdIncome150000To199999USDPct,householdIncomeMoreThan200000USD,householdIncomeMoreThan200000USDPct,medianHouseholdIncomeUSD,meanHouseholdIncomeUSD,stateFips,countyFips
897,1118980,51029,4.6,39441,3.5,78317,7.0,84271,7.5,122572,11.0,184805,16.5,144979,13.0,197632,17.7,98349,8.8,117585,10.5,74855,101302,6,73


In [15]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Admin2'

### Save data

In [16]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP03IncomeAdmin2.csv", index=False)

## Download zip-level data using US Census API

In [17]:
url_zip = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=zip%20code%20tabulation%20area:*'

In [18]:
df = pd.read_json(url_zip, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
0,DP03_0051E,DP03_0052E,DP03_0052PE,DP03_0053E,DP03_0053PE,DP03_0054E,DP03_0054PE,DP03_0055E,DP03_0055PE,DP03_0056E,DP03_0056PE,DP03_0057E,DP03_0057PE,DP03_0058E,DP03_0058PE,DP03_0059E,DP03_0059PE,DP03_0060E,DP03_0060PE,DP03_0061E,DP03_0061PE,DP03_0062E,DP03_0063E,state,zip code tabulation area
1,852,112,13.1,33,3.9,141,16.5,68,8.0,88,10.3,170,20.0,96,11.3,59,6.9,41,4.8,44,5.2,46167,62595,51,23833
2,467,0,0.0,15,3.2,89,19.1,6,1.3,57,12.2,98,21.0,119,25.5,83,17.8,0,0.0,0,0.0,70852,66879,51,23850
3,5507,349,6.3,300,5.4,888,16.1,599,10.9,795,14.4,966,17.5,684,12.4,535,9.7,264,4.8,127,2.3,46049,60628,51,23851
4,1617,98,6.1,70,4.3,180,11.1,201,12.4,316,19.5,316,19.5,128,7.9,260,16.1,38,2.4,10,0.6,46265,59332,51,23890


##### Add column names

In [19]:
df = df[1:].copy() # skip first row
columns = list(variables.values())
columns.append('stateFips')
columns.append('postalCode')
df.columns = columns

In [20]:
df.head()

Unnamed: 0,totalHouseholds,householdIncomeLessThan10000USD,householdIncomeLessThan10000USDPct,householdIncome10000To14999USD,householdIncome10000To14999USDPct,householdIncome15000To24999USD,householdIncome15000To24999USDPct,householdIncome25000To34999USD,householdIncome25000To34999USDPct,householdIncome35000To49999USD,householdIncome35000To49999USDPct,householdIncome50000To74999USD,householdIncome50000To74999USDPct,householdIncome75000To99999USD,householdIncome75000To99999USDPct,householdIncome100000To149999USD,householdIncome100000To149999USDPct,householdIncome150000To199999USD,householdIncome150000To199999USDPct,householdIncomeMoreThan200000USD,householdIncomeMoreThan200000USDPct,medianHouseholdIncomeUSD,meanHouseholdIncomeUSD,stateFips,postalCode
1,852,112,13.1,33,3.9,141,16.5,68,8.0,88,10.3,170,20.0,96,11.3,59,6.9,41,4.8,44,5.2,46167,62595,51,23833
2,467,0,0.0,15,3.2,89,19.1,6,1.3,57,12.2,98,21.0,119,25.5,83,17.8,0,0.0,0,0.0,70852,66879,51,23850
3,5507,349,6.3,300,5.4,888,16.1,599,10.9,795,14.4,966,17.5,684,12.4,535,9.7,264,4.8,127,2.3,46049,60628,51,23851
4,1617,98,6.1,70,4.3,180,11.1,201,12.4,316,19.5,316,19.5,128,7.9,260,16.1,38,2.4,10,0.6,46265,59332,51,23890
5,90,17,18.9,0,0.0,0,0.0,22,24.4,0,0.0,0,0.0,13,14.4,10,11.1,28,31.1,0,0.0,-666666666,85980,51,23302


In [21]:
# Example data
df.query("postalCode == '90210'")

Unnamed: 0,totalHouseholds,householdIncomeLessThan10000USD,householdIncomeLessThan10000USDPct,householdIncome10000To14999USD,householdIncome10000To14999USDPct,householdIncome15000To24999USD,householdIncome15000To24999USDPct,householdIncome25000To34999USD,householdIncome25000To34999USDPct,householdIncome35000To49999USD,householdIncome35000To49999USDPct,householdIncome50000To74999USD,householdIncome50000To74999USDPct,householdIncome75000To99999USD,householdIncome75000To99999USDPct,householdIncome100000To149999USD,householdIncome100000To149999USDPct,householdIncome150000To199999USD,householdIncome150000To199999USDPct,householdIncomeMoreThan200000USD,householdIncomeMoreThan200000USDPct,medianHouseholdIncomeUSD,meanHouseholdIncomeUSD,stateFips,postalCode
24526,8036,509,6.3,269,3.3,366,4.6,214,2.7,258,3.2,485,6.0,745,9.3,1287,16.0,729,9.1,3174,39.5,143542,285015,6,90210


In [22]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'PostalCode'

### Save data

In [23]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP03IncomeZip.csv", index=False)

## Download tract-level data using US Census API
Tract-level data are only available by state, so we need to loop over all states.

In [24]:
def get_tract_data(state):
    url_tract = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=tract:*&in=state:{state}'
    df = pd.read_json(url_tract, dtype='str')
    time.sleep(1)
    # skip first row of labels
    df = df[1:].copy()
    # Add column names
    columns = list(variables.values())
    columns.append('stateFips')
    columns.append('countyFips')
    columns.append('tract')
    df.columns = columns
    return df

In [25]:
df = pd.concat((get_tract_data(state) for state in stateFips))
df.fillna('', inplace=True)

In [26]:
df['tract'] = df['stateFips'] + df['countyFips'] + df['tract']

In [27]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Tract'

In [28]:
# Example data for San Diego County
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')].head()

Unnamed: 0,totalHouseholds,householdIncomeLessThan10000USD,householdIncomeLessThan10000USDPct,householdIncome10000To14999USD,householdIncome10000To14999USDPct,householdIncome15000To24999USD,householdIncome15000To24999USDPct,householdIncome25000To34999USD,householdIncome25000To34999USDPct,householdIncome35000To49999USD,householdIncome35000To49999USDPct,householdIncome50000To74999USD,householdIncome50000To74999USDPct,householdIncome75000To99999USD,householdIncome75000To99999USDPct,householdIncome100000To149999USD,householdIncome100000To149999USDPct,householdIncome150000To199999USD,householdIncome150000To199999USDPct,householdIncomeMoreThan200000USD,householdIncomeMoreThan200000USDPct,medianHouseholdIncomeUSD,meanHouseholdIncomeUSD,stateFips,countyFips,tract,source,aggregationLevel
6,819,137,16.7,52,6.3,46,5.6,70,8.5,26,3.2,63,7.7,162,19.8,157,19.2,70,8.5,36,4.4,77039,80888,6,73,6073008339,American Community Survey 5 year,Tract
7,2180,49,2.2,0,0.0,102,4.7,161,7.4,103,4.7,267,12.2,290,13.3,508,23.3,352,16.1,348,16.0,112368,130806,6,73,6073008347,American Community Survey 5 year,Tract
8,3200,37,1.2,51,1.6,71,2.2,87,2.7,336,10.5,466,14.6,422,13.2,839,26.2,572,17.9,319,10.0,110455,121883,6,73,6073008354,American Community Survey 5 year,Tract
9,2116,177,8.4,89,4.2,103,4.9,138,6.5,255,12.1,381,18.0,252,11.9,428,20.2,165,7.8,128,6.0,63625,90850,6,73,6073008505,American Community Survey 5 year,Tract
10,3096,231,7.5,207,6.7,170,5.5,195,6.3,369,11.9,449,14.5,271,8.8,335,10.8,302,9.8,567,18.3,71875,142032,6,73,6073017604,American Community Survey 5 year,Tract


### Save data

In [29]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP03IncomeTract.csv", index=False)

In [30]:
df.shape

(73056, 28)