# Demographic data from the American Community Survey

**[Work in progress]**

This notebook downloads [selected demographic data (DP05)](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP05) from the American Community Survey 2018 5-Year Data.

Data source: [American Community Survey 5-Year Data 2018](https://www.census.gov/data/developers/data-sets/acs-5year.html)

Authors: Peter Rose (pwrose@ucsd.edu), Ilya Zaslavsky (zaslavsk@sdsc.edu)

In [1]:
import os
import pandas as pd
from pathlib import Path
import time

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Download selected variables

* [Selected demographic estimates for US](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP05)

* [List of variables as HTML](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP05.html) or [JSON](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP05/)

* [Example URLs for API](https://api.census.gov/data/2018/acs/acs5/profile/examples.html)

### Specify variables from DP05 group and assign property names

Names must follow the [Neo4j property naming conventions](https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-naming-rules-and-recommendations).

The numbers in each subgroup below adds up to the total population

In [4]:
variables = {'DP05_0001E': 'totalPopulation',
             
             # sex
             'DP05_0002E': 'male',
             'DP05_0003E': 'female',
             
             # age
             'DP05_0005E': 'age0_4',
             'DP05_0006E': 'age5_9',
             'DP05_0007E': 'age10_14',
             'DP05_0008E': 'age15_19',
             'DP05_0009E': 'age20_24',
             'DP05_0010E': 'age25_34',
             'DP05_0011E': 'age35_44',
             'DP05_0012E': 'age45_54',
             'DP05_0013E': 'age55_59',
             'DP05_0014E': 'age60_64',
             'DP05_0015E': 'age65_74',
             'DP05_0016E': 'age75_84',
             'DP05_0017E': 'age85_',

              # race
             'DP05_0037E': 'white',
             'DP05_0038E': 'blackOrAfricanAmerican',
             'DP05_0039E': 'americanIndianAndAlaskaNative',
             'DP05_0044E': 'asian',
             'DP05_0052E': 'nativeHawaiianAndOtherPacificIslander',
             'DP05_0057E': 'otherRace',
             'DP05_0058E': 'twoOrMoreRaces',
             
              # ethnicity
             'DP05_0071E': 'hispanicOrLatino',
             'DP05_0076E': 'notHispanicOrLatino'
            }

In [5]:
fields = ",".join(variables.keys())

## Download county-level data using US Census API

In [6]:
url_county = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=county:*'

In [7]:
df = pd.read_json(url_county, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26
0,DP05_0001E,DP05_0002E,DP05_0003E,DP05_0005E,DP05_0006E,DP05_0007E,DP05_0008E,DP05_0009E,DP05_0010E,DP05_0011E,DP05_0012E,DP05_0013E,DP05_0014E,DP05_0015E,DP05_0016E,DP05_0017E,DP05_0037E,DP05_0038E,DP05_0039E,DP05_0044E,DP05_0052E,DP05_0057E,DP05_0058E,DP05_0071E,DP05_0076E,state,county
1,3255,1649,1606,206,217,197,198,302,320,334,334,259,187,462,188,51,2346,23,237,302,17,6,324,337,2918,02,195
2,8198,4388,3810,1086,867,894,750,654,1262,724,788,448,269,299,138,19,319,60,7480,32,2,3,302,81,8117,02,158
3,4253913,2104145,2149768,279161,288352,298170,288250,288489,616000,557229,541839,251853,229293,358732,183578,72967,3301814,232690,82699,176740,9405,296569,153996,1311091,2942822,04,013
4,37879,20330,17549,2819,2781,2965,2957,2826,5722,4751,4161,1814,2071,2882,1562,568,29917,752,4808,246,51,996,1109,12453,25426,04,009


##### Add column names

In [8]:
df = df[1:].copy() # skip first row of labels
columns = list(variables.values())
columns.append('stateFips')
columns.append('countyFips')
df.columns = columns

Remove Puerto Rico (stateFips = 72) to limit data to US States

TODO handle data for Puerto Rico (GeoNames represents Puerto Rico as a country)

In [9]:
df.query("stateFips != '72'", inplace=True)

Save list of state fips (required later to get tract data by state)

In [10]:
stateFips = list(df['stateFips'].unique())
stateFips.sort()
print(stateFips)

['01', '02', '04', '05', '06', '08', '09', '10', '11', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56']


In [11]:
df.head()

Unnamed: 0,totalPopulation,male,female,age0_4,age5_9,age10_14,age15_19,age20_24,age25_34,age35_44,age45_54,age55_59,age60_64,age65_74,age75_84,age85_,white,blackOrAfricanAmerican,americanIndianAndAlaskaNative,asian,nativeHawaiianAndOtherPacificIslander,otherRace,twoOrMoreRaces,hispanicOrLatino,notHispanicOrLatino,stateFips,countyFips
1,3255,1649,1606,206,217,197,198,302,320,334,334,259,187,462,188,51,2346,23,237,302,17,6,324,337,2918,2,195
2,8198,4388,3810,1086,867,894,750,654,1262,724,788,448,269,299,138,19,319,60,7480,32,2,3,302,81,8117,2,158
3,4253913,2104145,2149768,279161,288352,298170,288250,288489,616000,557229,541839,251853,229293,358732,183578,72967,3301814,232690,82699,176740,9405,296569,153996,1311091,2942822,4,13
4,37879,20330,17549,2819,2781,2965,2957,2826,5722,4751,4161,1814,2071,2882,1562,568,29917,752,4808,246,51,996,1109,12453,25426,4,9
5,46584,22285,24299,3342,3836,3350,3793,3035,4924,5204,5500,2784,2925,4644,2396,851,39656,248,362,430,0,5376,512,38899,7685,4,23


In [12]:
# Example data
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')]

Unnamed: 0,totalPopulation,male,female,age0_4,age5_9,age10_14,age15_19,age20_24,age25_34,age35_44,age45_54,age55_59,age60_64,age65_74,age75_84,age85_,white,blackOrAfricanAmerican,americanIndianAndAlaskaNative,asian,nativeHawaiianAndOtherPacificIslander,otherRace,twoOrMoreRaces,hispanicOrLatino,notHispanicOrLatino,stateFips,countyFips
897,3302833,1661931,1640902,211969,198148,197726,209496,262118,541385,436855,420221,201666,183654,251516,127904,60175,2335447,166412,20980,390418,13903,205307,170366,1106925,2195908,6,73


In [13]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Admin2'

### Save data

In [14]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP05Admin2.csv", index=False)

##### Data Checks for US
50 states + District of Columbia

In [15]:
print('Number of states:', len(stateFips)) 

Number of states: 51


In [16]:
df['totalPopulation'] = df['totalPopulation'].astype(int)
df['totalPopulation'].sum()

322903030

## Download zip-level data using US Census API

In [17]:
url_zip = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=zip%20code%20tabulation%20area:*'

In [18]:
df = pd.read_json(url_zip, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26
0,DP05_0001E,DP05_0002E,DP05_0003E,DP05_0005E,DP05_0006E,DP05_0007E,DP05_0008E,DP05_0009E,DP05_0010E,DP05_0011E,DP05_0012E,DP05_0013E,DP05_0014E,DP05_0015E,DP05_0016E,DP05_0017E,DP05_0037E,DP05_0038E,DP05_0039E,DP05_0044E,DP05_0052E,DP05_0057E,DP05_0058E,DP05_0071E,DP05_0076E,state,zip code tabulation area
1,2301,1148,1153,131,209,52,120,80,325,232,340,186,134,319,121,52,1931,350,0,4,0,0,16,72,2229,51,23833
2,1268,543,725,16,53,126,145,65,56,196,225,93,37,215,41,0,851,417,0,0,0,0,0,0,1268,51,23850
3,13262,6184,7078,771,999,854,720,727,1616,1336,1718,692,1275,1500,682,372,6348,6352,18,27,0,90,427,92,13170,51,23851
4,3708,1487,2221,242,158,203,187,264,323,233,600,258,290,551,194,205,1503,2158,0,0,0,17,30,108,3600,51,23890


##### Add column names

In [19]:
df = df[1:].copy() # skip first row
columns = list(variables.values())
columns.append('stateFips')
columns.append('postalCode')
df.columns = columns

In [20]:
df.head()

Unnamed: 0,totalPopulation,male,female,age0_4,age5_9,age10_14,age15_19,age20_24,age25_34,age35_44,age45_54,age55_59,age60_64,age65_74,age75_84,age85_,white,blackOrAfricanAmerican,americanIndianAndAlaskaNative,asian,nativeHawaiianAndOtherPacificIslander,otherRace,twoOrMoreRaces,hispanicOrLatino,notHispanicOrLatino,stateFips,postalCode
1,2301,1148,1153,131,209,52,120,80,325,232,340,186,134,319,121,52,1931,350,0,4,0,0,16,72,2229,51,23833
2,1268,543,725,16,53,126,145,65,56,196,225,93,37,215,41,0,851,417,0,0,0,0,0,0,1268,51,23850
3,13262,6184,7078,771,999,854,720,727,1616,1336,1718,692,1275,1500,682,372,6348,6352,18,27,0,90,427,92,13170,51,23851
4,3708,1487,2221,242,158,203,187,264,323,233,600,258,290,551,194,205,1503,2158,0,0,0,17,30,108,3600,51,23890
5,268,114,154,45,26,27,6,0,0,70,51,26,0,17,0,0,268,0,0,0,0,0,0,147,121,51,23302


In [21]:
# Example data
df.query("postalCode == '90210'")

Unnamed: 0,totalPopulation,male,female,age0_4,age5_9,age10_14,age15_19,age20_24,age25_34,age35_44,age45_54,age55_59,age60_64,age65_74,age75_84,age85_,white,blackOrAfricanAmerican,americanIndianAndAlaskaNative,asian,nativeHawaiianAndOtherPacificIslander,otherRace,twoOrMoreRaces,hispanicOrLatino,notHispanicOrLatino,stateFips,postalCode
24526,19909,9599,10310,710,1014,1482,1066,519,1956,2292,3143,1282,1093,2562,1695,1095,16911,198,21,1782,0,313,684,829,19080,6,90210


In [22]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'PostalCode'

### Save data

In [23]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP05Zip.csv", index=False)

##### Check Data

Note, this includes zip codes for Puerto Rico

In [24]:
df['totalPopulation'] = df['totalPopulation'].astype(int)
df['totalPopulation'].sum()

326274356

## Download tract-level data using US Census API
Tract-level data are only available by state, so we need to loop over all states.

In [25]:
def get_tract_data(state):
    url_tract = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=tract:*&in=state:{state}'
    df = pd.read_json(url_tract, dtype='str')
    time.sleep(1)
    # skip first row of labels
    df = df[1:].copy()
    # Add column names
    columns = list(variables.values())
    columns.append('stateFips')
    columns.append('countyFips')
    columns.append('tract')
    df.columns = columns
    return df

In [26]:
df = pd.concat((get_tract_data(state) for state in stateFips))
df.fillna('', inplace=True)

In [27]:
df['tract'] = df['stateFips'] + df['countyFips'] + df['tract']

In [28]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Tract'

In [29]:
# Example data
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')].head()

Unnamed: 0,totalPopulation,male,female,age0_4,age5_9,age10_14,age15_19,age20_24,age25_34,age35_44,age45_54,age55_59,age60_64,age65_74,age75_84,age85_,white,blackOrAfricanAmerican,americanIndianAndAlaskaNative,asian,nativeHawaiianAndOtherPacificIslander,otherRace,twoOrMoreRaces,hispanicOrLatino,notHispanicOrLatino,stateFips,countyFips,tract,source,aggregationLevel
6,2053,1195,858,79,42,47,107,778,479,262,151,39,23,31,10,5,1154,56,0,614,3,134,92,324,1729,6,73,6073008339,American Community Survey 5 year,Tract
7,7037,3708,3329,461,510,360,317,206,1237,970,943,488,449,752,316,28,2388,216,12,3166,29,551,675,974,6063,6,73,6073008347,American Community Survey 5 year,Tract
8,9862,4507,5355,958,561,380,347,674,2658,1434,1336,366,411,423,200,114,2812,452,0,4791,121,535,1151,1489,8373,6,73,6073008354,American Community Survey 5 year,Tract
9,6576,3398,3178,387,329,192,194,675,1130,1005,955,518,447,326,180,238,3914,178,26,1655,0,562,241,1791,4785,6,73,6073008505,American Community Survey 5 year,Tract
10,7427,3468,3959,635,455,334,527,289,1075,842,1062,389,255,656,526,382,6085,255,73,269,0,398,347,1375,6052,6,73,6073017604,American Community Survey 5 year,Tract


### Save data

In [30]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP05Tract.csv", index=False)

In [31]:
df.shape

(73056, 30)