# Selected Economic Characteristics: Occupation from the American Community Survey

**[Work in progress]**

This notebook downloads [selected economic characteristics (DP03)](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP03) from the American Community Survey 2018 5-Year Data.

Data source: [American Community Survey 5-Year Data 2018](https://www.census.gov/data/developers/data-sets/acs-5year.html)

Authors: Peter Rose (pwrose@ucsd.edu), Ilya Zaslavsky (zaslavsk@sdsc.edu)

In [1]:
import os
import pandas as pd
from pathlib import Path
import time

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Download selected variables

* [Selected economic characteristics for US](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP03)

* [List of variables as HTML](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP03.html) or [JSON](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP03/)

* [Description of variables](https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2018_ACSSubjectDefinitions.pdf)

* [Example URLs for API](https://api.census.gov/data/2018/acs/acs5/profile/examples.html)

### Specify variables from DP03 group and assign property names

Names must follow the [Neo4j property naming conventions](https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-naming-rules-and-recommendations).

In [4]:
variables = {# OCCUPATION
             'DP03_0026E': 'civilianEmployedPopulation16YearsAndOver',
             'DP03_0027E': 'managementBusinessScienceAndArtsOccupations',
             'DP03_0027PE': 'managementBusinessScienceAndArtsOccupationsPct',
             'DP03_0028E': 'serviceOccupations',
             'DP03_0028PE': 'serviceOccupationsPct',
             'DP03_0029E': 'salesAndOfficeOccupations',
             'DP03_0029PE': 'salesAndOfficeOccupationsPct',
             'DP03_0030E': 'naturalResourcesConstructionAndMaintenanceOccupations',
             'DP03_0030PE': 'naturalResourcesConstructionAndMaintenanceOccupationsPct',
             'DP03_0031E': 'productionTransportationAndMaterialMovingOccupations',
             'DP03_0031PE': 'productionTransportationAndMaterialMovingOccupationsPct'
            }

In [5]:
fields = ",".join(variables.keys())

In [6]:
for v in variables.values():
    if 'Pct' in v:
        print('o.' + v + ' = toFloat(row.' + v + '),')
    else:
        print('o.' + v + ' = toInteger(row.' + v + '),')

o.civilianEmployedPopulation16YearsAndOver = toInteger(row.civilianEmployedPopulation16YearsAndOver),
o.managementBusinessScienceAndArtsOccupations = toInteger(row.managementBusinessScienceAndArtsOccupations),
o.managementBusinessScienceAndArtsOccupationsPct = toFloat(row.managementBusinessScienceAndArtsOccupationsPct),
o.serviceOccupations = toInteger(row.serviceOccupations),
o.serviceOccupationsPct = toFloat(row.serviceOccupationsPct),
o.salesAndOfficeOccupations = toInteger(row.salesAndOfficeOccupations),
o.salesAndOfficeOccupationsPct = toFloat(row.salesAndOfficeOccupationsPct),
o.naturalResourcesConstructionAndMaintenanceOccupations = toInteger(row.naturalResourcesConstructionAndMaintenanceOccupations),
o.naturalResourcesConstructionAndMaintenanceOccupationsPct = toFloat(row.naturalResourcesConstructionAndMaintenanceOccupationsPct),
o.productionTransportationAndMaterialMovingOccupations = toInteger(row.productionTransportationAndMaterialMovingOccupations),
o.productionTransportati

In [7]:
print(len(variables.keys()))

11


## Download county-level data using US Census API

In [8]:
url_county = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=county:*'

In [9]:
df = pd.read_json(url_county, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,DP03_0026E,DP03_0027E,DP03_0027PE,DP03_0028E,DP03_0028PE,DP03_0029E,DP03_0029PE,DP03_0030E,DP03_0030PE,DP03_0031E,DP03_0031PE,state,county
1,1577,418,26.5,236,15.0,253,16.0,393,24.9,277,17.6,02,195
2,2088,623,29.8,536,25.7,388,18.6,240,11.5,301,14.4,02,158
3,2004236,753860,37.6,360919,18.0,503710,25.1,168119,8.4,217628,10.9,04,013
4,12837,3563,27.8,2693,21.0,2934,22.9,1920,15.0,1727,13.5,04,009


##### Add column names

In [10]:
df = df[1:].copy() # skip first row of labels
columns = list(variables.values())
columns.append('stateFips')
columns.append('countyFips')
df.columns = columns

Remove Puerto Rico (stateFips = 72) to limit data to US States

TODO handle data for Puerto Rico (GeoNames represents Puerto Rico as a country)

In [11]:
df.query("stateFips != '72'", inplace=True)

Save list of state fips (required later to get tract data by state)

In [12]:
stateFips = list(df['stateFips'].unique())
stateFips.sort()
print(stateFips)

['01', '02', '04', '05', '06', '08', '09', '10', '11', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56']


In [13]:
df.head()

Unnamed: 0,civilianEmployedPopulation16YearsAndOver,managementBusinessScienceAndArtsOccupations,managementBusinessScienceAndArtsOccupationsPct,serviceOccupations,serviceOccupationsPct,salesAndOfficeOccupations,salesAndOfficeOccupationsPct,naturalResourcesConstructionAndMaintenanceOccupations,naturalResourcesConstructionAndMaintenanceOccupationsPct,productionTransportationAndMaterialMovingOccupations,productionTransportationAndMaterialMovingOccupationsPct,stateFips,countyFips
1,1577,418,26.5,236,15.0,253,16.0,393,24.9,277,17.6,2,195
2,2088,623,29.8,536,25.7,388,18.6,240,11.5,301,14.4,2,158
3,2004236,753860,37.6,360919,18.0,503710,25.1,168119,8.4,217628,10.9,4,13
4,12837,3563,27.8,2693,21.0,2934,22.9,1920,15.0,1727,13.5,4,9
5,17233,4370,25.4,3667,21.3,4674,27.1,1975,11.5,2547,14.8,4,23


In [14]:
# Example data
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')]

Unnamed: 0,civilianEmployedPopulation16YearsAndOver,managementBusinessScienceAndArtsOccupations,managementBusinessScienceAndArtsOccupationsPct,serviceOccupations,serviceOccupationsPct,salesAndOfficeOccupations,salesAndOfficeOccupationsPct,naturalResourcesConstructionAndMaintenanceOccupations,naturalResourcesConstructionAndMaintenanceOccupationsPct,productionTransportationAndMaterialMovingOccupations,productionTransportationAndMaterialMovingOccupationsPct,stateFips,countyFips
897,1564930,652475,41.7,304726,19.5,340038,21.7,119478,7.6,148213,9.5,6,73


In [15]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Admin2'

### Save data

In [16]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP03OccupationAdmin2.csv", index=False)

## Download zip-level data using US Census API

In [17]:
url_zip = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=zip%20code%20tabulation%20area:*'

In [18]:
df = pd.read_json(url_zip, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,DP03_0026E,DP03_0027E,DP03_0027PE,DP03_0028E,DP03_0028PE,DP03_0029E,DP03_0029PE,DP03_0030E,DP03_0030PE,DP03_0031E,DP03_0031PE,state,zip code tabulation area
1,934,276,29.6,122,13.1,176,18.8,154,16.5,206,22.1,51,23833
2,563,204,36.2,104,18.5,70,12.4,111,19.7,74,13.1,51,23850
3,5684,1309,23.0,1187,20.9,1351,23.8,867,15.3,970,17.1,51,23851
4,1531,524,34.2,264,17.2,250,16.3,132,8.6,361,23.6,51,23890


##### Add column names

In [19]:
df = df[1:].copy() # skip first row
columns = list(variables.values())
columns.append('stateFips')
columns.append('postalCode')
df.columns = columns

In [20]:
df.head()

Unnamed: 0,civilianEmployedPopulation16YearsAndOver,managementBusinessScienceAndArtsOccupations,managementBusinessScienceAndArtsOccupationsPct,serviceOccupations,serviceOccupationsPct,salesAndOfficeOccupations,salesAndOfficeOccupationsPct,naturalResourcesConstructionAndMaintenanceOccupations,naturalResourcesConstructionAndMaintenanceOccupationsPct,productionTransportationAndMaterialMovingOccupations,productionTransportationAndMaterialMovingOccupationsPct,stateFips,postalCode
1,934,276,29.6,122,13.1,176,18.8,154,16.5,206,22.1,51,23833
2,563,204,36.2,104,18.5,70,12.4,111,19.7,74,13.1,51,23850
3,5684,1309,23.0,1187,20.9,1351,23.8,867,15.3,970,17.1,51,23851
4,1531,524,34.2,264,17.2,250,16.3,132,8.6,361,23.6,51,23890
5,120,98,81.7,22,18.3,0,0.0,0,0.0,0,0.0,51,23302


In [21]:
# Example data
df.query("postalCode == '90210'")

Unnamed: 0,civilianEmployedPopulation16YearsAndOver,managementBusinessScienceAndArtsOccupations,managementBusinessScienceAndArtsOccupationsPct,serviceOccupations,serviceOccupationsPct,salesAndOfficeOccupations,salesAndOfficeOccupationsPct,naturalResourcesConstructionAndMaintenanceOccupations,naturalResourcesConstructionAndMaintenanceOccupationsPct,productionTransportationAndMaterialMovingOccupations,productionTransportationAndMaterialMovingOccupationsPct,stateFips,postalCode
24526,8687,5773,66.5,676,7.8,1943,22.4,59,0.7,236,2.7,6,90210


In [22]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'PostalCode'

### Save data

In [23]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP03OccupationZip.csv", index=False)

## Download tract-level data using US Census API
Tract-level data are only available by state, so we need to loop over all states.

In [24]:
def get_tract_data(state):
    url_tract = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=tract:*&in=state:{state}'
    df = pd.read_json(url_tract, dtype='str')
    time.sleep(1)
    # skip first row of labels
    df = df[1:].copy()
    # Add column names
    columns = list(variables.values())
    columns.append('stateFips')
    columns.append('countyFips')
    columns.append('tract')
    df.columns = columns
    return df

In [25]:
df = pd.concat((get_tract_data(state) for state in stateFips))
df.fillna('', inplace=True)

In [26]:
df['tract'] = df['stateFips'] + df['countyFips'] + df['tract']

In [27]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Tract'

In [28]:
# Example data for San Diego County
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')].head()

Unnamed: 0,civilianEmployedPopulation16YearsAndOver,managementBusinessScienceAndArtsOccupations,managementBusinessScienceAndArtsOccupationsPct,serviceOccupations,serviceOccupationsPct,salesAndOfficeOccupations,salesAndOfficeOccupationsPct,naturalResourcesConstructionAndMaintenanceOccupations,naturalResourcesConstructionAndMaintenanceOccupationsPct,productionTransportationAndMaterialMovingOccupations,productionTransportationAndMaterialMovingOccupationsPct,stateFips,countyFips,tract,source,aggregationLevel
6,1282,718,56.0,197,15.4,260,20.3,23,1.8,84,6.6,6,73,6073008339,American Community Survey 5 year,Tract
7,3720,2017,54.2,678,18.2,563,15.1,189,5.1,273,7.3,6,73,6073008347,American Community Survey 5 year,Tract
8,5535,3160,57.1,761,13.7,1078,19.5,147,2.7,389,7.0,6,73,6073008354,American Community Survey 5 year,Tract
9,3349,1374,41.0,802,23.9,617,18.4,236,7.0,320,9.6,6,73,6073008505,American Community Survey 5 year,Tract
10,3501,1972,56.3,508,14.5,802,22.9,126,3.6,93,2.7,6,73,6073017604,American Community Survey 5 year,Tract


### Save data

In [29]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP03OccupationTract.csv", index=False)

In [30]:
df.shape

(73056, 16)