# Selected Housing Characteristics from the American Community Survey

**[Work in progress]**

This notebook downloads [selected housing characteristics (DP04)](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP04) from the American Community Survey 2018 5-Year Data.

Data source: [American Community Survey 5-Year Data 2018](https://www.census.gov/data/developers/data-sets/acs-5year.html)

Authors: Peter Rose (pwrose@ucsd.edu), Ilya Zaslavsky (zaslavsk@sdsc.edu)

In [1]:
import os
import pandas as pd
from pathlib import Path
import time

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/com.Neo4j.Relate/data/dbmss/dbms-8bf637fc-0d20-4d9f-9c6f-f7e72e92a4da/import


## Download selected variables

* [Selected housing characteristics for US](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP04)

* [List of variables as HTML](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP04.html) or [JSON](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP04/)

* [Description of variables](https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2018_ACSSubjectDefinitions.pdf)

* [Example URLs for API](https://api.census.gov/data/2018/acs/acs5/profile/examples.html)

### Specify variables from DP04 group and assign property names

Names must follow the [Neo4j property naming conventions](https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-naming-rules-and-recommendations). Special characters must be quoted by backticks.

In [4]:
variables = {# ROOMS
             'DP04_0037E': 'medianRoomsInHousingUnit',
             
             # HOUSING TENURE
             'DP04_0046E': 'ownerOccupiedHousingUnits',
             'DP04_0046PE': 'ownerOccupiedHousingUnitsPct',
             'DP04_0047E': 'renterOccupiedHousingUnits',
             'DP04_0047PE': 'renterOccupiedHousingUnitsPct',
             'DP04_0048E': 'averageHouseholdSizeOfOwnerOccupiedUnit',
             'DP04_0049E': 'averageHouseholdSizeOfRenterOccupiedUnit',
    
             # VEHICLES AVAILABLE
             'DP04_0057E': 'occupiedHousingUnitsWithVehicles',
             'DP04_0058E': 'occupiedHousingUnitsNoVehicles',
             'DP04_0058PE': 'occupiedHousingUnitsNoVehiclesPct',
    
             # OCCUPANTS PER ROOM (Special characters quoted by backticks)
             'DP04_0077E': 'occupantsPerRoom1.00orLess',
             'DP04_0077PE': 'occupantsPerRoom1.00orLessPct',
             'DP04_0078E': 'occupantsPerRoom1.01to1.50',
             'DP04_0078PE': 'occupantsPerRoom1.01to1.50Pct',
             'DP04_0079E': 'occupantsPerRoom1.51orMore',
             'DP04_0079PE': 'occupantsPerRoom1.51orMorePct'
            }

In [5]:
fields = ",".join(variables.keys())

In [6]:
for v in variables.values():
    if 'Pct' in v:
        print('h.' + v + ' = toFloat(row.' + v + '),')
    else:
        print('h.' + v + ' = toInteger(row.' + v + '),')

h.medianRoomsInHousingUnit = toInteger(row.medianRoomsInHousingUnit),
h.ownerOccupiedHousingUnits = toInteger(row.ownerOccupiedHousingUnits),
h.ownerOccupiedHousingUnitsPct = toFloat(row.ownerOccupiedHousingUnitsPct),
h.renterOccupiedHousingUnits = toInteger(row.renterOccupiedHousingUnits),
h.renterOccupiedHousingUnitsPct = toFloat(row.renterOccupiedHousingUnitsPct),
h.averageHouseholdSizeOfOwnerOccupiedUnit = toInteger(row.averageHouseholdSizeOfOwnerOccupiedUnit),
h.averageHouseholdSizeOfRenterOccupiedUnit = toInteger(row.averageHouseholdSizeOfRenterOccupiedUnit),
h.occupiedHousingUnitsWithVehicles = toInteger(row.occupiedHousingUnitsWithVehicles),
h.occupiedHousingUnitsNoVehicles = toInteger(row.occupiedHousingUnitsNoVehicles),
h.occupiedHousingUnitsNoVehiclesPct = toFloat(row.occupiedHousingUnitsNoVehiclesPct),
h.occupantsPerRoom1.00orLess = toInteger(row.occupantsPerRoom1.00orLess),
h.occupantsPerRoom1.00orLessPct = toFloat(row.occupantsPerRoom1.00orLessPct),
h.occupantsPerRoom1.01

## Download county-level data using US Census API

In [7]:
url_county = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=county:*'

In [8]:
df = pd.read_json(url_county, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,DP04_0037E,DP04_0046E,DP04_0046PE,DP04_0047E,DP04_0047PE,DP04_0048E,DP04_0049E,DP04_0057E,DP04_0058E,DP04_0058PE,DP04_0077E,DP04_0077PE,DP04_0078E,DP04_0078PE,DP04_0079E,DP04_0079PE,state,county
1,4.6,799,68.3,371,31.7,2.90,2.31,1170,185,15.8,1135,97.0,21,1.8,14,1.2,02,195
2,3.9,1257,74.3,435,25.7,4.88,4.12,1692,1485,87.8,857,50.7,342,20.2,493,29.1,02,158
3,5.4,933112,61.4,587655,38.6,2.78,2.74,1520767,93549,6.2,1451435,95.4,47415,3.1,21917,1.4,04,013
4,5.6,7467,69.3,3315,30.7,3.11,3.15,10782,630,5.8,10123,93.9,507,4.7,152,1.4,04,009


##### Add column names

In [9]:
df = df[1:].copy() # skip first row of labels
columns = list(variables.values())
columns.append('stateFips')
columns.append('countyFips')
df.columns = columns

Remove Puerto Rico (stateFips = 72) to limit data to US States

TODO handle data for Puerto Rico (GeoNames represents Puerto Rico as a country)

In [10]:
df.query("stateFips != '72'", inplace=True)

Save list of state fips (required later to get tract data by state)

In [11]:
stateFips = list(df['stateFips'].unique())
stateFips.sort()
print(stateFips)

['01', '02', '04', '05', '06', '08', '09', '10', '11', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56']


In [12]:
df.head()

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,stateFips,countyFips
1,4.6,799,68.3,371,31.7,2.9,2.31,1170,185,15.8,1135,97.0,21,1.8,14,1.2,2,195
2,3.9,1257,74.3,435,25.7,4.88,4.12,1692,1485,87.8,857,50.7,342,20.2,493,29.1,2,158
3,5.4,933112,61.4,587655,38.6,2.78,2.74,1520767,93549,6.2,1451435,95.4,47415,3.1,21917,1.4,4,13
4,5.6,7467,69.3,3315,30.7,3.11,3.15,10782,630,5.8,10123,93.9,507,4.7,152,1.4,4,9
5,5.3,10212,66.2,5218,33.8,2.98,3.02,15430,805,5.2,14658,95.0,580,3.8,192,1.2,4,23


In [13]:
# Example data
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')]

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,stateFips,countyFips
897,5.0,593890,53.1,525090,46.9,2.9,2.83,1118980,61486,5.5,1043965,93.3,50615,4.5,24400,2.2,6,73


In [14]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Admin2'

### Save data

In [15]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP04Admin2.csv", index=False)

## Download zip-level data using US Census API

In [16]:
url_zip = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=zip%20code%20tabulation%20area:*'

In [17]:
df = pd.read_json(url_zip, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,DP04_0037E,DP04_0046E,DP04_0046PE,DP04_0047E,DP04_0047PE,DP04_0048E,DP04_0049E,DP04_0057E,DP04_0058E,DP04_0058PE,DP04_0077E,DP04_0077PE,DP04_0078E,DP04_0078PE,DP04_0079E,DP04_0079PE,state,zip code tabulation area
1,6.1,691,81.1,161,18.9,2.84,2.04,852,56,6.6,841,98.7,11,1.3,0,0.0,51,23833
2,5.4,441,94.4,26,5.6,2.73,2.54,467,0,0.0,467,100.0,0,0.0,0,0.0,51,23850
3,5.6,3209,58.3,2298,41.7,2.40,2.35,5507,502,9.1,5482,99.5,25,0.5,0,0.0,51,23851
4,5.9,1075,66.5,542,33.5,2.13,2.29,1617,134,8.3,1563,96.7,54,3.3,0,0.0,51,23890


##### Add column names

In [18]:
df = df[1:].copy() # skip first row
columns = list(variables.values())
columns.append('stateFips')
columns.append('postalCode')
df.columns = columns

In [19]:
df.head()

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,stateFips,postalCode
1,6.1,691,81.1,161,18.9,2.84,2.04,852,56,6.6,841,98.7,11,1.3,0,0.0,51,23833
2,5.4,441,94.4,26,5.6,2.73,2.54,467,0,0.0,467,100.0,0,0.0,0,0.0,51,23850
3,5.6,3209,58.3,2298,41.7,2.4,2.35,5507,502,9.1,5482,99.5,25,0.5,0,0.0,51,23851
4,5.9,1075,66.5,542,33.5,2.13,2.29,1617,134,8.3,1563,96.7,54,3.3,0,0.0,51,23890
5,5.2,41,45.6,49,54.4,1.88,3.9,90,0,0.0,90,100.0,0,0.0,0,0.0,51,23302


In [20]:
# Example data
df.query("postalCode == '90210'")

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,stateFips,postalCode
24526,6.8,5747,71.5,2289,28.5,2.57,2.25,8036,437,5.4,7888,98.2,99,1.2,49,0.6,6,90210


In [21]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'PostalCode'

### Save data

In [22]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP04Zip.csv", index=False)

## Download tract-level data using US Census API
Tract-level data are only available by state, so we need to loop over all states.

In [23]:
def get_tract_data(state):
    url_tract = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=tract:*&in=state:{state}'
    df = pd.read_json(url_tract, dtype='str')
    time.sleep(1)
    # skip first row of labels
    df = df[1:].copy()
    # Add column names
    columns = list(variables.values())
    columns.append('stateFips')
    columns.append('countyFips')
    columns.append('tract')
    df.columns = columns
    return df

In [24]:
df = pd.concat((get_tract_data(state) for state in stateFips))
df.fillna('', inplace=True)

In [25]:
df['tract'] = df['stateFips'] + df['countyFips'] + df['tract']

In [26]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Tract'

In [27]:
# Example data for San Diego County
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')].head()

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,stateFips,countyFips,tract,source,aggregationLevel
6,3.9,164,20.0,655,80.0,2.2,2.58,819,35,4.3,772,94.3,36,4.4,11,1.3,6,73,6073008339,American Community Survey 5 year,Tract
7,5.8,1758,80.6,422,19.4,3.11,3.62,2180,52,2.4,2095,96.1,72,3.3,13,0.6,6,73,6073008347,American Community Survey 5 year,Tract
8,4.4,1242,38.8,1958,61.2,3.58,2.76,3200,93,2.9,2972,92.9,212,6.6,16,0.5,6,73,6073008354,American Community Survey 5 year,Tract
9,5.1,1328,62.8,788,37.2,3.17,3.0,2116,103,4.9,2008,94.9,77,3.6,31,1.5,6,73,6073008505,American Community Survey 5 year,Tract
10,4.3,1450,46.8,1646,53.2,2.55,2.23,3096,372,12.0,2918,94.3,125,4.0,53,1.7,6,73,6073017604,American Community Survey 5 year,Tract


### Save data

In [28]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP04Tract.csv", index=False)

In [29]:
df.shape

(73056, 21)