# Selected Housing Characteristics from the American Community Survey

**[Work in progress]**

This notebook downloads [selected housing characteristics (DP04)](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP04) from the American Community Survey 2018 5-Year Data.

Data source: [American Community Survey 5-Year Data 2018](https://www.census.gov/data/developers/data-sets/acs-5year.html)

Authors: Peter Rose (pwrose@ucsd.edu), Ilya Zaslavsky (zaslavsk@sdsc.edu)

In [1]:
import os
import pandas as pd
from pathlib import Path
import time

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-19636412-9e74-4bac-8a4c-c6c8b49bb9d3/installation-4.1.0/import


## Download selected variables

* [Selected housing characteristics for US](https://data.census.gov/cedsci/table?tid=ACSDP5Y2018.DP04)

* [List of variables as HTML](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP04.html) or [JSON](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP04/)

* [Description of variables](https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2018_ACSSubjectDefinitions.pdf)

* [Example URLs for API](https://api.census.gov/data/2018/acs/acs5/profile/examples.html)

### Specify variables from DP04 group and assign property names

Names must follow the [Neo4j property naming conventions](https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-naming-rules-and-recommendations). Special characters must be quoted by backticks.

In [4]:
variables = {# ROOMS
             'DP04_0037E': 'medianRoomsInHousingUnit',
             
             # HOUSING TENURE
             'DP04_0046E': 'ownerOccupiedHousingUnits',
             'DP04_0046PE': 'ownerOccupiedHousingUnitsPct',
             'DP04_0047E': 'renterOccupiedHousingUnits',
             'DP04_0047PE': 'renterOccupiedHousingUnitsPct',
             'DP04_0048E': 'averageHouseholdSizeOfOwnerOccupiedUnit',
             'DP04_0049E': 'averageHouseholdSizeOfRenterOccupiedUnit',
    
             # VEHICLES AVAILABLE
             'DP04_0057E': 'occupiedHousingUnitsWithVehicles',
             'DP04_0058E': 'occupiedHousingUnitsNoVehicles',
             'DP04_0058PE': 'occupiedHousingUnitsNoVehiclesPct',
    
             # OCCUPANTS PER ROOM (Special characters quoted by backticks)
             'DP04_0077E': 'occupantsPerRoom1.00orLess',
             'DP04_0077PE': 'occupantsPerRoom1.00orLessPct',
             'DP04_0078E': 'occupantsPerRoom1.01to1.50',
             'DP04_0078PE': 'occupantsPerRoom1.01to1.50Pct',
             'DP04_0079E': 'occupantsPerRoom1.51orMore',
             'DP04_0079PE': 'occupantsPerRoom1.51orMorePct'
            }

In [5]:
fields = ",".join(variables.keys())

In [6]:
for v in variables.values():
    if 'Pct' in v:
        print('h.' + v + ' = toFloat(row.' + v + '),')
    else:
        print('h.' + v + ' = toInteger(row.' + v + '),')

h.medianRoomsInHousingUnit = toInteger(row.medianRoomsInHousingUnit),
h.ownerOccupiedHousingUnits = toInteger(row.ownerOccupiedHousingUnits),
h.ownerOccupiedHousingUnitsPct = toFloat(row.ownerOccupiedHousingUnitsPct),
h.renterOccupiedHousingUnits = toInteger(row.renterOccupiedHousingUnits),
h.renterOccupiedHousingUnitsPct = toFloat(row.renterOccupiedHousingUnitsPct),
h.averageHouseholdSizeOfOwnerOccupiedUnit = toInteger(row.averageHouseholdSizeOfOwnerOccupiedUnit),
h.averageHouseholdSizeOfRenterOccupiedUnit = toInteger(row.averageHouseholdSizeOfRenterOccupiedUnit),
h.occupiedHousingUnitsWithVehicles = toInteger(row.occupiedHousingUnitsWithVehicles),
h.occupiedHousingUnitsNoVehicles = toInteger(row.occupiedHousingUnitsNoVehicles),
h.occupiedHousingUnitsNoVehiclesPct = toFloat(row.occupiedHousingUnitsNoVehiclesPct),
h.occupantsPerRoom1.00orLess = toInteger(row.occupantsPerRoom1.00orLess),
h.occupantsPerRoom1.00orLessPct = toFloat(row.occupantsPerRoom1.00orLessPct),
h.occupantsPerRoom1.01

## Download county-level data using US Census API

In [7]:
url_county = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=county:*'

In [8]:
df = pd.read_json(url_county, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,DP04_0037E,DP04_0046E,DP04_0046PE,DP04_0047E,DP04_0047PE,DP04_0048E,DP04_0049E,DP04_0057E,DP04_0058E,DP04_0058PE,DP04_0077E,DP04_0077PE,DP04_0078E,DP04_0078PE,DP04_0079E,DP04_0079PE,state,county
1,5.3,9888,54.0,8411,46.0,2.38,2.73,18299,2429,13.3,17742,97.0,468,2.6,89,0.5,28,151
2,5.3,3804,83.4,759,16.6,2.58,2.77,4563,275,6.0,4447,97.5,106,2.3,10,0.2,28,111
3,5.5,2417,76.4,747,23.6,2.61,2.48,3164,243,7.7,3075,97.2,89,2.8,0,0.0,28,019
4,5.7,6774,77.8,1932,22.2,2.63,2.38,8706,337,3.9,8577,98.5,100,1.1,29,0.3,28,057


##### Add column names

In [9]:
df = df[1:].copy() # skip first row of labels
columns = list(variables.values())
columns.append('stateFips')
columns.append('countyFips')
df.columns = columns

Remove Puerto Rico (stateFips = 72) to limit data to US States

TODO handle data for Puerto Rico (GeoNames represents Puerto Rico as a country)

In [10]:
df.query("stateFips != '72'", inplace=True)

Save list of state fips (required later to get tract data by state)

In [11]:
stateFips = list(df['stateFips'].unique())
stateFips.sort()
print(stateFips)

['01', '02', '04', '05', '06', '08', '09', '10', '11', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56']


In [12]:
df.head()

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,stateFips,countyFips
1,5.3,9888,54.0,8411,46.0,2.38,2.73,18299,2429,13.3,17742,97.0,468,2.6,89,0.5,28,151
2,5.3,3804,83.4,759,16.6,2.58,2.77,4563,275,6.0,4447,97.5,106,2.3,10,0.2,28,111
3,5.5,2417,76.4,747,23.6,2.61,2.48,3164,243,7.7,3075,97.2,89,2.8,0,0.0,28,19
4,5.7,6774,77.8,1932,22.2,2.63,2.38,8706,337,3.9,8577,98.5,100,1.1,29,0.3,28,57
5,5.7,2955,80.8,703,19.2,2.87,2.11,3658,173,4.7,3524,96.3,87,2.4,47,1.3,28,15


In [13]:
# Example data
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')]

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,stateFips,countyFips
1869,5.0,593890,53.1,525090,46.9,2.9,2.83,1118980,61486,5.5,1043965,93.3,50615,4.5,24400,2.2,6,73


In [14]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Admin2'

### Save data

In [15]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP04Admin2.csv", index=False)

## Download zip-level data using US Census API

In [16]:
url_zip = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=zip%20code%20tabulation%20area:*'

In [17]:
df = pd.read_json(url_zip, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,DP04_0037E,DP04_0046E,DP04_0046PE,DP04_0047E,DP04_0047PE,DP04_0048E,DP04_0049E,DP04_0057E,DP04_0058E,DP04_0058PE,DP04_0077E,DP04_0077PE,DP04_0078E,DP04_0078PE,DP04_0079E,DP04_0079PE,zip code tabulation area
1,5.8,2685,70.5,1126,29.5,2.35,2.05,3811,287,7.5,3802,99.8,3,0.1,6,0.2,43964
2,5.5,10634,53.4,9277,46.6,2.57,2.47,19911,1902,9.6,19634,98.6,132,0.7,145,0.7,28216
3,6.8,18754,70.3,7941,29.7,2.83,2.35,26695,465,1.7,26390,98.9,201,0.8,104,0.4,28277
4,6.9,6678,72.8,2498,27.2,2.92,3.12,9176,161,1.8,9016,98.3,124,1.4,36,0.4,28278


##### Add column names

In [18]:
df = df[1:].copy() # skip first row
columns = list(variables.values())
columns.append('postalCode')
df.columns = columns

In [19]:
df.head()

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,postalCode
1,5.8,2685,70.5,1126,29.5,2.35,2.05,3811,287,7.5,3802,99.8,3,0.1,6,0.2,43964
2,5.5,10634,53.4,9277,46.6,2.57,2.47,19911,1902,9.6,19634,98.6,132,0.7,145,0.7,28216
3,6.8,18754,70.3,7941,29.7,2.83,2.35,26695,465,1.7,26390,98.9,201,0.8,104,0.4,28277
4,6.9,6678,72.8,2498,27.2,2.92,3.12,9176,161,1.8,9016,98.3,124,1.4,36,0.4,28278
5,5.2,5903,44.7,7304,55.3,2.37,2.08,13207,1195,9.0,12950,98.1,233,1.8,24,0.2,28303


In [20]:
# Example data
df.query("postalCode == '90210'")

Unnamed: 0,medianRoomsInHousingUnit,ownerOccupiedHousingUnits,ownerOccupiedHousingUnitsPct,renterOccupiedHousingUnits,renterOccupiedHousingUnitsPct,averageHouseholdSizeOfOwnerOccupiedUnit,averageHouseholdSizeOfRenterOccupiedUnit,occupiedHousingUnitsWithVehicles,occupiedHousingUnitsNoVehicles,occupiedHousingUnitsNoVehiclesPct,occupantsPerRoom1.00orLess,occupantsPerRoom1.00orLessPct,occupantsPerRoom1.01to1.50,occupantsPerRoom1.01to1.50Pct,occupantsPerRoom1.51orMore,occupantsPerRoom1.51orMorePct,postalCode
30897,6.8,5747,71.5,2289,28.5,2.57,2.25,8036,437,5.4,7888,98.2,99,1.2,49,0.6,90210


In [21]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'PostalCode'

### Save data

In [22]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP04Zip.csv", index=False)

## Download tract-level data using US Census API
Tract-level data are only available by state, so we need to loop over all states.

In [23]:
def get_tract_data(state):
    url_tract = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=tract:*&in=state:{state}'
    df = pd.read_json(url_tract, dtype='str')
    time.sleep(1)
    # skip first row of labels
    df = df[1:].copy()
    # Add column names
    columns = list(variables.values())
    columns.append('stateFips')
    columns.append('countyFips')
    columns.append('tract')
    df.columns = columns
    return df

In [None]:
df = pd.concat((get_tract_data(state) for state in stateFips))
df.fillna('', inplace=True)

In [None]:
df['tract'] = df['stateFips'] + df['countyFips'] + df['tract']

In [None]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Tract'

In [None]:
# Example data for San Diego County
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')].head()

### Save data

In [None]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP04Tract.csv", index=False)

In [None]:
df.shape