# Selected Social Characteristics: Educational Attainment from the American Community Survey

**[Work in progress]**

This notebook downloads [selected social characteristics (DP02)](https://data.census.gov/cedsci/table?q=DP02&tid=ACSDP5Y2018.DP02) from the American Community Survey 5-Year Data (2009-2018).

Data source: [American Community Survey 5-Year Data (2009-2018)](https://www.census.gov/data/developers/data-sets/acs-5year.html)

Authors: Peter Rose (pwrose@ucsd.edu), Ilya Zaslavsky (zaslavsk@sdsc.edu)

In [1]:
import os
import pandas as pd
from pathlib import Path

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
NEO4J_IMPORT = Path(os.getenv('NEO4J_IMPORT'))
print(NEO4J_IMPORT)

/Users/peter/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-19636412-9e74-4bac-8a4c-c6c8b49bb9d3/installation-4.1.0/import


## Download selected variables

* [Selected social characteristics for US](https://data.census.gov/cedsci/table?q=DP02&tid=ACSDP5Y2018.DP02)

* [List of variables as HTML](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP02.html) or [JSON](https://api.census.gov/data/2018/acs/acs5/profile/groups/DP02/)

* [Description of variables](https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2018_ACSSubjectDefinitions.pdf)

* [Example URLs for API](https://api.census.gov/data/2018/acs/acs5/profile/examples.html)

### Specify variables from DP02 group and assign property names

Names must follow the [Neo4j property naming conventions](https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-naming-rules-and-recommendations).

In [4]:
variables = {# EDUCATIONAL ATTAINMENT
             'DP02_0058E': 'population25YearsAndOver',
             'DP02_0059E': 'LessThan9thGrade',
             'DP02_0059PE': 'LessThan9thGradePct',
             'DP02_0060E': 'grade9thTo12thNoDiploma',
             'DP02_0060PE': 'grade9thTo12thNoDiplomaPct',
             'DP02_0061E': 'highSchoolGraduate',
             'DP02_0061PE': 'highSchoolGraduatePct',
             'DP02_0062E': 'someCollegeNoDegree',
             'DP02_0062PE': 'someCollegeNoDegreePct',
             'DP02_0063E': 'associatesDegree',
             'DP02_0063PE': 'associatesDegreePct',
             'DP02_0064E': 'bachelorsDegree',
             'DP02_0064PE': 'bachelorsDegreePct',
             'DP02_0065E': 'graduateOrProfessionalDegree',
             'DP02_0065PE': 'graduateOrProfessionalDegreePct',
             'DP02_0066E': 'highSchoolGraduateOrHigher',
             'DP02_0066PE': 'highSchoolGraduateOrHigherPct',
             'DP02_0067E': 'bachelorsDegreeOrHigher',
             'DP02_0067PE': 'bachelorsDegreeOrHigherPct',
            }

In [5]:
fields = ",".join(variables.keys())

In [6]:
for v in variables.values():
    if 'Pct' in v:
        print('e.' + v + ' = toFloat(row.' + v + '),')
    else:
        print('e.' + v + ' = toInteger(row.' + v + '),')

e.population25YearsAndOver = toInteger(row.population25YearsAndOver),
e.LessThan9thGrade = toInteger(row.LessThan9thGrade),
e.LessThan9thGradePct = toFloat(row.LessThan9thGradePct),
e.grade9thTo12thNoDiploma = toInteger(row.grade9thTo12thNoDiploma),
e.grade9thTo12thNoDiplomaPct = toFloat(row.grade9thTo12thNoDiplomaPct),
e.highSchoolGraduate = toInteger(row.highSchoolGraduate),
e.highSchoolGraduatePct = toFloat(row.highSchoolGraduatePct),
e.someCollegeNoDegree = toInteger(row.someCollegeNoDegree),
e.someCollegeNoDegreePct = toFloat(row.someCollegeNoDegreePct),
e.associatesDegree = toInteger(row.associatesDegree),
e.associatesDegreePct = toFloat(row.associatesDegreePct),
e.bachelorsDegree = toInteger(row.bachelorsDegree),
e.bachelorsDegreePct = toFloat(row.bachelorsDegreePct),
e.graduateOrProfessionalDegree = toInteger(row.graduateOrProfessionalDegree),
e.graduateOrProfessionalDegreePct = toFloat(row.graduateOrProfessionalDegreePct),
e.highSchoolGraduateOrHigher = toInteger(row.highSchoo

## Download county-level data using US Census API

In [7]:
url_county = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=county:*'

In [8]:
df = pd.read_json(url_county, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
0,DP02_0058E,DP02_0059E,DP02_0059PE,DP02_0060E,DP02_0060PE,DP02_0061E,DP02_0061PE,DP02_0062E,DP02_0062PE,DP02_0063E,DP02_0063PE,DP02_0064E,DP02_0064PE,DP02_0065E,DP02_0065PE,DP02_0066E,DP02_0066PE,DP02_0067E,DP02_0067PE,state,county
1,30308,2551,8.4,3972,13.1,9254,30.5,6622,21.8,2213,7.3,3476,11.5,2220,7.3,23785,78.5,5696,18.8,28,151
2,8198,492,6.0,1044,12.7,3290,40.1,1666,20.3,808,9.9,694,8.5,204,2.5,6662,81.3,898,11.0,28,111
3,5799,496,8.6,638,11.0,2201,38.0,954,16.5,486,8.4,625,10.8,399,6.9,4665,80.4,1024,17.7,28,019
4,15684,1003,6.4,2372,15.1,4743,30.2,4045,25.8,1412,9.0,1337,8.5,772,4.9,12309,78.5,2109,13.4,28,057


##### Add column names

In [9]:
df = df[1:].copy() # skip first row of labels
columns = list(variables.values())
columns.append('stateFips')
columns.append('countyFips')
df.columns = columns

Remove Puerto Rico (stateFips = 72) to limit data to US States

TODO handle data for Puerto Rico (GeoNames represents Puerto Rico as a country)

In [10]:
df.query("stateFips != '72'", inplace=True)

Save list of state fips (required later to get tract data by state)

In [11]:
stateFips = list(df['stateFips'].unique())
stateFips.sort()
print(stateFips)

['01', '02', '04', '05', '06', '08', '09', '10', '11', '12', '13', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47', '48', '49', '50', '51', '53', '54', '55', '56']


In [12]:
df.head()

Unnamed: 0,population25YearsAndOver,LessThan9thGrade,LessThan9thGradePct,grade9thTo12thNoDiploma,grade9thTo12thNoDiplomaPct,highSchoolGraduate,highSchoolGraduatePct,someCollegeNoDegree,someCollegeNoDegreePct,associatesDegree,associatesDegreePct,bachelorsDegree,bachelorsDegreePct,graduateOrProfessionalDegree,graduateOrProfessionalDegreePct,highSchoolGraduateOrHigher,highSchoolGraduateOrHigherPct,bachelorsDegreeOrHigher,bachelorsDegreeOrHigherPct,stateFips,countyFips
1,30308,2551,8.4,3972,13.1,9254,30.5,6622,21.8,2213,7.3,3476,11.5,2220,7.3,23785,78.5,5696,18.8,28,151
2,8198,492,6.0,1044,12.7,3290,40.1,1666,20.3,808,9.9,694,8.5,204,2.5,6662,81.3,898,11.0,28,111
3,5799,496,8.6,638,11.0,2201,38.0,954,16.5,486,8.4,625,10.8,399,6.9,4665,80.4,1024,17.7,28,19
4,15684,1003,6.4,2372,15.1,4743,30.2,4045,25.8,1412,9.0,1337,8.5,772,4.9,12309,78.5,2109,13.4,28,57
5,7248,402,5.5,922,12.7,2422,33.4,1820,25.1,631,8.7,692,9.5,359,5.0,5924,81.7,1051,14.5,28,15


In [13]:
# Example data
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')]

Unnamed: 0,population25YearsAndOver,LessThan9thGrade,LessThan9thGradePct,grade9thTo12thNoDiploma,grade9thTo12thNoDiplomaPct,highSchoolGraduate,highSchoolGraduatePct,someCollegeNoDegree,someCollegeNoDegreePct,associatesDegree,associatesDegreePct,bachelorsDegree,bachelorsDegreePct,graduateOrProfessionalDegree,graduateOrProfessionalDegreePct,highSchoolGraduateOrHigher,highSchoolGraduateOrHigherPct,bachelorsDegreeOrHigher,bachelorsDegreeOrHigherPct,stateFips,countyFips
1869,2223376,149313,6.7,136695,6.1,409272,18.4,498566,22.4,181508,8.2,521525,23.5,326497,14.7,1937368,87.1,848022,38.1,6,73


In [14]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Admin2'

### Save data

In [15]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP02EducationAdmin2.csv", index=False)

## Download zip-level data using US Census API

In [16]:
url_zip = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=zip%20code%20tabulation%20area:*'

In [17]:
df = pd.read_json(url_zip, dtype='str')
df.fillna('', inplace=True)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,DP02_0058E,DP02_0059E,DP02_0059PE,DP02_0060E,DP02_0060PE,DP02_0061E,DP02_0061PE,DP02_0062E,DP02_0062PE,DP02_0063E,DP02_0063PE,DP02_0064E,DP02_0064PE,DP02_0065E,DP02_0065PE,DP02_0066E,DP02_0066PE,DP02_0067E,DP02_0067PE,zip code tabulation area
1,6316,157,2.5,511,8.1,2748,43.5,1239,19.6,824,13.0,513,8.1,324,5.1,5648,89.4,837,13.3,43964
2,33273,1310,3.9,2422,7.3,7413,22.3,8381,25.2,2938,8.8,7271,21.9,3538,10.6,29541,88.8,10809,32.5,28216
3,48340,601,1.2,802,1.7,4865,10.1,7319,15.1,2941,6.1,19822,41.0,11990,24.8,46937,97.1,31812,65.8,28277
4,18009,345,1.9,779,4.3,3272,18.2,4223,23.4,1277,7.1,5647,31.4,2466,13.7,16885,93.8,8113,45.0,28278


##### Add column names

In [18]:
df = df[1:].copy() # skip first row
columns = list(variables.values())
columns.append('postalCode')
df.columns = columns

In [19]:
df.head()

Unnamed: 0,population25YearsAndOver,LessThan9thGrade,LessThan9thGradePct,grade9thTo12thNoDiploma,grade9thTo12thNoDiplomaPct,highSchoolGraduate,highSchoolGraduatePct,someCollegeNoDegree,someCollegeNoDegreePct,associatesDegree,associatesDegreePct,bachelorsDegree,bachelorsDegreePct,graduateOrProfessionalDegree,graduateOrProfessionalDegreePct,highSchoolGraduateOrHigher,highSchoolGraduateOrHigherPct,bachelorsDegreeOrHigher,bachelorsDegreeOrHigherPct,postalCode
1,6316,157,2.5,511,8.1,2748,43.5,1239,19.6,824,13.0,513,8.1,324,5.1,5648,89.4,837,13.3,43964
2,33273,1310,3.9,2422,7.3,7413,22.3,8381,25.2,2938,8.8,7271,21.9,3538,10.6,29541,88.8,10809,32.5,28216
3,48340,601,1.2,802,1.7,4865,10.1,7319,15.1,2941,6.1,19822,41.0,11990,24.8,46937,97.1,31812,65.8,28277
4,18009,345,1.9,779,4.3,3272,18.2,4223,23.4,1277,7.1,5647,31.4,2466,13.7,16885,93.8,8113,45.0,28278
5,20122,685,3.4,1081,5.4,4601,22.9,5702,28.3,2219,11.0,3357,16.7,2477,12.3,18356,91.2,5834,29.0,28303


In [20]:
# Example data
df.query("postalCode == '90210'")

Unnamed: 0,population25YearsAndOver,LessThan9thGrade,LessThan9thGradePct,grade9thTo12thNoDiploma,grade9thTo12thNoDiplomaPct,highSchoolGraduate,highSchoolGraduatePct,someCollegeNoDegree,someCollegeNoDegreePct,associatesDegree,associatesDegreePct,bachelorsDegree,bachelorsDegreePct,graduateOrProfessionalDegree,graduateOrProfessionalDegreePct,highSchoolGraduateOrHigher,highSchoolGraduateOrHigherPct,bachelorsDegreeOrHigher,bachelorsDegreeOrHigherPct,postalCode
30897,15118,390,2.6,414,2.7,1720,11.4,2019,13.4,791,5.2,4935,32.6,4849,32.1,14314,94.7,9784,64.7,90210


In [21]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'PostalCode'

### Save data

In [22]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP02EducationZip.csv", index=False)

## Download tract-level data using US Census API
Tract-level data are only available by state, so we need to loop over all states.

In [23]:
def get_tract_data(state):
    url_tract = f'https://api.census.gov/data/2018/acs/acs5/profile?get={fields}&for=tract:*&in=state:{state}'
    df = pd.read_json(url_tract, dtype='str')
    # skip first row of labels
    df = df[1:].copy()
    # Add column names
    columns = list(variables.values())
    columns.append('stateFips')
    columns.append('countyFips')
    columns.append('tract')
    df.columns = columns
    return df

In [24]:
df = pd.concat((get_tract_data(state) for state in stateFips))
df.fillna('', inplace=True)

In [25]:
df['tract'] = df['stateFips'] + df['countyFips'] + df['tract']

In [26]:
df['source'] = 'American Community Survey 5 year'
df['aggregationLevel'] = 'Tract'

In [27]:
# Example data for San Diego County
df[(df['stateFips'] == '06') & (df['countyFips'] == '073')].head()

Unnamed: 0,population25YearsAndOver,LessThan9thGrade,LessThan9thGradePct,grade9thTo12thNoDiploma,grade9thTo12thNoDiplomaPct,highSchoolGraduate,highSchoolGraduatePct,someCollegeNoDegree,someCollegeNoDegreePct,associatesDegree,associatesDegreePct,bachelorsDegree,bachelorsDegreePct,graduateOrProfessionalDegree,graduateOrProfessionalDegreePct,highSchoolGraduateOrHigher,highSchoolGraduateOrHigherPct,bachelorsDegreeOrHigher,bachelorsDegreeOrHigherPct,stateFips,countyFips,tract,source,aggregationLevel
56,5498,57,1.0,0,0.0,375,6.8,824,15.0,183,3.3,1712,31.1,2347,42.7,5441,99.0,4059,73.8,6,73,6073008324,American Community Survey 5 year,Tract
57,1000,0,0.0,30,3.0,74,7.4,114,11.4,31,3.1,418,41.8,333,33.3,970,97.0,751,75.1,6,73,6073008339,American Community Survey 5 year,Tract
58,5183,340,6.6,156,3.0,901,17.4,1016,19.6,443,8.5,1305,25.2,1022,19.7,4687,90.4,2327,44.9,6,73,6073008347,American Community Survey 5 year,Tract
59,6942,248,3.6,269,3.9,368,5.3,1199,17.3,730,10.5,2759,39.7,1369,19.7,6425,92.6,4128,59.5,6,73,6073008354,American Community Survey 5 year,Tract
60,4799,347,7.2,270,5.6,1038,21.6,1065,22.2,328,6.8,1080,22.5,671,14.0,4182,87.1,1751,36.5,6,73,6073008505,American Community Survey 5 year,Tract


### Save data

In [28]:
df.to_csv(NEO4J_IMPORT / "03a-USCensusDP02EducationTract.csv", index=False)

In [29]:
df.shape

(73056, 24)