# U.S. Census Beta API Review: Overview and Concepts

Darren Erik Vengroff, Ph.D.

July 2024

## Introduction

I am reviewing the beta API from two perspectives:

1. As a researcher who wants to access U.S. Census data in my normal analysis workflow. I will use examples in Python,
   but others might choose R. The emphasis here is on whether the beta API makes it easy for me to express the concepts
   I need to express to get the data I want into my environment so that I can continue my analysis.
2. As someone who maintains a client library to make it easier for researchers to do their work.

I began my journey with Census data as a researcher in category 1 and evolved into someone in category 2 as
I developed [censusdis](https://github.com/censusdis/censusdis).

Whenever I comment on the new API below, whether the comments are positive or negative, I will try to make
them from the perspective of a member of one or more of these groups.

I deliberately structured this response as a notebook so that you can see my thought process as I worked
through the documentation and examples.

## Overviews

### Introduction 

[link to docs](https://api.census.gov/docs/user-guide/overview/introduction.html)

I think the target audience matches up well to the two personas I mentioned above.

### Making Queries

[link to docs](https://api.census.gov/docs/user-guide/overview/making-queries.html)

Note I went a bit out of order here.

I don't think dropping direct browser support is a big problem. I assume this means that 
you expect POST, not a GET requests. Maybe you could be a bit more explicit about this right
up front. I'll jump ahead and test it out here.

In [1]:
# I store my API key in `~/.censusdis/api_key.txt`. This simple API grabs it for me.
from censusdis.geography import EnvironmentApiKey

api_key = EnvironmentApiKey.api_key()

In [2]:
import requests

Note: In order to figure out the right base URL, and also how exactly to put my
API key into the header, I had to look at one of your sample queries at 
https://api.census.gov/docs/user-guide/endpoints/facets.html#example and then use
my browsers debugger to watch network traffic when I hit the button to look at 
exactly how the post was constructed so I could replicate it here.

Ideally, the base url and the fact that `Key` is the right header field for the
API key should be explained in the documentation.

In [3]:
BETA_API_BASE_URL = 'https://api.census.gov'

DATASETS_URL = f'{BETA_API_BASE_URL}/search/facets/datasets'

In [4]:
headers={"Key": api_key}

In [5]:
response_post = requests.post(DATASETS_URL, headers=headers)

In [6]:
response_post.status_code

200

In [7]:
response_get = requests.get(DATASETS_URL, headers=headers)

In [8]:
response_get.status_code

405

Confirmed that POST works and GET is not accepted.

### Available Datasets

[link to docs](https://api.census.gov/docs/user-guide/overview/available-datasets.html)

Is this list of popular data sets the only ones that are currently supported? That was
not totally clear.

Again, I will jump ahead a bit and see by making the same query again.

In [9]:
response = requests.post(DATASETS_URL, headers=headers)

In [10]:
json = response.json()

In [11]:
facets = json['content']['facets']

In [12]:
[facet['name'] for facet in facets]

['American Community Survey',
 'Current Population Survey',
 'Community Resilience Estimates',
 'Decennial Census',
 'Decennial Census of Island Areas',
 'Economic Census',
 'Economic Census of Island Areas',
 'Economic Surveys',
 'Household Pulse Survey',
 'International Database',
 'Population Estimates',
 'Post-Secondary Employment Outcomes (PSEO)',
 'Public Sector',
 'Survey of Income and Program Participation',
 'Survey of Market Absorption']

OK, it looks like the set of supported data sets centers on the most popular ones.

### Authentication

[link to docs](https://api.census.gov/docs/user-guide/overview/authentication.html)

I already have an API key. It was not clear from the docs if I could use the one I have
or if I needed a new one for the beta API. In the examples above I used the one I already
have and it worked.

### Global Request Parameters

[link to docs](https://api.census.gov/docs/user-guide/overview/global-request-parameters.html)

In the examples you give, some of them use what look like acceptable values, like `"2012"` for
a vintage. But for GEOIDs, you put `"g1"` and `"g2"` in quotes as if those are acceptable strings
to pass. I reality, as discussed later in the docs, these strings actually look like, `"0500000US04015"`
and similar. So maybe you should put some real ones here instead of `"g1"` and `"g2"`.

Here's an example showing the 400 response we get with a bad one.

In [13]:
TABLE_DATA_URL = BETA_API_BASE_URL + '/table/data'

In [14]:
TABLE_DATA_URL

'https://api.census.gov/table/data'

In [15]:
headers={
    "Key": api_key, 
    "Content-Type": "application/json",
}

In [16]:
bad_response = requests.post(
    TABLE_DATA_URL, 
    headers=headers, 
    json={
        "tables": [
            {
                "id": "ACSDT5YSPT2015.B01001",
                "geoIds": ["g1","g2"],
            }
        ]
    }
)

In [17]:
bad_response.status_code

400

In [18]:
good_response = requests.post(
    TABLE_DATA_URL, 
    headers=headers, 
    json={
        "tables": [
            {
                "id": "ACSDT5YSPT2015.B01001",
                "geoIds": ["0500000US04015"],
            }
        ]
    }
)

In [19]:
good_response.status_code

200

### Common Response Structure

[link to docs](https://api.census.gov/docs/user-guide/overview/common-response-structure.html)

Maybe the current error message are preliminary, but I would like to see something more helpful than
this. We got a lot of feedback that geographies are hard to understand, so we made the error messages
as detailed as possible in cases like this. It would be great to explain what geographies are supported
instead of just saying the requested ones are not.

In [20]:
bad_response.json()['service-messages']

{'error': ['Table ACSDT5YSPT2015.B01001 has no data for the requested geographies']}

Here is an example of a similar scenario in `censusdis`:

In [21]:
import censusdis.data as ced
from censusdis.datasets import ACS5
from censusdis import CensusApiException

In [22]:
try:
    df = ced.download(
        ACS5,
        2015,
        group='B01001',
        geo='g1',
    )
except CensusApiException as e:
    print(e)

Unable to match the geography specification {'geo': 'g1'}.
Supported geographies for dataset='acs/acs5' in year=2015 are:
['us']
['region']
['division']
['state']
['state', 'county']
['state', 'county', 'county_subdivision']
['state', 'county', 'county_subdivision', 'subminor_civil_division']
['state', 'county', 'county_subdivision', 'place_remainder_or_part']
['state', 'county', 'tract']
['state', 'county', 'tract', 'block_group']
['state', 'place', 'county_or_part']
['state', 'place']
['state', 'consolidated_city']
['state', 'consolidated_city', 'place_or_part']
['state', 'alaska_native_regional_corporation']
['american_indian_area_alaska_native_area_hawaiian_home_land']
['american_indian_area_alaska_native_area_hawaiian_home_land', 'american_indian_tribal_subdivision']
['american_indian_area_alaska_native_area_hawaiian_home_land', 'american_indian_area_alaska_native_area_reservation_or_statistical_entity_only']
['american_indian_area_alaska_native_area_hawaiian_home_land', 'american

Now I can read the error message, and realize what I want is state and county.

In [23]:
import censusdis.states as states
from censusdis.counties.arizona import MOHAVE

df = ced.download(
    ACS5,
    2015,
    ['NAME'],
    group='B01001',
    
    state=states.AZ,
    county=MOHAVE
)

In [24]:
df

Unnamed: 0,STATE,COUNTY,NAME,B01001_001E,B01001_002E,B01001_003E,B01001_004E,B01001_005E,B01001_006E,B01001_007E,...,B01001_041E,B01001_042E,B01001_043E,B01001_044E,B01001_045E,B01001_046E,B01001_047E,B01001_048E,B01001_049E,GEO_ID
0,4,15,"Mohave County, Arizona",203362,102371,4882,5447,5912,3561,2037,...,8507,3439,4727,3373,4967,7239,4768,3332,2901,0500000US04015


### Attributes

[link to docs](https://api.census.gov/docs/user-guide/overview/attributes.html)

I'm not sure what `took` is about? Is this how long it took on the server side? 
Some notion of resources consumed? I'm not sure.

In [25]:
bad_response.json()['general-info']

{'took': 0}

In [26]:
good_response.json()['general-info']

{'took': 0}

## Concepts

### Geography

[link to docs](https://api.census.gov/docs/user-guide/concepts/geography.html)

This is the section where I think I am going to be most critical. The way that `geoIds` have
to be constructed in the beta API is, in my opinion a step backwards both for researchers
who want to query the data and for developers of client-side APIs like `censusdis`.
There are two primary reasons for this:

1. Now users have to know the codes for summary levels. This is just one more set of
   codes to remember or, more likely, to forget and to have to look up every time you
   want to use them.
3. This new approach is an example of what I call *coding by string construction*. The
   idea is that instead of writing your code in a proper language, like Python or R,
   using standard idioms and taking advantage of tools like IDEs, syntax checkers, and
   so on to guide you and catch problems before you even run a single line of code,
   you are forced to construct a string that has detailed semantic meaning and is hard
   to get right.

I know that summary level codes have always existed, and `censusdis` looks them up and
uses them internally, but we went out of our way to never expose them to users.

As far as coding by string construction, in this case I wonder if these are really facet
ids used by whatever underlying faceted search library you are using on the server side
and you are needlessly exposing them to users because that way you don't have to do 
any translation to or from an internal format that may make sense to machines, but does
not make sense to people.

I am going to demonstrate below some of the kind of experiences that this new approach
is likely to put users through and contrast them with what I think, and my users have
told me, is a friendly user-centric approach to geography. Some additional examples 
and related materials are available in [Lesson 2](https://github.com/censusdis/censusdis-tutorial-2024/blob/main/Lessons/Lesson%202%20Maps.ipynb) of a tutorial I gave on the subject.


#### Examples of GeoId Construction

Let's begin with Mohave County, Arizona, the first example from [the documentation](https://api.census.gov/docs/user-guide/concepts/geography.html).

Here's how I would do it if I had to construct the ID manually:

In [27]:
# A good reference to summary levels so you can find the one you want: https://mcdc.missouri.edu/geography/sumlevs/

# County is summary level 050. 
SUMMARY_LEVEL_COUNTY = '050'

# I already imported symbols for the FIPS codes I need from `censusdis`
# so I will use them in constructing my geoId.

geo_id_mojave_county_az = f'{SUMMARY_LEVEL_COUNTY}0000US{states.AZ}{MOHAVE}'

geo_id_mojave_county_az

'0500000US04015'

Now I have something I can pass on to the POST I did above. If I were writing a library,
I would probably make a convenience method for each summary level. So this one would look
like this:

In [28]:
def geoid_state_county(state_fips: str, county_fips: str) -> str:
    """Construct a geoId for a state and a county."""
    return f'{SUMMARY_LEVEL_COUNTY}0000US{state_fips}{county_fips}'

Now the user would do this instead of direct string manipulation:

In [29]:
geo_id_mojave_county_az = geoid_state_county(states.AZ, MOHAVE)

geo_id_mojave_county_az

'0500000US04015'

Contrast this to what you saw in the query above, where I never explicitly
constructed or dealt with a representation of geography as a string or an
object. I just used `state=states.AZ, county=MOHAVE` directly in my
query to get the data I wanted, and all the other details happened behind
the scenes for me, including generation of the nice error message when I
got it wrong.

While I can certainly keep the user-facing `censusdis` interface I have and
implement a utility function behind the scenes that looks like `geoid_from_geo(**kwargs)`
that I would call to construct geoids from arbitrary combinations of 
keyword args that I would then pass to the beta API, it seems like a waste
of effort on both our parts. I as the client developer or end user have to
marshall my geography request into a string and you, on the server side most
likely have to parse it out into components. Maybe you have you have data
indexed in a way where these are direct keys, but in that case you are exposing
internal implementation details that most users don't want to know about.

In summary, I think just about everyone would rather use an API (whether it is 
the REST API or an API provided by a client library) that takes arguments like
```
{
    'state': '04',
    'county': '015'
}
```
than
```
{
    'geoId': ''0500000US04015'
}
```
Where the former is written in terms of concepts humans think about and the latter
is written in terms of code as strings with a domain-specific encoding.

For completness, here are additional queries that show the kind of simple user-friendly
API approach vs. the coding by string construction approach.

In [30]:
# Houston-Pasadena-The Woodlands, Texas Metro Area

from censusdis.msa_msa import HOUSTON_THE_WOODLANDS_SUGAR_LAND_TX_METRO_AREA

df_cbsa = ced.download(
    ACS5,
    2022,
    ['NAME', 'B01001_001E'],
    metropolitan_statistical_area_micropolitan_statistical_area=HOUSTON_THE_WOODLANDS_SUGAR_LAND_TX_METRO_AREA
)

df_cbsa

Unnamed: 0,METROPOLITAN_STATISTICAL_AREA_MICROPOLITAN_STATISTICAL_AREA,NAME,B01001_001E
0,26420,"Houston-The Woodlands-Sugar Land, TX Metro Area",7142603


In [31]:
# Oops. CBSA, aka metropolitan/micropolitan is not in my goto list at https://mcdc.missouri.edu/geography/sumlevs/
# There is a more extensive list at https://mcdc.missouri.edu/geography/sumlevs/sumlev-master-list.csv
# Let's check that. According to that one, 310 is the summary level we want.

SUMMARY_LEVEL_CBSA = "310"

# Note that if I didn't have the symbol, I'd need to explore to find the right 
# number for Houston.
geo_id_cbsa = f"{SUMMARY_LEVEL_CBSA}0000US{HOUSTON_THE_WOODLANDS_SUGAR_LAND_TX_METRO_AREA}"

geo_id_cbsa

'3100000US26420'

In [32]:
TABLE_URL = BETA_API_BASE_URL + '/table'

In [33]:
cbsa_response = requests.post(
    TABLE_URL, 
    headers=headers, 
    json={
        "tables": [
            {
                # Note that this ID is again an example of coding
                # by string construction.
                "id": "ACSDT5Y2015.B01001",
                "geoIds": [geo_id_cbsa],
                "columns": ["B01001_001E"]
            }
        ]
    }
)

In [34]:
cbsa_response.status_code

400

In [35]:
cbsa_response.json()['service-messages']['error']

['Table ACSDT5Y2015.B01001 has no data for the requested geographies']

In [36]:
# It looks like data may not be loaded for this geography in the beta API. 
# Let's just double check we can get it for the county-level geography.

county_response = requests.post(
    TABLE_URL, 
    headers=headers, 
    json={
        "tables": [
            {
                "id": "ACSDT5Y2015.B01001",
                "geoIds": [geo_id_mojave_county_az],
                "columns": ["B01001_001E"]
            }
        ]
    }
)

In [37]:
county_response.status_code

200

In [38]:
county_response.json()['content']['tables'][0]['data']

{'columns': [{'label': 'Estimate!!Total', 'id': 'B01001_001E'}],
 'rows': [{'B01001_001E': 203362}]}

In [39]:
# Fern Forest CDP, Hawaii

from censusdis.places.hawaii import FERN_FOREST_CDP

# From https://mcdc.missouri.edu/geography/sumlevs/
SUMMARY_LEVEL_PLACE = "160"

geo_id_place = f"{SUMMARY_LEVEL_PLACE}0000US{states.HI}{FERN_FOREST_CDP}"

geo_id_place 

'1600000US1507675'

In [40]:
place_response = requests.post(
    TABLE_URL, 
    headers=headers, 
    json={
        "tables": [
            {
                # Note that this ID is again an example of coding
                # by string construction.
                "id": "ACSDT5Y2015.B01001",
                "geoIds": [geo_id_place],
                "columns": ["B01001_001E"]
            }
        ]
    }
)

In [41]:
place_response.status_code

200

In [42]:
place_response.json()['content']['tables'][0]['data']

{'columns': [{'label': 'Estimate!!Total', 'id': 'B01001_001E'}],
 'rows': [{'B01001_001E': 669}]}

In [43]:
# Alternate API

df_place = ced.download(
    ACS5,
    2015,
    ["NAME", "B01001_001E"],
    state=states.HI,
    place=FERN_FOREST_CDP
)

In [44]:
df_place

Unnamed: 0,STATE,PLACE,NAME,B01001_001E
0,15,7675,"Fern Forest CDP, Hawaii",669


In [45]:
# We will skip a few and go to a hairier one, like
# Block Group 2, Census Tract 104.01, Jersey County, Illinois

from censusdis.counties.illinois import JERSEY

# From https://mcdc.missouri.edu/geography/sumlevs/
SUMMARY_LEVEL_BLOCK_GROUP = "150"

TRACT = "010401"
BLOCK_GROUP = "2"

geo_id_block_group = f"{SUMMARY_LEVEL_BLOCK_GROUP}0000US{states.IL}{JERSEY}{TRACT}0{BLOCK_GROUP}"

geo_id_block_group

'1500000US1708301040102'

In [46]:
block_group_response = requests.post(
    TABLE_URL, 
    headers=headers, 
    json={
        "tables": [
            {
                # Note that this ID is again an example of coding
                # by string construction.
                "id": "ACSDT5Y2015.B01001",
                "geoIds": [geo_id_block_group],
                "columns": ["B01001_001E"]
            }
        ]
    }
)

In [47]:
block_group_response.status_code

400

In [48]:
block_group_response.json()['service-messages']['error']

['Table ACSDT5Y2015.B01001 has no data for the requested geographies']

In [49]:
df_block_group = ced.download(
    ACS5,
    2022,
    ["NAME", "B01001_001E"],
    state=states.IL,
    county=JERSEY,
    tract=TRACT,
    block_group=BLOCK_GROUP
)


In [50]:
df_block_group

Unnamed: 0,STATE,COUNTY,TRACT,BLOCK_GROUP,NAME,B01001_001E
0,17,83,10401,2,Block Group 2; Census Tract 104.01; Jersey Cou...,1817


#### GeoId Construction with Wildcards

In [51]:
# All counties in Arkansas

In [52]:
# From https://mcdc.missouri.edu/geography/sumlevs/
SUMMARY_LEVEL_STATE = "040"

geo_id_county_all = f"{SUMMARY_LEVEL_STATE}0000US{states.AR}${SUMMARY_LEVEL_COUNTY}0000"

geo_id_county_all

'0400000US05$0500000'

In [53]:
all_county_response = requests.post(
    TABLE_URL, 
    headers=headers, 
    json={
        "tables": [
            {
                # Note that this ID is again an example of coding
                # by string construction.
                "id": "ACSDT5Y2015.B01001",
                "geoIds": [geo_id_county_all],
                # Note to explore. I was not able to get
                # STATE or COUNTY for each row by inserting
                # them in this list. Maybe they go somewhere
                # else?
                "columns": ["NAME", "B01001_001E"]
            }
        ]
    }
)

In [54]:
all_county_response.status_code

200

In [55]:
all_county_data = all_county_response.json()['content']['tables'][0]['data']

In [56]:
import pandas as pd

df_all_county_beta = pd.DataFrame(all_county_data['rows'])
df_all_county_beta

Unnamed: 0,NAME,B01001_001E
0,"Arkansas County, Arkansas",18731
1,"Ashley County, Arkansas",21229
2,"Baxter County, Arkansas",41040
3,"Benton County, Arkansas",238198
4,"Boone County, Arkansas",37227
...,...,...
70,"Van Buren County, Arkansas",17002
71,"Washington County, Arkansas",216432
72,"White County, Arkansas",78660
73,"Woodruff County, Arkansas",6983


In [57]:
df_all_county_censusdis = ced.download(
    ACS5,
    2015,
    ["NAME", "B01001_001E"],
    state=states.AR,
    county='*'
)

df_all_county_censusdis

Unnamed: 0,STATE,COUNTY,NAME,B01001_001E
0,05,099,"Nevada County, Arkansas",8793
1,05,037,"Cross County, Arkansas",17467
2,05,039,"Dallas County, Arkansas",7868
3,05,027,"Columbia County, Arkansas",24327
4,05,125,"Saline County, Arkansas",113833
...,...,...,...,...
70,05,095,"Monroe County, Arkansas",7713
71,05,103,"Ouachita County, Arkansas",25044
72,05,001,"Arkansas County, Arkansas",18731
73,05,015,"Carroll County, Arkansas",27635
