# 201A Lecture: Navigating ACS Data in Python
September 8, 2025
# Learning Objectives
* Retrieve ACS data from the Census Data API
* Renaming variables to make it easier to analyze


## Many ways to accomplish a task

There are almost always multiple ways to accomplish a given task in a full-featured programming language like Python. Different approaches have different merits, which can include:

* **readability**: how easy is it for someone else (or future you) to read and understand what the code is doing?
* **reusability**: how well suited is the code for reuse elsewhere, or for extension to multiple inputs or different datasets?
* **speed**: how quickly does the code run? Different approaches may take orders of magnitude more or less time to perform the same work.
* **conciseness**: is the code nice and brief, as short as it can be but not shorter?

In general, you should *prioritize readability over all others*, because the time it takes you to figure out what the heck you were doing when you wrote that code is measured in seconds at best and days at worst, while the time your code takes to run is usually measured in nanoseconds at best and minutes at worst. However, speed can really start to matter as your datasets get larger and larger, and your code runs more and more frequently. 

We will suggest approaches that prioritize readability/simplicity, especially because many of you are just getting started with Python/pandas. We don't have time to note all the other ways of doing things, but just remember that they often do exist, so feel free to play around!

# Retrieve ACS data from the Census Data API

Today, we are going to work on getting the data we need from the ACS to answer some of the questions that are in Assignment 1.  On Wednesday, we'll start some basic analysis!

We are going to start easy: **what is the racial and ethnic profile of the county your neighborhood is in?**  Even though we could use 1-year estimates, we're going to use the 5-year because we will eventually want to compare our neighborhood to the county data.

## Census Data API
We are going to use the Census Data API to download data. The Census Data API lets us pull data directly into our Python analyses. 

First, you will need to request an API key.  Do that now! https://api.census.gov/data/key_signup.html

We will need the `census` package, which makes it pretty easy to query the Census API. Let's try that now with a simple example query.

Note that requests for larger data pulls (many variables and/or many rows) may take a few seconds to complete. Still faster than going to data.census.gov and downloading a CSV ;-) You can tell if a cell is "still working" if there's a `*` in the brackets next to it. 

In [None]:
#if you're working in mini or Anaconda, you may need to install census first
!pip install census

In [None]:
# Import the Census object 
from census import Census
import pandas as pd

In [None]:
#set up your API key
api_key = 'put your API here'
c = Census(key=api_key)

# Example query: total population for all counties in California
c.acs5.get(
    ('NAME', 'GEO_ID', 'B01001_001E'),
    {'for': 'county:*', 'in': 'state:06'},
    year=2023
)

In [None]:
#what if you wanted to get all the data for census tracts in Alabama?

Let's break down that last command, the one that actually retrieves the ACS data.
* `c.acs5.get` means we want to get data from the ACS 5-year estimates
* The first argument is a *tuple* (a tuple is like a list but is [immutable](https://realpython.com/python-mutable-vs-immutable-types/)) of variables we want:
    * `NAME`: the human-readable name of each geography
    * `GEO_ID`: the Federal Information Processing Standard (FIPS) code for each geography. Every geographic area, from the entire state of California to each tiny census block, has a [unique FIPS code](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt)
    * `B01001_001E`: estimate #1 from table B01001, i.e. total population
    * We could specify many other variables here too
* The second argument is a *dict* (set of key:value pairs) specifying the geographies we want:
    * `'for': 'county:*'`: we want results for all counties... (we could instead specify one or more specific counties: `'for': 'county:001,075'`)
    * `'in': 'state:06'`: ...in California (CA's state FIPS code is 06)
    * If you want to specify multiple geographies for `in`, separate them with spaces. For example, "all census tracts in Alameda County, CA" would use `{'for': 'tract:*', 'in': 'state:06 county:001'}` as its geography dict
* The third argument, `year`, specifies which year's data we want. If you don't specify a year, `get` will default to the latest available year, but it's good practice to explicitly specify `year` so your code will pull the same vintage of data no matter what year it is when you run it

The query returns a *list* of *dicts*, each containing the three variables we requested as well as the two variables we used to specify our geographies of interest. It just so happens that this format is perfectly suited for conversion to a pandas `DataFrame`. We can do this by passing the `c.acs5.get()` command to the `pd.DataFrame` *constructor*. This results in a `DataFrame` object containing our ACS data:

In [None]:
# Store the ACS data in a pandas DataFrame
df = pd.DataFrame(
    c.acs5.get(
        ('NAME', 'GEO_ID', 'B01001_001E'),
        {'for': 'county:*', 'in': 'state:06'},
        year=2023
    )
)

# Display the resulting DataFrame
df

Great! We have Census data in a pandas `DataFrame`! But... how do you know what data to ask for?

## Finding the data you want

The Census Data API User Guide ([website](https://www.census.gov/data/developers/guidance/api-user-guide.html), [PDF](https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-user-guide.pdf)) is an invaluable resource. It, in turn, points us toward the [API Discovery Tool](https://api.census.gov/data.html), which begins with a downright intimidating and frankly not very useful list of ALL datasets available via the API. Whew! The easiest way to work with this table is by searching it... Search this page for "2023/acs/acs5" - we're looking for the "American Community Survey: 5-Year Estimates: Detailed Tables 5-Year" row, which is near the bottom. 

In the "American Community Survey: 5-Year Estimates: Detailed Tables 5-Year" row, you'll find some useful links. The "[variables](https://api.census.gov/data/2023/acs/acs5/variables.html)" link takes you to a table of all variables available in the ACS dataset. Maybe even more useful is the "[groups](https://api.census.gov/data/2023/acs/acs5/groups.html)" link, whose table lists all "groups" of ACS variables, i.e. tables. The "groups" table is very helpful: you can search or scan through it for tables that may be relevant to the question at hand. 

Today we want to analyze race and ethnicity, so Table B03002, Hispanic or Latino Origin by Race, is what we're looking for. (Table names may look random, but they aren't. More info on the format of ACS table names [here](https://www.census.gov/programs-surveys/acs/data/data-tables/table-ids-explained.html).) Click the "[selected variables](https://api.census.gov/data/2023/acs/acs5/groups/B03002.html)" link to the right of the B03002 row in the [groups](https://api.census.gov/data/2023/acs/acs5/groups.html) page to see all the variables in this table.

When I say "all the variables," I mean ALL the variables. There are four kinds of variables in all these tables:
1. Variables ending in `E` are *estimates*, i.e. the value itself.
2. Variables ending in `EA` are *annotations* of the estimates. Occasionally there is extra information about the estimate; in those cases, that information appears here. Most estimates don't have annotations so these variables are usually blank.
3. Variables ending in `M` are *margins of error*, i.e. how precise the estimate is. The Census Bureau provides margins of error for a 90% confidence level; that is, we can be 90% confident that the true value is somewhere in the interval $[estimate - MOE, estimate + MOE]$.
4. Variables ending in `MA` are annotations of the margins of error. For example, the Census Bureau has validated some estimates (using the decennial census and other measures), and it believes the MOE for these estimates is negligible. Such information will appear here; again, though, these variables are usually blank.

More information about annotations [here](https://www.census.gov/data/developers/data-sets/acs-1year/notes-on-acs-estimate-and-annotation-values.html). With any luck, you'll almost never run into these edge cases, except the ***** indicating the estimate has been cross-validated and the MOE is zero.

You could use the shorthand `group(B03002)` to request all variables in Table B03002:

In [None]:
df=pd.DataFrame(
    c.acs5.get(
        ('NAME', 'GEO_ID', 'group(B03002)'),
        {'for': 'county:*', 'in': 'state:06'},
        year=2023
    )
)
df

In [None]:
df.shape

88 columns is a lot! And we really don't need or want most of those columns. Better, instead, to study the table, think about what we actually want, and request those variables specifically.

This is where data.census.gov and the metadata file can really help. If you search data.census.gov for the table name (e.g. B03002), you can see an example of the table and learn more about it. The data.census.gov view helps you understand what *universe* this table reflects (click the "Notes" button), and how the groups and subgroups are laid out. data.census.gov and the Census Data API are best friends, not rivals.

Take a minute to look at [this table](https://data.census.gov/table/ACSDT5Y2023.B03002?&g=050XX00US06001$1400000) and think through how you're going to organize and answer the question: what are the racial and ethnic characteristics of your community?  What racial/ethnic groups are you going to focus on?  Are you going to use the conventional categories discussed in lecture, or something different?

Here's where you also have to be careful – within the table, there are columns that are subsets of other columns.  For example, B03002_009E and B03002_009M are the totals of Two or more races, whereas variables B03002_010 and B03002_011 are subcategories of B03002_009 Not Hispanic or Latino: Two or more races.  If you were to add all those columns, you'd get a larger population than actually exists.

## Selecting and renaming variables

We’re going to follow the standard convention and treat "Hispanic" as its own category. We are looking for a set of variables that cover the entire population without double-counting anyone. Looking at the table above, we can do so by selecting rows 3-9 and 12. Let's also grab row 1 to use as the denominator for some proportions we'll calculate later. As we can see in the [group's page](https://api.census.gov/data/2022/acs/acs5/groups/B03002.html), on the overall [variables table](https://api.census.gov/data/2022/acs/acs5/variables.html), row 1 in Table B03002 corresponds to variables B03002_001E (estimate) and B03002_001M (margin of error). Similarly, row 3 corresponds to variables B03002_003E and B03002_003M, and so forth. 

We now have what we need to construct a useful query to the Census Data API:
* A list of variables we want to pull, and
* A way to specify which geographies for which we want to pull data

We can put these ingredients together as shown below:

In [None]:
# I'm only going to specify a few of the variables we want, because
# we have something even better in store
df = pd.DataFrame(
    c.acs5.get(
        ['NAME', 'GEO_ID', 'B03002_001E', 'B03002_001M', 'B03002_003E', 'B03002_003M', 'B03002_004E', 'B03002_004M'],
        {'for': 'county:*', 'in': 'state:06'},
        year=2023
    )
)

df

Now we have a `DataFrame` with a bunch of variables we care about. Great! But we're humans, not machines; are we really meant to just remember which variable corresponds to "Asian" and which corresponds to "Hispanic"? (Of course not.)

One thing that can be helpful is to rename your variables - not necessary, but easier to work with. It's advisable to get in the habit of naming variables with lowercase letters, underscores rather than spaces, and no "special" characters. 

We can create a list of variables, and then pass this list to `c.acs5.get()` to specify which variables we want to pull! This suggests a very neat setup in which we define a master dict of variables of interest and use it both to govern which variables we request *and* rename the columns of the `DataFrame` resulting from our data pull:

In [None]:
# Define the dict of variables to pull and rename
variables_of_interest = {
    'NAME': 'NAME',  # no need to rename this variable
    'GEO_ID': 'GEO_ID',  # this one either
    'B03002_001E': 'total',
    'B03002_001M': 'total_moe',
    'B03002_003E': 'nh_white',
    'B03002_003M': 'nh_white_moe',
    'B03002_004E': 'nh_black',
    'B03002_004M': 'nh_black_moe',
    'B03002_005E': 'nh_native',
    'B03002_005M': 'nh_native_moe',
    'B03002_006E': 'nh_asian',
    'B03002_006M': 'nh_asian_moe',
    'B03002_007E': 'nh_pi',
    'B03002_007M': 'nh_pi_moe',
    'B03002_008E': 'nh_1other',
    'B03002_008M': 'nh_1other_moe',
    'B03002_009E': 'nh_multi',
    'B03002_009M': 'nh_multi_moe',
    'B03002_012E': 'hispanic',
    'B03002_012M': 'hispanic_moe',
}

# Pull the data and store in a DataFrame
# We can name this DataFrame "df_county"
# to distinguish it from other scales
df_county = pd.DataFrame(
    c.acs5.get(
        list(variables_of_interest.keys()),
        {'for': 'county:*', 'in': 'state:06'},
        year=2023
    )
)

# Rename the DataFrame columns again using the dict
df_county = df_county.rename(columns=variables_of_interest)

df_county

The `variables_of_interest` dict contains just one set of variables and human-readable names. Feel free to choose variables and names that make sense to you, as long as you cover the whole population one way or another.

In [None]:
# Try to rewrite the code to pull all cities in California


In [None]:
# Now, re-write the code to pull data on median rent for all counties in CA, rename the variable, and include the MOE column
