# Interactive Session: Human Population Data - US Census

![ntl](./assets/humans.jpg)

Humans create [complex systems](https://en.wikipedia.org/wiki/Complex_system) and, as such, humans are incredibly difficult to study. Even counting how many people there on the planet is difficult. Many large countries - which we think have rapidly growing populations - have not had a [reliable census conducted in decades](https://www.pnas.org/doi/abs/10.1073/pnas.1715305115). Even the United States, which has a well-regarded and well-funded census, has trouble [estimating population](https://doi.org/10.1016/j.apgeog.2013.11.002). 

Human population data is very powerful. In the US, **\$2.8 Trillion** is distributed based on Census data. That's a lot of money. Human data can also be used in nefarious ways too. Governments can rig elections by stuffing districts and governments can oppress groups if they can count them. The power of This is all to say, that while human data is really important for sustainability, there are true ethical considerations when developing and using human datasets.  

In this session, we will become familiar with a few human population datasets, specifically the US Census. In doing so, we will learn about [application programming interfaces (APIs)](https://en.wikipedia.org/wiki/API).

<p style="height:1pt"> </p>

<div class="boxhead2">
    Session Topics
</div>

<div class="boxtext2">
<ul class="a">
    <li> 📌 Introduction to <span class="codeb">US Census Data</span> </li>
    <ul class="b">
        <li> Census API </li>
        <li> Merging with shapefiles </li>
        <li> Plotting Data </li>
        <li> Area Aggregation </li>
    </ul>
</div>

<hr style="border-top: 0.2px solid gray; margin-top: 12pt; margin-bottom: 0pt"></hr>

### Instructions
We will work through this notebook together. To run a cell, click on the cell and press "Shift" + "Enter" or click the "Run" button in the toolbar at the top. 

<p style="color:#408000; font-weight: bold"> 🐍 &nbsp; &nbsp; This symbol designates an important note about Python structure, syntax, or another quirk.  </p>

<p style="color:#008C96; font-weight: bold"> ▶️ &nbsp; &nbsp; This symbol designates a cell with code to be run.  </p>

<p style="color:#008C96; font-weight: bold"> ✏️ &nbsp; &nbsp; This symbol designates a partially coded cell with an example.  </p>

<hr style="border-top: 1px solid gray; margin-top: 24px; margin-bottom: 1px"></hr>

# US Census Data

<img src="./assets/income-censustract.jpg">

The US Census Bureau's "mission is to serve as the nation's leading provider of quality data about its people and economy." It collects, curates and disseminates a wide range of demographic and economic data. It's mission in enshrined the US constitution. 

The [Decadal Populationa and Housing Census](https://www.census.gov/programs-surveys/decennial-census.html#:~:text=The%20U.S.%20census%20counts%20each,of%20Representatives%20among%20the%20states.) is designed to be a complete count of people residents of the United States' territory, whereas the [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs) is conducted annually has uses a spatiall-informed sampling strategy to estimate demographic data for the entire country. Both surveys are quite useful.

With the Census, it's important to remember that how we group people changes overtime. For example, how we count Hispanic-indentifying residents of the US has [changes through time](https://en.wikipedia.org/wiki/Race_and_ethnicity_in_the_United_States_census), just as various other racial and ethnic groups. This makes tracking specific demographics through time quite difficult. Similiarly, census boundaries can change through time - again making it difficult to measure fine-grained demographic change overtime. But the Census makes a strong effort to document changes and inform users how demographic data is collected and aggregated over space and time. 

Take a moment to checkout the [graphic below](https://www2.census.gov/geo/pdfs/reference/geodiagram.pdf). 

<img src="./assets/cenus-spatial.png">

It's useful to familiarize yourself with the various spatial domains available from the Census. While it is quite easy to download US Census data, let's use the [Python Census API](https://pygis.io/docs/d_access_census.html) to check out US Census data right in our notebook. Note that this tutorial borrows from a great online resource: [PyGIS - Open Source Spatial Programming & Remote Sensing](https://pygis.io/docs/a_intro.html). Check it out!

### Importing Data from the Census API
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

Some APIs are fully public, some require free credentials (e.g. a way to authentic users), and some require a fee to use them. <br>

Please obtain a census API Key here: https://api.census.gov/data/key_signup.html <br> 

Note: I had trouble getting a key and I had to try several times with different email address to finally have one work. We are going to try to use my Census API Key (see below).

The Census API allows us to read in census data into memory. The Census, like many organizations, uses a common, yet complex, naming convention for variables. Some variables make sense (e.g. `NAME`) and others are alpha numeric. You can checkout the [US Census API User Guide](https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf). For geographic information, the Census uses [FIPS](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code) codes, which are a standardize format to identify geographic adminstrative areas in the US.  

Here we are going to use the [ACS5](https://www.census.gov/data/developers/data-sets/acs-5year.html). From this we'll pull some demographic and socioeconomic data for the great state of Montana.

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Import modules
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
from census import Census
from us import states
import os
import rasterio
import sys

In [None]:
# Set API key 
c = Census('c2b7b1b0ee04a89666fd161e16d3e1dcec53d1b9') # My key ... please switch to your key.

Here are the variables we're going to pull from the API:
1. C17002_001E - Ratio of Income to Poverty Level in the Past 12 Months Total (e.g. total number of people in 'poverty')
1. C17002_002E - Ratio of Income to Poverty Level in the Past 12 Months <50% (e.g. 50% of poverty line)
1. C17002_003E - Ratio of Income to Poverty Level in the Past 12 Months 50-99% (e.g. 50% of poverty line)
1. B01003_001E - Total Population

You can check out the variables yourself [here](https://api.census.gov/data/2022/acs/acs5/variables.html). 

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Make the API request
mt_census = c.acs5.state_county_tract(fields = ('NAME', 'C17002_001E', 'C17002_002E', 'C17002_003E', 'B01003_001E'),
                                      state_fips = states.MT.fips, # You can change the state here
                                      county_fips = "*",
                                      tract = "*",
                                      year = 2021)  # You can change the year here

In [None]:
# Create a dataframe from the census data
mt_df = pd.DataFrame(mt_census)

# Show the dataframe
mt_df.head(2)

In [None]:
# Check out the shape

In [None]:
# Check out the data type

### Adding geography
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

Notice that the census data does not have any geographic information ascribed to it. We'll need to read in data directly from the Census's database of boundaries. `GeoPandas` can do this directly from the internet if the data is set up correctly. <br>

Note that I had to dive into the the [Census Tiger Product Guide](https://www.census.gov/programs-surveys/geography/guidance/tiger-data-products-guide.html) to make sense of the Census shapefiles that are online, again because they use a alphanumeric coding, not common place names.

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Access shapefile of Montana census tracts
mt_tract = gpd.read_file("https://www2.census.gov/geo/tiger/TIGER2023/TRACT/tl_2023_30_tract.zip")
mt_tract.head()

In [None]:
# What crs is the tract in?

In [None]:
# What size in the tract file?

In [None]:
# What are the data types of each column?

In [None]:
# Reproject shapefile to UTM Zone 17N
# https://spatialreference.org/ref/epsg/wgs-84-utm-zone-17n/
mt_tract = mt_tract.to_crs(epsg = 32617)
print(mt_tract.crs)
mt_tract.head()

#### GEOID
Notice that the `mt_tract` has a `GEOID` column but `mt_df` does not. So we need to combine the FIPS columns into a single GEOID that we can use to merge onto the shape file. This is pretty easy with simple string addition. 

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Combine state, county, and tract columns together to create a new string and assign to new column
mt_df["GEOID"] = mt_df["state"] + mt_df["county"] + mt_df["tract"]

In [None]:
# Print head of dataframe
mt_df.head(2)

In [None]:
# Remove columns we won't need later
mt_df = mt_df.drop(columns = ["state", "county", "tract"])

# Show updated dataframe
mt_df.head(2)

#### Check the data types
It's always good to check the data types before you merge two DataFrames to make sure that they will merge correctly.

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Check column data types for census data
print("Column data types for census data:\n{}".format(mt_df.dtypes))

# Check column data types for census shapefile
print("\nColumn data types for census shapefile:\n{}".format(mt_tract.dtypes))

Now merge the two DataFrames

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Join the attributes of the dataframes together
mt_merge = mt_tract.merge(mt_df, on = "GEOID")
mt_merge.head()

In [None]:
# What data types are the columns?

#### Poverty Rates
Now let's select a few columns to make our DateFrame easier to manage. Notice that we are using `.copy()` and since pd.DataFrame.copy() produces a **deep copy by default** we have a new object in memory and Python will not throw runtime warnings.

To estimate poverty rates, we need to take the poverty ratios and divide them by the total number of people in each census tract.

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Create new dataframe from select columns
mt_poverty_tract = mt_merge[["STATEFP", "COUNTYFP", "TRACTCE", 
                             "GEOID", "geometry", "C17002_001E", 
                             "C17002_002E", "C17002_003E", "B01003_001E"]].copy()
mt_poverty_tract.head()

In [None]:
# Get poverty rate and store mtlues in new column
mt_poverty_tract["Poverty_Rate"] = (mt_poverty_tract["C17002_002E"] 
                                     + mt_poverty_tract["C17002_003E"]) / mt_poverty_tract["B01003_001E"] * 100

#### Now plot the data:

In [None]:
# Create subplots
fig, ax = plt.subplots(1, 1, figsize = (20, 10))
mt_poverty_tract.plot(column = 'Poverty_Rate', ax = ax, cmap = 'RdPu', legend=True)

# Stylize plots
plt.style.use('bmh')

# Set title
ax.set_title('Poverty Rates (%) in Montana by Census Tract', fontdict = {'fontsize': '25', 'fontweight' : '3'})

# Hide grid lines
ax.grid(False)

# Hide axes ticks
ax.set_xticks([])
ax.set_yticks([])

# Set background color
ax.set_facecolor('white')

# show the plot
plt.show()

### Spatial Aggregation
<hr style="border-top: 0.2px solid gray; margin-top: 12px; margin-bottom: 1px"></hr>

We may want to aggregate data to the county level. This is very easy to do with `GeoPandas` and uses a spatial [`dissolve`](https://geopandas.org/en/stable/docs/user_guide/aggregation_with_dissolve.html) methods just like ArcGIS or QGIS. In short, `dissolve` uses catagorical data to dissolve the boundaries between ajoining polygons with the same catagorical value. In this case, we will use the `COUNTYFP` code to disscolve the census tract level data. We will need to pass an aggregation method (see [here](https://geopandas.org/en/stable/docs/user_guide/aggregation_with_dissolve.html). In this case, we'll use `sum`.

<div class="run">
    ▶️ <b> Run the cells below. </b>
</div>

In [None]:
# Dissolve data
mt_poverty_tract_d = mt_merge[["STATEFP", "COUNTYFP", "TRACTCE", 
                             "GEOID", "geometry", "C17002_001E", 
                             "C17002_002E", "C17002_003E", "B01003_001E"]].copy()
mt_poverty_county = mt_poverty_tract_d .dissolve(by = 'COUNTYFP', aggfunc = 'sum')
mt_poverty_county.head()

In [None]:
# Use the dataframe's shape to see how many counties there are in Montana

In [None]:
# Get poverty rate and store mtlues in new column
mt_poverty_county["Poverty_Rate"] = (mt_poverty_county["C17002_002E"] + 
                                     mt_poverty_county["C17002_003E"]) / mt_poverty_county["B01003_001E"] * 100

# Show dataframe
mt_poverty_county.head(2)

In [None]:
# Create subplots
fig, ax = plt.subplots(1, 1, figsize = (20, 10))

# Plot data
mt_poverty_county.plot(column = "Poverty_Rate",
                       ax = ax,
                       cmap = "RdPu",
                       legend = True)

# Stylize plots
plt.style.use('bmh')

# Set title
ax.set_title('Poverty Rates (%) in Montana by County', fontdict = {'fontsize': '25', 'fontweight' : '3'})

# Hide grid lines
ax.grid(False)

# Hide axes ticks
ax.set_xticks([])
ax.set_yticks([])

# Set background color
ax.set_facecolor('white')

# show the plot
plt.show()