# Downloading Census Data Using `tidycensus`

`tidycensus` is an outstanding package that makes it much simpler to access and pull into R data from the U.S. Census Bureau. This lab introduces you to the core functions of the package. The goal of this lab is to generate measures of race and ethnicity by state and county. We'll use the American Community Survey, 5-year estimates for state-level data, and we'll use the 2020 decennial census for county-level data. Because the ACS is survey-based, they do not produce 5-year estimates for every county.  

Let's load `tidyverse` and `tidycensus` using `pacman`.

In [None]:
# load necessary packages - install only if needed

#install.packages('pacman')

pacman::p_load(tidyverse, tidycensus)

## `load_variables()`

The `load_variables()` function provides access to lists of variables from Census products. *It is not easy to identify which variables contain the information you need.* I recommend using this function to examine potential data sources. You will need to set the arguments in the function to define which dataset you want to pull variables from. Let's examine how to do this.

In [None]:
# write the output to a new dataframe:

statevars <- load_variables(year = 2023, # set the year of the data
                            dataset = "acs5" # set the file - these are the ACS 5-year estimates (2019-2023)
                            )

# find variables you want
View(statevars)
countyvars <- load_variables(year = 2020, # for our decennial census data
                             dataset = "dp" # "Demographic Profile"
                             )
                        
View(countyvars)

Finding the right variables is difficult! Be sure to figure out if you are identifying *counts* or *statistics*. Count data is the norm, and represents the number of people in the population with some characteristic. Some Census products will provide calculated states (percentages, medians, averages, etc). *If you are using count data, you will need to calculate your statistics. We almost never want only counts.*

The Census Bureau nests variables. For example, if you were looking for educational attainment, the Bureau first will provide the total number of of people 25 and older. That's the relevant population. Then, it might provide the number of people with each level of educational attainment. And finally, it might provide those numbers broken down by some other variable, like race or sex. You need to understand the nesting structure of the data in order to grab the right measures of your concepts.

The Census Bureau measures race and ethnicity as distinct concepts. Hispanic or Latino is consider an ethnicity, not a race. The way political scientists typically measure race and ethnicity is to code everyone who says their ethnicity is Hispanic or Latino as Latino, and then non-Latinos are coded base on their race. Of course, this is an overly-simplistic way of capturing complex identities. 

Below, I've pulled out the variables for white, black, Latino, and Asian populations. For white, black, and Asian, variables, they are coming from the "Not Hispanic or Latino" sub-population. However, in the ACS, these are measured as counts. So we also need to record the total population from that variable group. Notice that the demographic profile information from the 2020 Census calculates percentages for us and stores them in variables ending with a "P".  

In [None]:
# since it is difficult to remember these variable names,
#  assign the names you want to a list that we can use
#  later.
vars.st <- c(totalpop = "B03002_001",
            white = "B03002_003",
            black = "B03002_004",
            latino = "B03002_012",
            asian = "B03002_006")

vars.cy <- c(latino = "DP1_0096P",
            white = "DP1_0105P",
            black = "DP1_0106P",
            asian = "DP1_0108P")

# `get_acs()` and `get_decennial()` functions

These functions allow us to pull in Census data using the Census Bureau's API. The functions are similar but not exactly the same. You will need to identify the *level of analysis* you want; place that information in the `geography` parameter. Both functions require you to provide information on the year and data source. The source is specified in the `survey` parameter in `get_acs()` and the `sumfile` parameter in `get_decennial()`. `sumfile` is a shortened version of "summary file", which is how the Census Bureau has historically released its decennial data. The `geometry` parameter will let you download geographic information for mapping. You won't be doing any mapping in this course, so you can set this argument to `FALSE`.

In [None]:
states <- get_acs(geography = "state",
                  year = 2023, 
                  survey = "acs5",
                  variables = vars.st,      # calls up our stored list
                  geometry = FALSE )        # you typically don't need the geometry data

counties <- get_decennial(geography = "county",
                  year = 2020, 
                  sumfile = "dp",
                  variables = vars.cy,      
                  geometry = FALSE)

Let's take a look.

In [None]:
head(states)
head(counties)

Notice that in both cases, `tidycensus` provides different variables as *rows* instead of *columns*. As we've been doing throughout the semester, we can transpose our rows data into new columns using `pivot_wider`.

In [None]:
states.wide <- states |> 
                    select(-moe) |>  # drops unneeded margin of error column from the ACS survey
                    pivot_wider(names_from = "variable",
                               values_from = "estimate")

counties.wide <- counties |> 
                    pivot_wider(names_from = "variable",
                               values_from = "value")


# Cleaning and processing data using loops and string functions

In order to use our state-level estimates, we need to create a percentage. Again, don't use count data to measure demographic characteristics. We nearly always want percentages.

In [None]:
states.wide <- states.wide |> 
                mutate(latino = 100 * latino / totalpop, 
                       white = 100 * white / totalpop,
                       black = 100 * black / totalpop,
                       asian = 100 * asian / totalpop)

In [None]:
ggplot() +
    geom_density(data = counties.wide, aes(x = latino, y = ..density..),
                fill = "navy", alpha = .3) +
    geom_density(data = states.wide, aes(x = latino, y = ..density..),
                fill = "#282727", alpha = .5) +
    labs(x = "Percent Latino/a",
         y = "Density",
         title = "Distribution of Percent Latino/a at County and State Levels") +
    theme_minimal() +
    annotate("text", x=10, y=.10, label = "States", color = "navy") +
    annotate("text", x=25, y=.025, label = "Counties", color = "#282727")
    

Now let's clean this up further and merge our county and state level data together into one dataframe. Before we merge, we need deal with the fact that variables are named the same thing in both dataframes. Let's rename the ones in the `states.wider` tibble using a simple loop. 

Loops repeat some action over multiple objects. The code below cycles through our race/ethnicity variables, stores a new variable name in object `q`, and then renames our existing variables.

In [None]:
# for loops work like this: 
# every time R sees an i, it will put insert the 
# named item in the list. It will repeat until every
# item has been used. 

for(i in c("latino","white", "black", "asian")){ 
    q <- paste(i, "st", sep = ".")
    states.wide <- states.wide |> rename(!!q := .data[[i]])
}

A full explanation for what is happening inside of `rename` above is probably more advanced than we need to get, but this is what is needed for R to recognize our placeholders as variables. On the left-hand side of the equation, we use two exclamation marks (`!!`). On the right-hand side, we can get the column of data using`.data[[i]]`. Remember, `i` will be replaced with our variable name. One last wrinkle is that we need to use `:=` instead of `=` in order for this process to work. 

One final step is needed before we can merge. Currently, there is no variable for state in the county data. The state informantion is buried in the `GEOID` and `NAME` variables. The easiest was to solve this issue is to pull out the state identifying codes, called "FIPS" codes, from the `GEOID` variable. We can use the `substr()` function to manipulate the character variable and grab the first two characters of the code. 

In [None]:
# substr(x,start,stop) will use some character string x
# and then go from the starting character to the stopping character 
substr("1a2b3c", 2, 4)

# now let's use real data
counties.wide <- counties.wide |> 
                    mutate(stfips = substr(GEOID,1,2))

Now we can merge using `left_join`.

In [None]:
comb <- left_join(counties.wide,
                  states.wide,
                  by = join_by(stfips == GEOID))
head(comb)

Now, let's create a new variable that equals the difference between the county Percent Latino and the state's Percent Latino, and then graph the distribution.

In [None]:
comb <- comb |> mutate(diff = latino - latino.st)

ggplot(comb, aes(x = diff)) +
    geom_density(fill = "navy", alpha = .3) +
    labs(x = "Difference",
         y = "Density",
         title = "Difference between County and State Percent Latino/a") +
    theme_minimal() 