# Working with InfoUSA Race Data

Information on the head of household's race is stored in the variable `Ethnicity_Code_1`. Unfortunately, this variable does not come in a usable format and requires some cleaning. `Ethnicity_Code_1` contains a two-letter code for specific ethnicities, and we have come up with a standardization for converting these ethnicity codes to broader racial categories. The broader racial categories we use are 

| Code | Race | 
|:---: | :---|
| A | Asian| 
| B | Black or African American |
| H | Hispanic or Latino |
| M | Two or More Races |
| N | American Indian and Alaska Native |
| P | Native Hawaiian or Other Pacific Islander |
| W | White Alone, Not Hispanic or Latino |
| Z | Unknown/Other |

The index for converting `Ethnicity_Code_1` to race is available in the file `IU_ethnicity_codes.csv` and the process for converting `Ethnicity_Code_1` to a race variable just involves a simple left join.

In [2]:
# NOTE: This tutorial was written using R 3.6.3. 
# for best results use a distribution >=3.6.3

# loading data cleaning packages
library(data.table)
library(dplyr)

# load the ethnicity-race-code index
race_codes <- fread("IU_ethnicity_codes.csv")

# take a peak at the ethnicity codes
head(race_codes)

description,subcode,code,race
Bangladesh,BD,A,A
Bhutanese,BT,A,A
Chinese,CN,F,A
Indonesian,ID,F,A
Indian,IN,A,A
Japanese,JP,F,A


In [3]:
# we only need the subcode and race
race_codes <- select(race_codes, subcode, race)

# load InfoUSA sample
iu <- fread("small_durm.csv",
            keepLeadingZeros = TRUE)

# there's an issue in the InfoUSA data where the Namibia code 
# was originally set to "NA", which ends up being read as a missing 
# value, so we need to convert NAs to the actual text "NA"
iu$Ethnicity_Code_1 <- ifelse(is.na(iu$Ethnicity_Code_1), "NA", iu$Ethnicity_Code_1)

# merging data and index
iu <- left_join(iu, race_codes, by = c("Ethnicity_Code_1" = "subcode"))

# here's the distribution of races in our sample
print("Race Distribution:")
table(iu$race)

[1] "Race Distribution:"



 A  B  H  W  Z 
 3 13  5 72  7 

## Missing Data

The big problem with the `Ethnicity_Code_1` variable in the InfoUSA data is that a large number of individuals are marked with the missing/unkown ethnicity code `00`. For the earlier years, this leads to almost half of the observations missing a race classification (this is from a one percent sample of the block groups in the data). 

| Year |  A   |  B   |   H   |  M   |  N   |  P   |   W   |   Z   |
|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|:-----:|:-----:|
| 2006 | 1.8% | 3.4% | 4.8%  | 0.0% | 0.1% | 0.0% | 44.0% | 45.9% |
| 2007 | 1.9% | 3.5% | 5.2%  | 0.0% | 0.1% | 0.0% | 45.4% | 43.9% |
| 2008 | 2.0% | 3.8% | 5.7%  | 0.0% | 0.1% | 0.0% | 48.5% | 39.9% |
| 2009 | 2.2% | 4.2% | 6.2%  | 0.0% | 0.1% | 0.0% | 52.0% | 35.3% |
| 2010 | 2.5% | 4.7% | 6.8%  | 0.0% | 0.1% | 0.0% | 56.9% | 29.0% |
| 2011 | 2.8% | 4.7% | 7.4%  | 0.0% | 0.1% | 0.0% | 59.7% | 25.2% |
| 2012 | 3.1% | 5.4% | 8.5%  | 0.0% | 0.1% | 0.0% | 65.0% | 17.9% |
| 2013 | 3.7% | 6.5% | 10.3% | 0.0% | 0.1% | 0.0% | 72.3% | 7.0%  |
| 2014 | 3.8% | 6.5% | 10.4% | 0.0% | 0.1% | 0.0% | 72.3% | 6.9%  |
| 2015 | 3.8% | 6.3% | 10.4% | 0.0% | 0.1% | 0.0% | 72.3% | 6.9%  |
| 2016 | 3.6% | 6.6% | 10.3% | 0.0% | 0.1% | 0.0% | 72.4% | 7.0%  |
| 2017 | 3.8% | 6.5% | 10.8% | 0.0% | 0.1% | 0.0% | 71.9% | 6.9%  |

We have chosen to work around this missing data problem by using the package [`wru`](https://github.com/kosukeimai/wru), which implements the methods described in [Imai and Khanna (2016)](https://imai.fas.harvard.edu/research/files/race.pdf) to predict an individual's race from their last name and geographic location. 

## Using the `wru` Package 

To use `wru`, you will first need to aquire an API key from [here](https://api.census.gov/data/key_signup.html). Once you have an API key, you can download the required Census geographic data using the following parallelized script.

```
library(wru)
library(foreach)
library(doParallel)

key <- "INSERT YOUR KEY"
states <- c(state.abb, "DC")
codes <- data.frame(code = c(1:5, NA), race = c("W", "B", "H", "A", "Z", "Z"))

# Setting up parallels
numcores <- detectCores()
cl <- makeCluster(numcores)
registerDoParallel(cl)

# download census geographic data
census <- foreach(i = 1:51, .packages = c("data.table", "dplyr", "wru")) %dopar% {
  get_census_data(key, states[i], retry = 5)
}
stopCluster(cl)

tmp <- list()
for(i in 1:51){
  for (state in states) {
    if(!is.null(census[[i]][[state]]))
      tmp[[state]] <- census[[i]][[state]]
  }
}
census <- tmp
rm(tmp)
```

Alternatively, you can use the preloaded data in this repository. 

In [4]:
# loading census data
census <- readRDS("census.rds")

We will now use the `predict_race` function from `wru` to impute the race for each of the individuals in our data. The first argument of `predict_race` is `voter.file`, which is just a `data.frame` containing all the observations we want to predict race for. `voter.file` needs certain columns to follow specific naming conventions: 
- A column called `surname` containing the last names of each of the observations.
- A column called `state` containing the two-letter abbreviations of the states for each observation.
- A column called `CD` containing the two-digit FIPS code of the states for each observation.
- A column called `county` containing the three-digit FIPS code of the counties for each observation.
- A column called `tract` containing the six-digit FIPS code of the Census tracts for each observation.

`predict_race` also has an argument `census.geo`, which specifies the geographic level at which we want to merge the names in our data to the Census' surname list. The values "county", "tract", "block", and "place" are supported, but in practice we use "tract" in order to be specific without making the list of Census surnames too narrow (to use "block" `voter.file` would also need a column called `block` containing the four-digit Census block number).

In [5]:
# another issue in the InfoUSA data is that the last name "Na" has been 
# converted to a missing NA value so we quickly correct this 
iu$last_name_1 = ifelse(is.na(iu$last_name_1), "NA", iu$last_name_1)

# renaming according to wru conventions
iu <- iu %>% 
    rename("surname" = "last_name_1", 
           "CD" = "GE_CENSUS_STATE_2010",
           "state" = "STATE",
           "county" = "GE_ALS_COUNTY_CODE_2010", 
           "tract" = "GE_ALS_CENSUS_TRACT_2010")

The output of `predict_race` is a new `data.frame` object with the same columns as the input `voter.file`, plus five new columns `pred.whi`, `pred.bla`,	`pred.his`, `pred.asi`, and `pred.oth`, containing the posterior probabilities for "W", "B", "H", "A", and "Z" respectively.

In [6]:
# load wru
library(wru)

# predict race, note we are not using age or gender
iu <- predict_race(voter.file = iu, 
             census.geo = "tract", 
             census.data = census, 
             age = FALSE, 
             sex = FALSE,)

iu %>% select(contains("pred.")) %>% head

"package 'wru' was built under R version 3.6.3"

[1] "Proceeding with Census geographic data at tract level..."
[1] "Using Census geographic data from provided census.data object..."


"Probabilities were imputed for 10 surnames that could not be matched to Census list."

[1] "State 1 of 11: NC"
[1] "State 2 of 11: MD"
[1] "State 3 of 11: NJ"
[1] "State 4 of 11: AZ"
[1] "State 5 of 11: CO"
[1] "State 6 of 11: MA"
[1] "State 7 of 11: TX"
[1] "State 8 of 11: NY"
[1] "State 9 of 11: MO"
[1] "State 10 of 11: IN"
[1] "State 11 of 11: OH"


pred.whi,pred.bla,pred.his,pred.asi,pred.oth
0.983106,0.0005493683,0.008515082,0.001383999,0.006445555
0.7418055,0.1667299455,0.058974333,0.005135683,0.027354495
0.4298657,0.4646243415,0.060784471,0.005513481,0.039211969
0.7161379,0.1557208313,0.102118771,0.00818062,0.017841928
0.9865986,0.003194178,0.006840289,0.00152443,0.001842551
0.5910814,0.3613312878,0.02010573,0.004186642,0.023294915


To make a prediction on race, we simply assign each observation the race with the maximal posterior probability (note that this method results in some loss of fidelity, as "M" and "N" get grouped into "Z" and we normally collapse "P" into "A"). 

In [7]:
codes <- data.frame(race_wru = c("W", "B", "H", "A", "Z", "Z"),
                    code = c(1:5, NA))

# predict race and classify by maximal column
iu <- iu %>% 
    mutate(code = as.numeric(apply(select(., pred.whi:pred.oth), 1, which.max))) %>% 
    left_join(codes, by = "code")

Lastly, we can report the new distribution of races as well as a confusion matrix of the `wru` races and the original InfoUSA races.

In [8]:
print("wru Race Distribution:")
table(iu$race_wru)

# collapse "N", "M", and "P", not strictly necessary for the given sample
iu$race <- ifelse(iu$race == "P", "A", iu$race)
iu$race <- ifelse(iu$race %in% c("M", "N"), "Z", iu$race)

print("Confusion Matrix:")
table(select(iu, race, race_wru))

[1] "wru Race Distribution:"



 A  B  H  W  Z 
 6 14  6 74  0 

[1] "Confusion Matrix:"


    race_wru
race  A  B  H  W  Z
   A  1  0  0  2  0
   B  0 10  0  3  0
   H  0  0  4  1  0
   W  3  4  0 65  0
   Z  2  0  2  3  0

It's a small sample, but looks pretty good!