# Data 4 Black Lives - COVID-19 Case/Death Disparities

Objective: Extract COVID-19 cases and deaths for each geographic location, both overall and for Black/African-Americans only.

Data sources for 3 locations (California (San Diego), Florida, and New York City) are provided in tables embedded in PDFs. There are tools that can extract tables from PDFs. Specifying the specific location of the table in the document can be a bit tricky, but this can certainly be done.

## Install packages/software

Source: https://github.com/hannarud/r-best-practices/wiki/Installing-RJava-(Ubuntu)

In [0]:
system('sudo apt-get install r-cran-rjava', intern=T)

In [0]:
system('sudo apt-get install libgdal-dev libpro', intern=T)

“running command 'sudo apt-get install libgdal-dev libpro' had status 100”


In [0]:
library(rJava)

In [0]:
if (!require("remotes")) {
    install.packages("remotes")
}

# on 64-bit Windows
#remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")

# elsewhere
remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))

Loading required package: remotes

Downloading GitHub repo ropensci/tabulizerjars@master




[32m✔[39m  [90mchecking for file ‘/tmp/Rtmpwn46c8/remotes796b88f8d8/ropensci-tabulizerjars-d1924e0/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘tabulizerjars’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘tabulizerjars_1.0.1.tar.gz’[39m[36m[39m
   


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Downloading GitHub repo ropensci/tabulizer@master



png (NA -> 0.1-7) [CRAN]


Installing 1 packages: png

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



[32m✔[39m  [90mchecking for file ‘/tmp/Rtmpwn46c8/remotes7950f01c9/ropensci-tabulizer-fa4dff5/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘tabulizer’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
   Removed empty directory ‘tabulizer/docs’
[90m─[39m[90m  [39m[90mbuilding ‘tabulizer_0.2.2.tar.gz’[39m[36m[39m
   


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [0]:
## Helper function to run code on command line
run = function(txt) system(txt, intern=T)

In [0]:
# Source:  https://askubuntu.com/questions/221962/how-can-i-extract-a-page-range-a-part-of-a-pdf

In [0]:
## Install software that can extract subset of PDF pages (needed for Florida)
run('sudo apt-get install qpdf')

## Load packages

In [0]:
library(tabulizer)
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.0     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 3.0.0     [32m✔[39m [34mdplyr  [39m 0.8.5
[32m✔[39m [34mtidyr  [39m 1.0.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Create directories

In [0]:
home_dir = '/content'

In [0]:
dir.create('data')
dir.create('data/san_diego')
dir.create('data/florida')
dir.create('data/nyc')

## California (San Diego)

In [0]:
## Set the working directory for San Diego
san_diego_dir = paste0(home_dir, '/data/san_diego')
setwd(san_diego_dir)
getwd()

In [0]:
## Download the entire San Diego cases and deaths
download.file('https://www.sandiegocounty.gov/content/dam/sdc/hhsa/programs/phs/Epidemiology/COVID-19%20Race%20and%20Ethnicity%20Summary.pdf', 'sd_cases.pdf')
download.file('https://www.sandiegocounty.gov/content/dam/sdc/hhsa/programs/phs/Epidemiology/COVID-19%20Deaths%20by%20Demographics.pdf', 'sd_deaths.pdf')

In [0]:
## Show downloaded files
list.files()

In [0]:
## Extracting cases from the PDF
sd_cases_raw <- extract_tables('sd_cases.pdf', encoding="UTF-8")[[1]]
sd_cases_raw

0,1,2,3
COVID-19 Case Summary,,San Diego County Residents,
Total Positives,,4020,
,,% of Total with Known,
Race and Ethnicity,Count,"Race/Ethnicity (N=3,221)","Rate per 100,000*"
Hispanic or Latino,1765,54.8%,153.4
White,948,29.4%,62.1
Black or African American,147,4.6%,99.5
Asian,283,8.8%,77.7
Pacific Islander,34,1.1%,231.1
American Indian,8,0.2%,


In [0]:
## Get race/ethnicity breakdown from cases data
id_start = which(sd_cases_raw[,1] == 'Race and Ethnicity')
sd_cases = sd_cases_raw[id_start:nrow(sd_cases_raw), 1:2]
colnames(sd_cases) = sd_cases[1,]
sd_cases = sd_cases[-1,] %>% data.frame(stringsAsFactors = F)
sd_cases$Count = gsub(',', '', sd_cases$Count) %>% as.numeric
sd_cases

Race.and.Ethnicity,Count
<chr>,<dbl>
Hispanic or Latino,1765
White,948
Black or African American,147
Asian,283
Pacific Islander,34
American Indian,8
Multiple Race,36
Race/Ethnicity Unknown,799


In [0]:
## Extracting deaths from the PDF
sd_deaths_raw <- extract_tables('sd_deaths.pdf', encoding="UTF-8")[[1]]
sd_deaths_raw

0,1
,San Diego County Residents
Total Deaths,144
% of Deaths with Known Selected Characteristics,Count
,Demographics
Age Groups,
0-9 years,0 0.0%
10-19 years,0 0.0%
20-29 years,2 1.4%
30-39 years,2 1.4%
40-49 years,4 2.8%


In [0]:
## Get race/ethnicity breakdown from deaths data
id_start = which(sd_deaths_raw[,1] == 'Hispanic or Latino'); id_start
sd_deaths = sd_deaths_raw[id_start:nrow(sd_deaths_raw), ]
colnames(sd_deaths) = c('Race_Ethnicity', 'Values')
sd_deaths = sd_deaths %>% data.frame(stringsAsFactors = F)
sd_deaths = sd_deaths %>% 
    separate(Values, c('Count', 'Percent'), sep=' ') %>%
    mutate(Count = as.numeric(Count))
sd_deaths

“Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [8].”


Race_Ethnicity,Count,Percent
<chr>,<dbl>,<chr>
Hispanic or Latino,51,37.8%
White,67,49.6%
Black or African American,4,3.0%
Asian,10,7.4%
Pacific Islander,1,0.7%
American Indian,1,0.7%
Multiple Race,1,0.7%
Race/Ethnicity Unknown,9,


In [0]:
## Total and AA percentage of cases
sd_total_cases = sd_cases$Count %>% sum

sd_aa_cases = sd_cases %>%
filter(Race.and.Ethnicity == 'Black or African American') %>%
select(Count) %>%
.[1,1]

sd_aa_cases_pct = round(100 * sd_aa_cases / sd_total_cases, 2) #; sd_aa_cases_pct

In [0]:
## Total and AA percentage of deaths
sd_total_deaths = sd_deaths$Count %>% sum #; sd_total_deaths

sd_aa_deaths = sd_deaths %>%
filter(Race_Ethnicity == 'Black or African American') %>%
select(Count) %>%
.[1,1] #; sd_aa_deaths

sd_aa_deaths_pct = round(100 * sd_aa_deaths / sd_total_deaths, 2) #; sd_aa_deaths_pct

In [0]:
## Summary table for San Diego
sd_date = Sys.Date() %>% format('%-m/%-d/%Y')

output_sd = tibble(
    `Location` = 'California - San Diego',
    `Date Published` = sd_date, 
    `Total Cases` = sd_total_cases,
    `Total Deaths` = sd_total_deaths,
    `Pct Cases Black/AA` = sd_aa_cases_pct,
    `Pct Deaths Black/AA` = sd_aa_deaths_pct)

output_sd

Location,Date Published,Total Cases,Total Deaths,Pct Cases Black/AA,Pct Deaths Black/AA
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
California - San Diego,5/5/2020,4020,144,3.66,2.78


## Florida

In [0]:
## Set the working directory for Florida
florida_dir = paste0(home_dir, '/data/florida')
setwd(florida_dir)
getwd()

In [0]:
## Download the entire Florida document
## TO DO: Crawl archive page to automatically detect the latest report
download.file("https://floridadisaster.org/globalassets/covid19/dailies/covid-19-data---daily-report-2020-05-04-0943.pdf", 'florida.pdf')

In [0]:
## Can use this alternative downloading mechanism if needed
#system("wget -O florida.pdf 'https://floridadisaster.org/globalassets/covid19/dailies/covid-19-data---daily-report-2020-05-03-1007.pdf'", intern=T)

In [0]:
## Get the 3rd page of the Florida summary document, as it contains the Race/Ethnicity breakdown
run('qpdf --empty --pages florida.pdf 3 -- florida_page_3.pdf')

In [0]:
## Show the original PDF along with the document that only contains page 3
list.files()

In [0]:
## Use function in tabulizer package to detect the page dimensions. 
## This is required in order to select the area of the page that contains the Race/Ethnicity table (second half of the page)
## Otherwise, the function will fail to find the specific table we want.
page_dims = get_page_dims('florida_page_3.pdf')
page_width = page_dims[[1]][1]
page_height = page_dims[[1]][2]
c(page_width, page_height)

In [0]:
# Area must be specified as c(top,left,bottom,right) 

# Source: https://stackoverflow.com/questions/45457054/tabula-extract-tables-by-area-coordinates

## TO DO: Check whether the race/ethnicity table is always located in this same region every day. 
##        Can check archives and continue to monitor new reports.

area1 = c(page_height, 1, 0.55*page_height, page_width)

## Extract the data from page 3 of the Florida report
florida_raw <- extract_tables('florida_page_3.pdf', encoding="UTF-8", area=list(area1))[[1]]
florida_raw

0,1,2,3,4,5,6,7,8
Age group,Cases,,Hospitalizations,Deaths,,Gender,Cases,
0-4 years,195,1%,12 0%,0,0%,Male,17969,50%
5-14 years,420,1%,10 0%,0,0%,Female,17963,50%
15-24 years,2723,8%,80 1%,0,0%,Unknown,37,0%
25-34 years,5314,15%,287 5%,12,1%,Total,35969,
35-44 years,5443,15%,521 9%,27,2%,,,
45-54 years,6476,18%,800 13%,56,4%,,,
55-64 years,6138,17%,"1,060 17%",143,10%,,,
65-74 years,4484,12%,"1,343 22%",321,23%,,,
75-84 years,2926,8%,"1,211 20%",423,30%,,,


In [0]:
## Get Race/Ethnicity cases and deaths only

id_start = which(florida_raw[,1] == 'Race, ethnicity')
florida = florida_raw[id_start:nrow(florida_raw), ]
colnames(florida) = c('Race_Ethnicity', 'Cases_Count', 'Cases_Percent', 'Hospitalizations', 
    'Deaths_Count', 'Deaths_Percent', 'x1', 'x2', 'x3')
florida = florida[-1,] %>% data.frame(stringsAsFactors = F)
florida = florida %>% 
    mutate(
        Cases_Count = gsub(',', '', Cases_Count) %>% as.numeric,
        Deaths_Count = gsub(',', '', Deaths_Count) %>% as.numeric
    ) %>%
    select(Race_Ethnicity, Cases_Count, Deaths_Count) %>%
    filter(Race_Ethnicity %in% c('White', 'Black', 'Other', 'Unknown race'))

florida

Race_Ethnicity,Cases_Count,Deaths_Count
<chr>,<dbl>,<dbl>
White,18347,936
Black,6697,304
Other,3032,87
Unknown race,7893,72


In [0]:
## Calculations needed for final output

fl_total_cases = florida$Cases_Count %>% sum; fl_total_cases
fl_total_deaths = florida$Deaths_Count %>% sum; fl_total_deaths

fl_aa_cases_pct = round(100 * florida$Cases_Count[florida$Race_Ethnicity == 'Black'] / fl_total_cases, 2); fl_aa_cases_pct
fl_aa_deaths_pct = round(100 * florida$Deaths_Count[florida$Race_Ethnicity == 'Black'] / fl_total_deaths, 2); fl_aa_deaths_pct

In [0]:
## Summary of Florida results
fl_date = Sys.Date() %>% format('%-m/%-d/%Y')

output_fl = tibble(
    `Location` = 'Florida',
    `Date Published` = fl_date, 
    `Total Cases` = fl_total_cases,
    `Total Deaths` = fl_total_deaths,
    `Pct Cases Black/AA` = fl_aa_cases_pct,
    `Pct Deaths Black/AA` = fl_aa_deaths_pct)

output_fl

Location,Date Published,Total Cases,Total Deaths,Pct Cases Black/AA,Pct Deaths Black/AA
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Florida,5/5/2020,35969,1399,18.62,21.73


## New York City

In [0]:
## Set the working directory for NYC
nyc_dir = paste0(home_dir, '/data/nyc')
setwd(nyc_dir)
getwd()

In [0]:
## Download NYC deaths
## Note: No cases data available for NYC
download.file('https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-deaths-race-ethnicity-04082020-1.pdf', 'nyc_april_6.pdf')
download.file('https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-deaths-race-ethnicity-04162020-1.pdf', 'nyc_april_16.pdf')

In [0]:
## Show the downloaded file(s)
list.files()

In [0]:
## Extract death data from NYC PDF
## TO DO: Update analysis with April 16 PDF results
nyc_deaths_raw = extract_tables('nyc_april_6.pdf', encoding="UTF-8")[[1]]
nyc_deaths_raw

0,1,2,3,4,5,6,7
"Death Rate Race/Ethnicity Population Number of (2018 Deaths Age-Adjusted % of Total with Per 100,000 Race/Ethnicity Known",estimate),,,"Crude Death Rate Per 100,000",,,
,,,,Population,,,Population
All Hispanic521 33.5,2449450,,,21.3,,,22.8
"Non-Hispanic/Latino: Black, African 428 27.5",1849077,,,23.1,,,19.8
American,,,,,,,
Non-Hispanic/Latino: White 424 27.3,2694258,,,15.7,,,10.2
Non-Hispanic/Latino: Asian 112 7.2,1231790,,,9.1,,,8.4
Non-Hispanic/Latino: Other 70 4.5,174173,,,40.2,,,59.5
"Total with Known Race/Ethnicity 1,555 *",*,,,*,,,*
Total Unknown Race/Ethnicity 917 *,*,,,*,,,*


In [0]:
## First column contains all relevant data. 
## Extract as a vector of strings for later parsing
nyc_deaths_txt = nyc_deaths_raw[,1]; nyc_deaths_txt

In [0]:
## NYC total deaths
nyc_total_deaths = nyc_deaths_txt[grep('^Total \\d', nyc_deaths_txt)] %>% gsub('Total| |\\*|,', '', .) %>% as.numeric; nyc_total_deaths

In [0]:
## NYC AA percentage of deaths
nyc_aa_deaths = nyc_deaths_txt[grep('^Non-Hispanic/Latino: Black, African', nyc_deaths_txt)] %>% 
    gsub('Non-Hispanic/Latino: Black, African|\\*|,', '', .) %>% trimws() %>% strsplit(., ' ') #%>%
    #.[[1]][2]
nyc_aa_deaths = nyc_aa_deaths[[1]][1] %>% as.numeric
nyc_aa_deaths_pct = round(100 * nyc_aa_deaths / nyc_total_deaths, 2)
nyc_aa_deaths_pct

In [0]:
## Summary of NYC results
nyc_date = '4/6/2020' ## Sys.Date() %>% format('%-m/%-d/%Y')

output_nyc = tibble(
    `Location` = 'New York City',
    `Date Published` = nyc_date, 
    `Total Cases` = NA_integer_, # nyc_total_cases,
    `Total Deaths` = nyc_total_deaths,
    `Pct Cases Black/AA` = NA_integer_, # nyc_aa_cases_pct,
    `Pct Deaths Black/AA` = nyc_aa_deaths_pct)

output_nyc

Location,Date Published,Total Cases,Total Deaths,Pct Cases Black/AA,Pct Deaths Black/AA
<chr>,<chr>,<int>,<dbl>,<int>,<dbl>
New York City,4/6/2020,,2472,,17.31


## Combining the results into a single table

In [0]:
output_case2 = list(output_fl, output_sd, output_nyc) %>% bind_rows
output_case2

Location,Date Published,Total Cases,Total Deaths,Pct Cases Black/AA,Pct Deaths Black/AA
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Florida,5/5/2020,35969.0,1399,18.62,21.73
California - San Diego,5/5/2020,4020.0,144,3.66,2.78
New York City,4/6/2020,,2472,,17.31


In [0]:
output_file = paste0('output_case2_pdf_', Sys.Date(), '.csv'); output_file
output_case2 %>% write_csv(output_file)

Notes:

* Run date (i.e., the date on which this code is run) is used as a proxy for `Date Published`. For now, these dates visual inspection is required in order to ensure that this is accurate. A later version of this script could include parsing of the header text (or the file name) to determine the publish date.

* For now, the percentages are calculated out of the total that includes unknown race/ethnicity. Another option is to exclude unknown race/ethnicity from the denominator. We should decide which option is best for these reports.

* I couldn't find any NYC racial/ethnic breakdowns beyond April 6 and April 16 reports. We will have to continue to dig for more recent data.