# 5. Data Transformations and Summaries

In this chapter, we will introduce the `dyplr` package, which is part of the `tidyverse` group of packages, to expand our tools in exploring and transforming our data. We will learn how to do some basic manipulations of data (e.g. adding or removing columns, filtering data, arranging by one or multiple columns) as well as how to summarize data (e.g. grouping by values, calculating summary statistics). We will also practice combining these operations using the pipe operator `%>%`. We will use the same sample of the National Health and Nutrition Examination Survey ([NHANES](https://www.cdc.gov/nchs/nhanes/index.htm)) as in Chapter 4.

In [1]:
library(RforHDSdata)
suppressPackageStartupMessages(library(tidyverse))
data(NHANESsample)

## Tibbles and Data Frames

Take a look at the class of `NHANESsample`. As we might expect, the data is stored as a data frame.  

In [2]:
class(NHANESsample)

However, the tidyverse also works with another data structure called a **tibble**. A **tibble** has all the properties of data frames that we have learned so far but they are a more modern version of a data frame. To convert our data to this data structure we use the `as_tibble()` function. In practice, there are only very slight difference between the two data structures and you do not need to convert to a tibble. Below we convert our data from a data frame and print the head of the data before converting back to a data frame and repeating. You can see the two structures have a slightly different print statement.

In [3]:
nhanes_df <- as_tibble(NHANESsample)
print(head(nhanes_df))

[90m# A tibble: 6 × 21[39m
     ID   AGE SEX    RACE             EDUCATION INCOME SMOKE  YEAR  LEAD BMI_CAT
  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m  [3m[90m<fct>[39m[23m            [3m[90m<fct>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m  
[90m1[39m     2    77 Male   Non-Hispanic Wh… MoreThan…   5    Neve…  [4m1[24m999   5   BMI<=25
[90m2[39m     5    49 Male   Non-Hispanic Wh… MoreThan…   5    Quit…  [4m1[24m999   1.6 25<BMI…
[90m3[39m    12    37 Male   Non-Hispanic Wh… MoreThan…   4.93 Neve…  [4m1[24m999   2.4 BMI>=30
[90m4[39m    13    70 Male   Mexican American LessThan…   1.07 Quit…  [4m1[24m999   1.6 25<BMI…
[90m5[39m    14    81 Male   Non-Hispanic Wh… LessThan…   2.67 Stil…  [4m1[24m999   5.5 25<BMI…
[90m6[39m    15    38 Female Non-Hispanic Wh… MoreThan…   4.52 Stil…  [4m1[24m999   1.5 25<BMI…
[90m# ℹ 11 more va

In [4]:
nhanes_df <- as.data.frame(nhanes_df)
print(head(nhanes_df))

  ID AGE    SEX               RACE  EDUCATION INCOME      SMOKE YEAR LEAD
1  2  77   Male Non-Hispanic White MoreThanHS   5.00 NeverSmoke 1999  5.0
2  5  49   Male Non-Hispanic White MoreThanHS   5.00  QuitSmoke 1999  1.6
3 12  37   Male Non-Hispanic White MoreThanHS   4.93 NeverSmoke 1999  2.4
4 13  70   Male   Mexican American LessThanHS   1.07  QuitSmoke 1999  1.6
5 14  81   Male Non-Hispanic White LessThanHS   2.67 StillSmoke 1999  5.5
6 15  38 Female Non-Hispanic White MoreThanHS   4.52 StillSmoke 1999  1.5
    BMI_CAT LEAD_QUANTILE HYP ALC DBP1 DBP2 DBP3 DBP4 SBP1 SBP2 SBP3 SBP4
1   BMI<=25            Q4   0 Yes   58   56   56   NA  106   98   98   NA
2 25<BMI<30            Q3   1 Yes   82   84   82   NA  122  122  122   NA
3   BMI>=30            Q4   1 Yes  108   98  100   NA  182  172  176   NA
4 25<BMI<30            Q3   1 Yes   78   62   70   NA  140  130  130   NA
5 25<BMI<30            Q4   1 Yes   56   NA   58   64  142   NA  134  138
6 25<BMI<30            Q3   0 Yes   68

We mention tibbles since some functions in the tidyverse package convert the data to a tibble. In particular, when we summarize over groups below we can expect a tibble to be returned. It is useful to be aware that our data may change data structure and to know that we can always convert back if needed. 

## Subsetting Data using `select`, `filter`, and `slice`

In earlier chapters, we have seen how to select and filter data by using row and column indices as well as the `subset()` function. The `dplyr` package has its own functions that are useful to subset the data. The `select()` function allows us to select a subset of columns. The function takes in the data frame (or tibble) and the names or indices of the columns we want to select. For example, if we only wanted to select the variables for race and blood lead level we could specify these two columns. To display the result of this selection we use the pipe operator `%>%`. Recall that this takes the result on the left hand side and passes it as the first argument to the function on the right hand side. This shows that there are only two columns in the filtered data.  

In [5]:
select(nhanes_df, c(RACE, LEAD)) %>% head()

Unnamed: 0_level_0,RACE,LEAD
Unnamed: 0_level_1,<fct>,<dbl>
1,Non-Hispanic White,5.0
2,Non-Hispanic White,1.6
3,Non-Hispanic White,2.4
4,Mexican American,1.6
5,Non-Hispanic White,5.5
6,Non-Hispanic White,1.5


The `select()` function can also be used to *remove* columns by using a vector of column names with a negative sign. For example, below we keep all columns except `ID` and `LEAD_QUANTILE`. Note that in this case we have saved the selected data back to our data frame `nhanes_df`. Additionally, this time I used a pipe operator to pipe the data to the select function itself.

In [6]:
nhanes_df <- nhanes_df %>% select(-c(ID, LEAD_QUANTILE))
names(nhanes_df)

While `select()` allows us to choose a subset of columns, the `filter()` function allows us to choose a subset of rows.  The `filter()` function takes a data frame as the first argument and a vector of booleans as the second argument. This vector of booleans can be generated using conditional statements as we used in Chapter 4. Below, we choose the data after 2008.

In [7]:
nhanes_df_recent <- nhanes_df %>% filter(YEAR >= 2008)

We can combine conditions by using multiple `filter` calls, creating a more complicated conditional statement by using the `&` (and), `|` (or), and `%in%` (in) operators, or by separating the conditions with commas within filter. Below, we show three ways to find all the data for males between 2008 and 2012. Note that the `between()` function allows us to capture the logic `YEAR >= 2008 & YEAR <= 2012`. The parentheses here are important to capture the logic we want. In all these examples, we broke our code up into multiple lines. This makes it easier to read. A good rule of thumb is to not go more than 80 characters in a line and R Studio conveinently has a vertical gray line at this limit. To create a new line you can hit enter after an operator (e.g. `%>%`, `+`, `|`) or within a set of unfinished brackets or parentheses. Either of these breaks lets R know that your code is not finished yet. 

In [8]:
# Example 1: multiple filter calls
nhanes_df_males1 <- nhanes_df %>%
  filter(YEAR <= 2012) %>%
  filter(YEAR >= 2008) %>%
  filter(SEX == "Male")

# Example 2: combine with & operator
nhanes_df_males2 <- nhanes_df %>%
  filter((YEAR <= 2012) & (YEAR >= 2008) & (SEX == "Male"))

# Example 3: combine into one filter call with commas
nhanes_df_males3 <- nhanes_df %>%
  filter(between(YEAR, 2008, 2012), SEX == "Male")

Last, we can subset the data using the `slice()` function to select a slice of rows by their index. The function takes the data set and a vector of indices. Below, we find the first and last rows of the data. 

In [9]:
slice(nhanes_df, c(1, nrow(nhanes_df)))

AGE,SEX,RACE,EDUCATION,INCOME,SMOKE,YEAR,LEAD,BMI_CAT,HYP,ALC,DBP1,DBP2,DBP3,DBP4,SBP1,SBP2,SBP3,SBP4
<dbl>,<fct>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<dbl>,<fct>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
77,Male,Non-Hispanic White,MoreThanHS,5.0,NeverSmoke,1999,5.0,BMI<=25,0,Yes,58,56,56,,106,98,98,
38,Male,Non-Hispanic White,MoreThanHS,1.56,StillSmoke,2017,0.9,BMI>=30,1,Yes,98,92,98,,150,146,148,


A few other useful slice functions are `slice_sample()`, `slice_max()`, and `slice_min()`. The first takes in an argument `n` which specifies the number of *random* rows to sample from the data. For example, we could randomly sample 100 rows from our data. The latter two allow us to specify a column through the argument `order_by` and returns the `n` rows with either the highest or lowest values. Below we find the three male observations from 2007 with the highest and lowest blood lead levels and select a subset of columns to display.

In [10]:
# three male observations with highest blood lead level in 2007
nhanes_df %>%
  filter(YEAR == 2007, SEX == "Male") %>%
  select(c(RACE, EDUCATION, SMOKE, LEAD, SBP1, DBP1)) %>%
  slice_max(order_by = LEAD, n=3)

# three male observations with lowest blood lead level in 2007
nhanes_df %>%
  filter(YEAR == 2007, SEX == "Male") %>%
  select(c(RACE, EDUCATION, SMOKE, LEAD, SBP1, DBP1)) %>%
  slice_min(order_by = LEAD, n=3)

RACE,EDUCATION,SMOKE,LEAD,SBP1,DBP1
<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>
Non-Hispanic Black,LessThanHS,NeverSmoke,33.1,106,66
Other Hispanic,LessThanHS,StillSmoke,26.8,106,72
Other Hispanic,LessThanHS,StillSmoke,25.7,112,60


RACE,EDUCATION,SMOKE,LEAD,SBP1,DBP1
<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>
Non-Hispanic White,LessThanHS,NeverSmoke,0.1767767,114,80
Other Hispanic,LessThanHS,QuitSmoke,0.28,122,62
Mexican American,MoreThanHS,QuitSmoke,0.32,112,66


### Practice Question

## Updating Rows and Columns using `rename`, `mutate`, and `arrange`

The next few functions we will look at will allow us to update the rows and columns in our data. For example, the `rename()` function allows us to change the names of columns. Below, we change the name of `INCOME` to `PIR` since this variable is the poverty income ratio and also update the name of `SMOKE` to be `SMOKE_STATUS`. In specifying these names, the new name is on the left and the old name on the right.

In [11]:
nhanes_df <- nhanes_df %>% rename(PIR = INCOME, SMOKE_STATUS = SMOKE)
names(nhanes_df)

In the last chapter, we created a new variable called `EVER_SMOKE` based on the smoking status variable. We use the `ifelse()` function to create this variable. Recall that this function allows us to specify a condition and then two alternative values based on whether we meet or do not meet this condition. We see that there are about 15,000 observations that never smoked. 

In [12]:
ifelse(nhanes_df$SMOKE_STATUS == "NeverSmoke", "No", "Yes") %>% table()

.
   No   Yes 
15087 16178 

Another useful function is the `case_when()` function from the tidyverse. This function is an extension of the `ifelse()` function but allows to specify more than two cases. We demonstrate this function below to show how we could relabel these entries. For each condition, we use the right side of the `~` to specify the value associated with a TRUE for that condition. 

In [13]:
case_when(nhanes_df$SMOKE_STATUS == "NeverSmoke" ~ "Never Smoked",
          nhanes_df$SMOKE_STATUS == "QuitSmoke" ~ "Quit Smoking",
          nhanes_df$SMOKE_STATUS == "StillSmoke" ~ "Current Smoker") %>% table()

.
Current Smoker   Never Smoked   Quit Smoking 
          7317          15087           8861 

Above we did not store the columns we created. To do so, we could use the `$` operator or the  `cbind()` function. The tidyverse also includes an alternative function to add columns called `mutate()`. This function takes in a data frame and a set of columns with associated names to add to the data or update. In the example below, we create the column `EVER_SMOKE` and update the column `SMOKE_STATUS`. Within the `mutate()` function we do not have to use the `$` operator to reference the column `SMOKE_STATUS`. Instead, we can specify just the column name and it will interpret it as that column.

In [14]:
nhanes_df <- nhanes_df %>% 
  mutate(EVER_SMOKE = ifelse(SMOKE_STATUS == "NeverSmoke", "No", "Yes"), 
         SMOKE_STATUS = case_when(SMOKE_STATUS == "NeverSmoke" ~ "Never Smoked",
                                  SMOKE_STATUS == "QuitSmoke" ~ "Quit Smoking",
                                  SMOKE_STATUS == "StillSmoke" ~ "Current Smoker")) 

The last function we will demonstrate in this section is the `arrange()` function. This function takes in a data frame and a vector of columns used to sort the data (data is sorted by the first column with ties being sorted by the second column, etc.). By default, the `arrange()` function sorts in increasing order. We can use the `desc()` function to instead use a descending order. The code below filters to male smokers before sorting by decreasing systolic and diastolic blood pressure. 

In [15]:
nhanes_df %>% 
  select(c(YEAR, SEX, SMOKE_STATUS, SBP1, DBP1, LEAD)) %>%
  filter(SEX == "Male", SMOKE_STATUS == "Current Smoker") %>%
  arrange(desc(SBP1), desc(DBP1)) %>%
  head(10)

Unnamed: 0_level_0,YEAR,SEX,SMOKE_STATUS,SBP1,DBP1,LEAD
Unnamed: 0_level_1,<dbl>,<fct>,<chr>,<dbl>,<dbl>,<dbl>
1,2011,Male,Current Smoker,230,120,5.84
2,2015,Male,Current Smoker,230,98,1.56
3,2009,Male,Current Smoker,220,80,4.84
4,2001,Male,Current Smoker,218,118,3.7
5,2017,Male,Current Smoker,212,122,2.2
6,2003,Male,Current Smoker,212,54,4.0
7,2011,Male,Current Smoker,210,92,5.37
8,2007,Male,Current Smoker,210,80,2.18
9,2015,Male,Current Smoker,206,108,1.44
10,2003,Male,Current Smoker,206,68,1.8


### Practice Question

TODO


## Summarizing using `summarize` and `group_by`

If we wanted to understand how many observations there are for each given race category, we could use the `table()` function. Another similar function is the `count()` function. This function takes in a data frame and one or more variables and counts the number of rows for each combination of unique value. If no columns are specified, it counts the total number of rows. Below, we find the total number of rows (31,265) and the number of observations by race and year. We can see that the number in each group fluctuates quite a bit!

In [16]:
count(nhanes_df)
count(nhanes_df, RACE, YEAR)

n
<int>
31265


RACE,YEAR,n
<fct>,<dbl>,<int>
Mexican American,1999,713
Mexican American,2001,674
Mexican American,2003,627
Mexican American,2005,634
Mexican American,2007,639
Mexican American,2009,672
Mexican American,2011,322
Mexican American,2013,234
Mexican American,2015,287
Mexican American,2017,475


Finding the count is a form of a summary statistic for our data. The `summarize()` function is used to compute summary statistics of the data and allows us to compute multiple statistics. The `summarize()` function takes in a data frame and one or more summary functions based on the given column variables. In the example below, we find the total number of observations and the mean and median systolic blood pressure for Non-Hispanic Blacks. Note that the `n()` function is the function within `summarize()` that finds the number of observations. In the `mean()` and `median()` functions we set `na.rm=TRUE` to remove NAs before computing the value (otherwise we could get NA as our output).

In [17]:
nhanes_df %>%
  filter(RACE == "Non-Hispanic Black") %>%
  summarize(TOT = n(), MEAN_SBP = mean(SBP1, na.rm=TRUE), MEAN_DBP = mean(DBP1, na.rm=TRUE))

TOT,MEAN_SBP,MEAN_DBP
<int>,<dbl>,<dbl>
6041,128.7584,72.59694


If we wanted to repeat this for the other race groups, we would have to change our filter each time. To avoid doing so, the `group_by()` function which takes a data frame and one or more columns with which to group data with the same values by. Below we group using the `RACE` variable. When we look at printed output it looks like before but we see that it is a grouped data frame as the class printed at the top. In fact, a grouped data frame (or grouped tibble) acts like a set of data frames: one for each group. If we use the `slice()` function with index 1 it will return the first row for each group.

In [18]:
nhanes_df %>% 
  group_by(RACE) %>%
  slice(1)

AGE,SEX,RACE,EDUCATION,PIR,SMOKE_STATUS,YEAR,LEAD,BMI_CAT,HYP,ALC,DBP1,DBP2,DBP3,DBP4,SBP1,SBP2,SBP3,SBP4,EVER_SMOKE
<dbl>,<fct>,<fct>,<fct>,<dbl>,<chr>,<dbl>,<dbl>,<fct>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
70,Male,Mexican American,LessThanHS,1.07,Quit Smoking,1999,1.6,25<BMI<30,1,Yes,78,62,70,,140,130,130,,Yes
61,Female,Other Hispanic,MoreThanHS,3.33,Current Smoker,1999,2.2,BMI<=25,0,Yes,70,60,74,,106,110,116,,Yes
77,Male,Non-Hispanic White,MoreThanHS,5.0,Never Smoked,1999,5.0,BMI<=25,0,Yes,58,56,56,,106,98,98,,No
38,Female,Non-Hispanic Black,HS,0.92,Current Smoker,1999,1.8,25<BMI<30,0,Yes,76,80,74,,116,116,114,,Yes
63,Female,Other Race,MoreThanHS,5.0,Never Smoked,1999,1.2,BMI<=25,1,Yes,66,78,82,,120,118,118,,No


Grouping data is very helpful in combination with the summarize function. Like with the `slice()` function, `summarize()` will calculate the summary values for each group. We can now find the total number of observations as well as the mean systolic and diastolic blood pressure values for each group. Note that the returned summarized data is in a tibble. 

In [19]:
nhanes_df %>% 
  group_by(RACE) %>%
  summarize(TOT = n(), MEAN_SBP = mean(SBP1, na.rm=TRUE), MEAN_DBP = mean(DBP1, na.rm=TRUE))

RACE,TOT,MEAN_SBP,MEAN_DBP
<fct>,<int>,<dbl>,<dbl>
Mexican American,5277,124.1754,70.40122
Other Hispanic,2279,123.2342,70.1127
Non-Hispanic White,15473,124.6984,70.3583
Non-Hispanic Black,6041,128.7584,72.59694
Other Race,2195,122.0406,72.60445


After summarizing, the data is no longer grouped by race. If we ever want to remove the group structure from our data, we can use the `ungroup()` function. This restores it to a single data frame. After ungrouping by race below, we get a single observation returned by the `slice()` function.

In [20]:
nhanes_df %>% 
  select(SEX, RACE, SBP1, DBP1) %>%
  group_by(RACE) %>%
  ungroup() %>%
  arrange(desc(SBP1)) %>%
  slice(1)

SEX,RACE,SBP1,DBP1
<fct>,<fct>,<dbl>,<dbl>
Female,Non-Hispanic White,270,124


## Review Video

Show creating the blood pressure variables! Then find mean by race and sex

Resummarize by year and show how blood pressure has changed using lag/lead

Grouped data and ungroup (slice by group!)

Documentation .by .preserve .drop or .add

distinct from cheat sheet

## Exercises

1. Suppose we are interested in the distribution of blood pressure by age among females. First, create a new column in `nhanes_df` called `AGE_CAT` containing the age categories "20-29" through "80-89". Then, use the `filter()` and `summarize()` functions to create a data frame summarizing the average and standard deviation of the first systolic blood pressure reading among females between the ages of 20 and 29. 

In [1]:
df_1 <- nhanes_df %>%
          mutate(AGE_CAT = ifelse(AGE < 29, "20-29", 
                                  ifelse(AGE < 39, "30-39",
                                  ifelse(AGE < 49, "40-49",
                                  ifelse(AGE < 59, "50-59",
                                  ifelse(AGE < 69, "60-69",
                                  ifelse(AGE < 79, "70-79",
                                  ifelse(AGE < 89, "80-89"))))))))
df_1 %>%
  filter(AGE_CAT == "20-29" & SEX == "Female") %>%
  summarize(NUM_OBS = n(), AVG_SBP = mean(SBP1, na.rm=TRUE), SD_SBP = sd(SBP1, na.rm=TRUE))

ERROR: Error in nhanes_df %>% mutate(AGE_CAT = ifelse(AGE < 29, "20-29", ifelse(AGE < : could not find function "%>%"


2. Now, create two data frames - one for each sex - with the average and standard deviation summarized separately for each age group rather than for just one selected age group as in question 1. 

In [22]:
df_1 %>%
  group_by(AGE_CAT) %>%
  filter(SEX == "Female") %>% 
  summarize(NUM_OBS = n(), AVG_SBP = mean(SBP1, na.rm=TRUE), SD_SBP = sd(SBP1, na.rm=TRUE))

df_1 %>%
  group_by(AGE_CAT) %>%
  filter(SEX == "Male") %>% 
  summarize(NUM_OBS = n(), AVG_SBP = mean(SBP1, na.rm=TRUE), SD_SBP = sd(SBP1, na.rm=TRUE))

AGE_CAT,NUM_OBS,AVE_SBP,SD_SBP
<chr>,<int>,<dbl>,<dbl>
20-29,2475,109.5777,10.07101
30-39,2701,111.7855,12.1576
40-49,2651,118.3425,15.87204
50-59,2304,125.5809,18.96571
60-69,2325,133.8368,20.43897
70-79,1445,139.1925,22.15939
80-89,896,147.3547,23.36094


AGE_CAT,NUM_OBS,AVE_SBP,SD_SBP
<chr>,<int>,<dbl>,<dbl>
20-29,2336,118.6879,10.90715
30-39,2726,120.568,12.5913
40-49,2831,122.9359,14.35871
50-59,2633,127.0484,16.63128
60-69,2863,133.1562,19.43038
70-79,1964,136.1728,20.27739
80-89,1115,139.9807,22.21069


3. Try to combine the creation of the two summary tables in exercises 2 into just one line of code. This is possible because `group_by` permits us to group by more than one variable, yielding one big summary table.

In [23]:
df_1 %>%
  group_by(AGE_CAT, SEX) %>%
  summarize(NUM_OBS = n(), AVG_SBP = mean(SBP1, na.rm=TRUE), SD_SBP = sd(SBP1, na.rm=TRUE))

[1m[22m`summarise()` has grouped output by 'AGE_CAT'. You can override using the
`.groups` argument.


AGE_CAT,SEX,NUM_OBS,AVE_SBP,SD_SBP
<chr>,<fct>,<int>,<dbl>,<dbl>
20-29,Male,2336,118.6879,10.90715
20-29,Female,2475,109.5777,10.07101
30-39,Male,2726,120.568,12.5913
30-39,Female,2701,111.7855,12.1576
40-49,Male,2831,122.9359,14.35871
40-49,Female,2651,118.3425,15.87204
50-59,Male,2633,127.0484,16.63128
50-59,Female,2304,125.5809,18.96571
60-69,Male,2863,133.1562,19.43038
60-69,Female,2325,133.8368,20.43897


4. For males between the ages of 40 and 49, compare systolic blood pressure across race as reported in the `RACE` variable. Order the resulting table from lowest to highest average systolic blood pressure.

In [24]:
df_1 %>%
  filter(SEX == "Male", AGE_CAT == "40-49") %>%
  group_by(RACE) %>%
  summarize(NUM_OBS = n(), AVE_SBP = mean(SBP1, na.rm=TRUE), SD_SBP = sd(SBP1, na.rm=TRUE)) %>%
  arrange(AVE_SBP)

RACE,NUM_OBS,AVE_SBP,SD_SBP
<fct>,<int>,<dbl>,<dbl>
Other Race,227,119.242,13.84683
Non-Hispanic White,1296,121.6624,13.18456
Mexican American,579,122.5474,13.61925
Other Hispanic,188,124.3068,15.22168
Non-Hispanic Black,541,127.6223,16.5722


5. We will use `covidcases` dataset for exercises. Construct a new data frame `covid_sub` that calculates the total number of cases confirmed in each county of state and filter the top 10 counties with their both cases and deaths. 

In [25]:
data(covidcases)
covid_sub <- covidcases %>%
  group_by(county) %>%
  summarize(totalcases  = sum(weekly_cases),
            totaldeaths = sum(weekly_deaths)) %>%
  arrange(desc(totaldeaths)) %>%
  slice(1:10)
covid_sub

county,totalcases,totaldeaths
<chr>,<int>,<int>
New York City,201698,20226
Providence,139025,10276
Bergen,19220,7799
Middlesex,67845,7200
Nassau,39435,6766
Los Angeles,204839,4720
Orange,87576,4284
Rockland,12258,4275
Cook,110237,4141
Essex,52243,4010
