# MPA 5830 - Module 02
In this module our goal is to understand how some very powerful packages work their magic. In particular, although they are of somewhat recent vintage, `{dplyr}`, `{tidyr}` and `{lubridate}` have quickly gained a fan following because they allow you to clean and organize your data and then calculate quantities of interest with surprising ease. In this module we will start to see some of these packages' core functionalities and wrap up this particular leg of our learning journey in __`Module 03`__. 

## `{dplyr}` 
There is a common quote that is tossed about a good bit in the hallways of data science, that, and I am paraphrasing here, a data scientists spends about 80% of their time gathering, cleaning, and organizing their data and spends only about 20% of their time on the analysis per se. This may or may not be true, especially in the initial stages of a new project but yes, we do spend an awfully large amount of our time getting the data ready for visualizations and other analyses. You do this work long enough and you come to realize that anything you could do to speed up the cleaning phase would be time and money saved. And yet, data cleaning skills are in short supply. On the plus side of the ledger, packages like `{dplyr}` and `{data.table}` have simplified what were once nightmarish tasks. 

`Nightmare` is not a word to be tossed around lightly and so let us setup a seemingly large data-set with 100+ columns, and tons of information hidden in it. Once we setup this scenario, spell out a few questions we would like to answer, we might better appreciate how `{dplyr}` comes to our aid. In particular, we might come to understand how `{dplyr}` uses seven core verbs:  

| **What you want to do ...** | **`{dplyr}` function** |
| :-----  | :----- |
| you need to select columns to work with? | `select()`|
| you need to use a subset of the data based on some criterion? | `filter()`|
| you need to arrange the data in ascending/descending order of variable(s)? | `arrange()`|
| you want the results of your calculations to be a standalone data frame? | `summarise()`|
| you want to add your calculated value(s) to the existing data frame? | `mutate()`|
| you want to add your calculated value(s) to the existing data frame but also  drop all  variables/columns not  being used in the calculation? | `transmute()`|
| you need to calculate averages, frequencies, etc by groups? | `group_by()`|

Here is the dataset -- all flights originating and departing from Columbus (Ohio) January through September of 2017. 

Let  us load the data, and the `{tidyverse}` and `{here}` packages. 

In [1]:
library(tidyverse) 

load("data/cmhflights_01092017.RData")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.1     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
names(cmhflights) # will show you the column names

The output is rather onerous with 110 columns displayed, the last being `X110`, an empty column that can be dropped. 

In [3]:
cmhflights$X110 <- NULL #This command will delete the column named X110 

Did it work? Let us check.

In [4]:
names(cmhflights)

Now we are down to 109 columns and ready to get to work. 

## select() 
As in the present case of 100+ columns, often our data frame will have more columns than we plan to work with. In such instances it is a good idea to drop all unwanted columns; this will make the data-set more manageable and tax our computing resources less. For example, say I only want the first five columns (which happen to be `Year`, `Quarter`, `Month`, `DayofMonth`, and `DayOfWeek`). I could use `select` to create a data frame with only these columns:

In [5]:
cmhflights %>% 
  select(Year:DayOfWeek) -> my.df

The `:` is the bridge between __consecutive columns__ starting with `Year` and stopping with `DayofWeek`.

In addition, note that the subset of columns selected and all rows of data are being written to a new data frame called `my.df`

Quick check to see if we have only the columns we wanted. 

In [6]:
names(my.df)

Wonderful! It worked.

Now, the same result could have been obtained by taking the longer route of __listing each column by name__, as in the following: 

In [7]:
cmhflights %>%
  select(Year, Quarter, Month, DayofMonth, DayOfWeek) -> my.df 

names(my.df)

What if the columns were not sequentially located? In that case we would need to list each column we want. Say I want `Year`, `FlightDate`, `UniqueCarrier`, and `TailNum`. 

In [8]:
cmhflights %>% 
  select(Year, FlightDate:UniqueCarrier, TailNum) -> my.df 

names(my.df)

Could we have used __column numbers instead of column names__? Absolutely. 

In [9]:
cmhflights %>% 
  select(c(1, 3, 5, 7)) -> my.df 

names(my.df)

You can also use __consecutive column numbers__, for examples, columns 1 through 5 as follows: 

In [10]:
cmhflights %>% 
  select(c(1:5)) -> my.df 

names(my.df)

You can also use column numbers to select a mix of columns, some that may be sequential and others that may be not sequential.

In [11]:
cmhflights %>% 
  select(c(1, 6:9, 12)) -> my.df

names(my.df)

### select() in other ways
We can also select columns in other ways, by specifying that  the column name `contain` some element. The code below shows you how this is done if I am looking for column names with the phrase "Carrier", then with "De", and then with "Num"  

In [12]:
cmhflights %>% 
  select(contains("Carrier")) -> my.df

names(my.df)

You can also specify that the columns to be selected __start with__ some alphanumeric string, for example, 

In [13]:
cmhflights %>% 
  select(starts_with("De")) -> my.df

names(my.df)

The other option would be to choose columns that __end with__ a particular alphanumeruc string, for example, 

In [14]:
cmhflights %>% 
  select(ends_with("Num")) -> my.df

names(my.df)

There are two other options -- `matches()` and `num_range()`. Let us look at each in turn. 

In [15]:
cmhflights %>%
    select(matches("el")) -> my.df

names(my.df)

Here is another dataset where some of the column names contain a number.

In [16]:
head(billboard)

artist,track,date.entered,wk1,wk2,wk3,wk4,wk5,wk6,wk7,⋯,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
<chr>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
2 Pac,Baby Don't Cry (Keep...,2000-02-26,87,82,72,77.0,87.0,94.0,99.0,⋯,,,,,,,,,,
2Ge+her,The Hardest Part Of ...,2000-09-02,91,87,92,,,,,⋯,,,,,,,,,,
3 Doors Down,Kryptonite,2000-04-08,81,70,68,67.0,66.0,57.0,54.0,⋯,,,,,,,,,,
3 Doors Down,Loser,2000-10-21,76,76,72,69.0,67.0,65.0,55.0,⋯,,,,,,,,,,
504 Boyz,Wobble Wobble,2000-04-15,57,34,25,17.0,17.0,31.0,36.0,⋯,,,,,,,,,,
98^0,Give Me Just One Nig...,2000-08-19,51,39,34,26.0,26.0,19.0,2.0,⋯,,,,,,,,,,


In [17]:
billboard %>%
    select(num_range("wk", 1:5)) -> my.df

names(my.df)

------------------

## filter()
Do you really want all the rows in the data-set or do you want to see only very specific rows that meet some criteria? Say we only want to look at certain months, or only flights on Saturdays and Sundays, or flights in a given month. For example, say we only want flights in January, i.e., `Month == 1`. 

In [18]:
cmhflights %>% 
  filter(Month == 1) -> my.df

table(my.df$Month) # Show me a frequency table for Month in my.df


   1 
3757 

What about only American Airline flights in January? Note that the UniqueCarrier code for this airline is AA.

In [19]:
cmhflights %>% 
  filter(Month == 1, UniqueCarrier == "AA") -> my.df 

table(my.df$Month, my.df$UniqueCarrier) # a simple frequency table

   
     AA
  1 387

Note that `,` inside `filter` means `&`

What about United Airlines flights in January to CMH (the airport code for Columbus, OH), regardless of where the flight originated? 

In [20]:
cmhflights %>% 
  filter(Month == 1, UniqueCarrier == "UA", Dest == "CMH") -> my.df

head(my.df)

Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,UniqueCarrier,AirlineID,Carrier,TailNum,⋯,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
<int>,<int>,<int>,<int>,<int>,<date>,<chr>,<int>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
2017,1,1,26,4,2017-01-26,UA,19977,UA,N889UA,⋯,,,,,,,,,,
2017,1,1,26,4,2017-01-26,UA,19977,UA,N39423,⋯,,,,,,,,,,
2017,1,1,25,3,2017-01-25,UA,19977,UA,N834UA,⋯,,,,,,,,,,
2017,1,1,25,3,2017-01-25,UA,19977,UA,N37462,⋯,,,,,,,,,,
2017,1,1,24,2,2017-01-24,UA,19977,UA,N855UA,⋯,,,,,,,,,,
2017,1,1,24,2,2017-01-24,UA,19977,UA,N75428,⋯,,,,,,,,,,


What if I wanted a more complicated filter, say, flights in January or February, either to CMH or to ORD (the airport code for O'Hare in Chicago)?

In [21]:
cmhflights %>% 
  filter(
    Month %in% c(1, 2), UniqueCarrier == "UA", Dest %in% c("CMH", "ORD")
    ) -> my.df 

table(my.df$Month) # frequency table of Month


  1   2 
106 145 

In [22]:
table(my.df$UniqueCarrier) # frequency table of UniqueCarrier


 UA 
251 

In [23]:
table(my.df$Dest) # frequency table of Dest


CMH ORD 
132 119 

In [24]:
cmhflights %>% 
  filter(
    !Month %in% c(1, 2)
    ) -> my.df 

table(my.df$Month)


   3    4    5    6    7    8    9 
4101 4123 4098 4138 4295 4279 3789 

In [25]:
cmhflights %>%
filter(
    Month != 1, Month != 2
    ) -> my.df

table(my.df$Month)


   3    4    5    6    7    8    9 
4101 4123 4098 4138 4295 4279 3789 

Beautiful, just beautiful. 

At this point it may not be readily apparent to you but using `%in% c(...)` makes applying complex filters easier than if you go some other route. 

Before we move on, note the operators  that work with `filter()` 

| **Operator** | **Meaning**  | **Operator** | **Meaning** |
| :---- | :---- | :---- |  :---- |
| $<$    | less than | $>$    | greater than |
| $==$   | equal to  | $\leq$ | less than or equal to |
| $\geq$ | greater than or equal to | `!=` | not equal to |
| `%in%`   | is a member of |  `is.na` | is NA |
| `!is.na` | is not NA  | `&,!,etc` | Boolean operators |

----------------

## arrange()

Now let us say I wanted to arrange the resulting data frame in `ascending order` of departure delays. How might I do that? Via `arrange()` 

Before we see this new command in action I am going to whittle the data-frame to only a few columns, and only flights to CMH or ORD. That will make it easier to see the result of executed commands.

In [26]:
cmhflights %>% 
    select('Year', 'Month', 'DayofMonth', 'FlightDate', 'Carrier', 'TailNum', 'FlightNum', 
           'Origin', 'OriginCityName', 'Dest', 'DestCityName', 'CRSDepTime', 'DepTime', 
           'DepDelay', 'DepDelayMinutes', 'CRSArrTime', 'ArrTime', 'ArrDelay', 'ArrDelayMinutes') %>%
    filter(Dest %in% c("CMH", "ORD")) -> my.df

my.df %>% 
  arrange(DepDelayMinutes) -> my.df2

my.df2

Year,Month,DayofMonth,FlightDate,Carrier,TailNum,FlightNum,Origin,OriginCityName,Dest,DestCityName,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes
<int>,<int>,<int>,<date>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
2017,1,1,2017-01-01,AA,N802AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0945,0936,-9,0,1707,1653,-14,0
2017,1,3,2017-01-03,AA,N760AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0945,0939,-6,0,1707,1630,-37,0
2017,1,4,2017-01-04,AA,N753AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0945,0940,-5,0,1707,1642,-25,0
2017,1,5,2017-01-05,AA,N744AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0945,0938,-7,0,1707,1645,-22,0
2017,1,7,2017-01-07,AA,N762AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0945,0941,-4,0,1707,1700,-7,0
2017,1,8,2017-01-08,AA,N808AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0945,0941,-4,0,1707,1655,-12,0
2017,1,11,2017-01-11,AA,N756AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0940,0934,-6,0,1712,1641,-31,0
2017,1,12,2017-01-12,AA,N805AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0940,0935,-5,0,1712,1652,-20,0
2017,1,13,2017-01-13,AA,N766AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0940,0937,-3,0,1712,1644,-28,0
2017,1,18,2017-01-18,AA,N825AA,508,LAX,"Los Angeles, CA",CMH,"Columbus, OH",0940,0940,0,0,1712,1658,-14,0


And now in `descending order` of delays by adding the minus symbol `-` to the  column name. 

In [27]:
my.df %>% 
  arrange(-DepDelayMinutes) -> my.df2

my.df2

Year,Month,DayofMonth,FlightDate,Carrier,TailNum,FlightNum,Origin,OriginCityName,Dest,DestCityName,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes
<int>,<int>,<int>,<date>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
2017,6,18,2017-06-18,EV,N18556,4080,CMH,"Columbus, OH",ORD,"Chicago, IL",1340,0644,1024,1024,1405,0705,1020,1020
2017,6,29,2017-06-29,DL,N956DL,1897,ATL,"Atlanta, GA",CMH,"Columbus, OH",2109,0555,526,526,2249,0713,504,504
2017,8,9,2017-08-09,F9,N223FR,1686,LAS,"Las Vegas, NV",CMH,"Columbus, OH",1615,0007,472,472,2314,0650,456,456
2017,1,29,2017-01-29,DL,N838DN,1897,ATL,"Atlanta, GA",CMH,"Columbus, OH",2025,0330,425,425,2157,0503,426,426
2017,4,8,2017-04-08,DL,N984DL,1276,ATL,"Atlanta, GA",CMH,"Columbus, OH",0925,1616,411,411,1058,1755,417,417
2017,1,23,2017-01-23,WN,N731SA,1886,DCA,"Washington, DC",CMH,"Columbus, OH",1125,1807,402,402,1250,1950,420,420
2017,2,19,2017-02-19,UA,N812UA,515,CMH,"Columbus, OH",ORD,"Chicago, IL",1750,0024,394,394,1821,0031,370,370
2017,1,16,2017-01-16,EV,N607LR,5440,LGA,"New York, NY",CMH,"Columbus, OH",1559,2216,377,377,1804,2347,343,343
2017,2,24,2017-02-24,UA,N849UA,688,ORD,"Chicago, IL",CMH,"Columbus, OH",1456,2112,376,376,1715,2349,394,394
2017,8,17,2017-08-17,WN,N955WN,1031,STL,"St. Louis, MO",CMH,"Columbus, OH",0820,1429,369,369,1045,1720,395,395


We could tweak this further, perhaps saying sort by departure delays to CMH, and then to ORD. 

In [28]:
my.df %>% 
  arrange(Dest, -DepDelayMinutes) -> my.df2

my.df2

Year,Month,DayofMonth,FlightDate,Carrier,TailNum,FlightNum,Origin,OriginCityName,Dest,DestCityName,CRSDepTime,DepTime,DepDelay,DepDelayMinutes,CRSArrTime,ArrTime,ArrDelay,ArrDelayMinutes
<int>,<int>,<int>,<date>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
2017,6,29,2017-06-29,DL,N956DL,1897,ATL,"Atlanta, GA",CMH,"Columbus, OH",2109,0555,526,526,2249,0713,504,504
2017,8,9,2017-08-09,F9,N223FR,1686,LAS,"Las Vegas, NV",CMH,"Columbus, OH",1615,0007,472,472,2314,0650,456,456
2017,1,29,2017-01-29,DL,N838DN,1897,ATL,"Atlanta, GA",CMH,"Columbus, OH",2025,0330,425,425,2157,0503,426,426
2017,4,8,2017-04-08,DL,N984DL,1276,ATL,"Atlanta, GA",CMH,"Columbus, OH",0925,1616,411,411,1058,1755,417,417
2017,1,23,2017-01-23,WN,N731SA,1886,DCA,"Washington, DC",CMH,"Columbus, OH",1125,1807,402,402,1250,1950,420,420
2017,1,16,2017-01-16,EV,N607LR,5440,LGA,"New York, NY",CMH,"Columbus, OH",1559,2216,377,377,1804,2347,343,343
2017,2,24,2017-02-24,UA,N849UA,688,ORD,"Chicago, IL",CMH,"Columbus, OH",1456,2112,376,376,1715,2349,394,394
2017,8,17,2017-08-17,WN,N955WN,1031,STL,"St. Louis, MO",CMH,"Columbus, OH",0820,1429,369,369,1045,1720,395,395
2017,4,30,2017-04-30,WN,N937WN,5531,MSY,"New Orleans, LA",CMH,"Columbus, OH",1540,2143,363,363,1850,0028,338,338
2017,4,3,2017-04-03,DL,N930DL,1193,ATL,"Atlanta, GA",CMH,"Columbus, OH",1747,2347,360,360,1919,0117,358,358


So far, we have seen each function in isolation. Now we streamline things a bit so that we only end up with the columns and rows we want to work with, arranged as we want the resulting data-set to be. 

In [29]:
cmhflights %>% 
  select(Month, UniqueCarrier, Dest, DepDelayMinutes) %>% 
  filter(
    Month %in% c(1, 2), UniqueCarrier == "UA", Dest %in% c("CMH", "ORD")
    ) %>% 
  arrange(Month, Dest, -DepDelayMinutes) -> my.df3

my.df3

Month,UniqueCarrier,Dest,DepDelayMinutes
<int>,<chr>,<chr>,<dbl>
1,UA,CMH,178
1,UA,CMH,61
1,UA,CMH,44
1,UA,CMH,39
1,UA,CMH,39
1,UA,CMH,30
1,UA,CMH,27
1,UA,CMH,18
1,UA,CMH,12
1,UA,CMH,10


Here, the end result is a data frame arranged by Month, then within Month by Destination, and then finally by descending order of flight delays. This is the beauty of `dplyr`, allowing us to chain together various functions to get what we want. How is this helpful? Well, now you have a data frame that you can analyze. What do I want to calculate? Well, let us say we want to create a frequency table, something that we would have done with Excel via a Pivot Table. 

------------

## summarise() or summarize()
What if we need to calculate frequencies? For example, how many flights per month are there? What if we want the __mean__ DepDelay or __median__ ArrDelay? These can be easily calculated as shown below. 

Let us start with frequencies.

In [30]:
cmhflights %>%
  count(Month) # Most flights were in July (n = 4295)

Month,n
<int>,<int>
1,3757
2,3413
3,4101
4,4123
5,4098
6,4138
7,4295
8,4279
9,3789


What about by days of the week AND by month? 

In [31]:
cmhflights %>%
  count(Month, DayOfWeek) # Output is sorted by Month and then DayOfWeek

Month,DayOfWeek,n
<int>,<int>,<int>
1,1,660
1,2,635
1,3,516
1,4,522
1,5,523
1,6,361
1,7,540
2,1,527
2,2,506
2,3,516


I want to know the average departure delay and the average arrival delay for all flights, with the averages calculated in two ways -- as the `mean` and as the `median`. Maybe I also want the variance and the standard deviation of both delays. 

In [32]:
cmhflights %>%
  summarise(
      mean_arr_delay = mean(ArrDelay, na.rm = TRUE),
      mean_dep_delay = mean(DepDelay, na.rm = TRUE),
      median_arr_delay = median(ArrDelay, na.rm = TRUE),
      median_dep_delay = median(DepDelay, na.rm = TRUE),
      variance_arr_delay = var(ArrDelay, na.rm = TRUE),
      variance_dep_delay = var(DepDelay, na.rm = TRUE),  
      sd_arr_delay = sd(ArrDelay, na.rm = TRUE),
      sd_dep_delay = sd(DepDelay, na.rm = TRUE)
  )

mean_arr_delay,mean_dep_delay,median_arr_delay,median_dep_delay,variance_arr_delay,variance_dep_delay,sd_arr_delay,sd_dep_delay
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3.258327,9.202072,-6,-2,1746.284,1581.747,41.78856,39.77119


Here, the `na.rm = TRUE` command is useful because R will not allow you to calculate any mean, median, variance, etc. if the clumn includes some rows with missing data. 

You can see this below, where I have a small data-set called `df` with 4 values of x, but one of the four is missing and recorded as `NA`. See what happens when I try to calculate the mean with/without `na.rm = TRUE`.

In [33]:
df = data.frame(x = c(2, 4, 9, NA))

df

x
<dbl>
2.0
4.0
9.0
""


In [34]:
df %>%
  summarise(mean.x = mean(x)) # You get no meaningful values 

mean.x
<dbl>
""


In [35]:
df %>%
  summarise(mean.x = mean(x, na.rm = TRUE)) # Now you do get something meaningful

mean.x
<dbl>
5


What if I want to count total DepDelay by airline? 

In [36]:
cmhflights %>%
    group_by(Carrier) %>%
    summarize(sum(DepDelay, na.rm = TRUE))

Carrier,"sum(DepDelay, na.rm = TRUE)"
<chr>,<dbl>
AA,29853
DL,38297
EV,44146
F9,16834
OO,13006
UA,9739
WN,175001


--------------

## group_by()
These summaries are fine if you want to calculate aggregate quantities of interest (i.e., means, medians, frequencies, variances, etc. for all rows of data) but what if you wanted to calculate the __number of flights per month, by airline? Average delays by airline?__ 

Now things get interesting because `group_by()` will open up this new world for us! The first thing I will calculate is the number of flights by airline per month. 

In [37]:
cmhflights %>%
  group_by(Month, Carrier) %>%
  tally()

Month,Carrier,n
<int>,<chr>,<int>
1,AA,387
1,DL,549
1,EV,436
1,F9,152
1,OO,123
1,UA,106
1,WN,2004
2,AA,360
2,DL,511
2,EV,362


You can get the same result if you use `summarize()`

In [38]:
cmhflights %>%
  group_by(Month, Carrier) %>%
  summarize(
    frequency = n()
    )

[1m[22m`summarise()` has grouped output by 'Month'. You can override using the
`.groups` argument.


Month,Carrier,frequency
<int>,<chr>,<int>
1,AA,387
1,DL,549
1,EV,436
1,F9,152
1,OO,123
1,UA,106
1,WN,2004
2,AA,360
2,DL,511
2,EV,362


Top recap, two ways to do this, first with `tally()` and the second with `summarize(frequency = n())` ... both yielding the same result. Remember `tally()` because it is shorter code. 

Now I want a table that gives us the number of flights per month per airline per destination. 

In [39]:
cmhflights %>%
  group_by(Month, Carrier, Dest) %>%
  tally()

Month,Carrier,Dest,n
<int>,<chr>,<chr>,<int>
1,AA,CMH,193
1,AA,DFW,112
1,AA,LAX,24
1,AA,PHX,58
1,DL,ATL,224
1,DL,CMH,275
1,DL,DTW,2
1,DL,LAX,27
1,DL,MSP,21
1,EV,CMH,218


We could keep enriching the grouping structure. For example, let us add the day of the week to the mix ... 

In [40]:
cmhflights %>%
  group_by(Month, Carrier, Dest, DayOfWeek) %>%
  tally()

Month,Carrier,Dest,DayOfWeek,n
<int>,<chr>,<chr>,<int>,<int>
1,AA,CMH,1,35
1,AA,CMH,2,23
1,AA,CMH,3,28
1,AA,CMH,4,28
1,AA,CMH,5,28
1,AA,CMH,6,21
1,AA,CMH,7,30
1,AA,DFW,1,20
1,AA,DFW,2,16
1,AA,DFW,3,16


Now say I am really curious about mean departure delays for the preceding grouping structure. That is, what does mean departure delay look like for flights by day of the week, by month, by carrier, and by destination?

In [41]:
cmhflights %>%
  group_by(Month, Carrier, Dest, DayOfWeek) %>%
  summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE))

[1m[22m`summarise()` has grouped output by 'Month', 'Carrier', 'Dest'. You can
override using the `.groups` argument.


Month,Carrier,Dest,DayOfWeek,mean_dep_delay
<int>,<chr>,<chr>,<int>,<dbl>
1,AA,CMH,1,16.2857143
1,AA,CMH,2,0.3043478
1,AA,CMH,3,1.1428571
1,AA,CMH,4,3.1071429
1,AA,CMH,5,4.1071429
1,AA,CMH,6,24.8571429
1,AA,CMH,7,21.8333333
1,AA,DFW,1,25.4000000
1,AA,DFW,2,-2.6250000
1,AA,DFW,3,17.6000000


But this is a complicated summary table. What if all I really want to know is what airline has the __highest mean departure delays__, regardless of month or destination or day of the week? This could be done by using `arrange()` to display the result in descending order of _mean_dep_delay_

In [42]:
cmhflights %>%
  group_by(Carrier) %>%
  summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) %>%
  arrange(-mean_dep_delay) # ordered in descending order of delays

Carrier,mean_dep_delay
<chr>,<dbl>
EV,15.207027
F9,10.839665
WN,9.581221
AA,7.733938
DL,7.095979
OO,6.438614
UA,6.39042


EV is Express Jet; F9 is Frontier Airlines; WN is Southwest Airlines; OO is SkyWest Airlines; AA is American Airlines; DL is Delta Airlines; UA is United Airlines. So clearly United Airlines had the lowest average departure delays. Would this still be true if we repeated the calculation by Month? 

In [43]:
cmhflights %>%
  group_by(Carrier, Month) %>%
  summarise(mean_dep_delay = mean(DepDelay, na.rm = TRUE)) %>%
  arrange(mean_dep_delay) # ordered in descending order of delays

[1m[22m`summarise()` has grouped output by 'Carrier'. You can override using the
`.groups` argument.


Carrier,Month,mean_dep_delay
<chr>,<int>,<dbl>
OO,8,-0.8727273
OO,2,-0.5700000
OO,9,-0.1792453
UA,9,0.6989796
DL,8,1.8528529
DL,2,2.2367906
AA,2,2.3027778
F9,9,2.5086705
AA,9,2.6250000
DL,3,2.9815838


All righty then! Looks like three of the lowest mean departure delays were for SkyWest. Do not let the negative numbers throw you for a loop; a negative value implies the flight left earlier than scheduled. 

So far so good. But now I am curious about what percent of flights operated by AA, DL, UA, and WN were delayed. How could I calculate this? 

(1) I need to use `filter()` to restrict the data-set to just these four airlines.  
(2) Then I need to generate a new column that identifies whether a flight was delayed or not (`late`).  
(3) Now we can calculate the total number of flights (`nflights`) and the total number of flights that were delayed (`nlate`).  
(4) If I then calculate $\left( \dfrac{nlate}{nflights} \right)\times 100$ we will end up with the percent of flights that were delayed.  

In [44]:
cmhflights %>%
  select(c(Carrier, DepDelay)) %>%
  filter(
      Carrier %in% c("AA", "DL", "UA", "WN")
  ) %>%
  mutate(
      late = case_when(
          DepDelay > 0 ~ "Yes",
          DepDelay <= 0 ~ "No"
      )
  ) %>% 
  group_by(Carrier) %>%
  mutate(
      nflights = n()
  ) %>%
  group_by(Carrier, late) %>%
  mutate(
    nlate = n(),
    pct_late = (nlate / nflights) * 100
  ) -> df1

df1

Carrier,DepDelay,late,nflights,nlate,pct_late
<chr>,<dbl>,<chr>,<int>,<int>,<dbl>
AA,-9,No,3891,2698,69.33950
AA,24,Yes,3891,1162,29.86379
AA,-6,No,3891,2698,69.33950
AA,-5,No,3891,2698,69.33950
AA,-7,No,3891,2698,69.33950
AA,22,Yes,3891,1162,29.86379
AA,-4,No,3891,2698,69.33950
AA,-4,No,3891,2698,69.33950
AA,42,Yes,3891,1162,29.86379
AA,-6,No,3891,2698,69.33950


There is a whole lot going on here so let us break it down. 

`filter(Carrier %in% c("AA", "DL", "UA", "WN"))` is keeping specified airlines' data while dropping the rest

`mutate(late = case_when(...)` is creating a new column called `late` and storing a value of "Yes" if `DepDelay > 0` (i.e., the flight was delayed by 1 or more minutes) and "No" if `DepDelay <= 0` (i.e., if the flight departed on time or earlier than the scheduled departure time) 

`group_by(Carrier)` is grouping by Carrier and then counting with `mutate(nflights = n())` how many flights there were per Carrier and storing this sum in a new column called `nflights`

`group_by(Carrier, late)` then regroups the data, this time by Carrier and if the flight was late or not


`mutate(nlate = n(), pct_late = (nlate / nflights) * 100)` is then creating two new columns, `nlate` -- the number of flights per late values of "Yes" and "No", respectively, and then `pct_late` -- the percent of flights per carrier that were late.  

Now, we only want the flights that were late so let us apply `select()` to keep just a few columns and then we use `filter()` to keep only rows corresponding to `late = "Yes"`. 

This will still leave us with duplicate rows but we can drop these duplicate rows via a new command, `distinct()`  -- which if left empty inside the parentheses `()` looks for all unique rows of data (each row ends up with a unique combination of all columns' values). If you want unique values of specific columns then those column names can be inserted inside the parentheses `()`

In [45]:
df1 %>%
  filter(late == "Yes") %>%
  ungroup() %>%
  select(Carrier, pct_late) %>%
  distinct() %>%
  arrange(pct_late)

Carrier,pct_late
<chr>,<dbl>
UA,23.96857
DL,27.72677
AA,29.86379
WN,41.06911


So! 24% of UA flights were late, the lowest in this group. 

What if we wanted to do this for all airlines, and we want the calculations to be done by Month as well?

In [46]:
cmhflights %>%
  select(c(Carrier, Month, DepDelay)) %>%
#  filter(Carrier %in% c("AA", "DL", "UA", "WN")) %>%
  mutate(late = case_when(
    DepDelay > 0 ~ "Yes",
    DepDelay <= 0 ~ "No"
      )
    ) %>% 
  group_by(Carrier, Month) %>%
  mutate(nflights = n()) %>%
  group_by(Carrier, Month, late) %>%
  mutate(
    nlate = n(),
    pct_late = (nlate / nflights) * 100) %>%
  filter(late == "Yes") %>%
  ungroup() %>%  
  select(Carrier, Month, pct_late) %>%
  distinct() %>%
  arrange(pct_late)

Carrier,Month,pct_late
<chr>,<int>,<dbl>
OO,8,11.60714
OO,9,14.15094
OO,2,16.00000
UA,9,17.85714
OO,3,18.10345
UA,2,18.58974
DL,9,20.19704
UA,5,20.22472
UA,4,20.42254
OO,4,21.18056


Before we move on, I want to point out something about `case_when()`. Specifically, we used it to create a new column called `late` from numeric values found in `DepDelay`. But what if we wanted to create a new column from a column that had categorical variables in it, like `Dest` or `Carrier`? Easy.

In [47]:
cmhflights %>%
    mutate(
      carrier_name = case_when(
          Carrier == "AA" ~ "American Airlines",
          Carrier == "DL" ~ "Delta Airlines",
          Carrier == "UA" ~ "United Airlines",
          Carrier == "EV" ~ "Express Jet",
          Carrier == "F9" ~ "Frontier Airlines",
          Carrier == "WN" ~ "Southwest Airlines",
          Carrier == "OO" ~ "SkyWest Airlines"
      )
  ) %>%
    select(Carrier, carrier_name) %>%
    group_by(Carrier, carrier_name) %>%
    tally()

Carrier,carrier_name,n
<chr>,<chr>,<int>
AA,American Airlines,3891
DL,Delta Airlines,5446
EV,Express Jet,3056
F9,Frontier Airlines,1568
OO,SkyWest Airlines,2041
UA,United Airlines,1527
WN,Southwest Airlines,18464


Second, `case_when()` includes an option that cuts down on our work. In particular, say I want to create a new column and label its values as "Weekend" if the DayOfWeek is Saturday or Sunday and "Weekday" if DayOfWeek is any other day. In doing this, it would serve us well to remember that the week begins on Sunday so DayOfWeek == 1 is Sunday, not Monday. 

In [48]:
cmhflights %>%
  mutate(
      weekend = case_when(
          DayOfWeek %in% c(7, 1) ~ "Yes",
          TRUE ~ "No"
      )
    ) %>%
  select(DayOfWeek, weekend) %>%
  distinct()

DayOfWeek,weekend
<int>,<chr>
7,Yes
1,Yes
2,No
3,No
4,No
5,No
6,No


Notice how `TRUE` swept up all other values of `DayOfWeek` and coded them as "No."  

One final showcasing of `case_when()`. In **Module 01** we looked at the `hsb2` data and created some `factors` for columns such as female, ses, schtyp, and so on. Well, let us see how the same thing could be done with `case_when()`.

In [49]:
read.table(
  'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
  header = TRUE,
  sep = ","
  ) -> hsb2

In [50]:
hsb2 %>%
  mutate(
    female.f = case_when(
      female == 0 ~ "Male",
      female == 1 ~ "Female"),
    race.f = case_when(
      race == 1 ~ "Hispanic",
      race == 2 ~ "Asian",
      race == 3 ~ "African-American",
      TRUE ~ "White"),
    ses.f = case_when(
      ses == 1 ~ "Low",
      ses == 2 ~ "Medium",
      TRUE ~ "High"),
    schtyp.f = case_when(
      schtyp == 1 ~ "Public",
      TRUE ~ "Private"),
    prog.f = case_when(
      prog == 1 ~ "General",
      prog == 2 ~ "Academic",
      TRUE ~ "Vocational")
    ) -> hsb2

hsb2

id,female,race,ses,schtyp,prog,read,write,math,science,socst,female.f,race.f,ses.f,schtyp.f,prog.f
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
70,0,4,1,1,1,57,52,41,47,57,Male,White,Low,Public,General
121,1,4,2,1,3,68,59,53,63,61,Female,White,Medium,Public,Vocational
86,0,4,3,1,1,44,33,54,58,31,Male,White,High,Public,General
141,0,4,3,1,3,63,44,47,53,56,Male,White,High,Public,Vocational
172,0,4,2,1,2,47,52,57,53,61,Male,White,Medium,Public,Academic
113,0,4,2,1,2,44,52,51,63,61,Male,White,Medium,Public,Academic
50,0,3,2,1,1,50,59,42,53,61,Male,African-American,Medium,Public,General
11,0,1,2,1,2,34,46,45,39,36,Male,Hispanic,Medium,Public,Academic
84,0,4,2,1,1,63,57,54,58,51,Male,White,Medium,Public,General
48,0,3,2,1,2,57,55,52,50,51,Male,African-American,Medium,Public,Academic


-----------

## Some other `dplyr()` commands
We have seen `count()` in action but let us see it again, in a slightly different light. In particular, say I want to know how many unique destinations are there connected by air from Columbus. 

### count()

In [51]:
cmhflights %>%
  filter(Origin == "CMH") %>%
  count(Dest, sort = TRUE)

Dest,n
<chr>,<int>
ATL,2884
MDW,1511
MCO,1148
DFW,1122
DEN,971
BWI,948
LAS,815
PHX,815
ORD,803
EWR,736


Note: There is no need for `group_by()` here. And `sort = TRUE` arranges the result in descending order of the frequency (`n`). Here is another code example, this time adding Carrier to the mix. 

In [52]:
cmhflights %>%
  filter(Origin == "CMH") %>%
  count(Carrier, Dest, sort = TRUE)

Carrier,Dest,n
<chr>,<chr>,<int>
DL,ATL,2141
WN,MDW,1511
AA,DFW,1122
WN,BWI,948
WN,MCO,929
WN,ATL,743
EV,EWR,736
WN,TPA,595
UA,ORD,577
WN,LAS,542


How does this help us? Well, now we know that if we were flying to Atlanta, Delta would have the most flights, but if we were flying to the Chicago area then Southwest should be our pick.  

### n_distinct() 
Another useful command is `n_distinct()`, useful in the sense of allowing us to calculate the the number of distinct values of any column. For example, say I want to know how many unique aircraft (not airlines) are there in this data-set. 

In [53]:
cmhflights %>%
  summarise(n_distinct(TailNum))

n_distinct(TailNum)
<int>
2248


### top_n()
If you want to see the top 'n' number of observations, for example the 4 airlines with the most aircraft, you can lean on `top_n()`, as shown below.

In [54]:
cmhflights %>%
  group_by(Carrier) %>%
  summarise(num.flights = n_distinct(TailNum)) %>%
  arrange(-num.flights) %>% 
  top_n(4)

[1m[22mSelecting by num.flights


Carrier,num.flights
<chr>,<int>
WN,751
DL,539
UA,289
OO,222


I am also curious about which aircraft has flown the most, and then maybe 9 other aircraft that follow in descending order.

In [55]:
cmhflights %>%
  filter(!is.na(TailNum)) %>% # Removing some missing cases 
  group_by(TailNum) %>%
  tally() %>% 
  arrange(-n) %>%
  top_n(4)

[1m[22mSelecting by n


TailNum,n
<chr>,<int>
N396SW,74
N601WN,66
N646SW,64
N635SW,62


### join()
You will, from time to time, need to merge multiple data-sets together. For example, say I have the following data-sets I have created for demonstration purposes. 

In [56]:
tibble(
  Name = c("Tim", "Tammy", "Bubbles", "Panda"),
  Score = c(5, 8, 9, 10)
    ) -> df1

In [57]:
tibble(
  Name = c("Tim", "Tammy", "Bubbles"),
  Age = c(25, 78, 19)
    ) -> df2

In [58]:
tibble(
  Name = c("Tim", "Tammy", "Panda"),
  Education = c("BA", "PhD", "JD")
    ) -> df3

In [59]:
df1; df2; df3

Name,Score
<chr>,<dbl>
Tim,5
Tammy,8
Bubbles,9
Panda,10


Name,Age
<chr>,<dbl>
Tim,25
Tammy,78
Bubbles,19


Name,Education
<chr>,<chr>
Tim,BA
Tammy,PhD
Panda,JD


Notice that Panda is absent from `df2` and Bubbles is absent from `df3`. So if we wanted to build ONE data-set with all data for Tim, Tammy, Bubbles, and Panda, some of the information would be missing for some of these folks. But how could we construct ONE data-set? Via one of a few `join()` commands. 

#### full_join() 
Let us start with a simple full_join, where we link up every individual in df1 or df2 or df3 **regardless of whether they are seen in both data-sets**. 

In [60]:
df1 %>%
  full_join(df2, by = "Name") %>%
  full_join(df3, by = "Name") 

Name,Score,Age,Education
<chr>,<dbl>,<dbl>,<chr>
Tim,5,25.0,BA
Tammy,8,78.0,PhD
Bubbles,9,19.0,
Panda,10,,JD


Pay attention to two things: (i) Name connects the records in each data-set, and so it must be spelled exactly the same for a link to be made, and (ii) the `full_join()` links up all individuals regardless of whether they are missing any information in any of the data-sets. This is usually how most folks will link up multiple files unless they only want records found in a master file. For example, say I want to link up df2 and df3 but only such that the final result will include all records found in BOTH df2 and df3, with df2 serving as the master data-set. Eh?

In [61]:
df2 %>%
  left_join(df3, by = "Name")  

Name,Age,Education
<chr>,<dbl>,<chr>
Tim,25,BA
Tammy,78,PhD
Bubbles,19,


Notice that Panda is dropped because it is not found in df2. 

Maybe you want df3 to be the master file, in which case you would see a different result (with Bubbles not seen in the result since Bubbles is found in df2 but not in df3): 

In [62]:
df3 %>%
  left_join(df2, by = "Name")  

Name,Education,Age
<chr>,<chr>,<dbl>
Tim,BA,25.0
Tammy,PhD,78.0
Panda,JD,


Rarely, but definitely not "never," you may want to see the records that are not found in both. Here, anti_join() comes in handy, thus:

In [63]:
df2 %>%
  anti_join(df3, by = "Name")

Name,Age
<chr>,<dbl>
Bubbles,19


In [64]:
df3 %>%
  anti_join(df2, by = "Name")

Name,Education
<chr>,<chr>
Panda,JD


--------------------

## Two other useful commands 
### {santoku}
Every now and then you may want to or need to create a grouped version of some numeric variable. For example, we have DepDelay for all flights but want to group this into `quartiles`. How can we do that? In many ways but the easiest might be to use a specific library -- `{santoku}`. Say, for example, I want to create 4 groups of `dep_delay`, and I want these such that we are grouping `DepDelay` into the bottom 25%, next 25%, the next 25%, and finally the highest 25%. Wait, these are the `quartiles`! Fair enough, but how can I do this? 

In [70]:
install.packages("santoku")

Installing package into ‘/Users/ruhil/Library/R/arm64/4.3/library’
(as ‘lib’ is unspecified)




The downloaded binary packages are in
	/var/folders/qh/6q39v0755_54rxmbl8m5ttnwy0twd7/T//RtmpHYaGP5/downloaded_packages


In [71]:
library(santoku)

cmhflights %>%
  mutate(
    depdelay_groups = chop_equally(DepDelay, groups = 4)
      ) %>%
  group_by(depdelay_groups) %>%
  tally()


Attaching package: ‘santoku’


The following object is masked from ‘package:tidyr’:

    chop




depdelay_groups,n
<fct>,<int>
"[-27, -5)",6887
"[-5, -2)",9267
"[-2, 7)",10143
"[7, 1323]",9225
,471


What if we wanted to slice up DepDelay in specific intervals, first at 0, then at 15, then at 30, and then at 45? 

In [72]:
cmhflights %>%
  mutate(
    depdelay_groups = chop(DepDelay, breaks = c(0, 15, 30, 45))
      ) %>%
  group_by(depdelay_groups) %>%
  tally()

depdelay_groups,n
<fct>,<int>
"[-27, 0)",21108
"[0, 15)",7883
"[15, 30)",2567
"[30, 45)",1240
"[45, 1323]",2724
,471


We could also create quintiles (5 groups) or deciles (10 groups) as shown below: 

In [73]:
cmhflights %>%
  filter(!is.na(DepDelay)) %>%
  mutate(
    depdelay_groups = chop_quantiles(
      DepDelay, c(0.2, 0.4, 0.6, 0.8)
    )
  ) %>%
  group_by(depdelay_groups) %>%
  tally()

depdelay_groups,n
<fct>,<int>
"[0%, 20%)",6887
"[20%, 40%)",6342
"[40%, 60%)",7879
"[60%, 80%)",7035
"[80%, 100%]",7379


In [74]:
cmhflights %>%
  filter(!is.na(DepDelay)) %>%
  mutate(
    depdelay_groups = chop_quantiles(
      DepDelay, seq(0.1, 0.9, by = 0.1)
    )
  ) %>%
  group_by(depdelay_groups) %>%
  tally()

depdelay_groups,n
<fct>,<int>
"[0%, 10%)",3006
"[10%, 20%)",3881
"[20%, 30%)",3344
"[30%, 40%)",2998
"[40%, 50%)",2925
"[50%, 60%)",4954
"[60%, 70%)",3262
"[70%, 80%)",3773
"[80%, 90%)",3805
"[90%, 100%]",3574


You could also ask for groups such that they have the same width. Below we create 4 groups. Note that the width of each group is exactly 337.5 ... 
1323 - 985.5 = 337.5; 985.5 - 648 = 337.5; and so on. In this particular example this chopping isn't very useful since we end up with almost all of the data in the very first group. However, there are occasions where equal widths are useful. 

In [75]:
cmhflights %>%
  filter(!is.na(DepDelay)) %>%
  mutate(
    depdelay_groups = chop_evenly(
        DepDelay,
        4
    )
  ) %>%
  group_by(depdelay_groups) %>%
  tally()

depdelay_groups,n
<fct>,<int>
"[-27, 310.5)",35439
"[310.5, 648)",68
"[648, 985.5)",5
"[985.5, 1323]",10


### ordered()
More often than we would like to see happen, we often end up with categorical variables that should follow a certain order but do not. For example, say you have survey data where people were asked to respond whether they Agree, are Neutral, or Disagree with some statement. Let us also assume that the frequencies are as follows:

In [76]:
tibble(
  response = c(
      rep("Agree", 25), 
      rep("Neutral", 30), 
      rep("Disagree", 45)
      )
    ) -> mydf

In [77]:
mydf %>%
  group_by(response) %>%
  tally()

response,n
<chr>,<int>
Agree,25
Disagree,45
Neutral,30


Notice how the responses are out of order, with Agree followed by Disagree, then Neutral, since R defaults to alphabetic ordering for anything that is a categorical variable. One way to ensure the correct ordering of categorical variables is via `ordered`, as shown below. 

In [78]:
mydf %>%
  mutate(
      ordered.response = ordered(
          response,
          levels = c("Agree", "Neutral", "Disagree")
      )
    ) %>%
  group_by(ordered.response) %>%
  tally()

ordered.response,n
<ord>,<int>
Agree,25
Neutral,30
Disagree,45


## Concluding thoughts 
We have a covered a lot of ground here but every inch has been critical space. Google any question we have tackled and you will see how many R-users ask the same questions ... how do I calculate mean for groups in R? What you have seen is the heart of the `dplyr()` package. We saw grouped operations, we saw the use of summarise, mutate, case_when, distinct, filter, arrange, select, count, and tally. I will let you in on a secret; while these are core functions, there are others you could experiment with. Look up the cheat-sheet [here](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf). 

Practice what we have done, with the `{nycflights13}` data-set perhaps to get something familiar yet sufficiently different to test your fundamentals. Maybe pick a non-travel data-set altogether, perhaps one of the `tidytuesday` data-sets. What is that you ask? Discover it for yourself [here](https://github.com/rfordatascience/tidytuesday). Bon voyage! Don't go too far because we will be working with two new packages next week -- `{tidyr}` and `{lubridate}`. 

-----

# Exercises for Practice 
## Exercise 01
Why are our best and most experienced employees leaving prematurely? 

The data [available here](https://aniruhil.github.io/avsr/teaching/dataviz/HR_comma_sep.csv) includes information on several current and former employees of an anonymous organization. Fields in the data-set include: 

| Variable | Description |
| :---- | :---- |
| satisfaction_level | = Level of satisfaction (numeric; 0-1) |
| last_evaluation | = Evaluation score of the employee (numeric; 0-1) |
| number_project | = Number of projects completed while at work (numeric) |
| average_monthly_hours | = Average monthly hours spent at the workplace (numeric)  |
| time_spend_company | = Number of years spent in the company (numeric) |
| Work_accident | = Whether the employee had a workplace accident (categorical; 1 = yes or 0 = no) |
| left | = Whether the employee left the workplace or not (categorical; 1 = left or 0 = stayed)  |
| promotion_last_5years | = Whether the employee was promoted in the last five years (categorical; 1 = yes or 0 = no) |
| sales | = Department in which they work (categorical) |
| salary | = Relative level of salary (categorical; low, med, and high) |

(a) Read in the csv-format data-set, naming it `hrdata` and save it in RData format as `hrdata.RData` 

(b) Create new variables that add labels to Work_accident, left, promotion_last_5years, and add these three to `hrdata`.  

(c) Now retain only employees who left the company, and had not been promoted in the last five years. Save this result as `hr01`

(d) In this `hr01` data-set, how many employees do you have per sales department? What sales department has the most number of employees? 

(e) By sales department, calculate mean and standard deviation of (i) satisfaction_level, and (ii) last_evaluation. 

(f) What department has the lowest mean satisfaction? Which department has the highest variation in satisfaction?  

(g) Create a new variable that groups `average_montly_hours` into 4 groups. You can let the group cut-points be chosen automatically with `chop_evenly()`. Then show the frequencies of each group.

## Exercise 02
Thanks to the frenetic work of many individuals, the global spread of the Novel Coronavirus (COVID-19) has been tracked and the data made available for analysis. [Yanchang Zhao](https://rdatamining.wordpress.com/2020/03/10/coronavirus-data-analysis-with-r-tidyverse-and-ggplot2/) is one such individual and for this exercise we will use a spcific version of his data that I have named `cvdata.RData` is available in the `data` folder. Read it in via the `load()` command. We can then answer a few questions. Note the contents: 

+ `country =` name of the country 
+ `date =` date of incidents as recorded 
+ `confirmed =` cumulative count of the number of people who tested positive  
+ `deaths =` cumulative count of the number of people lost to Covid-19 
+ `deaths =` cumulative count of the number of people recovered  

(a) Filter the data-set so that we have only one row per country, the data from March 10, 2020 and call it `cv0310`. 

(b) How many countries have lost `at least one` person to this tragedy? "Others" should not show up as one of the countries.  

(c) What 10 countries have had the most number of confirmed cases? "Others" should not show up as one of the countries. Also ensure the result is organized in descending order of the number of confirmed cases. 

(d) Calculate the `fatality_rate`, defined for our purposes as the percent of deaths. excluding "Others", and only keeping countries that have had `at least 10` confirmed cases, arrange the result to show the top-10 countries in descending order of `fatality_rate`.  

(e) Say we only want to focus on the Baltic countries (Estonia, Latvia, and Lithuania) as a unified group and compare this group to the ASEAN nations (Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand, and Vietnam). Use `cv0310` to complete the followng tasks: 

(i) Create a new variable called `region` that only takes on two values -- "Baltic" if the country is a Baltic country and "Asean" if the country is an ASEAN country. 

(ii) Use this variable to calculate the total number of confirmed cases in each region. 