# MPA 5830 - Module 04

## {lubridate} 
Working with dates and times is not as simple as it looks, and the reasons are as many as they are diverse. Let us spell out a few of the major ones. 

For one, dates come in all shapes and sizes, as text that looks like this: "Monday, Jan10, 2010" to strings like "2020-28-01 12:49" but no matter what software you use, you have to be able to convert all dates into a standard format.  

Second, if you need to calculate how much time has passed between events, for example, how many days go by before a patient returns to the Emergency Room (ER), how many days between Covid-19 deaths in an infected population, hours needed to fly from Columbus to an Francisco, and so on ... you need to be able to move between months, days, hours, etc with ease, AND calculate the length of time in a way that automatically adjusts for `leap` years. 

There are many more reasons I could advance but we might as well start working with dates. First up, some mangled date entries and we'll see how to parse them into correct date formats! We will rely on the `lubridate` package to do this.

In [None]:
"20171217" -> today1 
"2017-12-17" -> today2 
"2017 December 17" -> today3 
"20171217143241" -> today4 
"2017 December 17 14:32:41" -> today5 
"December 17 2017 14:32:41" -> today6 
"17-Dec, 2017 14:32:41" -> today7 

Now we fix them up!

In [None]:
library(tidyverse)
library(lubridate)

In [None]:
ymd(today1) -> date1
ymd(today2) -> date2 
ymd(today3) -> date3 

In [None]:
date1; date2; date3

`today1`, `today2`, and `today3` all had the same structure of year-month-day and so `ymd()` works to get the format right. 

`today4` has year-month-day-hours-minutes-seconds so we'll have to do this one slightly differently. The same thing works for `today5` as well.

In [None]:
ymd_hms(today4) -> date4 
ymd_hms(today5) -> date5 

In [None]:
date4; date5

`today6` has a slightly different format, `month-day-year-hours-minutes-seconds` that is read in thus:

In [None]:
mdy_hms(today6) -> date6 

In [None]:
date6

`today7` has a slightly different format, `day-month-year-hours-minutes-seconds` that is read in thus:

In [None]:
dmy_hms(today7) -> date7 

In [None]:
date7

Notice how regardless of the format pf the raw data, merely invoking the correct sequence of ymd, mdy, dmy, etc flips the raw date into properly formatted date. 

## Working with flight dates
Now we should be able to start working with some date variables, and the ideal candidate would be the flight date column in our `cmhflights` data. So the first thing we will do is load that data-set so that we can work with it. 

In [None]:
load("data/cmhflights_01092017.RData")

In [None]:
names(cmhflights)

I dislike the uppercase-lowercase mixture they have in their column names and so will get rid of it as shown below, making everything nice and lowercase. This is done with the `janitor` package's `clean_names()` command. 

I am also going to use `select()` to keep only a handful of columns since keeping 100+ is of no value. 

In [None]:
library(janitor)

In [None]:
cmhflights %>%
  clean_names() %>%
  select(
    year, month, dayof_month, day_of_week, flight_date, carrier,
    tail_num, flight_num, origin_city_name, dest_city_name,
    dep_time, dep_delay, arr_time, arr_delay, cancelled, diverted
    ) -> cmh.df

The first thing I want to do now is to label the days of the week, the months, and then also create a flag for the `weekend` versus `weekdays`. Here goes:

In [None]:
cmh.df %>%
  mutate(
    dayofweek = wday(
      day_of_week,
      abbr = FALSE,
      label = TRUE
      ),
    monthname = month(
      month,
      abbr = FALSE,
      label = TRUE
      ),
    weekend = case_when(
      dayofweek %in% c("Saturday", "Sunday") ~ "Weekend",
      TRUE ~ "Weekday"
      )
    ) -> cmh.df 

#### Now let us ask some questions:  
    (a) What month had the most flights?  
    (b) What day of the week had the most flights?  
    (c) What about weekends; did weekends have more flights than weekdays?  
    (d) With respect to (c), does whatever pattern we see vary by month, or does month not matter? 

In [None]:
cmh.df %>%
  count(monthname, sort = TRUE) # (a)

In [None]:
cmh.df %>%
  count(dayofweek, sort = TRUE) # (b) 

In [None]:
cmh.df %>%
  count(weekend, sort = TRUE) # (c) 

In [None]:
cmh.df %>%
  count(monthname, weekend, sort = TRUE) # (d) 

So most flights are on weekdays, but weekend flights lead in July while weekday flights lead in August. 

But wait a minute, if I can calculate these frequencies, why not do it by the hour. That may allow us to answer such questions as: What hour of the day has the most flights, the most delays? What about by airline? What if we push this to the minute of the hour? 

Well, first we will have to create a new variable that marks just the hour of the day in the 24-hour cycle. But to do this we will first need to create a single `flight_date_time` column that will be in the `ymd_hms` format. How? With `unite()`.  

In [None]:
cmh.df %>%
  unite(
    col = "flight_date_time",
    c(flight_date, dep_time),
    sep = ":",
    remove = FALSE
  ) -> cmh.df

Okay, now we create `flt_date_time` and note the seconds here are automatically coerced to be `00` because the raw date field not not include the seconds, only hour and minutes. But a proper date field must include seconds for accurate calculations and hence seconds are set to `00` here.

In [None]:
cmh.df %>%
  mutate(
    flt_date_time = ymd_hm(flight_date_time)
      ) -> cmh.df

In [None]:
head(cmh.df$flt_date_time)

The warning indicates there are 471 flight dates that could not be parsed correctly.

Now we extract just the hour of the day the flight was scheduled to depart, with `hour()` and `minute()`, respectively. 

In [None]:
cmh.df %>%
  mutate(
    flt_hour = hour(flt_date_time),
    flt_minute = minute(flt_date_time)
    ) -> cmh.df

All righty then, now we start digging in. What hour has the most flights, and does this vary by the day of the week? By the Month? 

In [None]:
cmh.df %>%
  count(flt_hour, sort = TRUE)

In [None]:
cmh.df %>%
  count(monthname, flt_hour, sort = TRUE)

Looks like 10:00 and then 17:00, these would be your best bets if you were looking to catch a flight and wanted as many options as possible. On the flip side, this might also be the time when flights get delayed more often because there are so many flights scheduled at these hours! 

Now I want to ask the question about delays: Are median delays higher at certain hours?Notice the results are being arranged first in descending and then in ascending order.

In [None]:
cmh.df %>%
  group_by(flt_hour) %>%
  summarise(md.delay = median(dep_delay, na.rm = TRUE)) %>%
  arrange(-md.delay)

In [None]:
cmh.df %>%
  group_by(flt_hour) %>%
  summarise(md.delay = median(dep_delay, na.rm = TRUE)) %>%
  arrange(md.delay)

The expected result -- shortest median delay is at 5 AM, and delays increase by the hour. 

__Bottom-line:__ Fly as early as you can. Might this vary by destination?

In [None]:
cmh.df %>%
  group_by(dest_city_name, flt_hour) %>%
  summarise(md.delay = median(dep_delay, na.rm = TRUE)) %>%
  arrange(-md.delay)

Avoid flying to Newark, NJ, even at 6 or 7 AM. 

Might these vary by airline?

In [None]:
cmh.df %>%
  group_by(carrier, dest_city_name, flt_hour) %>%
  summarise(md.delay = median(dep_delay, na.rm = TRUE)) %>%
  arrange(-md.delay)

Worst early-morning delays are for EV, to Newark and to Chicago. 

## Passage of Time
Let us assume we are interested in seeing how much time lapses between successive flights of **each aircraft** seen in the data. We know we can identify each unique aircraft by its `tail_num`. So let us first see how many times is each aircraft seen and create a new column called `number_flew`. 

Some rows of data are missing `flt_date_time` and `tail_num` so I will filter these out as well. 

In [None]:
cmh.df %>%
  filter( # eliminates all rows where both these columns are blank 
    !is.na(tail_num),
    !is.na(flt_date_time) 
    ) %>% 
  group_by(tail_num) %>% 
  arrange(flt_date_time) %>% # each aircraft is now stacked by when it flew
  mutate(n_flew = row_number()) %>% # each time aan aircraft is seen it gets a number, 1, 2, 3, and so on ... 
  select(tail_num, flt_date_time, n_flew) %>%
  arrange(-n_flew) -> cmh.df2 # N396SW is seen the most often in this data-set 

In [None]:
cmh.df2 %>%
  head()

So far so good; [N396SW is the winner and has well-earned its retirement](https://www.planespotters.net/airframe/boeing-737-n396sw-aerothrust-holdings/38l2ge). 

Now we need to see how much time lapsed between flights, and this is just the difference between the **preceding** `flt_date_time` recorded and the **most recent** `flt_date_time`. As we do this, note that by default time span (`ytspan`) is calculated in seconds.  

In [None]:
cmh.df2 %>%
  group_by(tail_num) %>%
  arrange(flt_date_time) %>%
  mutate(
    tspan = interval(
      lag(flt_date_time, order_by = tail_num), flt_date_time
      ), # calculate the time span between successive flights recorded and save as new varable tspan
    tspan.minutes = as.duration(tspan)/dminutes(1), # convert tspan into minutes and save as tspan.minutes
    tspan.hours = as.duration(tspan)/dhours(1), # convert tspan into hours and save as tspan.hours
    tspan.days = as.duration(tspan)/ddays(1), # convert tspan into days and save as tspan.days 
    tspan.weeks = as.duration(tspan)/dweeks(1) # convert tspan into weeks and save as tspan.weeks 
    ) -> cmh.df2 

Here, `tspan` is being converted into, say, minutes by dividing it by 60, into hours by dividing tspan by 60 x 60 = 3600, and so on. 

Note that `dminutes(1)` is calculating the time span in one-minute intervals. Similarly for hours, days, and weeks. Thus if you ran `dhours(2)` you would get the time interval in 2-hour increments.  

There is a lot more we could do with time but the few things we have covered so far would be the more common tasks we usually encounter.   

Before we move on, let us see the flight sequence of **N396SW** ...

In [None]:
cmh.df2 %>%
    filter(tail_num == "N396SW") %>%
    head(., 10)

In [None]:
cmh.df2 %>%
    filter(tail_num == "N396SW") %>%
    tail(., 10)

--------------

# Exercises for Practice

## Exercise 01 

The data below come from [tidytuesday](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-09-10) and provide information on accidents at theme parks. You can see more of these [data available here](https://ridesdatabase.org/saferparks/data/). The data give you some details of where and when the accident occurred, and something about the injured party as well. 

In [None]:
library(readr)

read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-10/saferparks.csv"
    ) -> safer_parks

|variable             |class     |description |
|:--------------------|:---------|:-----------|
|acc_id               |double    | Unique ID |
|acc_date             |character | Accident Date |
|acc_state            |character | Accident State |
|acc_city             |character | Accident City |
|fix_port             |character |.           |
|source               |character | Source of injury report |
|bus_type             |character | Business type |
|industry_sector      |character | Industry sector |
|device_category      |character | Device category |
|device_type          |character | Device type |
|tradename_or_generic |character | Common name of the device |
|manufacturer         |character | Manufacturer of device |
|num_injured          |double    | Num injured |
|age_youngest         |double    | Youngest individual injured |
|gender               |character | Gender of individual injured |
|acc_desc             |character | Description of accident |
|injury_desc          |character | Injury description |
|report               |character | Report URL |
|category             |character | Category of accident |
|mechanical           |double    | Mechanical failure (binary NA/1) |
|op_error             |double    | Operator error (binary NA/1)|
|employee             |double    | Employee error (binary NA/1)|
|notes                |character | Additional notes| 

Working with the `safer_parks` data, complete the following tasks. 

### Problem (a)
Using `acc_date`, create a new date variable called `idate` that is a proper date column generated via ``{lubridate}``. 

### Problem (b)
Now create new columns for (i) the month of the accident, and (ii) the day of the week. These should not be abbreviated (i.e., we should see the values as 'Monday' instead of 'Mon', "July" instead of "Jul"). 

What month had the highest number of accidents? 

What day of the week had the highest number of accidents? 

### Problem (c)
What if you look at days of the week by month? Does the same day of the week show up with the most accidents regardless of month or do we see some variation? 

### Problem (d)
What were the `five` dates with the most number of accidents? 

### Problem (e)
Using the Texas injury data, answer the following question: What ride was the safest? [Hint: For each ride (`ride_name`) you will need to calculate the number of days between accidents. The ride with the highest number of days is the safest.] 

In [None]:
read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-10/tx_injuries.csv"
  ) -> tx_injuries


|variable          |class     |description |
|:-----------------|:---------|:-----------|
|injury_report_rec |double    | Unique Record ID |
|name_of_operation |character | Company name |
|city              |character | City |
|st                |character | State (all TX) |
|injury_date       |character | Injury date - note there are some different formats |
|ride_name         |character | Ride Name |
|serial_no         |character | Serial number of ride |
|gender            |character | Gender of the injured individual |
|age               |character | Age of the injured individual |
|body_part         |character | Body part injured |
|alleged_injury    |character | Alleged injury - type of injury |
|cause_of_injury   |character | Approximate cause of the injury (free text) |
|other             |character | Anecdotal information in addition to cause of injury |

You should note that this assumes each ride was in operation for the same amount of time. If this is not true then our estimates will be unreliable. 

## Exercise 02
These data (see below) come from this story: [The next generation: The space race is dominated by new contenders](https://www.economist.com/graphic-detail/2018/10/18/the-space-race-is-dominated-by-new-contenders). You have data on space missions over time, with dates of the launch, the launching agency/country, type of launch vehicle, and so on. 


| variable    | definition                               |
| ----------- | ---------------------------------------- |
| tag         | Harvard or [COSPAR][cospar] id of launch |
| JD          | [Julian Date][jd] of launch              |
| launch_date | date of launch                           |
| launch_year | year of launch                           |
| type        | type of launch vehicle                   |
| variant     | variant of launch vehicle                |
| mission     | space mission                            |
| agency      | launching agency                         |
| state_code  | launching agency's state                 |
| category    | success (O) or failure (F)               |
| agency_type | type of agency                           |


In [None]:
read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-15/launches.csv"
  ) -> launches

### Problem (a) 
Create a new column called `date` that stores `launch_date` as a proper data field in ymd format from `{lubridate}`. 

### Problem (b) 
Creating columns as needed, calculate and show the number of launches first by year, then by month, and then by day of the week. The result should be arranged in descending order of the number of launches. 

### Problem (c) 
How many launches were successful `(O)` versus failed `(F)` by country and year? 

Note: The countries of interest will be state_code values of "CN", "F", "J", "RU", "SU", "US". In addition, you do not need to arrange your results in any order. 