# MPA 5830 - Solutions to Task 04

# Chicago Bird Collisions

[Winger et al, 2019](https://royalsocietypublishing.org/doi/10.1098/rspb.2019.0364#d3e550) examined nocturnal flight-calling behavior and vulnerability to artificial light in migratory birds. 

> "Understanding interactions between biota and the built environment is increasingly important as human modification of the landscape expands in extent and intensity. For migratory birds, collisions with lighted structures are a major cause of mortality, but the mechanisms behind these collisions are poorly understood. Using 40 years of collision records of passerine birds, we investigated the importance of species' behavioral ecologies in predicting rates of building collisions during nocturnal migration through Chicago, IL and Cleveland, OH, USA. "

> "One of the few means to examine species-specific dynamics of social biology during nocturnal bird migration is through the study of short vocalizations made in flight by migrating birds. Many species of birds, especially passerines (order Passeriformes), produce such vocal signals during their nocturnal migrations. These calls (hereafter, ‘flight calls’) are hypothesized to function as important social cues for migrating birds that may aid in orientation, navigation and other decision-making behaviors.not all nocturnally migratory species make flight calls, raising the possibility that different lineages of migratory birds vary in the degree to which social cues and collective decisions are important for accomplishing migration. "

I have only uploaded the raw and tamed Chicago data-set as it is the most complete, but you can access the full raw data [here](https://datadryad.org/resource/doi:10.5061/dryad.8rr0498). 

Each row in the `bird_collisions.csv` data-set accounts for a single observation of a bird collision. You can aggregate by species/genus, time, or other factors.

h/t to [Data is Plural 2019/04/10](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0)

### Important Notes but Spoilers

An important point but somewhat spoiler from the authors
> From 2000 to 2018, D.E.W. and M.H. recorded data on the status of night-time lighting at McCormick Place during pre-dawn walks to collect collisions by recording the proportion of the 17 window bays that were illuminated... We used this index to test whether building lighting influenced the number of collisions and whether the influence of light levels on collisions counts varied across the sets of species with different flight-calling behavior or habitat preferences.

There is a factor data column (`bird_collisions$locality`) that indicates if the data was collected at McCormick Place (MP) or elsewhere in Chicago (CHI). If you `dplyr::filter` to only use `MP` you can `dplyr::left_join` the light data and the bird collision data to look at the effects of light on bird collisions from 2000 on.

## Get the data!

In [1]:
library(tidyverse)

readr::read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-30/bird_collisions.csv"
    ) -> bird_collisions 

readr::read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-30/mp_light.csv"
    ) -> mp_light 

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

[1mRows: [22m[34m69695[39m [1mColumns: [22m[34m8[39m

[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (7): genus, species, locality, family, flight_call, habitat, stratum
[34mdate[39m (1): date


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m

### Citations

When using this data, please cite the original publication:

> Winger BM, Weeks BC, Farnsworth A, Jones AW, Hennen M, Willard DE (2019) Nocturnal flight-calling behavior predicts vulnerability to artificial light in migratory birds. Proceedings of the Royal Society B 286(1900): 20190364. https://doi.org/10.1098/rspb.2019.0364

If using the data alone, please cite the [Dryad data package](https://cran.r-project.org/web/packages/rdryad/rdryad.pdf):

> Winger BM, Weeks BC, Farnsworth A, Jones AW, Hennen M, Willard DE (2019) Data from: Nocturnal flight-calling behavior predicts vulnerability to artificial light in migratory birds. Dryad Digital Repository. https://doi.org/10.5061/dryad.8rr0498


### Data Dictionary

#### `bird_collisions.csv` 
|variable    |class     |description |
|:-----------|:---------|:-----------|
|genus       | factor | Bird Genus          |
|species     | factor | Bird species           |
|date        | date    | Date of collision death (ymd)           |
|locality    | factor | MP or CHI - recording at either McCormick Place or greater Chicago area           |
|family      | factor | Bird Family          |
|flight_call | factor | Does the bird use a flight call - yes or no           |
|habitat     | factor | Open, Forest, Edge - their habitat affinity          |
|stratum     | factor  | Typical occupied stratum - ground/low or canopy/upper           |

#### `mp_light.csv` 
|variable    |class  |description |
|:-----------|:------|:-----------|
|date        | date | Date of light recording  (ymd)        |
|light_score | integer | Number of windows lit at the McCormick Place, Chicago - higher = more light          |

## Now the questions ...

### Question (a) 
Does the number of bird collisions vary by month? By day of the month? by day of the week? By year? Show your calculations and then answer, in words after your code chunk, what year, month day of the month, and day of the week had the most bird collisions. 

In [2]:
library(tidylog)
library(lubridate)

bird_collisions %>%
  mutate(
    year = year(date),
    month = month(date, abbr = FALSE, label = TRUE),
    day = day(date),
    dow = wday(date, abbr = FALSE, label = TRUE)
  ) -> birds

birds %>%
  count(year, sort = TRUE)


Attaching package: ‘tidylog’


The following objects are masked from ‘package:dplyr’:

    add_count, add_tally, anti_join, count, distinct, distinct_all,
    distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
    full_join, group_by, group_by_all, group_by_at, group_by_if,
    inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
    relocate, rename, rename_all, rename_at, rename_if, rename_with,
    right_join, sample_frac, sample_n, select, select_all, select_at,
    select_if, semi_join, slice, slice_head, slice_max, slice_min,
    slice_sample, slice_tail, summarise, summarise_all, summarise_at,
    summarise_if, summarize, summarize_all, summarize_at, summarize_if,
    tally, top_frac, top_n, transmute, transmute_all, transmute_at,
    transmute_if, ungroup


The following objects are masked from ‘package:tidyr’:

    drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
    spread, uncount


The following object is masked from ‘package:stats

year,n
<dbl>,<int>
2013,3848
2014,3761
2011,3510
2012,3502
2010,3475
2015,3319
2008,3235
2009,3219
2007,2952
2006,2653


In [3]:
birds %>%
  count(month, sort = TRUE)

count: now 7 rows and 2 columns, ungrouped



month,n
<ord>,<int>
October,22840
September,16066
May,12832
April,7697
March,5093
November,3522
August,1645


In [4]:
birds %>%
  count(day, sort = TRUE)

count: now 31 rows and 2 columns, ungrouped



day,n
<int>,<int>
11,2933
8,2914
9,2884
14,2823
7,2797
10,2717
12,2568
15,2564
20,2513
5,2382


In [5]:
birds %>%
  count(dow, sort = TRUE)

count: now 7 rows and 2 columns, ungrouped



dow,n
<ord>,<int>
Wednesday,10568
Sunday,10464
Thursday,10060
Saturday,10054
Monday,9813
Tuesday,9446
Friday,9290


### Question (b)
What locality has had the most hits per year -- McCormick Place or the greater Chicago area?

In [6]:
birds %>%
  count(locality, year, sort = TRUE)

count: now 77 rows and 3 columns, ungrouped



locality,year,n
<chr>,<dbl>,<int>
CHI,2014,3218
CHI,2013,3186
CHI,2011,3146
CHI,2010,3037
CHI,2015,2913
CHI,2012,2832
CHI,2009,2822
CHI,2008,2601
MP,1996,2306
CHI,2007,2055


### Question (c)
Now filter the bird collision data to keep only records from McCormick Place. Then join the resulting data frame to the mp_light data set, joining the two such that bird collision records and light-score records are matched up correctly by date. Eliminate any rows of data where light_score is missing. Save the resulting data-set as `birds.df`

In [7]:
birds %>%
  filter(locality == "MP") %>%
  full_join(mp_light, by = "date") %>%
  filter(!is.na(light_score)) -> birds.df

filter: removed 33,456 rows (48%), 36,239 rows remaining

full_join: added one column (light_score)

           > rows only in x   26,622

           > rows only in y    1,426

           > matched rows      9,626    (includes duplicates)


           > rows total       37,674

filter: removed 26,622 rows (71%), 11,052 rows remaining



### Question (d) 
Now we want to know if the distribution of bird collisions differs by how brightly lit (a high light_score) or dimly lit (low light_score) the windows of McCormick Place were. To answer this question, we want to do two things. First, convert `light_score` into a grouped, ordinal variable with 4 groups (essentially creating quartiles). Second, we now want to see the number of fatalities that fall within each of these five groups of light_scores. 

In [8]:
library(santoku)

birds.df %>%
  mutate(
    litg = chop_equally(light_score, groups = 4)
  ) %>%
  count(litg)


Attaching package: ‘santoku’


The following object is masked from ‘package:tidyr’:

    chop


mutate: new variable 'litg' (factor) with 4 unique values and 0% NA

count: now 4 rows and 2 columns, ungrouped



litg,n
<fct>,<int>
"[0%, 25%)",2609
"[25%, 50%)",1734
"[50%, 75%)",2840
"[75%, 100%]",3869


# Philadelphia Parking Violations

These data come from Philly's open data portal and were used in one `tidyduesday` iteration. These particular data are for 2017. Use these data to answer the questions that follow. 

In [9]:
readr::read_csv(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-03/tickets.csv"
  ) -> tickets


[1mRows: [22m[34m1260891[39m [1mColumns: [22m[34m7[39m

[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): violation_desc, issuing_agency
[32mdbl[39m  (4): fine, lat, lon, zip_code
[34mdttm[39m (1): issue_datetime


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



## Problem (a) 
Make a proper date-time column called `tixdt`, and then from this extract the month, day of the month, day of the week, hour, and minute. Month and day of the week should be fully labelled. Save the resulting data-set as `tix.df`, and make sure you also save it in the RData format as `tix.df.RData` to the `data` folder. 

In [10]:
tickets %>%
  mutate(
    tixdt = ymd_hms(issue_datetime),
    month = month(tixdt, label = TRUE, abbr = FALSE),
    dow = wday(tixdt, label = TRUE, abbr = FALSE),
    day = day(tixdt),
    hour = hour(tixdt),
    minute = minute(tixdt)
  ) -> tix.df

mutate: new variable 'tixdt' (double) with 332,269 unique values and 0% NA

        new variable 'month' (ordered factor) with 12 unique values and 0% NA

        new variable 'dow' (ordered factor) with 7 unique values and 0% NA

        new variable 'day' (integer) with 31 unique values and 0% NA

        new variable 'hour' (integer) with 24 unique values and 0% NA

        new variable 'minute' (integer) with 60 unique values and 0% NA



In [11]:
save(tix.df, file = "data/tix.df.RData")

## Problem (b)

What months, day of the month, day of the week, hours, and minutes were most likely to see a ticket issued? You will need one calculation for each of these. 

In [12]:
tix.df %>%
  count(month, sort = TRUE)

count: now 12 rows and 2 columns, ungrouped



month,n
<ord>,<int>
August,115529
October,115063
March,110087
May,109667
September,109468
November,109330
June,106288
April,105846
January,96803
February,96044


In [13]:
tix.df %>%
  count(day, sort = TRUE)

count: now 31 rows and 2 columns, ungrouped



day,n
<int>,<int>
21,46818
28,46036
18,45966
6,45431
27,44850
20,44837
11,44473
17,44117
13,43395
8,43005


In [14]:
tix.df %>%
  count(dow, sort = TRUE)

count: now 7 rows and 2 columns, ungrouped



dow,n
<ord>,<int>
Thursday,228348
Wednesday,223194
Friday,220961
Tuesday,210567
Monday,175854
Saturday,150769
Sunday,51198


In [15]:
tix.df %>%
  count(hour, sort = TRUE)

count: now 24 rows and 2 columns, ungrouped



hour,n
<int>,<int>
12,137364
11,134920
13,116560
14,108124
10,98501
15,95779
16,85558
9,72974
17,64324
21,57951


In [16]:
tix.df %>%
  count(minute, sort = TRUE)

count: now 60 rows and 2 columns, ungrouped



minute,n
<int>,<int>
0,26982
5,25285
30,24727
10,24029
15,23872
35,23792
40,23611
20,23526
50,23473
45,23404


## Problem (c)

What combination of hour and minute we most likely to see a ticket issued? What hour and minute were least likely to see a ticket issued?

In [17]:
tix.df %>%
  count(hour, minute, sort = TRUE)

count: now 1,440 rows and 3 columns, ungrouped



hour,minute,n
<int>,<int>,<int>
16,5,3039
12,0,2633
11,30,2558
15,35,2510
11,40,2501
11,50,2495
11,35,2482
12,10,2474
11,55,2470
12,15,2463
