# Tidyverse: Modern R

In [1]:
#install.packages("tidyverse")

In [2]:
library("tidyverse")

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --
[32mv[39m [34mggplot2[39m 3.2.1     [32mv[39m [34mpurrr  [39m 0.3.2
[32mv[39m [34mtibble [39m 2.1.3     [32mv[39m [34mdplyr  [39m 0.8.3
[32mv[39m [34mtidyr  [39m 1.0.0     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.4.0
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


### Read this ([online](http://r4ds.had.co.nz/))
![](images/rfordatascience.jpg)

If you expect to be a heavy R user, then you might find this useful:
![](images/rinaction.jpg)

As always, cheat sheets are an absoloutely fantastic resource: https://rstudio.com/resources/cheatsheets/

### Why Tidyverse instead of base R?

Tidyverse provides a faster, cleaner and more up-to-date set of features for the modern data scientist. Further, the several packages of Tidyverse work _together_ to provide a solution, greater than the individual parts.

As a new data scientist, you will find the Tidyverse API to be more uniform and easier to use. As you become more experienced, you will appreciate the speed improvements (although, for absolute speed, you will need more specialized tools).

### Read data

```R
read_csv(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
  quote = "\"", comment = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = show_progress(), skip_empty_rows = TRUE)
```

In [11]:
d <- read_csv('../../datasets/deaths-in-gameofthrones/game-of-thrones-deaths-data.csv')

Parsed with column specification:
cols(
  order = [32mcol_double()[39m,
  season = [32mcol_double()[39m,
  episode = [32mcol_double()[39m,
  character_killed = [31mcol_character()[39m,
  killer = [31mcol_character()[39m,
  method = [31mcol_character()[39m,
  method_cat = [31mcol_character()[39m,
  reason = [31mcol_character()[39m,
  location = [31mcol_character()[39m,
  allegiance = [31mcol_character()[39m,
  importance = [32mcol_double()[39m
)


Notice that `read_csv` tells us the assumptions it made about column types.

In [13]:
d

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,1,1,Waymar Royce,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,"House Royce, Night’s Watch",2
2,1,1,Gared,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,Night’s Watch,2
3,1,1,Will,Ned Stark,Sword (Ice),Blade,Deserting the Night’s Watch,Winterfell,Night’s Watch,2
4,1,1,Stag,Direwolf,Direwolf teeth,Animal,Unknown,Winterfell,,1
5,1,1,Direwolf,Stag,Antler,Animal,Unknown,Winterfell,,1
6,1,1,Jon Arryn,Lysa Arryn,Poison,Poison,Petyr Baelish persuaded Lysa to do so for reasons unknown,King’s Landing,House Arryn,2
7,1,1,Dothraki man,Dothraki man,Arakh,Blade,A Dothraki wedding without at least three deaths is a dull affair,Pentos,Dothraki,1
8,1,2,Catspaw assassin,Summer,Direwolf teeth,Animal,Attempting to kill Bran Stark,Winterfell,,1
9,1,2,Mycah,Sandor “the Hound” Clegane,Unknown (likely a sword),Unknown,Joffrey has him killed after Arya attacks Joffrey,Kingsroad,Smallfolk,3
10,1,2,Lady,Ned Stark,Knife,Blade,"Robert Baratheon orders that Lady be killed to appease Cersei, who wants revenge after Nymeria attacked Joffrey",Kingsroad,House Stark,3


If we want to change a data type, say the first three columns should be integers, instead of a double, just copy & paste the output above and modify according to your needs:

In [14]:
col_types <- cols(
  order = col_integer(),
  season = col_integer(),
  episode = col_integer(),
  character_killed = col_character(),
  killer = col_character(),
  method = col_character(),
  method_cat = col_character(),
  reason = col_character(),
  location = col_character(),
  allegiance = col_character(),
  importance = col_double()
)
read_csv('../../datasets/deaths-in-gameofthrones/game-of-thrones-deaths-data.csv', col_types = col_types)

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
<int>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,1,1,Waymar Royce,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,"House Royce, Night’s Watch",2
2,1,1,Gared,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,Night’s Watch,2
3,1,1,Will,Ned Stark,Sword (Ice),Blade,Deserting the Night’s Watch,Winterfell,Night’s Watch,2
4,1,1,Stag,Direwolf,Direwolf teeth,Animal,Unknown,Winterfell,,1
5,1,1,Direwolf,Stag,Antler,Animal,Unknown,Winterfell,,1
6,1,1,Jon Arryn,Lysa Arryn,Poison,Poison,Petyr Baelish persuaded Lysa to do so for reasons unknown,King’s Landing,House Arryn,2
7,1,1,Dothraki man,Dothraki man,Arakh,Blade,A Dothraki wedding without at least three deaths is a dull affair,Pentos,Dothraki,1
8,1,2,Catspaw assassin,Summer,Direwolf teeth,Animal,Attempting to kill Bran Stark,Winterfell,,1
9,1,2,Mycah,Sandor “the Hound” Clegane,Unknown (likely a sword),Unknown,Joffrey has him killed after Arya attacks Joffrey,Kingsroad,Smallfolk,3
10,1,2,Lady,Ned Stark,Knife,Blade,"Robert Baratheon orders that Lady be killed to appease Cersei, who wants revenge after Nymeria attacked Joffrey",Kingsroad,House Stark,3


Recall from `first_programs` that we always `.strip()` out white spaces. Notice that `read_csv` already does this for us, by default, using the `trim_ws` argument!

There are several related functions to parse data using any delimiter (semi-colon, tab, etc.) as well as reading fixed width files. Tidyverse also provides methods to read and write xml, web apis, json, spss, stata and sas and other files.

## Tibble (not data.frame)

Using Tidyvers's csv reading function returns a `tibble`, an alternative to R's built-in data frame. Tibble's are better at displaying their data (perticularly large amounts of data) and their API is a bit more consistent.

#### Creating a tibble

We have already seen the use of `read_csv` to create a tibble from a file. You can also create tibbles directly:

In [20]:
tibble(age=sample(10), rank=seq(1,10))

age,rank
<int>,<int>
1,1
9,2
10,3
7,4
8,5
3,6
5,7
6,8
4,9
2,10


## Tidyr

The library `tidyr` provides functions to change the layout of your data. Here is a common scenario, your data looks like this:

In [28]:
(bad_data <- tibble(feature=c("age", "anger", "hunger"), homer=c(38, 9, 9), marge=c(36, 3, 5) ))

feature,homer,marge
<chr>,<dbl>,<dbl>
age,38,36
anger,9,3
hunger,9,5


The tidyverse standard practice is to re-organize your data so each column represents a feature and each row represents an observation. In other words, our data _should_ look like this:

In [29]:
(good_data1 <- gather(bad_data, 'homer', 'marge', key="person", value="value"))

feature,person,value
<chr>,<chr>,<dbl>
age,homer,38
anger,homer,9
hunger,homer,9
age,marge,36
anger,marge,3
hunger,marge,5


In [31]:
(good_data2 <- spread(good_data1, feature, value))

person,age,anger,hunger
<chr>,<dbl>,<dbl>,<dbl>
homer,38,9,9
marge,36,3,5


**Exercise** Convert the first table into the second table, using `gather`:

In [35]:
(first_table <- tibble(country=c('A', 'B', 'C'), '1999'=c(.7, 37, 212), '2000'=c(2, 80,213)))

country,1999,2000
<chr>,<dbl>,<dbl>
A,0.7,2
B,37.0,80
C,212.0,213


In [37]:
(second_table <- tibble(country=c('A', 'B', 'C', 'A', 'B', 'C'), year=c(1999, 1999, 1999, 2000, 2000, 2000), cases=c(.7, 37, 212, 2, 80,213)))

country,year,cases
<chr>,<dbl>,<dbl>
A,1999,0.7
B,1999,37.0
C,1999,212.0
A,2000,2.0
B,2000,80.0
C,2000,213.0


See cheatsheet at https://rstudio.com/resources/cheatsheets/ for the answer.

## Dplyr

The `dplyr` library large number of very useful functions:

In [38]:
d

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,1,1,Waymar Royce,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,"House Royce, Night’s Watch",2
2,1,1,Gared,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,Night’s Watch,2
3,1,1,Will,Ned Stark,Sword (Ice),Blade,Deserting the Night’s Watch,Winterfell,Night’s Watch,2
4,1,1,Stag,Direwolf,Direwolf teeth,Animal,Unknown,Winterfell,,1
5,1,1,Direwolf,Stag,Antler,Animal,Unknown,Winterfell,,1
6,1,1,Jon Arryn,Lysa Arryn,Poison,Poison,Petyr Baelish persuaded Lysa to do so for reasons unknown,King’s Landing,House Arryn,2
7,1,1,Dothraki man,Dothraki man,Arakh,Blade,A Dothraki wedding without at least three deaths is a dull affair,Pentos,Dothraki,1
8,1,2,Catspaw assassin,Summer,Direwolf teeth,Animal,Attempting to kill Bran Stark,Winterfell,,1
9,1,2,Mycah,Sandor “the Hound” Clegane,Unknown (likely a sword),Unknown,Joffrey has him killed after Arya attacks Joffrey,Kingsroad,Smallfolk,3
10,1,2,Lady,Ned Stark,Knife,Blade,"Robert Baratheon orders that Lady be killed to appease Cersei, who wants revenge after Nymeria attacked Joffrey",Kingsroad,House Stark,3


`summarise`: aggregate the whole table

In [43]:
summarise(d, max=max(episode))

max
<dbl>
10


`count`: count values in a group

In [52]:
head(count(d, killer))

killer,n
<chr>,<int>
Accident,1
Allister Thorne,3
Amory Lorch,1
Arryn soldier,12
Arthur Dayne,3
Arya Stark,1278


`filter`": select rows according to a criteria

In [46]:
filter(d, killer == 'Arya Stark')

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
46,1,8,Stableboy,Arya Stark,Sword (Needle),Blade,Tried to grab Arya,King’s Landing,Smallfolk,1
55,1,9,Pigeon,Arya Stark,Hands,Hands,Killed for food,King’s Landing,,1
273,3,10,Frey soldier,Arya Stark,Knife,Blade,Killed in revenge for the Red Wedding,Riverlands,House Frey,1
279,4,1,Baratheon of King’s Landing soldier,Arya Stark,Sword,Blade,Killed in a tavern fight,Riverlands,House Baratheon of King’s Landing,1
281,4,1,Polliver,Arya Stark,Sword (Needle),Blade,Revenge for killing Lommy,Riverlands,House Baratheon of King’s Landing,3
328,4,7,Rorge,Arya Stark,Sword (Needle),Blade,Tried to collect the bounty on Sandor “the Hound” Clegane’s head; threatened Arya Stark in the past,Riverlands,,2
464,5,2,Pigeon,Arya Stark,Sword (Needle),Blade,Killed for food,Braavos,,1
521,5,6,Ghita,Arya Stark,Poisoned water,Poison,Sought death from the House of Black and White after suffering from an illness,Braavos,,2
527,5,8,Oyster,Arya Stark,Shucking,Blade,Killed for food,Braavos,,1
628,5,9,Clam,Arya Stark,Shucking,Blade,Killed for food,Braavos,,1


`sample_n`: randomly sample `n` rows

`sample_frac`: randomly sample `frac` fraction of rows

In [47]:
sample_n(d, 5)

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
4657,8,3,Wight,Tormund,Axe,Axe,Killed during the Battle of Winterfell,Winterfell,,1
6501,8,5,Golden Company soldier,Drogon,Dragonfire,Animal,Killed when Daenerys Targaryen attacked King’s Landing,King’s Landing,"Golden Company, House Lannister",1
3971,8,3,Wight,Rhaegal,Dragonfire,Animal,Killed during the Battle of Winterfell,Winterfell,,1
6107,8,5,Golden Company soldier,Drogon,Dragonfire,Animal,Killed when Daenerys Targaryen attacked King’s Landing,King’s Landing,"Golden Company, House Lannister",1
9,1,2,Mycah,Sandor “the Hound” Clegane,Unknown (likely a sword),Unknown,Joffrey has him killed after Arya attacks Joffrey,Kingsroad,Smallfolk,3


In [51]:
sample_frac(d, .001)

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
6067,8,5,Golden Company soldier,Drogon,Dragonfire,Animal,Killed when Daenerys Targaryen attacked King’s Landing,King’s Landing,"Golden Company, House Lannister",1
1618,7,6,Wight,,Drowning,Other,"Killed during a wight hunt led by Jon Snow, who wanted to capture a wight to prove the existence of White Walkers.",Beyond the Wall,,1
6844,8,5,Peasant,Drogon,Rubble,Crushing,Killed when Daenerys Targaryen attacked King’s Landing,King’s Landing,Smallfolk,1
728,6,3,Othell Yarwyck,Jon Snow,Hanging,Other,"Betrayed and stabbed Jon Snow, their lord commander",Castle Black,Night’s Watch,2
3469,8,3,Dothraki rider,Wight,Unknown,Unknown,Killed during the Battle of Winterfell,Winterfell,House Targaryen,1
3338,8,3,Horse,Wight,Unknown,Unknown,Killed during the Battle of Winterfell,Winterfell,,1
1310,7,2,Greyjoy (Euron-aligned) soldier,Yara Greyjoy,Sword,Blade,"Killed during Euron’s attack on Yara’s fleet, which was partially Euron’s attempt to bring a gift back to Cersei and win her heart.",Narrow Sea,House Greyjoy (Euron-aligned),1


`slice`: Get rows by position

In [50]:
slice(d, 1:3)

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
1,1,1,Waymar Royce,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,"House Royce, Night’s Watch",2
2,1,1,Gared,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,Night’s Watch,2
3,1,1,Will,Ned Stark,Sword (Ice),Blade,Deserting the Night’s Watch,Winterfell,Night’s Watch,2


`pull`: output is a vector

In [55]:
head(pull(d, killer))

`select`: output is a tibble

In [56]:
head(select(d, killer))

killer
<chr>
White Walker
White Walker
Ned Stark
Direwolf
Stag
Lysa Arryn


`arrange`: sort tibble

In [57]:
head(arrange(d, character_killed))

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>
812,6,6,Aerys II Targaryen,Jaime Lannister,Sword,Blade,Tried to blow up all of King’s Landing using wildfire when Tywin Lannister sacked the city,King’s Landing,House Targaryen,2
732,6,4,Aggo,Daario Naharis,Knife,Blade,Killed when Jorah and Daario sneak into Vaes Dothrak to try and rescue Daenerys,Vaes Dothrak,Dothraki,2
729,6,3,Alliser Thorne,Jon Snow,Hanging,Other,"Betrayed and stabbed Jon Snow, their lord commander",Castle Black,"Night’s Watch, House Thorne",3
95,2,7,Alton Lannister,Jaime Lannister,Hands,Hands,Killed in Jaime’s attempt to escape imprisonment,Robb Stark’s camp,House Lannister,2
4616,8,3,Alys Karstark,Wight,Unknown,Unknown,Killed during the Battle of Winterfell,Winterfell,"House Karstark, House Stark",2
92,2,6,Amory Lorch,Jaqen H’ghar,Poison dart,Poison,Named by Arya Stark after he finds her with one of Tywin’s letters,Harrenhal,House Lannister,2


`mutate`: compute new column

In [58]:
head(mutate(d, season_plus_episode = episode + season))

order,season,episode,character_killed,killer,method,method_cat,reason,location,allegiance,importance,season_plus_episode
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,1,1,Waymar Royce,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,"House Royce, Night’s Watch",2,2
2,1,1,Gared,White Walker,Ice sword,Blade,Unknown,Beyond the Wall,Night’s Watch,2,2
3,1,1,Will,Ned Stark,Sword (Ice),Blade,Deserting the Night’s Watch,Winterfell,Night’s Watch,2,2
4,1,1,Stag,Direwolf,Direwolf teeth,Animal,Unknown,Winterfell,,1,2
5,1,1,Direwolf,Stag,Antler,Animal,Unknown,Winterfell,,1,2
6,1,1,Jon Arryn,Lysa Arryn,Poison,Poison,Petyr Baelish persuaded Lysa to do so for reasons unknown,King’s Landing,House Arryn,2,2


**`group_by`**
This function creates several tibbles for each category. Extremely powerful function.

Recall:

In [63]:
summarise(d, max=max(episode))

max
<dbl>
10


Now see what `group_by` does:

In [64]:
summarise(group_by(d, season), max=max(episode))

season,max
<dbl>,<dbl>
1,10
2,10
3,10
4,10
5,10
6,10
7,7
8,6


`%>%` The _pipe_ operator, very important

You can now connect functions in a pipeline, the output of one function is automatically passed to the next function:

In [65]:
d %>% group_by(season) %>% summarize(max=max(episode))

season,max
<dbl>,<dbl>
1,10
2,10
3,10
4,10
5,10
6,10
7,7
8,6


### Reference

Images for book from Amazon:
- R for Data Science: https://www.amazon.com/Data-Science-Transform-Visualize-Model/dp/1491910399
- R in Action