In [1]:
library(tidyverse)

"package 'tidyverse' was built under R version 3.6.3"-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.2     v purrr   0.3.3
v tibble  3.0.0     v dplyr   1.0.1
v tidyr   1.1.1     v stringr 1.4.0
v readr   1.3.1     v forcats 0.5.0
"package 'forcats' was built under R version 3.6.3"-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

Why ensure that your data is tidy? There are two main advantages:

• There’s a general advantage to picking one consistent way of
storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underly‐
ing uniformity.

• There’s a specific advantage to placing variables in columns
because it allows R’s vectorized nature to shine. As you learned
in “Useful Creation Functions” on page 56 and “Useful Sum‐
mary Functions” on page 66, most built-in R functions work
with vectors of values. That makes transforming tidy data feel
particularly natural.

dplyr, ggplot2, and all the other packages in the tidyverse are
designed to work with tidy data

Most data that you will encounter will be untidy. There are two
main reasons:

• Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

• Data is often organized to facilitate some use other than  analysis. For example, data is often organized to make entry as easy as possible.

In [4]:
table1

country,year,cases,population
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


In [5]:
table2

country,year,type,count
Afghanistan,1999,cases,745
Afghanistan,1999,population,19987071
Afghanistan,2000,cases,2666
Afghanistan,2000,population,20595360
Brazil,1999,cases,37737
Brazil,1999,population,172006362
Brazil,2000,cases,80488
Brazil,2000,population,174504898
China,1999,cases,212258
China,1999,population,1272915272


In [6]:
table3

country,year,rate
Afghanistan,1999,745/19987071
Afghanistan,2000,2666/20595360
Brazil,1999,37737/172006362
Brazil,2000,80488/174504898
China,1999,212258/1272915272
China,2000,213766/1280428583


In [7]:
table4a

country,1999,2000
Afghanistan,745,2666
Brazil,37737,80488
China,212258,213766


In [8]:
table4b

country,1999,2000
Afghanistan,19987071,20595360
Brazil,172006362,174504898
China,1272915272,1280428583


<h1>Spreading and Gathering</h1>

In [9]:
table4a

country,1999,2000
Afghanistan,745,2666
Brazil,37737,80488
China,212258,213766


In [11]:
table4a%>%
gather('1999', '2000', key = "year", value = "cases")

country,year,cases
Afghanistan,1999,745
Brazil,1999,37737
China,1999,212258
Afghanistan,2000,2666
Brazil,2000,80488
China,2000,213766


In [12]:
table4b

country,1999,2000
Afghanistan,19987071,20595360
Brazil,172006362,174504898
China,1272915272,1280428583


In [18]:
table4b%>%
gather('1999', '2000', key = "year", value = "population")

country,year,population
Afghanistan,1999,19987071
Brazil,1999,172006362
China,1999,1272915272
Afghanistan,2000,20595360
Brazil,2000,174504898
China,2000,1280428583


In [16]:
table2

country,year,type,count
Afghanistan,1999,cases,745
Afghanistan,1999,population,19987071
Afghanistan,2000,cases,2666
Afghanistan,2000,population,20595360
Brazil,1999,cases,37737
Brazil,1999,population,172006362
Brazil,2000,cases,80488
Brazil,2000,population,174504898
China,1999,cases,212258
China,1999,population,1272915272


In [17]:
spread(table2, key = type, value = count)

country,year,cases,population
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


<h2>Separating and Pull</h2>

In [19]:
table3

country,year,rate
Afghanistan,1999,745/19987071
Afghanistan,2000,2666/20595360
Brazil,1999,37737/172006362
Brazil,2000,80488/174504898
China,1999,212258/1272915272
China,2000,213766/1280428583


In [20]:
table3%>%
separate(rate, into = c("cases", "population"))

country,year,cases,population
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


In [21]:
table3%>%
separate(rate, into = c("cases", "population"), sep = "/")

country,year,cases,population
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


In [22]:
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)

country,year,cases,population
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


In [24]:
table3 %>%
separate(year, into = c("century", "year"), sep = 2)

country,century,year,rate
Afghanistan,19,99,745/19987071
Afghanistan,20,0,2666/20595360
Brazil,19,99,37737/172006362
Brazil,20,0,80488/174504898
China,19,99,212258/1272915272
China,20,0,213766/1280428583


In [25]:
table5 <- table3 %>%
separate(year, into = c("century", "year"), sep = 2)

In [26]:
table5 %>%
unite(new, century, year)

country,new,rate
Afghanistan,19_99,745/19987071
Afghanistan,20_00,2666/20595360
Brazil,19_99,37737/172006362
Brazil,20_00,80488/174504898
China,19_99,212258/1272915272
China,20_00,213766/1280428583


In [27]:
table5 %>%
unite(new, century, year, sep = "")

country,new,rate
Afghanistan,1999,745/19987071
Afghanistan,2000,2666/20595360
Brazil,1999,37737/172006362
Brazil,2000,80488/174504898
China,1999,212258/1272915272
China,2000,213766/1280428583


In [28]:
stocks <- tibble(
 year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
 qtr = c( 1, 2, 3, 4, 2, 3, 4),
 return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66))

In [29]:
stocks

year,qtr,return
2015,1,1.88
2015,2,0.59
2015,3,0.35
2015,4,
2016,2,0.92
2016,3,0.17
2016,4,2.66


In [30]:
stocks %>%
spread(year, return)

qtr,2015,2016
1,1.88,
2,0.59,0.92
3,0.35,0.17
4,,2.66


In [31]:
stocks%>%
spread(year, return) %>%
gather(year, return, '2015', '2016', na.rm = TRUE)

qtr,year,return
1,2015,1.88
2,2015,0.59
3,2015,0.35
2,2016,0.92
3,2016,0.17
4,2016,2.66


In [32]:
stocks %>%
complete(year, qtr)

year,qtr,return
2015,1,1.88
2015,2,0.59
2015,3,0.35
2015,4,
2016,1,
2016,2,0.92
2016,3,0.17
2016,4,2.66


In [33]:
stocks %>%
fill(return)

year,qtr,return
2015,1,1.88
2015,2,0.59
2015,3,0.35
2015,4,0.35
2016,2,0.92
2016,3,0.17
2016,4,2.66
