<b><a href = 'https://tidyr.tidyverse.org/articles/tidy-data.html'>Comprehensive Documentation</a></b>

In [1]:
library(tidyverse)

"package 'tidyverse' was built under R version 3.6.3"-- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.2     v purrr   0.3.4
v tibble  3.0.4     v dplyr   1.0.2
v tidyr   1.1.2     v stringr 1.4.0
v readr   1.4.0     v forcats 0.5.0
"package 'forcats' was built under R version 3.6.3"-- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


# Tidy data

In **Tidy data**:
* Each variable forms a column.

* Each observation forms a row.

* Each type of observational unit forms a table.

# Tidying messy datasets

This section describes the five most common problems with messy datasets, along with their remedies:

*  Column headers are values, not variable names.

*  Multiple variables are stored in one column.

*  Variables are stored in both rows and columns.

*  Multiple types of observational units are stored in the same table.

*  A single observational unit is stored in multiple tables.

### Column headers are values, not variables

In [2]:
relig_income %>% head()

religion,<$10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k,$75-100k,$100-150k,>150k,Don't know/refused
Agnostic,27,34,60,81,76,137,122,109,84,96
Atheist,12,27,37,52,35,70,73,59,74,76
Buddhist,27,21,30,34,33,58,62,39,53,54
Catholic,418,617,732,670,638,1116,949,792,633,1489
Don’t know/refused,15,14,15,11,10,35,21,17,18,116
Evangelical Prot,575,869,1064,982,881,1486,949,723,414,1529


This dataset has three variables, `religion`, `income` and `frequency`. To tidy it, we need to pivot the non-variable columns into a two-column key-value pair. This action is often described as making a wide dataset longer (or taller).

In [3]:
relig_income %>% pivot_longer(!religion, names_to = 'income', values_to = 'frequency') %>% head()

religion,income,frequency
Agnostic,<$10k,27
Agnostic,$10-20k,34
Agnostic,$20-30k,60
Agnostic,$30-40k,81
Agnostic,$40-50k,76
Agnostic,$50-75k,137


<hr>

In [4]:
billboard %>% head()

artist,track,date.entered,wk1,wk2,wk3,wk4,wk5,wk6,wk7,...,wk67,wk68,wk69,wk70,wk71,wk72,wk73,wk74,wk75,wk76
2 Pac,Baby Don't Cry (Keep...,2000-02-26,87,82,72,77.0,87.0,94.0,99.0,...,,,,,,,,,,
2Ge+her,The Hardest Part Of ...,2000-09-02,91,87,92,,,,,...,,,,,,,,,,
3 Doors Down,Kryptonite,2000-04-08,81,70,68,67.0,66.0,57.0,54.0,...,,,,,,,,,,
3 Doors Down,Loser,2000-10-21,76,76,72,69.0,67.0,65.0,55.0,...,,,,,,,,,,
504 Boyz,Wobble Wobble,2000-04-15,57,34,25,17.0,17.0,31.0,36.0,...,,,,,,,,,,
98^0,Give Me Just One Nig...,2000-08-19,51,39,34,26.0,26.0,19.0,2.0,...,,,,,,,,,,


To tidy this dataset, we first use pivot_longer() to make the dataset longer. We transform the columns from `wk1` to `wk76`, making a new column for their names, `week`, and a new value for their values, `rank`:

In [11]:
#Here we use values_drop_na = TRUE to drop any missing values from the rank column. 
#In this data, missing values represent weeks that the song wasn’t in the charts, so can be safely dropped.
billboard1 <- billboard %>% pivot_longer(starts_with('wk'),
                          names_to = 'week',
                          values_to = 'rank',
                          names_prefix = 'wk',                       # remove prefix 'wk' for values in column `week`
                          names_transform = list(week = as.integer), # convert character to integer, e.g: '1' to 1
                          values_drop_na = T)                        # drop rows having NA in column `rank`

billboard1

artist,track,date.entered,week,rank
2 Pac,Baby Don't Cry (Keep...,2000-02-26,1,87
2 Pac,Baby Don't Cry (Keep...,2000-02-26,2,82
2 Pac,Baby Don't Cry (Keep...,2000-02-26,3,72
2 Pac,Baby Don't Cry (Keep...,2000-02-26,4,77
2 Pac,Baby Don't Cry (Keep...,2000-02-26,5,87
2 Pac,Baby Don't Cry (Keep...,2000-02-26,6,94
2 Pac,Baby Don't Cry (Keep...,2000-02-26,7,99
2Ge+her,The Hardest Part Of ...,2000-09-02,1,91
2Ge+her,The Hardest Part Of ...,2000-09-02,2,87
2Ge+her,The Hardest Part Of ...,2000-09-02,3,92


### Multiple variables stored in one column

In [6]:
who %>% head()

country,iso2,iso3,year,new_sp_m014,new_sp_m1524,new_sp_m2534,new_sp_m3544,new_sp_m4554,new_sp_m5564,...,newrel_m4554,newrel_m5564,newrel_m65,newrel_f014,newrel_f1524,newrel_f2534,newrel_f3544,newrel_f4554,newrel_f5564,newrel_f65
Afghanistan,AF,AFG,1980,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1981,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1982,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1983,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1984,,,,,,,...,,,,,,,,,,
Afghanistan,AF,AFG,1985,,,,,,,...,,,,,,,,,,


In [7]:
who %>% pivot_longer(
    starts_with('new'),
    names_pattern = 'new_?(.*)_(.)(.*)',
    names_to = c('diagnosis', 'sex', 'age'),
    values_to = 'count'
) %>% head()

country,iso2,iso3,year,diagnosis,sex,age,count
Afghanistan,AF,AFG,1980,sp,m,14,
Afghanistan,AF,AFG,1980,sp,m,1524,
Afghanistan,AF,AFG,1980,sp,m,2534,
Afghanistan,AF,AFG,1980,sp,m,3544,
Afghanistan,AF,AFG,1980,sp,m,4554,
Afghanistan,AF,AFG,1980,sp,m,5564,


### Variables are stored in both rows and columns

In [8]:
weather <- data.frame(
    year = c(2010, 2010, 2010, 2010,  2011, 2011, 2011, 2011),
    month = c(1, 1, 2, 2, 1, 1, 2, 2),
    element = rep(c('tmin', 'tmax'), 4),
    d1 = 1:8,
    d2 = 1:8,
    d3 = 1:8
)

weather

year,month,element,d1,d2,d3
2010,1,tmin,1,1,1
2010,1,tmax,2,2,2
2010,2,tmin,3,3,3
2010,2,tmax,4,4,4
2011,1,tmin,5,5,5
2011,1,tmax,6,6,6
2011,2,tmin,7,7,7
2011,2,tmax,8,8,8


This data frame has variables in individual columns (`year`, `month`), spread across columns (`day`, d1-d3) and across rows (`tmin`, `tmax`) (minimum and maximum temperature). 

In [9]:
weather1 <- weather %>% pivot_longer(
    starts_with('d'),
    names_to = 'day',
    names_prefix = 'd',
    names_transform = list(day = as.integer),
    values_to = 'temperature'
)

weather1

year,month,element,day,temperature
2010,1,tmin,1,1
2010,1,tmin,2,1
2010,1,tmin,3,1
2010,1,tmax,1,2
2010,1,tmax,2,2
2010,1,tmax,3,2
2010,2,tmin,1,3
2010,2,tmin,2,3
2010,2,tmin,3,3
2010,2,tmax,1,4


In [10]:
weather1 %>% pivot_wider(
    names_from = element,
    values_from = temperature
)

year,month,day,tmin,tmax
2010,1,1,1,2
2010,1,2,1,2
2010,1,3,1,2
2010,2,1,3,4
2010,2,2,3,4
2010,2,3,3,4
2011,1,1,5,6
2011,1,2,5,6
2011,1,3,5,6
2011,2,1,7,8


This form is tidy: there’s one variable in each column, and each row represents one day.

### Multiple types in one table

In [12]:
billboard1

artist,track,date.entered,week,rank
2 Pac,Baby Don't Cry (Keep...,2000-02-26,1,87
2 Pac,Baby Don't Cry (Keep...,2000-02-26,2,82
2 Pac,Baby Don't Cry (Keep...,2000-02-26,3,72
2 Pac,Baby Don't Cry (Keep...,2000-02-26,4,77
2 Pac,Baby Don't Cry (Keep...,2000-02-26,5,87
2 Pac,Baby Don't Cry (Keep...,2000-02-26,6,94
2 Pac,Baby Don't Cry (Keep...,2000-02-26,7,99
2Ge+her,The Hardest Part Of ...,2000-09-02,1,91
2Ge+her,The Hardest Part Of ...,2000-09-02,2,87
2Ge+her,The Hardest Part Of ...,2000-09-02,3,92


This data frame is memory consumption, we can split it into 2 data frames.   
* One data frame stores unique `artist` and `track`  
* One data frame stores the rank of each song
* 2 data frames linked together by a column `id`

In [26]:
#Create unique id for each song
billboard2 <- billboard1 %>% 
group_by(artist, track) %>%
mutate(id = cur_group_id()) %>% ungroup()

billboard2


artist,track,date.entered,week,rank,id
2 Pac,Baby Don't Cry (Keep...,2000-02-26,1,87,1
2 Pac,Baby Don't Cry (Keep...,2000-02-26,2,82,1
2 Pac,Baby Don't Cry (Keep...,2000-02-26,3,72,1
2 Pac,Baby Don't Cry (Keep...,2000-02-26,4,77,1
2 Pac,Baby Don't Cry (Keep...,2000-02-26,5,87,1
2 Pac,Baby Don't Cry (Keep...,2000-02-26,6,94,1
2 Pac,Baby Don't Cry (Keep...,2000-02-26,7,99,1
2Ge+her,The Hardest Part Of ...,2000-09-02,1,91,2
2Ge+her,The Hardest Part Of ...,2000-09-02,2,87,2
2Ge+her,The Hardest Part Of ...,2000-09-02,3,92,2


In [27]:
song <- billboard2 %>% distinct(artist, track, id)
song

artist,track,id
2 Pac,Baby Don't Cry (Keep...,1
2Ge+her,The Hardest Part Of ...,2
3 Doors Down,Kryptonite,3
3 Doors Down,Loser,4
504 Boyz,Wobble Wobble,5
98^0,Give Me Just One Nig...,6
A*Teens,Dancing Queen,7
Aaliyah,I Don't Wanna,8
Aaliyah,Try Again,9
"Adams, Yolanda",Open My Heart,10


In [31]:
song_ranking <- billboard2 %>% select(song_id = id, week, rank)

song_ranking

song_id,week,rank
1,1,87
1,2,82
1,3,72
1,4,77
1,5,87
1,6,94
1,7,99
2,1,91
2,2,87
2,3,92


### One type in multiple tables