# Preamble

Most of this workshop is taken from Hadley's [R for Data Science](https://r4ds.had.co.nz/) book. You can find more examples, explanations and exercises there if you want.

# Package prerequisites

The only package that is required in this workshop is **tidyverse**, which includes the packages **ggplot2**, **dplyr**, **purrr**, and others designed for the process of tidying, transforming, and visualizing data with the tidyverse paradigm.

In [1]:
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  2.0.1     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0
“package ‘tibble’ was built under R version 3.5.2”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


If you get an error message “there is no package called ‘tidyverse’” then you need to install tidyverse first. (This is something you should have done before coming to the workshop, but it's ok if you didn't and it won't take long.)

In [2]:
#install.packages('tidyverse')

# Importing Data

## tibble

The core unit of the tidyverse is the tibble, provided in the **tibble** package. tibbles are the same as data frames, but with a few different features
 * Printing data frame shows only first 10 rows instead of all, and only the columns that fit on screen
 * Each column reports its type
 * Also reports dimensions of data frame
 * Will not allow for partial matching in subsetting

Note that these behaviors only happen in RStudio. In Jupytern notebooks tibbles have the same behaviors as data frames by design.

In [6]:
mtcars

mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [17]:
#as_tibble(mtcars)
as_tibble(head(mtcars))

mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


## readr

The package **readr** provides options for importing data that has already been set up as tables, such as csvs:
 * `read_csv()` is for comma-delimited files
 * `read_csv2()` is for semicolon-delimited files
 * `read_tsv()` is for tab-delimited files
 * `read_delim()` is for any other delimiters
 * `read_fwf()` is for fixed width files
 * `read_table()` is for fixed width files with columns separated by white space

If you are used to using `read.csv`, `read.table` or other functions from **utils**, these are almost exactly the same except that they import data as tibbles instead of data frames. They are also about 10x faster, provide progress bars when loading larger files, and provide typing for columns.

If your file is massive and you *really* need something as fast as possible, look into the **data.table** package (or consider writing your core functions in C++ and using the **Rcpp** package). Data tables don't fit exactly into the tidyverse but it is not too much more of a learning curve.

In [18]:
mtcars <- read_csv(readr_example("mtcars.csv"))

Parsed with column specification:
cols(
  mpg = col_double(),
  cyl = col_double(),
  disp = col_double(),
  hp = col_double(),
  drat = col_double(),
  wt = col_double(),
  qsec = col_double(),
  vs = col_double(),
  am = col_double(),
  gear = col_double(),
  carb = col_double()
)


These functions have other optional arguments that can be helpful as well.

 * `delim`: in `read_delim()` denotes what delimiter to use
 * `col_types`: if you know what column types you want your tibble to have, specify it here. Can be useful because often data gets parsed as `factor` or `character` when you may want it to be `character` or `numeric`.
 * `skip`: how many lines to skip. Default is 0. Useful for when the file you're reading in has comments at the beginning, such as in SAM files.

We won't go through the other arguments, but you can view them in the help for the functions.

In [20]:
?read_csv

## Exercises

1. Read in the file `exercises/import1.csv` as a tibble with 3 columns with the appropriate function.
2. Suppose we are trying to read in the following line into 3 columns: x, y\n1, and 'a,b'.
```
"x,y\n1,'a,b'"
```
What arguments do you need to specify for `read_csv()` to do so properly?
3. What is wrong with each of the following inline CSV files?
```{r}
read_csv("a,b\n1,2,3\n4,5,6")
read_csv("a,b,c\n1,2\n1,2,3,4")
read_csv("a,b\n\"1")
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
```

## Solutions

In [22]:
challenge <- read_csv(readr_example("challenge.csv"))
problems(challenge)

Parsed with column specification:
cols(
  x = col_double(),
  y = col_logical()
)
“1000 parsing failures.
 row col           expected     actual                                                                                         file
1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
.... ... .................. .......... ...............................................................

row,col,expected,actual,file
1001,y,1/0/T/F/TRUE/FALSE,2015-01-16,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1002,y,1/0/T/F/TRUE/FALSE,2018-05-18,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1003,y,1/0/T/F/TRUE/FALSE,2015-09-05,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1004,y,1/0/T/F/TRUE/FALSE,2012-11-28,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1005,y,1/0/T/F/TRUE/FALSE,2020-01-13,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1006,y,1/0/T/F/TRUE/FALSE,2016-04-17,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1007,y,1/0/T/F/TRUE/FALSE,2011-05-14,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1008,y,1/0/T/F/TRUE/FALSE,2020-07-18,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1009,y,1/0/T/F/TRUE/FALSE,2011-04-30,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'
1010,y,1/0/T/F/TRUE/FALSE,2010-05-11,'/Library/Frameworks/R.framework/Versions/3.5/Resources/library/readr/extdata/challenge.csv'


# Tidying Data

Consider the following tables:

In [24]:
table1 <- data.frame(treatmenta=c(NA,16,3),treatmentb=c(2,11,1))
rownames(table1) <- c("John Smith", "Jane Doe", "Mary Johnson")
table1

Unnamed: 0,treatmenta,treatmentb
John Smith,,2
Jane Doe,16.0,11
Mary Johnson,3.0,1


In [25]:
table2 <- data.frame(patient=c("John Smith","John Smith","Jane Doe","John Doe","Mary Johnson","Mary Johnson"),
                     treatment=c("a","b","a","b","a","b"),
                    result=c(NA,2,16,11,3,1))
table2

patient,treatment,result
John Smith,a,
John Smith,b,2.0
Jane Doe,a,16.0
John Doe,b,11.0
Mary Johnson,a,3.0
Mary Johnson,b,1.0


In [None]:
table3 <- data.frame(patient=c("John Smith",))

## Gathering
## Spreading
## Separating
## Uniting

# Transforming Data

## Filter rows with `filter()`

## Arrange rows with `arrange()`

## Select columns with `select()`

## Add new variables with `mutate()`

## Grouped summaries wit

# Visualizing Data

The ultimate tool of 

# Publication Quality Graphs

# Some other useful visualization packages

We don't have time in this workshop to get in depth to other workshops, but here are some more useful visualization packages that may be helpful for your research.

## ggtree for phylogenetics

Resources and associated packages:
 * [Data Integration, Manipulation and Visualization of Phylogenetic Trees](https://yulab-smu.github.io/treedata-book/index.html)
 * [treeio](https://bioconductor.org/packages/release/bioc/html/treeio.html)
 * [tidytree](https://cran.r-project.org/web/packages/tidytree/index.html)

In [13]:
sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.3.0   stringr_1.3.1   dplyr_0.7.8     purrr_0.2.5    
[5] readr_1.3.1     tidyr_0.8.2     tibble_2.0.1    ggplot2_3.1.0  
[9] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       cellranger_1.1.0 plyr_1.8.4       pillar_1.3.1    
 [5] compiler_3.5.1   bindr_0.1.1      base64enc_0.1-3  tools_3.5.1     
 [9] digest_0.6.18    uuid_0.1-2       lubridate_1.7.4  jsonlite_1.6    
[13] evaluate_0.12    nlme_3.1-137     gtable_0.2.0     lattice_0.20-38 
[17] pkgconf