<table width='100%'><tr>
    <td style='background-color:red; text-align:center; color: white;'><!--Foundation<!--hr size='5' style='border-color:red; background-color:red;'--></td>
    <td style='background-color:yellow; text-align:center;'><!--Level 1<!--hr size='5' style='border-color:yellow; background-color:yellow;'--></td>
    <td style='background-color:orange; text-align:center;'><!--Level 2<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:green; text-align:center; color: white;'><!--Level 3<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:blue; text-align:center; color: white;'><!--Level 4<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:purple; text-align:center; color: white;'><!--Level 5<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:brown; text-align:center; color: white;'><!--Level 6<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:black; text-align:center; color: white;'><!--Level 7<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
</tr></table>

<table style='border-left:10px solid orange;'><tr>
    <td style='padding-left:20px;'>
        <h2><i>Swansea University Medical School</i><br/><b>MSc Health Data Science</b></h2>
        <h3>PMIM-102 Introduction to Scientific Computing in Healthcare</h3>
        <h1><b>Introduction to Programming in R</b></h1>
        <h2><b>3. The Tidyverse</b></h2>
        <h2><i>Part 2: Tidying the data.</i></h2>
        <h3><i>September 2020</i></h3>
        <h3><b>To-do</b></h3>
        <ul>
            <li>Find a challenge dataset for the tidyr functions.</li>
        </ul>
    </td>
    <td><img height='300' width='500' src='images/cover.jpg'/></td>
</tr></table>

## __Aim__: Use the tools available in R to manipulate tables of data.

The aim of this session is to concentrate on the core activities in working with large datasets: moving, cleaning and transforming table data to facilitate analyses. Whilst this is possible using base-R, the facilities provided by the libraries in the __Tidyverse__ make it considerably __easier__ and the resulting code __more readable__.

### __A map of where we're going__

1. <b>Introduction</b> - What is the process, the problems with standard R and the structure of 'tidy' data.

1. <b>Acquiring data</b> - Getting data into R from files (<b>readr</b>).

1. <div style="background-color:yellow;"><b>Tidying the data</b> - Handling missing data and reshaping the tables (<b>tidyr</b>).</div>

1. <b>Transforming the data</b> - Selecting and converting the data ready to analyse (<b>dplyr</b>).

1. <b>Working with specific data types in tidyverse</b>: strings (<b>stringr</b>), dates (<b>lubridate</b>), factors (<b>forcats</b>).

1. <b>Plotting &amp; Data visualisation</b> - beyond the simple R plot etc. (<b>ggplot2</b>).

1. <b>Extras</b> - Things worth knowing of so that you can use them if you ever need them.
 * Applying functions and working with lists (purrr).
 * Tidy evaluation (rlang).
 * Communicating your results with a dynamic, R-based website (shiny).

## __Load the Tidyverse__

The first thing to do is make sure the library is loaded. If you have not already installed it, do so not using the <code>install.packages()</code> function.

In [1]:
## install.packages('tidyverse')
#library(tidyverse)

## __Making data tidy (tidyr)__

[See the reverse of the readr cheatsheet.](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf)

### _Missing data_
Missing data is not necessarily missing information - child maltreatment often leads to a lack of visits to GPs etc. so we might need to find a way to indicate zero codes. More generally, however, we will be using data that is not collected perfectly by people whose under-pressure primary job is not to collect for us so we will, even in well-curated datasets, come across missing data (NA in R).

There are some computation techniques which we may cover (time-permitting) in the Health Data Modelling module (PMIM202) which are applicable in certain circumstances and use Multiple Imputation to reconstruct missing data from the other values that are present. To do this we need to know that this is a valid, unbiassed thing to do and that is not straightfoward. Here we'll consider just the case of cleaning up the dataset rather than attempting to reconstruct it.

We may just keep these cases and omit them in specific analyses or we may need to do something with a specific column. There are three obvious things we might so with NAs in a column (and you can probably imagine a number of others):

* delete the cases with missing data
* replace all values in the column with a specified value
* work down the column filling any missing value with the last preceding value

### _Dropping and filling data_
<code>drop_na()</code> will, as expected drop either all rows with an NA or all rows with an NA in a specific column. <code>replace_na()</code> will replace NAs in columns with specified values and <code>fill()</code> will fill missing values with the last preceding value (useful for sequences or grouped data, perhaps).

In [2]:
## library(tidyr)
#df <- data.frame(creature=c('werewolf', 'zombie', 'vampyr', 'troll', 'jabberwocky'), sightings=c(6, 120, NA, 1, 0), victims=c(32, 27, NA, NA, 1012))
#df <- df[order(df$sightings), ]
#print(df)

#print(drop_na(df, sightings))
#print(drop_na(df))

#print(replace_na(df, list(sightings=1, victims=0)))

#print(fill(df, victims))

### _Expanding data to include missing cases_
We can expand the cases to include combinations that are missing.

In [3]:
#df <- data.frame(weapon=c('sword', 'axe', 'spear'), material=c('diamond', 'iron', 'bronze'), size=c('two-hand', 'one-hand', 'assassin'), count=c(1, 2, 3))
#print(df)
#complete(df, weapon, material, fill=list(size='assassin'))
#expand(df, weapon, material, size)

### _Extracting multiple values from one column into one or multiple columns_
Occasionally we will get data where two values have been combined in one column (usually for convenience or brevity). For example, we might get a region and year as 'Asia/2017' or just a series of values 'enrolled/submitted/passed/graduated'. We might then want to split these values as separate rows or as separate columns. These can be split into separate columns using <code>separate()</code> or into rows in the existing column using <code>separate_rows()</code>.



In [4]:
#df <- data.frame(episode=c(1, 1, 1), attack=c(1, 2, 3), kills=c('1/5', '3/7', '1/3'))
#print(df)
#print(separate(df, kills, into=c('red-shirts', 'total-crew-at-risk')))
#print(separate_rows(df, kills))

### _Compressing data from multiple cells into one_
We can do the reverse and combine the cells into a single cell with a separator character using <code>unite()</code>.

In [5]:
#print(unite(df, episode, attack, col="event", sep="#"))

### _Reshaping data - expanding columns with categories into multiple columns for each category_
Often you will receive data with a column that contains multiple rows for the same case listing categories that apply to that case. To make this tidy, we often want to spread these categories across multiple columns so that we end up (or work towards ending up) with one row for each case and one column for each category.
The tidyverse lets us do this with <code>spread()</code>. You need to specify the data frame, the column that contains the categories (the categories will become the new column names) and the data that needs to be included in each category.

In [6]:
#df <- data.frame(event=c('1#1', '1#1', '1#1', '1#2', '1#2', '1#2', '2#1', '2#1', '2#1'), rank=rep(c('redshirts', 'yellowshirts', 'blueshirts')),
#                 kills=c(1, 1, 0, 3, 1, 0, 5, 1, 0))
#print(df)
#print(spread(df, rank, kills))

### _Reshaping data - brining columns together into one_
You can gather data that has been organised in multiple columns of similar data into multiple rows with two columns, the first of which contains the name of the column and the second the value from that column for that row.

In [7]:
#df <- data.frame(event=c('1#1', '1#2', '2#1'), redshirts=c(1, 3, 5), yellowshirts=c(1, 1, 1), blueshirts=c(0, 0, 0))
#print(df)
#print(gather(df, redshirts, yellowshirts, blueshirts, key='rank', value='kills'))

## __Exercise__: Practice separating/uniting and gathering/spreading.

1. Use `gather` to bring all the child date of births into a single 'Children' column with date of birth in a column, 'DOBs'.
1. Similarly bring the medications into a pair of columns, 'Medication' and Code'.
1. Unite the adult relationships and ages with a '/' separator.
1. Bring those together into a pair of columns, 'Adults' and 'Rel-Age'.
1. Create a spread from the BIRTH-PLAN (to create columns 1, 2, 2,3, 4 etc.) with values of EDUCATION.

In [8]:
# p <- gather(pregnancy, DOBCHILD1, DOBCHILD2, DOBCHILD3, DOBCHILD4, DOBCHILD5, DOBCHILD6, key='Children', value='DOBs')
# p <- gather(p, MEDICATION1, MEDICATION2, MEDICATION3, MEDICATION4, MEDICATION5, MEDICATION6, key='Medication', value='Code')
# p <- unite(p, ADULT1_RELATIONSHIP, ADULT1_AGE, col='Adult1', sep='/')
# p <- unite(p, ADULT2_RELATIONSHIP, ADULT2_AGE, col='Adult2', sep='/')
# p <- unite(p, ADULT3_RELATIONSHIP, ADULT3_AGE, col='Adult3', sep='/')
# p <- unite(p, ADULT4_RELATIONSHIP, ADULT4_AGE, col='Adult4', sep='/')
# p <- gather(p, Adult1, Adult2, Adult3, Adult4, key='Adults', value='Rel-Age')
# head(p[order(p$PARENT_ID),], 12)

In [9]:
# q <- spread(pregnancy, BIRTH_PLAN, EDUCATION)
# head(q)
# table(pregnancy$BIRTH_PLAN, pregnancy$EDUCATION)

<table style="text-align:center;"><tr><td width="100" height="20" style="background-color:greenyellow"></td><td width="100" height="20" style="background-color:hotpink"></td></tr></table>

<table width='100%'><tr>
    <td style='background-color:red; text-align:center; color: white;'><!--Foundation<!--hr size='5' style='border-color:red; background-color:red;'--></td>
    <td style='background-color:yellow; text-align:center;'><!--Level 1<!--hr size='5' style='border-color:yellow; background-color:yellow;'--></td>
    <td style='background-color:orange; text-align:center;'><!--Level 2<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:green; text-align:center; color: white;'><!--Level 3<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:blue; text-align:center; color: white;'><!--Level 4<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:purple; text-align:center; color: white;'><!--Level 5<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:brown; text-align:center; color: white;'><!--Level 6<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
    <td style='background-color:black; text-align:center; color: white;'><!--Level 7<!--hr size='5' style='border-color:orange; background-color:orange;'--></td>
</tr></table>