# R for Neuroscience (r4n)
Made with 💖 by Andrew Li

![](https://media.giphy.com/media/SSirUu2TrV65ymCi4J/giphy.gif)
*Source: https://giphy.com/gifs/back-to-school-SSirUu2TrV65ymCi4J*

## Welcome NURC 2021!
I created this worksheet for the 2021 [Neuroscience Undergraduate Research Conference (NURC)](https://ubcneuroscienceclub.wixsite.com/uncweb/nurc-2021). This worksheet is meant for beginners with little to no R background. I will introduce you to the modern tools such as tidyverse to show you how to conduct a clean and modern data analysis. 

## Why should you learn R?
During my undergrad, my PhD mentor told me to learn R if I wanted to attend graduate school. 

1. R is free (unlike Matlab, SPSS, or Excel)! As well, many textbooks and resources (like this one) is open source and widely available. As well, there is a huge online R community. 

2. R is the future of statistical analysis (present as well?). During the summer and fall, I have interviewed with many potential supervisors. Most assumed that you know R already or that you will be willing to learn in grad school. As such, it is a very useful skill to pick up.

3. R is very versatile. I have used R in so many different ways from statistical tests, web development, and modeling. 

4. Analyses conducted in R are reproducible, reusable, and shareable. I was an undergraduate research assistant and my PhD mentor was showing me the data analysis and plots he created in Excel. He told me to do the other data set but I totally forgot how to do it when I got around to it.

## About the instructors
[Andrew Li](http://andrewcli.me) <br>
[Tiger Wu](https://tigerthepro.github.io/TigerWu/)

## Contents
1. Introduction to Jupyter Notebooks
    * Text cells
        * Markdown cheatsheet
    * Code cells
    * Equations
    * Comments
    * Check your answers
2. Welcome to the tidyverse
3. Data collection
    * `readr`
4. Data wrangling
    * `select()`
    * `filter()`
    * `rename()`
    * `mutate()`
5. Data visualization
    * Themes
    * Grammar of Graphics
6. Putting it all together
    * Data collection
    * Data wrangling
    * Data visualization
7. Future direction

Use the cell below to load (install if you haven't already) the packages we will need:

In [None]:
# uncomment and run this cell if you need to install tidyverse.
# install.packages('tidyverse')

In [None]:
# Run this cell before continuing.
source("tests_nurc_2021.R")
suppressPackageStartupMessages(library(tidyverse))

## 1. Introduction to Jupyter Notebooks
If you go on to statistic or computer science courses, often time they will use Jupyter Notebooks. As such it is worthwhile to familiarize yourselves to it. This section will show you what Jupyter notebooks are and what they can do!

### 1.1 Text cells
In a notebook, each rectangle is called a cell. This one is a text cell because it contains text. You can edit text cells by double clicking it. After you are done, simply press `control + enter` (mac and pc)or click run. Text cells are written in markdown. Markdown is a very simple markup language to format and edit the text. **Note** Jupyter lab does not have spell check, so be careful if you want to submit a future assignment/project!

#### 1.1.1 Markdown cheatsheet

Double click this cell to take a look at some common markdown tools:

**This is bold**

*This is italics*
## This is a header
Here is an ordered list:
1. Thing 1
2. Thing 2

Here is an unordered list: 
* Order doesn't matter
* Still doesn't matter

### 1.2 Code cells
Code cells allow you to input R or Python code. Pressing `control + enter` or clicking run will make Jupyter run the whole cell. You can run the entire sheet from top to bottom if you click the `run all` tab in the `cell` tab. 

Try to print "Hello World!" 

In [None]:
print("Hello World!")

### 1.3 Equations
In markdown, you can enter equations as well. Double click this cell to see how you can create in line equations such as this $Y = \beta_0 + \beta_1 = X$ or in the center like this 

$$
\bar{X} = \frac{\sum{X_1}}{N} = \frac{X_1 + X_2 + X_3 + ... + X_N}{N}
$$

As you can see, Jupyter notebooks have become very popular because of how much you can do with them. 

### 1.4 Comments
In all programming languages, you can comment out chunk codes or sentences. This is so that you and others can more easily understand your thoughts and workflow. As well, it is really good for debugging and trying things out. In R, you comment things with `#`, everything behind # will be ignored. Other languages will have different syntax. In Jupyter notebooks, you can comment out large chunks of code by selecting everything you want to run and use the hot keys `command or control + /`. In RStudio, you can do this by using the hot keys `control + shift + c`. 

### 1.5 Check your answer

I have written tests for your answers and the autograder. You can check your for the correctness of your code by simply running the cells directly underneath the cells with a question. However, you must load the solution first (just as you need to load packages) before you can use the test functions. In every notebook, you will be prompt to run the packages needed and my tests. 

In [None]:
# you will find this cell near the top of every notebook

# Run this cell before continuing. 
source("tests_nurc_2021.R")

# example of a test from this notebook
# test_2.3()

#### Question 1.1 

Multiple choice: 

Welcome to your first question! When you are working through this worksheet make sure you *read* and *follow* the instructions. 

To answer this multiple choice question, assign your answer to `answer1.1` and make sure your response(s) are in upper case and in quotation marks ("A"). 

    answer1.1 <- c(FILL_THIS_IN, FILL_THIS_IN)

A. Correct

B. Correct

C. Incorrect

D. Incorrect 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

#### Question 1.2

True or false: 

R is similar to SPSS.

To answer this true or false question, assign your response to `answer1.2`. Make sure your submission is in all lower case and to surround your answer in quotation marks ("true"/"false"). 

    answer1.2 <- FILL_THIS_IN

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

#### Question 1.3 

True or False: 

This problem set is for marks. 

To answer this true or false question, assign your response to `answer1.3`. Make sure your submission is in all lower case and to surround your answer in quotation marks ("true"/"false"). 

    answer1.3 <- FILL_THIS_IN

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.3()

### 2. Welcome to the tidyverse
Tidyverse is a *collection* of packages designed for data science (often called a "meta"--package). When you install and load tidyverse version 1.2.0, you are actually several core pacakges including: `dplyr`, `forcats`, `ggplot2`, `purrr`, `readr`, `stringr`, `tibble`, and `tidyr`. These packages all share a high-level design philosophy and low-level grammer and data structures. This makes it so that learning one package makes it easier to learn the others as well. 

#### Installation
* In order to use an R package, you need to install the package. Note that you only need to install packages once.

        install.packages("tidyverse")
        
* After you have installed the package, you need to load the packages you want to use in your current R session.

        library(tidyverse)
        
#### Core tidyverse packages 

**Bold** denotes packages used today

| Package name  | Description                                                                                                                                                                                                                                                                                                                  | Cheetsheet                                                                                                     |
|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| **ggplot2**       | ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.<br>You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical <br>primitives to use, and it takes care of the details.                                                                                 | Click [here](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) for cheatsheet!            |
| **dplyr**         | dplyr provides a grammar of data manipulation, providing a consistent set of verbs that <br>solve the most common data manipulation challenges.                                                                                                                                                                              | Click [here](<br>https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf) for cheatsheet!   |
| tidyr         | tidyr provides a set of functions that help you get to tidy data. Tidy data is data with <br>a consistent form: in brief, every variable goes in a column, and every column is a variable                                                                                                                                    | Click [here](<br>https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) for cheatsheet! |
| **readr**         | readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf).<br>It is designed to flexibly parse many types of data found in the wild, while still cleanly <br>failing when data unexpectedly changes.                                                                                          | Click [here](<br>https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf) for cheatsheet!           |
| purrr         | purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent <br>set of tools for working with functions and vectors. Once you master the basic concepts, <br>purrr allows you to replace many for loops with code that is easier to write and more expressive.                             | Click [here](<br>https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) for cheatsheet!                 |
| tibble        | tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, <br>and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less <br>and complain more forcing you to confront problems earlier, typically leading to cleaner, more <br>expressive code. | Click [here](<br>https://miro.medium.com/max/700/1*fEGdnyXLzgeftfCLwvBZ5A.jpeg) for cheatsheet!                |
| stringr       | stringr provides a cohesive set of functions designed to make working with strings as easy as <br>possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct <br>implementations of common string manipulations.                                                                         | Click [here](<br>https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf) for cheatsheet!              |
| forcats       | forcats provides a suite of useful tools that solve common problems with factors. R uses factors <br>to handle categorical variables, variables that have a fixed and known set of possible values.                                                                                                                          | Click [here](http://www.flutterbys.com.au/stats/downloads/slides/figure/factors.pdf) for cheatsheet!           |

#### Question 2.1
Use the `library` function to fire up tidyverse! {Points:1}

    library(FILL_THIS_IN)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

#### Question 2.2
Tidyverse crossword!

You can complete this crossword about `tidyverse` functions, specifically, `dplyr`, `ggplot2`, and `readr`. You might not know about some of these functions so googling is highly encouraged. You do not need parentheses but some answers have underscores. 

To answer this question, assign the across words to lower case "a" followed by the number. Assign the down words to a lowercase "d" followed by the number. Make sure you answer all of the across questions in ascending order first before moving on to down for the autograder. Follow the guide below {Points: 1}

    a3 <- FILL_THIS_IN
    a5 <- FILL_THIS_IN
    ...
    ...
    d12 <- FILL_THIS_IN

You can solve for this puzzle [here](https://crosswordlabs.com/embed/2020-12-08-749) as well!

<img src="img/crossword.png" width="800" height="800">

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

### 3. Data collection
![](https://media.giphy.com/media/Vn9JVHDAzYw1O/giphy.gif)
*Source: https://giphy.com/gifs/brain-anatomy-Vn9JVHDAzYw1O*

#### Background
For this workshop, we will be working through a MRI and Alzheimer's data set made publicly available by Kaggle. You can find the original data set [here](https://www.kaggle.com/jboysen/mri-and-alzheimers). The original data was made available by the [Open Access Series of Imaging Studies (OASIS)](http://www.oasis-brains.org/) project. This was aimed at making MRI data sets freely available to the scientific community. OASIS was made available by the [Neuroinformatics Research Group (NRG)](http://nrg.wustl.edu/) at Washington University and the [Howard Hughes Medical Institude (HHMI)](http://www.hhmi.org/) at Harvard University.

#### Alzheimer's Disease
Alzheimer's disease is an irreversible, progressive brain disorder that slowly destroys memory, cognitive functions, and eventually, the ability to carry out the simplest tasks. In late onset, symptoms typically appear in their mid-60s while early onset symptoms typically appear between a person's 30s and mid 60s. In 1906, Dr. Alois Alzheimer noticed significant changes in the brain tissue of a women who died of an unusual mental illness. She experienced memory loss, language problems, and unpredictable behavior. An autopsy showed that her brain had many abnormal clumps (amyloid plaques) and tangled bundles of fibers (neurofibrillary). The damage initially takes place in the hippocampus but spreads to other brain regions. 

Experts predict that 5.5 million Americans age 65 and older may suffer from Alzheimer's. As well many under 65 suffer from it as well. 

#### Cross sectional data
Kaggle offers two data sets but we will be working with the cross sectional data set `oasis_cross-sectional.csv`. 

*Cross-sectional MRI Data in Young, Middle Aged, Nondemented and Demented Older Adults:* This set consists of a cross-sectional collection of 416 subjects aged 18 to 96. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 100 of the included subjects over the age of 60 have been clinically diagnosed with very mild to moderate Alzheimer's disease (AD). Additionally, a reliability data set is included containing 20 nondemented subjects imaged on a subsequent visit within 90 days of their initial session.

| Variable | Description                                                                                                                                                                            | Type    |
|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
| ID       | Unique Id of the patient                                                                                                                                                               | String  |
| M/F      | Gender                                                                                                                                                                                 | Boolean |
| Hand     | handedness                                                                                                                                                                             | Boolean |
| Age      | Age in years                                                                                                                                                                           | Integer |
| Educ     | Education level in years                                                                                                                                                               | Integer |
| SES      | Socioeconomic status as assessed by the Hollingshead Index of Social Position and <br>classified into categories from 1 (highest status) to 5 (lowest status)                          | Integer |
| MNSE     | Mini Mental State Examination is a test of cognitive function and classified via<br>(range is from 0 = worst to 30 = best)                                                             | Integer |
| CDR      | Clinical Dementia Rating (0 = no dementia, 0.5 = very mild AD, 1 = mild AD, <br> 2 = moderate AD, 3 = Severe AD)                                                                                           | Float   |
| eTIV     | Estimated Total Intracranial Volume                                                                                                                                                    | Integer |
| nWBV     | Normalized Whole Brain Volume expressed as a percent of all voxels in the atlas-masked image <br>that are labeled as gray or white matter by the automated tissue segmentation process | Float   |
| ASF      | Atlas scaling factor (unitless). Computed scaling factor that transforms native-space brain <br>and skull to the atlas target (i.e., the determinant of the transform matrix)          | Float   |
| Delay    | Delay time (contrast)                                                                                                                                                                  | Integer |

#### Reading in the data set
Using the `read_csv` function from the `readr` package (that is in the tidyverse package), we can load this data set into R. 

#### Question 3.1

Multiple choice: 

What is the `oasis_cross-sectional.csv` data set about? 

To answer this multiple choice question, assign your answer to `answer3.1` and make sure your response(s) are in upper case and in quotation marks ("A"). 

    answer3.1 <- FILL_THIS_IN

A. Study conducted looking at predictors for drug addiction

B. A data set that uses electroencephalography (EEG) to show brain activity in people with Alzheimer's disease

C. A longitudinal MRI data set that looks at nondemented and demented oldr adults

D. A cross-sectional MRI data set with young, middle-aged, nondemented, and demented older adults

E. A data set that uses positron emission tomography (PET)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.1()

#### Question 3.2 
True or False: 

The Clinical Dementia Rating scale is classified from 1 (no Dementia) to 30 (severe Dementia). 

To answer this true or false question, assign your response to `answer3.2`. Make sure your submission is in all lower case and to surround your answer in quotation marks ("true"/"false"). 

    answer3.2 <- FILL_THIS_IN

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.2()

#### Question 3. 3
If we just load the data into R, it will be printed on the screen but you cannot do anything to it. If we want to work with it, we need to give it a name so that we can call upon it and manipulate it moving forward. Create a new variable called `mri_data` and assign the `oasis_cross-sectional.csv` data set to it. 

    FILL_THIS_IN <- read_csv("data/FILL_THIS_IN")

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(mri_data) # prints the first 6 lines 

In [None]:
test_3.3()

#### Question 3.4

Now that we have read in the data, let's take a look at it. In question 3.3, we used the `head()` function to look at the first 6 entries. Now, use the `tail()` function to take a look at the last 6 entries and find out the last patient's (`OAS1_0395_MR2`) age.  

To answer this question, assign your answer to `answer3.4`.

    answer3.4 <- FILL_THIS_IN

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.4()

### 4. Data Wrangling

Before we create any graphs or models, we need to tweak the data to make it easier to work with. Some common manipulations include renaming variables or creating new variables. To do this, we will use the `dplyr` package. We will look at three of the most common functions in this package:

    1. select()
    2. filter()
    3. mutate()
    
<img src="img/dplyr.jpg" width="500" height="300">

*Source: Allison Horst*
    
### 4.1 `select()`
The `select()` function allows you to select and work with the variable you want and find relevant. 

<img src="img/select.png" width="500" height="300">

*Source: https://datacarpentry.org/r-intro-geospatial/06-dplyr/*

#### Question 4.1 
Say we are only interested in the age of the participants. Use the select function on `mri_data` so that only have the age of the participants. 

To answer this question, call the new data frame `answer4.1`

    answer4.1 <- select(mri_data, FILL_THIS_IN)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer4.1)

In [None]:
test_4.1()

#### Question 4.2
Now, we are only interested in the demographics information. Select ID, M/F, Hand, Age, Educ, and SES. 

We can use this code:

    select(mri_data, ID, "M/F", Hand, Age, Educ, SES)
    
But there is a much faster way to select multiple columns using ":". Use the select function and : to create a new variable called `answer4.2` that contains the demographic data. 

    answer4.2 <- (FILL_THIS_IN, FILL_THIS_IN:FILL_THIS_IN)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer4.2)

In [None]:
test_4.2()

#### Question 4.4

Now, select *every* column except Delay. We will not be needing the variable moving forward. 

We can manually type out all of the columns we want

    select(mri_data, ID, "M/F", Hand, Age, Educ, SES ... ASF)

Or, we can simply tell R which columns we *don't* want via the minus sign. create a new variable called `answer4.4` with every column except Delay.

    select(FILL_THIS_IN, -FILL_THIS_IN)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer4.3)

In [None]:
test_4.3()

#### Question 4.4
If you want to look at the gender distribution, you will encounter an error because of how it is named. We need to use quotations in order to use it. Let's rename this variable to something that is easier to work with. Rename M/F to `gender` and create a new object called `answer4.4`. Remember to use our new data frame `answer4.3`.

Hint there is a `rename()` function!

    answer4.4 <- FILL_THIS_IN(answer4.3, gender = FILL_THIS_IN)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer4.4)

In [None]:
test_4.4()

### 4.2 `filter()`

We can use the `filter()` function to keep observations that meet our criteria. 

<img src="img/filter.jpg" width="600" height="300">

*Source: Allison Horst*


#### Question 4.5

As you know, Alzheimer's mainly affect the older population. As such, we want to look filter our data so that we only have participants that are over 60 years old. Using `answer4.4` create a new object called `answer4.5` so that we only have older adults.

    answer4.5 <- filter(FILL_THIS_IN, FILL_THIS_IN)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer4.5)

In [None]:
test_4.5()

#### Question 4.6 

Now, to practice, from `answer4.5`, take only the data from females. Create a new object variable called `answer4.6`.

    answer4.6 <- FILL_THIS_IN(FILL_THIS_IN, FILL_THIS_IN == FILL_THIS_IN)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer4.6)

In [None]:
test_4.6()

### 4.3 `mutate()`

We can use the `mutate()` function to create new variables while preserving existing ones. Mutate will create a new column at the end of your data set. 

<img src="img/mutate.jpg" width="400" height="400">

*Source: Allison Horst*


#### Question 4.7

We can calculate the "risk score" for developing Alzheimer's by adding up their socioeconomic status (SES), Mini Mental State Examination score (MMSE), and Clinical Dementia Rating (CDR). Create a new column in the `answer4.6` data frame called `risk_score`. Call this new data frame `answer4.7`. This new column should be the sum of the three predictive scores. 

    answer4.7 <- FILL_THIS_IN(answer4.6, 
        FILL_THIS_IN = FILL_THIS_IN + FILL_THIS_IN + FILL_THIS_IN)
        
**NOTE:** Risk Score was completely made up for teaching purposes. 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(answer4.7)

In [None]:
test_4.7()

### 5. Data Visualization
In the final section of this workshop, we will cover how to make plots via `ggplot2`!
<img src="img/ggplot2.jpg" width="400" height="400">

*Source: Allison Horst*

### 5.1 Grammar of graphics
ggplot2 follows the grammar of graphics principle by Leland Wilkinson. Before this, there were functions for each graph - a line graph, bar graph, etc. Wilkinson did not like this because this became complex and unmanageable really fast (imagine creating a new function for every plot). As such, he set out to look for what all graphs have in common - the grammar of graphics. He proposed that these 8 constituents of graphics can become the bedrock for *every* plot.  

<img src="img/ggplot-2.png" width="400" height="400">

*Source: https://www.science-craft.com/2014/07/08/introducing-the-grammar-of-graphics-plotting-concept/*


### 5.2 Themes
I like to teach beginners how to use themes because the default theme R gives you is really ugly (in my opinion). You can easily change the theme by adding the theme name to your code. I created the graph below using the plot you will create in question 5.2. It is the same graph, I just applied all the different base themes available in R.

        plot + theme_bw() #example

<img src="img/ggthemes.png" width="500" height="500">

*Source: [Andrew Li](http://andrewcli.me)*

#### Question 5.1 

Let's bring it all together! Using our original data set `mri_data`, create a new data frame called `mri_tidy` that has the following:

1. Only contains participants who is 60 years old or greater
2. Create a new `risk_score` column which is the sum of SES, MMSE, and CDR
3. Rename the M/F column name to `gender`

        mri_tidy <- mri_data %>%
            FILL_THIS_IN(Age FILL_THIS_IN) %>%
            FILL_THIS_IN(FILL_THIS_IN = SES + MMSE + CDR) %>%
            FILL_THIS_IN(FILL_THIS_IN = "M/F") 
           
**NOTE:** The `%>%` is called the pipe operator. This comes in handy when you want to perform multiple functions like question 5.1. The pipe operator will forward the result of an expression into the next expression.

    filter(mri_data, SES == 3)
    
Is the same as

    mri_data %>% filter(SES == 3)

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(mri_data)

In [None]:
test_5.1()

#### Question 5.2

Let's take a look at relationship between the **Clinical Dementia Score** (CDR) and the **Mini Mental State Examination** (MMSE). As well, we want to see the difference between **genders**. Using the following parameters, create a boxplot to look at this relationship:

* CDR should be on the x-axis and MMSE should be the y-axis
* Color by gender
* Set the transparency (alpha) to 0.8
* Rename the x-axis, y-axis, and legend title so that it is descriptive and human readable
    * x-axis: `Clinical Dementia Score`
    * y-axis: `Mini Mental State Examination`
    * legend title: `Gender` (simply capitalize it)
* Create a title `MRI and Alzheimer's`
* Change the theme (optional)

Store the result in a variable called `answer5.2`

    answer5.2 <- mri_tidy %>%
        ggplot(aes(x = as.factor(FILL_THIS_IN), y = FILL_THIS_IN, fill = FILL_THIS_IN)) +
        geom_boxplot(FILL_THIS_IN) +
        xlab(FILL_THIS_IN) +
        ylab(FILL_THIS_IN) +
        labs(fill = "Gender") +
        ggtitle(FILL_THIS_IN) + 
        theme_classic() # you can play around and try different themes
        
Or you can use this:

    answer5.2 <- mri_tidy %>%
        ggplot(aes(x = as.factor(FILL_THIS_IN), y = FILL_THIS_IN, fill = FILL_THIS_IN)) +
        geom_boxplot(FILL_THIS_IN) +
        labs(x = FILL_THIS_IN, y = FILL_THIS_IN, title = FILL_THIS_IN, fill = FILL_THIS_IN) +
        theme_classic() # you can play around and try different themes

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
answer5.2

In [None]:
test_5.2()

### 6. Putting it all together

In this last section, we will use everything we have learned thus far! We will be working on creating this graph below. __Note__ that this last section might be tricky but the goal of this section is to show you how customizable ggplot is. If there is something you haven't learned yet, I will provide the code.

<img src="img/tbi_deaths.png" width="600" height="600">

*Source: [Andrew Li](http://andrewcli.me)*

### 6.1 Data collection - Traumatic Brain Injury (TBI)

We will be taking the data from tidytuesday - a weekly social data project in R where users explore a new data set each week. This data set will look into [traumatic brain injury](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-24/readme.md) and how common it is. The original data comes from the [CDC](https://www.cdc.gov/traumaticbraininjury/pdf/TBI-Surveillance-Report-FINAL_508.pdf) and [Veterans Brain Injury Center](https://dvbic.dcoe.mil/dod-worldwide-numbers-tbi).

> Brain Injury Awareness Month, observed each March, was established 3 decades ago to educate the public about the incidence of brain injury and the needs of persons with brain injuries and their families (1). Caused by a bump, blow, or jolt to the head, or penetrating head injury, a traumatic brain injury (TBI) can lead to short- or long-term changes affecting thinking, sensation, language, or emotion.

<img src="img/tbi_summary.png" width="800" height="200">

*Source: [CDC](https://www.cdc.gov/mmwr/volumes/68/wr/mm6810a1.htm)*

#### 6.1.1 Variables 

| Variable         | Description                      | Type   |
|------------------|----------------------------------|--------|
| age_group        | Age group                        | string |
| type             | Type of measure                  | string |
| injury_mechanism | Injury mechanism                 | string |
| number_est       | Estimated observed cases in 2014 | Integer|
| rate_est         | Rate/100,000 in 2014             | float  |

#### 6.1.2 The question we want to answer

What are the leading causes of traumatic brain injury related deaths by age group?

#### Question 6.1 

Use the `read_csv()` function to load the data into this session. Name the variable `tbi_age`. 

FILL_THIS_IN <- FILL_THIS_IN('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-24/tbi_age.csv')

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(tbi_age) # prints the first 6 lines

In [None]:
test_6.1()

#### Question 6.2 

Now that we have loaded the data, we need to clean it up a bit. Using `tbi_age` conduct the following manipulations: 

1. Using the `filter()` function we will:
    * Notice that in the `age_group` column, **0-17** seems pretty useless. Let's get rid of it. Use the != operator.
    * We do not need the **Total** in `age_group`, let's get rid of it as well.
    * **Other or no mechanism specified** in the `injury_mechanism` column isn't that informative. Let's get rid of it.
    * We are only interested in **Deaths**. Use the == operator to filter that from `type`.



2. Using the `mutate()` function we will:
    * Create a new column called `pct` whereby we will divide `number_est` by the sum (`sum()`) of itself.
    * Turn all columns that are strings to factors. We will use the `factor()` function. Keep the original names of the columns. 
    * Reorder `age_group` into ascending order. (See what happens to our plot if we skip this step!)



3. We need to get rid of missing values. We can do this via `na.omit()`.



4. Assign this new tidy data set to an object called `tbi_age_tidy`.

Follow the following skeleton code provided to answer this question:

    FILL_THIS_IN <- FILL_THIS_IN %>% 
      FILL_THIS_IN(FILL_THIS_IN != "0-17" & age_group != "FILL_THIS_IN" 
                   & FILL_THIS_IN != "Other or no mechanism specified",
                   & FILL_THIS_IN == "Deaths") %>% 
      na.omit() %>%
      FILL_THIS_IN(pct = number_est/sum(number_est)) %>%
      mutate(pct = pct * 10) %>%
      FILL_THIS_IN(age_group = factor(FILL_THIS_IN),
             injury_mechanism = factor(FILL_THIS_IN),
             type = factor(FILL_THIS_IN)) %>%
      mutate(age_group = fct_relevel(age_group, c("0-4", "5-14", "15-24",
                                                  "25-34", "35-44", "45-54",
                                                  "55-64", "65-74", "75+")))

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
head(tbi_age_tidy) # prints the first 6 lines

In [None]:
test_6.2()

#### Question 6.3

Now, using our new data, we will go on to answer our question, __What are the leading causes of traumatic brain injury related deaths by age group?__ As you can see, from our FMRI question, you can create a graph in R very easily. In this question, we will be trying to make this graph as aesthetically pleasing as possible. Here, I will show you some cool and easy ways to make your plot look really nice. 

1. What data will we use?
    * We will use `tidy_age_group` for this graph.


2. What are our aesthetic mappings? 
    * We will be mapping __age_group__ on the x-axis, __pct__ on the y-axis, and we will differentiate __injury_mechanism__ by color. 


3. What geometric object will we use? 
    * We will need a scatter plot and a line plot.
    
*Nice! We have a plot now. Everything we do beyond step three can be optional but encouraged.*


4. We can create this graph without scales, but if we want to make it nicer, we can change the scales.
    * We can change the y-axis to percent via `scales` package.
    * We can change the default color R gives us. 


5. Let's change the x-axis and y-axis to be something more human readable.
    * In this case, I would get rid of the x-axis and the y-axis names. 
    * We can add a descriptive title as well.
        * Play around and add a subtitle or caption!
    * We can get rid of the legend title. The contents are self-explanatory and it seems redundant to add a title. Let's get rid of it. 


6. I am not a big fan of the default theme that R gives us. Play around with the built in themes in R!

*The graph we have now is publication ready. However, we can make it look even nicer.*

7. Finally, we can add the finishing touches to our graph.
    * I am not a fan of the legend placement. Let's change the position of the graph. Try changing it to be on top or on the bottom.
    * Let's make our title font size a bit bigger and make it bold.
    * Accordingly, let's make the axis text a bit easier to read. Make it **bold**.


Use the following skeleton code to create this graph:

*__Note__ that this question will not be graded. Play around with the colors, naming, themes, and more! Make this graph unique and aesthetically pleasing to look at!* 

In [None]:
# Step 1: Pipe the data set you will be using 
tbi_age_tidy %>%
# Step 2: Add your aesthetic mappings
    ggplot(aes(x = age_group, y = pct, color = injury_mechanism)) +
# Step 3: Choose what geometric object you will use
    geom_point() +
    geom_line(aes(group = injury_mechanism)) +
# Step 4: Change the scales
    scale_y_continuous(labels = scales::percent) +
    scale_color_manual(values = c("#1b9e77", "#d95f02", "#7570b3", #change to whatever color you want!
                                  "#e7298a", "#66a61e", "#e6ab02")) +
# Step 5: Add human readable and descriptive titles 
    labs(x = "", 
         y = "", 
         color = "",
         title = "Leading causes of TBI related deaths", 
         subtitle = "By age in 2014", 
         caption = "") +
# Step 6: Change the default theme to something more aesthetically pleasing
    theme_minimal() +
# Step 7: Add the final touches!
    theme(
        legend.position = "bottom",
        plot.title = element_text(face = "bold"),
        axis.text = element_text(face = "bold")
        ) 

### 7. Future directions

If you enjoyed this workshop, please let me know by filling out our (very short) survey [here](https://ubc.ca1.qualtrics.com/jfe/form/SV_9BN4hBVzr9jRkZD)! I would love to create and host more workshops in the future. We covered a lot of topics today but we didn't really go over the basics or go deep into a specific topic. My goal was to show you the cool side of R to get you interested. If you want more workshops that go into more detail, shoot me an email!