# DSCI 100 - Introduction to Data Science


## Lecture 3 - Wrangling to get tidy data


### 2019-09-19

## Reminder  

Where are we? Where are we going?

![](https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png)

*image source: [R for Data Science](https://r4ds.had.co.nz/) by Grolemund & Wickham*

## Shameless borrowing of slides from Jenny Bryan

https://www.slideshare.net/Plotly/plotcon-nyc-behind-every-great-plot-theres-a-great-deal-of-wrangling

<img src="img/whisperer.png" width="700"/>

<img src="img/on_balcony.png" width="700"/>

<img src="img/tame.png" width="700"/>

## How should you wrangle your data? 


## We make it "tidy"

### What is tidy data?

A tidy data is one that is satified by these three criteria:

- each row is a single observation,
- each variable is a single column, and
- each value is a single cell (i.e., its row, column position in the data frame is not shared with another value)

<img src="https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png" width="550" />

*Source: [R for Data Science](https://r4ds.had.co.nz/) by Garrett Grolemund & Hadley Wickham*

## A tale of 4 data tables...

...here is the same data represented in 4 different ways, let's vote on which are tidy

*Example source: https://garrettgman.github.io/tidying/*

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

| country | year  | rate |
|---------|-------|---|
| Afghanistan | 1999|      745/19987071
|Afghanistan |2000    | 2666/20595360
|      Brazil |1999|   37737/172006362
|      Brazil |2000  | 80488/174504898
|      China |1999| 212258/1272915272
|      China |2000 |213766/1280428583

| runner | region | country | speed |
|--------|--------|-------|---|
| Olusoji Fasuba | Africa | Nigeria|      9.85/100
|Murielle Ahouré | Africa | Ivory Coast | 10.78/100
|  Femi Ogunode | Asia | Qatar | 9.91/100
| Su Bingtian | Asia | China | 9.91/100
| Li Xuemei | Asia | China| 10.79




| country | cases (year=1999) | cases (year=2000)|
|---------|-------|-------|
| Afghanistan |   745 |  2666
|  Brazil | 37737 | 80488
|  China | 212258 | 213766

| country | population (year=1999) | population (year=2000)|
|---------|-------|-------|
| Afghanistan |  19987071 |  20595360 |
| Brazil | 172006362 | 174504898 |
| China | 1272915272 | 1280428583 |

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

| country | year  | cases | population |
|---------|-------|-------|------------|
| Afghanistan | 1999  |  745  | 19987071|
| Afghanistan | 2000 |  2666 |  20595360|
|Brazil |1999 | 37737  |172006362|
| Brazil| 2000 | 80488 | 174504898|
| China | 1999 | 212258 |1272915272|
|  China |2000 | 213766 | 1280428583|

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

| country | year  | key | value |
|---------|-------|-------|------------|
|Afghanistan |1999 |     cases   |     745
| Afghanistan |1999| population  | 19987071
|  Afghanistan |2000|      cases |      2666
|  Afghanistan| 2000| population |  20595360
|       Brazil| 1999|      cases |     37737
|       Brazil |1999| population | 172006362
|       Brazil| 2000|      cases  |    80488
|       Brazil |2000| population | 174504898
|        China |1999|      cases  |   212258
|       China |1999| population |1272915272
|       China |2000|      cases |    213766
|       China |2000| population| 1280428583

## Tools for getting it there:

- `tidyverse` package functions from: 
    - `dplyr` package (`select`, `filter`, `mutate`, `group_by`, `summarize`)
    
    - `tidyr` package (`gather`)
    
    - `purrr` package (`*map*`)

## Another big concept this week: iteration

- iteration is when you need to do something repeatedly (e.g., ringing in and bagging groceries at the till)

![](https://www.ecomcrew.com/wp-content/uploads/2015/07/bar-code-scanning-grocery-store.jpg)

## Tidyverse tools for iteration

1. `group_by` + `summarize`
2. `*map*` 

## `group_by` + `summarize`

- useful when you want to do something repeatedly to a group of rows
- an example, we want to calculate the average life expectancy (`lifeExp`) for each continent from the `gapminder` data set


In [1]:
library(tidyverse)
gapminder <- read_csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/master/data/gapminder_data.csv")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.0     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 0.8.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
Parsed with column specification:
cols(
  country = [31mcol_character()[39m,
  year = [32mcol_double()[39m,
  pop = [32mcol_double()[39m,
  continent = [31mcol_character()[39m,
  lifeExp = [32mcol_double()[39m,
  gdpPercap = [32mcol_double()[39m
)


In [2]:
#preview data frame

## First, let's filter for only 1 year, 2007



## Now let's use `group_by` + `summarize` to iterate

Goal: calculate average life expectancy in 2007 for each continent

## `*map*` 

- useful when you want to do something repeatedly to almost anything (we'll give the example of columns in a data frame)
- an example, we want to calculate the average value for each column from the `USAarrests` data to get the average across all US states

In [3]:
head(USArrests)

Unnamed: 0_level_0,Murder,Assault,UrbanPop,Rape
Unnamed: 0_level_1,<dbl>,<int>,<int>,<dbl>
Alabama,13.2,236,58,21.2
Alaska,10.0,263,48,44.5
Arizona,8.1,294,80,31.0
Arkansas,8.8,190,50,19.5
California,9.0,276,91,40.6
Colorado,7.9,204,78,38.7


## use `*map*` to iterate

Calculate the average/mean of each column:

But why isn't our output a data frame? 

##  `*map*` functions output depends on which function you use...

| `map` function | Output |
|----------|--------|
| `map()` | list |
| `map_lgl()` | logical vector |
| `map_int()` | integer vector |
| `map_dbl()` | double vector |
| `map_chr()` | character vector |
| `map_df()` | data frame |

## use `map_df` instead:

## Go forth and wrangle!

we'll be here to help if you need it!

<img align="left" src="https://media.giphy.com/media/Qgm6tIYrSQqC4/giphy.gif">



*image source: https://media.giphy.com/media/Qgm6tIYrSQqC4/giphy-downsized-large.gif*

## Class activity 1

Calculate the mean petal length for each of the Iris (flower) species in the `iris` dataset. Post your answer on Piazza when you are done.

## Class activity 2

Use `map_df` to caclulate the mean of each of the numerical columns in the `iris` dataset. Post your answer on Piazza when you are done.

## What did we learn?

- 
- 
- 



