# Introduction to tidy data

## Principles of tidy data

| name   | age | eye_color | height |
|--------|-----|-----------|--------|
| Jake   | 34  | Other     | 6'1"   |
| Alice  | 55  | Blue      | 5'9"   |
| Tim    | 76  | Brown     | 5'7"   |
| Denise | 19  | Other     | 5'1"   |

* Observations as rows
* Variables as columns
* One type of observationa unit per table

## A dirty data diagnosis

| name   | age | brown | blue | other | height |
|--------|-----|-------|------|-------|--------|
| Jake   | 34  | 0     | 0    | 1     | 6'1"   | 
| Alice  | 55  | 0     | 1    | 0     | 5'9"   |
| Tim    | 76  | 1     | 0    | 0     | 5'7"   |
| Denise | 19  | 0     | 0    | 1     | 5'1"   |

* Column headers are values, not variables


## Wide vs. long dataets

Some cases, it's easy to see the data in a wide form. However, to manipulate and analyze the data, it's much easier in a long form.


# Introduction to tidyr

- R package by Hadley Wickham
- Apply the principles of tidy data
- Small set of simple functions

## Gather columns into key-value pairs

** gather() **

In [1]:
library(tidyr)

# Apply gather() to bmi and save the result as bmi_long
bmi_long <- gather(bmi, year, bmi_val, -Country)

# View the first 20 rows of the result
head(bmi_long, 20)

ERROR: Error in gather(bmi, year, bmi_val, -Country): could not find function "gather"


## Spreading key-value pairs into columns

** spread() **

In [None]:
# Apply spread() to bmi_long
bmi_wide <- spread(bmi_long, year, bmi_val)

# View the head of bmi_wide
head(bmi_wide)

# Introduction to tidyr (part 2)

## Separate columns

** seperate() **

## Unite columns

** unite() **

## Summary of key tidyr function

| Function   | What it does                         |
|------------|--------------------------------------|
| gather()   |  Gather columns into key-value pairs | 
| spread()   |  Spred key-value pairs into columns  |
| seperate() |  Seperate one column into multiple   |
| unite()    |  Unite multiple columns into one     |


## Separating columns

In [None]:
# Apply separate() to bmi_cc
bmi_cc_clean <- separate(bmi_cc, col = Country_ISO, into = c("Country", "ISO"), sep = "/")

# Print the head of the result
head(bmi_cc_clean)

## Uniting columns

In [None]:
# Apply unite() to bmi_cc_clean
bmi_cc <- unite(bmi_cc_clean, Country_ISO, Country, ISO, sep = "-")

# View the head of the result
head(bmi_cc)

# Addressing common symptoms of messy data

## Column headers are values, not variable names

** Messy data **

| name   | age | brown | blue | other | height |
|--------|-----|-------|------|-------|--------|
| Jake   | 34  | 0     | 0    | 1     | 6'1"   | 
| Alice  | 55  | 0     | 1    | 0     | 5'9"   |
| Tim    | 76  | 1     | 0    | 0     | 5'7"   |
| Denise | 19  | 0     | 0    | 1     | 5'1"   |

** Tidy data **

| name   | age | eye_color | height |
|--------|-----|-----------|--------|
| Jake   | 34  | Other     | 6'1"   |
| Alice  | 55  | Blue      | 5'9"   |
| Tim    | 76  | Brown     | 5'7"   |
| Denise | 19  | Other     | 5'1"   |


## Variables are stored in both rows and columns

** Messy data **

| name  | measurement | value |
|-------|-------------|-------|
| Jake  | n_dogs      | 1     |
| Jake  | n_cats      | 0     |
| Jake  | n_birds     | 1     |
| Alice | n_dogs      | 1     |
| Alice | n_cats      | 2     |
| Alice | n_birds     | 0     |

** Tidy data **

| name  | n_dogs | n_cats | n_birds |
|-------|--------|--------|---------|
| Jake  | 1      | 0      | 1       |
| Alice | 1      | 2      | 0       |

## Multiple variables are stored in one column

** Messy data **

| name   | sex_age | eye_color | height |
|--------|---------|-----------|--------|
| Jake   | M.34    | Other     | 6'1"   |
| Alice  | F.55    | Blue      | 5'9"   |
| Tim    | M.76    | Brown     | 5'7"   |
| Denise | F.19    | Other     | 5'1"   |

** Tidy data **

| name   | sex | age | eye_color | height |
|--------|-----|-----|-----------|--------|
| Jake   | M   | 34  | Other     | 6'1"   |
| Alice  | F   | 55  | Blue      | 5'9"   |
| Tim    | M   | 76  | Brown     | 5'7"   |
| Denise | F   | 19  | Other     | 5'1"   |

## Other common symptoms

* A single observationa unit is stored in multiple tables
* Multiple types of observational units are stored in the same table

| name   | age | height | pet_name | pet_type | pet_height |
|--------|-----|--------|----------|----------|------------|
| Jake   | 34  | 6'1"   | Larry    | Dog      | 25"        |
| Jake   | 34  | 6'1"   | Chirp    | Bird     | 3"         |
| Alice  | 55  | 5'9"   | Wally    | Dog      | 30"        |
| Alice  | 55  | 5'9"   | Sugar    | Cat      | 10"        |
| Alice  | 55  | 5'9"   | Spice    | Cat      | 12"        |

Jake and Alice's names, ages, and heights are duplicated 3 times.

The first three columns are related to people and the last three columns are related to pets. So it should be seperated into two tables which have poeple's information and pets' information.


## Column headers are values, not variable names

In [None]:
library(tidyr)
library(dplyr)

## tidyr and dplyr are already loaded for you

# View the head of census
head(census)

# Gather the month columns
census2 <- gather(census, month, amount, -YEAR)

# Arrange rows by YEAR using dplyr's arrange
census2 <- arrange(census2, YEAR)

# View first 20 rows of census2
head(census2, 20)


## Variables are stored in both rows and columns

In [None]:
## tidyr is already loaded for you

# View first 50 rows of census_long
head(census_long, 50)

# Spread the type column
census_long2 <- spread(census_long, type, amount)

# View first 20 rows of census_long2
head(census_long2, 20)

## Multiple values are stored in one column

In [None]:
## tidyr is already loaded for you

# View the head of census_long3
head(census_long3)

# Separate the yr_month column into two
census_long4 <- separate(census_long3, yr_month, c("year", "month"), sep = "_")

# View the first 6 rows of the result
head(census_long4)