# Assempling data

## Binds

* rbind()
* cbind()

* bind_rows()
* bind_cols()

### Benefits of bind_rows() and bind_cols

* Faster
* Returns a tibble
* Can handel lists of data frames
* .id : Column name for new column

### Which bind?

side_one and side_two contain tracks from sides one and two, respectively, of Pink Floyd's famous album The Dark Side of the Moon.

Bind these datasets into a single table using a dplyr function. Which type of bind should you use?

* Examine side_one and side_two by printing them to the console.
* Use a bind to combine side_one and side_two into a single dataset.

In [None]:
# Examine side_one and side_two
side_one
side_two

# Bind side_one and side_two into a single dataset
side_one %>% 
  bind_rows(side_two)

### Bind rows

discography and jimi contain all of the information you need to create an anthology dataset for the band The Jimi Hendrix Experience.

discography contains a data frame of each album by The Jimi Hendrix Experience and the year of the album.

jimi contains a list of data frames of album tracks, one for each album released by The Jimi Hendrix Experience. As Garrett explained in the video, you can pass bind_rows() a list of data frames like jimi to bind together into a single data frame.

* Examine discography and jimi.
* Bind jimi into a single data frame. As you do, save the data frame names as a column named album by specifying the .id argument to bind_rows().
* Left join discography to the results to make a complete data frame.


In [None]:
# Examine discography and jimi
discography
jimi

jimi %>% 
  # Bind jimi into a single data frame
  bind_rows(.id = "album") %>% 
  # Make a complete data frame
  left_join(discography, by = c("album"))

### Bind columns
Let's make a compilation of Hank Williams' 67 singles. To do this, you can use hank_years and hank_charts:

hank_years contains the name and release year of each of Hank Williams' 67 singles.
hank_charts contains the name of each of Hank Williams' 67 singles as well as the highest position it earned on the Billboard sales charts.
Each dataset contains the same songs, but hank_years is arranged chronologically by year, while hank_charts is arranged alphabetically by song title.

* Examine hank_years and hank_charts. How should you bind the two datasets?
* Use arrange() to reorder hank_years alphabetically by song title.
* Select just the year column of the result.
* Bind the year column to hank_charts.
* arrange() the resulting dataset chronologically by year, then alphabetically by song title within each year.

In [None]:
# Examine hank_years and hank_charts
hank_years
hank_charts

hank_years %>% 
  # Reorder hank_years alphabetically by song title
  arrange(song) %>% 
  # Select just the year column
  select(year) %>% 
  # Bind the year column
  bind_cols(hank_charts) %>% 
  # Arrange the finished dataset
  arrange(year, song)

## Buil a better data frame

* data.frame()
* as.data.frame()

* data_frame()
* as_data_frame()

### data.frame() defaults
* Chnages strings to factors
* Add row names
* Changes unusual column names

### data_frame()

* will not
    * Change the data type of vectors(eg. strings to factors)
    * Add row names
    * Change column names
    * Recycle vectors greater than length one
* will
    * Evaluate arguments lazily, in order
    * Return a tibble

### Make a data frame
Let's make a Greatest Hits compilation for Hank Williams. hank_year, hank_song, and hank_peak contain the columns of the data frame you made in the last exercise.

* Use data_frame() to combine hank_year, hank_song, and hank_peak into a data frame that has the column names year, song, and peak; in that order.
* Use filter() to extract just the songs where peak equals 1 (i.e. Hank's number one hits.)

In [None]:
# Make combined data frame using data_frame()
data_frame(year = hank_year,
           song = hank_song,
           peak = hank_peak) %>% 
  # Extract songs where peak equals 1
  filter(peak == 1)

### Lists of columns
As a data scientist, you should always be prepared to handle raw data that comes in many different formats.

hank saves Hank Williams' singles in a different way, as a list of vectors. Can you turn hank into the same dataset that you made in the last exercise?

* Examine the contents of hank.
* Use as_data_frame() to convert the hank list into a data frame.
* Use filter to extract the number one hits.

In [None]:
# Examine the contents of hank
hank

# Convert the hank list into a data frame
as_data_frame(hank) %>% 
  # Extract songs where peak equals 1
  filter(peak == 1)

## Lists of rows (data frames)
michael contains a list of data frames, one for each album released by Michael Jackson. The code in the editor attempts to bind the data frames into a single data frame and then extract a data frame of the top tracks on each album.

However, the code runs into a problem. The commented line fails because as_data_frame() combines a list of column vectors into a data frame. However, michael is a list of data frames.

Can you fix the code? After all, you have seen something like this before.

* Examine the contents of michael.
* Replace the commented code in the editor with a call to a dplyr function, which should bind the datasets in the list into a single data frame, adding an album column as it does.

In [None]:
# Examine the contents of michael
michael

# as_data_frame(michael) %>% 
michael %>%
  bind_rows(.id = "album") %>%
  group_by(album) %>% 
  mutate(rank = min_rank(peak)) %>% 
  filter(rank == 1) %>% 
  select(-rank, -peak)

## Working with data types

### Atomic data types

* logical
* character
* double
* integer
* complex
* raw

## dplyr's coercion rules

* Character + Integer/Double/Logical -> Character
* Double + Integer/Logical -> Double
* Integer + Logical -> Integer

### factors

* factor can be converted to numeric and character

### dplyr's coercion behavior

* dplyr functions will not automatically coerce data types
    - Returns an error
    - Expects you to manually coerce data
* Exception: factors
    - dplyr converts non-aligning factors to strings
    - Gives warning message

### Results

sixties contains the top selling albums in the US in the 1960s. It stores year as a numeric (double). When you combine it with seventies, which stores year as a factor, bind_rows() returns an error.

You can fix this by coercing seventies$year to a numeric. But if you do it like this, something surprising happens.

```
seventies %>% 
  mutate(year = as.numeric(year))
```

Can you fix things?

* Coerce seventies$year into a useful numeric.
* Bind the updated version of seventies to sixties and examine the results. Make sure they are sensible.

In [None]:
seventies %>% 
  # Coerce seventies$year into a useful numeric
  mutate(year = as.numeric(as.character(year))) %>% 
  # Bind the updated version of seventies to sixties
  bind_rows(sixties) %>% 
  arrange(year)