# Mutating joins

## Welcome to the course!


| Var_1 | Var_2 | Var_3 | Var_4 |
|-------------------------------|
| obs_1 | 33    | 3     | 54    |
| obs_2 | 20    | 90    | 22    |
| obs_3 | 58    | 12    | 15    |
| obs_4 | 83    | 81    | 5     |

```
# Summarise a column
mean(df[["Var_2"]])

# Create a column
df[["Var_5"]] <- df[["Var_2"]] + df[["Var_4"]]
```

## Course outline

* Chapter 1 - Mutating joins
    - Join multiple tables
* Chapter 2 - Filtering joins and set operations
    - Search, extract rows, ...
* Chapter 3 - Assembling data
    - Compose a data frame
* Chapter 4 - Advanced joining
    - What can go wrong? How to fix errors
* Chapter 5 - Case study

## Benefits of dplyr join functions

* Always preserve row order
* Intuitive syntax
* Can be applied to databases, spark, etc.

## The advantages of dplyr

* dplyr joins syntax is intuitive to use.
* dplyr joins preserve the row order of your data.
* dplyr joins will work with dplyr to connect to SQL databases.

## Keys

* A key is a column or combination of columns, that occurs in each tables that you want to join.
* dplyr completes the join by matching rows that have the same values of the key.
* Primary Key: A key in the primary table
* Secondary Key/Foreign Key: A key in the second table

### Primary Key

* Which can be the primary key from the dataset ** artists **
    - ** name ** 
    - instrument

In [10]:
library(dplyr)
library(tibble)

artists_df <- data.frame(
    name = c("Jimmy Buffett", "George Harrison", "Mick Jagger", "Tom Jones",
            "Davy Jones", "John Lennon", "Paul McCartney", "Jimmy Page",
            "Joe Perry", "Elvis Presley", "Keith Richards", "Paul Simon",
            "Ringo Starr", "Joe Walsh", "Brian Wilson", "Nancy Wilson"),
    instrument = c("Guitar", "Guitar", "Vocals", "Vocals",
                   "Vocals", "Guitar", "Bass", "Guitar",
                   "Guitar", "Vocals", "Guitar", "Guitar",
                   "Drums", "Guitar", "Vocals", "Vocals"))
artists <- as_tibble(artists_df)
artists

name,instrument
Jimmy Buffett,Guitar
George Harrison,Guitar
Mick Jagger,Vocals
Tom Jones,Vocals
Davy Jones,Vocals
John Lennon,Guitar
Paul McCartney,Bass
Jimmy Page,Guitar
Joe Perry,Guitar
Elvis Presley,Vocals


### Secondary keys

* albums, bands, and songs provide information that you may be able to join to artists. To do so, you'll need to pair the primary key in artists (i.e. name) with secondary keys in albums, bands, and/or songs.

* Examine albums, bands, and songs. Which datasets have a secondary key that matches artists$name? What are the secondary keys?

In [76]:
# albums
albums_df <- data.frame(
    album = c("A Hard Day's Night", "Magical Mystery Tour", "Beggar's Banquet",
            "Abbey Road", "Led Zeppelin IV", "The Dark Side of the Moon",
            "Aerosmith", "Rumours", "Hotel California"),
    band = c("The Beatles", "The Beatles", "The Rolling Stones",
            "The Beatles", "Led Zeppelin", "Pink Floyd",
            "Aerosmith", "Fleetwood Mac", "Eagles"),
    year = c(1964, 1967, 1968, 1969, 1971, 1973, 1973, 1977, 1982))

albums <- as_tibble(albums_df)

# bands
bands_df <- data.frame(
    name = c("John Bonham", "John Paul Jones", "Jimmy Page", "Robert Plant",
            "George Harrison", "John Lennon", "Paul McCartney", "Ringo Starr",
            "Jimmy Buffett", "Mick Jagger", "Keith Richards",
            "Charlie Watts", "Ronnie Wood"),
    band = c("Led Zeppelin", "Led Zeppelin", "Led Zeppelin", "Led Zeppelin",
            "The Beatles", "The Beatles", "The Beatles", "The Beatles",
            "The Coral Reefers", "The Rolling Stones", "The Rolling Stones",
            "The Rolling Stones", "The Rolling Stones"))

bands <- as_tibble(bands_df)

# songs
songs_df <- data.frame(
    song = c("Come Together", "Dream On", "Hello, Goodbye", "It's not Unusual"),
    album = c("Abbey Road", "Aerosmith", "Magical Mystery Tour", "Along Came Jones"),
    writer = c("John Lennon", "Steven Tyler", "Paul McCartney", "Tom Jones"))

songs <- as_tibble(songs_df)

 # Check the elements
artists$name
bands$name
songs$writer

### Multi-variable keys

As Garrett mentioned in the video, sometimes no single variable acts as a primary key in a dataset. Instead, it takes a combination of variables to uniquely identify each row.

For example, here is a modified version of the artists dataset that does not contain a names variable. What would be the primary "key" in this dataset?

** The combination of first and last **

In [77]:
artists1 <- tidyr::separate(artists, col = name, into = c("first", "last"), sep = " ")

nrow(artists1)
length(unique(artists1$first))
length(unique(artists1$last))
length(unique(artists1$instrument))

## Joins

### left_join()

* left_join(table to augment, table to augment with, by = key column name as a character string)
* left_join treats the first dataset as the primary dataset.

### right_join()

* rgiht_join(table to augment, table to augment with, by = key column name as a character string)
* right_join treats the second dataset as the primary dataset.

### "tables"

* data frames
* tibbles(tbl_df)
* tbl references

### tibble vs. data frame

* They are pretty much the same.
* tibble makes it a bit easier to inspect a small portion of dataset.

### A basic join

* Complete the code to join artists to bands. bands2 should contain all of the information in bands supplemented with information in artists.
* Print bands2 to the console to see the result.

In [78]:
artists1 <- tidyr::separate(artists, col = name,
                            into = c("first", "last"), sep = " ")
bands1   <- tidyr::separate(bands, col = name,
                            into = c("first", "last"), sep = " ")

bands2   <- left_join(bands1, artists1, by = c("first", "last"))
bands2

"Too many values at 1 locations: 2"

first,last,band,instrument
John,Bonham,Led Zeppelin,
John,Paul,Led Zeppelin,
Jimmy,Page,Led Zeppelin,Guitar
Robert,Plant,Led Zeppelin,
George,Harrison,The Beatles,Guitar
John,Lennon,The Beatles,Guitar
Paul,McCartney,The Beatles,Bass
Ringo,Starr,The Beatles,Drums
Jimmy,Buffett,The Coral Reefers,Guitar
Mick,Jagger,The Rolling Stones,Vocals


### A second join

The result from the previous exercise, bands2, is loaded in your workspace.

* Examine the output from the code provided in the editor. How is it different from bands2?
* Fix the code so that the result is identical to bands2.

In [79]:
# Fix the code to recreate bands2
# left_join(bands1, artists1, by = "first")
left_join(bands1, artists1, by = c("first", "last"))

first,last,band,instrument
John,Bonham,Led Zeppelin,
John,Paul,Led Zeppelin,
Jimmy,Page,Led Zeppelin,Guitar
Robert,Plant,Led Zeppelin,
George,Harrison,The Beatles,Guitar
John,Lennon,The Beatles,Guitar
Paul,McCartney,The Beatles,Bass
Ringo,Starr,The Beatles,Drums
Jimmy,Buffett,The Coral Reefers,Guitar
Mick,Jagger,The Rolling Stones,Vocals


### A right join

* Use right_join() to create bands3, a new dataset that contains the same information as bands2.
* Use setequal() to check that the datasets are the same.

In [80]:
# Finish the code below to recreate bands3 with a right join
bands2 <- left_join(bands1, artists1, by = c("first", "last"))
bands3 <- right_join(artists1, bands1, by = c("first", "last"))

# Check that bands3 is equal to bands2
setequal(bands2, bands3)

TRUE

## Variations on joins

### Mutating joins

* mutate()
* left_join()
* right_join()
* inner_join(): Returns only the rows from the first dataset that have a match in the second dataset. It's the most exclusive join because every row in the result must appear in both datasets.
* full_join(): Returns every row in either dataset. It's the most inclusive join because it returns all of the information in the original tables.

### Syntax

```
left_join(x, y, by = )
right_join(x, y, by = )
inner_join(x, y, by = )
full_join(x, y, by = )
```

* join functions are natual candidates for writing pipes with the pipe operator(%>%)

### Pipe operator

It takes the result on its left-hand side and passes it into the function on its right-hand side.

```
x <- 1:10

x %>% sum()
```

### dplyr and pipes

```
names %>%
    full_join(plays, by = "name) %>%
    mutate(missing_info = is.na(band) | is.na(plays)) %>%
    filter(missing_info = TRUE) %>%
    select(name, band, plays)
```

### Summary

* left_join(): prioritizes the first dataset
* right_join(): prioritizes the second dataset
* inner_join(): only retains rows that appear in both datasets
* full_join(): retains every row that appears in any dataset

### Inner joins and full joins

* Join albums to songs in a way that returns only rows that contain information about both songs and albums.
* Join bands to artists to create a single table that contains all of the available data.

In [81]:
songs1 <- tidyr::separate(songs, col = writer,
                          into = c("first", "last"), sep = " ")

In [82]:
# Join albums to songs using inner_join()
inner_join(songs1, albums, by = "album")

# Join bands to artists using full_join()
full_join(artists1, bands1, by = c("first", "last"))

"Column `album` joining factors with different levels, coercing to character vector"

song,album,first,last,band,year
Come Together,Abbey Road,John,Lennon,The Beatles,1969
Dream On,Aerosmith,Steven,Tyler,Aerosmith,1973
"Hello, Goodbye",Magical Mystery Tour,Paul,McCartney,The Beatles,1967


first,last,instrument,band
Jimmy,Buffett,Guitar,The Coral Reefers
George,Harrison,Guitar,The Beatles
Mick,Jagger,Vocals,The Rolling Stones
Tom,Jones,Vocals,
Davy,Jones,Vocals,
John,Lennon,Guitar,The Beatles
Paul,McCartney,Bass,The Beatles
Jimmy,Page,Guitar,Led Zeppelin
Joe,Perry,Guitar,
Elvis,Presley,Vocals,


### Pipes

The code in the editor finds all of the known guitarists in the bands dataset. Rewrite the code to use %>%s instead of multiple function calls. The pipe %>% should be used three times and temp zero times.

In [83]:
# Find guitarists in bands dataset (don't change)
temp <- left_join(bands1, artists1, by = c("first", "last"))
temp <- filter(temp, instrument == "Guitar")
select(temp, first, last, band)

first,last,band
Jimmy,Page,Led Zeppelin
George,Harrison,The Beatles
John,Lennon,The Beatles
Jimmy,Buffett,The Coral Reefers
Keith,Richards,The Rolling Stones


In [84]:
# Reproduce code above using pipes
bands1 %>%
    left_join(artists1, by = c("first", "last")) %>%
    filter(instrument == "Guitar") %>%
    select(first, last, band)

first,last,band
Jimmy,Page,Led Zeppelin
George,Harrison,The Beatles
John,Lennon,The Beatles
Jimmy,Buffett,The Coral Reefers
Keith,Richards,The Rolling Stones


### Practice with pipes and joins

* Examine the goal dataset by printing it to the console.
* Write a pipe that uses a full join and an inner join to combine artists, bands, and songs into goal2, a dataset identical to goal.
* Use setequal() to check that goal is identical to goal2.

In [85]:
goal <- as_tibble(data.frame(
    first = c("Tom", "John", "Paul"),
    last = c("Jones", "Lennon", "McCartney"),
    instrument = c("Vocals", "Guitar", "Bass"),
    band = c(NA, "The Beatles", "The Beatles"),
    song = c("It's Not Unusual", "Come Together", "Hello, Goodbye"),
    album = c("Along Came Jones", "Abbey Road", "Magical Mystery Tour")))

In [86]:
goal

first,last,instrument,band,song,album
Tom,Jones,Vocals,,It's Not Unusual,Along Came Jones
John,Lennon,Guitar,The Beatles,Come Together,Abbey Road
Paul,McCartney,Bass,The Beatles,"Hello, Goodbye",Magical Mystery Tour


In [87]:
artists1 %>%
    full_join(bands1, by = c("first", "last")) %>%
    inner_join(songs1, by = c("first", "last"))

first,last,instrument,band,song,album
Tom,Jones,Vocals,,It's not Unusual,Along Came Jones
John,Lennon,Guitar,The Beatles,Come Together,Abbey Road
Paul,McCartney,Bass,The Beatles,"Hello, Goodbye",Magical Mystery Tour


### Choose your joins

Write a pipe that combines artists, bands, songs, and albums (in that order) into a single table, such that it contains all of the information in the datasets.

In [88]:
names(artists1)
names(bands1)
names(songs1)
names(albums)

In [91]:
# Create one table that combines all information
artists1 %>%
    full_join(bands1, by = c("first", "last")) %>%
    full_join(songs1, by = c("first", "last")) %>%
    full_join(albums, by = c("album", "band"))

"Column `band` joining factors with different levels, coercing to character vector"

first,last,instrument,band,song,album,year
Jimmy,Buffett,Guitar,The Coral Reefers,,,
George,Harrison,Guitar,The Beatles,,,
Mick,Jagger,Vocals,The Rolling Stones,,,
Tom,Jones,Vocals,,It's not Unusual,Along Came Jones,
Davy,Jones,Vocals,,,,
John,Lennon,Guitar,The Beatles,Come Together,Abbey Road,1969.0
Paul,McCartney,Bass,The Beatles,"Hello, Goodbye",Magical Mystery Tour,1967.0
Jimmy,Page,Guitar,Led Zeppelin,,,
Joe,Perry,Guitar,,,,
Elvis,Presley,Vocals,,,,
