# Filtering joins and set operations

## Semi-joins

* Checks if some items in one data set exists in the other data set

In [15]:
library(dplyr)
library(tibble)

artists_df <- data.frame(
    name = c("Jimmy Buffett", "George Harrison", "Mick Jagger", "Tom Jones",
            "Davy Jones", "John Lennon", "Paul McCartney", "Jimmy Page",
            "Joe Perry", "Elvis Presley", "Keith Richards", "Paul Simon",
            "Ringo Starr", "Joe Walsh", "Brian Wilson", "Nancy Wilson"),
    instrument = c("Guitar", "Guitar", "Vocals", "Vocals",
                   "Vocals", "Guitar", "Bass", "Guitar",
                   "Guitar", "Vocals", "Guitar", "Guitar",
                   "Drums", "Guitar", "Vocals", "Vocals"))
artists <- as_tibble(artists_df)

# albums
albums_df <- data.frame(
    album = c("A Hard Day's Night", "Magical Mystery Tour", "Beggar's Banquet",
            "Abbey Road", "Led Zeppelin IV", "The Dark Side of the Moon",
            "Aerosmith", "Rumours", "Hotel California"),
    band = c("The Beatles", "The Beatles", "The Rolling Stones",
            "The Beatles", "Led Zeppelin", "Pink Floyd",
            "Aerosmith", "Fleetwood Mac", "Eagles"),
    year = c(1964, 1967, 1968, 1969, 1971, 1973, 1973, 1977, 1982))

albums <- as_tibble(albums_df)

# bands
bands_df <- data.frame(
    name = c("John Bonham", "John Paul Jones", "Jimmy Page", "Robert Plant",
            "George Harrison", "John Lennon", "Paul McCartney", "Ringo Starr",
            "Jimmy Buffett", "Mick Jagger", "Keith Richards",
            "Charlie Watts", "Ronnie Wood"),
    band = c("Led Zeppelin", "Led Zeppelin", "Led Zeppelin", "Led Zeppelin",
            "The Beatles", "The Beatles", "The Beatles", "The Beatles",
            "The Coral Reefers", "The Rolling Stones", "The Rolling Stones",
            "The Rolling Stones", "The Rolling Stones"))

bands <- as_tibble(bands_df)

# songs
songs_df <- data.frame(
    song = c("Come Together", "Dream On", "Hello, Goodbye", "It's not Unusual"),
    album = c("Abbey Road", "Aerosmith", "Magical Mystery Tour", "Along Came Jones"),
    writer = c("John Lennon", "Steven Tyler", "Paul McCartney", "Tom Jones"))

songs <- as_tibble(songs_df)

In [16]:
artists <- tidyr::separate(artists, col = name,
                           into = c("first", "last"), sep = " ")
bands   <- tidyr::separate(bands, col = name,
                           into = c("first", "last"), sep = " ")
songs   <- tidyr::separate(songs, col = writer,
                           into = c("first", "last"), sep = " ")

"Too many values at 1 locations: 2"

### Apply a semi-join

In [19]:
# View the output of semi_join()
artists %>% 
  semi_join(songs, by = c("first", "last"))

# Create the same result
artists %>% 
  right_join(songs, by = c("first", "last")) %>% 
  filter(!is.na(instrument)) %>% 
  select(first, last, instrument)

first,last,instrument
Tom,Jones,Vocals
John,Lennon,Guitar
Paul,McCartney,Bass


first,last,instrument
Tom,Jones,Vocals
John,Lennon,Guitar
Paul,McCartney,Bass


first,last,instrument
John,Lennon,Guitar
Paul,McCartney,Bass
Tom,Jones,Vocals


### Exploring with semi-joins

In [20]:
albums

album,band,year
A Hard Day's Night,The Beatles,1964
Magical Mystery Tour,The Beatles,1967
Beggar's Banquet,The Rolling Stones,1968
Abbey Road,The Beatles,1969
Led Zeppelin IV,Led Zeppelin,1971
The Dark Side of the Moon,Pink Floyd,1973
Aerosmith,Aerosmith,1973
Rumours,Fleetwood Mac,1977
Hotel California,Eagles,1982


In [21]:
bands

first,last,band
John,Bonham,Led Zeppelin
John,Paul,Led Zeppelin
Jimmy,Page,Led Zeppelin
Robert,Plant,Led Zeppelin
George,Harrison,The Beatles
John,Lennon,The Beatles
Paul,McCartney,The Beatles
Ringo,Starr,The Beatles
Jimmy,Buffett,The Coral Reefers
Mick,Jagger,The Rolling Stones


In [34]:
albums %>% 
  # Collect the albums made by a band
  semi_join(bands, by = c("band")) %>%
  # Count the albums made by a band
  nrow()

"Column `band` joining factors with different levels, coercing to character vector"

# A more precise way to filter?

We've attempted to rewrite this semi-join as a filter. Will it return the same results?

In [35]:
tracks %>% semi_join(
  matches,
  by = c("band", "year", "first")
)

ERROR: Error in eval(lhs, parent, parent): object 'tracks' not found


## Anti-join

* Does the opposite of semi_join. 
* Shows which rows in the primary data frame do not have matches in the secondary data frame
* This can be used for checking misspellings in the key values.

## join functions provided in dplyr

* left_join()
* right_join()
* inner_join()
* full_join()
* semi_join()
* anti_join()

### Apply an anti-join

Use an anti_join() to return the rows of artists for which you don't have any bands info. Note: Don't forget to mention the by argument.

In [36]:
artists %>%
    anti_join(bands, by = c("first", "last"))

first,last,instrument
Tom,Jones,Vocals
Davy,Jones,Vocals
Joe,Perry,Guitar
Elvis,Presley,Vocals
Paul,Simon,Guitar
Joe,Walsh,Guitar
Brian,Wilson,Vocals
Nancy,Wilson,Vocals


### Apply another anti-join

labels describes the record labels of the albums in albums. Compare the spellings of album names in labels with the names in albums. Are any of the album names of labels mis-entered? Use anti_join() to check. Note: Don't forget to mention the by argument.

In [38]:
labels_df <- data.frame(album = c("Abbey Road", "A Hard Days Night", "Magical Mystery Tour",
                                 "Led Zeppelin IV", "The Dark Side of the Moon", "Hotel California",
                                 "Rumours", "Aerosmith", "Beggar's Banquet"),
                       label = c("Apple", "Parlophone", "Parlophone", "Atlantic", "Harvest",
                                "Asylum", "Warner Brothers", "Columbia", "Decca"))
labels <- as.tibble(labels_df)

In [41]:
# Check whether album names in labels are mis-entered
labels %>%
    anti_join(albums, by = c("album"))
albums

"Column `album` joining factors with different levels, coercing to character vector"

album,label
A Hard Days Night,Parlophone


album,band,year
A Hard Day's Night,The Beatles,1964
Magical Mystery Tour,The Beatles,1967
Beggar's Banquet,The Rolling Stones,1968
Abbey Road,The Beatles,1969
Led Zeppelin IV,Led Zeppelin,1971
The Dark Side of the Moon,Pink Floyd,1973
Aerosmith,Aerosmith,1973
Rumours,Fleetwood Mac,1977
Hotel California,Eagles,1982


### Which filtering join?

Which filtering join would you use to determine how many rows in songs match a label in labels?

* Determine which key joins labels and songs.
* Use a filtering join to find the rows of songs that match a row in labels.
* Use nrow() to determine how many matches exist between labels and songs.

In [44]:
# Determine which key joins labels and songs
labels
songs

# Check your understanding
songs %>% 
  # Find the rows of songs that match a row in labels
  semi_join(labels, by = c("album")) %>%
  # Number of matches between labels and songs
  nrow()

album,label
Abbey Road,Apple
A Hard Days Night,Parlophone
Magical Mystery Tour,Parlophone
Led Zeppelin IV,Atlantic
The Dark Side of the Moon,Harvest
Hotel California,Asylum
Rumours,Warner Brothers
Aerosmith,Columbia
Beggar's Banquet,Decca


song,album,first,last
Come Together,Abbey Road,John,Lennon
Dream On,Aerosmith,Steven,Tyler
"Hello, Goodbye",Magical Mystery Tour,Paul,McCartney
It's not Unusual,Along Came Jones,Tom,Jones


"Column `album` joining factors with different levels, coercing to character vector"

## Set operations

* When two data sets contain the exact same variables, it can be helpful to combine them with set operations.

    * union()
    * intersect()
    * setdiff()

### How many songs are there?

We have loaded two datasets in your workspace, aerosmith and greatest_hits, each of which represents an album from the band Aerosmith. Each row in either of the datasets is a song on that album.

How many unique songs do these two albums contain in total?

* Use a set operation to create a dataset with every song contained on aerosmith and/or greatest_hits.
* Use nrow() to count the total number of songs.

In [52]:
aerosmith_df <- data.frame(song = c("Make It", "Somebody", "Dream On", "One Way Street",
                                   "Mama Kin", "Write me a Letter", "Moving Out", "Walking the Dog"),
                          length = c(13260, 13500, 16080, 25200, 15900, 15060, 18180, 11520))
aerosmith <- as.tibble(aerosmith_df)

greatest_hits_df <- data.frame(song = c("Dream On", "Mama Kin", "Same Old Song and Dance", 
                                       "Seasons of Winter", "Sweet Emotion", "Walk this Way",
                                       "Big Ten Inch Record", "Last Child", "Back in the Saddle",
                                       "Draw the Line", "Kings and Queens", "Come Together",
                                        "Remeber (Walking in the Sand)", "Lightning Strikes",
                                        "Chip Away the Stone", "Sweet Emotion (remix)", "One Way Street"),
                              length = c(16080, 16020, 11040, 17820, 11700, 12780, 8100, 12480, 16860,
                                        12240, 13680, 13620, 14700, 16080, 14460, 16560, 24000))
greatest_hits <- as.tibble(greatest_hits_df)

In [54]:
aerosmith %>%
    # Create the new dataset using a set operation
    union(greatest_hits, by = c("song")) %>%
    # Count the total number of songs
    nrow()

"Column `song` joining factors with different levels, coercing to character vector"

### Greatest hits

Which songs from Aerosmith made it onto Greatest Hits?

In [56]:
# Create the new dataset using a set operation
aerosmith %>% 
  intersect(greatest_hits, by = c("song"))

"Column `song` joining factors with different levels, coercing to character vector"

song,length
Dream On,16080


In [57]:
### Live! Bootleg songs
live_df <- data.frame(song = c("Back in the Saddle", "Sweet Emotion", "Lord of the Things",
                               "Toys in the Attic", "Last Child", "Come Together", "Walk this Way",
                               "Sick as a Dog", "Dream On", "Chip Away the Stone", "Sight for Sore Eyes",
                               "Mama Kin", "S.O.S (Too Bad)", "I Ain't Got You", "Mother Popconr/Draw the Line",
                              "Train Kept A-Rollin'/Strangers in the Night"),
                      length = c(15900, 16920, 26280, 13500, 12240, 17460, 13560, 16920, 16260, 15120,
                                 11880, 13380, 9960, 14220, 41700, 17460))

live <- as.tibble(live_df)

In [58]:
# Select the song names from live
live_songs <- live %>% select(song)

# Select the song names from greatest_hits
greatest_songs <- greatest_hits %>% select(song)

# Create the new dataset using a set operation
live_songs %>% 
  setdiff(greatest_songs)

"Column `song` joining factors with different levels, coercing to character vector"

song
Lord of the Things
Toys in the Attic
Sick as a Dog
Sight for Sore Eyes
S.O.S (Too Bad)
I Ain't Got You
Mother Popconr/Draw the Line
Train Kept A-Rollin'/Strangers in the Night


### Multiple operations

Can you think of a combination that would answer the question, "Which songs appear on one of Live! Bootleg or Greatest Hits, but not both?"

* Select the songs from the live and greatest_hits datasets and call them live_songs and greatest_songs, respectively. Use the select() function to do this.
* Combine setdiff(), union(), and intersect() to return all of the songs that are in one of live_songs or greatest_songs, but not both. You will need to use all three functions and save some results along the way (i.e. you won't be able to do this with a single pipe.)

In [60]:
# Select songs from live and greatest_hits
live_songs <- live %>% select(song)
greatest_songs <- greatest_hits %>% select(song)

# Return the songs that only exist in one dataset
union_songs <- live_songs %>% union(greatest_songs)
intersect_songs <- live_songs %>% intersect(greatest_songs)
union_songs %>% setdiff(intersect_songs)

"Column `song` joining factors with different levels, coercing to character vector"

song
One Way Street
Kings and Queens
Seasons of Winter
Mother Popconr/Draw the Line
Lord of the Things
Lightning Strikes
Sick as a Dog
Toys in the Attic
Remeber (Walking in the Sand)
S.O.S (Too Bad)


## Comparing datasets

* setequal(set1, set2) - The contents matter, not the order
* identical(set1, set2) - The order of data has to be the same

### Mutating joins
* left_join()
* right_join()
* inner_join()
* full_join()

### Filtering joins
* semi_join()
* anti_join()

### Set operations
* union()
* intersect()
* setdiff()

### Comparisons
* setequal()

In [None]:
# Check if same order: definitive and complete
identical(definitive, complete)

# Check if any order: definitive and complete
setequal(definitive, complete)

# Songs in definitive but not complete
setdiff(definitive, complete)


# Songs in complete but not definitive
setdiff(complete, definitive)

### Apply setequal again

A few exercises ago, you saw that an intersect() is analagous to a semi_join() when two datasets contain the same variables and each variable is used in the key.

Under these conditions, setdiff() is also analagous to one of the filtering joins.

* Write a filtering join that returns songs in definitive that are not in complete. Are there any?
* Write a filtering join that returns songs in complete that are not in definitive. Are there any?

In [None]:
# Return songs in definitive that are not in complete
definitive %>% 
  anti_join(complete)

# Return songs in complete that are not in definitive
complete %>% 
  anti_join(definitive)

### Comparing albums

It appears that The Definitive Collection contains songs from the soundtrack of The Song Remains the Same, a movie filmed during a live Led Zeppelin concert. Is this the only difference between The Definitive Collection and The Complete Studio Recordings?

The songs from The Song Remains the Same are contained in soundtrack.

* Use identical() to check if definitive and the union of complete and soundtrack contain the same songs in the same order.
* Use setequal() to check if definitive and the union of complete and soundtrack contain the same songs in any order.

In [None]:
# Check if same order: definitive and union of complete and soundtrack
identical(definitive, union(complete, soundtrack))



# Check if any order: definitive and union of complete and soundtrack
setequal(definitive, union(complete, soundtrack))