<a href="https://colab.research.google.com/github/carlosfmorenog/CMM202/blob/master/CMM202_Topic_4/CMM202_T4_Lec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CMM202 Topic 4: Joins & Aggregations

Let's create two small dataset to use as example

In [None]:
import pandas as pd

In [None]:
books = pd.DataFrame({'Author' : ['J. R. R. Tolkien',
                                  'George R. R. Martin',
                                  'J. K. Rowling', 
                                  'Suzanne Collins']},
                     index = ['The Lord of the Rings',
                              'Game of Thrones',
                              'Harry Potter',
                              'The Hunger Games'])
books

Unnamed: 0,Author
The Lord of the Rings,J. R. R. Tolkien
Game of Thrones,George R. R. Martin
Harry Potter,J. K. Rowling
The Hunger Games,Suzanne Collins


In [None]:
films = pd.DataFrame({'Year of First Film' : [1999, 2001, 2001, 2012],
                      'Number of Films' : [3, 2, 8, 4]},
                     index = ['The Matrix',
                              'The Lord of the Rings',
                              'Harry Potter',
                              'The Hunger Games'])


films

Unnamed: 0,Year of First Film,Number of Films
The Matrix,1999,3
The Lord of the Rings,2001,2
Harry Potter,2001,8
The Hunger Games,2012,4


The `.join` operation is used to join the columns of two different data sets based on matching index

In [None]:
books.join(films)

Unnamed: 0,Author,Year of First Film,Number of Films
The Lord of the Rings,J. R. R. Tolkien,2001.0,2.0
Game of Thrones,George R. R. Martin,,
Harry Potter,J. K. Rowling,2001.0,8.0
The Hunger Games,Suzanne Collins,2012.0,4.0


## Types of Join

In the previous example, the `books` data set is the *'left'* dataset and `films` is the *'right'* dataset

A left join keeps all of the data from the *'left'* data set and adds in the applicable data from the *'right'* data set where the keys match up

If there is no film, the cells are populated with `NaN` (e.g. Game of Thrones)

The above operations is equivalent to

    books.join(films, how='left')

We can also do a right join. This keeps all of the films, adding the book data where applicable. If there is no book, the cells are populated with `NaN`.

In [None]:
books.join(films, how='right')

Unnamed: 0,Author,Year of First Film,Number of Films
The Matrix,,1999,3
The Lord of the Rings,J. R. R. Tolkien,2001,2
Harry Potter,J. K. Rowling,2001,8
The Hunger Games,Suzanne Collins,2012,4


This is almost equivalent to `films.join(books)` (or `films.join(books, how='left')`), with the exception of the order in which the columns appear

If we want to keep all of the data (book **OR** film), we can use an outer join

In [None]:
films.join(books, how='outer')

Unnamed: 0,Year of First Film,Number of Films,Author
Game of Thrones,,,George R. R. Martin
Harry Potter,2001.0,8.0,J. K. Rowling
The Hunger Games,2012.0,4.0,Suzanne Collins
The Lord of the Rings,2001.0,2.0,J. R. R. Tolkien
The Matrix,1999.0,3.0,


And if we only want to keep the data (book **AND** film), we can use an inner join

In [None]:
films.join(books, how='inner')

Unnamed: 0,Year of First Film,Number of Films,Author
The Lord of the Rings,2001,2,J. R. R. Tolkien
Harry Potter,2001,8,J. K. Rowling
The Hunger Games,2012,4,Suzanne Collins


You should choose the join type based on what your resulting data table is intended to describe

For instance, the inner join gave us a table of films based on books

Contrarily, the left join gave us a list of books with additional information on the film (if any)

Once again, remember that `.join` is not modifying the dataframe, so if you want to save the result, assign it to either the same or a different variable with an appropriate name

    films = films.join(books, how='left')

    books = films.join(books, how='right')
    
    films_based_on_books = films.join(books, how='inner')
    
    favourite_series = films.join(books, how='outer')

This table may be used as a reminder of the difference between the joins


| Type of Join   | Keeps Rows of Left Data | Keeps Rows of Right Data |
| :------------- | ----------------------: | -----------------------: |
| left (default) | yes                     | only if matching left    |
| right          | only if matching right  | yes                      |
| outer          | yes                     | yes                      |
| inner          | only if matching right  | only if matching left    |

## Joining Different Columns

`.join` joins by comparing indexes of each dataframe

Sometimes the key column(s) is not the index (particularly if you are using default indexing)

If you need to join based on columns other than the index, you should use `merge`

For example, it is possible that we would encounter data with default indexes as follows:

In [None]:
books = books.reset_index()
books = books.rename(columns={'index' : 'Book Series Title'})
books

Unnamed: 0,Book Series Title,Author
0,The Lord of the Rings,J. R. R. Tolkien
1,Game of Thrones,George R. R. Martin
2,Harry Potter,J. K. Rowling
3,The Hunger Games,Suzanne Collins


In [None]:
films = films.reset_index()
films = films.rename(columns={'index' : 'Film Series Title'})
films

Unnamed: 0,Film Series Title,Year of First Film,Number of Films
0,The Matrix,1999,3
1,The Lord of the Rings,2001,2
2,Harry Potter,2001,8
3,The Hunger Games,2012,4


If we join on the index, the result is nonsense!

In [None]:
books.join(films) # WRONG

Unnamed: 0,Book Series Title,Author,Film Series Title,Year of First Film,Number of Films
0,The Lord of the Rings,J. R. R. Tolkien,The Matrix,1999,3
1,Game of Thrones,George R. R. Martin,The Lord of the Rings,2001,2
2,Harry Potter,J. K. Rowling,Harry Potter,2001,8
3,The Hunger Games,Suzanne Collins,The Hunger Games,2012,4


We could change the `Book Series Title` and `Film Series Title` to indexes and join with `.join`

Or we can use `.merge`, in which left, right, outer, and inner joins work the same way

However, it is not the index we are comparing, it is the column specified with `left_on=` and `right_on=`

In [None]:
books.merge(films,
            how='inner',
            left_on='Book Series Title',
            right_on='Film Series Title')

Unnamed: 0,Book Series Title,Author,Film Series Title,Year of First Film,Number of Films
0,The Lord of the Rings,J. R. R. Tolkien,The Lord of the Rings,2001,2
1,Harry Potter,J. K. Rowling,Harry Potter,2001,8
2,The Hunger Games,Suzanne Collins,The Hunger Games,2012,4


You can use `on` instead of `left_on` & `right_on` as long as both columns have the same name!

## Data Aggregation

Before continuing, we will reset the index of the books

In [None]:
books = books.set_index('Book Series Title', drop=True)
books

Unnamed: 0_level_0,Author
Book Series Title,Unnamed: 1_level_1
The Lord of the Rings,J. R. R. Tolkien
Game of Thrones,George R. R. Martin
Harry Potter,J. K. Rowling
The Hunger Games,Suzanne Collins


Let's import a dataset of a list of books

In [None]:
volumes = pd.read_csv('https://www.dropbox.com/s/9flqjjvetgbex97/volumes.csv?raw=1')
volumes

Unnamed: 0,Series,Title,Rating,Year
0,Harry Potter,Harry Potter and the Philosopher's Stone,4.47,1997
1,Harry Potter,Harry Potter and the Chamber of Secrets,4.42,1998
2,Harry Potter,Harry Potter and the Prisoner of Azkaban,4.56,1999
3,Harry Potter,Harry Potter and the Goblet of Fire,4.55,2000
4,Harry Potter,Harry Potter and the Order of the Phoenix,4.49,2003
5,Harry Potter,Harry Potter and the Half-Blood Prince,4.57,2005
6,Harry Potter,Harry Potter and the Deathly Hallows,4.61,2007
7,The Lord of the Rings,The Fellowship of the Ring,4.36,1954
8,The Lord of the Rings,The Two Towers,4.44,1954
9,The Lord of the Rings,The Return of the King,4.53,1955


We want to summarise by series, the `.groupby` method gives you the name of the column which has the groups

The result is a Python object which we will use for the next step

In [None]:
groups = volumes.groupby('Series')
groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A12EEC6208>

We will use `count` and `mean` to work out the number of books and the average rating, and use `min` to work out the first publication year.

Other operations available include `sum` and `max` (you can check others [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html))

Now we have our data summary by groups

In [None]:
summary = groups.agg({'count', 'mean', 'min'})
summary

Unnamed: 0_level_0,Rating,Rating,Rating,Year,Year,Year
Unnamed: 0_level_1,min,count,mean,min,count,mean
Series,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Game of Thrones,4.13,5,4.372,1996,5,2002.0
Harry Potter,4.42,7,4.524286,1997,7,2001.285714
The Hunger Games,4.03,3,4.216667,2008,3,2009.0
The Lord of the Rings,4.36,3,4.443333,1954,3,1954.333333


## Joining Aggregation Data

We now have a dataframe with two sub-frames (one for `Rating` and one for `Year`) we can easily separate them

In [None]:
rating_summary = summary['Rating']
rating_summary

Unnamed: 0_level_0,min,count,mean
Series,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Game of Thrones,4.13,5,4.372
Harry Potter,4.42,7,4.524286
The Hunger Games,4.03,3,4.216667
The Lord of the Rings,4.36,3,4.443333


In [None]:
year_summary = summary['Year']
year_summary

Unnamed: 0_level_0,min,count,mean
Series,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Game of Thrones,1996,5,2002.0
Harry Potter,1997,7,2001.285714
The Hunger Games,2008,3,2009.0
The Lord of the Rings,1954,3,1954.333333


Let's rename the columns for the `Ratings` aggregation, and remove the unneeded `min` column

In [None]:
rating_summary = summary['Rating'].rename(
    columns={'mean' : 'Average Rating',
             'count' : 'Number of Books'})
rating_summary = rating_summary.drop(columns={'min'})
rating_summary

Unnamed: 0_level_0,Number of Books,Average Rating
Series,Unnamed: 1_level_1,Unnamed: 2_level_1
Game of Thrones,5,4.372
Harry Potter,7,4.524286
The Hunger Games,3,4.216667
The Lord of the Rings,3,4.443333


We may also want to round off the average ratings

In [None]:
rating_summary['Average Rating'] = rating_summary['Average Rating'].round(2)
rating_summary

Unnamed: 0_level_0,Number of Books,Average Rating
Series,Unnamed: 1_level_1,Unnamed: 2_level_1
Game of Thrones,5,4.37
Harry Potter,7,4.52
The Hunger Games,3,4.22
The Lord of the Rings,3,4.44


Let's also rename the column from the `Year` aggregation and drop the other columns

In [None]:
year_summary = year_summary.rename(columns={'min' : 'First Published'})
year_summary = year_summary.drop(columns={'mean', 'count'})
year_summary

Unnamed: 0_level_0,First Published
Series,Unnamed: 1_level_1
Game of Thrones,1996
Harry Potter,1997
The Hunger Games,2008
The Lord of the Rings,1954


Now we can join the `books`, `rating_summary` and `year_summary` dataframes

Since all have the same keys we don't need to worry about join type, but this is a left join so will keep everything in the `books` data frame if it didn't match

In [None]:
books.join(rating_summary).join(year_summary)

## Concatenate

There is another similar function called `concat` which is mostly used to *paste* two tables (although it also has inner/outer options)

You can find more info [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

# Lab