# Lecture 12 – Grouping and Pivoting

## Data 6, Fall 2024

In [1]:
from datascience import *
import numpy as np
%matplotlib inline
Table.interactive_plots() 

## Grouping with `.group`

The term "group" in data science is most commonly associated with data aggregation and disaggregation. When we "group" a table in Python, we are able to gain insights about our data at a higher unit of analysis (e.g. at a city or state level, as opposed to the individual level).

Returning to the `top_10` dataset from last week (showing the songs on the Spotify Top 10), we can utilize `tbl.group()` to count how many Top 10 songs each artist has.

In [2]:
top_10 = Table.read_table('data/regional-global-daily-latest.csv').take(np.arange(10))
top_10

rank,artist_names,track_name,source,peak_rank,previous_rank,days_on_chart,streams
1,Kate Bush,Running Up That Hill,Rhino,1,1,41,7208654
2,Harry Styles,As It Was,Columbia,1,2,99,6543793
3,Joji,Glimpse of Us,88rising Music/Warner Records,1,3,28,5492997
4,"Bad Bunny, Chencho Corleone",Me Porto Bonito,Rimas Entertainment LLC,2,4,63,5416421
5,"Bizarrap, Quevedo","Quevedo: Bzrp Music Sessions, Vol. 52",DALE PLAY Records,5,-1,1,4676471
6,Bad Bunny,Tití Me Preguntó,Rimas Entertainment LLC,4,5,63,4549682
7,"Bad Bunny, Bomba Estéreo",Ojitos Lindos,Rimas Entertainment LLC,2,6,63,4144625
8,Bad Bunny,Efecto,Rimas Entertainment LLC,7,8,63,3722317
9,"Charlie Puth, BTS, Jung Kook",Left and Right (Feat. Jung Kook of BTS),Atlantic Records,3,7,14,3715689
10,Bad Bunny,Moscow Mule,Rimas Entertainment LLC,1,9,63,3465003


By default `tbl.group(column)` counts the number of occurences of each unique value in `column`.

In [3]:
... # Group `top_10` by artist name and then sort by count

Ellipsis

Notice that Bad Bunny's name appears 5 times on the Top 10, but the "Bad Bunny" row in the groupped table only has a count of 3. This is because Python looks for an **exact match**. The songs where Bad Bunny appears along with other artists are counted separately.

### Quick Check (not mandatory for participation credit today)

In [4]:
streams = Table.read_table('data/regional-global-daily-latest.csv')
streams

rank,artist_names,track_name,source,peak_rank,previous_rank,days_on_chart,streams
1,Kate Bush,Running Up That Hill,Rhino,1,1,41,7208654
2,Harry Styles,As It Was,Columbia,1,2,99,6543793
3,Joji,Glimpse of Us,88rising Music/Warner Records,1,3,28,5492997
4,"Bad Bunny, Chencho Corleone",Me Porto Bonito,Rimas Entertainment LLC,2,4,63,5416421
5,"Bizarrap, Quevedo","Quevedo: Bzrp Music Sessions, Vol. 52",DALE PLAY Records,5,-1,1,4676471
6,Bad Bunny,Tití Me Preguntó,Rimas Entertainment LLC,4,5,63,4549682
7,"Bad Bunny, Bomba Estéreo",Ojitos Lindos,Rimas Entertainment LLC,2,6,63,4144625
8,Bad Bunny,Efecto,Rimas Entertainment LLC,7,8,63,3722317
9,"Charlie Puth, BTS, Jung Kook",Left and Right (Feat. Jung Kook of BTS),Atlantic Records,3,7,14,3715689
10,Bad Bunny,Moscow Mule,Rimas Entertainment LLC,1,9,63,3465003


Using the `streams` table, fill in the blanks to create the “Top 10 Artists” bar chart: The 10 artists with the most songs in the Spotify Daily Top 200 streams table.

In [5]:
top_10_artists = streams.group(...).sort(..., descending=...).take(np.arange(...))
top_10_artists.barh(...)

TypeError: object of type 'ellipsis' has no len()

## Advanced Grouping

For the rest of today's lecture, we will use the `cars` table, which contains specifications for a variety of car models.

In [None]:
cars = Table.read_table('data/models-2021.csv')
cars

A few notes:
* `Manufacturer` is who owns the Brand.
    * GM owns Buick, Cadillac, Chevrolet, GMC.
* `Displacement` is the engine size in liters.
* `MPG` is miles per gallon.


Here we'll take a subset of the rows and columns for illustration.

In [None]:
gm = cars.where('Manufacturer', 'General Motors').select('Brand', 'Model', 'Cylinders', 'MPG').take([0, 1, 9, 16, 20, 30, 31, 35, -1]).take([1, 2, 4, 8, 5, 6, 3, 7, 0])
gm

### Default Behavior

We have already seen how we can group on a single variable/column.

In [None]:
... # Group `gm` by Brand

In [None]:
... # Group `gm` by number of Cylinders

In [None]:
cars

In [None]:
cars.group('Brand')

In [None]:
# shuffles the rows in the table; returns a new table
cars.shuffle()

In [None]:
cars.group('Brand')

Note that it doesn't matter what order the rows are originally in. The resulting table will be sorted alphabetically.

### Specifying a `collect` function

We can also use `.group` to learn other aggregate statistics about cateogories. We do this by specifying a second argument: `collect`. The `collect` argument must be a function (e.g. `len`, `min` `np.mean`, etc).

In [None]:
... # Group `gm` by Brand and use np.mean as the collect function

How does this work under the hood?

In [None]:
gm.where('Brand', 'Buick')

In [None]:
print('mean of Cylinders: ', gm.where('Brand', 'Buick').column('Cylinders').mean())
print('mean of MPG: ', gm.where('Brand', 'Buick').column('MPG').mean())

In [None]:
gm.where('Brand', 'Cadillac')

In [None]:
print('mean of Cylinders: ', gm.where('Brand', 'Cadillac').column('Cylinders').mean())
print('mean of MPG: ', gm.where('Brand', 'Cadillac').column('MPG').mean())

In [None]:
gm.where('Brand', 'Chevrolet')

In [None]:
print('mean of Cylinders: ', gm.where('Brand', 'Chevrolet').column('Cylinders').mean())
print('mean of MPG: ', gm.where('Brand', 'Chevrolet').column('MPG').mean())

In [None]:
gm.where('Brand', 'GMC')

In [None]:
print('mean of Cylinders: ', gm.where('Brand', 'GMC').column('Cylinders').mean())
print('mean of MPG: ', gm.where('Brand', 'GMC').column('MPG').mean())

If you want a more concise way of doing the above:

In [None]:
# Just run this cell — you'll learn how to write for loops next week
for brand in np.unique(gm.column('Brand')):
    brand_only = gm.where('Brand', brand)
    print(brand)
    print('mean of Cylinders: ', brand_only.column('Cylinders').mean())
    print('mean of MPG: ', brand_only.column('MPG').mean())
    print('\n')

What if we use other `collect` functions?

In [None]:
gm

In [None]:
gm.group('Brand', sum)

In [None]:
gm.group('Brand', list)

In [None]:
gm.group('Brand', len)

In [None]:
gm.group('Brand', max)

### Grouping by Multiple Columns

We can also group by unique combinations of multiple variables. Passing in an array of column names as the first argument in `.group` will create a row for each unique combination of column values in the original table.

In [None]:
cars

In [None]:
... # Group `cars` by Manufacturer and Brand

In [None]:
... # Group `cars` by Manufacturer and use np.mean as the collect function

In [None]:
... # Group `cars` by Manufacturer, Brand, and Displacement

## `.pivot`

Another useful table method is `tbl.pivot()`, which can help us determine statistics for different combinations of values for two variables.

For example, what if we wanted to view the mean MPG for each combination of car brand and cylinder number? `.pivot` allows us to do just that!

In [None]:
... # Create a pivot table showing average MPG for each combination of Brand and Cylinders

`.pivot` can take up to four arguments, the last two of which are optional (but must be used together):
1. `columns`: The column in `tbl` to use as the columns in the pivot table
2. `rows`: The column in `tbl` to use as the rows in the pivot table
3. `values`: The column in `tbl` to aggregate using the `collect` function
4. `collect`: A function with which to aggregate the values in the `values` column

### Quick Check 2

<img src="pivot-table.png" width="70%"/>

Fill in the blanks to create the table above, which describes the largest number of cylinders each manufacturer makes for every possible drivetrain (`'Wheel'`).

In [None]:
cars.pivot(___, ___, ___, ___) # Replace the blanks with your answers

## Demo: US R1 Universities

For our demo, we will be using a dataset of [R1 universities](https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States) in the US.

In [None]:
unis = Table.read_table("data/r1_with_students.csv")
unis

If we wanted to visualize information from this table, we could try to plot all 96 universities on one bar chart, but that isn't ideal...

In [None]:
unis.sort('Number_students', descending=True).barh('University', 'Number_students')

Instead, let's group by state and find the average enrollment in each state.

In [None]:
unis.group('State', np.mean).sort('Number_students mean', descending=True).barh('State', 'Number_students mean')

We can also use a pivot table to help us generate a useful visualization:

In [None]:
unis_pivot = ... # Create a pivot table for each combination of State and type of school
unis_pivot

In [None]:
unis_pivot.barh('State')

Ta-da!