# Lecture 12 – Grouping and Pivoting

## Data 6, Fall 2024

In [1]:
from datascience import *
import numpy as np
%matplotlib inline
Table.interactive_plots() 

## Grouping with `.group`

The term "group" in data science is most commonly associated with data aggregation and disaggregation. When we "group" a table in Python, we are able to gain insights about our data at a higher unit of analysis (e.g. at a city or state level, as opposed to the individual level).

Returning to the `top_10` dataset from last week (showing the songs on the Spotify Top 10), we can utilize `tbl.group()` to count how many Top 10 songs each artist has.

In [2]:
top_10 = Table.read_table('data/regional-global-daily-latest.csv').take(np.arange(10))
top_10

rank,artist_names,track_name,source,peak_rank,previous_rank,days_on_chart,streams
1,Kate Bush,Running Up That Hill,Rhino,1,1,41,7208654
2,Harry Styles,As It Was,Columbia,1,2,99,6543793
3,Joji,Glimpse of Us,88rising Music/Warner Records,1,3,28,5492997
4,"Bad Bunny, Chencho Corleone",Me Porto Bonito,Rimas Entertainment LLC,2,4,63,5416421
5,"Bizarrap, Quevedo","Quevedo: Bzrp Music Sessions, Vol. 52",DALE PLAY Records,5,-1,1,4676471
6,Bad Bunny,Tití Me Preguntó,Rimas Entertainment LLC,4,5,63,4549682
7,"Bad Bunny, Bomba Estéreo",Ojitos Lindos,Rimas Entertainment LLC,2,6,63,4144625
8,Bad Bunny,Efecto,Rimas Entertainment LLC,7,8,63,3722317
9,"Charlie Puth, BTS, Jung Kook",Left and Right (Feat. Jung Kook of BTS),Atlantic Records,3,7,14,3715689
10,Bad Bunny,Moscow Mule,Rimas Entertainment LLC,1,9,63,3465003


By default `tbl.group(column)` counts the number of occurences of each unique value in `column`.

In [5]:
top_10.group("artist_names").sort("count", descending = True) # Group `top_10` by artist name and then sort by count

artist_names,count
Bad Bunny,3
"Bad Bunny, Bomba Estéreo",1
"Bad Bunny, Chencho Corleone",1
"Bizarrap, Quevedo",1
"Charlie Puth, BTS, Jung Kook",1
Harry Styles,1
Joji,1
Kate Bush,1


Notice that Bad Bunny's name appears 5 times on the Top 10, but the "Bad Bunny" row in the groupped table only has a count of 3. This is because Python looks for an **exact match**. The songs where Bad Bunny appears along with other artists are counted separately.

### Quick Check (not mandatory for participation credit today)

In [6]:
streams = Table.read_table('data/regional-global-daily-latest.csv')
streams

rank,artist_names,track_name,source,peak_rank,previous_rank,days_on_chart,streams
1,Kate Bush,Running Up That Hill,Rhino,1,1,41,7208654
2,Harry Styles,As It Was,Columbia,1,2,99,6543793
3,Joji,Glimpse of Us,88rising Music/Warner Records,1,3,28,5492997
4,"Bad Bunny, Chencho Corleone",Me Porto Bonito,Rimas Entertainment LLC,2,4,63,5416421
5,"Bizarrap, Quevedo","Quevedo: Bzrp Music Sessions, Vol. 52",DALE PLAY Records,5,-1,1,4676471
6,Bad Bunny,Tití Me Preguntó,Rimas Entertainment LLC,4,5,63,4549682
7,"Bad Bunny, Bomba Estéreo",Ojitos Lindos,Rimas Entertainment LLC,2,6,63,4144625
8,Bad Bunny,Efecto,Rimas Entertainment LLC,7,8,63,3722317
9,"Charlie Puth, BTS, Jung Kook",Left and Right (Feat. Jung Kook of BTS),Atlantic Records,3,7,14,3715689
10,Bad Bunny,Moscow Mule,Rimas Entertainment LLC,1,9,63,3465003


Using the `streams` table, fill in the blanks to create the “Top 10 Artists”: The 10 artists with the most songs in the Spotify Daily Top 200 streams table.

In [11]:
top_10_artists = streams.group("artist_names").sort("count", descending=True).take(np.arange(10))
top_10_artists

artist_names,count
Bad Bunny,16
Harry Styles,8
Olivia Rodrigo,5
The Weeknd,5
Ed Sheeran,4
Arctic Monkeys,3
BTS,3
Doja Cat,3
Eminem,3
Imagine Dragons,3


## Advanced Grouping

For the rest of today's lecture, we will use the `cars` table, which contains specifications for a variety of car models.

In [12]:
cars = Table.read_table('data/models-2021.csv')
cars

Manufacturer,Brand,Model,Displacement,Cylinders,MPG,Wheel
BMW,BMW,228i Gran Coupe,2,4,28,"2-Wheel Drive, Front"
BMW,BMW,228i xDrive Gran Coupe,2,4,27,All Wheel Drive
BMW,BMW,230i Convertible,2,4,27,"2-Wheel Drive, Rear"
BMW,BMW,230i Coupe,2,4,28,"2-Wheel Drive, Rear"
BMW,BMW,230i xDrive Convertible,2,4,24,All Wheel Drive
BMW,BMW,230i xDrive Coupe,2,4,24,All Wheel Drive
BMW,BMW,330i,2,4,30,"2-Wheel Drive, Rear"
BMW,BMW,330i xDrive,2,4,28,All Wheel Drive
BMW,BMW,430i Coupe,2,4,29,"2-Wheel Drive, Rear"
BMW,BMW,430i xDrive Coupe,2,4,27,All Wheel Drive


A few notes:
* `Manufacturer` is who owns the Brand.
    * GM owns Buick, Cadillac, Chevrolet, GMC.
* `Displacement` is the engine size in liters.
* `MPG` is miles per gallon.


Here we'll take a subset of the rows and columns for illustration.

In [14]:
gm = cars.where('Manufacturer', 'General Motors').select('Brand', 'Model', 'Cylinders', 'MPG').take([0, 1, 9, 16, 20, 30, 31, 35, -1]).take([1, 2, 4, 8, 5, 6, 3, 7, 0])
gm

Brand,Model,Cylinders,MPG
Buick,ENCLAVE FWD,6,21
Cadillac,CT4 AWD,4,26
Cadillac,XT5 AWD,4,23
GMC,YUKON XL 4WD,6,22
Chevrolet,CAMARO,4,25
Chevrolet,COLORADO 2WD,4,22
Cadillac,ESCALADE 2WD,6,23
Chevrolet,EQUINOX AWD,4,27
Buick,ENCLAVE AWD,6,20


### Default Behavior

We have already seen how we can group on a single variable/column.

In [15]:
gm.group("Brand") # Group `gm` by Brand

Brand,count
Buick,2
Cadillac,3
Chevrolet,3
GMC,1


In [16]:
gm.group("Cylinders") # Group `gm` by number of Cylinders

Cylinders,count
4,5
6,4


In [35]:
cars.group(["Manufacturer", "Brand"], np.mean)

Manufacturer,Brand,Model mean,Displacement mean,Cylinders mean,MPG mean,Wheel mean
BMW,BMW,,3.10789,6.0,22.3421,
BMW,Mini,,1.85294,3.70588,28.3529,
BMW,TOYOTA,,2.5,5.0,26.5,
FCA US LLC,ALFA ROMEO,,2.0,4.0,25.5,
FCA US LLC,Chrysler,,3.6,6.0,21.6,
FCA US LLC,Dodge,,4.93333,7.0,18.75,
FCA US LLC,FIAT,,1.3,4.0,26.0,
FCA US LLC,Jeep,,2.865,5.2,22.55,
FCA US LLC,RAM,,3.54286,6.0,21.5714,
Ferrari,"Ferrari North America, Inc.",,4.64286,9.14286,16.2857,


In [46]:
thri = Table().with_columns("Flavor", make_array("Blue", "Red", "Green", "Blue", "Blue"),
                            "Prices", make_array("1", "2", "3", "4", "5"),
                            "Orders", make_array(5, 10, 15, 20, 25))

thri.group("Flavor", len)
#thri

Flavor,Prices len,Orders len
Blue,3,3
Green,1,1
Red,1,1


In [None]:
# shuffles the rows in the table; returns a new table
cars.shuffle()

In [None]:
cars.group('Brand')

Note that it doesn't matter what order the rows are originally in. The resulting table will be sorted alphabetically.

### Specifying a `collect` function

We can also use `.group` to learn other aggregate statistics about cateogories. We do this by specifying a second argument: `collect`. The `collect` argument must be a function (e.g. `len`, `min` `np.mean`, etc).

In [None]:
... # Group `gm` by Brand and use np.mean as the collect function

How does this work under the hood?

In [None]:
gm.where('Brand', 'Buick')

In [None]:
print('mean of Cylinders: ', gm.where('Brand', 'Buick').column('Cylinders').mean())
print('mean of MPG: ', gm.where('Brand', 'Buick').column('MPG').mean())

In [None]:
gm.where('Brand', 'Cadillac')

In [None]:
print('mean of Cylinders: ', gm.where('Brand', 'Cadillac').column('Cylinders').mean())
print('mean of MPG: ', gm.where('Brand', 'Cadillac').column('MPG').mean())

In [None]:
gm.where('Brand', 'Chevrolet')

In [None]:
print('mean of Cylinders: ', gm.where('Brand', 'Chevrolet').column('Cylinders').mean())
print('mean of MPG: ', gm.where('Brand', 'Chevrolet').column('MPG').mean())

In [None]:
gm.where('Brand', 'GMC')

In [None]:
print('mean of Cylinders: ', gm.where('Brand', 'GMC').column('Cylinders').mean())
print('mean of MPG: ', gm.where('Brand', 'GMC').column('MPG').mean())

If you want a more concise way of doing the above:

In [None]:
# Just run this cell — you'll learn how to write for loops next week
for brand in np.unique(gm.column('Brand')):
    brand_only = gm.where('Brand', brand)
    print(brand)
    print('mean of Cylinders: ', brand_only.column('Cylinders').mean())
    print('mean of MPG: ', brand_only.column('MPG').mean())
    print('\n')

What if we use other `collect` functions?

In [None]:
gm

In [None]:
gm.group('Brand', sum)

In [None]:
gm.group('Brand', list)

In [None]:
gm.group('Brand', len)

In [None]:
gm.group('Brand', max)

### Grouping by Multiple Columns

We can also group by unique combinations of multiple variables. Passing in an array of column names as the first argument in `.group` will create a row for each unique combination of column values in the original table.

In [18]:
cars

Manufacturer,Brand,Model,Displacement,Cylinders,MPG,Wheel
BMW,BMW,228i Gran Coupe,2,4,28,"2-Wheel Drive, Front"
BMW,BMW,228i xDrive Gran Coupe,2,4,27,All Wheel Drive
BMW,BMW,230i Convertible,2,4,27,"2-Wheel Drive, Rear"
BMW,BMW,230i Coupe,2,4,28,"2-Wheel Drive, Rear"
BMW,BMW,230i xDrive Convertible,2,4,24,All Wheel Drive
BMW,BMW,230i xDrive Coupe,2,4,24,All Wheel Drive
BMW,BMW,330i,2,4,30,"2-Wheel Drive, Rear"
BMW,BMW,330i xDrive,2,4,28,All Wheel Drive
BMW,BMW,430i Coupe,2,4,29,"2-Wheel Drive, Rear"
BMW,BMW,430i xDrive Coupe,2,4,27,All Wheel Drive


In [21]:
cars.group(['Brand', 'Cylinders']) # Group `cars` by Manufacturer and Brand

Brand,Cylinders,count
ALFA ROMEO,4,4
Acura,4,9
Acura,6,1
Aston Martin Lagonda Ltd,8,4
Aston Martin Lagonda Ltd,12,2
Audi,4,16
Audi,5,1
Audi,6,14
Audi,8,6
Audi,10,4


In [None]:
... # Group `cars` by Manufacturer and use np.mean as the collect function

In [None]:
... # Group `cars` by Manufacturer, Brand, and Displacement

## `.pivot`

Another useful table method is `tbl.pivot()`, which can help us determine statistics for different combinations of values for two variables.

For example, what if we wanted to view the mean MPG for each combination of car brand and cylinder number? `.pivot` allows us to do just that!

In [29]:
num_tries = 0

In [34]:
def draw_triangle():
    return 1

num_tries = num_tries + 1

In [33]:
draw_triangle()

1

`.pivot` can take up to four arguments, the last two of which are optional (but must be used together):
1. `columns`: The column in `tbl` to use as the columns in the pivot table
2. `rows`: The column in `tbl` to use as the rows in the pivot table
3. `values`: The column in `tbl` to aggregate using the `collect` function
4. `collect`: A function with which to aggregate the values in the `values` column

### Quick Check 2

<img src="pivot-table.png" width="70%"/>

Fill in the blanks to create the table above, which describes the largest number of cylinders each manufacturer makes for every possible drivetrain (`'Wheel'`).

In [None]:
cars.pivot(___, ___, ___, ___) # Replace the blanks with your answers

## Demo: US R1 Universities

For our demo, we will be using a dataset of [R1 universities](https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States) in the US.

In [None]:
unis = Table.read_table("data/r1_with_students.csv")
unis

If we wanted to visualize information from this table, we could try to plot all 96 universities on one bar chart, but that isn't ideal...

In [None]:
unis.sort('Number_students', descending=True).barh('University', 'Number_students')

Instead, let's group by state and find the average enrollment in each state.

In [None]:
unis.group('State', np.mean).sort('Number_students mean', descending=True).barh('State', 'Number_students mean')

We can also use a pivot table to help us generate a useful visualization:

In [None]:
unis_pivot = ... # Create a pivot table for each combination of State and type of school
unis_pivot

In [None]:
unis_pivot.barh('State')

Ta-da!