# Grouped data

pplyr verbs are particularly powerful when you apply them to grouped data frames (DataFrameGroupBy objects). This vignette shows you:

* How to group, inspect, and ungroup with group_by() and friends.
* How individual pplyr verbs changes their behaviour when applied to grouped data frame.
* How to access data about the “current” group from within a verb.  (NOT IMPLEMENTED in pplyr)

## Imports

In [1]:
import sys
if ".." not in sys.path:
    sys.path.append("..")

import pplyr

In [2]:
import numpy as np
import pandas as pd

## Data: starwars

We'll use the same starwars data set we used in the introduction notebook.

In [3]:
starwars = pd.read_csv("../data/starwars.csv.gz")
starwars

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,167.0,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human
...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,,,brown,light,hazel,,female,feminine,,Human
83,Poe Dameron,,,brown,light,brown,,male,masculine,,Human
84,BB8,,,none,none,black,,none,masculine,,Droid
85,Captain Phasma,,,unknown,unknown,unknown,,,,,


## group_by()

The most important grouping verb is ```group_by()``` (also aliased as ```groupby()```): it takes a data frame and one or more variables to group by:

In [4]:
by_species = starwars.pipe(pplyr.group_by, "species")
by_sex_gender = starwars.pipe(pplyr.group_by, ["sex", "gender"])

Unlike dpplyr, a grouped DataFrame in pandas does not print out any data:

In [5]:
by_species

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027C3C46EDD8>

But you can get access to the grouping information from the ```groups``` properties.  This is a dictionary where the keys are the grouping keys and the values are selectors for the specified group.

In [6]:
by_species.groups.keys()

dict_keys(['Aleena', 'Besalisk', 'Cerean', 'Chagrian', 'Clawdite', 'Droid', 'Dug', 'Ewok', 'Geonosian', 'Gungan', 'Human', 'Hutt', 'Iktotchi', 'Kaleesh', 'Kaminoan', 'Kel Dor', 'Mirialan', 'Mon Calamari', 'Muun', 'Nautolan', 'Neimodian', "Pau'an", 'Quermian', 'Rodian', 'Skakoan', 'Sullustan', 'Tholothian', 'Togruta', 'Toong', 'Toydarian', 'Trandoshan', "Twi'lek", 'Vulptereen', 'Wookiee', 'Xexto', "Yoda's species", 'Zabrak'])

You can also use ```ungroup()``` to remove the grouping from the DataFrame.  This just returns the internal ```obj``` property that has the original, ungrouped DataFrame.

In [7]:
by_species.pipe(pplyr.ungroup).head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
1,C-3PO,167.0,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid
3,Darth Vader,202.0,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19.0,female,feminine,Alderaan,Human


Or use ```tally()``` to count the number of rows in each group. The ```sort``` argument is useful if you want to see the largest groups up front.

In [8]:
by_species.pipe(pplyr.tally).head()

Unnamed: 0,species,n
0,Aleena,1
1,Besalisk,1
2,Cerean,1
3,Chagrian,1
4,Clawdite,1


In [9]:
by_sex_gender.pipe(pplyr.tally, sort = True).head()

Unnamed: 0,sex,gender,n
2,male,masculine,60
0,female,feminine,16
4,none,masculine,5
5,,,4
1,hermaphroditic,masculine,1


While our ```group_by()``` function doesn't allow new variables to be created within it, you can define new variables with ```mutate()``` first and then group by these.  The example below calculates a BMI and divides it into different ranges before tallying the results.

In [10]:
bmi_bins = [0, 18.5, 25, 30, np.inf]

starwars.pipe(pplyr.pipeline()
              .mutate(
                  bmi = lambda x: x.mass / (x.height / 100).pow(2),
                  bmi_cat = lambda x: pd.cut(x.bmi, bins=bmi_bins)
              ).group_by("bmi_cat")
              .tally()
)

Unnamed: 0,bmi_cat,n
0,"(0.0, 18.5]",10
1,"(18.5, 25.0]",24
2,"(25.0, 30.0]",13
3,"(30.0, inf]",12


## Group Metadata

TODO

## Verbs

The following sections describe how grouping affects the main pplyr verbs.

### summarise()

```summarise()``` computes a summary for each group. This means that it starts from the group keys (DataFrame.groups.keys()), adding summary variables to the right hand side:

In [11]:
by_species.pipe(pplyr.summarise,
    n = lambda x: len(x),
    height = lambda x: x.height.mean()
  ).head()

Unnamed: 0_level_0,n,height
species,Unnamed: 1_level_1,Unnamed: 2_level_1
Aleena,1,79.0
Besalisk,1,198.0
Cerean,1,198.0
Chagrian,1,196.0
Clawdite,1,168.0


NOTE: Difference from dplyr functionality!  dplyr has a convention or removing the last key of a grouped DataFrame when summarise() returns a result.  We do not follow that same convention.  Instead, we always return an ungrouped DataFrame.

### ```select()```, ```rename()```, and ```relocate()```

These verbs operate in the same way as they do on ungrouped DataFrames.  Be careful not to drop the columns that are part of the grouping or an errorw will be thrown.

NOTE: relocate() is not implemented by pplyr

### ```arrange()```

Grouped arrange() functions will call arrange() within each group and combine the results.

NOTE: Difference from dplyr functionality!  dplyr applies arrange() to the entire DataFrame, ignoring groups.  Although if you sort the entire DataFrame and then operate on the groups, the groups themselves will also be sorted.  Our implementation is similar to dplyr with the parameter ```.by_group = TRUE```.

In [12]:
by_species.pipe(pplyr.arrange, "mass", ascending=False) \
          .pipe(pplyr.slice, 0) \
          .head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Ratts Tyerell,79.0,15.0,none,"grey, blue",unknown,,male,masculine,Aleen Minor,Aleena
1,Dexter Jettster,198.0,102.0,none,brown,yellow,,male,masculine,Ojom,Besalisk
2,Ki-Adi-Mundi,198.0,82.0,white,pale,yellow,92.0,male,masculine,Cerea,Cerean
3,Mas Amedda,196.0,,none,blue,blue,,male,masculine,Champala,Chagrian
4,Zam Wesell,168.0,55.0,blonde,"fair, green, yellow",yellow,,female,feminine,Zolan,Clawdite


### ```mutate()``` and ```transmute()```

In simple cases with vectorised functions, grouped and ungrouped mutate() give the same results. They differ when used with summary functions:

In [13]:
# Subtract off global mean
starwars.pipe(pplyr.pipeline()
        .select(["name","homeworld","mass"])
        .mutate(
            standard_mass = lambda x: x.mass - x.mass.mean()
        )
).head()

Unnamed: 0,name,homeworld,mass,standard_mass
0,Luke Skywalker,Tatooine,77.0,-20.311864
1,C-3PO,Tatooine,75.0,-22.311864
2,R2-D2,Naboo,32.0,-65.311864
3,Darth Vader,Tatooine,136.0,38.688136
4,Leia Organa,Alderaan,49.0,-48.311864


In [14]:
# Subtract off homeworld mean
starwars.pipe(pplyr.pipeline() 
        .select(["name", "homeworld", "mass"])
        .group_by("homeworld", sort=False)
        .mutate(
            standard_mass = lambda x: x.mass - x.mass.mean()
        )
        .ungroup()
).head()

Unnamed: 0,name,homeworld,mass,standard_mass
0,Luke Skywalker,Tatooine,77.0,-8.375
1,C-3PO,Tatooine,75.0,-10.375
2,Darth Vader,Tatooine,136.0,50.625
3,Owen Lars,Tatooine,120.0,34.625
4,Beru Whitesun lars,Tatooine,75.0,-10.375


Or with window functions like min_rank():

TODO: min_rank() is not implemented in pplyr

### ```filter()```

A grouped filter() effectively does a mutate() to generate a logical variable, and then only keeps the rows where the variable is TRUE. This means that grouped filters can be used with summary functions. For example, we can find the tallest character of each species:

In [15]:
by_species.pipe(pplyr.pipeline()
  .select(["name", "species", "height"])
  .filter(lambda x: x.height == x.height.max())
  .ungroup()
).head()

Unnamed: 0,name,species,height
0,Ratts Tyerell,Aleena,79.0
1,Dexter Jettster,Besalisk,198.0
2,Ki-Adi-Mundi,Cerean,198.0
3,Mas Amedda,Chagrian,196.0
4,Zam Wesell,Clawdite,168.0


You can also use filter() to remove entire groups. For example, the following code eliminates all groups that only have a single member:

In [16]:
by_species.pipe(pplyr.pipeline()
  .filter(lambda x: len(x) != 1)
  .tally()
)

Unnamed: 0,species,n
0,Droid,6
1,Gungan,3
2,Human,35
3,Kaminoan,2
4,Mirialan,2
5,Twi'lek,2
6,Wookiee,2
7,Zabrak,2


### ```slice()``` and friends

```slice()``` and friends (```slice_head()```, ```slice_tail()```, ```slice_sample()```, ```slice_min()``` and ```slice_max()```) select rows within a group. For example, we can select the first observation within each species:

In [17]:
by_species.pipe(pplyr.slice, 0).head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Ratts Tyerell,79.0,15.0,none,"grey, blue",unknown,,male,masculine,Aleen Minor,Aleena
1,Dexter Jettster,198.0,102.0,none,brown,yellow,,male,masculine,Ojom,Besalisk
2,Ki-Adi-Mundi,198.0,82.0,white,pale,yellow,92.0,male,masculine,Cerea,Cerean
3,Mas Amedda,196.0,,none,blue,blue,,male,masculine,Champala,Chagrian
4,Zam Wesell,168.0,55.0,blonde,"fair, green, yellow",yellow,,female,feminine,Zolan,Clawdite


Similarly, we can use ```slice_min()``` to select the smallest n values of a variable:

In [19]:
by_species.pipe(pplyr.slice_min, "height", n=2).head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
0,Ratts Tyerell,79.0,15.0,none,"grey, blue",unknown,,male,masculine,Aleen Minor,Aleena
1,Dexter Jettster,198.0,102.0,none,brown,yellow,,male,masculine,Ojom,Besalisk
2,Ki-Adi-Mundi,198.0,82.0,white,pale,yellow,92.0,male,masculine,Cerea,Cerean
3,Mas Amedda,196.0,,none,blue,blue,,male,masculine,Champala,Chagrian
4,Zam Wesell,168.0,55.0,blonde,"fair, green, yellow",yellow,,female,feminine,Zolan,Clawdite


## Computing on grouping information

We do not currently implement cur_group() or cur_group_id().

TODO?