# Welcome to the Dark Art of Coding:
## Introduction to Python
Grouping/aggregating data

<img src='../images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Know how to group data by:
    * Dictionary
    * List
    * Columns
    * Index
* Be able to aggregate data effectively
* Make their own functions to use in aggregation

In [2]:
import pandas as pd
from pandas import DataFrame         #, Series
import numpy as np

# Groupby series
---

Let's start by defining a DataFrame that includes some sample data showing the messages (emails & tweets) received by diana and clark on each of two days

In [3]:
df = DataFrame({'name': ['diana', 'diana', 'clark', 'clark', 'diana'],
                'msgs': ['email', 'tweet', 'email', 'tweet', 'email'],
                'day1': [10, 11, 23, 23, 15],
                'day2': [14, 15, 16, 17, 21]})

df

Unnamed: 0,day1,day2,msgs,name
0,10,14,email,diana
1,11,15,tweet,diana
2,23,16,email,clark
3,23,17,tweet,clark
4,15,21,email,diana


Once we have our data we can use the `.groupby()` method and section our columns based on what data they see that is similar

In [4]:
groups = df['day1'].groupby(df['name'])

In this case we have a `SeriesGroupBy` object

In [5]:
groups.

<pandas.core.groupby.SeriesGroupBy object at 0x116be65f8>

This stores the data that let's us view our data group by group

In [6]:
for group in groups:
    for item in group:
        print(item)

clark
2    23
3    23
Name: day1, dtype: int64
diana
0    10
1    11
4    15
Name: day1, dtype: int64


Once we have our groups we can do vector math on groups individually. E.G. get every group's average or sum

Groupby objects already have a few methods attached to them that will do this for us

In [10]:
print(groups.mean())


name
clark    23
diana    12
Name: day1, dtype: int64


In [12]:
print(groups.sum())

name
clark    46
diana    36
Name: day1, dtype: int64


If we want to get multilevel Groupby object we can have it group by *multiple* columns at once by giving the `.groupby()` mthod a list of columns

In [13]:
means = df['day1'].groupby( [df['name'], df['msgs']] ).mean()
means

name   msgs 
clark  email    23.0
       tweet    23.0
diana  email    12.5
       tweet    11.0
Name: day1, dtype: float64

Since we have a multilevel series-like object we can use the unstack method on it to show it like a DataFrame

In [14]:
means.unstack()

msgs,email,tweet
name,Unnamed: 1_level_1,Unnamed: 2_level_1
clark,23.0,23.0
diana,12.5,11.0


If the column you're going to use is in the same DataFrame as the original column you don't need to index to the column in the `.groupby()` method and instead you can just give it a string to use to grab the column

In [15]:
df.groupby('name').mean()

Unnamed: 0_level_0,day1,day2
name,Unnamed: 1_level_1,Unnamed: 2_level_1
clark,23.0,16.5
diana,12.0,16.666667


In [19]:
df.groupby(['name', 'msgs']).max()

Unnamed: 0_level_0,Unnamed: 1_level_0,day1,day2
name,msgs,Unnamed: 2_level_1,Unnamed: 3_level_1
clark,email,23,16
clark,tweet,23,17
diana,email,15,21
diana,tweet,11,15


One thing to realize is that you don't actually need to use columns from the same DataFrame. You can use any array, series, or column that is the same length as your DataFrame or Series

In this case we'll show you an example of using numpy arrays to group some of this data together

In [21]:
cities = np.array(['new york', 'baltimore', 'baltimore', 'new york', 'new york'])
day = np.array(['mon', 'mon', 'tues', 'mon', 'tues'])

In [22]:
df['day1'].groupby([cities, day]).mean()

baltimore  mon     11.0
           tues    23.0
new york   mon     16.5
           tues    15.0
Name: day1, dtype: float64

Another helpful groupby method is `.size()`. This method shows us how many objects fell into a given group. In this case it shows us that the clark group had 2 objects inside and the diana group had 3

In [23]:
df.groupby('name').size()

name
clark    2
diana    3
dtype: int64

Here we do the same thing but we tell it to group by multiple columns. This time it shows us how many objects were in each subgroup

In [24]:
df.groupby(['name', 'msgs']).size()

name   msgs 
clark  email    1
       tweet    1
diana  email    2
       tweet    1
dtype: int64

If we try and iterate over a GroupBy object each loop will return a small tuple. This tuple contains the name of the group and secondarily the content in the group itself. Since it returns a tuple we can use tuple unpacking to get both the name and the group content alone in separate variables

In [26]:
for name, group in df.groupby('msgs'):
    if name == 'email':
        print(name)
        print('='*30)
        print(group)
        print()
    else:
        pass

email
   day1  day2   msgs   name
0    10    14  email  diana
2    23    16  email  clark
4    15    21  email  diana



Since this GroupBy object is multilevel each group combination has the "name" slot of that first tuple broken out into a new tuple with multiple parts for each level of the group

In [28]:
for (k1, k2), group in df.groupby(['name', 'msgs']):
    if k2 != 'tweet':
        
        print(k1, k2)
        print('='*30)
        print(group)
        print()

clark email
   day1  day2   msgs   name
2    23    16  email  clark

diana email
   day1  day2   msgs   name
0    10    14  email  diana
4    15    21  email  diana



Sometimes (especially with larger datasets) grouping across the entire DataFrame might not be what you want to do. you can do column specific grouping

In [33]:
df.groupby(['name', 'msgs'])[['day2', 'day1']].mean()

# The following line does the same thing, but displays multiple columns.
# df.groupby(['key1', 'key2'])[['day1', 'day2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,day2,day1
name,msgs,Unnamed: 2_level_1,Unnamed: 3_level_1
clark,email,16.0,23.0
clark,tweet,17.0,23.0
diana,email,17.5,12.5
diana,tweet,15.0,11.0


In [None]:
# There are a number of other ways to implement grouping. One way is using
# dictionaries to map columns (or rows) to values. In this case, we tie each
# of the years to a grouping (pre vs post apocalyptic event and a third
# category for future events).

In [34]:
heroes = DataFrame([[512, 613, 714, 815, 916],
                    [413, 412, 411, 420, 415],
                    [501, 525, 535, 545, 555],
                    [501, 602, 545, 600, 599],
                    [413, 603, 412, 599, 419]],
                    columns=[2011, 2012, 2013, 2014, 2015],
                    index=['clark', 'bruce', 'diana', 'kara', 'selina'])
heroes

Unnamed: 0,2011,2012,2013,2014,2015
clark,512,613,714,815,916
bruce,413,412,411,420,415
diana,501,525,535,545,555
kara,501,602,545,600,599
selina,413,603,412,599,419


In [35]:
mapping = {2011: 'pre',
           2012: 'pre',
           2013: 'post',
           2014: 'post',
           2015: 'post',
           2016: 'future'}  

In [36]:
mapping

{2011: 'pre',
 2012: 'pre',
 2013: 'post',
 2014: 'post',
 2015: 'post',
 2016: 'future'}

In [None]:
# So rather than grouping by a column, we simply drop in the dictionary:
# in this case, we explicitly identify the grouping axis to be the columns
# by using an axis=1 argument (the default is axis=0 for grouping by rows.)

In [37]:
by_column = heroes.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,post,pre
clark,2445,1125
bruce,1246,825
diana,1635,1026
kara,1744,1103
selina,1430,1016


In [None]:
# NOTE: pre/post are lexigraphically sortable, and leads to all the 'post'
# years being displayed infront of the 'pre' years, which seems weird, so
# we tell the function to skip the alphabetical sorting process, which
# leaves the 'pre' group (years 2011, 2012) in front of the 'post' group (years
# 2013, 2014, 2015).

In [40]:
by_column = heroes.groupby(mapping, axis=1, sort=False)
by_column.describe()
# by_dolumn.mean()
# by_column.describe()

Unnamed: 0,Unnamed: 1,count,mean,std,min,25%,50%,75%,max
pre,2011,5.0,468.0,50.408333,413.0,413.0,501.0,501.0,512.0
pre,2012,5.0,551.0,85.360998,412.0,525.0,602.0,603.0,613.0
post,2013,5.0,523.4,124.472085,411.0,412.0,535.0,545.0,714.0
post,2014,5.0,595.8,142.796008,420.0,545.0,599.0,600.0,815.0
post,2015,5.0,580.8,204.343339,415.0,419.0,555.0,599.0,916.0


In [None]:
# You can GroupBy the outputs of functions. it does not matter where the
# function comes from, as long as it provides an output. For example, this
# sample functions counts the number of vowels that show up in a superhero's
# name.

In [41]:
def count_vowels(name):
    count = 0
    for letter in name:
        if letter in ['a', 'e', 'i', 'o', 'u']:
            count += 1
    return count

In [None]:
# The following will group the heroes based on the number of vowels in their
# name (1, 2 or 3) and then will calculate the means of their emails by group.

In [45]:
heroes.groupby(count_vowels).describe()

Unnamed: 0_level_0,2011,2011,2011,2011,2011,2011,2011,2011,2012,2012,...,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
1,1.0,512.0,,512.0,512.0,512.0,512.0,512.0,1.0,613.0,...,815.0,815.0,1.0,916.0,,916.0,916.0,916.0,916.0,916.0
2,2.0,457.0,62.225397,413.0,435.0,457.0,479.0,501.0,2.0,507.0,...,555.0,600.0,2.0,507.0,130.107648,415.0,461.0,507.0,553.0,599.0
3,2.0,457.0,62.225397,413.0,435.0,457.0,479.0,501.0,2.0,564.0,...,585.5,599.0,2.0,487.0,96.166522,419.0,453.0,487.0,521.0,555.0


In [None]:
# What other aggregations can you do on data?
# by_column.mean()
# by_column.sum()
# by_column.min()
# by_column.max()
# by_column.median()
# by_column.first()
# by_column.last()
by_column.describe()

# by_column.<tab>        # pressing the <tab> key in ipython after typing
                         # "by_column.' will display all the methods/attributes

In [None]:
# it is possible to apply any aggregation function to your data. for example
# if we want to know how far off from the max, any given mean is, we can
# calculate that:

In [46]:
def max_mean_diff(arr):
    return arr.max() - arr.mean()

groups.agg(max_mean_diff)

name
clark    0
diana    3
Name: day1, dtype: int64

In [None]:
# for the next bit, let's look at a slightly more sophisticated dataset.
# this is data that I generated randomly.
# we start by reading in the data
# then we perform a calculation to determine the relative pct between
# the student's gpa and change in the gpa

In [48]:
gpas = pd.read_csv('gpa_short.csv')
gpas

Unnamed: 0,gpa,gpa_change,gender,athlete,day,period,duration
0,75,0.939395,Female,no,Tue,evening,2
1,55,0.106715,Male,yes,Fri,morning,1
2,73,1.309814,Male,yes,Sat,morning,3
3,82,1.702488,Male,yes,Fri,afternoon,3
4,56,2.576343,Male,yes,Tue,morning,2
5,76,3.610408,Male,yes,Thu,afternoon,3
6,90,3.91934,Male,yes,Mon,afternoon,2
7,62,0.728715,Female,yes,Tue,evening,1
8,78,0.832815,Male,no,Mon,morning,2
9,57,0.654,Female,yes,Mon,evening,3


In [49]:
gpas['gpa_pct'] = gpas['gpa_change'] / gpas['gpa']
gpas

Unnamed: 0,gpa,gpa_change,gender,athlete,day,period,duration,gpa_pct
0,75,0.939395,Female,no,Tue,evening,2,0.012525
1,55,0.106715,Male,yes,Fri,morning,1,0.00194
2,73,1.309814,Male,yes,Sat,morning,3,0.017943
3,82,1.702488,Male,yes,Fri,afternoon,3,0.020762
4,56,2.576343,Male,yes,Tue,morning,2,0.046006
5,76,3.610408,Male,yes,Thu,afternoon,3,0.047505
6,90,3.91934,Male,yes,Mon,afternoon,2,0.043548
7,62,0.728715,Female,yes,Tue,evening,1,0.011753
8,78,0.832815,Male,no,Mon,morning,2,0.010677
9,57,0.654,Female,yes,Mon,evening,3,0.011474


In [None]:
# from there, we group by the gender and the athlete statues.

In [51]:
groups = gpas.groupby(['gender', 'athlete'])
for grp in groups:
    print(grp)

(('Female', 'no'),    gpa  gpa_change  gender athlete  day   period  duration   gpa_pct
0   75    0.939395  Female      no  Tue  evening         2  0.012525)
(('Female', 'yes'),    gpa  gpa_change  gender athlete  day   period  duration   gpa_pct
7   62    0.728715  Female     yes  Tue  evening         1  0.011753
9   57    0.654000  Female     yes  Mon  evening         3  0.011474)
(('Male', 'no'),    gpa  gpa_change gender athlete  day   period  duration   gpa_pct
8   78    0.832815   Male      no  Mon  morning         2  0.010677)
(('Male', 'yes'),    gpa  gpa_change gender athlete  day     period  duration   gpa_pct
1   55    0.106715   Male     yes  Fri    morning         1  0.001940
2   73    1.309814   Male     yes  Sat    morning         3  0.017943
3   82    1.702488   Male     yes  Fri  afternoon         3  0.020762
4   56    2.576343   Male     yes  Tue    morning         2  0.046006
5   76    3.610408   Male     yes  Thu  afternoon         3  0.047505
6   90    3.919340   M

In [None]:
# then we pull out just the gpa_pct column in reference to the groupings

In [52]:
groups_pct = groups['gpa_pct']
groups_pct

<pandas.core.groupby.SeriesGroupBy object at 0x116ecc860>

In [None]:
# at this point we can apply some functions to this data. Notice the
# use/non-use of quotes. for built-in functions, you need to use the
# quotations. For funtions that you have defined in the current namespace
# you can get away with not using the quotes:

In [53]:
groups_pct.agg(['max', 'mean', max_mean_diff])

Unnamed: 0_level_0,Unnamed: 1_level_0,max,mean,max_mean_diff
gender,athlete,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,no,0.012525,0.012525,0.0
Female,yes,0.011753,0.011614,0.00014
Male,no,0.010677,0.010677,0.0
Male,yes,0.047505,0.029617,0.017888


In [None]:
# there are multiple ways to slice the data into columns that you want and
# then aggregate the data in those columns. here is another method, where
# we choose several columns and then apply four functions against each 
# of the columns:

In [54]:
functions = ['median', 'min', 'max', 'count']
result = groups['gpa_pct', 'gpa'].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,gpa_pct,gpa_pct,gpa_pct,gpa_pct,gpa,gpa,gpa,gpa
Unnamed: 0_level_1,Unnamed: 1_level_1,median,min,max,count,median,min,max,count
gender,athlete,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Female,no,0.012525,0.012525,0.012525,1,75.0,75,75,1
Female,yes,0.011614,0.011474,0.011753,2,59.5,57,62,2
Male,no,0.010677,0.010677,0.010677,1,78.0,78,78,1
Male,yes,0.032155,0.00194,0.047505,6,74.5,55,90,6


In [None]:
# if we want to focus on just a single grouping at a time, we can select
# for that grouping via dictionary-like indexing.

In [55]:
result['gpa_pct']

Unnamed: 0_level_0,Unnamed: 1_level_0,median,min,max,count
gender,athlete,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,no,0.012525,0.012525,0.012525,1
Female,yes,0.011614,0.011474,0.011753,2
Male,no,0.010677,0.010677,0.010677,1
Male,yes,0.032155,0.00194,0.047505,6


In [None]:
# There is a way to map certain functions to only certain columns by using a dictionary to perform the mapping:

In [56]:
mapping = {'gpa_change': 'min', 'duration': 'mean'}
groups.agg(mapping)

Unnamed: 0_level_0,Unnamed: 1_level_0,gpa_change,duration
gender,athlete,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,no,0.939395,2.0
Female,yes,0.654,2.0
Male,no,0.832815,2.0
Male,yes,0.106715,2.333333


In [None]:
# This can get pretty sophisticated:

In [57]:
mapping2 = {'gpa_change': ['min', 'max', 'median'], 'duration': 'mean'}
groups.agg(mapping2)

Unnamed: 0_level_0,Unnamed: 1_level_0,gpa_change,gpa_change,gpa_change,duration
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,median,mean
gender,athlete,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Female,no,0.939395,0.939395,0.939395,2.0
Female,yes,0.654,0.728715,0.691357,2.0
Male,no,0.832815,0.832815,0.832815,2.0
Male,yes,0.106715,3.91934,2.139416,2.333333


In [None]:
# sometimes, we want to perform aggregation, but we want to retain the
# overall shape of the data, much like we did using the as_index=False argument
# this is where the transform method comes in. Let's look at two attempts
# to examine the mean for our heroes DataFrame

In [60]:
k = ['cat_a', 'cat_b', 'cat_a', 'cat_b', 'cat_a']

In [61]:
# Method 1

heroes.groupby(k).mean()

Unnamed: 0,2011,2012,2013,2014,2015
cat_a,475.333333,580.333333,553.666667,653.0,630.0
cat_b,457.0,507.0,478.0,510.0,507.0


In [63]:
# Method 2

heroes.groupby(k).transform(max_mean_diff)

# NOTE: as before, any function that can be applied to the group can be fed
# into the transform function, for example:
# heroes.groupby(k).transform(max_mean_diff)

Unnamed: 0,2011,2012,2013,2014,2015
clark,36.666667,32.666667,160.333333,162.0,286.0
bruce,44.0,95.0,67.0,90.0,92.0
diana,36.666667,32.666667,160.333333,162.0,286.0
kara,44.0,95.0,67.0,90.0,92.0
selina,36.666667,32.666667,160.333333,162.0,286.0


In [None]:
# there is another fundamental way to apply functions to the data in a Pandas
# object: using apply()
# let's say we want to find the best performers in terms of grade changes across
# each group

In [64]:
def best(df, n=10, column='gpa_pct'):
    return df.sort_values(by=column, ascending=False)[:n]

In [None]:
# if we simply apply this to the gpas DataFrame, as a whole, we will see the
# most improved students and their characteristics, here we use n=7 to get the
# top seven.

In [65]:
df.sort_values?

In [66]:
best(gpas, n=7)

Unnamed: 0,gpa,gpa_change,gender,athlete,day,period,duration,gpa_pct
5,76,3.610408,Male,yes,Thu,afternoon,3,0.047505
4,56,2.576343,Male,yes,Tue,morning,2,0.046006
6,90,3.91934,Male,yes,Mon,afternoon,2,0.043548
3,82,1.702488,Male,yes,Fri,afternoon,3,0.020762
2,73,1.309814,Male,yes,Sat,morning,3,0.017943
0,75,0.939395,Female,no,Tue,evening,2,0.012525
7,62,0.728715,Female,yes,Tue,evening,1,0.011753


In [None]:
# if we use a GroupBy and then apply the function to the GroupBy object

In [68]:
# gpas.groupby('athlete').apply(best)
gpas.groupby('athlete').apply(best, n=2, column='gpa_change')

Unnamed: 0_level_0,Unnamed: 1_level_0,gpa,gpa_change,gender,athlete,day,period,duration,gpa_pct
athlete,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
no,0,75,0.939395,Female,no,Tue,evening,2,0.012525
no,8,78,0.832815,Male,no,Mon,morning,2,0.010677
yes,6,90,3.91934,Male,yes,Mon,afternoon,2,0.043548
yes,5,76,3.610408,Male,yes,Thu,afternoon,3,0.047505


In [69]:
gpas.groupby(['athlete', 'gender']).apply(best, n=3)[['duration',
                                                      'gpa_pct',
                                                      'gpa_change']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,duration,gpa_pct,gpa_change
athlete,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
no,Female,0,2,0.012525,0.939395
no,Male,8,2,0.010677,0.832815
yes,Female,7,1,0.011753,0.728715
yes,Female,9,3,0.011474,0.654
yes,Male,5,3,0.047505,3.610408
yes,Male,4,2,0.046006,2.576343
yes,Male,6,2,0.043548,3.91934


In [None]:
dfc = gpas[['gpa', 'gpa_change', 'gpa_pct']]
gpabins = pd.cut(dfc.gpa_change, 10)
gpabins

In [None]:
def stat_summary(grp):
    return {'min': grp.min(),
            'max': grp.max(),
            'median': grp.median(),
            'std': grp.std()}

groups = dfc.gpa_pct.groupby(gpabins)
groups.apply(stat_summary)
# groups.apply(stat_summary).unstack()

In [None]:
# heroes2 = DataFrame([[1, 2, 3, 4, 5],
#                     [2, 4, 6, 8, 10],
#                     [1, 25, 50, 75, 100],
#                     [1, 2, 3, 4, 100],
#                     [100, 90, 80, 70, 60]], columns=[2011, 2012, 2013, 2014, 2015],
#                     index=['clark', 'tony', 'diana', 'thor', 'jessica'])
# 
# mapping = {2011: 'pre',
#            2012: 'pre',
#            2013: 'post',
#            2014: 'post',
#            2015: 'post',
#            2016: 'future'}
# 
# by_column2 = heroes.groupby(mapping, axis=1, sort=False)