# Welcome to the Dark Art of Coding:
## Introduction to Python
Grouping/aggregating data

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Know how to group data by:
    * Dictionary
    * List
    * Columns
    * Index
* Be able to aggregate data effectively
* Make their own functions to use in aggregation

In [None]:
import pandas as pd
from pandas import DataFrame         #, Series
import numpy as np

# Groupby series
---

Let's start by defining a DataFrame that includes some sample data showing the messages (emails & tweets) received by diana and clark on each of two days

In [None]:
df = DataFrame({'name': ['diana', 'diana', 'clark', 'clark', 'diana'],
                'msgs': ['email', 'tweet', 'email', 'tweet', 'email'],
                'day1': [10, 11, 23, 23, 15],
                'day2': [14, 15, 16, 17, 21]})

df

Once we have our data we can use the `.groupby()` method and section our columns based on what data they see that is similar

In [None]:
groups = df['day1'].groupby(df['name'])

In this case we have a `SeriesGroupBy` object

In [None]:
groups.

This stores the data that lets us view our data group by group

In [None]:
for group in groups:
    for item in group:
        print(item)

Once we have our groups we can do vector math on groups individually (e.g. get every group's average or sum).

Groupby objects already have a few methods attached to them that will do this for us.

In [None]:
groups.mean()

In [None]:
groups.sum()

If we want to get multilevel Groupby object we can have it group by *multiple* columns at once by giving the `.groupby()` method a list of columns

In [None]:
means = df['day1'].groupby( [df['name'], df['msgs']] ).mean()
means

Since we have a multilevel Series-like object we can use the `.unstack()` method on it to display the object as a DataFrame

In [None]:
means.unstack()

If the column you're going to use is in the same DataFrame as the original column you don't need to index to the column in the `.groupby()` method and instead you can just give it a string to use to grab the column

In [None]:
df.groupby('name').mean()

In [None]:
df.groupby(['name', 'msgs']).max()

One thing to realize is that you don't actually need to use columns from the same DataFrame. You can use any array, series, or column that is the same length as your DataFrame or Series

In this case we'll show you an example of using numpy arrays to group some of this data together

In [None]:
cities = np.array(['new york', 'baltimore', 'baltimore', 'new york', 'new york'])
day = np.array(['mon', 'mon', 'tues', 'mon', 'tues'])

In [None]:
df['day1'].groupby([cities, day]).mean()

Another helpful groupby method is `.size()`. This method shows us how many objects fell into a given group. In this case it shows us that the clark group had 2 objects inside and the diana group had 3

In [None]:
df.groupby('name').size()

Here we do the same thing but we tell it to group by multiple columns. This time it shows us how many objects were in each subgroup

In [None]:
df.groupby(['name', 'msgs']).size()

If we try and iterate over a GroupBy object each loop will return a small tuple. This tuple contains the name of the group and secondarily the content in the group itself. Since it returns a tuple we can use tuple unpacking to get both the name and the group content alone in separate variables

In [None]:
for name, group in df.groupby('msgs'):
    if name == 'email':
        print(name)
        print('='*30)
        print(group)
        print()
    else:
        pass

Since this GroupBy object is multilevel each group combination has the "name" slot of that first tuple broken out into a new tuple with multiple parts for each level of the group

In [None]:
for (k1, k2), group in df.groupby(['name', 'msgs']):
    if k2 != 'tweet':
        
        print(k1, k2)
        print('='*30)
        print(group)
        print()

Sometimes (especially with larger datasets) grouping across the entire DataFrame might not be what you want to do. you can do column specific grouping

In [None]:
df.groupby(['name', 'msgs'])[['day2', 'day1']].mean()

# The following line does the same thing, but displays multiple columns.
# df.groupby(['key1', 'key2'])[['day1', 'day2']].mean()

In [None]:
# There are a number of other ways to implement grouping. One way is using
# dictionaries to map columns (or rows) to values. In this case, we tie each
# of the years to a grouping (pre vs post apocalyptic event and a third
# category for future events).

In [None]:
heroes = DataFrame([[512, 613, 714, 815, 916],
                    [413, 412, 411, 420, 415],
                    [501, 525, 535, 545, 555],
                    [501, 602, 545, 600, 599],
                    [413, 603, 412, 599, 419]],
                    columns=[2011, 2012, 2013, 2014, 2015],
                    index=['clark', 'bruce', 'diana', 'kara', 'selina'])
heroes

In [None]:
mapping = {2011: 'pre',
           2012: 'pre',
           2013: 'post',
           2014: 'post',
           2015: 'post',
           2016: 'future'}  

In [None]:
mapping

In [None]:
# So rather than grouping by a column, we simply drop in the dictionary:
# in this case, we explicitly identify the grouping axis to be the columns
# by using an axis=1 argument (the default is axis=0 for grouping by rows.)

In [None]:
by_column = heroes.groupby(mapping, axis=1)
by_column.sum()

In [None]:
# NOTE: pre/post are lexigraphically sortable, and leads to all the 'post'
# years being displayed infront of the 'pre' years, which seems weird, so
# we tell the function to skip the alphabetical sorting process, which
# leaves the 'pre' group (years 2011, 2012) in front of the 'post' group (years
# 2013, 2014, 2015).

In [None]:
by_column = heroes.groupby(mapping, axis=1, sort=False)
by_column.describe()
# by_dolumn.mean()
# by_column.describe()

In [None]:
# You can GroupBy the outputs of functions. it does not matter where the
# function comes from, as long as it provides an output. For example, this
# sample functions counts the number of vowels that show up in a superhero's
# name.

In [None]:
def count_vowels(name):
    count = 0
    for letter in name:
        if letter in ['a', 'e', 'i', 'o', 'u']:
            count += 1
    return count

In [None]:
# The following will group the heroes based on the number of vowels in their
# name (1, 2 or 3) and then will calculate the means of their emails by group.

In [None]:
heroes.groupby(count_vowels).describe()

In [None]:
# What other aggregations can you do on data?
# by_column.mean()
# by_column.sum()
# by_column.min()
# by_column.max()
# by_column.median()
# by_column.first()
# by_column.last()
by_column.describe()

# by_column.<tab>        # pressing the <tab> key in ipython after typing
                         # "by_column.' will display all the methods/attributes

In [None]:
# it is possible to apply any aggregation function to your data. for example
# if we want to know how far off from the max, any given mean is, we can
# calculate that:

In [None]:
def max_mean_diff(arr):
    return arr.max() - arr.mean()

groups.agg(max_mean_diff)

In [None]:
# for the next bit, let's look at a slightly more sophisticated dataset.
# this is data that I generated randomly.
# we start by reading in the data
# then we perform a calculation to determine the relative pct between
# the student's gpa and change in the gpa

In [None]:
gpas = pd.read_csv('../universal_datasets/gpa_short.csv')
gpas

In [None]:
gpas['gpa_pct'] = gpas['gpa_change'] / gpas['gpa']
gpas

In [None]:
# from there, we group by the gender and the athlete statues.

In [None]:
groups = gpas.groupby(['gender', 'athlete'])
for grp in groups:
    print(grp)

In [None]:
# then we pull out just the gpa_pct column in reference to the groupings

In [None]:
groups_pct = groups['gpa_pct']
groups_pct

In [None]:
# at this point we can apply some functions to this data. Notice the
# use/non-use of quotes. for built-in functions, you need to use the
# quotations. For funtions that you have defined in the current namespace
# you can get away with not using the quotes:

In [None]:
groups_pct.agg(['max', 'mean', max_mean_diff])

In [None]:
# there are multiple ways to slice the data into columns that you want and
# then aggregate the data in those columns. here is another method, where
# we choose several columns and then apply four functions against each 
# of the columns:

In [None]:
functions = ['median', 'min', 'max', 'count']
result = groups['gpa_pct', 'gpa'].agg(functions)
result

In [None]:
# if we want to focus on just a single grouping at a time, we can select
# for that grouping via dictionary-like indexing.

In [None]:
result['gpa_pct']

In [None]:
# There is a way to map certain functions to only certain columns by using a dictionary to perform the mapping:

In [None]:
mapping = {'gpa_change': 'min', 'duration': 'mean'}
groups.agg(mapping)

In [None]:
# This can get pretty sophisticated:

In [None]:
mapping2 = {'gpa_change': ['min', 'max', 'median'], 'duration': 'mean'}
groups.agg(mapping2)

In [None]:
# sometimes, we want to perform aggregation, but we want to retain the
# overall shape of the data, much like we did using the as_index=False argument
# this is where the transform method comes in. Let's look at two attempts
# to examine the mean for our heroes DataFrame

In [None]:
k = ['cat_a', 'cat_b', 'cat_a', 'cat_b', 'cat_a']

In [None]:
# Method 1

heroes.groupby(k).mean()

In [None]:
# Method 2

heroes.groupby(k).transform(max_mean_diff)

# NOTE: as before, any function that can be applied to the group can be fed
# into the transform function, for example:
# heroes.groupby(k).transform(max_mean_diff)

In [None]:
# there is another fundamental way to apply functions to the data in a Pandas
# object: using apply()
# let's say we want to find the best performers in terms of grade changes across
# each group

In [None]:
def best(df, n=10, column='gpa_pct'):
    return df.sort_values(by=column, ascending=False)[:n]

In [None]:
# if we simply apply this to the gpas DataFrame, as a whole, we will see the
# most improved students and their characteristics, here we use n=7 to get the
# top seven.

In [None]:
df.sort_values?

In [None]:
best(gpas, n=7)

In [None]:
# if we use a GroupBy and then apply the function to the GroupBy object

In [None]:
# gpas.groupby('athlete').apply(best)
gpas.groupby('athlete').apply(best, n=2, column='gpa_change')

In [None]:
gpas.groupby(['athlete', 'gender']).apply(best, n=3)[['duration',
                                                      'gpa_pct',
                                                      'gpa_change']]

In [None]:
dfc = gpas[['gpa', 'gpa_change', 'gpa_pct']]
gpabins = pd.cut(dfc.gpa_change, 10)
gpabins

In [None]:
def stat_summary(grp):
    return {'min': grp.min(),
            'max': grp.max(),
            'median': grp.median(),
            'std': grp.std()}

groups = dfc.gpa_pct.groupby(gpabins)
# groups.apply(stat_summary)
groups.apply(stat_summary).unstack()