In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
import pandas as pd
from pandas import Series, DataFrame

#Intro, overview

##Why the complexity of Pandas's groupby and MultiIndex, when SQL is simpler?

Per the intro, the complexity of pandas syntax and MultiIndex use is ok because 'query languages like SQL are rather limited in the kinds of group operations that can be performed'. Pandas supports 'much more complex grouped operations by utilizing any function that accepts a pandas object or NumPy array.'

For example, rather than just specifying the names of columns to define how to group, pandas lets you define your group (the keys) in terms of DataFrame column names, arrays, or functions (i.e., in code). Once you have a set of groups, you can compute things - like count, mean, etc. using pre-defined function, or define your own functions.

##Mechanics - split, apply, combine

All group by work can be thought of as having three parts:
1. Split - Data (in a Series, DataFrame, etc.) is split into groups based on one or more keys. SQL splits data into groups by row - i.e., the groups are made up of one or more rows. In Pandas this is splitting by axis = 0. Pandas can also group by/split columns, with axis = 1.
2. Apply - A function is applied to each group, producing a new value. For example, 'count' can be applied to count the number of rows (or columns, if splitting by axis = 1), sum can be applied to sum values associated with each row in the group, etc.
3. Combine - Finally, the results of the apply step are combined into a result object. The book says taht the 'resulting object will usually depend on what's being done to the data' - I'm not sure what this means; hopefully it'll become clear after working through the chapter.

##Split - defining keys

Ultimately you want a set of one or more grouping keys. You can provide explicitly a list or array of values that is the same length as the axis being grouped (for ex, if you're grouping rows then you provide a list w/ values where the length of the list is equal to the number of rows). Or you can provide other stuff, as shown below, that are each and all shortcuts to define a list of values that's the same length as the axis being grouped.

In [4]:
df = DataFrame({'key1': list('aabba'),
                'key2': ['one','two','one','two','one'],
                'data1': np.random.randn(5),
                'data2': np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,0.095599,1.632777,a,one
1,1.262306,0.444808,a,two
2,-0.201121,0.229857,b,one
3,0.882426,-0.766906,b,two
4,1.059917,-0.795773,a,one


As one example - which one, I'll need to think through or learn - here we first get a Series object for the 'data1' column. Then we group that Series object using values from the 'key1' column. I think this means we're working with two Series instances - one that has the data we care about working with, on which we call the groupby method, and one that provides the values that groupby needs to actually group the original values.

In [5]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x107354d68>

Once we have this SeriesGroupBy object, which I think represents the groups but not the apply part of the steps above - i.e., it's just doing the split part, then we can actually do the 'apply' by applying a function to the groupby.

In [6]:
grouped.mean()

key1
a    0.805941
b    0.340653
Name: data1, dtype: float64

When we apply a function like mean we turn a Series into another Series, where the produced/second Series has a row for each unique value specified by the key. Here the key is only made up of the key1 column, and the key1 column has two unique values, so the resulting Series has two rows - further, the index values for these two rows are the unique values. The actual data associated with each row is the result of the function we applied to the SeriesGroupBy object, which is 'mean' in this case.

Another example involves passing multiple arrays as a list.

In [8]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.577758
      two     1.262306
b     one    -0.201121
      two     0.882426
Name: data1, dtype: float64