In [24]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [25]:
import pandas as pd
from pandas import Series, DataFrame

#Intro, overview

##Why the complexity of Pandas's groupby and MultiIndex, when SQL is simpler?

Per the intro, the complexity of pandas syntax and MultiIndex use is ok because 'query languages like SQL are rather limited in the kinds of group operations that can be performed'. Pandas supports 'much more complex grouped operations by utilizing any function that accepts a pandas object or NumPy array.'

For example, rather than just specifying the names of columns to define how to group, pandas lets you define your group (the keys) in terms of DataFrame column names, arrays, or functions (i.e., in code). Once you have a set of groups, you can compute things - like count, mean, etc. using pre-defined function, or define your own functions.

##Mechanics - split, apply, combine

All group by work can be thought of as having three parts:
1. Split - Data (in a Series, DataFrame, etc.) is split into groups based on one or more keys. SQL splits data into groups by row - i.e., the groups are made up of one or more rows. In Pandas this is splitting by axis = 0. Pandas can also group by/split columns, with axis = 1.
2. Apply - A function is applied to each group, producing a new value. For example, 'count' - or is it just 'size'? - can be applied to count the number of rows (or columns, if splitting by axis = 1), sum can be applied to sum values associated with each row in the group, etc.
3. Combine - Finally, the results of the apply step are combined into a result object. The book says taht the 'resulting object will usually depend on what's being done to the data' - I'm not sure what this means; hopefully it'll become clear after working through the chapter.

#Split - defining keys

Ultimately you want a set of one or more grouping keys. You can provide explicitly a list or array of values that is the same length as the axis being grouped (for ex, if you're grouping rows then you provide a list w/ values where the length of the list is equal to the number of rows). Or you can provide other stuff, as shown below, that are each and all shortcuts to define a list of values that's the same length as the axis being grouped.

In [44]:
df = DataFrame({'key1': list('aabba'),
                'key2': 'one two one two one'.split(),
                'data1': np.random.randn(5),
                'data2': np.random.randn(5)})
df

Unnamed: 0,data1,data2,key1,key2
0,-1.142105,-0.427882,a,one
1,0.143537,0.052476,a,two
2,-0.295989,0.101438,b,one
3,-1.844569,0.606259,b,two
4,1.526356,-1.16785,a,one


As one example - which one, I'll need to think through or learn - here we first get a Series object for the 'data1' column. Then we group that Series object using values from the 'key1' column. I think this means we're working with two Series instances - one that has the data we care about working with, on which we call the groupby method, and one that provides the values that groupby needs to actually group the original values.

In [27]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x1073c4550>

Once we have this SeriesGroupBy object, which I think represents the groups but not the apply part of the steps above - i.e., it's just doing the split part, then we can actually do the 'apply' by applying a function to the groupby.

In [28]:
grouped.mean()

key1
a   -0.366095
b    0.423272
Name: data1, dtype: float64

When we apply a function like mean we turn a Series into another Series, where the produced/second Series has a row for each unique value specified by the key. Here the key is only made up of the key1 column, and the key1 column has two unique values, so the resulting Series has two rows - further, the index values for these two rows are the unique values. The actual data associated with each row is the result of the function we applied to the SeriesGroupBy object, which is 'mean' in this case.

Another example involves passing multiple arrays as a list.

In [29]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one    -0.179601
      two    -0.739081
b     one     1.237260
      two    -0.390717
Name: data1, dtype: float64

The above is the same as the first example, except that we group by two Series. We get a MultiIndex where the first level is the unique values of the first key, and the second level is the unique values of the second key. 

In [30]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.179601,-0.739081
b,1.23726,-0.390717


The next example was a bit odd to me, because it shows the use of arrays w/ any content, as long as they're the 'right' length, where 'right' means the same length as the axis we're grouping. Based on checking the values of the actual calculated means, it looks like the _location_ of the key values is what matters... the actual key values themselves aren't important to figure out which rows in the source are grouped... they're only used as the labels for the index in the resulting groupby and DataFrame. That is, in the following example, 'Ohio' and 2005 are at location 0, so the row in the source data at location 0 is made a part of the group with keys 'Ohio' and 2005. Location three also has the same key values, so it's made part of the same group. The other locations have different key values.

Upon a bit of reflection, this makes sense - all groupby really cares about is a set of values, one for each entity (row, or column) being grouped - it creates groups of entities where each group is just made of entities with the same key value. The key value can come from anywhere, including as an array that's defined completely separately from anything we're grouping. (Although that's not common, I think.)

In [31]:
states = np.array(['Ohio','California','California','Ohio','Ohio'])
years = np.array([2005,2005,2006,2005,2006])

df['data1'].groupby([states, years]).mean()

California  2005   -0.739081
            2006    1.237260
Ohio        2005   -0.500794
            2006    0.251667
Name: data1, dtype: float64

In all of the above examples we used individual Series instances on which to call groupby and to provide the key(s). There was no relationship needed between the two. 

Another option, if the grouping information is in the same DataFrame as the data being grouped, all you need to do is specify the name of the column(s).

In [32]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.366095,-0.167894
b,0.423272,1.046002


Note that in the previous example, there's no 'key2' column in the results, even though we're not grouping by it and it's in the source data frame. This is because by default any columns that have non-numeric data are considered 'nuisance' columns (at least when you do the apply w/ a function - like mean - that requires numeric data?) and so left out of the result.

In [33]:
df.groupby(['key1','key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.179601,-0.91242
a,two,-0.739081,1.321158
b,one,1.23726,0.894614
b,two,-0.390717,1.19739


In [34]:
# don't forget that to just get a count of rows/items in the group
# use 'size' (NOT 'count')
df.groupby(['key1','key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

##Iterating over groups

When you have a groupby object - you haven't applied the 'apply' function yet - you can still do interesting stuff with, including iterating through things (rows, or columns) that were grouped.

In [35]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
      data1     data2 key1 key2
0 -0.610870 -1.384503    a  one
1 -0.739081  1.321158    a  two
4  0.251667 -0.440337    a  one
b
      data1     data2 key1 key2
2  1.237260  0.894614    b  one
3 -0.390717  1.197390    b  two


In [36]:
for (k1, k2), group in df.groupby(['key1','key2']):
    print(k1, k2)
    print(group)

a one
      data1     data2 key1 key2
0 -0.610870 -1.384503    a  one
4  0.251667 -0.440337    a  one
a two
      data1     data2 key1 key2
1 -0.739081  1.321158    a  two
b one
     data1     data2 key1 key2
2  1.23726  0.894614    b  one
b two
      data1    data2 key1 key2
3 -0.390717  1.19739    b  two


A useful recipe is 'computing a dict of the data pieces [i.e., the 'group' part of the tuples returned by groupby) as a one-liner':

In [45]:
pieces = dict(list(df.groupby('key1')))
len(pieces) # 2, because we had two groups, with keys 'a' and 'b'

2

In [38]:
pieces

{'a':       data1     data2 key1 key2
 0 -0.610870 -1.384503    a  one
 1 -0.739081  1.321158    a  two
 4  0.251667 -0.440337    a  one, 'b':       data1     data2 key1 key2
 2  1.237260  0.894614    b  one
 3 -0.390717  1.197390    b  two}

In [39]:
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,1.23726,0.894614,b,one
3,-0.390717,1.19739,b,two


##Example of grouping columns instead of rows

In [40]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [41]:
grouped_cols = df.groupby(df.dtypes, axis=1)
grouped_cols

<pandas.core.groupby.DataFrameGroupBy object at 0x1073ccb70>

In [42]:
dict(list(grouped))

{'a': 0   -0.610870
 1   -0.739081
 4    0.251667
 Name: data1, dtype: float64, 'b': 2    1.237260
 3   -0.390717
 Name: data1, dtype: float64}

##Selecting a column or subset of columns

With no further code, the apply method is used on each (non-numeric) column. We can also specify only selected columns.

In [47]:
df.groupby('key1')['data1'].sum()

key1
a    0.527789
b   -2.140557
Name: data1, dtype: float64

The above is syntactic sugar for what's really happening:

In [48]:
df['data1'].groupby(df['key1']).sum()

key1
a    0.527789
b   -2.140557
Name: data1, dtype: float64

Also - and I think this is cool and that it's about time I finally figured it out - if you pass a list to the DataFrame indexing/bracket operator, you get back a DataFrame instead of a Series (which, of course, is what you get when you specify a single string). This looks odd when you pass only a single column name, but doesn't look so odd when you pass multiple column names - the single column name is doing the same thing, so it's consistent, even if it does look odd. 

In [49]:
type(df['data1']) # Series, as we know

pandas.core.series.Series

In [54]:
type(df[['data1','data2']]) # DataFrame, since we have two columns

pandas.core.frame.DataFrame

In [55]:
type(df[['data1']]) # Still a DataFrame, since we passed a list - only one column though

pandas.core.frame.DataFrame

So, back to grouping, and another example of syntactic sugar. Both of the following are the same.

In [56]:
df.groupby('key1')[['data2']].sum()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,-1.543256
b,0.707696


In [58]:
df[['data2']].groupby(df['key1']).sum()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,-1.543256
b,0.707696


Another example - filtering to only a few cols 'may be especially useful for large datasets.

In [59]:
df.groupby(['key1','key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.797866
a,two,0.052476
b,one,0.101438
b,two,0.606259


In [60]:
# above is same as
df[['data2']].groupby([df['key1'], df['key2']]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.797866
a,two,0.052476
b,one,0.101438
b,two,0.606259


The nature of the object that comes back from the groupby (before the operation is applied) depends on if you pass a single column name - you get a 'grouped Series' - or if you pass a list or arry - you get a 'grouped DataFrame'. 

In [62]:
s_grouped = df.groupby(['key1','key2'])['data2']
s_grouped

<pandas.core.groupby.SeriesGroupBy object at 0x1073b9160>

In [64]:
df_grouped = df.groupby(['key1','key2'])[['data2']]
df_grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x1073ccdd8>

In [67]:
df_grouped = df.groupby('key1 key2'.split())[['data1','data2']]
df_grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x107507a58>

##Grouping with dicts and Series

You don't need to have your grouping information in a list or array.

In [69]:
people = DataFrame(np.random.randn(5,5),
                   columns=list('abcde'),
                   index='Joe Steve Wes Jim Travis'.split())
people.ix[2:3, ['b','c']] = np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,0.180718,0.602803,-2.505254,-0.585075,2.018985
Steve,0.313131,0.59518,-0.223885,-0.853578,1.731554
Wes,-0.660761,,,2.328535,0.600022
Jim,-0.357893,0.164575,0.987413,0.715731,0.341701
Travis,0.479686,-0.866455,-1.169262,-1.457515,-0.909903


In [70]:
# consider this a 'group correspondence' - each column above maps
# to a color, defined by this dict.
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f': 'orange'}
mapping

{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

Here, groupby works by column - I think it takes the label on each column - a, b, etc. - and uses the dict to map the label to the actual grouping key, so 'a' is mapped to 'red', and 'red' is used as the grouping key.

In [73]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,-3.090329,2.802506
Steve,-1.077463,2.639865
Wes,2.328535,-0.060738
Jim,1.703144,0.148382
Travis,-2.626778,-1.296672


In [74]:
map_series = Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

You can also map using a Series, 'which can be viewed as a fixed-size mapping'.

In [75]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


##Grouping with functions

Or,  you can pass a function, which will be called once per entity being grouped, and whose return value will be used as the grouping key.

For example, the following calls len for each row - each index value is what's actually passed, which here is the name - and the return value of the function is used to group. For example, 'Joe' has three letters, as does 'Wes' and 'Jim', so those three rows are grouped together into a single group. Applying sum adds the values for those three rows, and the returned DataFrame has a row with an index value of 3.

In [77]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-0.837936,0.767377,-1.517841,2.45919,2.960708
5,0.313131,0.59518,-0.223885,-0.853578,1.731554
6,0.479686,-0.866455,-1.169262,-1.457515,-0.909903


You can mix functions, dicts, and Series - everything gets converted to arrays internally. I think that means the ultimately there's an array of key values, regardless of where it comes from.

I think this means that the next example is grouping by two key values - first by the number returned by len, and then by the set of key values in the passed list (i.e., here, by grouping the first three rows and then the second two rows).

In [79]:
key_list = 'one one one two two'.split()
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.660761,0.602803,-2.505254,-0.585075,0.600022
3,two,-0.357893,0.164575,0.987413,0.715731,0.341701
5,one,0.313131,0.59518,-0.223885,-0.853578,1.731554
6,two,0.479686,-0.866455,-1.169262,-1.457515,-0.909903


##Grouping with index levels

If you have a hierarchical index, you can group using one (or more?) of the levels of the index, by passing the level number or name using the level keyword.

In [80]:
columns = pd.MultiIndex.from_arrays(['US US US JP JP'.split(),
                                    [1,3,5,1,3]],
                                    names=['cty','tenor'])
columns

MultiIndex(levels=[['JP', 'US'], [1, 3, 5]],
           labels=[[1, 1, 1, 0, 0], [0, 1, 2, 0, 1]],
           names=['cty', 'tenor'])

In [82]:
hier_df = DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-1.052541,0.5983,0.715088,-0.309342,-0.794973
1,-0.709837,0.118723,0.336335,0.114131,0.713223
2,0.631868,0.363225,1.32401,0.4405,-0.222555
3,0.865379,-2.176058,0.860176,-0.494471,-0.327906


In [83]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


#Data aggregation

Here 'aggregation' means any transformation of data that produces scalar values from arrays - mean, count, min, sum, etc.

You can use these, or call other methods that are defined on the groupby object, or write your own.

In [84]:
df

Unnamed: 0,data1,data2,key1,key2
0,-1.142105,-0.427882,a,one
1,0.143537,0.052476,a,two
2,-0.295989,0.101438,b,one
3,-1.844569,0.606259,b,two
4,1.526356,-1.16785,a,one


In [88]:
grouped = df.groupby('key1')
grouped.size()

key1
a    3
b    2
dtype: int64

Difference between count and size? It looks like size is meant to return the 'size of the group', which I take to be the number of entities (rows, columns) grouped using each key value. Accordingly, you'd expect a series (at least if you group by just one key value?) with one row per unique key value, and that's what we see above.

The count method instead is just something that you can apply to each column? As below.

It might be that count and size treat NAs differently? Would be interesting to investigate if it matters.

In [89]:
grouped.count()

Unnamed: 0_level_0,data1,data2,key2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,3,3,3
b,2,2,2


If we group by multiple keys and then call size and count, what do we get?

In [90]:
df.groupby(['key1','key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

So, we, still get a series from size, but now it has a hierarchical index, and individual size of each group identified by each unique combination of key values.

In [91]:
df.groupby(['key1','key2']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,2,2
a,two,1,1
b,one,1,1
b,two,1,1


We don't have to only call GroupBy methods - if we have a SeriesGroupBy object, we can call any method defined for a Series, I think.

In [92]:
grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x10751fba8>

In [93]:
grouped['data1'] # this is a SeriesGroupBy

<pandas.core.groupby.SeriesGroupBy object at 0x10752ac50>

The quantile method is defined on Series, so we'll try it here, applied to each group. "Internally, GroupBy efficiently slices up the Series, calls piece.quantile(0.9) for each piece" - where a 'piece' is a group with a common key value, I think - "and then assembles those results together into the result object."

In [94]:
grouped['data1'].quantile(0.9)

key1
a    1.249792
b   -0.450847
Name: data1, dtype: float64

In [96]:
Series.quantile?

In [100]:
Series([-0.295989, -1.844569]).quantile(0.9)

-0.450847

Your, use your own aggregation functions by passing any function that aggregates an array to the aggregate or agg method.

In [101]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.668461,1.220325
b,1.54858,0.504821


The book says that 'describe' isn't an aggregation, 'strictly speaking', but still works. It's called on each group.

In [103]:
grouped.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,0.17593,-0.514419
a,std,1.334525,0.614748
a,min,-1.142105,-1.16785
a,25%,-0.499284,-0.797866
a,50%,0.143537,-0.427882
a,75%,0.834947,-0.187703
a,max,1.526356,0.052476
b,count,2.0,2.0
b,mean,-1.070279,0.353848


##More advanced, using restaurant tipping dataset

In [104]:
tips = pd.read_csv('../pydata-book/ch08/tips.csv')
len(tips)

244

In [105]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [106]:
# add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


In [107]:
tips.describe()

Unnamed: 0,total_bill,tip,size,tip_pct
count,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,0.160803
std,8.902412,1.383638,0.9511,0.061072
min,3.07,1.0,1.0,0.035638
25%,13.3475,2.0,2.0,0.129127
50%,17.795,2.9,2.0,0.15477
75%,24.1275,3.5625,3.0,0.191475
max,50.81,10.0,6.0,0.710345


In [108]:
tips[tips['tip_pct'] > 0.5]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


Let's look at tips by sex and smoking status.

In [109]:
grouped = tips.groupby(['sex', 'smoker'])

In [110]:
grouped_tip_pct = grouped['tip_pct']
grouped_tip_pct.agg('mean')

sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64

In [111]:
grouped_tip_pct.mean()

sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64

In [112]:
# apply multiple functions
grouped_tip_pct.agg(['mean','std',peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,No,0.156921,0.036421,0.195876
Female,Yes,0.18215,0.071595,0.360233
Male,No,0.160669,0.041849,0.220186
Male,Yes,0.152771,0.090588,0.674707


In [114]:
# provide a name isntead of using the function name (helpful for lambdas)
grouped_tip_pct.agg([('foo','mean'), ('bar', np.std)])

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,bar
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,No,0.156921,0.036421
Female,Yes,0.18215,0.071595
Male,No,0.160669,0.041849
Male,Yes,0.152771,0.090588


DataFrames provide more options - you can provide a list of functions to apply to all of the columns, or different functions to different columns.

In [115]:
functions = ['count','mean','max']
result = grouped['tip_pct','total_bill'].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Female,No,54,0.156921,0.252672,54,18.105185,35.83
Female,Yes,33,0.18215,0.416667,33,17.977879,44.3
Male,No,97,0.160669,0.29199,97,19.791237,48.33
Male,Yes,60,0.152771,0.710345,60,22.2845,50.81


In [116]:
result['tip_pct']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,max
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,No,54,0.156921,0.252672
Female,Yes,33,0.18215,0.416667
Male,No,97,0.160669,0.29199
Male,Yes,60,0.152771,0.710345


There's a few more examples in the book, on p263 and the top of p264.

##Return aggregated data in 'unindexed' form

Up to this point the aggregated data always comes back with an index made up of the unique group key combinations that exist in the data. If you group by multiple keys, you get a hierarchical index.

You can disable this, and get back a DataFrame, using as_index=False.

In [117]:
tips.groupby(['sex','smoker'], as_index=False).mean()

Unnamed: 0,sex,smoker,total_bill,tip,size,tip_pct
0,Female,No,18.105185,2.773519,2.592593,0.156921
1,Female,Yes,17.977879,2.931515,2.242424,0.18215
2,Male,No,19.791237,3.113402,2.71134,0.160669
3,Male,Yes,22.2845,3.051167,2.5,0.152771


In [118]:
# compare to
tips.groupby(['sex','smoker']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_pct
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,No,18.105185,2.773519,2.592593,0.156921
Female,Yes,17.977879,2.931515,2.242424,0.18215
Male,No,19.791237,3.113402,2.71134,0.160669
Male,Yes,22.2845,3.051167,2.5,0.152771


Or, you can take the returned DataFrame with the hierarchical index and convert it to the as_index=False format, using reset_index(). (DataFrame.reset_index? says 'For DataFrame with multi-level index, return new DataFrame with labeling information in the columns under the index names..').

In [121]:
tips.groupby(['sex','smoker']).mean().reset_index()

Unnamed: 0,sex,smoker,total_bill,tip,size,tip_pct
0,Female,No,18.105185,2.773519,2.592593,0.156921
1,Female,Yes,17.977879,2.931515,2.242424,0.18215
2,Male,No,19.791237,3.113402,2.71134,0.160669
3,Male,Yes,22.2845,3.051167,2.5,0.152771


#Group-wise operations and transformations