## Groupby docs

This example demonstrates two things:

1.  Grouping by column (axis = 1) instead of by row (axis = 0); grouping by row is the default.
2.  Using a function to return the grouping values

Consider what using a categorical column (like gender) to group rows does.
A categorical column can be thought of as a function from rows to a label.
So

df.groupby(col_name)

applies the column function col_name to rows and all rows with the same label go in the same group.  

In this example, we use an arbitrary function to label columns, and then split
a DataFrame with 4 columns into 2 DataFrames of 3 and 1 column respectively.
Note that the function is a function on strings, because groupby will apply
the function to column names rather than to column instances.

In [19]:
import pandas as pd 
import numpy as np

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'
    
grouped = df.groupby(get_letter_type, axis=1)

In [20]:
grouped.get_group('vowel')

Unnamed: 0,A
0,foo
1,bar
2,foo
3,bar
4,foo
5,bar
6,foo
7,foo


In [21]:
grouped.get_group('consonant')

Unnamed: 0,B,C,D
0,one,-0.414043,1.01371
1,one,-0.191214,0.797889
2,two,0.588714,-0.009055
3,three,0.085814,-0.36701
4,two,1.063588,2.041712
5,two,1.319786,1.395074
6,one,-0.421134,0.208063
7,three,-0.20631,-0.100299


Alternatively, suppose we want to group using a function of column instances instead of column names.  Here's the cleanest way to do that

In [82]:
def get_data_type(col):
    """
    Note that the labels returned should be strings,
    or at least instances that support `<`, i.e., that have an ordering defined,
    since in some some contexts the labels will be sorted.
    """
    if col.dtype == np.float64:
        return 'fl'
    else:
        return 'o'

# Analogous to 
# np.ones((3,4)).sum(axis=0) = [3., 3., 3., 3.]
# Applies to vectors which extend along row axis: labels 4 columns
label_series = df.apply(get_data_type,axis=0)
# the label sequence extends along the column axis.
grouped = df.groupby(label_series, axis=1)

In [78]:
label_series

A     o
B     o
C    fl
D    fl
dtype: object

In [80]:
grouped.get_group('fl')

Unnamed: 0,C,D
0,-0.414043,1.01371
1,-0.191214,0.797889
2,0.588714,-0.009055
3,0.085814,-0.36701
4,1.063588,2.041712
5,1.319786,1.395074
6,-0.421134,0.208063
7,-0.20631,-0.100299


In [81]:
grouped.get_group('o')

Unnamed: 0,A,B
0,foo,one
1,bar,one
2,foo,two
3,bar,three
4,foo,two
5,bar,two
6,foo,one
7,foo,three


Alternatively again, you can just pass in a list of labels of the same
length as the axis you are grouping:

In [26]:
grouped = df.groupby(['In','Out','In','Out'], axis=1)

In [27]:
grouped.first()

Unnamed: 0,In,Out
0,foo,one
1,bar,one
2,foo,two
3,bar,three
4,foo,two
5,bar,two
6,foo,one
7,foo,three


In [28]:
grouped.get_group('In')

Unnamed: 0,A,C
0,foo,-0.414043
1,bar,-0.191214
2,foo,0.588714
3,bar,0.085814
4,foo,1.063588
5,bar,1.319786
6,foo,-0.421134
7,foo,-0.20631
