This notebook has some experiments to understand Pandas aggregration and grouping better.

It's based on the [aggregation and grouping section of the Python Data Science Handbook](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb#scrollTo=54U87NFis0yC)

## `DataFrame` for experiments

In [1]:
import numpy as np
import pandas as pd

The code uses this `DataFrame` for experiments.

In [2]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                  columns=['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


## Filtering

> A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value

The code below logs when the filtering function is called to let us follow what Pandas is doing.

In [3]:
iteration = 0
def filter_func(x):
    global iteration
    iteration += 1
    s = x['data2'].std()
    s_gt_4 = s > 4

    print('\nPass #{:-<20}'.format(iteration))
    print('Input:')
    print('type={}'.format(type(x)))
    print(x)

    print('\nOutput:')
    print('std()={:.2f}'.format(s))
    print('std() is {} 4 - {}'
          .format('>' if s > 4 else '<=',
                  'keep' if s_gt_4 else 'discard'))

    return s_gt_4


print('The full DataFrame:')
print(df)

print('\nFiltering...')
f = df.groupby('key').filter(filter_func)
print('\nFiltered DataFrame:')
print(f)


The full DataFrame:
  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9

Filtering...

Pass #1-------------------
Input:
type=<class 'pandas.core.frame.DataFrame'>
  key  data1  data2
0   A      0      5
3   A      3      3

Output:
std()=1.41
std() is <= 4 - discard

Pass #2-------------------
Input:
type=<class 'pandas.core.frame.DataFrame'>
  key  data1  data2
1   B      1      0
4   B      4      7

Output:
std()=4.95
std() is > 4 - keep

Pass #3-------------------
Input:
type=<class 'pandas.core.frame.DataFrame'>
  key  data1  data2
2   C      2      3
5   C      5      9

Output:
std()=4.24
std() is > 4 - keep

Filtered DataFrame:
  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9


## Transformation

> While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input.

The original code in the textbook uses a lamba function:
    
```
df.groupby('key').transform(lambda x: x - x.mean())
```

To be able to see what is happening, we use an actual function.

In [4]:
iteration = 0
def fmean(x):
    global iteration
    iteration += 1

    print('\nPass #{} ------------'.format(iteration))
    print('Input:')
    print('type={}'.format(type(x)))
    print(x)
    print('x.mean()={}'.format(x.mean()))

    print('\nTransformed:')
    transformed = x - x.mean()
    print(transformed)

    return transformed

print('\nTransforming...')
t = df.groupby('key').transform(fmean)
print('\nTransformed DataFrame:')
print(t)


Transforming...

Pass #1 ------------
Input:
type=<class 'pandas.core.series.Series'>
0    0
3    3
Name: data1, dtype: int64
x.mean()=1.5

Transformed:
0   -1.5
3    1.5
Name: data1, dtype: float64

Pass #2 ------------
Input:
type=<class 'pandas.core.series.Series'>
0    5
3    3
Name: data2, dtype: int64
x.mean()=4.0

Transformed:
0    1.0
3   -1.0
Name: data2, dtype: float64

Pass #3 ------------
Input:
type=<class 'pandas.core.frame.DataFrame'>
   data1  data2
0      0      5
3      3      3
x.mean()=data1    1.5
data2    4.0
dtype: float64

Transformed:
   data1  data2
0   -1.5    1.0
3    1.5   -1.0

Pass #4 ------------
Input:
type=<class 'pandas.core.frame.DataFrame'>
   data1  data2
1      1      0
4      4      7
x.mean()=data1    2.5
data2    3.5
dtype: float64

Transformed:
   data1  data2
1   -1.5   -3.5
4    1.5    3.5

Pass #5 ------------
Input:
type=<class 'pandas.core.frame.DataFrame'>
   data1  data2
2      2      3
5      5      9
x.mean()=data1    3.5
data2    6.