# Aggregation and Grouping
________________

In [1]:
import numpy as np
import pandas as pd

An essential piece of analysis of large data is efficient summarization: computing aggregations like ``sum()``, ``mean()``, ``median()``, ``min()``, and ``max()``, in which a single number gives insight into the nature of a potentially large dataset.
* simple operations 
* ``groupby``-based operationson 

## 1. Simple aggregation
___________________

#### 1.1. Aggregation for ``Series``  
------------------

* aggregates return a single value:

In [2]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [3]:
ser.sum()

2.811925491708157

In [4]:
ser.mean()

0.5623850983416314

#### 1.2. Aggregation for ``DataFrame``
__________________

* by default the aggregates return results within **each** column:

In [5]:
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [6]:
df.sum()

A    2.389442
B    2.217101
dtype: float64

In [7]:
df.mean()

A    0.477888
B    0.443420
dtype: float64

* aggregation within each row by specifying the ``axis`` :

In [8]:
df.sum(axis='columns')

0    0.176579
1    1.027993
2    1.698619
3    0.813454
4    0.889898
dtype: float64

In [9]:
df.mean(axis='columns')

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

#### 1.3. ``DataFrame`` and ``Series`` aggregation methods:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

* `describe()` -- `Series` and `DataFrames` method  that computes several **common** aggregates for each column and returns the result

In [10]:
df.describe()

Unnamed: 0,A,B
count,5.0,5.0
mean,0.477888,0.44342
std,0.353125,0.426952
min,0.058084,0.020584
25%,0.155995,0.181825
50%,0.601115,0.212339
75%,0.708073,0.832443
max,0.866176,0.96991


In [11]:
dir(ser)

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__r

## 2. ``DataFrame.groupby()`` 
_________________________________________

In [12]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


#### 2.1. Aggregating **conditionally** on some label or index
_______________

* Using the name of the desired key column:

In [13]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001B4988CA940>

#### 2.2. ``DataFrameGroupBy`` :
____________________
* **"lazy evaluation"** -- a special **view** of ``DataFrame``, which does no actual computation until the aggregation is applied 
* performs the appropriate **apply & combine** steps after applying an aggregate (any valid ``DataFrame`` operation)  

In [15]:
df.groupby('key').count()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,2
B,2
C,2


In [16]:
df.groupby('key')['data'].sum()

key
A    3
B    5
C    7
Name: data, dtype: int64

#### 2.3. Pattern ``split``&``apply``&``combine``
__________________

In [14]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


![03.08-split-apply-combine.png](attachment:03.08-split-apply-combine.png)

In [17]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


#### 2.4. ``GroupBy`` by ``aggregate()`` -- aggregation with **different**  aggregates and computing all them at once
_________________________________________________________________
* using a string, a function, or a list thereof:

In [18]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


* a dictionary mapping column names to operations to be applied on that column:

In [19]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


#### 2.5. ``GroupBy`` by ``transform()`` --  transformation of the **full** data
________________________________
* can return some transformed version of the full data to recombine
* the output is the same shape as the input

In [21]:
#  center the data by subtracting the group-wise mean
df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


#### 2.6. ``GroupBy`` by ``apply()`` -- applying a function to the group
________________________________
* applying an **arbitrary** function to the group results
* function should take a ``DataFrame``, and return either a Pandas object (e.g., ``DataFrame``, ``Series``) or a scalar 
* the combine operation will be tailored to the type of output returned

In [1]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

In [22]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

display('df', "df.groupby('key').apply(norm_by_data2)")

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.142857,0
2,C,0.166667,3
3,A,0.375,3
4,B,0.571429,7
5,C,0.416667,9


#### 3. Column indexing
_________________

* ``GroupBy``  supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object
* no computation is done until we call some aggregate on the object

In [23]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001B49AC71880>

In [24]:
df.groupby('key')['data1']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001B49AC71E80>

In [25]:
df.groupby('key')['data1'].sum()

key
A    3
B    5
C    7
Name: data1, dtype: int64

#### 4. Iteration over groups
__________________

*  ``GroupBy`` object supports direct iteration over the groups, returning each group as a ``Series`` or ``DataFrame``
*  this **can be** useful for doing certain things manually, though it is often much faster to use the built-in ``apply``

In [2]:
for (key, group) in df.groupby('key'):
    print("{0}: shape={1}".format(key, group.shape))

NameError: name 'df' is not defined

#### Dispatch methods

* any method not explicitly implemented by the ``GroupBy`` object will be passed through and called on the groups, whether they are ``DataFrame`` or ``Series`` objects.

For example,   ``describe()`` method of ``DataFrame``s can be used to perform a set of aggregations that describe each group in the data:

In [None]:
df.groupby('key')['data1'].describe()