# Aggregation and Grouping
________________

In [None]:
import numpy as np
import pandas as pd

An essential piece of analysis of large data is efficient summarization: computing aggregations like ``sum()``, ``mean()``, ``median()``, ``min()``, and ``max()``, in which a single number gives insight into the nature of a potentially large dataset.
* simple operations 
* ``groupby``-based operationson 

## 1. Simple aggregation
___________________

#### 1.1. Aggregation for ``Series``  
------------------

* aggregates return a single value:

In [None]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser

In [None]:
ser.sum()

In [None]:
ser.mean()

#### 1.2. Aggregation for ``DataFrame``
__________________

* by default the aggregates return results within **each** column:

In [None]:
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df

In [None]:
df.sum()

In [None]:
type(df.sum())

In [None]:
df.mean()

* aggregation within each row by specifying the ``axis`` :

In [None]:
df.sum(axis='columns')

In [None]:
df.mean(axis='columns')

#### 1.3. ``DataFrame`` and ``Series`` aggregation methods:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

* `describe()` -- `Series` and `DataFrames` method  that computes several **common** aggregates for each column and returns the result

In [None]:
df.describe()

In [None]:
dir(ser)

## 2. ``DataFrame.groupby()`` 
_________________________________________

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

#### 2.1. Aggregating **conditionally** on some label or index
_______________

* Using the name of the desired key column:

In [None]:
df.groupby('key')

#### 2.2. ``DataFrameGroupBy`` :
____________________
* **"lazy evaluation"** -- a special **view** of ``DataFrame``, which does no actual computation until the aggregation is applied 
* performs the appropriate **apply & combine** steps after applying an aggregate (any valid ``DataFrame`` operation)  

In [None]:
df.groupby('key').count()

In [None]:
df.groupby('key')['data'].sum()

#### 2.3. Pattern ``split``&``apply``&``combine``
__________________

In [None]:
df.groupby('key').sum()

![split_apply_combine.png](split_apply_combine.png)

In [None]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

#### 2.4. ``GroupBy`` by ``aggregate()`` -- aggregation with **different**  aggregates and computing all them at once
_________________________________________________________________
* using a string, a function, or a list thereof:

In [None]:
df.groupby('key').aggregate(['min', np.median, max])

* a dictionary mapping column names to operations to be applied on that column:

In [None]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

#### 2.5. ``GroupBy`` by ``transform()`` --  transformation of the **full** data
________________________________
* can return some transformed version of the full data to recombine
* the output is the same shape as the input

In [None]:
#  center the data by subtracting the group-wise mean
df.groupby('key').transform(lambda x: x - x.mean())

#### 2.6. ``GroupBy`` by ``apply()`` -- applying a function to the group
________________________________
* applying an **arbitrary** function to the group results
* function should take a ``DataFrame``, and return either a Pandas object (e.g., ``DataFrame``, ``Series``) or a scalar 
* the combine operation will be tailored to the type of output returned

In [None]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

In [None]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

display('df', "df.groupby('key').apply(norm_by_data2)")

#### 3. Column indexing
_________________

* ``GroupBy``  supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object
* no computation is done until we call some aggregate on the object

In [None]:
df.groupby('key')

In [None]:
dir(df.groupby('key'))

In [None]:
df.groupby('key')['data1']

In [None]:
df.groupby('key')['data1'].sum()

#### 4. Iteration over groups
__________________

*  ``GroupBy`` object supports direct iteration over the groups, returning each group as a ``Series`` or ``DataFrame``
*  this **can be** useful for doing **certain** things manually, though it is often much faster to use the built-in ``apply``

In [None]:
for (key, group) in df.groupby('key'):
    print(f"{key}: shape={group.shape}")

#### 5. Dispatch methods
____________________

* any method not explicitly implemented by the ``GroupBy``  will be passed through and called on the groups, whether they are ``DataFrame`` or ``Series`` .

In [None]:
df.groupby('key')['data1'].describe()