## Groupby: split / apply / combine

![](https://i.imgur.com/hg5DYmU.png)

- **split**/apply/combine
- Groupby type → methods (get_group ; mean ; size) / iteration
- **apply** → aggregation vs transform vs apply
- pivot_table (comme pivot mais qui aggr, en enlevant des index_cols)

In [1]:
df = pd.DataFrame({
    'key': list('ABCABCABC'),
    'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]
})

In [2]:
df.groupby('key').sum().reset_index()

Unnamed: 0,key,data
0,A,15
1,B,30
2,C,45


In [3]:
# Si on ne fait pas de .reset_index, les colonnes sur lesquelles on a .groupby restent en index:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,15
B,30
C,45


In [4]:
kpi = pd.read_csv('kpi.csv')

In [5]:
kpi

Unnamed: 0,date,countries,type,metric,value
0,1/1/18,France,Brand A,Revenue,161.0
1,1/1/18,France,Brand A,Units,85.0
2,1/1/18,France,Brand B,Revenue,184.0
3,1/1/18,France,Brand B,Units,65.0
4,1/1/18,France,Brand C,Revenue,85.0
...,...,...,...,...,...
475,12/1/18,Spain,Brand B,Units,196.0
476,12/1/18,Spain,Brand C,Revenue,243.0
477,12/1/18,Spain,Brand C,Units,285.0
478,12/1/18,Spain,Brand D,Revenue,165.0


In [6]:
kpi['date'] = pd.to_datetime(kpi.date)

In [7]:
kpi.groupby(['countries', 'metric']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
countries,metric,Unnamed: 2_level_1
France,Revenue,8105.0
France,Units,8442.0
Germany,Revenue,7915.0
Germany,Units,8235.0
Italy,Revenue,6500.0
Italy,Units,7560.0
Spain,Revenue,8755.0
Spain,Units,7973.0
UK,Revenue,8705.0
UK,Units,8992.0


On peut groupby sur l'index (si l'index est hiérarchique on peut choisir les niveaux sur lesquels grouper):

In [8]:
kpi.set_index(['countries', 'metric']).groupby(level=[0, 1]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
countries,metric,Unnamed: 2_level_1
France,Revenue,8105.0
France,Units,8442.0
Germany,Revenue,7915.0
Germany,Units,8235.0
Italy,Revenue,6500.0
Italy,Units,7560.0
Spain,Revenue,8755.0
Spain,Units,7973.0
UK,Revenue,8705.0
UK,Units,8992.0


On peut réaliser un groupby custom avec un mapping qui associe des valeurs de l'index à d'autres valeurs:

In [9]:
mapping = {
    'France': 'South',
    'Spain': 'South',
    'Italy': 'South',
    'UK': 'North',
    'Germany': 'North',
}

In [10]:
kpi.set_index('countries').groupby([mapping, 'metric']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
Unnamed: 0_level_1,metric,Unnamed: 2_level_1
North,Revenue,174.947368
North,Units,181.336842
South,Revenue,163.356643
South,Units,166.493056


On peut utiliser `.agg` pour calculer plusieurs aggrégations en une seule fois

In [11]:
kpi.groupby('metric').agg(['sum', 'mean', 'median'])

Unnamed: 0_level_0,value,value,value
Unnamed: 0_level_1,sum,mean,median
metric,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Revenue,39980.0,167.983193,170.0
Units,41202.0,172.393305,182.0


On peut aussi utiliser une fonction d'aggrégation custom avec `.agg`:

In [12]:
def peak_to_peak(g):
    return g.max() - g.min()

kpi.groupby(['date', 'metric']).agg(peak_to_peak)

Unnamed: 0_level_0,Unnamed: 1_level_0,value
date,metric,Unnamed: 2_level_1
2018-01-01,Revenue,234.0
2018-01-01,Units,241.0
2018-02-01,Revenue,234.0
2018-02-01,Units,231.0
2018-03-01,Revenue,219.0
2018-03-01,Units,242.0
2018-04-01,Revenue,212.0
2018-04-01,Units,251.0
2018-05-01,Revenue,223.0
2018-05-01,Units,255.0


## .TRANSFORM

`.transform` ressemble à `.agg` sauf qu'il renvoie autant de lignes que le dataframe d'origine

In [13]:
# Si je veux rajouter une colonne 'moyenne_pays' sur chacune de mes lignes:
kpi['moyenne_pays'] = kpi.groupby(['countries', 'metric'])['value'].transform(np.mean)

kpi.head()

Unnamed: 0,date,countries,type,metric,value,moyenne_pays
0,2018-01-01,France,Brand A,Revenue,161.0,172.446809
1,2018-01-01,France,Brand A,Units,85.0,175.875
2,2018-01-01,France,Brand B,Revenue,184.0,172.446809
3,2018-01-01,France,Brand B,Units,65.0,175.875
4,2018-01-01,France,Brand C,Revenue,85.0,172.446809


In [14]:
del kpi['moyenne_pays']

NB: j'aurai pu obtenir la même chose en calculant la moyenne puis en faisant un merge dessus:

In [15]:
moyenne_pays = kpi.groupby(['countries', 'metric'])['value']\
                  .agg(np.mean)\
                  .reset_index()\
                  .rename(columns={'value': 'moyenne_pays'})

kpi.merge(moyenne_pays, on=['countries', 'metric']).head()

Unnamed: 0,date,countries,type,metric,value,moyenne_pays
0,2018-01-01,France,Brand A,Revenue,161.0,172.446809
1,2018-01-01,France,Brand B,Revenue,184.0,172.446809
2,2018-01-01,France,Brand C,Revenue,85.0,172.446809
3,2018-01-01,France,Brand D,Revenue,144.0,172.446809
4,2018-02-01,France,Brand A,Revenue,103.0,172.446809


In [16]:
kpi_revenue = kpi[kpi.metric == 'Revenue']

In [17]:
del kpi_revenue['metric']

In [18]:
kpi_revenue.groupby(['countries']).mean()

Unnamed: 0_level_0,value
countries,Unnamed: 1_level_1
France,172.446809
Germany,164.895833
Italy,135.416667
Spain,182.395833
UK,185.212766


In [19]:
# Au sein de chacun de mes pays, quelles ont été les 5 meilleurs chiffres d'affaire ?

def top5(g):
    return g.dropna().sort_values('value')[-5:]

kpi_revenue.groupby(['countries']).apply(top5)

Unnamed: 0_level_0,Unnamed: 1_level_0,date,countries,type,value
countries,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
France,84,2018-03-01,France,Brand C,285.0
France,42,2018-02-01,France,Brand B,287.0
France,326,2018-09-01,France,Brand D,288.0
France,86,2018-03-01,France,Brand D,292.0
France,200,2018-06-01,France,Brand A,292.0
Germany,222,2018-06-01,Germany,Brand D,242.0
Germany,380,2018-10-01,Germany,Brand C,284.0
Germany,460,2018-12-01,Germany,Brand C,285.0
Germany,302,2018-08-01,Germany,Brand D,290.0
Germany,58,2018-02-01,Germany,Brand B,299.0


In [20]:
# En janvier 2018, quel pourcentage des revenus chacune de mes marques a-t-elle représenté ?

def pct_totalgroup(g):
    g['pct'] = g['value'] / g['value'].sum() * 100
    return g

kpi_revenue.groupby(['date', 'type']).sum().reset_index()\
           .groupby(['date']).apply(pct_totalgroup)\
           .query('date == "2018-01-01"')

Unnamed: 0,date,type,value,pct
0,2018-01-01,Brand A,783.0,24.825618
1,2018-01-01,Brand B,876.0,27.774255
2,2018-01-01,Brand C,749.0,23.747622
3,2018-01-01,Brand D,746.0,23.652505


In [22]:
# Je veux une colonne de somme cumulée (cumsum) pour chacune de mes marques/métrique/pays:

def mycumsum(g):
    g['cumsum'] = g.value.cumsum()
    return g

df = kpi.groupby(['countries', 'type', 'metric']).apply(mycumsum)

df[(df.countries == 'France') & (df.metric == 'Revenue') & (df.type == 'Brand C')]

Unnamed: 0,date,countries,type,metric,value,cumsum
4,2018-01-01,France,Brand C,Revenue,85.0,85.0
44,2018-02-01,France,Brand C,Revenue,70.0,155.0
84,2018-03-01,France,Brand C,Revenue,285.0,440.0
124,2018-04-01,France,Brand C,Revenue,260.0,700.0
164,2018-05-01,France,Brand C,Revenue,141.0,841.0
204,2018-06-01,France,Brand C,Revenue,53.0,894.0
244,2018-07-01,France,Brand C,Revenue,170.0,1064.0
284,2018-08-01,France,Brand C,Revenue,47.0,1111.0
324,2018-09-01,France,Brand C,Revenue,198.0,1309.0
364,2018-10-01,France,Brand C,Revenue,199.0,1508.0
