Any groupby operation involves one of the following operations on the original object. They are −

* **Splitting** the Object

* **Applying** a function

* **Combining** the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

* **Aggregation** − computing a summary statistic

* **Transformation** − perform some group-specific operation

* **Filtration** − discarding the data with some condition

In [2]:
import pandas as pd
import numpy as np 
import os
import matplotlib.pyplot as plt
os.chdir(r'C:\Users\dell\PycharmProjects\MachineLearning\Pandas\datasets')

In [3]:
pokemon = pd.read_csv('./pokemon.csv')
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


# Split data into Groups

```python
DataFrame.groupby(
    by=None,
    axis=0,
    level=None,
    as_index=True,
    sort=True,
    group_keys=True,
    squeeze=False,
    observed=False,
    **kwargs,
)
```

<b><code>by</code></b> : mapping, function, label, or list of labels

In [17]:
pd.set_option('display.precision', 2, 'display.max_rows', 10)

In [28]:
#grouped by type
grouped_types = pokemon.groupby(['Type 1', 'Type 2'])
grouped_types

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000019489927978>

In [29]:
#mean stats for each type
stats = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
grouped_types[stats].mean().sort_values(by = 'Attack', ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
Type 1,Type 2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ground,Fire,100.00,180.00,160.00,150.00,90.00,90.00
Psychic,Fighting,80.67,160.00,86.67,94.67,110.00,106.67
Psychic,Dark,80.00,160.00,60.00,170.00,130.00,80.00
Bug,Fighting,80.00,155.00,95.00,40.00,100.00,80.00
Dragon,Electric,100.00,150.00,120.00,120.00,100.00,90.00
...,...,...,...,...,...,...,...
Fairy,Flying,70.00,45.00,90.00,100.00,110.00,60.00
Ghost,Fire,56.67,41.67,68.33,101.67,68.33,51.67
Ice,Psychic,55.00,40.00,25.00,100.00,80.00,80.00
Water,Fairy,85.00,35.00,65.00,40.00,65.00,45.00


In [68]:
#number of pokemons for each type
pokemon.groupby(['Type 1', 'Type 2']).size()

Type 1  Type 2  
Bug     Electric     2
        Fighting     2
        Fire         2
        Flying      14
        Ghost        1
                    ..
Water   Ice          3
        Poison       3
        Psychic      5
        Rock         4
        Steel        1
Length: 136, dtype: int64

## as_index

Do not set grouped values as index

In [71]:
pokemon.groupby(['Type 1', 'Type 2'], as_index = False).mean()

Unnamed: 0,Type 1,Type 2,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bug,Electric,595.50,395.50,60.00,62.00,55.00,77.00,55.00,86.50,5.00,0.0
1,Bug,Fighting,214.00,550.00,80.00,155.00,95.00,40.00,100.00,80.00,2.00,0.0
2,Bug,Fire,636.50,455.00,70.00,72.50,60.00,92.50,80.00,80.00,5.00,0.0
3,Bug,Flying,286.29,419.50,63.00,70.14,61.57,72.86,69.07,82.86,2.86,0.0
4,Bug,Ghost,292.00,236.00,1.00,90.00,45.00,30.00,30.00,40.00,3.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
131,Water,Ice,103.00,511.67,90.00,83.33,113.33,80.00,78.33,66.67,1.00,0.0
132,Water,Poison,118.67,426.67,61.67,68.33,58.33,61.67,91.67,85.00,1.33,0.0
133,Water,Psychic,111.80,481.00,87.00,73.00,104.00,94.00,79.00,44.00,1.20,0.0
134,Water,Rock,430.00,428.75,70.75,82.75,112.75,61.50,65.00,36.00,3.75,0.0


## mapping

group by first mapping interested values

In [73]:
np.random.seed(101)

ratings = pd.DataFrame(np.random.randint(0, 5, (2,4)), index = ['Apple', 'Google'], columns = ['Math', 'Physic', 'History', 'Literature'])
ratings

Unnamed: 0,Math,Physic,History,Literature
Apple,3,1,3,1
Google,0,4,0,4


In [75]:
#Mapping each subject to its corresponding Category
mapping = {'Math' : 'Science', 'Physic' : 'Science', 'History' : 'Social', 'Literature' : 'Social'}

In [76]:
#calculate the total score for each subject of each company
ratings.groupby(mapping, axis = 1).sum()

Unnamed: 0,Science,Social
Apple,4,4
Google,4,4


<hr>

In [77]:
networth = pd.Series([107, 203, 88, 45, 32, 87], index = ['Google', 'Nestle', 'Amazon', 'Apple', 'Kitkat', 'Alibaba'])
networth

Google     107
Nestle     203
Amazon      88
Apple       45
Kitkat      32
Alibaba     87
dtype: int64

In [83]:
industry_mapping = {
'Apple' : 'Tech',
'Google' : 'Tech',
'Nestle' : 'Food',
'Kitkat' : 'Food',
'Alibaba' : 'E-commerce',
'Amazon' : 'E-commerce'
}

industry_networth = networth.groupby(industry_mapping).sum()
industry_networth

E-commerce    175
Food          235
Tech          152
dtype: int64

## function

group by using a function

In [91]:
#group by first characters
networth = pd.Series([107, 203, 88, 45, 32, 87], index = ['Google', 'Nestle', 'Amazon', 'Apple', 'Kitkat', 'Alibaba'])
networth

Google     107
Nestle     203
Amazon      88
Apple       45
Kitkat      32
Alibaba     87
dtype: int64

In [92]:
networth.groupby(lambda name: name[0]).sum()

A    220
G    107
K     32
N    203
dtype: int64

## level

multiIndex grouping

In [93]:
np.random.seed(99)
index = pd.MultiIndex.from_product([['Trung', 'Kien'], ['Math', 'Physic']])
scores = pd.DataFrame(np.random.randint(5, 10, (4, 2)), index = index, columns = ['Semeter 1', 'Semester 2'])
scores

Unnamed: 0,Unnamed: 1,Semeter 1,Semester 2
Trung,Math,6,8
Trung,Physic,6,5
Kien,Math,6,5
Kien,Physic,7,9


In [94]:
#average score for each subject
scores.groupby(level = 1).mean()

Unnamed: 0,Semeter 1,Semester 2
Math,6.0,6.5
Physic,6.5,7.0


# View Groups

<b><code>DataFrameGroupBy.groups</code></b>

In [30]:
grouped_types.groups

{('Grass',
  'Poison'): Int64Index([0, 1, 2, 3, 48, 49, 50, 75, 76, 77, 344, 451, 452, 651, 652], dtype='int64'),
 ('Fire',
  nan): Int64Index([  4,   5,  42,  43,  63,  64,  83,  84, 135, 147, 169, 170, 171,
             236, 259, 263, 276, 355, 435, 518, 557, 572, 573, 614, 615, 692,
             721, 722],
            dtype='int64'),
 ('Fire', 'Flying'): Int64Index([6, 8, 158, 270, 730, 731], dtype='int64'),
 ('Fire', 'Dragon'): Int64Index([7], dtype='int64'),
 ('Water',
  nan): Int64Index([  9,  10,  11,  12,  59,  60,  65,  66,  93,  97, 106, 107, 125,
             126, 127, 128, 129, 139, 145, 172, 173, 174, 201, 241, 242, 264,
             280, 350, 351, 373, 381, 382, 401, 402, 403, 405, 421, 422, 438,
             439, 465, 466, 469, 506, 507, 547, 548, 560, 561, 562, 574, 575,
             595, 610, 655, 724, 725, 762, 763],
            dtype='int64'),
 ('Bug',
  nan): Int64Index([ 13,  14, 136, 219, 288, 289, 291, 342, 343, 446, 447, 457, 649,
             677, 678, 732, 733

# Iterating over groups

In [31]:
for name, group in grouped_types:
    print(name)
    print(group)
    print('_' * 50)

('Bug', 'Electric')
       #        Name Type 1    Type 2  Total  HP  Attack  Defense  Sp. Atk  \
656  595      Joltik    Bug  Electric    319  50      47       50       57   
657  596  Galvantula    Bug  Electric    472  70      77       60       97   

     Sp. Def  Speed  Generation  Legendary  
656       50     65           5      False  
657       60    108           5      False  
__________________________________________________
('Bug', 'Fighting')
       #                     Name Type 1    Type 2  Total  HP  Attack  \
231  214                Heracross    Bug  Fighting    500  80     125   
232  214  HeracrossMega Heracross    Bug  Fighting    600  80     185   

     Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary  
231       75       40       95     85           2      False  
232      115       40      105     75           2      False  
__________________________________________________
('Bug', 'Fire')
       #       Name Type 1 Type 2  Total  HP  Attack  Defense  

424      150       90     90           3       True  
__________________________________________________
('Ground', 'Flying')
       #                     Name  Type 1  Type 2  Total  HP  Attack  Defense  \
222  207                   Gligar  Ground  Flying    430  65      75      105   
523  472                  Gliscor  Ground  Flying    510  75      95      125   
708  645  LandorusIncarnate Forme  Ground  Flying    600  89     125       90   
709  645    LandorusTherian Forme  Ground  Flying    600  89     145       90   

     Sp. Atk  Sp. Def  Speed  Generation  Legendary  
222       35       65     85           2      False  
523       45       75     95           4      False  
708      115       80    101           5       True  
709      105       80     91           5       True  
__________________________________________________
('Ground', 'Ghost')
       #    Name  Type 1 Type 2  Total  HP  Attack  Defense  Sp. Atk  Sp. Def  \
683  622  Golett  Ground  Ghost    303  59    

# Select a group

<b><code>DataFrameGroupBy.get_group(name, obj = None)</code></b>

In [34]:
#multiIndex
grouped_types.get_group(('Bug', 'Flying'))

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
15,12,Butterfree,Bug,Flying,395,60,45,50,90,80,70,1,False
132,123,Scyther,Bug,Flying,500,70,110,80,55,80,105,1,False
137,127,PinsirMega Pinsir,Bug,Flying,600,65,155,120,65,90,105,1,False
179,165,Ledyba,Bug,Flying,265,40,20,30,40,80,55,2,False
180,166,Ledian,Bug,Flying,390,55,35,50,55,110,85,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
461,414,Mothim,Bug,Flying,424,70,94,50,94,50,66,4,False
462,415,Combee,Bug,Flying,244,30,30,42,30,42,70,4,False
463,416,Vespiquen,Bug,Flying,474,70,80,102,80,102,40,4,False
520,469,Yanmega,Bug,Flying,515,86,76,86,116,56,95,4,False


# Aggregate

In [40]:
#calculate min, max, mean of stats lengendary vs others
grouped_legendary = pokemon.groupby('Legendary')
grouped_legendary[stats].agg(['min', 'max', 'mean'])

Unnamed: 0_level_0,HP,HP,HP,Attack,Attack,Attack,Defense,Defense,Defense,Sp. Atk,Sp. Atk,Sp. Atk,Sp. Def,Sp. Def,Sp. Def,Speed,Speed,Speed
Unnamed: 0_level_1,min,max,mean,min,max,mean,min,max,mean,min,max,mean,min,max,mean,min,max,mean
Legendary,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
False,1,255,67.18,5,185,75.67,5,230,71.56,10,175,68.45,20,230,68.89,5,160,65.46
True,50,150,92.74,50,190,116.68,20,200,99.66,50,194,122.18,20,200,105.94,50,180,100.18


In [38]:
#get group of legendary pokemons
grouped_legendary.get_group(True)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
156,144,Articuno,Ice,Flying,580,90,85,100,95,125,85,1,True
157,145,Zapdos,Electric,Flying,580,90,90,85,125,90,100,1,True
158,146,Moltres,Fire,Flying,580,90,100,90,125,85,90,1,True
162,150,Mewtwo,Psychic,,680,106,110,90,154,90,130,1,True
163,150,MewtwoMega Mewtwo X,Psychic,Fighting,780,106,190,100,154,100,130,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


# Transformation

In [56]:
#normalize stats of each generation
grouped_gen = pokemon.groupby('Generation')
grouped_gen

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001948AE16860>

In [57]:
grouped_gen.groups

{1: Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
             ...
             156, 157, 158, 159, 160, 161, 162, 163, 164, 165],
            dtype='int64', length=166),
 2: Int64Index([166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
             ...
             262, 263, 264, 265, 266, 267, 268, 269, 270, 271],
            dtype='int64', length=106),
 3: Int64Index([272, 273, 274, 275, 276, 277, 278, 279, 280, 281,
             ...
             422, 423, 424, 425, 426, 427, 428, 429, 430, 431],
            dtype='int64', length=160),
 4: Int64Index([432, 433, 434, 435, 436, 437, 438, 439, 440, 441,
             ...
             543, 544, 545, 546, 547, 548, 549, 550, 551, 552],
            dtype='int64', length=121),
 5: Int64Index([553, 554, 555, 556, 557, 558, 559, 560, 561, 562,
             ...
             708, 709, 710, 711, 712, 713, 714, 715, 716, 717],
            dtype='int64', length=165),
 6: Int64Index([718, 719, 720, 721, 722, 723, 724, 725, 726, 727,

In [88]:
#noramlize each group
norm = grouped_gen[stats].transform(lambda group: (group - group.mean()) / group.std())
norm #return merged transformed groups

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,-0.74,-0.90,-0.76,-0.20,-0.16,-0.93
1,-0.21,-0.48,-0.27,0.24,0.43,-0.42
2,0.50,0.17,0.42,0.82,1.21,0.25
3,0.50,0.76,1.82,1.46,2.00,0.25
4,-0.95,-0.80,-0.97,-0.34,-0.75,-0.26
...,...,...,...,...,...,...
795,-0.87,0.83,2.34,0.81,2.50,-0.64
796,-0.87,2.89,1.06,2.70,1.17,1.70
797,0.56,1.17,-0.53,2.38,1.83,0.14
798,0.56,2.89,-0.53,3.01,1.83,0.53


# filter

In [89]:
#select pokemon types only have 1 instance
rare = pokemon.groupby(['Type 1', 'Type 2']).filter(lambda group: len(group) == 1)
rare #return merged filtered groups

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
7,6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
196,181,AmpharosMega Ampharos,Electric,Dragon,610,90,95,105,165,110,45,2,False
237,219,Magcargo,Fire,Rock,410,50,50,120,80,80,30,2,False
245,227,Skarmory,Steel,Flying,465,65,80,140,40,70,70,2,False
271,251,Celebi,Psychic,Grass,600,100,100,100,100,100,100,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
771,701,Hawlucha,Fighting,Flying,500,78,92,75,74,63,118,6,False
772,702,Dedenne,Electric,Fairy,431,67,58,57,81,67,101,6,False
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True
