# Data Grouping in Pandas

## Importing Modules

In [1]:
import pandas as pd
import numpy as np

## `groupby()` function

Syntax:
```
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
```

In [2]:
df = pd.read_csv('Data/nba.csv')
df

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


## Grouping by a Single Column

In [3]:
grouped_team = df.groupby('Team')
grouped_team

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1341187d0>

In [4]:
grouped_team.groups

{'Atlanta Hawks': [309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323], 'Boston Celtics': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 'Brooklyn Nets': [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'Charlotte Hornets': [324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338], 'Chicago Bulls': [151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165], 'Cleveland Cavaliers': [166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180], 'Dallas Mavericks': [227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241], 'Denver Nuggets': [383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397], 'Detroit Pistons': [181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195], 'Golden State Warriors': [76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90], 'Houston Rockets': [242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255,

Compute the first entry of each column within each group using `first()`.
Syntax:
```
DataFrameGroupBy.first(numeric_only=False, min_count=-1, skipna=True)
```

In [5]:
grouped_team.first()

Unnamed: 0_level_0,Name,Number,Position,Age,Height,Weight,College,Salary
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Atlanta Hawks,Kent Bazemore,24.0,SF,26.0,6-5,201.0,Old Dominion,2000000.0
Boston Celtics,Avery Bradley,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
Brooklyn Nets,Bojan Bogdanovic,44.0,SG,27.0,6-8,216.0,Oklahoma State,3425510.0
Charlotte Hornets,Nicolas Batum,5.0,SG,27.0,6-8,200.0,Virginia Commonwealth,13125306.0
Chicago Bulls,Cameron Bairstow,41.0,PF,25.0,6-9,250.0,New Mexico,845059.0
Cleveland Cavaliers,Matthew Dellavedova,8.0,PG,25.0,6-4,198.0,Saint Mary's,1147276.0
Dallas Mavericks,Justin Anderson,1.0,SG,22.0,6-6,228.0,Virginia,1449000.0
Denver Nuggets,Darrell Arthur,0.0,PF,28.0,6-9,235.0,Kansas,2814000.0
Detroit Pistons,Joel Anthony,50.0,C,33.0,6-9,245.0,UNLV,2500000.0
Golden State Warriors,Leandro Barbosa,19.0,SG,33.0,6-3,194.0,North Carolina,2500000.0


In [6]:
for name, group in grouped_team:
    print(f'Team: {name}', group, sep='\n')

Team: Atlanta Hawks
                 Name           Team  Number Position   Age Height  Weight  \
309     Kent Bazemore  Atlanta Hawks    24.0       SF  26.0    6-5   201.0   
310  Tim Hardaway Jr.  Atlanta Hawks    10.0       SG  24.0    6-6   205.0   
311      Kirk Hinrich  Atlanta Hawks    12.0       SG  35.0    6-4   190.0   
312        Al Horford  Atlanta Hawks    15.0        C  30.0   6-10   245.0   
313    Kris Humphries  Atlanta Hawks    43.0       PF  31.0    6-9   235.0   
314       Kyle Korver  Atlanta Hawks    26.0       SG  35.0    6-7   212.0   
315      Paul Millsap  Atlanta Hawks     4.0       PF  31.0    6-8   246.0   
316      Mike Muscala  Atlanta Hawks    31.0       PF  24.0   6-11   240.0   
317   Lamar Patterson  Atlanta Hawks    13.0       SG  24.0    6-5   225.0   
318   Dennis Schroder  Atlanta Hawks    17.0       PG  22.0    6-1   172.0   
319        Mike Scott  Atlanta Hawks    32.0       PF  27.0    6-8   237.0   
320   Thabo Sefolosha  Atlanta Hawks    25.0

**Selecting a group using `Groupby.get_group(name, obj=None)`**

In [7]:
grouped_team.get_group('Atlanta Hawks')

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
309,Kent Bazemore,Atlanta Hawks,24.0,SF,26.0,6-5,201.0,Old Dominion,2000000.0
310,Tim Hardaway Jr.,Atlanta Hawks,10.0,SG,24.0,6-6,205.0,Michigan,1304520.0
311,Kirk Hinrich,Atlanta Hawks,12.0,SG,35.0,6-4,190.0,Kansas,2854940.0
312,Al Horford,Atlanta Hawks,15.0,C,30.0,6-10,245.0,Florida,12000000.0
313,Kris Humphries,Atlanta Hawks,43.0,PF,31.0,6-9,235.0,Minnesota,1000000.0
314,Kyle Korver,Atlanta Hawks,26.0,SG,35.0,6-7,212.0,Creighton,5746479.0
315,Paul Millsap,Atlanta Hawks,4.0,PF,31.0,6-8,246.0,Louisiana Tech,18671659.0
316,Mike Muscala,Atlanta Hawks,31.0,PF,24.0,6-11,240.0,Bucknell,947276.0
317,Lamar Patterson,Atlanta Hawks,13.0,SG,24.0,6-5,225.0,Pittsburgh,525093.0
318,Dennis Schroder,Atlanta Hawks,17.0,PG,22.0,6-1,172.0,,1763400.0


**Data Aggregation**
Aggregation functions allow to summarize data.

In [8]:
grouped_team.get_group('Atlanta Hawks')['Salary'].mean()

4860196.666666667

In [9]:
# Calculating the mean salary per team
mean_salary_team = grouped_team['Salary'].mean()
mean_salary_team

Team
Atlanta Hawks             4.860197e+06
Boston Celtics            4.181505e+06
Brooklyn Nets             3.501898e+06
Charlotte Hornets         5.222728e+06
Chicago Bulls             5.785559e+06
Cleveland Cavaliers       7.642049e+06
Dallas Mavericks          4.746582e+06
Denver Nuggets            4.294424e+06
Detroit Pistons           4.477884e+06
Golden State Warriors     5.924600e+06
Houston Rockets           5.018868e+06
Indiana Pacers            4.450122e+06
Los Angeles Clippers      6.323643e+06
Los Angeles Lakers        4.784695e+06
Memphis Grizzlies         5.467920e+06
Miami Heat                6.347359e+06
Milwaukee Bucks           4.350220e+06
Minnesota Timberwolves    4.593054e+06
New Orleans Pelicans      4.355304e+06
New York Knicks           4.581494e+06
Oklahoma City Thunder     6.251020e+06
Orlando Magic             4.297248e+06
Philadelphia 76ers        2.213778e+06
Phoenix Suns              4.229676e+06
Portland Trail Blazers    3.220121e+06
Sacramento Kings    

In [10]:
grouped_team.get_group('Atlanta Hawks')['Salary'].max()

18671659.0

In [11]:
# Calculating the max salary per team
max_salary_team = grouped_team['Salary'].max()
max_salary_team

Team
Atlanta Hawks             18671659.0
Boston Celtics            12000000.0
Brooklyn Nets             19689000.0
Charlotte Hornets         13500000.0
Chicago Bulls             20093064.0
Cleveland Cavaliers       22970500.0
Dallas Mavericks          16407500.0
Denver Nuggets            14000000.0
Detroit Pistons           16000000.0
Golden State Warriors     15501000.0
Houston Rockets           22359364.0
Indiana Pacers            17120106.0
Los Angeles Clippers      21468695.0
Los Angeles Lakers        25000000.0
Memphis Grizzlies         19688000.0
Miami Heat                22192730.0
Milwaukee Bucks           16407500.0
Minnesota Timberwolves    12700000.0
New Orleans Pelicans      15514031.0
New York Knicks           22875000.0
Oklahoma City Thunder     20158622.0
Orlando Magic             11250000.0
Philadelphia 76ers         6500000.0
Phoenix Suns              13500000.0
Portland Trail Blazers     8042895.0
Sacramento Kings          15851950.0
San Antonio Spurs         1968900

**Applying multiple aggregation functions**

In [12]:
grouped_team['Salary'].aggregate([np.sum, np.mean, np.std])

  grouped_team['Salary'].aggregate([np.sum, np.mean, np.std])
  grouped_team['Salary'].aggregate([np.sum, np.mean, np.std])
  grouped_team['Salary'].aggregate([np.sum, np.mean, np.std])


Unnamed: 0_level_0,sum,mean,std
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Atlanta Hawks,72902950.0,4860197.0,5194508.0
Boston Celtics,58541068.0,4181505.0,3146033.0
Brooklyn Nets,52528475.0,3501898.0,5317817.0
Charlotte Hornets,78340920.0,5222728.0,4538601.0
Chicago Bulls,86783378.0,5785559.0,6251088.0
Cleveland Cavaliers,106988689.0,7642049.0,7730329.0
Dallas Mavericks,71198732.0,4746582.0,5030279.0
Denver Nuggets,60121930.0,4294424.0,4320214.0
Detroit Pistons,67168263.0,4477884.0,4668478.0
Golden State Warriors,88868997.0,5924600.0,5664282.0


In [13]:
grouped_team['Age'].agg([np.mean, np.std])

  grouped_team['Age'].agg([np.mean, np.std])
  grouped_team['Age'].agg([np.mean, np.std])


Unnamed: 0_level_0,mean,std
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Atlanta Hawks,28.2,4.229151
Boston Celtics,24.733333,2.840188
Brooklyn Nets,25.6,3.018988
Charlotte Hornets,26.133333,3.159265
Chicago Bulls,27.4,4.188419
Cleveland Cavaliers,29.533333,4.120795
Dallas Mavericks,29.733333,3.712271
Denver Nuggets,25.733333,4.742915
Detroit Pistons,26.2,4.443294
Golden State Warriors,27.666667,3.848314


## Grouping by Multiple Columns

In [14]:
grouped_team_position = df.groupby(['Team', 'Position'])
grouped_team_position

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x137f53250>

In [15]:
grouped_team_position.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Number,Age,Height,Weight,College,Salary
Team,Position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Atlanta Hawks,C,Al Horford,15.0,30.0,6-10,245.0,Florida,12000000.0
Atlanta Hawks,PF,Kris Humphries,43.0,31.0,6-9,235.0,Minnesota,1000000.0
Atlanta Hawks,PG,Dennis Schroder,17.0,22.0,6-1,172.0,Wake Forest,1763400.0
Atlanta Hawks,SF,Kent Bazemore,24.0,26.0,6-5,201.0,Old Dominion,2000000.0
Atlanta Hawks,SG,Tim Hardaway Jr.,10.0,24.0,6-6,205.0,Michigan,1304520.0
...,...,...,...,...,...,...,...,...
Washington Wizards,C,Marcin Gortat,13.0,32.0,6-11,240.0,North Carolina State,11217391.0
Washington Wizards,PF,Drew Gooden,90.0,34.0,6-10,250.0,Kansas,3300000.0
Washington Wizards,PG,Ramon Sessions,7.0,30.0,6-3,190.0,Nevada,2170465.0
Washington Wizards,SF,Jared Dudley,1.0,30.0,6-7,225.0,Boston College,4375000.0


In [16]:
grouped_team_position.groups

{('Atlanta Hawks', 'C'): [312, 321, 322], ('Atlanta Hawks', 'PF'): [313, 315, 316, 319], ('Atlanta Hawks', 'PG'): [318, 323], ('Atlanta Hawks', 'SF'): [309, 320], ('Atlanta Hawks', 'SG'): [310, 311, 314, 317], ('Boston Celtics', 'C'): [7, 10, 14], ('Boston Celtics', 'PF'): [4, 5, 6], ('Boston Celtics', 'PG'): [0, 8, 9, 11], ('Boston Celtics', 'SF'): [1], ('Boston Celtics', 'SG'): [2, 3, 12, 13], ('Brooklyn Nets', 'C'): [23, 27], ('Brooklyn Nets', 'PF'): [24, 25, 26, 29], ('Brooklyn Nets', 'PG'): [19, 22, 28], ('Brooklyn Nets', 'SG'): [15, 16, 17, 18, 20, 21], ('Charlotte Hornets', 'C'): [330, 331, 338], ('Charlotte Hornets', 'PF'): [327, 329, 337], ('Charlotte Hornets', 'PG'): [326, 335, 336], ('Charlotte Hornets', 'SF'): [332], ('Charlotte Hornets', 'SG'): [324, 325, 328, 333, 334], ('Chicago Bulls', 'C'): [156, 162], ('Chicago Bulls', 'PF'): [151, 155, 157, 160, 163], ('Chicago Bulls', 'PG'): [152, 164], ('Chicago Bulls', 'SF'): [159, 165], ('Chicago Bulls', 'SG'): [153, 154, 158, 16

In [17]:
for name, group in grouped_team_position:
    print(name, group)

('Atlanta Hawks', 'C')                Name           Team  Number Position   Age Height  Weight  \
312      Al Horford  Atlanta Hawks    15.0        C  30.0   6-10   245.0   
321  Tiago Splitter  Atlanta Hawks    11.0        C  31.0   6-11   245.0   
322  Walter Tavares  Atlanta Hawks    22.0        C  24.0    7-3   260.0   

     College      Salary  
312  Florida  12000000.0  
321      NaN   9756250.0  
322      NaN   1000000.0  
('Atlanta Hawks', 'PF')                Name           Team  Number Position   Age Height  Weight  \
313  Kris Humphries  Atlanta Hawks    43.0       PF  31.0    6-9   235.0   
315    Paul Millsap  Atlanta Hawks     4.0       PF  31.0    6-8   246.0   
316    Mike Muscala  Atlanta Hawks    31.0       PF  24.0   6-11   240.0   
319      Mike Scott  Atlanta Hawks    32.0       PF  27.0    6-8   237.0   

            College      Salary  
313       Minnesota   1000000.0  
315  Louisiana Tech  18671659.0  
316        Bucknell    947276.0  
319        Virginia   3

**Applying multiple aggregation functions**

In [18]:
functions = {'Age': ['mean', 'min', 'max'], 'Salary': ['mean', 'sum']}

grouped_team_position.agg(functions)

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age,Age,Salary,Salary
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max,mean,sum
Team,Position,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Atlanta Hawks,C,28.333333,24.0,31.0,7.585417e+06,22756250.0
Atlanta Hawks,PF,28.250000,24.0,31.0,5.988067e+06,23952268.0
Atlanta Hawks,PG,24.500000,22.0,27.0,4.881700e+06,9763400.0
Atlanta Hawks,SF,29.000000,26.0,32.0,3.000000e+06,6000000.0
Atlanta Hawks,SG,29.500000,24.0,35.0,2.607758e+06,10431032.0
...,...,...,...,...,...,...
Washington Wizards,C,30.666667,27.0,33.0,8.163476e+06,24490429.0
Washington Wizards,PF,30.000000,26.0,34.0,5.650000e+06,11300000.0
Washington Wizards,PG,27.500000,25.0,30.0,9.011208e+06,18022415.0
Washington Wizards,SF,25.500000,20.0,30.0,2.789700e+06,11158800.0


**Selecting a group**

In [19]:
grouped_team_position.get_group(('Atlanta Hawks', 'C'))

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
312,Al Horford,Atlanta Hawks,15.0,C,30.0,6-10,245.0,Florida,12000000.0
321,Tiago Splitter,Atlanta Hawks,11.0,C,31.0,6-11,245.0,,9756250.0
322,Walter Tavares,Atlanta Hawks,22.0,C,24.0,7-3,260.0,,1000000.0
