---
# Grouping for Aggregation, Filtration, and Transformation
---

In [1]:
import numpy as np
import pandas as pd

## Defining an aggregation
we examine the flights dataset and perform the simplest aggregation involving only a single grouping column, a single aggregating column, and a single aggregating function.
We will find the average arrival delay for each airline.

In [3]:
flights = pd.read_csv('./flights.csv')
flights.head()

Unnamed: 0,MONTH,DAY,WEEKDAY,AIRLINE,ORG_AIR,DEST_AIR,SCHED_DEP,DEP_DELAY,AIR_TIME,DIST,SCHED_ARR,ARR_DELAY,DIVERTED,CANCELLED
0,1,1,4,WN,LAX,SLC,1625,58.0,94.0,590,1905,65.0,0,0
1,1,1,4,UA,DEN,IAD,823,7.0,154.0,1452,1333,-13.0,0,0
2,1,1,4,MQ,DFW,VPS,1305,36.0,85.0,641,1453,35.0,0,0
3,1,1,4,AA,DFW,DCA,1555,7.0,126.0,1192,1935,-7.0,0,0
4,1,1,4,WN,LAX,MCI,1720,48.0,166.0,1363,2225,39.0,0,0


Define the grouping columns (AIRLINE), aggregating columns (ARR_DELAY), and aggregating functions (mean). Place the grouping column in the .groupby method and then call the .agg method with a dictionary pairing the aggregating column with its aggregating function. If you pass in a dictionary, it returns back a DataFrame instance

In [4]:
(
    flights.groupby(['AIRLINE'])
    .agg({'ARR_DELAY': 'mean'})
)

Unnamed: 0_level_0,ARR_DELAY
AIRLINE,Unnamed: 1_level_1
AA,5.542661
AS,-0.833333
B6,8.692593
DL,0.339691
EV,7.03458
F9,13.630651
HA,4.972973
MQ,6.860591
NK,18.43607
OO,7.593463


Alternatively, we can place the aggregating column in the index operator and then pass the aggregating function as a string to `.agg`. This will return a Series:

In [5]:
(
    flights.groupby('AIRLINE')
    ['ARR_DELAY'].agg(['mean'])
 
)

Unnamed: 0_level_0,mean
AIRLINE,Unnamed: 1_level_1
AA,5.542661
AS,-0.833333
B6,8.692593
DL,0.339691
EV,7.03458
F9,13.630651
HA,4.972973
MQ,6.860591
NK,18.43607
OO,7.593463


With numpy mean

In [6]:
(
    flights.groupby('AIRLINE')
    ['ARR_DELAY'].agg(np.mean)
    .head()
)

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

It's possible to skip the agg method altogether in this case and use the code in text
method directly.

In [7]:
(
    flights.groupby('AIRLINE')
    ['ARR_DELAY'].mean()
    .head()
)

AIRLINE
AA    5.542661
AS   -0.833333
B6    8.692593
DL    0.339691
EV    7.034580
Name: ARR_DELAY, dtype: float64

In [10]:
flights.groupby('AIRLINE')['ARR_DELAY'].agg([np.std]).head()

Unnamed: 0_level_0,std
AIRLINE,Unnamed: 1_level_1
AA,43.32316
AS,31.168354
B6,40.221718
DL,32.299471
EV,36.682336


## Grouping and aggregating with multiple columns and functions
