# Solutions

1. Alternate GroupBy Syntax
2. Custom Aggregation
3. Transform and Filter with GroupBy

# 1. Alternate GroupBy Syntax

# 2. Custom Aggregation

In [1]:
import pandas as pd
pd.options.display.max_columns = 40
flights = pd.read_csv('../../data/flights.csv')
flights.head()

Unnamed: 0,year,month,day,day_of_week,airline,flight_number,tail_number,origin_airport,destination_airport,scheduled_departure,departure_time,departure_delay,taxi_out,wheels_off,scheduled_time,elapsed_time,air_time,distance,wheels_on,taxi_in,scheduled_arrival,arrival_time,arrival_delay,diverted,cancelled,cancellation_reason,air_system_delay,security_delay,airline_delay,late_aircraft_delay,weather_delay
0,2015,1,1,4,WN,1908,N8324A,LAX,SLC,1625,1723.0,58.0,10.0,1733.0,100.0,107.0,94.0,590,2007.0,3.0,1905,2010.0,65.0,0,0,,31.0,0.0,0.0,34.0,0.0
1,2015,1,1,4,UA,581,N448UA,DEN,IAD,823,830.0,7.0,11.0,841.0,190.0,170.0,154.0,1452,1315.0,5.0,1333,1320.0,-13.0,0,0,,,,,,
2,2015,1,1,4,MQ,2851,N645MQ,DFW,VPS,1305,1341.0,36.0,18.0,1359.0,108.0,107.0,85.0,641,1524.0,4.0,1453,1528.0,35.0,0,0,,0.0,0.0,35.0,0.0,0.0
3,2015,1,1,4,AA,383,N3EUAA,DFW,DCA,1555,1602.0,7.0,13.0,1615.0,160.0,146.0,126.0,1192,1921.0,7.0,1935,1928.0,-7.0,0,0,,,,,,
4,2015,1,1,4,WN,3047,N560WN,LAX,MCI,1720,1808.0,48.0,6.0,1814.0,185.0,176.0,166.0,1363,2300.0,4.0,2225,2304.0,39.0,0,0,,0.0,0.0,17.0,22.0,0.0


## Problem 1
<span  style="color:green; font-size:16px">What are the 3 least common airlines?</span>

In [2]:
flights['airline'].value_counts().tail(3)

AS    768
B6    543
HA    112
Name: airline, dtype: int64

## Problem 2
<span  style="color:green; font-size:16px">For each airline, find out what percentage of its flights leave on the 4th day of the week. Use a custom aggregation function.</span>

In [3]:
def day_pct(s):
    return (s == 4).mean()

flights.groupby('airline')['day_of_week'].agg(day_pct)

airline
AA    0.149775
AS    0.143229
B6    0.154696
DL    0.145835
EV    0.139638
F9    0.141230
HA    0.133929
MQ    0.161913
NK    0.149736
OO    0.144809
UA    0.149641
US    0.156656
VX    0.141994
WN    0.150154
Name: day_of_week, dtype: float64

## Problem 3
<span  style="color:green; font-size:16px">Redo problem 2 without using a custom aggregation problem. What is the performance difference?</span>

In [4]:
flights['is_4th'] = flights['day_of_week'] == 4
flights.groupby('airline')['is_4th'].mean()

airline
AA    0.149775
AS    0.143229
B6    0.154696
DL    0.145835
EV    0.139638
F9    0.141230
HA    0.133929
MQ    0.161913
NK    0.149736
OO    0.144809
UA    0.149641
US    0.156656
VX    0.141994
WN    0.150154
Name: is_4th, dtype: float64

About 50% improvement

In [5]:
%timeit -n 5 flights.groupby('airline')['day_of_week'].agg(day_pct)

7.98 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [6]:
%%timeit -n 5
flights['is_4th'] = flights['day_of_week'] == 4
flights.groupby('airline')['is_4th'].mean()

3.69 ms ± 274 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


## Problem 4
<span  style="color:green; font-size:16px">The range of undergrad populations per state was calculated using the `min_max` custom function from the top of this notebook. Use this same function to calculate the range of distance for each airline. Then calculate this range again without a custom function.</span>

In [7]:
def min_max(s):
    return s.max() - s.min()

In [8]:
flights.groupby('airline')['distance'].agg(min_max)

airline
AA    3609
AS    2425
B6    2473
DL    4396
EV    1256
F9    1845
HA     579
MQ    1161
NK    2145
OO    1668
UA    4135
US    2753
VX    2468
WN    2132
Name: distance, dtype: int64

In [9]:
dist_min_max = flights.groupby('airline').agg({'distance': ['min', 'max']}).reset_index()
dist_min_max.columns = ['airline', 'min dist', 'max dist']
dist_min_max['dist range'] = dist_min_max['max dist'] - dist_min_max['min dist']
dist_min_max

Unnamed: 0,airline,min dist,max dist,dist range
0,AA,175,3784,3609
1,AS,421,2846,2425
2,B6,231,2704,2473
3,DL,106,4502,4396
4,EV,74,1330,1256
5,F9,373,2218,1845
6,HA,2338,2917,579
7,MQ,89,1250,1161
8,NK,236,2381,2145
9,OO,67,1735,1668


## Problem 5
<span  style="color:green; font-size:16px">For each airline, return the first and last row of each group. Use one of the direct [GroupBy methods][1]</span>

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#groupby

In [10]:
flights.groupby('airline').nth([0, -1]).head(10)

Unnamed: 0_level_0,air_system_delay,air_time,airline_delay,arrival_delay,arrival_time,cancellation_reason,cancelled,day,day_of_week,departure_delay,departure_time,destination_airport,distance,diverted,elapsed_time,flight_number,is_4th,late_aircraft_delay,month,origin_airport,scheduled_arrival,scheduled_departure,scheduled_time,security_delay,tail_number,taxi_in,taxi_out,weather_delay,wheels_off,wheels_on,year
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
AA,,126.0,,-7.0,1928.0,,0,1,4,7.0,1602.0,DCA,1192,0,146.0,383,True,,1,DFW,1935,1555,160.0,,N3EUAA,7.0,13.0,,1615.0,1921.0,2015
AA,,166.0,,-19.0,1026.0,,0,31,4,5.0,520.0,DFW,1464,0,186.0,1454,True,,12,SFO,1045,515,210.0,,N852AA,10.0,10.0,,530.0,1016.0,2015
AS,,127.0,,-25.0,1644.0,,0,31,4,-8.0,1412.0,SEA,954,0,152.0,323,True,,12,LAX,1709,1420,169.0,,N323AS,6.0,19.0,,1431.0,1638.0,2015
AS,,155.0,,-3.0,1659.0,,0,1,4,-2.0,1503.0,SEA,1107,0,176.0,633,True,,1,PHX,1702,1505,177.0,,N320AS,4.0,17.0,,1520.0,1655.0,2015
B6,,231.0,,-45.0,430.0,,0,31,4,-12.0,2224.0,BOS,2300,0,246.0,602,True,,12,PHX,515,2236,279.0,,N625JB,3.0,12.0,,2236.0,427.0,2015
B6,,246.0,,-27.0,1959.0,,0,1,4,0.0,1230.0,BOS,2381,0,269.0,178,True,,1,LAS,2026,1230,296.0,,N625JB,4.0,19.0,,1249.0,1955.0,2015
DL,,156.0,,-18.0,1202.0,,0,1,4,-5.0,708.0,MSP,1299,0,174.0,1550,True,,1,LAS,1220,713,187.0,,N3739P,6.0,12.0,,720.0,1156.0,2015
DL,,64.0,,-8.0,2330.0,,0,31,4,2.0,2208.0,CMH,447,0,82.0,1640,True,,12,ATL,2338,2206,92.0,,N841DN,4.0,14.0,,2222.0,2326.0,2015
EV,,52.0,,14.0,1026.0,,0,31,4,21.0,911.0,LFT,351,0,75.0,2758,True,,12,DFW,1012,850,82.0,,N633AE,4.0,19.0,,930.0,1022.0,2015
EV,,113.0,,5.0,1408.0,,0,1,4,6.0,1201.0,JAN,677,0,127.0,4589,True,,1,ORD,1403,1155,128.0,,N13992,4.0,10.0,,1211.0,1404.0,2015


# 3. Transform and Filter with GroupBy

## Problem 1
<span  style="color:green; font-size:16px">Filter the college DataFrame for states that have more than 500,000 total undergraduate students. Can you verify your results?</span>

In [11]:
pd.options.display.max_columns = 100
college = pd.read_csv('../../data/college.csv')
college.head(3)

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [12]:
college_large = college.groupby('stabbr').filter(lambda sub_df: sub_df['ugds'].sum() > 500000)
college_large.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
43,Prince Institute-Southeast,Elmhurst,IL,0.0,0.0,0.0,0,,,0.0,84.0,0.7976,0.131,0.0714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.7857,0.9375,0.6569,PrivacySuppressed,20992
68,Everest College-Phoenix,Phoenix,AZ,0.0,0.0,0.0,1,,,0.0,4102.0,0.3162,0.4405,0.0763,0.0017,0.0207,0.0046,0.0373,0.0,0.1026,0.4749,0,0.8291,0.7151,0.67,28600,9500
69,Collins College,Phoenix,AZ,0.0,0.0,0.0,0,,,0.0,83.0,0.3253,0.0843,0.1566,0.0,0.0241,0.0,0.0241,0.0,0.3855,0.3373,0,0.7205,0.8228,0.4764,25700,47000
70,Empire Beauty School-Paradise Valley,Phoenix,AZ,0.0,0.0,0.0,1,,,0.0,25.0,0.76,0.04,0.12,0.0,0.0,0.04,0.04,0.0,0.0,0.16,0,0.6349,0.5873,0.4651,17800,9588
71,Empire Beauty School-Tucson,Tucson,AZ,0.0,0.0,0.0,0,,,0.0,126.0,0.2143,0.0873,0.5794,0.0159,0.0873,0.0079,0.0,0.0,0.0079,0.2222,1,0.7962,0.6615,0.4229,18200,9833


In [13]:
college_large.groupby('stabbr').agg({'ugds':'sum'}).sort_values('ugds', ascending=False).round(-3)

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
CA,2304000.0
TX,1277000.0
NY,994000.0
FL,960000.0
PA,605000.0
IL,600000.0
OH,538000.0
AZ,520000.0


## Problem 2
<span  style="color:green; font-size:16px">Read in the employee dataset. Filter it so that only position titles with an average salary of $100,000 remain. Can you verify your results?</span>

In [14]:
emp = pd.read_csv('../../data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,1984-11-26
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,2012-03-26
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,2013-11-04


In [15]:
high_sal = emp.groupby('title').filter(lambda sub_df: sub_df['salary'].mean() > 100000)
high_sal.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date
5,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,Public Works & Engineering-PWE,107962.0,White,Male,1993-11-15
8,"CHIEF PHYSICIAN,MD",Health & Human Services,180416.0,Black,Male,1987-05-22
37,ASSOCIATE EMS PHYSICIAN DIRECTOR,Houston Fire Department (HFD),165216.0,Hispanic,Male,2013-08-31
59,"PUBLIC HEALTH DENTIST,DDS",Health & Human Services,100791.0,White,Female,2015-12-28
147,ASSISTANT DIRECTOR (EXECUTIVE LEVEL),Houston Airport System (HAS),120916.0,White,Male,2004-06-07


In [16]:
high_sal.groupby('title').agg({'salary':'mean'}).astype('int').sort_values('salary', ascending=False)

Unnamed: 0_level_0,salary
title,Unnamed: 1_level_1
"ASSOCIATE EMS PHYSICIAN DIRECTOR,MD",210588
DEPUTY DIRECTOR-FINANCE & ADMINISTRATION,199596
DEPUTY DIRECTOR-AVIATION (EX LVL),186192
"CHIEF PHYSICIAN,MD",180416
DEPUTY DIRECTOR-PUBLIC WORKS (EXECUTIVE,178331
ASSOCIATE EMS PHYSICIAN DIRECTOR,165216
DEPUTY DIRECTOR (EXECUTIVE LEVEL),156822
ASSISTANT DIRECTOR-PUBLIC WORKS (EXECUTI,144044
EXECUTIVE ASSISTANT FIRE CHIEF,130585
ASSISTANT DIRECTOR (EXECUTIVE LEVEL),123440


## Problem 3
<span  style="color:green; font-size:16px">Filter the employee dataset so that only position titles with at least 5 employees and an average salary of $80,000 remain. Can you verify the results?</span>

In [17]:
def sal_count(sub_df):
    return sub_df['salary'].mean() > 80000 and len(sub_df) >= 5

In [18]:
high_sal_count = emp.groupby('title').filter(sal_count)
high_sal_count.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date
11,POLICE SERGEANT,Houston Police Department-HPD,77076.0,Black,Male,2001-06-04
14,POLICE SERGEANT,Houston Police Department-HPD,81239.0,White,Male,1995-06-19
24,POLICE SERGEANT,Houston Police Department-HPD,81239.0,White,Male,1978-03-13
45,POLICE SERGEANT,Houston Police Department-HPD,81239.0,Hispanic,Male,1983-09-12
47,DIVISION MANAGER,Houston Airport System (HAS),86297.0,Hispanic,Male,1992-12-14


In [19]:
high_sal_count.groupby('title').agg({'salary': ['mean', 'count']}).astype('int')

Unnamed: 0_level_0,salary,salary
Unnamed: 0_level_1,mean,count
title,Unnamed: 1_level_2,Unnamed: 2_level_2
ADMINISTRATION MANAGER,80859,7
DISTRICT CHIEF,89172,11
DIVISION MANAGER,91765,7
MANAGING ENGINEER,106794,5
POLICE CAPTAIN,104455,5
POLICE LIEUTENANT,90185,14
POLICE SERGEANT,80132,77
SUPERVISING ENGINEER,96815,8


## Problem 4
<span  style="color:green; font-size:16px">Add a new column, **pct_max_dept_gender**, to the employee DataFrame that holds the employees percentage of the maximum salary for each department and race. For instance, if a male HPD employee makes 80,000 and the maximum male HPD salary is 120,000 then the value for this employee would be 80,000/120,000 or .666. Verify this value for the first employee.</span>

In [20]:
def pct_max(sub_series):
    return sub_series / sub_series.max()

In [21]:
emp['pct_max_dept_gender'] = emp.groupby(['dept', 'gender']).transform(pct_max)
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,pct_max_dept_gender
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,0.226853
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,0.299951
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,1984-11-26,0.333744
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,2012-03-26,0.504974
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,2013-11-04,0.227668


In [22]:
filt = (emp['dept'] == 'Houston Police Department-HPD') & (emp['gender'] == 'Male')
max_sal = emp.loc[filt, 'salary'].max()
max_sal

199596.0

In [23]:
emp.loc[0, 'salary'] / max_sal

0.22685324355197498