# Solutions

1. [Groupby Aggregation Basics](#1.-Groupby-Aggregation-Basics)
1. [Grouping and Aggregating with Multiple Columns](#2.-Grouping-and-Aggregating-with-Multiple-Columns)
1. [Grouping with Pivot Tables](#3.-Grouping-with-Pivot-Tables)
1. [Counting with Crosstabs](#4.-Counting-with-Crosstabs)

# 1. Groupby Aggregation Basics

In [1]:
import pandas as pd
nyc = pd.read_csv('../data/nyc_deaths.csv')
nyc.head()

Unnamed: 0,year,cause,sex,race,deaths
0,2007,Accidents,F,Asian,32
1,2007,Accidents,F,Black,87
2,2007,Accidents,F,Hispanic,71
3,2007,Accidents,F,White,162
4,2007,Accidents,M,Asian,53


### Problem 1
<span  style="color:green; font-size:16px">What year had the most deaths?</span>

In [2]:
year_deaths = nyc.groupby('year').agg({'deaths':'sum'})
year_deaths.idxmax()

deaths    2008
dtype: int64

In [3]:
# one line
nyc.groupby('year').agg({'deaths':'sum'}).idxmax()

deaths    2008
dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>

In [4]:
nyc.groupby('race').agg({'deaths':'sum'}).sort_values('deaths', ascending=False)

Unnamed: 0_level_0,deaths
race,Unnamed: 1_level_1
White,206487
Black,111116
Hispanic,74802
Asian,26355
Unknown,6238


### Use the employee dataset for the remaining problems

In [5]:
emp = pd.read_csv('../data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,1984-11-26
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,2012-03-26
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,2013-11-04


### Problem 3
<span  style="color:green; font-size:16px">Find the maximum salary for each gender.</span>

In [6]:
emp.groupby('gender').agg({'salary':'max'})

Unnamed: 0_level_0,salary
gender,Unnamed: 1_level_1
Female,178331.0
Male,210588.0


### Problem 4
<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

In [7]:
emp.groupby('dept').agg({'salary':'median'}).head()

Unnamed: 0_level_0,salary
dept,Unnamed: 1_level_1
Health & Human Services,46717.0
Houston Airport System (HAS),41953.5
Houston Fire Department (HFD),61921.0
Houston Police Department-HPD,61643.0
Parks & Recreation,33634.0


### Problem 5
<span  style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>

In [8]:
emp.groupby('race').agg({'salary':'mean'}).reset_index()

Unnamed: 0,race,salary
0,Asian,60143.218391
1,Black,50366.588803
2,Hispanic,52533.456693
3,Native American,64562.142857
4,White,63834.575646


# 2. Grouping and Aggregating with Multiple Columns

In [10]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp['experience'] = 2016 - emp['hire_date'].dt.year
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,experience
0,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,1
1,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,34
2,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black,Male,1984-11-26,32
3,ENGINEER,Public Works & Engineering-PWE,71680.0,Asian,Male,2012-03-26,4
4,CARPENTER,Houston Airport System (HAS),42390.0,White,Male,2013-11-04,3


### Problem 1
<span  style="color:green; font-size:16px">For each department and gender find the number of unique position titles, the total number of employees and the average salary. Make sure there is no multi-index for the index or columns.</span>

In [11]:
data = emp.groupby(['dept', 'gender']).agg({'title':['nunique','size'],
                                            'salary':'mean'}).reset_index()
data.columns = ['dept', 'gender', 'num unique positions', 'size', 'mean salary']
data.head(10)

Unnamed: 0,dept,gender,num unique positions,size,mean salary
0,Health & Human Services,Female,47,82,48661.961538
1,Health & Human Services,Male,22,26,59240.0
2,Houston Airport System (HAS),Female,24,36,53174.194444
3,Houston Airport System (HAS),Male,37,70,54358.171429
4,Houston Fire Department (HFD),Female,13,21,52853.047619
5,Houston Fire Department (HFD),Male,26,363,59930.56447
6,Houston Police Department-HPD,Female,38,155,52219.92517
7,Houston Police Department-HPD,Male,27,483,63032.841121
8,Parks & Recreation,Female,15,23,40361.055556
9,Parks & Recreation,Male,20,51,38396.243243


### Problem 2
<span  style="color:green; font-size:16px">For each department, race and gender find the maximum years of experience and salary.</span>

In [12]:
emp.groupby(['dept','race','gender']).agg({'experience': 'max',
                                           'salary': 'max'}).reset_index().head(10)

Unnamed: 0,dept,race,gender,experience,salary
0,Health & Human Services,Asian,Female,23,94149.0
1,Health & Human Services,Asian,Male,25,70864.0
2,Health & Human Services,Black,Female,34,103270.0
3,Health & Human Services,Black,Male,29,180416.0
4,Health & Human Services,Hispanic,Female,25,65589.0
5,Health & Human Services,Hispanic,Male,14,58406.0
6,Health & Human Services,Native American,Female,17,58855.0
7,Health & Human Services,White,Female,33,100791.0
8,Health & Human Services,White,Male,8,120799.0
9,Houston Airport System (HAS),Asian,Female,23,32157.0


## Use the college dataset for the rest of the problems

In [13]:
college = pd.read_csv('../data/college.csv')
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Problem 3
<span  style="color:green; font-size:16px">Which city name appears the most frequently. Do this in two different ways. Do it once with and once without the `groupby` method?</span>

In [14]:
size = college.groupby('city').agg({'stabbr': 'size'})
size.head()

Unnamed: 0_level_0,stabbr
city,Unnamed: 1_level_1
ARTESIA,1
Aberdeen,3
Abilene,5
Abingdon,2
Abington,1


In [15]:
size.sort_values('stabbr', ascending=False).head()

Unnamed: 0_level_0,stabbr
city,Unnamed: 1_level_1
New York,87
Chicago,78
Houston,72
Los Angeles,56
Miami,51


Can also just `size` directly and sort the series.

In [16]:
college.groupby('city').size().sort_values(ascending=False).head()

city
New York       87
Chicago        78
Houston        72
Los Angeles    56
Miami          51
dtype: int64

### Without groupby
Just use **`value_counts`**! Much easier

In [17]:
college['city'].value_counts().head()

New York       87
Chicago        78
Houston        72
Los Angeles    56
Miami          51
Name: city, dtype: int64

### Problem 4
<span  style="color:green; font-size:16px">Does the city **`Houston`** only appear in the state of **`Texas`**?</span>

NO! It also appears in Missouri.

In [18]:
filt = college['city'] == 'Houston'
college.loc[filt, 'stabbr'].unique()

array(['TX', 'MO'], dtype=object)

Can see exact counts

In [19]:
college.loc[filt, 'stabbr'].value_counts()

TX    71
MO     1
Name: stabbr, dtype: int64

You can use a groupby and find the number of unique states for each city. This is not very efficient.

In [20]:
city_unique_state = college.groupby('city').agg({'stabbr': 'nunique'})
city_unique_state.head()

Unnamed: 0_level_0,stabbr
city,Unnamed: 1_level_1
ARTESIA,1
Aberdeen,2
Abilene,1
Abingdon,1
Abington,1


In [21]:
city_unique_state.loc['Houston']

stabbr    2
Name: Houston, dtype: int64

### Problem 5
<span  style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

In [22]:
college.groupby('stabbr').agg({'ugds': 'max'}).head(10)

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,12865.0
AL,29851.0
AR,21405.0
AS,1276.0
AZ,151558.0
CA,44744.0
CO,25873.0
CT,18016.0
DC,10433.0
DE,18222.0


### Problem 6
<span  style="color:green; font-size:16px">Among colleges that have the largest undergrad population for each state, what is the difference between the most and least populous college?</span>

In [23]:
# from problem 8
largest_per_state = college.groupby('stabbr').agg({'ugds': 'max'})
largest_per_state.max() - largest_per_state.min()

ugds    150956.0
dtype: float64

### Problem 7: Advanced
<span  style="color:green; font-size:16px">Find the name and population of the largest college per state.</span>

The following returns the index of the maximum value of population for each state. 

In [24]:
max_indexes = college.groupby('stabbr').agg({'ugds': 'idxmax'})
max_indexes.head(10)

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,60
AL,5
AR,137
AS,4138
AZ,7116
CA,1299
CO,574
CT,641
DC,701
DE,691


For instance, the row with index label 60 as the maximum population for Alaska. Let's verify this by selecting the institution name and population of this specific row.

In [25]:
cols = ['instnm', 'ugds']
college.loc[60, cols]

instnm    University of Alaska Anchorage
ugds                               12865
Name: 60, dtype: object

Verify by selecting only Alaska colleges and getting the max value:

In [26]:
filt = college['stabbr'] == 'AK'
college.loc[filt, 'ugds'].max()

12865.0

We need to get the index locations as a Series or a NumPy array to use with **`.loc`**. Currently **`max_indexes`** is a DataFrame.

In [27]:
locs = max_indexes['ugds']
locs.head()

stabbr
AK      60
AL       5
AR     137
AS    4138
AZ    7116
Name: ugds, dtype: int64

We can pass this Series to **`.loc`** which will select just those indexes, along with the columns we want.

In [28]:
cols = ['stabbr', 'instnm', 'ugds']
college.loc[locs, cols].head()

Unnamed: 0,stabbr,instnm,ugds
60,AK,University of Alaska Anchorage,12865.0
5,AL,The University of Alabama,29851.0
137,AR,University of Arkansas,21405.0
4138,AS,American Samoa Community College,1276.0
7116,AZ,University of Phoenix-Arizona,151558.0


### Alternative method if index is INSTNM

In [29]:
college_instm = college.set_index('instnm')
cols = ['stabbr', 'ugds']
college_instm = college_instm[cols]
college_instm.head()

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,AL,4206.0
University of Alabama at Birmingham,AL,11383.0
Amridge University,AL,291.0
University of Alabama in Huntsville,AL,5451.0
Alabama State University,AL,4811.0


In [30]:
# group by state and use idxmax
max_colleges = college_instm.groupby('stabbr').agg({'ugds': 'idxmax'})
max_colleges.head()
max_indexes = max_colleges['ugds']
college_instm.loc[max_indexes].head()

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Alaska Anchorage,AK,12865.0
The University of Alabama,AL,29851.0
University of Arkansas,AR,21405.0
American Samoa Community College,AS,1276.0
University of Phoenix-Arizona,AZ,151558.0


## Yet another way
Use the **`first`** groupby method to return the first row of each group after sorting.

In [31]:
cols = ['stabbr', 'instnm', 'ugds']
college_trim = college[cols]

# sort by state then by population descending
college_trim_sort = college_trim.sort_values(['stabbr', 'ugds'], ascending=[True, False])


# group by state and take the first in the group
college_trim_sort.groupby('stabbr').first().head()

Unnamed: 0_level_0,instnm,ugds
stabbr,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,University of Alaska Anchorage,12865.0
AL,The University of Alabama,29851.0
AR,University of Arkansas,21405.0
AS,American Samoa Community College,1276.0
AZ,University of Phoenix-Arizona,151558.0


## Use `sort_values` with `drop_duplicates`
We've done this in previous notebooks. No grouping.

In [32]:
college_trim.sort_values(['stabbr', 'ugds'], ascending=[True, False]) \
            .drop_duplicates(subset='stabbr').head(10)

Unnamed: 0,stabbr,instnm,ugds
60,AK,University of Alaska Anchorage,12865.0
5,AL,The University of Alabama,29851.0
137,AR,University of Arkansas,21405.0
4138,AS,American Samoa Community College,1276.0
7116,AZ,University of Phoenix-Arizona,151558.0
1299,CA,Ashford University,44744.0
574,CO,University of Colorado Boulder,25873.0
641,CT,University of Connecticut,18016.0
701,DC,George Washington University,10433.0
691,DE,University of Delaware,18222.0


### Problem 8
<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

In [33]:
# They have more
college.groupby('distanceonly').agg({'ugds': 'mean'})

Unnamed: 0_level_0,ugds
distanceonly,Unnamed: 1_level_1
0.0,2334.648135
1.0,6245.74359


### Problem 9
<span  style="color:green; font-size:16px">Do distance only schools tend to be more or less religously affiliated than non-distance-only schools?</span>

In [34]:
# Less
college.groupby('distanceonly').agg({'relaffil': 'mean'})

Unnamed: 0_level_0,relaffil
distanceonly,Unnamed: 1_level_1
0.0,0.149635
1.0,0.05


### Problem 10
<span  style="color:green; font-size:16px">What state has the lowest percentage of currently operating schools of those that have religious affiliation?</span>

In [35]:
filt = college['relaffil'] == 1
cr = college[filt]
rel_oper_mean = cr.groupby('stabbr').agg({'curroper': 'mean'})
rel_oper_mean.head()

Unnamed: 0_level_0,curroper
stabbr,Unnamed: 1_level_1
AK,1.0
AL,0.916667
AR,0.944444
AZ,0.444444
CA,0.585366


In [36]:
# Utah. Answer makes sense.
rel_oper_mean.sort_values('curroper').head()

Unnamed: 0_level_0,curroper
stabbr,Unnamed: 1_level_1
UT,0.4
AZ,0.444444
NV,0.5
CA,0.585366
CT,0.647059


### Problem 11
<span  style="color:green; font-size:16px">Trim the **`college`** DataFrame to only the 'race' columns - those beginning with **`ugds_`**. Create a new column called **`ugds_other`** that is the sum of any race column that averages under 4% for the entire dataset.</span>

In [37]:
pd.options.display.max_columns = 100

In [38]:
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [39]:
# trim dataframe
df_race = college.loc[:, 'ugds_white':'ugds_unkn']

race_average = df_race.mean()

race_average

ugds_white    0.510207
ugds_black    0.189997
ugds_hisp     0.161635
ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
ugds_unkn     0.045181
dtype: float64

In [40]:
# keep only those less than 4%
other_race = race_average[race_average < .04]

other_race

ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
dtype: float64

In [41]:
# get the column names
race_columns = other_race.index

race_columns

Index(['ugds_asian', 'ugds_aian', 'ugds_nhpi', 'ugds_2mor', 'ugds_nra'], dtype='object')

In [42]:
# grab the columns and sum accross the rows
df_race['ugds_other'] = df_race[race_columns].sum(axis='columns')

# can drop the low percentage columns
df_race.drop(race_columns, axis=1).head(10)

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_unkn,ugds_other
0,0.0333,0.9353,0.0055,0.0138,0.0121
1,0.5922,0.26,0.0283,0.01,0.1094
2,0.299,0.4192,0.0069,0.2715,0.0034
3,0.6988,0.1255,0.0382,0.035,0.1025
4,0.0158,0.9208,0.0121,0.0137,0.0376
5,0.7825,0.1119,0.0348,0.0026,0.0682
6,0.7255,0.2613,0.0044,0.0019,0.0069
7,0.7823,0.12,0.0191,0.0334,0.0451
8,0.5328,0.3376,0.0074,0.0246,0.0975
9,0.8507,0.0704,0.0248,0.014,0.0401


### Problem 12
<span  style="color:green; font-size:16px">Which top 5 historically black colleges that have the highest white percentage?</span>

In [43]:
filt = college['hbcu'] == 1
cols = ['instnm', 'ugds_white']
college.loc[filt, cols].sort_values('ugds_white', ascending=False).head()

Unnamed: 0,instnm,ugds_white
4021,Bluefield State College,0.8437
17,Gadsden State Community College,0.6921
4050,West Virginia State University,0.5816
48,Shelton State Community College,0.5613
55,H Councill Trenholm State Community College,0.3951


# 3. Grouping with Pivot Tables

In [44]:
import pandas as pd
pd.options.display.max_columns = 100
flights = pd.read_csv('../data/flights.csv')
flights.head()

Unnamed: 0,year,month,day,day_of_week,airline,flight_number,tail_number,origin_airport,destination_airport,scheduled_departure,departure_time,departure_delay,taxi_out,wheels_off,scheduled_time,elapsed_time,air_time,distance,wheels_on,taxi_in,scheduled_arrival,arrival_time,arrival_delay,diverted,cancelled,cancellation_reason,air_system_delay,security_delay,airline_delay,late_aircraft_delay,weather_delay
0,2015,1,1,4,WN,1908,N8324A,LAX,SLC,1625,1723.0,58.0,10.0,1733.0,100.0,107.0,94.0,590,2007.0,3.0,1905,2010.0,65.0,0,0,,31.0,0.0,0.0,34.0,0.0
1,2015,1,1,4,UA,581,N448UA,DEN,IAD,823,830.0,7.0,11.0,841.0,190.0,170.0,154.0,1452,1315.0,5.0,1333,1320.0,-13.0,0,0,,,,,,
2,2015,1,1,4,MQ,2851,N645MQ,DFW,VPS,1305,1341.0,36.0,18.0,1359.0,108.0,107.0,85.0,641,1524.0,4.0,1453,1528.0,35.0,0,0,,0.0,0.0,35.0,0.0,0.0
3,2015,1,1,4,AA,383,N3EUAA,DFW,DCA,1555,1602.0,7.0,13.0,1615.0,160.0,146.0,126.0,1192,1921.0,7.0,1935,1928.0,-7.0,0,0,,,,,,
4,2015,1,1,4,WN,3047,N560WN,LAX,MCI,1720,1808.0,48.0,6.0,1814.0,185.0,176.0,166.0,1363,2300.0,4.0,2225,2304.0,39.0,0,0,,0.0,0.0,17.0,22.0,0.0


### Problem 1
<span  style="color:green; font-size:16px">What is the average departure delay for each day of the week for each airline? Highlight the worst day of the week for each airline.</span>

In [45]:
avg_delay = flights.pivot_table(index='airline', columns='day_of_week', 
                                values='departure_delay').round(1)
avg_delay.style.highlight_max(axis='columns')

day_of_week,1,2,3,4,5,6,7
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AA,14.4,9.1,9.3,11.8,11.3,12.9,10.5
AS,1.7,-0.9,-2.4,7.5,2.2,1.8,2.8
B6,13.7,14.7,11.0,20.7,11.2,10.8,16.5
DL,7.7,7.8,7.9,8.8,6.9,5.9,5.3
EV,12.5,7.9,9.3,9.5,7.8,6.5,9.7
F9,13.2,15.3,16.3,10.7,16.3,12.5,15.6
HA,7.3,15.2,0.8,-1.7,-1.7,-5.0,-2.9
MQ,11.5,10.9,12.0,10.6,12.4,7.1,12.3
NK,16.6,15.8,14.8,20.2,23.9,19.5,25.3
OO,11.1,9.1,9.4,7.8,9.5,8.8,11.9


You can highlight min and max by chaining style methods.

In [46]:
avg_delay.style.highlight_max(axis='columns') \
         .highlight_min(axis='columns', color='lightblue')

day_of_week,1,2,3,4,5,6,7
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AA,14.4,9.1,9.3,11.8,11.3,12.9,10.5
AS,1.7,-0.9,-2.4,7.5,2.2,1.8,2.8
B6,13.7,14.7,11.0,20.7,11.2,10.8,16.5
DL,7.7,7.8,7.9,8.8,6.9,5.9,5.3
EV,12.5,7.9,9.3,9.5,7.8,6.5,9.7
F9,13.2,15.3,16.3,10.7,16.3,12.5,15.6
HA,7.3,15.2,0.8,-1.7,-1.7,-5.0,-2.9
MQ,11.5,10.9,12.0,10.6,12.4,7.1,12.3
NK,16.6,15.8,14.8,20.2,23.9,19.5,25.3
OO,11.1,9.1,9.4,7.8,9.5,8.8,11.9


### Problem 2
<span  style="color:green; font-size:16px">Find the airline and origin airport that has the most canceled flights. Highlight the maximum value of the table. Read the docs to see how it's done.</span>

In [47]:
airline_cancel = flights.pivot_table(index='airline', columns='origin_airport', 
                                     values='cancelled', aggfunc='sum', fill_value=0)
airline_cancel.style.highlight_max(axis=None)

origin_airport,ATL,DEN,DFW,IAH,LAS,LAX,MSP,ORD,PHX,SFO
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AA,3,4,86,3,3,11,3,35,4,2
AS,0,0,0,0,0,0,0,0,0,0
B6,0,0,0,0,0,0,0,0,0,1
DL,28,1,0,0,1,1,4,0,1,2
EV,18,6,27,36,0,0,6,53,0,0
F9,0,2,1,0,1,1,1,4,0,0
HA,0,0,0,0,0,0,0,0,0,0
MQ,5,0,62,0,0,0,0,85,0,0
NK,1,1,6,0,1,1,3,10,2,0
OO,3,25,2,10,0,15,4,41,9,33


### Problem 3
<span  style="color:green; font-size:16px">Find the total distance flown for each airline for each month. Can you use the style `format` method to put commas in the numbers so that they are easier to read?</span>

In [48]:
total_dist = flights.pivot_table(index='airline', columns='month', 
                                 values='distance', aggfunc='sum')

In [49]:
# we also highlight the max month for each airline
total_dist.style.format('{:,.0f}').highlight_max(axis='columns')

month,1,2,3,4,5,6,7,8,9,11,12
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AA,728748,678434,773487,775426,728602,840028,1175729.0,1124517.0,1027703.0,1001409.0,1063613.0
AS,68255,63920,66004,84302,88380,87994,86077.0,62637.0,69188.0,67073.0,74769.0
B6,78150,79298,59978,81902,98000,105232,85799.0,90605.0,91253.0,86255.0,105660.0
DL,733393,654574,905960,853458,878038,877052,904442.0,987795.0,800168.0,775772.0,814568.0
EV,261396,211944,302376,277527,258884,260477,265773.0,249078.0,213148.0,203508.0,191960.0
F9,95506,87892,100298,99547,142196,140486,117118.0,125389.0,113856.0,122676.0,131990.0
HA,44342,24487,36409,16761,32344,26096,33304.0,26052.0,10788.0,21158.0,21159.0
MQ,153442,130251,143586,125056,121728,140145,133366.0,119640.0,113163.0,110294.0,112408.0
NK,129984,109817,118574,106371,140416,160658,150540.0,168150.0,169017.0,169174.0,165200.0
OO,275594,276571,283835,318742,331793,312358,325984.0,329097.0,314292.0,305618.0,294161.0


### Problem 4
<span  style="color:green; font-size:16px">Use the employee dataset for this problem. You can create pivot tables with multiple columns in the index or the columns by using a list. Create a pivot table with the department as the index and the race and gender as the columns. Calculate the median salary for these cross sections.</span>

In [50]:
emp = pd.read_csv('../data/employee.csv')
emp.pivot_table(index='dept', columns=['race', 'gender'], 
                values='salary', aggfunc='median').round(-3).style.format('{:,.0f}')

race,Asian,Asian,Black,Black,Hispanic,Hispanic,Native American,Native American,White,White
gender,Female,Male,Female,Male,Female,Male,Female,Male,Female,Male
dept,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Health & Human Services,50000.0,53000,47000,55000,33000,34000,54000.0,,51000.0,56000
Houston Airport System (HAS),29000.0,50000,37000,42000,31000,40000,68000.0,,70000.0,56000
Houston Fire Department (HFD),,44000,52000,58000,47000,55000,,78000.0,51000.0,63000
Houston Police Department-HPD,53000.0,55000,48000,67000,48000,62000,,60000.0,67000.0,67000
Parks & Recreation,,49000,31000,34000,43000,30000,,,,37000
Public Works & Engineering-PWE,86000.0,56000,37000,39000,44000,46000,,,67000.0,57000


# 4. Counting with Crosstabs

In [51]:
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200
mh = pd.read_csv('../data/mental_health.csv')
mh.head()

Unnamed: 0,timestamp,age,gender,country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [52]:
pd.read_csv('../data/mental_health_dd.csv')

Unnamed: 0,Column Name,Description
0,timestamp,Time the survey was submitted
1,age,Respondent age
2,gender,Respondent gender
3,country,Respondent country
4,state,"If you live in the United States, which state or territory do you live in?"
5,self_employed,Are you self-employed?
6,family_history,Do you have a family history of mental illness?
7,treatment,Have you sought treatment for a mental health condition?
8,work_interfere,"If you have a mental health condition, do you feel that it interferes with your work?"
9,no_employees,How many employees does your company or organization have?


### Problem 1
<span  style="color:green; font-size:16px">Do people with a family history of mental illness seek treatment more often than those who do not?</span>

In [53]:
pd.crosstab(index=mh['family_history'], columns=mh['treatment'])

treatment,No,Yes
family_history,Unnamed: 1_level_1,Unnamed: 2_level_1
No,427,255
Yes,116,343


In [54]:
pd.crosstab(index=mh['family_history'], columns=mh['treatment'], normalize='index').round(2)

treatment,No,Yes
family_history,Unnamed: 1_level_1,Unnamed: 2_level_1
No,0.63,0.37
Yes,0.25,0.75


Yes, there is a large difference. 75% of people with a family history seek treatment vs 37% for those who have not.

### Problem 2
<span  style="color:green; font-size:16px">Find the total number and ratio of employees that seek treatment for companies that provide health benefits vs those that do not.</span>

In [55]:
pd.crosstab(index=mh['benefits'], columns=mh['treatment'])

treatment,No,Yes
benefits,Unnamed: 1_level_1,Unnamed: 2_level_1
Don't know,231,141
No,146,158
Yes,166,299


In [56]:
pd.crosstab(index=mh['benefits'], columns=mh['treatment'], normalize='index').round(2)

treatment,No,Yes
benefits,Unnamed: 1_level_1,Unnamed: 2_level_1
Don't know,0.62,0.38
No,0.48,0.52
Yes,0.36,0.64


### Problem 3
<span  style="color:green; font-size:16px">You can provide a list of multiple columns to both the `index` and `columns` parameters of the `crosstab` function. Put country and number of employees in the index and benefits and treatment in the columns. It's probably easier to make separate list variables first.</span>

In [57]:
index = [mh['country'], mh['no_employees']]
columns = [mh['benefits'], mh['treatment']]
pd.crosstab(index=index, columns=columns)

Unnamed: 0_level_0,benefits,Don't know,Don't know,No,No,Yes,Yes
Unnamed: 0_level_1,treatment,No,Yes,No,Yes,No,Yes
country,no_employees,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Australia,1-5,1,0,1,1,0,0
Australia,100-500,1,0,1,2,0,2
Australia,26-100,0,0,1,3,0,0
Australia,500-1000,1,0,0,0,0,0
Australia,6-25,0,1,0,3,0,0
Australia,More than 1000,1,0,0,0,1,1
Canada,1-5,1,0,5,5,0,0
Canada,100-500,2,3,0,0,2,4
Canada,26-100,4,4,3,1,3,3
Canada,500-1000,0,0,0,0,0,1
