## In this chapter, you'll learn how to identify and split DataFrames by groups or categories for further aggregation or analysis. You'll also learn how to transform and filter your data, including how to detect outliers and impute missing values. Knowing how to effectively group data in pandas can be a seriously powerful addition to your data science toolbox.

## Grouping by multiple columns
In this exercise, you will return to working with the Titanic dataset from Chapter 1 and use .groupby() to analyze the distribution of passengers who boarded the Titanic.

The 'pclass' column identifies which class of ticket was purchased by the passenger and the 'embarked' column indicates at which of the three ports the passenger boarded the Titanic. 'S' stands for Southampton, England, 'C' for Cherbourg, France and 'Q' for Queenstown, Ireland.

Your job is to first group by the 'pclass' column and count the number of rows in each class using the 'survived' column. You will then group by the 'embarked' and 'pclass' columns and count the number of passengers.

The DataFrame has been pre-loaded as titanic

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
titanic = pd.read_csv('titanic.csv')
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [4]:
# Group by the 'pclass' column and save the result as by_class
by_class = titanic.groupby('pclass')

In [5]:
# Aggregate the 'survived' column of by_class using .count(). Save the result as count_by_class.
count_by_class = by_class['survived'].count()
count_by_class

pclass
1    323
2    277
3    709
Name: survived, dtype: int64

In [6]:
from IPython import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [7]:
titanic['pclass'].unique()

array([1, 2, 3], dtype=int64)

In [8]:
# Group titanic by the 'embarked' and 'pclass' columns. Save the result as by_mult
mult = titanic.groupby(['embarked', 'pclass'])

In [10]:
# Aggregate the 'survived' column of by_mult using .count(). Save the result as count_mult.
count_mult = mult['survived'].count()
count_mult

embarked  pclass
C         1         141
          2          28
          3         101
Q         1           3
          2           7
          3         113
S         1         177
          2         242
          3         495
Name: survived, dtype: int64

## Computing multiple aggregates of multiple columns
The .agg() method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.

In this exercise, you're going to group passengers on the Titanic by 'pclass' and aggregate the 'age' and 'fare' columns by the functions 'max' and 'median'. You'll then use multi-level selection to find the oldest passenger per class and the median fare price per class.

The DataFrame has been pre-loaded as titanic.



In [24]:
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [28]:
titanic.groupby('pclass')[['age', 'fare']].agg(['max', 'median'])

Unnamed: 0_level_0,age,age,fare,fare
Unnamed: 0_level_1,max,median,max,median
pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,80.0,39.0,512.3292,60.0
2,70.0,29.0,73.5,15.0458
3,74.0,24.0,69.55,8.05


In [30]:
# Print the maximum age in each class
titanic.groupby('pclass')[['age', 'fare']].agg(['max', 'median']).loc[:, ('age', 'max')]

pclass
1    80.0
2    70.0
3    74.0
Name: (age, max), dtype: float64

In [31]:
# Print the median fare in each class
titanic.groupby('pclass')[['age', 'fare']].agg(['max', 'median']).loc[:, ('fare', 'median')]

pclass
1    60.0000
2    15.0458
3     8.0500
Name: (fare, median), dtype: float64

## Aggregating on index levels/fields
If you have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns.

In this exercise you'll use the full Gapminder dataset which contains yearly values of life expectancy, population, child mortality (per 1,000) and per capita gross domestic product (GDP) for every country in the world from 1964 to 2013.

Your job is to create a multi-level DataFrame of the columns 'Year', 'Region' and 'Country'. Next you'll group the DataFrame by the 'Year' and 'Region' levels. Finally, you'll apply a dictionary aggregation to compute the total population, spread of per capita GDP values and average child mortality rate.

In [32]:
# Read 'gapminder.csv' into a DataFrame with index_col=['Year','region','Country']. Sort the index.
gap_df = pd.read_csv('gapminder_tidy.csv', index_col = ['Year', 'region', 'Country']).sort_index()
gap_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,fertility,life,population,child_mortality,gdp
Year,region,Country,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1964,America,Antigua and Barbuda,4.250,63.775,58653.0,72.78,5008.0
1964,America,Argentina,3.068,65.388,21966478.0,57.43,8227.0
1964,America,Aruba,4.059,67.113,57031.0,,5505.0
1964,America,Bahamas,4.220,64.189,133709.0,48.56,18160.0
1964,America,Barbados,4.094,62.819,234455.0,64.70,5681.0
1964,America,Belize,6.420,62.241,103555.0,112.27,2114.0
1964,America,Bolivia,6.607,43.913,3668568.0,265.40,2971.0
1964,America,Brazil,5.953,56.521,82021855.0,154.20,4707.0
1964,America,Canada,3.513,71.690,19309343.0,28.20,16464.0
1964,America,Chile,5.185,58.756,8457066.0,126.80,6537.0


In [34]:
# Group gapminder with a level of ['Year','region'] using its level parameter. Save the result as by_year_region.
by_year_region = gap_df.groupby(level = ['Year', 'region'])

In [35]:
def spread(series):
    return series.max() - series.min()


In [36]:
aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}

In [37]:
# se the aggregator dictionary to aggregate by_year_region. Save the result as aggregated
aggregated = by_year_region.agg(aggregator)
aggregated

Unnamed: 0_level_0,Unnamed: 1_level_0,population,child_mortality,gdp
Year,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1964,America,4.621957e+08,113.950667,18314.0
1964,East Asia & Pacific,1.110668e+09,129.109130,66821.0
1964,Europe & Central Asia,6.988545e+08,61.585319,28734.0
1964,Middle East & North Africa,1.180955e+08,179.605263,38474.0
1964,South Asia,6.250739e+08,256.922500,812.0
1964,Sub-Saharan Africa,2.541011e+08,243.872766,8613.0
1965,America,4.715780e+08,110.795333,19358.0
1965,East Asia & Pacific,1.134218e+09,124.430435,67881.0
1965,Europe & Central Asia,7.062355e+08,59.160213,29367.0
1965,Middle East & North Africa,1.213494e+08,172.106316,48796.0


In [38]:
# # Print the last 6 entries of aggregated 
aggregated.tail(6)

Unnamed: 0_level_0,Unnamed: 1_level_0,population,child_mortality,gdp
Year,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013,America,962908700.0,17.745833,49634.0
2013,East Asia & Pacific,2244209000.0,22.285714,134744.0
2013,Europe & Central Asia,896878800.0,9.831875,86418.0
2013,Middle East & North Africa,403050400.0,20.2215,128676.0
2013,South Asia,1701241000.0,46.2875,11469.0
2013,Sub-Saharan Africa,920599600.0,76.94449,32035.0


## Grouping on a function of the index
Groupby operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group.

In this exercise you'll read in a set of sample sales data from February 2015 and assign the 'Date' column as the index. Your job is to group the sales data by the day of the week and aggregate the sum of the 'Units' column.

Is there a day of the week that is more popular for customers? To find out, you're going to use .strftime('%a') to transform the index datetime values to abbreviated days of the week.


In [40]:
sales = pd.read_csv('sales/sales-feb-2015.csv', parse_dates = True, index_col = 'Date')
sales

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-02 08:30:00,Hooli,Software,3
2015-02-02 21:00:00,Mediacore,Hardware,9
2015-02-03 14:00:00,Initech,Software,13
2015-02-04 15:30:00,Streeplex,Software,13
2015-02-04 22:00:00,Acme Coporation,Hardware,14
2015-02-05 02:00:00,Acme Coporation,Software,19
2015-02-05 22:00:00,Hooli,Service,10
2015-02-07 23:00:00,Acme Coporation,Hardware,1
2015-02-09 09:00:00,Streeplex,Service,19
2015-02-09 13:00:00,Mediacore,Software,7


In [41]:
# Create a groupby object with sales.index.strftime('%a') as input and assign it to by_day.
by_day = sales.groupby(sales.index.strftime('%a'))

In [42]:
# Aggregate the 'Units' column of by_day with the .sum() method. Save the result as units_sum.
units_sum = by_day['Units'].sum()
units_sum

Mon    48
Sat     7
Thu    59
Tue    13
Wed    48
Name: Units, dtype: int64

## Detecting outliers with Z-Scores
As Dhavide demonstrated in the video using the zscore function, you can apply a .transform() method after grouping to apply a function to groups of data independently. The z-score is also useful to find outliers: a z-score value of +/- 3 is generally considered to be an outlier.

In this example, you're going to normalize the Gapminder data in 2010 for life expectancy and fertility by the z-score per region. Using boolean indexing, you will filter out countries that have high fertility rates and low life expectancy for their region.

In [53]:
gapminder = pd.read_csv('gapminder_tidy.csv', index_col = 'Country')
gapminder.head()

Unnamed: 0_level_0,Year,fertility,life,population,child_mortality,gdp,region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,1964,7.671,33.639,10474903.0,339.7,1182.0,South Asia
Afghanistan,1965,7.671,34.152,10697983.0,334.1,1182.0,South Asia
Afghanistan,1966,7.671,34.662,10927724.0,328.7,1168.0,South Asia
Afghanistan,1967,7.671,35.17,11163656.0,323.3,1173.0,South Asia
Afghanistan,1968,7.671,35.674,11411022.0,318.1,1187.0,South Asia


In [54]:
gapminder_2010 = gapminder[gapminder['Year'] == 2010].drop('Year', axis = 'columns')

In [55]:
gapminder_2010

Unnamed: 0_level_0,fertility,life,population,child_mortality,gdp,region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,5.659,59.612,31411743.0,105.00,1637.0,South Asia
Albania,1.741,76.780,3204284.0,16.60,9374.0,Europe & Central Asia
Algeria,2.817,70.615,35468208.0,27.40,12494.0,Middle East & North Africa
Angola,6.218,50.689,19081912.0,182.50,7047.0,Sub-Saharan Africa
Antigua and Barbuda,2.130,75.437,88710.0,9.90,20567.0,America
Argentina,2.215,75.772,40412376.0,14.60,15765.0,America
Armenia,1.550,74.291,3092072.0,18.00,6508.0,Europe & Central Asia
Aruba,1.701,75.059,107488.0,17.84,33288.0,America
Australia,1.886,82.091,22268384.0,4.80,41330.0,East Asia & Pacific
Austria,1.438,80.595,8393644.0,4.40,42861.0,Europe & Central Asia


In [52]:
# Import zscore from scipy.stats.
from scipy.stats import zscore

In [56]:
# Group gapminder_2010: standardized
standardized = gapminder_2010.groupby('region')['life','fertility'].transform(zscore)

# Construct a Boolean Series to identify outliers: outliers
outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)

# Filter gapminder_2010 by the outliers: gm_outliers
gm_outliers = gapminder_2010.loc[outliers]

# Print gm_outliers
gm_outliers

Unnamed: 0_level_0,fertility,life,population,child_mortality,gdp,region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Guatemala,3.974,71.1,14388929.0,34.5,6849.0,America
Haiti,3.35,45.0,9993247.0,208.8,1518.0,America
Tajikistan,3.78,66.83,6878637.0,52.6,2110.0,Europe & Central Asia
Timor-Leste,6.237,65.952,1124355.0,63.8,1777.0,East Asia & Pacific


## Filling missing data (imputation) by group
Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered. Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior). In Chapter 1, you practiced using the .dropna() method to drop missing values. Now, you will practice imputing missing values. You can use .groupby() and .transform() to fill missing data appropriately for each group.

Your job is to fill in missing 'age' values for passengers on the Titanic with the median age from their 'gender' and 'pclass'. To do this, you'll group by the 'sex' and 'pclass' columns and transform each group with a custom function to call .fillna() and impute the median value.

The DataFrame has been pre-loaded as titanic. Explore it in the IPython Shell by printing the output of titanic.tail(10). Notice in particular the NaNs in the 'age' column.

In [58]:
titanic.tail(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1299,3,0,"Yasbeck, Mr. Antoni",male,27.0,1,0,2659,14.4542,,C,C,,
1300,3,1,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1,0,2659,14.4542,,C,,,
1301,3,0,"Youseff, Mr. Gerious",male,45.5,0,0,2628,7.225,,C,,312.0,
1302,3,0,"Yousif, Mr. Wazli",male,,0,0,2647,7.225,,C,,,
1303,3,0,"Yousseff, Mr. Gerious",male,,0,0,2627,14.4583,,C,,,
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


In [62]:
# Group titanic by 'sex' and 'pclass'.
titanic.groupby(['sex', 'pclass'])

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x11EB2F90>

In [63]:
# Write a function called impute_median() that fills missing values with the median of a series.
def impute_median(series):
    return series.fillna(series.median())

In [66]:
# Call .transform() with impute_median on the 'age' column 
titanic['age'] = titanic.groupby(['sex', 'pclass'])['age'].transform(impute_median)
titanic.tail(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1299,3,0,"Yasbeck, Mr. Antoni",male,27.0,1,0,2659,14.4542,,C,C,,
1300,3,1,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1,0,2659,14.4542,,C,,,
1301,3,0,"Youseff, Mr. Gerious",male,45.5,0,0,2628,7.225,,C,,312.0,
1302,3,0,"Yousif, Mr. Wazli",male,25.0,0,0,2647,7.225,,C,,,
1303,3,0,"Yousseff, Mr. Gerious",male,25.0,0,0,2627,14.4583,,C,,,
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,22.0,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


## Other transformations with .apply
The .apply() method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The .apply() method will then combine the results in an intelligent way.

In this exercise, you're going to analyze economic disparity within regions of the world using the Gapminder data set for 2010. To do this you'll define a function to compute the aggregate spread of per capita GDP in each region and the individual country's z-score of the regional per capita GDP. You'll then select three countries - United States, Great Britain and China - to see a summary of the regional GDP and that country's z-score against the regional mean.

The 2010 Gapminder DataFrame is provided for you as gapminder_2010.

In [67]:
gapminder_2010

Unnamed: 0_level_0,fertility,life,population,child_mortality,gdp,region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,5.659,59.612,31411743.0,105.00,1637.0,South Asia
Albania,1.741,76.780,3204284.0,16.60,9374.0,Europe & Central Asia
Algeria,2.817,70.615,35468208.0,27.40,12494.0,Middle East & North Africa
Angola,6.218,50.689,19081912.0,182.50,7047.0,Sub-Saharan Africa
Antigua and Barbuda,2.130,75.437,88710.0,9.90,20567.0,America
Argentina,2.215,75.772,40412376.0,14.60,15765.0,America
Armenia,1.550,74.291,3092072.0,18.00,6508.0,Europe & Central Asia
Aruba,1.701,75.059,107488.0,17.84,33288.0,America
Australia,1.886,82.091,22268384.0,4.80,41330.0,East Asia & Pacific
Austria,1.438,80.595,8393644.0,4.40,42861.0,Europe & Central Asia


In [68]:
# The following function has been defined for your use:

def disparity(gr):
    # Compute the spread of gr['gdp']: s
    s = gr['gdp'].max() - gr['gdp'].min()
    # Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
    z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
    # Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
    return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})


In [70]:
# Group gapminder_2010 by 'region'. Save the result as regional.
regional = gapminder_2010.groupby('region')

In [72]:
# Apply the provided disparity function on regional, and save the result as reg_disp
reg_disp = regional.apply(disparity)

In [73]:
# Use .loc[] to select ['United States','United Kingdom','China'] from reg_disp and print the results.
reg_disp.loc[['United States', 'United Kingdom', 'China']]

Unnamed: 0_level_0,z(gdp),regional spread(gdp)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,3.013374,47855.0
United Kingdom,0.572873,89037.0
China,-0.432756,96993.0


## Grouping and filtering with .apply()
By using .apply(), you can write functions that filter rows within groups. The .apply() method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame.

In this exercise you'll take the Titanic data set and analyze survival rates from the 'C' deck, which contained the most passengers. To do this you'll group the dataset by 'sex' and then use the .apply() method on a provided user defined function which calculates the mean survival rates on the 'C' deck:

In [74]:
def c_deck_survival(gr):

    c_passengers = gr['cabin'].str.startswith('C').fillna(False)

    return gr.loc[c_passengers, 'survived'].mean()

In [75]:
# Group titanic by 'sex'. Save the result as by_sex
by_sex = titanic.groupby('sex')

In [76]:
# Apply the provided c_deck_survival function on the by_sex DataFrame. Save the result as c_surv_by_sex
c_surv_by_sex = by_sex.apply(c_deck_survival)

# Print the survival rates
c_surv_by_sex

sex
female    0.913043
male      0.312500
dtype: float64

## Grouping and filtering with .filter()
You can use groupby with the .filter() method to remove whole groups of rows from a DataFrame based on a boolean condition.

In this exercise, you'll take the February sales data and remove entries from companies that purchased less than or equal to 35 Units in the whole month.

First, you'll identify how many units each company bought for verification. Next you'll use the .filter() method after grouping by 'Company' to remove all rows belonging to companies whose sum over the 'Units' column was less than or equal to 35. Finally, verify that the three companies whose total Units purchased were less than or equal to 35 have been filtered out from the DataFrame.

In [78]:
sales

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-02 08:30:00,Hooli,Software,3
2015-02-02 21:00:00,Mediacore,Hardware,9
2015-02-03 14:00:00,Initech,Software,13
2015-02-04 15:30:00,Streeplex,Software,13
2015-02-04 22:00:00,Acme Coporation,Hardware,14
2015-02-05 02:00:00,Acme Coporation,Software,19
2015-02-05 22:00:00,Hooli,Service,10
2015-02-07 23:00:00,Acme Coporation,Hardware,1
2015-02-09 09:00:00,Streeplex,Service,19
2015-02-09 13:00:00,Mediacore,Software,7


In [80]:
# Group sales by 'Company', Compute and print the sum of the 'Units' column 
sales.groupby('Company')['Units'].sum()

Company
Acme Coporation    34
Hooli              30
Initech            30
Mediacore          45
Streeplex          36
Name: Units, dtype: int64

In [82]:
# Call .filter() with lambda g:g['Units'].sum() > 35 as input and print the result.
sales.groupby('Company').filter(lambda g:g['Units'].sum() > 35)

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-02 21:00:00,Mediacore,Hardware,9
2015-02-04 15:30:00,Streeplex,Software,13
2015-02-09 09:00:00,Streeplex,Service,19
2015-02-09 13:00:00,Mediacore,Software,7
2015-02-19 11:00:00,Mediacore,Hardware,16
2015-02-19 16:00:00,Mediacore,Service,10
2015-02-21 05:00:00,Mediacore,Software,3
2015-02-26 09:00:00,Streeplex,Service,4


## Filtering and grouping with .map()
You have seen how to group by a column, or by multiple columns. Sometimes, you may instead want to group by a function/transformation of a column. The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping.

In this exercise your job is to investigate survival rates of passengers on the Titanic by 'age' and 'pclass'. In particular, the goal is to find out what fraction of children under 10 survived in each 'pclass'. You'll do this by first creating a boolean array where True is passengers under 10 years old and False is passengers over 10. You'll use .map() to change these values to strings.

Finally, you'll group by the under 10 series and the 'pclass' column and aggregate the 'survived' column. The 'survived' column has the value 1 if the passenger survived and 0 otherwise. The mean of the 'survived' column is the fraction of passengers who lived.

In [85]:
# Create the Boolean Series: under10
under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})

In [86]:
# # Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
survived_mean_1

age
over 10     0.366748
under 10    0.609756
Name: survived, dtype: float64

In [87]:
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10, 'pclass'])['survived'].mean()
survived_mean_2

age       pclass
over 10   1         0.617555
          2         0.380392
          3         0.238897
under 10  1         0.750000
          2         1.000000
          3         0.446429
Name: survived, dtype: float64