# Split Apply Combine

In his 2011 paper Hadley Wickham describes the split-apply-combine technique as breaking up a big problem into manageable pieces, operating on each piece independently, and then putting all the pieces back together.
 
In Pandas the split-apply-combine technique is implemented using the Group by method. Calling df.groupby(df[‘column’]) creates a groupby object. No splitting happens until some aggregation function is called on the groupby object. By default groupby sorts the keys (df[‘column’]) and drops NaN values.


In [174]:
import pandas as pd
import numpy as np
import altair as alt
import time

In [225]:
# Read WDI data into a pandas dataframe
wdi_df = pd.read_csv('WDIData.csv')

In [226]:
# There are region names included in the country column and I only want to look at countries
# Using the WDICountry.csv to get a list of countries only
wdi_country_df = pd.read_csv('WDICountry.csv')

# getting list of only countries and not regions
country_only = wdi_country_df.dropna(subset=['Region']) #the regions don't have anything in the 'Region' column
country_only = country_only['Table Name']

# filtering out regions using my new list of contries
wdi_df = wdi_df.loc[wdi_df['Country Name'].isin(country_only)]
#print(wdi_df.head())

In [114]:
#Pandas groupby returns a groupby object
#no actual splitting happens until a function is applied to the groupby object
new_df= wdi_df.groupby('Indicator Name')
new_df

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023902AB51E0>

Example 1. Aggregation

In [227]:
#Grouping by indicator could be useful if you want to see global trends
#here I used the aggregation functions mean and count on the groupby object
#the groups are sorted by default
new_df= wdi_df[['Indicator Name','2017', '2018', '2019']].groupby('Indicator Name').agg(['mean','count'])
new_df.head(5)

Unnamed: 0_level_0,2017,2017,2018,2018,2019,2019
Unnamed: 0_level_1,mean,count,mean,count,mean,count
Indicator Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ARI treatment (% of children under 5 taken to a health provider),61.158333,12,66.627273,22,66.65,4
Access to clean fuels and technologies for cooking (% of population),66.673387,186,67.056183,186,67.432258,186
"Access to clean fuels and technologies for cooking, rural (% of rural population)",58.276882,186,58.693548,186,59.11586,186
"Access to clean fuels and technologies for cooking, urban (% of urban population)",75.413172,186,75.673118,186,75.91828,186
Access to electricity (% of population),85.183127,212,85.856512,211,86.361743,212


In [234]:
new_df= wdi_df[['Indicator Name', '2017', '2018','2019']].groupby('Indicator Name').agg(['mean']).reset_index().droplevel(axis=1, level=1)
new_df.head()

Unnamed: 0,Indicator Name,2017,2018,2019
0,ARI treatment (% of children under 5 taken to ...,61.158333,66.627273,66.65
1,Access to clean fuels and technologies for coo...,66.673387,67.056183,67.432258
2,Access to clean fuels and technologies for coo...,58.276882,58.693548,59.11586
3,Access to clean fuels and technologies for coo...,75.413172,75.673118,75.91828
4,Access to electricity (% of population),85.183127,85.856512,86.361743


Example 2. Filter

In [235]:
#here I am filtering by indicators with more than 50 values in 2019

new_df = wdi_df[['Country Name','Indicator Name', '2019']].groupby('Indicator Name').filter(lambda x : x['2019'].count() > 50)
new_df.sample(5)

Unnamed: 0,Country Name,Indicator Name,2019
100517,Benin,"Population ages 20-24, male (% of male populat...",9.293757
94101,Barbados,"Employment to population ratio, 15+, male (%) ...",59.591999
175554,Georgia,Poverty headcount ratio at national poverty li...,19.5
209785,Italy,"Life expectancy at birth, female (years)",85.7
178677,Ghana,Taxes on goods and services (% of revenue),32.131491


I tried a few more things with groupby below. 

In [233]:
#Here I am grouping by indicator then selecting a specific indicator using get_group()
#This is an example of wide form data where multiple observations are in the same row

new_df= wdi_df[['Country Name','Indicator Name','2017', '2018', '2019']].groupby(['Indicator Name']).get_group('CO2 emissions (kt)')
new_df.head()

Unnamed: 0,Country Name,Indicator Name,2017,2018,2019
70850,Afghanistan,CO2 emissions (kt),4780.00021,6070.000172,6079.999924
72292,Albania,CO2 emissions (kt),5139.999866,5110.000134,4829.999924
73734,Algeria,CO2 emissions (kt),158339.996338,165539.993286,171250.0
75176,American Samoa,CO2 emissions (kt),,,
76618,Andorra,CO2 emissions (kt),469.999999,490.00001,500.0


In [146]:
#Here I am using the pandas melt function to go from wide form to long form
#In long form data each row represents an observation

new_df = pd.melt(new_df, id_vars=['Country Name','Indicator Name'], value_vars=['2017', '2018', '2019'], var_name='year', value_name='value')
new_df.head()

Unnamed: 0,Country Name,Indicator Name,year,value
0,Afghanistan,CO2 emissions (kt),2017,4780.00021
1,Albania,CO2 emissions (kt),2017,5139.999866
2,Algeria,CO2 emissions (kt),2017,158339.996338
3,American Samoa,CO2 emissions (kt),2017,
4,Andorra,CO2 emissions (kt),2017,469.999999


In [91]:
# Here I am getting the mean for each country after grouping by indicator then selecting a specific indicator using get_group
indi = 'CO2 emissions (kt)'
new_df= wdi_df[['Country Name','Indicator Name','2017', '2018', '2019']].groupby(
    ['Indicator Name']).get_group(indi).set_index('Country Name').mean(axis=1,numeric_only=True)
new_df.head()

Country Name
Afghanistan         5643.333435
Albania             5026.666641
Algeria           165043.329875
American Samoa              NaN
Andorra              486.666669
dtype: float64

In [150]:
#new_df.describe()

In [153]:
# Here I am getting the ranking for each country by specific indicator value 
indi = 'CO2 emissions (kt)'
new_df= wdi_df[['Country Name','Indicator Name', '2004', '2016', '2017', '2018','2019']].groupby(
    ['Indicator Name']).get_group(indi).set_index('Country Name').rank(ascending=False, axis=0, numeric_only=True)
print(new_df.sort_values('2019').head())
#new_df

                    2004  2016  2017  2018  2019
Country Name                                    
China                2.0   1.0   1.0   1.0   1.0
United States        1.0   2.0   2.0   2.0   2.0
India                5.0   3.0   3.0   3.0   3.0
Russian Federation   3.0   4.0   4.0   4.0   4.0
Japan                4.0   5.0   5.0   5.0   5.0


In [166]:
# Here I am getting the 10 highest values for a specific indicator and year

indi = 'CO2 emissions (kt)'
new_df= wdi_df[['Country Name','Indicator Name','2019']].groupby(
    ['Indicator Name']).get_group(indi).set_index('Country Name').mean(axis=1,numeric_only=True).nlargest(10)
new_df

Country Name
China                 1.070722e+07
United States         4.817720e+06
India                 2.456300e+06
Russian Federation    1.703590e+06
Japan                 1.081570e+06
Germany               6.574000e+05
Iran, Islamic Rep.    6.300100e+05
Indonesia             6.198400e+05
Korea, Rep.           6.107900e+05
Canada                5.802100e+05
dtype: float64

In [240]:
%%timeit
# Here I am timeing to see if groupby is faster than using another method

new_df= wdi_df[['Country Name','Indicator Name','2018','2019']].groupby(
    ['Indicator Name']).get_group('CO2 emissions (kt)').set_index('Country Name').mean(axis=1,numeric_only=True).nlargest(10)
new_df

50.4 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [241]:
%%timeit
# Here I am comparing  the a similar operation without groupby

source = wdi_df[['Country Name', 'Indicator Name', '2018', '2019']] #getting only country name the CO2 emmision for the year 2019
source = source[source['Indicator Name'] =='CO2 emissions (kt)'] #selecting one indicator
source = source.set_index('Country Name')
source = source.mean(axis=1, numeric_only=True)
source = source.nlargest(10) #getting the top 15 highest in 2019
source

17 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [95]:
#Dependencies
%load_ext watermark
%watermark
%watermark --iversions

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Last updated: 2022-09-26T16:30:04.948772-04:00

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.4.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel
CPU cores   : 8
Architecture: 64bit

sys   : 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)]
altair: 4.2.0
numpy : 1.22.3
pandas: 1.4.2

