# GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", None)
# pd.set_option("display.precision", 2)
pd.set_option("display.float_format", '{:.2f}'.format)
# pd.options.display.float_format = '{:.2f}'.format

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

In [2]:
path = "../../data/"
file_name = "drinks.csv"
df = pd.read_csv(path+file_name)
df.shape, df

((193, 6),
          country  beer_servings  spirit_servings  wine_servings  \
 0    Afghanistan              0                0              0   
 1        Albania             89              132             54   
 2        Algeria             25                0             14   
 3        Andorra            245              138            312   
 4         Angola            217               57             45   
 ..           ...            ...              ...            ...   
 188    Venezuela            333              100              3   
 189      Vietnam            111                2              1   
 190        Yemen              6                0              0   
 191       Zambia             32               19              4   
 192     Zimbabwe             64               18              4   
 
      total_litres_of_pure_alcohol continent  
 0                            0.00        AS  
 1                            4.90        EU  
 2                            

### Step 3. Assign it to a variable called drinks.

In [3]:
drinks = df.copy()
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


### Step 4. Which continent drinks more beer on average?

In [4]:
drinks.merge(drinks.continent.value_counts(), on='continent')

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent,count
0,Afghanistan,0,0,0,0.00,AS,44
1,Albania,89,132,54,4.90,EU,45
2,Algeria,25,0,14,0.70,AF,53
3,Andorra,245,138,312,12.40,EU,45
4,Angola,217,57,45,5.90,AF,53
...,...,...,...,...,...,...,...
165,Venezuela,333,100,3,7.70,SA,12
166,Vietnam,111,2,1,2.00,AS,44
167,Yemen,6,0,0,0.10,AS,44
168,Zambia,32,19,4,2.50,AF,53


In [5]:
# cont_dsum = drinks.groupby(by=['continent'])[['total_litres_of_pure_alcohol', 'country']].agg(['sum', 'count'])
cont_dsum = drinks.groupby(by=['continent'])['total_litres_of_pure_alcohol'].agg(['sum'])
cont_dsum
cont_dcount = drinks.groupby(by=['continent'])['country'].agg(['count'])
cont_dcount
cont_data = cont_dsum.merge(cont_dcount, on='continent')
(cont_data['sum']/cont_data['count']).sort_values(ascending=False)

continent
EU   8.62
SA   6.31
OC   3.38
AF   3.01
AS   2.17
dtype: float64

In [6]:
drinks.groupby(by=['country'])[['total_litres_of_pure_alcohol']].mean().sort_values(by='total_litres_of_pure_alcohol'
                                                                                    ,ascending=False).iloc[[0],:]

Unnamed: 0_level_0,total_litres_of_pure_alcohol
country,Unnamed: 1_level_1
Belarus,14.4


### Step 5. For each continent print the statistics for wine consumption.

In [7]:
drinks.groupby('continent')[['wine_servings']].describe().T
# drinks

Unnamed: 0,continent,AF,AS,EU,OC,SA
wine_servings,count,53.0,44.0,45.0,16.0,12.0
wine_servings,mean,16.26,9.07,142.22,35.62,62.42
wine_servings,std,38.85,21.67,97.42,64.56,88.62
wine_servings,min,0.0,0.0,0.0,0.0,1.0
wine_servings,25%,1.0,0.0,59.0,1.0,3.0
wine_servings,50%,2.0,1.0,128.0,8.5,12.0
wine_servings,75%,13.0,8.0,195.0,23.25,98.5
wine_servings,max,233.0,123.0,370.0,212.0,221.0


### Step 6. Print the mean alcohol consumption per continent for every column

In [8]:
drinks.set_index('continent').select_dtypes(include=np.number).groupby('continent')[drinks.set_index('continent').select_dtypes(include=np.number).columns].mean()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,61.47,16.34,16.26,3.01
AS,37.05,60.84,9.07,2.17
EU,193.78,132.56,142.22,8.62
OC,89.69,58.44,35.62,3.38
SA,175.08,114.75,62.42,6.31


In [9]:
drinks.groupby('continent')[drinks.select_dtypes(include=np.number).columns].mean()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,61.47,16.34,16.26,3.01
AS,37.05,60.84,9.07,2.17
EU,193.78,132.56,142.22,8.62
OC,89.69,58.44,35.62,3.38
SA,175.08,114.75,62.42,6.31


### Step 7. Print the median alcohol consumption per continent for every column

In [10]:
drinks.groupby('continent')[drinks.select_dtypes(include=np.number).columns].median()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,32.0,3.0,2.0,2.3
AS,17.5,16.0,1.0,1.2
EU,219.0,122.0,128.0,10.0
OC,52.5,37.0,8.5,1.75
SA,162.5,108.5,12.0,6.85


### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

In [11]:
drinks.groupby('continent')[['spirit_servings']].agg(['mean', 'min', 'max']).loc[:,[('spirit_servings', 'mean')]]
drinks.groupby('continent')[['spirit_servings']].agg(['mean', 'min', 'max'])

Unnamed: 0_level_0,spirit_servings,spirit_servings,spirit_servings
Unnamed: 0_level_1,mean,min,max
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
AF,16.34,0,152
AS,60.84,0,326
EU,132.56,0,373
OC,58.44,0,254
SA,114.75,25,302
