# Ex - GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [1]:
import numpy as np
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

### Step 3. Assign it to a variable called drinks.

In [4]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drinks = pd.read_csv(url)

### Step 4. Which continent drinks more beer on average?

In [5]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     170 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB


- dtypes look good. Only nulls are in continent -- could be islands??
- column names are pythonic∫


In [7]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [11]:
# what continents are listed?
drinks.continent.value_counts()

AF    53
EU    45
AS    44
OC    16
SA    12
Name: continent, dtype: int64

In [12]:
# If we make the assumption for these abbreviations:
# AF = Africa, EU = Europe, AS = Asia, OC = Oceania, SA = South America
# It seems that Australia may not be here unless it's in Oceania and North
# America

In [17]:
# Let's check for Australia
drinks[drinks.continent == 'OC']

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
8,Australia,261,72,212,10.4,OC
40,Cook Islands,0,254,74,5.9,OC
59,Fiji,77,35,1,2.0,OC
89,Kiribati,21,34,1,1.0,OC
106,Marshall Islands,0,0,0,0.0,OC
110,Micronesia,62,50,18,2.3,OC
118,Nauru,49,0,8,1.0,OC
121,New Zealand,203,79,175,9.3,OC
125,Niue,188,200,7,7.0,OC
129,Palau,306,63,23,6.9,OC


In [15]:
# so, yes, Australia is assigned to OC

In [16]:
# so let's get a look at those nulls
drinks[drinks.continent.isna()]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
5,Antigua & Barbuda,102,128,45,4.9,
11,Bahamas,122,176,51,6.3,
14,Barbados,143,173,36,6.3,
17,Belize,263,114,8,6.8,
32,Canada,240,122,100,8.2,
41,Costa Rica,149,87,11,4.4,
43,Cuba,93,137,5,4.2,
50,Dominica,52,286,26,6.6,
51,Dominican Republic,193,147,9,6.2,
54,El Salvador,52,69,2,2.2,


In [18]:
# Actually it appears that the nulls are all North America and the Caribbean. I'm going to replace the nulls with NA
drinks.fillna('NA', inplace=True)

In [23]:
drinks.groupby('continent').sum().sort_values('beer_servings', ascending=False)

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
EU,8720,5965,6400,387.8
,3345,3812,564,137.9
AF,3258,866,862,159.4
SA,2101,1377,749,75.7
AS,1630,2677,399,95.5
OC,1435,935,570,54.1


- Europe does the most beer drinking by far

### Step 5. For each continent print the statistics for wine consumption.

In [38]:
print(drinks[['continent', 'wine_servings']].groupby('continent').sum().sort_values('wine_servings', ascending=False))

           wine_servings
continent               
EU                  6400
AF                   862
SA                   749
OC                   570
NA                   564
AS                   399


### Step 6. Print the mean alcohol consumption per continent for every column

In [33]:
print(f'Mean alcohol consumption by continent and drink type:\n\n {drinks.groupby("continent").mean()}')

Mean alcohol consumption by continent and drink type:

            beer_servings  spirit_servings  wine_servings  \
continent                                                  
AF             61.471698        16.339623      16.264151   
AS             37.045455        60.840909       9.068182   
EU            193.777778       132.555556     142.222222   
NA            145.434783       165.739130      24.521739   
OC             89.687500        58.437500      35.625000   
SA            175.083333       114.750000      62.416667   

           total_litres_of_pure_alcohol  
continent                                
AF                             3.007547  
AS                             2.170455  
EU                             8.617778  
NA                             5.995652  
OC                             3.381250  
SA                             6.308333  


### Step 7. Print the median alcohol consumption per continent for every column

In [37]:
print(f'Median alcohol consumption by continent and drink type:\n\n {drinks.groupby("continent").median()}')

Median alcohol consumption by continent and drink type:

            beer_servings  spirit_servings  wine_servings  \
continent                                                  
AF                  32.0              3.0            2.0   
AS                  17.5             16.0            1.0   
EU                 219.0            122.0          128.0   
NA                 143.0            137.0           11.0   
OC                  52.5             37.0            8.5   
SA                 162.5            108.5           12.0   

           total_litres_of_pure_alcohol  
continent                                
AF                                 2.30  
AS                                 1.20  
EU                                10.00  
NA                                 6.30  
OC                                 1.75  
SA                                 6.85  


### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

In [50]:
max_spirit = drinks['spirit_servings'].max()
mean_spirit = round(drinks['spirit_servings'].mean())
min_spirit = drinks['spirit_servings'].min()

print(f'Mean, minimum, and maximum values for spirit consumption are {mean_spirit}, {min_spirit}, and {max_spirit}, respectively.')


Mean, minimum, and maximum values for spirit consumption are 81, 0, and 438, respectively.


In [63]:
# let's see which countries these are:
print(f"Mean, minimum, and maximum values for spirit consumption are\n\n{mean_spirit}, \n\n{drinks[drinks.spirit_servings == min_spirit][['country', 'spirit_servings']]}, \n\nand {drinks[drinks.spirit_servings == max_spirit][['country', 'spirit_servings']]}, \n\nrespectively.")


Mean, minimum, and maximum values for spirit consumption are

81, 

               country  spirit_servings
0          Afghanistan                0
2              Algeria                0
13          Bangladesh                0
19              Bhutan                0
27             Burundi                0
46         North Korea                0
55   Equatorial Guinea                0
56             Eritrea                0
63              Gambia                0
70              Guinea                0
79                Iran                0
90              Kuwait                0
92                Laos                0
97               Libya                0
103           Maldives                0
106   Marshall Islands                0
107         Mauritania                0
111             Monaco                0
118              Nauru                0
128           Pakistan                0
147         San Marino                0
158            Somalia                0
190         