### Grouping and Aggregation

Let us consider some of the questions you posed:

3. Which government form has the highest life expectancy?
6. compare the surface area of the regions and sort in ascending order
10. compare government type and life expectancy?
11. what regions have a high life expectancy?
16. what is the most common form of government?\n"
20. which region has the highest total gnp?
21. What is the most common form of government in Asia?
22. Which continent has the largest surface area?
12. List the total population of each continent and order them from smallest to largest
13. In which decade did the most countries achieve independence?
14. Show the 10 government forms with the largest population
15. List the total population of each region in Africa from lowest to highest.


All of these questions have something in common.  They ask you to summarize (aggregate) some data from a group of countries.

In [1]:
import warnings
warnings.filterwarnings('ignore')
from reframe import Relation

In [2]:
country = Relation('/home/faculty/millbr02/pub/country.csv')

How many countries are in each region?

In [3]:
country.project(['region','name']).sort(['region']).head(15)

Unnamed: 0,region,name
236,Antarctica,Heard Island and McDonald Islands
235,Antarctica,South Georgia and the South Sandwich Islands
233,Antarctica,Bouvet Island
232,Antarctica,Antarctica
237,Antarctica,French Southern territories
14,Australia and New Zealand,Australia
100,Australia and New Zealand,Cocos (Keeling) Islands
84,Australia and New Zealand,Christmas Island
218,Australia and New Zealand,New Zealand
147,Australia and New Zealand,Norfolk Island


What we really want to do is squash all of the rows down to one for each of the regions, counting how many rows we had to squash.  We can do that.

### Introducing the groupby operator

* we can choose one or more columns to group on
* Then we choose an aggregation operator

  * count
  * min
  * max
  * sum
  * mean
  * median
  
**Very Powerful**

How many countries are in each region?


In [5]:
country.groupby(['region']).count('name').sort(['count_name'],ascending=False)

Unnamed: 0,region,count_name
4,Caribbean,24
7,Eastern Africa,20
13,Middle East,18
23,Western Africa,17
21,Southern Europe,15
22,Southern and Central Asia,14
18,South America,14
19,Southeast Asia,11
17,Polynesia,10
9,Eastern Europe,10


In [5]:
country.groupby(['continent','region']).count('name')

Unnamed: 0,continent,region,count_name
0,Africa,Central Africa,9
1,Africa,Eastern Africa,20
2,Africa,Northern Africa,7
3,Africa,Southern Africa,5
4,Africa,Western Africa,17
5,Antarctica,Antarctica,5
6,Asia,Eastern Asia,8
7,Asia,Middle East,18
8,Asia,Southeast Asia,11
9,Asia,Southern and Central Asia,14


What is the average life expectancy for each continent?

In [7]:
country.groupby(['continent']).mean('lifeexpectancy').sort(['mean_lifeexpectancy'])

Unnamed: 0,continent,mean_lifeexpectancy
0,Africa,52.57193
2,Asia,67.441176
5,Oceania,69.715
6,South America,70.946154
4,North America,72.991892
3,Europe,75.147727
1,Antarctica,


Ok, jeopardy style... What question does the following query answer?

In [11]:
country.groupby(['region']).max('gnp')
#country.project(['region','name','gnp']).sort(['region','name','gnp'])

Unnamed: 0,region,max_gnp
0,Antarctica,0
1,Australia and New Zealand,351182
2,Baltic Countries,10692
3,British Islands,1378330
4,Caribbean,34100
5,Central Africa,9174
6,Central America,414972
7,Eastern Africa,9217
8,Eastern Asia,3787042
9,Eastern Europe,276608


Notice that the column names have changed to aggregate_column  We can change that if we want to using the rename operator


In [8]:
country.groupby(['region']).max('gnp').rename('max_gnp','gnp').sort(['gnp'])

Unnamed: 0,region,gnp
0,Antarctica,0
12,Micronesia/Caribbean,0
17,Polynesia,818
11,Micronesia,1197
10,Melanesia,4988
5,Central Africa,9174
7,Eastern Africa,9217
2,Baltic Countries,10692
4,Caribbean,34100
23,Western Africa,65707


In [9]:
country.groupby(['continent']).sum('surfacearea').sort('sum_surfacearea')

Unnamed: 0,continent,sum_surfacearea
5,Oceania,8564294.0
1,Antarctica,13132101.0
6,South America,17864922.0
3,Europe,23049133.9
4,North America,24214469.0
0,Africa,30250377.0
2,Asia,31881008.0


Of course combining groupby with query and project is an important part of problem solving.

What is the most popular government form in asia

In [13]:
country.query('continent == "Asia"').groupby(['governmentform']).count('name').sort(['count_name'],ascending=False)


Unnamed: 0,governmentform,count_name
13,Republic,26
2,Constitutional Monarchy,5
9,Monarchy,3
6,Federal Republic,2
11,Monarchy (Sultanate),2
14,Socialistic Republic,2
15,Special Administrative Region of China,2
0,Administrated by the UN,1
1,Autonomous Area,1
3,Constitutional Monarchy (Emirate),1


In [12]:
country.query('continent == "Asia"').groupby('governmentform').count('name').query('count_name > 1')

Unnamed: 0,governmentform,count_name
2,Constitutional Monarchy,5
6,Federal Republic,2
9,Monarchy,3
11,Monarchy (Sultanate),2
13,Republic,26
14,Socialistic Republic,2
15,Special Administrative Region of China,2


In [13]:
import pandas as pd