### Import Pandas dan Dataset

In [1]:
import pandas as pd

df = pd.read_csv("datasets/world_population.csv")

df = df.drop(columns=["2020 Population", "2020 Population", "2015 Population", "2010 Population", "2000 Population",
                      "1990 Population", "1980 Population", "1970 Population"])
df = df.rename(columns={"Country/Territory": "Country", 
                        "2022 Population": "Population"})

df.head()

Unnamed: 0,Rank,CCA3,Country,Capital,Continent,Population,Area (km²),Density (per km²),Growth Rate,World Population Percentage
0,36,AFG,Afghanistan,Kabul,Asia,41128771,652230,63.0587,1.0257,0.52
1,138,ALB,Albania,Tirana,Europe,2842321,28748,98.8702,0.9957,0.04
2,34,DZA,Algeria,Algiers,Africa,44903225,2381741,18.8531,1.0164,0.56
3,213,ASM,American Samoa,Pago Pago,Oceania,44273,199,222.4774,0.9831,0.0
4,203,AND,Andorra,Andorra la Vella,Europe,79824,468,170.5641,1.01,0.0


## DATAFRAME AGGREGATION - GROUP BY 

**DataFrame Aggregation** menggunakan **Group By** berfungsi untuk mengelompokkan data berdasarkan paramater tertentu kemudian mencari nilai statistiknya berdasarkan kolom tertentu. 

```df.groupby("Continent").agg({"Country": "count"})```

- **Continent** sebagai parameter, yaitu kolom yang digunakan acuan untuk mengelompokkan data
- **Country** adalah kolom yang ingin dicari nilai statistiknya
- **count** adalah nilai statistik yang akan ditampilkan pada kolom "Country"

Nilai statistik yang biasa digunakan adalah:
- **count    =** menghitung total baris 
- **nunique  =** menghitung total baris yang unik (tidak duplikat)
- **mean     =** menghitung rata-rata dari kolom 
- **median   =** menghitung nilai tengah (median) dari kolom 
- **min      =** menentukan nilai terkecil dari kolom 
- **max      =** menentukan nilai terbesar dari kolom  

### Group By berdasarkan satu parameter berdasarkan satu kolom dan satu statistik

In [2]:
# Menentukan jumlah negara pada tiap benua

df.groupby("Continent").agg({"Country": "count"})

Unnamed: 0_level_0,Country
Continent,Unnamed: 1_level_1
Africa,57
Asia,50
Europe,50
North America,40
Oceania,23
South America,14


In [3]:
# Menentukan median populasi tiap benua

df.groupby("Continent").agg({"Population": "median"})

Unnamed: 0_level_0,Population
Continent,Unnamed: 1_level_1
Africa,13352864
Asia,18082920
Europe,5228714
North America,236399
Oceania,114164
South America,15112555


In [4]:
# Menentukan rata-rata Growth rate tiap benua

df.groupby("Continent").agg({"Growth Rate": "mean"})

Unnamed: 0_level_0,Growth Rate
Continent,Unnamed: 1_level_1
Africa,1.021244
Asia,1.009384
Europe,1.002256
North America,1.004175
Oceania,1.007383
South America,1.007957


In [5]:
# Menentukan rangking tertinggi (min) pada tiap benua

df.groupby("Continent").agg({"Rank": "min"})

Unnamed: 0_level_0,Rank
Continent,Unnamed: 1_level_1
Africa,6
Asia,1
Europe,9
North America,3
Oceania,55
South America,7


### Group By berdasarkan satu parameter berdasarkan satu kolom dan multi statistik

In [6]:
# Menentukan min, median, dan max populasi tiap benua

df.groupby("Continent").agg({"Population": ["min", "median", "max"]})

Unnamed: 0_level_0,Population,Population,Population
Unnamed: 0_level_1,min,median,max
Continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Africa,107118,13352864,218541212
Asia,449002,18082920,1425887337
Europe,510,5228714,144713314
North America,4390,236399,338289857
Oceania,1871,114164,26177413
South America,3780,15112555,215313498


In [7]:
# Menentukan min, median, dan max Growth Rate tiap benua

df.groupby("Continent").agg({"Growth Rate": ["min", "median", "max"]})

Unnamed: 0_level_0,Growth Rate,Growth Rate,Growth Rate
Unnamed: 0_level_1,min,median,max
Continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Africa,1.0004,1.0231,1.0378
Asia,0.9816,1.0081,1.0376
Europe,0.912,1.0015,1.0691
North America,0.9937,1.00415,1.015
Oceania,0.9831,1.0079,1.0238
South America,0.999,1.0063,1.0239


### Group By berdasarkan satu parameter berdasarkan multi kolom dan multi statistik

In [8]:
df.groupby("Continent").agg({"Country": "count",
                             "Rank": ["min", "max"],
                             "Population": ["min", "max"]})

Unnamed: 0_level_0,Country,Rank,Rank,Population,Population
Unnamed: 0_level_1,count,min,max,min,max
Continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Africa,57,6,196,107118,218541212
Asia,50,1,175,449002,1425887337
Europe,50,9,234,510,144713314
North America,40,3,230,4390,338289857
Oceania,23,55,233,1871,26177413
South America,14,7,231,3780,215313498


### Group By berdasarkan multi parameter berdasarkan multi kolom dan multi statistik

In [9]:
# contoh grouping dengan multi parameter
# catatan: data ini kurang cocok untuk contoh kali ini

df.groupby(["Continent", "World Population Percentage"]).agg({"Country": "count",
                                                             "Rank": ["min", "median", "max"],
                                                             "Population": ["min", "median", "max"]})

Unnamed: 0_level_0,Unnamed: 1_level_0,Country,Rank,Rank,Rank,Population,Population,Population
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,median,max,min,median,max
Continent,World Population Percentage,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Africa,0.00,3,182,187.0,196,107118,227380.0,326101
Africa,0.01,5,160,163.0,172,575986,836774.0,1120849
Africa,0.02,3,152,157.0,159,1201670,1299469.0,1674908
Africa,0.03,6,142,145.5,149,2105566,2478002.0,2705992
Africa,0.05,1,132,132.0,132,3684032,3684032.0,3684032
...,...,...,...,...,...,...,...,...
South America,0.35,1,51,51.0,51,28301696,28301696.0,28301696
South America,0.43,1,44,44.0,44,34049588,34049588.0,34049588
South America,0.57,1,33,33.0,33,45510318,45510318.0,45510318
South America,0.65,1,28,28.0,28,51874024,51874024.0,51874024


### Mengubah nama kolom

Kita telah belajar mengubah nama kolom tertentu dengan perintah ```df.rename()```\
Kali ini kita akan mengubah nama semua kolom dengan perintah ```df.columns()```

In [10]:
df_continent = df.groupby("Continent").agg({"Country": "count",
                             "Rank": ["min", "max"],
                             "Population": ["min", "max"]})

df_continent

Unnamed: 0_level_0,Country,Rank,Rank,Population,Population
Unnamed: 0_level_1,count,min,max,min,max
Continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Africa,57,6,196,107118,218541212
Asia,50,1,175,449002,1425887337
Europe,50,9,234,510,144713314
North America,40,3,230,4390,338289857
Oceania,23,55,233,1871,26177413
South America,14,7,231,3780,215313498


In [11]:
df_continent = df_continent.reset_index()

df_continent

Unnamed: 0_level_0,Continent,Country,Rank,Rank,Population,Population
Unnamed: 0_level_1,Unnamed: 1_level_1,count,min,max,min,max
0,Africa,57,6,196,107118,218541212
1,Asia,50,1,175,449002,1425887337
2,Europe,50,9,234,510,144713314
3,North America,40,3,230,4390,338289857
4,Oceania,23,55,233,1871,26177413
5,South America,14,7,231,3780,215313498


In [12]:
df_continent.columns = ["Continent", "Sum_Country", "Rank_Min", "Rank_Max", "Population_Min", "Population_Max"]

df_continent

Unnamed: 0,Continent,Sum_Country,Rank_Min,Rank_Max,Population_Min,Population_Max
0,Africa,57,6,196,107118,218541212
1,Asia,50,1,175,449002,1425887337
2,Europe,50,9,234,510,144713314
3,North America,40,3,230,4390,338289857
4,Oceania,23,55,233,1871,26177413
5,South America,14,7,231,3780,215313498


### Pivot Table untuk Aggregation

Pivot Table pada pandas DataFrame hampir sama dengan grouping.\ Pada Pivot Table kita menentukan beberapa hal:
1. Data yang akan digunakan menjadi tabel
2. **Index** yang akan menjadi baris
3. **Aggfunc/Columns** yang akan menjadi kolom 
4. **Values** yang akan menjadi nilai pada tabel

In [13]:
df_pivot = pd.pivot_table(df,
                          index= ["Continent"],
                          aggfunc= {"Country": ["count"],
                                    "Rank": ["min", "max"],
                                    "Population": ["min", "max"]},
                          values= ["Country", "Rank", "Population"])

df_pivot

Unnamed: 0_level_0,Country,Population,Population,Rank,Rank
Unnamed: 0_level_1,count,max,min,max,min
Continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Africa,57,218541212,107118,196,6
Asia,50,1425887337,449002,175,1
Europe,50,144713314,510,234,9
North America,40,338289857,4390,230,3
Oceania,23,26177413,1871,233,55
South America,14,215313498,3780,231,7


In [14]:
df_pivot = df_pivot.reset_index()
df_pivot

Unnamed: 0_level_0,Continent,Country,Population,Population,Rank,Rank
Unnamed: 0_level_1,Unnamed: 1_level_1,count,max,min,max,min
0,Africa,57,218541212,107118,196,6
1,Asia,50,1425887337,449002,175,1
2,Europe,50,144713314,510,234,9
3,North America,40,338289857,4390,230,3
4,Oceania,23,26177413,1871,233,55
5,South America,14,215313498,3780,231,7


In [15]:
df_pivot.columns = ["Continent", "Sum_Country", "Population_Min", "Population_Max", "Rank_Min", "Rank_Max"]

df_pivot

Unnamed: 0,Continent,Sum_Country,Population_Min,Population_Max,Rank_Min,Rank_Max
0,Africa,57,218541212,107118,196,6
1,Asia,50,1425887337,449002,175,1
2,Europe,50,144713314,510,234,9
3,North America,40,338289857,4390,230,3
4,Oceania,23,26177413,1871,233,55
5,South America,14,215313498,3780,231,7
