## Activity 04: Use Pandas to compute the Mean, Median, and Variance

In this activity, you will consolidate the skills you've acquired in the last exercise and use Pandas to do some very basic mathematical calculations on our `world_population.csv` dataset.   
Pandas have a consistent API, so it should be rather easy to transfer your knowledge of the mean method to median and variance.    
Your already existing knowledge of NumPy will also help.

#### Loading the dataset

In [1]:
# importing the necessary dependencies
import pandas as pd

In [2]:
# loading the Dataset
dataset = pd.read_csv('./data/world_population.csv', index_col=0)

In [3]:
# looking at the first two rows of the dataset
dataset.head()

Unnamed: 0_level_0,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,ABW,Population density (people per sq. km of land ...,EN.POP.DNST,,307.972222,312.366667,314.983333,316.827778,318.666667,320.622222,...,562.322222,563.011111,563.422222,564.427778,566.311111,568.85,571.783333,574.672222,577.161111,
Andorra,AND,Population density (people per sq. km of land ...,EN.POP.DNST,,30.587234,32.714894,34.914894,37.170213,39.470213,41.8,...,180.591489,182.161702,181.859574,179.614894,175.161702,168.757447,161.493617,154.86383,149.942553,
Afghanistan,AFG,Population density (people per sq. km of land ...,EN.POP.DNST,,14.038148,14.312061,14.599692,14.901579,15.218206,15.545203,...,39.637202,40.634655,41.674005,42.830327,44.127634,45.533197,46.997059,48.444546,49.821649,
Angola,AGO,Population density (people per sq. km of land ...,EN.POP.DNST,,4.305195,4.384299,4.464433,4.544558,4.624228,4.703271,...,15.387749,15.915819,16.459536,17.020898,17.600302,18.196544,18.808215,19.433323,20.070565,
Albania,ALB,Population density (people per sq. km of land ...,EN.POP.DNST,,60.576642,62.456898,64.329234,66.209307,68.058066,69.874927,...,108.394781,107.566204,106.843759,106.314635,106.013869,105.848431,105.717226,105.60781,105.444051,


---

#### Mean

In [4]:
# calculate the mean of the third row
dataset.iloc[2, 3:].mean()

25.373378700152898

In [5]:
numerical = dataset.select_dtypes(include=['number'])
numerical.iloc[2].mean()

25.37337870015289

In [6]:
dataset.iloc[[2]].mean(numeric_only=True, axis=1)

Country Name
Afghanistan    25.373379
dtype: float64

In [7]:
# calculate the mean of the last row
dataset.iloc[-1, 3:].mean()

24.520531613145806

In [8]:
dataset.iloc[[-1]].mean(numeric_only=True, axis=1)

Country Name
Zimbabwe    24.520532
dtype: float64

In [9]:
# calculate the mean of the country Germany
dataset.loc[["Germany"]].mean(numeric_only=True, axis=1)

Country Name
Germany    227.773688
dtype: float64

**Note:**   
`.iloc()` and `.loc()` are two important methods when indexing with Pandas. They allow making precise selections of data based on either the integer value index (`iloc`) or the index column (`loc`), which in our case is the country name column.

---

#### Median

In [10]:
# calculate the median of the last row
dataset[-1:].median(numeric_only=True, axis=1)

Country Name
Zimbabwe    25.505431
dtype: float64

In [11]:
dataset.iloc[-1:].median(numeric_only=True, axis=1)

Country Name
Zimbabwe    25.505431
dtype: float64

In [12]:
# calculate the median of the last 3 rows
dataset[-3:].median(numeric_only=True, axis=1)

Country Name
Congo, Dem. Rep.    14.419050
Zambia              10.352668
Zimbabwe            25.505431
dtype: float64

**Note:**   
Slicing can be done in the same way as with NumPy.   
`dataset[1:3]` will return the second and third row of our dataset.

In [55]:
# calculate the median of the first 10 countries
dataset[:10].median(numeric_only=True, axis=1)

Country Name
Aruba                   348.022222
Andorra                 107.300000
Afghanistan              19.998926
Angola                    8.458253
Albania                 106.001058
Arab World               15.307283
United Arab Emirates     19.305072
Argentina                11.618238
Armenia                 105.898033
American Samoa          220.245000
dtype: float64

In [65]:
dataset.iloc[:10].median(numeric_only=True, axis=1)

Country Name
Aruba                   348.022222
Andorra                 107.300000
Afghanistan              19.998926
Angola                    8.458253
Albania                 106.001058
Arab World               15.307283
United Arab Emirates     19.305072
Argentina                11.618238
Armenia                 105.898033
American Samoa          220.245000
dtype: float64

**Note:**   
When handling larger datasets, the order in which methods get executed definitely matters.   
Think about what `.head(10)` does for a moment, it simply takes your dataset and returns the first 10 rows of it, cutting down your input to the `.mean()` method drastically.   
This will definitely have an impact when using more memory intensive calculations, so keep an eye on the order.

---

#### Variance

In [13]:
# calculate the variance of the last 5 columns
dataset.iloc[:, -5:].var()

2012    3.063475e+06
2013    3.094597e+06
2014    3.157111e+06
2015    3.220634e+06
2016             NaN
dtype: float64

---

As mentioned in the introduction of Pandas, it's interoperable with several of NumPy's features.   
Here's an example of how to use NumPy's `mean` method with a Pandas DataFrame.

In [14]:
# NumPy Pandas interoperability
import numpy as np

print("Pandas", dataset["2015"].mean())
print("NumPy", np.mean(dataset["2015"]))

Pandas 368.70660104001837
NumPy 368.70660104001837
