## Exercise 02: Loading a sample dataset and calculating the mean.

In this exercise, we will be loading the `world_population.csv` dataset and calculate the mean of some rows and columns.   
Our dataset holds the yearly population density for each country. We can, therefore, use Pandas to get some really quick and easy insights.

#### Loading our dataset

In [1]:
# importing the necessary dependencies
import pandas as pd

In [2]:
# loading the Dataset
dataset = pd.read_csv('./data/world_population.csv', index_col=0)

**Note:**   
`index_col` enables you to use any column as an index instead of the incrementing int column that gets added by default. In our case, we want column 0, which is the country names, as indices.

In [3]:
# looking at the dataset
dataset
dataset.head(6)
dataset.tail(2)

Unnamed: 0_level_0,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zambia,ZMB,Population density (people per sq. km of land ...,EN.POP.DNST,,4.227724,4.359305,4.496824,4.639914,4.788452,4.942343,...,17.135926,17.641587,18.170609,18.721585,19.294752,19.890745,20.508866,21.148177,21.80789,
Zimbabwe,ZWE,Population density (people per sq. km of land ...,EN.POP.DNST,,10.021037,10.356112,10.703901,11.062585,11.431128,11.809022,...,34.374559,34.885516,35.46852,36.122262,36.850438,37.651498,38.511289,39.410249,40.332819,


---

#### After loading our dataset

To get a quick overview of our dataset we want to print out the "shape" of it.   
This will give us an output of the form (rows, columns)

In [4]:
# printing the shape of our dataset
dataset.shape

(264, 60)

In [5]:
# calculating the mean for 1961 column
dataset['1961'].mean()

176.91514132840555

In [6]:
# calculating the mean for 2015 column
dataset['2015'].mean()

368.70660104001837

**Note:**   
Only by comaparing the overall mean of the two years, 1961 and 2015, we can already see that the mean population density **more than doubled** in this time range.

In [7]:
# mean for each country (row)
dataset.mean(axis=1)

Country Name
Aruba               413.944949
Andorra             106.838839
Afghanistan          25.373379
Angola                9.649583
Albania              99.159197
                       ...    
Yemen, Rep.          24.702231
South Africa         28.599504
Congo, Dem. Rep.     16.661282
Zambia               11.055234
Zimbabwe             24.520532
Length: 264, dtype: float64

In [8]:
# mean for each feature (col)
dataset.mean(axis=0)

1960           NaN
1961    176.915141
1962    180.703231
1963    184.572413
1964    188.461797
1965    192.412363
1966    196.145042
1967    200.118063
1968    203.879464
1969    207.336102
1970    210.607871
1971    213.489694
1972    215.998475
1973    218.438708
1974    220.621210
1975    223.046375
1976    224.960258
1977    227.006734
1978    229.187306
1979    232.510772
1980    236.185357
1981    240.789508
1982    246.175178
1983    251.342389
1984    256.647822
1985    261.680751
1986    266.647038
1987    271.768300
1988    276.813259
1989    281.850054
1990    286.062387
1991    288.292566
1992    293.305416
1993    297.759160
1994    302.275463
1995    304.537276
1996    309.714948
1997    313.896935
1998    320.405981
1999    324.004669
2000    327.270760
2001    312.259570
2002    313.269043
2003    315.847613
2004    317.746559
2005    322.669534
2006    326.907971
2007    331.995474
2008    338.688417
2009    343.649206
2010    347.967029
2011    351.942027
2012    357.

**Note:**   
The axis parameter is again needed to control the aggregation flow.

In [9]:
# calculating the mean for the whole matrix
dataset.mean()

1960           NaN
1961    176.915141
1962    180.703231
1963    184.572413
1964    188.461797
1965    192.412363
1966    196.145042
1967    200.118063
1968    203.879464
1969    207.336102
1970    210.607871
1971    213.489694
1972    215.998475
1973    218.438708
1974    220.621210
1975    223.046375
1976    224.960258
1977    227.006734
1978    229.187306
1979    232.510772
1980    236.185357
1981    240.789508
1982    246.175178
1983    251.342389
1984    256.647822
1985    261.680751
1986    266.647038
1987    271.768300
1988    276.813259
1989    281.850054
1990    286.062387
1991    288.292566
1992    293.305416
1993    297.759160
1994    302.275463
1995    304.537276
1996    309.714948
1997    313.896935
1998    320.405981
1999    324.004669
2000    327.270760
2001    312.259570
2002    313.269043
2003    315.847613
2004    317.746559
2005    322.669534
2006    326.907971
2007    331.995474
2008    338.688417
2009    343.649206
2010    347.967029
2011    351.942027
2012    357.

**Note:**   
If you compare the result of this last cell with the one about `# mean for each col` you can see that the default axis is 0 which leads to the same result as the cell above.

---

Using a real dataset with Pandas can already give us some quick and easy insights into our data.  
In this case we can already see that the mean population density increased constantly for each year.