# What is Pandas?

Think of the capabilities of [R](https://www.r-project.org/about.html), but using the Python programming language.

It is open source, fast, flexible, easy to learn and easy to use and has become popular in the field of data analysis.


# 1. Numpy

Pandas is a high level library built on top of [Numpy](http://www.numpy.org), a fundamental package for scientific computing with Python that enables powerful and fast array operations, amongst many other functions.

## Arrays
We already have native Python list and why do we still need numpy? Numpy is feature rich and much, much fast.

In [19]:
import numpy as np
from numpy.random import random

In [22]:
# get the same data in two data types
a = random(1000)
b = list(a)
print('a:', type(a), 'b:', type(b))

('a:', <type 'numpy.ndarray'>, 'b:', <type 'list'>)


In [23]:
%%timeit 
a.sum()

The slowest run took 17.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.99 µs per loop


In [24]:
%%timeit
b = list(a)
b_sum = 0
while b:
    b_sum += b.pop()
b_sum

1000 loops, best of 3: 247 µs per loop


For example, a common task is to convert a raster image to an numpy array so that it could be manipulated outside/bypassing a GIS, enabling more flexible and fast way to produce statistics. 

Imagine the following hypothetical array of values representing a 100 by 10 image (each pixel 30m x 30m resolution) of percentage tree cover, i.e. a pixel value of 0.43 suggest this pixel has 43% forest cover. The questions are:

- considering the definition of a forest pixel being more than 25% (FAO definition), what is the total size of forest in the image.
- what if the definition threshold has been reduced to 10%?

In [50]:
# hypothetical image
a = random(1000).reshape(100,10)

In [51]:
a

array([[  8.83769679e-01,   9.26215574e-01,   4.50550809e-01,
          7.03745422e-01,   4.46250466e-01,   8.59446037e-01,
          9.12523341e-01,   1.41261740e-01,   5.89192572e-01,
          1.99691297e-01],
       [  6.41161587e-01,   2.23608837e-01,   1.73501249e-01,
          9.38980751e-01,   1.69910262e-01,   3.31894044e-02,
          2.74533340e-01,   5.60526341e-02,   4.92440788e-01,
          5.91992546e-01],
       [  1.79202182e-01,   9.86353706e-01,   3.32188560e-01,
          5.97397890e-01,   7.37840515e-02,   5.99908115e-01,
          4.65156838e-01,   1.56560321e-01,   6.58225559e-01,
          5.17452156e-01],
       [  5.90050326e-01,   4.74516988e-01,   5.19325797e-02,
          7.65620390e-01,   8.52004787e-01,   2.42699347e-01,
          9.11498986e-01,   1.01560752e-01,   8.94633656e-01,
          9.61269560e-01],
       [  9.03097569e-01,   2.36948280e-01,   3.94699727e-01,
          4.57211381e-01,   1.57437253e-01,   6.61111654e-01,
          8.75730189e-01

In [66]:
a.size

1000

You can use a loop but in fact, it could be even simpler by using a boolen index itself.

In [128]:
# booleans arrays value treated: True -> 1, False -> 0
np.array([True, False, True]).sum()

2

In [56]:
a>0.25

array([[ True,  True,  True,  True,  True,  True,  True, False,  True,
        False],
       [ True, False, False,  True, False, False,  True, False,  True,
         True],
       [False,  True,  True,  True, False,  True,  True, False,  True,
         True],
       [ True,  True, False,  True,  True, False,  True, False,  True,
         True],
       [ True, False,  True,  True, False,  True,  True, False, False,
         True],
       [False,  True,  True,  True,  True,  True,  True,  True,  True,
        False],
       [ True,  True,  True,  True,  True, False,  True,  True,  True,
         True],
       [ True, False,  True,  True,  True,  True,  True,  True, False,
        False],
       [False,  True,  True, False,  True,  True,  True,  True,  True,
        False],
       [ True,  True,  True, False,  True, False,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [ True,  True, False,  True, False, 

In [129]:
# thus adding all values without loop..
(a>0.25).sum()

742

In [55]:
(a>0.25).sum() * 900 / 1000000.0

0.66779999999999995

By multiplying the the number of forest cell with cell size, the total area of forest can be easily calculated.

## Random module

Let's consider another example to illustrate the powerful `numpy.random` module: a simple walk starting from 0 with steps of 1 and -1 occurring with equal probability. Below we repeat 1000 steps for 5000 times, i.e., each row in the matrix represents a walk of 1000 steps and there will be 5000 walks.

![axis](./axis.jpg)

In [77]:
# shape of the array
nwalks = 5000
nsteps = 1000
# get random integers between 0 and 1
draws = np.random.randint(0, 2, size=(nwalks, nsteps)) 
# reclass values: 0->-1, 1->1,
steps = np.where(draws > 0, 1, -1)

In [83]:
steps

array([[ 1,  1,  1, ...,  1,  1,  1],
       [-1, -1,  1, ...,  1, -1,  1],
       [ 1, -1,  1, ...,  1,  1,  1],
       ..., 
       [-1,  1, -1, ...,  1,  1, -1],
       [-1,  1, -1, ..., -1, -1,  1],
       [-1,  1, -1, ...,  1,  1,  1]])

In [107]:
# cumulative sum along axis 1, for two-dimensional arrays. (left to right)
walks = steps.cumsum(1)

In [108]:
walks

array([[  1,   2,   3, ..., -64, -63, -62],
       [ -1,  -2,  -1, ..., -10, -11, -10],
       [  1,   0,   1, ..., -12, -11, -10],
       ..., 
       [ -1,   0,  -1, ...,  34,  35,  34],
       [ -1,   0,  -1, ..., -72, -73, -72],
       [ -1,   0,  -1, ...,   0,   1,   2]])

The `np.ndarray.any` method returns *True* if any of its element evaluated to *True*. This can be used to determine those walks that have ended up 30 steps away from the starting point

In [113]:
walks.any?

In [118]:
(np.abs(walks)>30).any(1).sum()

3221

Out of 5000 walks, 3221 ended up having moved 30 steps away from the starting point

# 2. Pandas

Unlike numpy, Pandas contains higher-level data structures and manipulation tools designed to make data analysis fast and easy in Python, including: data I/O, cleaning, manipulation, visualisation and much more.

In [139]:
#!pip install pandas

In [140]:
%matplotlib inline
import pandas as pd

## Load data and export result

Pandas includes handy functions to handle common data formats. In this section, we will be using data on protected areas and biogeographical classifications in a CSV table. For each ecoregion, its higher classes of biome and realm are given, so are the size of ecoregions and how much of them are covered by protected areas.


In [142]:
a = pd.read_csv('ecoregion.csv')

Pandas dataframe is very similar to dataframe in R 

In [146]:
a.head(5)

Unnamed: 0,id,ecoregion_name,biome,realm,ecoregion_area,area_protected,percent_protected,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,60141,Monte Alegre varzeá,Tropical & Subtropical Moist Broadleaf Forests,Neotropics,66947.38501,15338.98079,22.9%,,All 4 Antarctic ecoregions were excluded from ...,,,,,,
1,70104,Eastern Micronesia tropical moist forests,Tropical & Subtropical Moist Broadleaf Forests,Oceania,533.660211,40.425363,7.6%,,,,,,,,
2,30126,Northwestern Congolian lowland forests,Tropical & Subtropical Moist Broadleaf Forests,Afrotropics,435086.2309,91600.35292,21.1%,,,,,,,,
3,30908,Zambezian halophytics,Flooded Grasslands & Savannas,Afrotropics,30438.54771,6356.705778,20.9%,,,,,,,,
4,51307,Meseta Central matorral,Deserts & Xeric Shrublands,Nearctic,125544.3385,2546.763524,2.0%,,,,,,,,


In [152]:
# unclean data
a.columns[7:]

Index([u'Unnamed: 7', u'Unnamed: 8', u'Unnamed: 9', u'Unnamed: 10',
       u'Unnamed: 11', u'Unnamed: 12', u'Unnamed: 13', u'Unnamed: 14'],
      dtype='object')

In [171]:
# get rid of columns of no/dirty data/artefacts
b = a.drop(a.columns[7:], axis=1)
b.head(5)

Unnamed: 0,id,ecoregion_name,biome,realm,ecoregion_area,area_protected,percent_protected
0,60141,Monte Alegre varzeá,Tropical & Subtropical Moist Broadleaf Forests,Neotropics,66947.38501,15338.98079,22.9%
1,70104,Eastern Micronesia tropical moist forests,Tropical & Subtropical Moist Broadleaf Forests,Oceania,533.660211,40.425363,7.6%
2,30126,Northwestern Congolian lowland forests,Tropical & Subtropical Moist Broadleaf Forests,Afrotropics,435086.2309,91600.35292,21.1%
3,30908,Zambezian halophytics,Flooded Grasslands & Savannas,Afrotropics,30438.54771,6356.705778,20.9%
4,51307,Meseta Central matorral,Deserts & Xeric Shrublands,Nearctic,125544.3385,2546.763524,2.0%


When you finish, you can export your dataframe to a csv file on disk

In [168]:
# don't save index as an additional column
b.to_csv('ecoregion_clean.csv', index=False)

In [170]:
c = pd.read_csv('ecoregion_clean.csv')
c.head(5)

Unnamed: 0,id,ecoregion_name,biome,realm,ecoregion_area,area_protected,percent_protected
0,60141,Monte Alegre varzeá,Tropical & Subtropical Moist Broadleaf Forests,Neotropics,66947.38501,15338.98079,22.9%
1,70104,Eastern Micronesia tropical moist forests,Tropical & Subtropical Moist Broadleaf Forests,Oceania,533.660211,40.425363,7.6%
2,30126,Northwestern Congolian lowland forests,Tropical & Subtropical Moist Broadleaf Forests,Afrotropics,435086.2309,91600.35292,21.1%
3,30908,Zambezian halophytics,Flooded Grasslands & Savannas,Afrotropics,30438.54771,6356.705778,20.9%
4,51307,Meseta Central matorral,Deserts & Xeric Shrublands,Nearctic,125544.3385,2546.763524,2.0%


## Explore data

If the field `percent_protected` really the division between `area_protected` and `ecoregion_area`? A good practice when you receive data is that you don't assume it is of good quality even it may appear so. Run a few interity checks, for example, does your data contain duplicate rows? Any missing values? Any obvious errors such as percentage value should never exceed 1?

In [182]:
# check data types
c.dtypes

id                     int64
ecoregion_name        object
biome                 object
realm                 object
ecoregion_area       float64
area_protected       float64
percent_protected     object
dtype: object

The field is a string object and it needs to be converted to a numeric value before any validation checks can be run

In [187]:
c['percent_protected_numeric'] = c.percent_protected.replace('%', '', regex=True).astype(float)/100

In [189]:
# calculate the percentage from raw areas
c['percent_protected_calculate'] = c.area_protected / c.ecoregion_area

In [192]:
# maximum value of percentage
c.percent_protected_calculate.max()

1.0

In [194]:
# maximum difference between percentage in the original data and one just caluclated
np.abs(c.percent_protected_numeric - c.percent_protected_calculate).max()

0.00049998846934889984

In [199]:
# check unique value of ecoregion name
c.ecoregion_name.unique().size == c.index.size

True

In [200]:
# Summary statistics
c.describe()

Unnamed: 0,id,ecoregion_area,area_protected,percent_protected_numeric,percent_protected_calculate
count,821.0,821.0,821.0,821.0,821.0
mean,52465.563946,161028.5,23043.785198,0.207116,0.207114
std,21894.001795,328723.0,44414.061076,0.217698,0.217688
min,10101.0,6.082918,0.0,0.0,0.0
25%,40109.0,16155.46,1473.016376,0.053,0.052608
50%,51303.0,62516.34,7409.636151,0.128,0.128252
75%,70116.0,184592.8,23922.09296,0.284,0.28393
max,81333.0,4650158.0,342909.9318,1.0,1.0


Often it may be of interest to select and focus on a subset of the dataframe, group a number of rows based on common values in a field (pivot table), find low/high and quantities, and to quickly make a graph or map to visually explore their relationships. 

In [202]:
# how many ecoregions have less than 5% protection
(c.percent_protected_calculate < 0.05).sum()

194

In [226]:
# what are top ecoregions with the lowest protection
c.sort_values?

In [229]:
print(c.sort_values(by='percent_protected_calculate', ascending=False).tail(5))

        id                                    ecoregion_name  \
804  31304                           Eritrean coastal desert   
803  80420               Eastern Anatolian deciduous forests   
802  31321            Southwestern Arabian montane woodlands   
801  31319                    Somali montane xeric woodlands   
820  80515  Northern Anatolian conifer and deciduous forests   

                                   biome        realm  ecoregion_area  \
804           Deserts & Xeric Shrublands  Afrotropics     4604.696113   
803  Temperate Broadleaf & Mixed Forests   Palearctic    81747.971930   
802           Deserts & Xeric Shrublands  Afrotropics    87101.536660   
801           Deserts & Xeric Shrublands  Afrotropics    62774.470050   
820            Temperate Conifer Forests   Palearctic   101519.257900   

     area_protected percent_protected  percent_protected_numeric  \
804             0.0              0.0%                        0.0   
803             0.0              0.0%   

In [230]:
# how does the result look for biomes
c.groupby('biome').ecoregion_area.sum()

biome
Boreal Forests/Taiga                                        1.507795e+07
Deserts & Xeric Shrublands                                  2.798464e+07
Flooded Grasslands & Savannas                               1.096130e+06
Mangroves                                                   3.485189e+05
Mediterranean Forests, Woodlands & Scrub                    3.227266e+06
Montane Grasslands & Shrublands                             5.203411e+06
Temperate Broadleaf & Mixed Forests                         1.283569e+07
Temperate Conifer Forests                                   4.087094e+06
Temperate Grasslands, Savannas & Shrublands                 1.010408e+07
Tropical & Subtropical Coniferous Forests                   7.126166e+05
Tropical & Subtropical Dry Broadleaf Forests                3.025997e+06
Tropical & Subtropical Grasslands, Savannas & Shrublands    2.029542e+07
Tropical & Subtropical Moist Broadleaf Forests              1.989415e+07
Tundra                                       

In [244]:
# use ? to look up what this function does
d = c.groupby('biome').aggregate({'ecoregion_area':sum, 'area_protected':sum})

Unnamed: 0_level_0,ecoregion_area,area_protected
biome,Unnamed: 1_level_1,Unnamed: 2_level_1
Boreal Forests/Taiga,15077950.0,1431270.0
Deserts & Xeric Shrublands,27984640.0,2708582.0
Flooded Grasslands & Savannas,1096130.0,315841.0
Mangroves,348518.9,102152.8
"Mediterranean Forests, Woodlands & Scrub",3227266.0,511611.8
Montane Grasslands & Shrublands,5203411.0,1414551.0
Temperate Broadleaf & Mixed Forests,12835690.0,1571072.0
Temperate Conifer Forests,4087094.0,694419.4
"Temperate Grasslands, Savannas & Shrublands",10104080.0,465254.6
Tropical & Subtropical Coniferous Forests,712616.6,87089.82


In [250]:
# create a new field in the dataframe to hold the percentage protected value
d['percent'] = d.area_protected/d.ecoregion_area

In [249]:
d.sort_values('percent', ascending=True)

Unnamed: 0_level_0,ecoregion_area,area_protected,percent
biome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Temperate Grasslands, Savannas & Shrublands",10104080.0,465254.6,0.046046
Boreal Forests/Taiga,15077950.0,1431270.0,0.094925
Deserts & Xeric Shrublands,27984640.0,2708582.0,0.096788
Tropical & Subtropical Dry Broadleaf Forests,3025997.0,303687.9,0.10036
Tropical & Subtropical Coniferous Forests,712616.6,87089.82,0.122211
Temperate Broadleaf & Mixed Forests,12835690.0,1571072.0,0.122399
"Tropical & Subtropical Grasslands, Savannas & Shrublands",20295420.0,3079894.0,0.151753
"Mediterranean Forests, Woodlands & Scrub",3227266.0,511611.8,0.158528
Temperate Conifer Forests,4087094.0,694419.4,0.169905
Tundra,8311402.0,1586965.0,0.190938


# Exercise

## Random walks using a different distribution

Instead of either 1 or -1, use a normal distribution of steps. HINT: 
```python
steps = np.random.normal(mean, sd, size)
```

In [135]:
## np.random.normal(0, 1, (5000, 1000))

## Biogeographic realms protected

Instead of using ecoregion and biome, which is the biome with the least protection in terms of protected area coverage.