## Follow along in this notebook to see the how we got the Diversity index numbers

* https://en.wikipedia.org/wiki/Diversity_index
* https://en.wikipedia.org/wiki/Gini_coefficient

### We will first import any packages needed

In [49]:
import pandas as pd 
import numpy as np

### We can now load in and preview the data

In [50]:
cfh = 'F:\\Research\\Funded\\UIREEJ\\Data\\Catch.pkl'
catch = pd.read_pickle(cfh)
catch[['Fish', 'Site', '_Site', '_Species']].head(15)

Unnamed: 0,Fish,Site,_Site,_Species
0,1.0,2.0,South Up-Stream,Smallmouth bass
1,2.0,2.0,South Up-Stream,Smallmouth bass
2,3.0,2.0,South Up-Stream,Smallmouth bass
3,4.0,2.0,South Up-Stream,Rock bass
4,5.0,2.0,South Up-Stream,Rock bass
5,6.0,2.0,South Up-Stream,Bluegill
6,7.0,2.0,South Up-Stream,Bluegill
7,8.0,2.0,South Up-Stream,Bluegill
8,9.0,2.0,South Up-Stream,Green sunfish
9,10.0,2.0,South Up-Stream,Green sunfish


### Remove Unknown Species

In [51]:
unknown = catch['_Species'] == 'Unknown'
catch = catch[~unknown]
catch2 = catch.copy()

catch2['_Site'].replace('North Down-Stream', 'Down-Stream', inplace=True)
catch2['_Site'].replace('South Down-Stream', 'Down-Stream', inplace=True)
catch2['_Site'].replace('South Up-Stream', 'Up-Stream', inplace=True)
catch2['_Site'].replace('North Up-Stream', 'Up-Stream', inplace=True)

### Site Diversity

#### Shannon index

* $ \displaystyle H' = - \sum_{s=1}^{S} p_{s}log_{10}{(p_{s})} $ 

* The Shannon entropy quantifies the uncertainty (entropy or degree of surprise) associated with our species prediction.

In [52]:
def shannon(x):
    x = x.dropna().value_counts()
    p = x / x.sum()
    p = -p*np.log10(p)
    shannon = p.sum()
    return shannon

shannon_all = pd.DataFrame(
    catch.groupby(by='_Site')['_Species'].apply(
    lambda x: shannon(x)
    ))
shannon_all.columns = ["Shannon Index/Entropy"]
shannon_all

Unnamed: 0_level_0,Shannon Index/Entropy
_Site,Unnamed: 1_level_1
North Down-Stream,0.547209
North Up-Stream,0.706463
South Down-Stream,0.864131
South Up-Stream,0.78247


In [53]:
shannon_strata = pd.DataFrame(
    catch2.groupby(by='_Site')['_Species'].apply(
    lambda x: shannon(x)
    ))
shannon_strata.columns = ["Shannon Index/Entropy"]
shannon_strata

Unnamed: 0_level_0,Shannon Index/Entropy
_Site,Unnamed: 1_level_1
Down-Stream,0.753399
Up-Stream,0.771797


#### Gini–Simpson index

* $ \displaystyle D = 1 - \frac {\sum_{s=1}^{S} n_{s}(n_{s}-1)} {N(N-1)} $ 

* $ D $ represents the probability that two entities taken at random from the dataset of interest (with replacement) represent different species.

In [54]:
def simpson(x):
    x = x.value_counts()
    numer = (x*(x-1)).sum()
    denom = (x.sum()*(x.sum()-1))
    simpson = 1 - numer/denom
    return simpson

simpson_all = pd.DataFrame(
    catch.groupby(by='_Site')['_Species'].apply(
    lambda x: simpson(x)
    ))
simpson_all.columns = ["Gini-Simpson Index"]
simpson_all

Unnamed: 0_level_0,Gini-Simpson Index
_Site,Unnamed: 1_level_1
North Down-Stream,0.481665
North Up-Stream,0.71594
South Down-Stream,0.768645
South Up-Stream,0.764182


In [55]:
simpson_strata = pd.DataFrame(
    catch2.groupby(by='_Site')['_Species'].apply(
    lambda x: simpson(x)
    ))
simpson_strata.columns = ["Gini-Simpson Index"]
simpson_strata

Unnamed: 0_level_0,Gini-Simpson Index
_Site,Unnamed: 1_level_1
Down-Stream,0.670217
Up-Stream,0.75084


#### Gini Coefficient

$ \displaystyle G = \frac{ \sum_{i=1}^{n} \sum_{j=1}^{n} | x_{i} - x_{j} | }{ 2n \sum_{j=1}^{n} x_{j} } = \frac{ \sum_{i=1}^{n} | \mathbf{X} - \mathbf{X}_{i} | }{ 2n \sum_{j=1}^{n} x_{j} } $

Note: vector notation $\mathbf{X}$ 

In [56]:
def gini(x):
    x = x.dropna().value_counts()
    species = x.index
    n = len(species)
    numer = 0
    for s in species:
        numer += np.sum(np.abs(x - x[s]))
    gini = numer / (2 * n * x.sum())
    return gini

gini_all = pd.DataFrame(
    catch.groupby(by='_Site')['_Species'].apply(
    lambda x: gini(x)
    ))
gini_all.columns = ['Gini Coefficient']
gini_all

Unnamed: 0_level_0,Gini Coefficient
_Site,Unnamed: 1_level_1
North Down-Stream,0.853904
North Up-Stream,0.726061
South Down-Stream,0.784058
South Up-Stream,0.687587


In [57]:
gini_strata = pd.DataFrame(
    catch2.groupby(by='_Site')['_Species'].apply(
    lambda x: gini(x)
    ))
gini_strata.columns = ['Gini Coefficient']
gini_strata

Unnamed: 0_level_0,Gini Coefficient
_Site,Unnamed: 1_level_1
Down-Stream,0.837343
Up-Stream,0.754788
