# Biodiversity for food and agriculture - Dec 2017

The notebook assesses the contribution of biodiversity for food and agriculture through an analysis of overlay between IUCN's Red List of Threatened Species and the World Database on Protected Areas

## Discussions
- all species for food and agriculture to be included, regardless of their comprehensiveness

## Data, pre-processing and scripts

### Red List
- Presence: 1,2; Origin: 1,2; Seasonality: 1,2,3;??
- create index on `id_no`
- decisions to be deferred: RedList categories (threatened etc)

### Hexagon Grid
- an equal area global hex grid using the binary in [Discrete Global Grids for R](https://github.com/r-barnes/dggridR/)
- convert string ids to int ids
- create index on `id`
- additionally the grid has the following characteristics
>dggs_type ISEA3H 
>
>verbosity 1 
>
>dggs_res_spec 12
>
>densification 5
>
>max_cells_per_output_file 1000000
>
>cell_output_type SHAPEFILE
>
>cell_output_file_name ../output/hex_globe10k
>
>point_output_type SHAPEFILE
>
>point_output_file_name ../output/hex_globe_centre10k


### WDPA
- Nov 2017 release public version
- point data in the WDPA are **not** considered for reasons: a) the points do not represent the extent of PAs, leading to unquantifiable error; b) result less than 2% difference globally, with area potentially affected in Russia, Ukrain, Belarus, China and India ; c) siginifcant time required to process
- apart from the point, the pre-processing of the WDPA uses the same methodology in calculating global statistics and it can be found on the [protectedplanet](https://www.protectedplanet.net/c/calculating-protected-area-coverage).

### Script
The multiprocessing template is adapted to utilise all computing cores. A similar script could be found on the [github](https://github.com/Yichuans/geoprocessing/blob/master/species_richness/Species_multiprocessing_template_20171010_ca_improve_performance.py)

## Proposed methodology

In short the methodology
- use a 10km by 10km global grid (hex or regular)
- use the above grid to bin each and every species 
- use the above grid to bin each and every protected area
- use the above grid to bin each and every key biodiversity area
result will enable further interogation of species and protected area at 100sqkm resolution, without undertaking any additional spatial analysis

The above result will produce the following metrics:
- Species range (at 100 sqkm reso)
- Range of each species protected. Per species level, area (number of cell x cellsize) and how much of that *is protected* (number of cell shared by species + pa/total number of cells by species)
- Combining the two, we will be able to produce the *Rodrigus 2014* graph, i.e. species with different targets
- Heatmap of number of species protected


**THINKING: **

**Can we justify the methodology (grid approach 10km by 10km) by saying the species has a *protected status* if it overlaps with a protected areas or within 10km of a protected area? Also considering the inaccruacy of the RedList EOO data and the inaccruacy of global protected areas network. **

**Similarly, can we say a KBA is protected if it is within 10km of a protected areas?**


## Understanding the data

In [2]:
import numpy as np
import pandas as pd

In [99]:
%matplotlib inline

In [3]:
sis = pd.read_csv('sis_2017.csv')

In [4]:
bfa = pd.read_csv('Food_FAO_2017_2.csv', delimiter=';')

In [5]:
sis.head()

Unnamed: 0,objectid,id_no,binomial,presence,origin,seasonal,compiler,year,citation,source,...,tax_comm,kingdom,phylum,class,order_,family,genus,code,shape_length,shape_area
0,10143,190868.0,Nihonogomphus thomassoni,2,1,1,Kate Saunders,2011,"Asia freshwater biodiversity assessments, IUCN",Red List assessment,...,,ANIMALIA,ARTHROPODA,INSECTA,ODONATA,GOMPHIDAE,Nihonogomphus,LC,99.74407,27.275351
1,10144,190868.0,Nihonogomphus thomassoni,1,1,1,FBU,2011,IUCN (International Union for Conservation of ...,,...,,ANIMALIA,ARTHROPODA,INSECTA,ODONATA,GOMPHIDAE,Nihonogomphus,LC,2.133768,0.093543
2,10145,190868.0,Nihonogomphus thomassoni,1,1,1,DO Manh Cuong,2010,"Asia freshwater biodiversity assessments, IUCN",unknown,...,,ANIMALIA,ARTHROPODA,INSECTA,ODONATA,GOMPHIDAE,Nihonogomphus,LC,4.357552,0.30194
3,10146,190868.0,Nihonogomphus thomassoni,1,1,1,DO Manh Cuong,2010,"Asia freshwater biodiversity assessments, IUCN",Red List assessment,...,,ANIMALIA,ARTHROPODA,INSECTA,ODONATA,GOMPHIDAE,Nihonogomphus,LC,6.100119,0.527099
4,10147,63940.0,Tantilla coronadoi,1,1,1,NatureServe and IUCN,2007,NatureServe and IUCN (International Union for ...,,...,,ANIMALIA,CHORDATA,REPTILIA,SQUAMATA,COLUBRIDAE,Tantilla,LC,0.78953,0.049604


In [6]:
print('The number of database rows of species id 137: {}'.format(sis[sis.id_no==137].id_no.count()))

The number of database rows of species id 137: 13


It is important to understand that the `sis` table, i.e., the attribute table of the spatial data, contain potentially both multi-part polygons as well as single-part polygons (probably due to different sources + range polygons with different presence, origin and seasonality). 

There may also be a possiblity that these rows may be overlapping.

Setting the indices for both datasets. Make sure the ids are integer. Also clean the tables, to reduce size

In [7]:
sis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78042 entries, 0 to 78041
Data columns (total 25 columns):
objectid        78042 non-null int64
id_no           78042 non-null float64
binomial        78042 non-null object
presence        78042 non-null int64
origin          78042 non-null int64
seasonal        78042 non-null int64
compiler        76165 non-null object
year            78042 non-null int64
citation        78038 non-null object
source          24156 non-null object
dist_comm       947 non-null object
island          7320 non-null object
subspecies      1197 non-null object
subpop          186 non-null object
legend          78042 non-null object
tax_comm        426 non-null object
kingdom         78042 non-null object
phylum          78042 non-null object
class           78042 non-null object
order_          78042 non-null object
family          78042 non-null object
genus           78042 non-null object
code            78042 non-null object
shape_length    78042 non-nul

In [8]:
sis.id_no.max()

117582065.0

Choose 4 bytes for the ID field. The largest `id_no` is less than the largest value of `uint32`

In [224]:
int_types = ["uint8", "int8", "int16", "uint16", "uint32", "int32", 'int64']
for it in int_types:
    print(np.iinfo(it))

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------

Machine parameters for uint32
---------------------------------------------------------------
min = 0
max = 4294967295
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -

ValueError: Invalid integer data type.

In [76]:
sis_columns=['id_no', 'kingdom', 'phylum', 'class', 'order_', 'family', 'genus', 'code', 'shape_area']
sis_clean = pd.DataFrame(sis[sis_columns])
sis_clean = sis_clean.astype(dtype={"id_no": "uint32",
                                    'kingdom': "category",
                                    'phylum': "category",
                                     'class': "category",
                                     'order_': "category",
                                     'family': "category",
                                     'genus': "category",
                                     'code': "category"})
# sis_clean = sis_clean.set_index('id_no')

In [77]:
sis_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78042 entries, 0 to 78041
Data columns (total 9 columns):
id_no         78042 non-null uint32
kingdom       78042 non-null category
phylum        78042 non-null category
class         78042 non-null category
order_        78042 non-null category
family        78042 non-null category
genus         78042 non-null category
code          78042 non-null category
shape_area    78042 non-null float64
dtypes: category(7), float64(1), uint32(1)
memory usage: 2.1 MB


After the data cleaning process, the size of the `sis` table has reduced from 14.9+MB to 2.1MB. All essential information is retained and the number of rows stays the same

In [78]:
sis.index.size == sis_clean.index.size

True

In [79]:
sis.id_no.max()

117582065.0

In [14]:
bfa.head()

Unnamed: 0,id,kingdom_name,phylum_name,class_name,order_name,family_name,genus_name,species_name,scientific_name,authority,infra_rank,infra_name,infra_authority,category,criteria,publicationyear,main_common_name,value,REF,DESCRIPTION
0,9,ANIMALIA,CHORDATA,ACTINOPTERYGII,CYPRINIFORMES,CYPRINIDAE,Aaptosyax,grypus,Aaptosyax grypus,"Rainboth, 1991",,,,CR,A2acd,2011,Mekong Giant Salmon Carp,1,1,Food - human
1,137,ANIMALIA,CHORDATA,MAMMALIA,CHIROPTERA,PTEROPODIDAE,Acerodon,celebensis,Acerodon celebensis,"Peters, 1867",,,,VU,A2d,2016,Sulawesi Fruit Bat,1,1,Food - human
2,138,ANIMALIA,CHORDATA,MAMMALIA,CHIROPTERA,PTEROPODIDAE,Acerodon,humilis,Acerodon humilis,"K. Andersen, 1909",,,,EN,"B1ab(iii,v)",2016,Talaud Fruit Bat,1,1,Food - human
3,139,ANIMALIA,CHORDATA,MAMMALIA,CHIROPTERA,PTEROPODIDAE,Acerodon,jubatus,Acerodon jubatus,"(Eschscholtz, 1831)",,,,EN,A2cd,2016,Golden-capped Fruit Bat,1,1,Food - human
4,140,ANIMALIA,CHORDATA,MAMMALIA,CHIROPTERA,PTEROPODIDAE,Acerodon,leucotis,Acerodon leucotis,"(Sanborn, 1950)",,,,VU,A4cd,2008,Palawan Fruit Bat,1,1,Food - human


In [15]:
bfa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11396 entries, 0 to 11395
Data columns (total 20 columns):
id                  11396 non-null int64
kingdom_name        11396 non-null object
phylum_name         11396 non-null object
class_name          11396 non-null object
order_name          11396 non-null object
family_name         11396 non-null object
genus_name          11396 non-null object
species_name        11396 non-null object
scientific_name     11396 non-null object
authority           11380 non-null object
infra_rank          126 non-null object
infra_name          126 non-null object
infra_authority     0 non-null float64
category            11396 non-null object
criteria            1899 non-null object
publicationyear     11396 non-null int64
main_common_name    8649 non-null object
value               11396 non-null int64
REF                 11396 non-null int64
DESCRIPTION         11396 non-null object
dtypes: float64(1), int64(4), object(15)
memory usage: 1.7+ MB


In [16]:
bfa.id.max()

117582065

In [41]:
columns=['id', 'kingdom_name', 'phylum_name', 'class_name', 'order_name', 'family_name', 'genus_name', 'category']
bfa_clean = pd.DataFrame(bfa[columns])
bfa_clean = bfa_clean.astype(dtype={"id": "uint32",
                                    'kingdom_name': "category",
                                    'phylum_name': "category",
                                     'class_name': "category",
                                     'order_name': "category",
                                     'family_name': "category",
                                     'genus_name': "category",
                                     'category': "category"})
# bfa_clean = bfa_clean.set_index('id')

In [42]:
bfa_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11396 entries, 0 to 11395
Data columns (total 8 columns):
id              11396 non-null uint32
kingdom_name    11396 non-null category
phylum_name     11396 non-null category
class_name      11396 non-null category
order_name      11396 non-null category
family_name     11396 non-null category
genus_name      11396 non-null category
category        11396 non-null category
dtypes: category(7), uint32(1)
memory usage: 294.3 KB


Since both `sis` and `bfa` tables contains multiple rows per species, it is imperative that they must not counted multiple times. The below function is to achieve that.

In [43]:
def count_unique(array_like):
    return np.count_nonzero(np.unique(array_like))

In [46]:
np.count_nonzero(bfa_clean.id.unique())

9627

In [47]:
count_unique(bfa_clean.id)

9627

###  Spatial data for species for food and agriculture

This is to find out how many species in the non spatial database have spatial data


Count of species by class and red list categories from the `bfa` table

In [60]:
bfa_clean.pivot_table(index='class_name',  columns='category', values='id', aggfunc=count_unique, fill_value=0, margins=True)

category,CR,DD,EN,EW,EX,LC,LR/lc,LR/nt,NT,VU,All
class_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ACTINOPTERYGII,133,799,148,1,9,3129,1,1,157,233,4611
AGARICOMYCETES,1,0,2,0,0,0,0,0,1,4,8
AMPHIBIA,9,12,20,0,0,141,0,0,22,34,238
ANTHOZOA,0,1,0,0,0,1,0,0,0,0,2
AVES,63,5,113,2,46,1021,0,0,191,205,1646
BIVALVIA,6,23,8,0,0,50,0,0,1,7,95
CEPHALASPIDOMORPHI,0,0,0,0,1,3,0,0,1,0,5
CEPHALOPODA,0,62,0,0,0,32,0,0,1,1,96
CHONDRICHTHYES,6,83,28,0,0,62,0,0,51,57,287
CLITELLATA,0,2,0,0,1,0,0,0,0,0,3


Of which those have spatial data that have been mapped and have spatial data

In [62]:
bfa_clean[bfa_clean.id.isin(sis_clean.index)].pivot_table(
    index='class_name',  columns='category', values='id', aggfunc=count_unique, fill_value=0, margins=True)

category,CR,DD,EN,EW,EX,LC,LR/lc,LR/nt,NT,VU,All
class_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ACTINOPTERYGII,60,563,114,0,0,2719,0,0,125,201,3782.0
AGARICOMYCETES,1,0,2,0,0,0,0,0,1,4,8.0
AMPHIBIA,9,12,20,0,0,141,0,0,22,34,238.0
ANTHOZOA,0,0,0,0,0,0,0,0,0,0,
AVES,55,5,113,0,0,1019,0,0,191,205,1588.0
BIVALVIA,5,16,8,0,0,35,0,0,1,2,67.0
CEPHALASPIDOMORPHI,0,0,0,0,0,3,0,0,1,0,4.0
CEPHALOPODA,0,0,0,0,0,0,0,0,0,0,
CHONDRICHTHYES,6,82,27,0,0,62,0,0,51,57,285.0
CLITELLATA,0,2,0,0,0,0,0,0,0,0,2.0


In [89]:
np.intersect1d(sis_clean.id_no.unique(), bfa.id.unique()).size

7867

7,867 species in the `bfa` table have spatial data

### Restricted range species

This impacts the resolution and accuracy of analysis. We propose to use a 10km resolution global grid to map the indicative presence of species. It works for wide ranging species but may encounter difficult if the distribution of species is way below this threshold. 

A rough rule of thumb calculation between square degrees is made, i.e., 1 degree equates to 111km

In [104]:
degree2_to_km2 = 111*111

Join the two tables to migrate attribute `shape_area`

In [91]:
bfa_sis = pd.merge(bfa_clean, sis_clean, how='inner', left_on='id', right_on='id_no')[sis_columns]

Double check the species that have spatial data

In [93]:
bfa_sis.id_no.unique().size

7867

Save this table for filtering the spatial data

In [148]:
pd.DataFrame(bfa_sis.id_no.unique(),columns=['bfa']).to_csv('bfa_sis.csv')

In [95]:
bfa_sis_area_dist = bfa_sis.pivot_table(values='shape_area', index='id_no', aggfunc=np.sum)

Calculate the number of species that have less than 100 sqkm

In [109]:
bfa_sis_area_dist[bfa_sis_area_dist['shape_area']*degree2_to_km2 < 100].size

64

The above calculation indicates at least **64** species will be less than 100 sqkm (thus affected by the 10x10 resolution grid). This assumes they all occur near the equator. In reality, more species could be affected. 

Below is a table of what these 64 species are

In [113]:
bfa_clean[bfa_clean.id.isin(bfa_sis_area_dist[bfa_sis_area_dist['shape_area']*degree2_to_km2 < 100].index)].pivot_table(
    index='class_name',  columns='category', values='id', aggfunc=count_unique, fill_value=0, margins=True)

category,CR,DD,EN,EW,EX,LC,LR/lc,LR/nt,NT,VU,All
class_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ACTINOPTERYGII,4,1,0,0,0,3,0,0,0,5,13.0
AGARICOMYCETES,0,0,0,0,0,0,0,0,0,0,
AMPHIBIA,4,1,1,0,0,0,0,0,0,0,6.0
ANTHOZOA,0,0,0,0,0,0,0,0,0,0,
AVES,10,0,10,0,0,0,0,0,0,5,25.0
BIVALVIA,0,0,0,0,0,0,0,0,0,0,
CEPHALASPIDOMORPHI,0,0,0,0,0,0,0,0,0,0,
CEPHALOPODA,0,0,0,0,0,0,0,0,0,0,
CHONDRICHTHYES,0,0,0,0,0,0,0,0,0,0,
CLITELLATA,0,0,0,0,0,0,0,0,0,0,


While the 64 species represent a small proportion (out of 7,867), they account for almost 10% of critically endangered species (29 out of 239)


## Analysis

The binning exercise for species (7,867, or 13,000+ rows) and hex (5 millions) took about 22 hours to finish, on a machine with 10 running cores in parallel. The same technique has also been applied for the WDPA (200k+ records). The end result is a **common** mapping of species and protected areas to a hex grid.

Notably, the two species failed on two processes (row_id: 8840 and 8841, i.e. Guttera pucherani, Hipposideros grandis, LC bird and mammal)

The main benefit of this approach is to avoid any additional spatial analysis which are costly and that any filtering, for example, by presence/origin/seasonality, or in the WDPA, could be done without undertaking any additional spatial analysis. This requires the spatial analysis done at the granularity that allows separation at this level. In the case of WDPA, as `status` and `desgination_type` could be identified at per `WDPAID` level, no further processing is needed. In the case of RL species, the same species may have mulitple polygons with different `presence` codes, thus it cannot be done at the `ID_NO` level - so the analysis must be done at a row level, requiring an additional `row_id`.

### the big data challenge
The end result of the analyses is a set of two-column tables containing the mapping of two datasets by IDs (the names `ID_NO`, `WDPAID` are for historical references but also for ease of remembering the what gets mapped to what). This will be fixed in the next version of the script.

| first column (id_no)| second column (wdpaid)|
|----------------------|----------------------|
| wdpaid(WDPA) or row_id(RL)|hexid            |

The resulting tables are huge in size. By configuring the data types, it can be read in the memory.

**!the below is outdated!**

(For example, the csv for hex and RL would be 1.2TB uncompressed. This means the data cannot be read as a big lump in memory in one go - it requires a piecemeal approach, albeit slower. The idea is to process the data chunk by chunk, but distribute the workload of processing chunks to different processes/cores using the `ipyparallel` library. This requires additional packages to be install and configured elsewhere.)

BFA tables

In [158]:
bfa_hex = pd.read_csv("hex_bfa.csv.gz",
                        skipinitialspace=True,
                        dtype={'ID_NO': np.uint32, 'WDPAID': np.uint32})

In [229]:
bfa_hex_lu = pd.read_csv('bfa_species_lookup.csv')

print(bfa_hex_lu.info())

bfa_hex_lu = bfa_hex_lu[['row_id', 'id_no', 'binomial', 'presence', 'origin', 'seasonal', 'kingdom', 'phylum', 'class', 'order_', 'family', 'genus', 'code']]

print(bfa_hex_lu.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Data columns (total 28 columns):
OID             14723 non-null int64
id_no           14723 non-null float64
binomial        14723 non-null object
presence        14723 non-null int64
origin          14723 non-null int64
seasonal        14723 non-null int64
compiler        14651 non-null object
year            14723 non-null int64
citation        14723 non-null object
source          4735 non-null object
dist_comm       255 non-null object
island          2435 non-null object
subspecies      611 non-null object
subpop          73 non-null object
legend          14723 non-null object
tax_comm        29 non-null object
kingdom         14723 non-null object
phylum          14723 non-null object
class           14723 non-null object
order_          14723 non-null object
family          14723 non-null object
genus           14723 non-null object
code            14723 non-null object
OBJECTID_1      14723 non-null in

WDPA related tables

In [158]:
bfa_hex = pd.read_csv("hex_bfa.csv.gz",
                        skipinitialspace=True,
                        dtype={'ID_NO': np.uint32, 'WDPAID': np.uint32})

In [171]:
wdpa_hex = pd.read_csv('hex_wdpa_poly.csv.gz',
                        skipinitialspace=True,
                        dtype={'ID_NO': np.uint32, 'WDPAID': np.uint32})

In [225]:
wdpa_hex_lu = pd.read_csv('wdpa_lookup2.csv')

print(wdpa_hex_lu.info())

# change dtypes to make it smaller in memory
wdpa_hex_lu['WDPAID'] = wdpa_hex_lu['WDPAID'].astype('uint32')
wdpa_hex_lu['STATUS_YR'] = wdpa_hex_lu['STATUS_YR'].astype('uint16')
wdpa_hex_lu['MARINE'] = wdpa_hex_lu['MARINE'].astype('uint8')
wdpa_hex_lu[['REP_AREA', 'GIS_AREA']] = wdpa_hex_lu[['REP_AREA', 'GIS_AREA']].astype('float16')

wdpa_hex_lu = wdpa_hex_lu.drop('OID', axis=1)

wdpa_hex_lu['DESIG_ENG'] = wdpa_hex_lu['DESIG_ENG'].astype('category')
wdpa_hex_lu['DESIG_TYPE'] = wdpa_hex_lu['DESIG_TYPE'].astype('category')
wdpa_hex_lu['IUCN_CAT'] = wdpa_hex_lu['IUCN_CAT'].astype('category')
wdpa_hex_lu['INT_CRIT'] = wdpa_hex_lu['INT_CRIT'].astype('category')
wdpa_hex_lu['STATUS'] = wdpa_hex_lu['STATUS'].astype('category')
wdpa_hex_lu['PARENT_ISO'] = wdpa_hex_lu['PARENT_ISO'].astype('category')

print(wdpa_hex_lu.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217544 entries, 0 to 217543
Data columns (total 12 columns):
OID           217544 non-null int64
WDPAID        217544 non-null float64
DESIG_ENG     217544 non-null object
DESIG_TYPE    217544 non-null object
IUCN_CAT      217544 non-null object
INT_CRIT      217544 non-null object
MARINE        217544 non-null int64
REP_AREA      217544 non-null float64
GIS_AREA      217544 non-null float64
STATUS        217544 non-null object
STATUS_YR     217544 non-null int64
PARENT_ISO    217544 non-null object
dtypes: float64(3), int64(3), object(6)
memory usage: 19.9+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217544 entries, 0 to 217543
Data columns (total 11 columns):
WDPAID        217544 non-null uint32
DESIG_ENG     217544 non-null category
DESIG_TYPE    217544 non-null category
IUCN_CAT      217544 non-null category
INT_CRIT      217544 non-null category
MARINE        217544 non-null uint8
REP_AREA      217544 non-null float16

In [226]:
wdpa_hex.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1499915 entries, 0 to 1499914
Data columns (total 2 columns):
ID_NO     1499915 non-null uint32
WDPAID    1499915 non-null uint32
dtypes: uint32(2)
memory usage: 11.4 MB


In [160]:
bfa_hex.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487378599 entries, 0 to 487378598
Data columns (total 2 columns):
ID_NO     uint32
WDPAID    uint32
dtypes: uint32(2)
memory usage: 3.6 GB


In [173]:
bfa_hex_lu = pd.read_csv('bfa_species_lookup.csv')

In [164]:
bfa_hex.columns = ['row_id', 'hexid']

In [166]:
a = bfa_hex.pivot_table(index='row_id', values='hexid', aggfunc=count_unique)

In [168]:
a

Unnamed: 0_level_0,hexid
row_id,Unnamed: 1_level_1
2,745
3,83881
4,51409
5,4664
6,3634549
7,225
8,19657
9,48336
10,25149
11,237252


In [167]:
a.info()

<class 'pandas.core.frame.DataFrame'>
UInt64Index: 14693 entries, 2 to 14723
Data columns (total 1 columns):
hexid    14693 non-null uint32
dtypes: uint32(1)
memory usage: 172.2 KB


In [159]:
bhc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 2 columns):
ID_NO     40000 non-null uint32
WDPAID    40000 non-null uint32
dtypes: uint32(2)
memory usage: 312.6 KB


In [155]:
bhc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 2 columns):
ID_NO     40000 non-null uint32
WDPAID    40000 non-null uint32
dtypes: uint32(2)
memory usage: 312.6 KB


In [157]:
pd.read_csv?

In [156]:
bhc.head()

Unnamed: 0,ID_NO,WDPAID
0,7,1596246
1,7,1601353
2,7,1602073
3,7,1602074
4,7,1602075


In [None]:
def count_hex_species(df):
    # this function counts the number of hex for species
    pass

def count_protected_hex_species(df):
    # this function counts the number of hex that are protected for species
    pass

def

## TEST

In [119]:
import datetime, time

In [137]:
datetime.datetime.now().strftime('%c')


'Fri Nov 24 09:28:55 2017'

In [135]:
a.strftime('%x-%X')

'11/24/17-09:11:37'

In [136]:
a.strftime('%c')

'Fri Nov 24 09:11:37 2017'

In [125]:
time.time()

1511514626.670678

In [127]:
datetime.datetime.fromtimestamp(time.time())

datetime.datetime(2017, 11, 24, 9, 11, 5, 140678)