### Species Widget

### Covert into 3 tables:
[1) Data provided:](#1.-Data-provided)

* ##### Schema
location: string  
species_scientific_names {string, n columns based on species list}: boolean   {represents presence/abscence}

[2) Table 1: Species properties](#2.-Table-1:-Species-properties)

* ##### Schema
species_id: number  
scientific_name: str  
common_name: str  
iucn_url: url_str  
red_list_cat: enum {ex, ew, re, cr, en, vu, lr,nt,lc,dd} - reference Red list categories

[3.Table 2: Species locations](#3.-Table-2:-Species-locations)
* ##### Schema
location_id: string  
species: array[species_id]

In [3]:
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import fiona
import requests
import re
import json
from pandera.typing import Series
from hypothesis import given
import pandera as pa

### 1. Data provided

In [20]:
sp = pd.read_csv('../../../data/Species_Binary_20220223.csv', encoding = 'latin-1').drop(columns='Unnamed: 0')
sp.columns = sp.columns.str.replace(".", " ")

  sp.columns = sp.columns.str.replace(".", " ")


In [21]:
sp.head()

Unnamed: 0,Country,Acanthus ebracteatus,Acanthus ilicifolius,Acrostichum aureum,Acrostichum danaeifolium,Acrostichum speciosum,Aegialitis annulata,Aegialitis rotundifolia,Aegiceras corniculatum,Aegiceras floridum,...,Scyphiphora hydrophylacea,Sonneratia alba,Sonneratia apetala,Sonneratia caseolaris,Sonneratia griffithii,Sonneratia lanceolata,Sonneratia ovata,Tabebuia palustris,Xylocarpus granatum,Xylocarpus moluccensis
0,American Samoa,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Angola,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Anguilla,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Antigua & Barbuda,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Australia,1,1,1,0,1,1,0,1,0,...,1,1,0,1,0,1,1,0,1,1


### Model to validate data provided
(assuming they will continue to provide the data in this format)  
Right now it only validates that the first column is a string and that some species examples are 0-1

In [22]:
class Species_matrix(pa.SchemaModel):
    """
    This class is used to validate the data schema of a species matrix.
    """
    location: Series[str] = pa.Field(nullable=False,alias= 'Country')
    specie: Series[int] = pa.Field(alias ='^(?!Country$).*', nullable=False,regex=True, in_range={"min_value": 0, "max_value": 1})

In [23]:
validated = Species_matrix.validate(sp)

### 2. Table 1: Species properties

* First we need to retrieve the needed data from the IUCN API.  
We can perform a search on IUCN main page (https://www.iucnredlist.org/),for example, to retrieve the first species Acanthus ebracteatus.
That will send us to another page: https://www.iucnredlist.org/search?query=Acanthus%20ebracteatus&searchType=species
If we open the inspector tab we can see the search querys from 'Network'
We can click on one of the 'search?size...' and look at the headers and payload. Not all querys return the same information, we need t find one that is returning the common name, species id, iucn status, etc.
Once we find it, we can do right click --> Copy --> Copy as cURL. 
The curl command is shown below. We can reformat this in python format using https://curlconverter.com/#python  
We can make this request simpler by eliminating unecessary params and we can make it generic by passing the species name to the query term.

* Once we have all the data we can add it to a dataframe

* What to do with hibrid species??? they don't have an IUCN species identifier
For now I am adding 'hibrid{list_number}' but we should think about how to identify them

curl 'https://www.iucnredlist.org/dosearch/assessments/_search?size=60&_source=false&from=0&track_total_hits=true' \
  -H 'authority: www.iucnredlist.org' \
  -H 'pragma: no-cache' \
  -H 'cache-control: no-cache' \
  -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"' \
  -H 'dnt: 1' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36' \
  -H 'sec-ch-ua-platform: "Linux"' \
  -H 'content-type: application/json' \
  -H 'accept: */*' \
  -H 'origin: https://www.iucnredlist.org' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://www.iucnredlist.org/search?query=canis%20lupus&searchType=species' \
  -H 'accept-language: en,es;q=0.9,gl;q=0.8' \
  -H 'cookie: _ga=GA1.2.1721324640.1640163956; _gid=GA1.2.729186747.1640163956; _application_devise_session=U2NyVC9NaS8xOXdEeEdFOFprY0VWeVZTWE0vWWZZb2NQNEJyRlNlejVXK2Y1N1lOcUQwRDlkbUd0OXk0MmEvZS84RFVPQUtGNEJUenZQdXVQZzVqblkrT050RC9kMmJCTGtESTh5c0JzVWF1MzRKRkV0cXJPWkFhUlZIRXlBZWI1YVFwUDIrRXA0MHR2RVNnNEFIb1JGMUNteS9HUFhid05FZjI2WVpzSVorQjRlVmNIQjd1eWc5RlArR09lRjFGbGQ1ZUJFSlJoN201N1JQT0U2T3hLN2lERkdjeE00OTBIaERjN09ocG9KdTNSMFpnbTE2a3RVWElQR3Q5cjlaby0tUEMvSHlIMnBJckludlhEZjRvY0RRUT09--f221903eb125a5b45ea7b6dc7d9148fa0643c650' \
  --data-raw '{"stored_fields":["hasImage","hasPoints","hasRanges","image.id","image.url","image.urlThumb","image.credit","scopes.id","scopes.code","scopes.jsonDescription","kingdomName","className","commonName","scientificName","sisTaxonId","redListCategory.scaleCode","redListCategory.order","redListCategory.code","redListCategory.jsonDescription","populationTrend.id","populationTrend.code","populationTrend.jsonDescription","hasGreen","greenListCategory.scaleCode","greenListCategory.name"],"query":{"bool":{"must":[{"multi_match":{"query":"canis lupus","type":"phrase_prefix","fields":["commonName^12","commonNames^10","scientificName^8","keywords^4","synonyms^2","assessors","sisTaxonId","id"],"lenient":true,"max_expansions":100}}],"filter":{"bool":{"filter":[{"terms":{"scopes.code":["1"]}},{"terms":{"taxonLevel":["Species"]}}],"should":[],"minimum_should_match":0}},"should":[{"term":{"hasImage":{"value":true,"boost":6}}}]}},"sort":[{"_score":{"order":"desc"}}]}' \
  --compressed

In [24]:
def getIUCNdata(species):
    '''
    gets the IUCN data from their Api for a given specie

    Parameters
    ----------
    species: str
        the scientific name of the specie

    Returns
    -------
    dict
    '''
    sp_str = species.lower()

    headers = {
        'authority': 'www.iucnredlist.org',
        'content-type': 'application/json'}

    data = json.dumps({"stored_fields":["hasImage","hasPoints","hasRanges","image.id","image.url",\
                "image.urlThumb","image.credit","scopes.id","scopes.code","scopes.jsonDescription",\
                "kingdomName","className","commonName","scientificName","sisTaxonId",\
                "redListCategory.scaleCode","redListCategory.order","redListCategory.code",\
                "redListCategory.jsonDescription","populationTrend.id","populationTrend.code",\
                "populationTrend.jsonDescription","hasGreen","greenListCategory.scaleCode",\
                "greenListCategory.name"],\
                "query":{"bool":{"must":[{"multi_match":\
                {"query":f"{sp_str}","type":"phrase_prefix",\
                "fields":["commonName^12","commonNames^10",\
                "scientificName^8","keywords^4","synonyms^2",\
                "assessors","sisTaxonId","id"],"lenient":True,\
                "max_expansions":100}}],\
                "filter":{"bool":{"filter":[{"terms":{"scopes.code":["1"]}},\
                {"terms":{"taxonLevel":["Species"]}}],"should":[],"minimum_should_match":0}},\
                "should":[{"term":{"hasImage":{"value":True,"boost":6}}}]}},"sort":[{"_score":{"order":"desc"}}]})

    response = requests.get('https://www.iucnredlist.org/dosearch/assessments/_search', headers=headers, data=data)
    return response.json()

In [25]:
sp_list = validated.columns.tolist()

records ={'species_id':[],
         'scientific_name':[],
         'common_name':[],
         'iucn_url':[],
         'red_list_cat':[]}

for n, species in enumerate(sp_list[1:]):
    out = getIUCNdata(species)
    records['scientific_name'].append(species)
    records['iucn_url'].append(f'https://www.iucnredlist.org/search?query={species.replace(" ","%20")}&searchType=species')
    
    ### Sepcies id
    try:
        records['species_id'].append(out['hits']['hits'][0]['_id'])
    except:
        records['species_id'].append(999900+n)
    
    ### Common name
    try:
        if 'commonName' in out['hits']['hits'][0]['fields']:
            records['common_name'].append(out['hits']['hits'][0]['fields']['commonName'][0])
        else:
            records['common_name'].append(out['hits']['hits'][1]['fields']['commonName'][0])
    except:
        records['common_name'].append(np.nan)
    
    ### IUCN category
    try:
        records['red_list_cat'].append(out['hits']['hits'][0]['fields']['redListCategory.scaleCode'][0])
    except:
        records['red_list_cat'].append(np.nan)
        

In [26]:
### Write table 1 using IUCN API data
table1 = pd.DataFrame(data=records)
table1.species_id = table1.species_id.astype(int)

In [27]:
table1

Unnamed: 0,species_id,scientific_name,common_name,iucn_url,red_list_cat
0,7621003,Acanthus ebracteatus,Holy Mangrove,https://www.iucnredlist.org/search?query=Acant...,lc
1,6536949,Acanthus ilicifolius,Holy Mangrove,https://www.iucnredlist.org/search?query=Acant...,lc
2,7366131,Acrostichum aureum,,https://www.iucnredlist.org/search?query=Acros...,lc
3,7613396,Acrostichum danaeifolium,,https://www.iucnredlist.org/search?query=Acros...,lc
4,7614751,Acrostichum speciosum,,https://www.iucnredlist.org/search?query=Acros...,lc
...,...,...,...,...,...
59,7619241,Sonneratia lanceolata,,https://www.iucnredlist.org/search?query=Sonne...,lc
60,7615033,Sonneratia ovata,,https://www.iucnredlist.org/search?query=Sonne...,nt
61,7610513,Tabebuia palustris,,https://www.iucnredlist.org/search?query=Tabeb...,vu
62,7624881,Xylocarpus granatum,,https://www.iucnredlist.org/search?query=Xyloc...,lc


### Model to validate Species properties

In [29]:
class Species_properties(pa.SchemaModel):
    #species_id: Series[int] = pa.Field(nullable=False, gt=0,allow_duplicates=False)
    #scientific_name: Series[str] = pa.Field(str_matches= "[A-Za-z]*", allow_duplicates=True, nullable=True)
    #common_name: Series[str] = pa.Field(str_matches= "[A-Za-z]*",nullable=True)
    #iucn_url: Series[str] = pa.Field(str_matches= "^(https?:\/\/)?(www\.)?([A-Za-z0-9@:%._\+?~#=/&])*", allow_duplicates=True,nullable=True)
    #red_list_cat: Series[str] = pa.Field(str_matches= "^(dd|lc|nt|vu|en|cr|ew|ex)$",allow_duplicates=True, nullable=True) 
    species_id: Series[int] = pa.Field(nullable=False, gt=0)
    scientific_name: Series[str] = pa.Field(str_matches= "[A-Za-z]*", nullable=True)
    common_name: Series[str] = pa.Field(str_matches= "[A-Za-z]*",nullable=True)
    iucn_url: Series[str] = pa.Field(str_matches= "^(https?:\/\/)?(www\.)?([A-Za-z0-9@:%._\+?~#=/&])*",nullable=True)
    red_list_cat: Series[str] = pa.Field(str_matches= "^(dd|lc|nt|vu|en|cr|ew|ex)$", nullable=True) 

In [30]:
## Validate any pandas dataframe using the schema
validated = Species_properties.validate(table1)
validated.head()

Unnamed: 0,species_id,scientific_name,common_name,iucn_url,red_list_cat
0,7621003,Acanthus ebracteatus,Holy Mangrove,https://www.iucnredlist.org/search?query=Acant...,lc
1,6536949,Acanthus ilicifolius,Holy Mangrove,https://www.iucnredlist.org/search?query=Acant...,lc
2,7366131,Acrostichum aureum,,https://www.iucnredlist.org/search?query=Acros...,lc
3,7613396,Acrostichum danaeifolium,,https://www.iucnredlist.org/search?query=Acros...,lc
4,7614751,Acrostichum speciosum,,https://www.iucnredlist.org/search?query=Acros...,lc


### Save species table

In [37]:
table1[['scientific_name','common_name','iucn_url','red_list_cat']].to_csv('../../../data/Species_ID_list.csv', index = False)

### 3. Table 2: Species locations

### a)Convert Species names to identifiers and add as an array to locations table

In [187]:
#sp.stack().reset_index().set_axis('From To Distance'.split(), axis=1)

In [89]:
table2 = pd.DataFrame(data = {'country_name':sp.Species,'species':np.nan})
for country in sp.Species:
    df = sp[sp['Species']==country]
    for col in df.columns:
        if df[[col]].values[0]==0:
            df.drop(col,inplace=True,axis=1)
            sp_names = df.columns
    table2.loc[table2['country_name']==country,'species']= str(list(table1[table1.scientific_name.isin(sp_names)].species_id))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [90]:
table2.head()

Unnamed: 0,country_name,species
0,Comoros,"[7619457, 7610926, 7617531, 7627492, 7625290, ..."
1,Kenya,"[7619457, 7610926, 7617531, 7627492, 7625290, ..."
2,Madagascar,"[7619457, 7610926, 7617531, 7627492, 7625290, ..."
3,Mauritius,"[7366131, 7610926, 7622565, 7618520]"
4,Mayotte,"[7619457, 7610926, 7617531, 7627492, 7625290, ..."


### b) Parse location names to match API locations_id

In [94]:
### Read location ids file
dataLocation = requests.get('https://mangrove-atlas-api.herokuapp.com/api/v2/locations').json()['data']
loc = pd.DataFrame(dataLocation)
locapi = loc.name.values

In [95]:
!pip install fuzzywuzzy
!pip install python-Levenshtein

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 9.6 MB/s  eta 0:00:01
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25ldone
[?25h  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp38-cp38-linux_x86_64.whl size=185027 sha256=ca3de73f9ffc4711abd27d71f9aafc4a1fb13bddb920042d7f63529cdaffc9d4
  Stored in directory: /home/jovyan/.cache/pip/wheels/d7/0c/76/042b46eb0df65c3ccd0338f791210c55ab79d209bcc269e2c7
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2


In [193]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [224]:
def match_countries(table, country_col, API_country_list):
    ### 1. Get exact matches
    table['API_country_name'] = np.nan
    for country in table[country_col].values: 
        for element in API_country_list:
            z = fuzz.token_sort_ratio(country,element)
            if z > 75:
                #print(f'{country} matched API {element}')
                table.loc[table[country_col]==country,'API_country_name']= element
    

    ### 2. Remove already matched countries from list and match partially
    unmatched_API_list = list(set (API_country_list)- set(table['API_country_name'].values))
    unmatched_table_index = table[table.API_country_name.isnull()].index.tolist()
    unmatched_table_list= table.loc[unmatched_table_index,country_col].dropna().values
    
    if len(unmatched_table_list) > 0 :
        for country in unmatched_table_list: 
            for element in unmatched_API_list:
                z = fuzz.partial_ratio(country,element)
                if z > 80:
                    #print(f'partial match of {country} to API {element}')
                    table.loc[table[country_col]==country,'API_country_name']= element


    ### 3. Remove again already matched countries from list and match partially with lower score)
    unmatched_API_list = list(set(API_country_list)- set(table['API_country_name'].values))
    unmatched_table_index = table[table.API_country_name.isnull()].index.tolist()
    unmatched_table_list= table.loc[unmatched_table_index,country_col].dropna().values
    
    if len(unmatched_table_list) > 0 :
        for country in unmatched_table_list: 
            for element in unmatched_API_list:
                z = fuzz.ratio(country,element)
                if z > 60:
                    #print(f'partial match of {country} to API {element}')
                    table.loc[table[country_col]==country,'API_country_name']= element
    return table

In [225]:
### Run function 
table2 = match_countries(table2, 'country_name', locapi)

In [226]:
### Client locations not matched in the API
missing_index = table2[table2.API_country_name.isnull()].index.tolist()
unmatched = table2.loc[missing_index,'country_name'].values
unmatched

array(['Kuwait', 'British.Indian.Ocean.Territory', 'Maldives',
       'Christmas.Island', 'Guam', 'Kiribati', 'Marshall.Islands',
       'Nauru', 'Northern.Mariana.Islands', 'American.Samoa', 'Niue',
       'Tuvalu', 'Wallis.and.Futuna', 'Bermuda', 'Anguilla', 'Aruba',
       'Barbados', 'Curacao', 'Dominica', 'Montserrat',
       'Saint.Barthélemy', 'The.Democratic.Republic.of.the.Congo',
       'Mauritania', 'Sao.Tome.and.Principe', 'Togo'], dtype=object)

In [228]:
len(unmatched)

25

In [229]:
## Country locations in API not matched in client file
list(set(loc.loc[loc['location_type']=='country','name'].values)- set(table2.loc[~table2.API_country_name.isnull(),'API_country_name'].values))

['Saint Martin', 'Brunei', 'Protected zone Australia/Papua New Guinea']

In [233]:
table2.loc[table2.API_country_name == 'Saint-Martin']

Unnamed: 0,country_name,species,API_country_name
87,Saint.Martin,"[7366131, 7613866, 7617944, 7612125, 7609219, ...",Saint-Martin
89,Sint.Maarten,"[7366131, 7613866, 7617944, 7612125, 7609219, ...",Saint-Martin


In [None]:
#### Test with another list of countries

In [237]:
countries = pd.read_csv('../../datasets/Mangrove_Protection_Calculations_20210430.csv')

In [239]:
countries.columns

Index(['Country', 'Total Mangrove Composite',
       'Total Protected Mangrove Composite', 'Total Mangrove 1996',
       'Total Protected Mangrove 1996', 'Total Mangrove 2007',
       'Total Protected Mangrove 2007', 'Total Mangrove 2010',
       'Total Protected Mangrove 2010', 'Total Mangrove 2016',
       'Total Protected Mangrove 2016', 'Net Change in Total Mangrove Extent',
       'Net Change in Protected Mangrove Extent', 'Unnamed: 13',
       '% in protected areas in 1996', '% in protected areas in 2007',
       '% in protected areas in 2010', '% protected in 2016'],
      dtype='object')

In [240]:
countries = match_countries(countries, 'Country', locapi)

In [241]:
### Client locations not matched in the API
missing_index = countries[countries.API_country_name.isnull()].index.tolist()
unmatched = countries.loc[missing_index,'Country'].values

In [242]:
unmatched

array(['American Samoa', 'Anguilla', 'Aruba', 'Barbados', 'Curaçao',
       'Democratic Republic of the Congo', 'Dominica', 'Hong Kong',
       'Sao Tome and Principe', nan], dtype=object)

In [243]:
len(unmatched)

10

In [248]:
## Country locations in API not matched in client file
list(set(loc.loc[loc['location_type']=='country','name'].values)- set(countries.loc[~countries.API_country_name.isnull(),'API_country_name'].values))

['Peru',
 'Protected zone Australia/Papua New Guinea',
 'Saint Martin',
 'Singapore',
 'Congo, DRC']

In [285]:
table2= table2.merge(loc[['name','location_id']],left_on = 'API_country_name',right_on='name',how='left').drop(columns= ['name'])

### Model to validate Species locations

In [286]:
class Species_locations(pa.SchemaModel):
    country_name: Series[str] = pa.Field(str_matches= "[A-Za-z\.]*",unique=False,nullable=True)
    species: Series[str] = pa.Field(allow_duplicates=True,nullable=True)
    API_country_name: Series[str] = pa.Field(str_matches= "[A-Za-z]*", nullable=True,unique=True, ignore_na =True)
    location_id: Series[str]= pa.Field(str_matches= "[0-9\_]*", nullable=True,unique=True, ignore_na =True)
### can´t set unique True, it consoders NaN as duplicate
### Known issue: https://issueexplorer.com/issue/pandera-dev/pandera/644

In [287]:
table2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 126 entries, 0 to 125
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   country_name      126 non-null    object
 1   species           126 non-null    object
 2   API_country_name  101 non-null    object
 3   location_id       101 non-null    object
dtypes: object(4)
memory usage: 4.9+ KB


In [288]:
validated = Species_locations.validate(table2)

SchemaError: series 'API_country_name' contains duplicate values:
23              NaN
25              NaN
41              NaN
45              NaN
46              NaN
47              NaN
48              NaN
49              NaN
53              NaN
54              NaN
57              NaN
58              NaN
63              NaN
73              NaN
75              NaN
76              NaN
78              NaN
79              NaN
83              NaN
84              NaN
89     Saint-Martin
111             NaN
120             NaN
122             NaN
125             NaN
Name: API_country_name, dtype: object