# QA for WDPA 2019

Stijn den Haan

Supervisor: Yichuan Shi

Bioinformatics internship • WCMC • 10 June --- 9 August 2019

---

#### A brief note on diction: 

- **Offending fields** are fields (columns) that contain values that do not adhere to the rules set in the WDPA manual.
- Offending fields are subdivided in **three types**:
    - *Duplicate*: specifically, the WDPA_PID should be unique throughout the WDPA
    - *Inconsistent*: multiple records (rows) about the same protected area contains conflicting attribute information
        - Example: records with the same WDPAID have different values present in the NAME field, e.g. 'De Veluwe' vs 'De VeLUwe'.
    - *Invalid*: a record has an incorrect value for a particular field where only a particular set of values is allowed.
        - Example: DESIG_TYPE = 'Individual' while only 'National', 'International', and 'Regional' are allowed values for this field.
    
### **AP: add an explanation of the difference in attribute vs field.**
---

###### Load packages

In [7]:
import numpy as np
import pandas as pd
import arcpy
import datetime
import os

###### Load data

In [2]:
# set working directory
os.chdir("C:/Users/paintern/Desktop/Stijn/5. Scripts")

# load dummy data
wdpa_df = pd.read_csv('dummyData.csv', low_memory=False)

# load fields present in the WDPA
input_fields = ['WDPAID', 'WDPA_PID', 'PA_DEF', 'NAME', 'ORIG_NAME', 'DESIG', 
      'DESIG_ENG', 'DESIG_TYPE', 'IUCN_CAT', 'INT_CRIT', 'MARINE', 'REP_M_AREA', 
      'GIS_M_AREA', 'REP_AREA', 'GIS_AREA', 'NO_TAKE', 'NO_TK_AREA', 'STATUS', 'STATUS_YR', 
      'GOV_TYPE', 'OWN_TYPE', 'MANG_AUTH', 'MANG_PLAN', 'VERIF', 'METADATAID', 'SUB_LOC', 'PARENT_ISO3', 'ISO3', ]

---
### 1. Convert ArcGIS table to Pandas DataFrame
---

In [None]:
# Source: https://gist.github.com/d-wasserman/e9c98be1d0caebc2935afecf0ba239a0
def arcgis_table_to_df(in_fc, input_fields, query=""):
    """Function will convert an arcgis table into a pandas dataframe with an object ID index, and the selected
    input fields using an arcpy.da.SearchCursor."""

    OIDFieldName = arcpy.Describe(in_fc).OIDFieldName
    final_fields = [OIDFieldName] + input_fields
    data = [row for row in arcpy.da.SearchCursor(in_fc,final_fields,where_clause=query)]
    fc_dataframe = pd.DataFrame(data,columns=final_fields)
    fc_dataframe = fc_dataframe.set_index(OIDFieldName,drop=True)
    
    return fc_dataframe

# wdpa_df = arcgis_table_to_df(wdpa, fields)

---
### 2. Verify that the imported data is as expected
---

### **Action Point**: Add some error-trapping code to verify that the dimensions of the dataset are as expected: number of rows & columns OK? Otherwise, garbage in, garbage out.

If dataset does not contain fields from input field: raise error. Ensure all fields are in the dataset.

In [None]:
def invalid_data_import(wdpa_df):
    '''
    Return True if the data imported does not contain all expected fields.
    '''
    
    #someCode ....  
    
    return fieldsExpected != fieldsPresent

---
### 3. Utility to pull rows from the WDPA, based on WDPA_PID input
---

In [None]:
def find_wdpa_rows(wdpa_df, wdpa_pid):
    '''
    Return a subset of dataframe based on wdpa_pid list

    Arguments:
    wdpa_df -- wdpa dataframe
    wdpa_pid -- a list of WDPA_PID
    '''
    
    return wdpa_df[wdpa_df['WDPA_PID'].isin(wdpa_pid)]

---
### 4. Checks
---

Checking the validality (True vs False, errors or not) should be implemented as efficiently as possible to avoid having to pull out all 'offending' rows when return_pid is set to **False**

---
#### **i. Find duplicate WDPA_PIDs**
---

In [None]:
def duplicate_wdpa_pid(wdpa_df, return_pid=False):
    '''
    Return True if WDPA_PID is duplicate in the dataframe. 
    
    Return list of WDPA_PID, if duplicates are present 
    and return_pid is set True.
    '''

    if return_pid:
        ids = wdpa_df['WDPA_PID'] # make a variable of the field to find
        return ids[ids.duplicated()].unique() # return duplicate WDPA_PIDs

    return wdpa_df['WDPA_PID'].nunique() != wdpa_df.index.size # this returns True if there are WDPA_PID duplicates

---
#### **ii. Find inconsistent fields for the same WDPAID**
---

##### Parent function

In [None]:
def inconsistent_attributes_same_wdpaid(wdpa_df, 
                                        check_attribute, 
                                        return_pid=False):
    '''
    Factory of functions: this generic function is to be linked to
    the family of 'inconsistent' functions stated below. These latter 
    functions are to give information on which fields to check and pull 
    from the DataFrame. This function is the foundation of the others.
    
    Return True if inconsistent attributes are found for rows 
    sharing the same WDPAID

    Return list of WDPA_PID where inconsistencies occur, if 
    return_pid is set True

    Arguments:
    check_attributes -- list of the attribute(s) to check for inconsistency
    '''

    if return_pid:
        # Group by WDPAID to find duplicate WDPAIDs and count the 
        # number of unique values for the field in question
        wdpaid_groups = wdpa_df.groupby(['WDPAID'])[check_attributes].nunique()

        # Select all WDPAID duplicates groups with >1 unique value for 
        # specified field ('check_attributtes') and use their index to
        # return the WDPA_PIDs
        return wdpa_df[wdpa_df["WDPAID"].isin(wdpaid_groups[wdpaid_groups >1].index)]["WDPA_PID"].values
                
    # Sum the number of times a WDPAID has more than 1 value for a field
    return (wdpa_df.groupby('WDPAID')[check_attributes].nunique() > 1).sum() >= 1

##### Child functions 

###### Inconsistent NAME

### **Change tuple to list**

In [None]:
def inconsistent_name_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'NAME' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''

    check_attributes = ('NAME')
    
    # The command below loads the parent function
    # and adds the check_attributes and return_pid arguments in it
    # to evaluate the wdpa_df for these arguments
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent ORIG_NAME

In [None]:
def inconsistent_orig_name_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'ORIG_NAME' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''

    check_attributes = ('ORIG_NAME')
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent DESIG

In [None]:
def inconsistent_desig_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'DESIG' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    
    check_attributes = ('DESIG')
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent DESIG_ENG

In [None]:
def inconsistent_desig_eng_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'DESIG_ENG' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    
    check_attributes = ('DESIG_ENG')
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent DESIG_TYPE

In [None]:
def inconsistent_desig_type_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'DESIG_TYPE' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    
    check_attributes = ('DESIG_TYPE')
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent IUCN_CAT

In [None]:
def inconsistent_iucn_cat_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'IUCN_CAT' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''

    check_attributes = ('IUCN_CAT')
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent INT_CRIT

In [None]:
def inconsistent_int_crit_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'INT_CRIT' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''

    check_attributes = ('INT_CRIT')
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent NO_TAKE

In [None]:
def inconsistent_no_take_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'NO_TAKE' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('NO_TAKE')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent STATUS

In [None]:
def inconsistent_status_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'STATUS' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('STATUS')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent STATUS_YR

In [None]:
def inconsistent_status_yr_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'STATUS_YR' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('STATUS_YR')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent GOV_TYPE

In [None]:
def inconsistent_gov_type_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'GOV_TYPE' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('GOV_TYPE')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent OWN_TYPE

In [None]:
def inconsistent_own_type_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'OWN_TYPE' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('OWN_TYPE')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent MANG_AUTH

In [None]:
def inconsistent_mang_auth_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'MANG_AUTH' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    
    check_attributes = ('MANG_AUTH')
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent MANG_PLAN

In [None]:
def inconsistent_mang_plan_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'MANG_PLAN' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('MANG_PLAN')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent VERIF

In [None]:
def inconsistent_verif_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'VERIF' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('VERIF')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent METADATAID

In [None]:
def inconsistent_metadataid_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'METADATAID' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('METADATAID')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent SUB_LOC

In [None]:
def inconsistent_sub_loc_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'SUB_LOC' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('SUB_LOC')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent PARENT_ISO3

In [None]:
def inconsistent_parent_iso3_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'PARENT_ISO3' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('PARENT_ISO3')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

###### Inconsistent ISO3

In [None]:
def inconsistent_iso3_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'ISO3' attribute
    for records with the same WDPAID
    
    Input: WDPA in pandas dataframe 
    Output: list with WDPAIDs containing attribute inconsistencies
    '''
    check_attributes = ('ISO3')

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes)

---
#### **iii. Find invalid values in fields**
--- 

##### Parent function

In [None]:
def invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid=False):
    '''
    This function checks the WDPA for invalid values and returns a list of WDPA_PIDs 
    that have invalid values for the specified field(s).
    
    This function is to be linked to the family of 'invalid field'-checking functions. 
    These latter functions give specific information on the fields to be checked, and how.
        
    Return True if invalid values are found in specified fields
    Return list of WDPA_PIDs with invalid fields, if return_pid is set True

    ## Arguments ##
    
    field -- the field to be checked for invalid values, in a list
    
    field_allowed_values -- a list of expected values in each field, case sensitive
    
    condition_field -- a constraint of another field for evaluating 
                       invalid values, in list; leave "" if no condition specified
    
    condition_crit --  a list of values for which the condition_field 
                       needs to be evaluated; leave "" if no condition specified

    ## Example of function usage ##
    invalid_value_in_field(
        wdpa_df,
        field=["DESIG_ENG"],
        field_allowed_values=["Ramsar Site, Wetland of International Importance", 
                              "UNESCO-MAB Biosphere Reserve", 
                              "World Heritage Site (natural or mixed)],
        condition_field=["DESIG_TYPE"],
        condition_crit=["International"],
        return_pid=True):
    '''

    if field and field_allowed_values and condition_field and condition_crit:
        invalid_wdpa_pid = wdpa_df[~wdpa_df[field[0]].isin(field_allowed_values) & 
                           wdpa_df[condition_field[0]].isin(condition_crit)]['WDPA_PID'].values

    # If no condition_field and condition_crit are specified
    else:
        if field and field_allowed_values:
            invalid_wdpa_pid = wdpa_df[~wdpa_df[field[0]].isin(field_allowed_values)]['WDPA_PID'].values
        else: 
            raise Exception("ERROR: field(s) and/or condition(s) to test are not specified")
            
    if return_pid:
        # return list with invalid WDPA_PIDs
        return invalid_wdpa_pid
    
    return len(invalid_wdpa_pid) >= 1

##### Child functions

###### Invalid PA_DEF

In [None]:
def invalid_pa_def(wdpa_df, return_pid=False):
    '''
    Return True if PA_DEF not 1
    Return list of WDPA_PIDs where PA_DEF is not 1, if return_pid is set True
    '''

    field = ['PA_DEF']
    field_allowed_values = ['1'] # based on WDPA data type
    condition_field = []
    condition_crit = []

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid DESIG_ENG International

In [None]:
def invalid_desig_eng_international(wdpa_df, return_pid=False):
    '''
    Return True if DESIG_ENG is invalid while DESIG_TYPE is 'International'
    Return list of WDPA_PIDs where DESIG_ENG is invalid, if return_pid is set True
    '''
    
    field = ['DESIG_ENG']
    field_allowed_values = ['Ramsar Site', 
                            'Wetland of International Importance', 
                            'UNESCO-MAB Biosphere Reserve', 
                            'World Heritage Site (natural or mixed)']
    condition_field = ['DESIG_TYPE']
    condition_crit = ['International']
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid DESIG_ENG Regional

In [1]:
def invalid_desig_eng_regional(wdpa_df, return_pid=False):
    '''
    Return True if DESIG_ENG is invalid while DESIG_TYPE is 'Regional'
    Return list of WDPA_PIDs where DESIG_ENG is invalid, if return_pid is set True
    '''
    
    field = ['DESIG_ENG']
    field_allowed_values = ['Baltic Sea Protected Area (HELCOM)', 
                            'Specially Protected Area (Cartagena Convention)', 
                            'Marine Protected Area (CCAMLR)', 
                            'Marine Protected Area (OSPAR)', 
                            'Site of Community Importance (Habitats Directive)', 
                            'Special Protection Area (Birds Directive)', 
                            'Specially Protected Areas of Mediterranean Importance (Barcelona Convention)']
    condition_field = ['DESIG_TYPE']
    condition_crit = ['Regional']
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid DESIG_ENG & IUCN_CAT - UNESCO-MAB & World Heritage Sites

In [None]:
def invalid_desig_eng_iucn_cat(wdpa_df, return_pid=False):
     '''
    Return True if DESIG_ENG is unequal to 'UNESCO_MAB (...)' or 'World Heritage (...)' 
    and IUCN_CAT is 'Not Applicable'
    Return list of WDPA_PIDs where DESIG_ENG is invalid, if return_pid is set True
    '''
    
    field = ['DESIG_ENG']
    field_allowed_values = ['UNESCO-MAB Biosphere Reserve', 
                            'World Heritage Site (natural or mixed)']
    condition_field = ['IUCN_CAT']
    condition_crit = ['Not Applicable']
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid DESIG_ENG & IUCN_CAT - UNESCO-MAB & World Heritage Sites - inverse

In [None]:
def invalid_desig_eng_iucn_cat_inverse(wdpa_df, return_pid=False):
     '''
    Return True if IUCN_CAT is unqueal to 'Not Applicable' 
    and DESIG_ENG is 'UNESCO-MAB (...)' or 'World Heritage (...)'
    Return list of WDPA_PIDs where IUCN_CAT is invalid, if return_pid is set True
    '''
    
    field = ['IUCN_CAT']
    field_allowed_values = ['Not Applicable']
    condition_field = ['DESIG_ENG']
    condition_crit = ['UNESCO-MAB Biosphere Reserve', 
                      'World Heritage Site (natural or mixed)']
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid INT_CRIT & DESIG_ENG  - Ramsar Site & World Heritage Sites

In [None]:
def invalid_int_crit_desig_eng_ramsar_whs(wdpa_df, return_pid=False):
     '''
    Return True if INT_CRIT is unequal to the allowed values (>1000 possible values) 
    and DESIG_ENG equals 'Ramsar Site (...)' or 'World Heritage Sites (...)'
    Return list of WDPA_PIDs where INT_CRIT is invalid, if return_pid is set True
    '''
    
    # Function to create the possible INT_CRIT combinations
    def generate_combinations():
        import itertools
        collection = []
        INT_CRIT_ELEMENTS = ['(i)','(ii)','(iii)','(iv)',
                             '(v)','(vi)','(vii)','(viii)',
                             '(ix)','(x)']
        for length_combi in range(1, len(INT_CRIT_ELEMENTS)+1): # for 1 - 10 elements
            for combi in itertools.combinations(INT_CRIT_ELEMENTS, length_combi): # generate combinations
                collection.append(''.join(combi)) # append to list, remove the '' in each combination
        return collection
   
    # Arguments
    field = ['INT_CRIT']
    field_allowed_values_extra = ['Not Reported']
    field_allowed_values =  generate_combinations() + field_allowed_values_extra
    condition_field = ['DESIG_ENG']
    condition_crit = ['Ramsar Site, Wetland of International Importance', 
                      'World Heritage Site (natural or mixed)']
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid DESIG_TYPE

In [None]:
def invalid_desig_type(wdpa_df, return_pid=False):
    '''
    Return True if DESIG_TYPE is not "National", "Regional", "International" or "Not Applicable"
    Return list of WDPA_PIDs where DESIG_TYPE is invalid, if return_pid is set True
    '''

    field = ['DESIG_TYPE']
    field_allowed_values = ['National', 'Regional', 'International', 'Not Applicable']
    condition_field = []
    condition_crit = []

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid IUCN_CAT

In [None]:
def invalid_iucn_cat(wdpa_df, return_pid=False):
    '''
    Return True if IUCN_CAT is not equal to allowed values
    Return list of WDPA_PIDs where IUCN_CAT is invalid, if return_pid is set True
    '''
    
    field = ['IUCN_CAT']
    field_allowed_values = ["Ia", "Ib", "II", "III", 
                            "IV", "V", "VI", 
                            "Not Reported", 
                            "Not Applicable", 
                            "Not Assigned"]
    condition_field = []
    condition_crit = []
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid IUCN_CAT - UNESCO-MAB and World Heritage Sites

In [2]:
def invalid_iucn_cat_unesco_whs(wdpa_df, return_pid=False):
    '''
    Return True if IUCN_CAT is "Not Applicable" and DESIG_ENG is UNESCO-MAB or World Heritage Site
    Return list of WDPA_PIDs where IUCN_CAT is invalid, if return_pid is set True
    '''
    
    field = ['IUCN_CAT']
    field_allowed_values = ['Not Applicable']
    condition_field = ['DESIG_ENG']
    condition_crit = ['UNESCO-MAB Biosphere Reserve', 
                      'World Heritage Site (natural or mixed)']
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid MARINE

In [None]:
def invalid_marine(wdpa_df, return_pid=False):
    '''
    Return True if MARINE is not in [0,1,2]
    Return list of WDPA_PIDs where MARINE is invalid, if return_pid is set True
    '''

    field = ['MARINE']
    field_allowed_values = ['0','1','2']
    condition_field = []
    condition_crit = []

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid NO_TAKE & MARINE = 0

In [None]:
def invalid_no_take_marine0(wdpa_df, return_pid=False):
    '''
    Return True if NO_TAKE is not equal to 'Not Applicable' and MARINE = 0
    I.e. test whether terrestrial PAs (MARINE = 0) have a NO_TAKE other than 'Not Applicable'
    Return list of WDPA_PIDs where NO_TAKE is invalid, if return_pid is set True
    '''

    field = ['NO_TAKE']
    field_allowed_values = ['Not Applicable']
    condition_field = ['MARINE']
    condition_crit = ['0']

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid NO_TAKE & MARINE = [1,2]

In [None]:
def invalid_no_take_marine12(wdpa_df, return_pid=False):
    '''
    Return True if NO_TAKE is not in ['All', 'Part', 'None', 'Not Reported'] while MARINE = [1, 2]
    I.e. check whether coastal and marine sites (MARINE = [1, 2]) have an invalid NO_TAKE value.
    Return list of WDPA_PIDs where NO_TAKE is invalid, if return_pid is set True
    '''

    field = ['NO_TAKE']
    field_allowed_values = ['All', 'Part', 'None', 'Not Reported']
    condition_field = ['MARINE']
    condition_crit = ['1', '2']

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid NO_TK_AREA & MARINE

In [None]:
def invalid_no_tk_area_marine(wdpa_df, return_pid=False):
    '''
    Return True if NO_TK_AREA is not in [0] while MARINE = [0]
    I.e. check whether NO_TK_AREA is unequal to 0 for terrestrial PAs.
    Return list of WDPA_PIDs where NO_TAKE is invalid, if return_pid is set True
    '''

    field = ['NO_TK_AREA']
    field_allowed_values = [0]
    condition_field = ['MARINE']
    condition_crit = ['0']

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid NO_TK_AREA & NO_TAKE

In [1]:
def invalid_no_tk_area_no_take(wdpa_df, return_pid=False):
    '''
    Return True if NO_TK_AREA is not in [0] while NO_TAKE = 'Not Applicable'
    Return list of WDPA_PIDs where NO_TK_AREA is invalid, if return_pid is set True
    '''

    field = ['NO_TK_AREA']
    field_allowed_values = [0]
    condition_field = ['NO_TAKE']
    condition_crit = ['Not Applicable']

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid STATUS

In [None]:
def invalid_status(wdpa_df, return_pid=False):
    '''
    Return True if STATUS is not in ["Proposed", "Inscribed", "Adopted", "Designated", "Established"]
    Return list of WDPA_PIDs where STATUS is invalid, if return_pid is set True
    '''

    field = ['STATUS']
    field_allowed_values = ["Proposed", "Inscribed", "Adopted", "Designated", "Established"]
    condition_field = []
    condition_crit = []

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid STATUS_YR

In [None]:
def invalid_status_yr(wdpa_df, return_pid=False):
    '''
    Return True if STATUS_YR is smaller than 0 or greater than the current year
    Return list of WDPA_PIDs where STATUS_YR is invalid, if return_pid is set True
    '''
    
    field = ['STATUS_YR']
    year = datetime.date.today().year # obtain current year
    yearArray = [0] + np.arange(1819, year + 1, 1).tolist() # make a list of all years, from 0 to current year
    field_allowed_values = [str(x) for x in testArray] # change all integers to strings
    condition_field = []
    condition_crit = []
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid GOV_TYPE

In [6]:
def invalid_gov_type(wdpa_df, return_pid=False):
    '''
    Return True if GOV_TYPE is invalid
    Return list of WDPA_PIDs where GOV_TYPE is invalid, if return_pid is set True
    '''
    
    field = ['GOV_TYPE']
    field_allowed_values = ['Federal or national ministry or agency', 
                            'Sub-national ministry or agency', 
                            'Government-delegated management', 
                            'Transboundary governance', 
                            'Collaborative governance', 
                            'Joint governance', 
                            'Individual landowners', 
                            'Non-profit organisations', 
                            'For-profit organisations', 
                            'Indigenous peoples', 
                            'Local communities', 
                            'Not Reported']
    
    condition_field = []
    condition_crit = []
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid OWN_TYPE

In [None]:
def invalid_own_type(wdpa_df, return_pid=False):
    '''
    Return True if OWN_TYPE is invalid
    Return list of WDPA_PIDs where OWN_TYPE is invalid, if return_pid is set True
    '''
    
    field = ['OWN_TYPE']
    field_allowed_values = ['State', 
                            'Communal', 
                            'Individual landowners', 
                            'For-profit organisations', 
                            'Non-profit organisations', 
                            'Joint ownership', 
                            'Multiple ownership', 
                            'Contested', 
                            'Not Reported']
    condition_field = []
    condition_crit = []
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid VERIF

In [None]:
def invalid_verif(wdpa_df, return_pid=False):
    '''
    Return True if VERIF is invalid
    Return list of WDPA_PIDs where VERIF is invalid, if return_pid is set True
    '''
    
    field = ['VERIF']
    field_allowed_values = ['State Verified', 
                            'Expert Verified', 
                            'Not Reported']
    condition_field = []
    condition_crit = []
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition_field, condition_crit, return_pid)

###### Invalid METADATAID
#### **AP: make this function -- use duplicate checker such as WDPA_PID?**

In [None]:
def invalid_metadataid(wdpa_df, return_pid=False):
    pull out datatable
    read metadata IDs
    check uniques
    return

---
#### **iv. Marine and GIS tabular data checks**
---

##### Parent function: GIS or Reported area is invalid.

### AP: rm condition_field & condition_crit

In [None]:
def area_invalid_size(wdpa_df, field_small_area, field_large_area, 
                      condition_field, condition_crit, return_pid=False):
    '''
    Factory of functions: this generic function is to be linked to
    the family of 'area' functions stated below. These latter 
    functions are to give information on which fields to check and pull 
    from the DataFrame. This function is the foundation of the others.
    
    Return True if the size of the small_area is invalid compared to large_area

    Return list of WDPA_PIDs where small_area is invalid compared to large_area,
    if return_pid is set True

    ## Arguments ##
    field_small_area  -- list of the attribute to check for size - supposedly smaller
    field_large_area  -- list of the attribute to check for size - supposedly larger
    condition_field   -- a constraint of another field for evaluating 
                         invalid values, in list; leave "" if no condition specified
    condition_crit    -- a list of values for which the condition_field 
                         needs to be evaluated; leave "" if no condition specified
    '''
    
    size_threshold = 1.0001 # due to the rounding of numbers, you can have many false positives without a threshold.

    if field_small_area and field_large_area and condition_field and condition_crit:
        invalid_values = wdpa_df[(wdpa_df[field_small_area[0]] > size_threshold*wdpa_df[field_large_area[0]]) &
                         wdpa[condition_field[0]].isin(condition_crit)]["WDPA_PID"].values

    # If no condition_field and condition_crit are specified
    else:
        if field_small_area and field_large_area:
            wdpa_df[wdpa_df[field_small_area[0]] > size_threshold*wdpa_df[field_large_area[0]]]['WDPA_PID'].values
        else: 
            raise Exception("ERROR: field(s) and/or condition(s) to test are not specified")
            
    if return_pid:
        return invalid_values
    
    return len(invalid_values) >= 1

###### Area invalid: NO_TK_AREA and REP_M_AREA

In [None]:
def area_invalid_no_tk_area_rep_m_area(wdpa_df, return_pid=False):
    '''
    Return True if NO_TK_AREA is larger than REP_M_AREA
    Return list of WDPA_PIDs where NO_TK_AREA is larger than REP_M_AREA if return_pid=True
    '''
    
    field_small_area = ['NO_TK_AREA']
    field_large_area = ['REP_M_AREA']
    condition_field = []
    condition_crit = []
    
    return area_invalid_size(wdpa_df, field_small_area, field_large_area, condition_field, condition_crit, return_pid)

###### Area invalid: GIS_M_AREA and GIS_AREA

In [None]:
def area_invalid_gis_m_area_gis_area(wdpa_df, return_pid=False):
    '''
    Return True if GIS_M_AREA is larger than GIS_AREA
    Return list of WDPA_PIDs where GIS_M_AREA is larger than GIS_AREA, if return_pid=True
    '''
    
    field_small_area = ['GIS_M_AREA']
    field_large_area = ['GIS_AREA']
    condition_field = []
    condition_crit = []
    
    return area_invalid_size(wdpa_df, field_small_area, field_large_area, condition_field, condition_crit, return_pid)

###### Area invalid: REP_M_AREA and REP_AREA

In [None]:
def area_invalid_rep_m_area_rep_area(wdpa_df, return_pid=False):
    '''
    Return True if REP_M_AREA is larger than REP_AREA
    Return list of WDPA_PIDs where REP_M_AREA is larger than REP_AREA, if return_pid=True
    '''
    
    field_small_area = ['REP_M_AREA']
    field_large_area = ['REP_AREA']
    condition_field = []
    condition_crit = []
    
    return area_invalid_size(wdpa_df, field_small_area, field_large_area, condition_field, condition_crit, return_pid)

---
#### **v. Hardcoded**
---

###### Invalid MARINE based on GIS area

In [None]:
def area_invalid_marine(wdpa_df, return_pid=False):
    '''
    Assign a marine_value based on GIS calculations, return True if marine_value is unequal to MARINE
    Return list of WDPA_PIDs where MARINE is invalid, if return_pid is set True
    '''
    
    # set min and max for 'coastal' designation (MARINE = 1)
    coast_min = 0.1
    coast_max = 0.9
    
    # create new column with proportion marine vs total GIS area 
    wdpa_df['marine_proportion'] = wdpa_df['GIS_M_AREA'] / wdpa_df['GIS_AREA']
    
    def assign_marine_value(wdpa_df):
        if wdpa_df['marine_proportion'] < coast_min:
            return '0'
        elif coast_min < wdpa_df['marine_proportion'] < coast_max:
            return '1'
        elif wdpa_df['marine_proportion'] > coast_max:
            return '2'
    
    # calculate the marine_value
    wdpa_df['marine_value'] = wdpa_df.apply(assign_marine_value, axis=1)
    
    # store the rows for which marine_value != MARINE 
    if return_pid:
        return wdpa_df[wdpa_df['marine_value'] != wdpa_df['MARINE']].["WDPA_PID"].values
    
    return len(wdpa_df[wdpa_df['marine_value'] != wdpa_df['MARINE']]) >= 1

###### Invalid: REP_AREA >= 2* GIS_AREA

### AP: normalize & 2xstdev from mean of distribution

In [None]:
def area_invalid_rep_area_gis_area(wdpa_df, return_pid=False):
    '''
    Return True if REP_AREA is more than 2 times as large as GIS_AREA
    Return list of WDPA_PIDs where REP_AREA is more than 2 times larger than GIS_AREA, if return_pid=True
    '''
    
    size_threshold = 2
    field_small_area = ['REP_AREA']
    field_large_area = ['GIS_AREA']
    
    return wdpa_df[wdpa_df[field_small_area[0]] >= size_threshold*wdpa_df[field_large_area[0]]]['WDPA_PID'].values

###### Invalid: REP_M_AREA >= 2* GIS_M_AREA
### As above

In [1]:
def area_invalid_rep_m_area_gis_m_area(wdpa_df, return_pid=False):
    '''
    Return True if REP_M_AREA is more than 2 times as large as GIS_M_AREA
    Return list of WDPA_PIDs where REP_M_AREA is more than 2 times larger than GIS_M_AREA, if return_pid=True
    '''
    
    size_threshold = 2
    field_small_area = ['REP_M_AREA']
    field_large_area = ['GIS_M_AREA']
    
    return wdpa_df[wdpa_df[field_small_area[0]] >= size_threshold*wdpa_df[field_large_area[0]]]['WDPA_PID'].values

###### Invalid: GIS_AREA <= 0.0001 km² (100 m²)

In [1]:
def area_invalid_gis_area(wdpa_df, return_pid=False):
    '''
    Return True if GIS_AREA is smaller than 0.0001 km²
    Return list of WDPA_PIDs where GIS_AREA is smaller than 0.0001 km², if return_pid=True
    '''
    
    size_threshold = 0.0001
    field_gis_area = ['GIS_AREA']
    
    if return_pid:
        return wdpa_df[wdpa_df[field_gis_area[0]] <= size_threshold]['WDPA_PID'].values
    
    return len(wdpa_df[wdpa_df[field_gis_area[0]] <= size_threshold]['WDPA_PID'].values) >= 1

###### Invalid: REP_M_AREA <= 0 when MARINE = 1 or 2

In [None]:
def area_invalid_rep_m_area_marine12(wdpa_df, return_pid=False):
    '''
    Return True if REP_M_AREA is smaller than or equal to 0 while MARINE = 1 or 2
    Return list of WDPA_PIDs where REP_M_AREA is invalid, if return_pid=True
    '''
    
    field = ['REP_M_AREA']
    field_allowed_values = [0]
    condition_field = ['MARINE']
    condition_crit = ['1','2']
    
    if return_pid:
        return wdpa_df[(wdpa_df[field[0]] <= field_allowed_values[0]) & 
               wdpa_df[condition_field[0]].isin(condition_crit)]['WDPA_PID'].values
    
    return len(wdpa_df[(wdpa_df[field[0]] <= field_allowed_values[0]) & 
                    wdpa_df[condition_field[0]].isin(condition_crit)]['WDPA_PID'].values) >= 1

###### Invalid: GIS_M_AREA <= 0 when MARINE = 1 or 2

In [1]:
def area_invalid_gis_m_area_marine12(wdpa_df, return_pid=False):
    '''
    Return True if GIS_M_AREA is smaller than or equal to 0 while MARINE = 1 or 2
    Return list of WDPA_PIDs where GIS_M_AREA is invalid, if return_pid=True
    '''
    
    field = ['GIS_M_AREA']
    field_allowed_values = [0]
    condition_field = ['MARINE']
    condition_crit = ['1','2']
     
    if return_pid:
        return wdpa_df[(wdpa_df[field[0]] <= field_allowed_values[0]) & 
                    wdpa_df[condition_field[0]].isin(condition_crit)]['WDPA_PID'].values
        
    return len(wdpa_df[(wdpa_df[field[0]] <= field_allowed_values[0]) & 
                    wdpa_df[condition_field[0]].isin(condition_crit)]['WDPA_PID'].values) >= 1

###### Invalid NO_TAKE, NO_TK_AREA and REP_M_AREA

In [None]:
def invalid_no_take_no_tk_area_rep_m_area(wdpa_df, return_pid=False):
    '''
    Return True if NO_TAKE = 'All' while the REP_M_AREA is unequal to NO_TK_AREA
    Return list of WDPA_PIDs where NO_TAKE is invalid, if return_pid=True
    '''

    # Select rows with NO_TAKE = 'All'
    no_take_all = wdpa_df[wdpa_df['NO_TAKE']=='All']
    
    # Select rows where the REP_M_AREA is unequalo to NO_TK_AREA
    return no_take_all[no_take_all['REP_M_AREA'] != no_take_all['NO_TK_AREA']]['WPDA_PID'].values

###### Invalid INT_CRIT & DESIG_ENG  - other sites

In [None]:
def invalid_int_crit_desig_eng_other(wdpa_df, return_pid=False):
     '''
    Return True if DESIG_ENG is something else than Ramsar Site (...)' or 'World Heritage Sites (...)'
    while INT_CRIT is unequal to 'Not Applicable'. Other-than Ramsar / WHS should not contain anything
    else than 'Not Applicable'.
    Return list of WDPA_PIDs where INT_CRIT is invalid, if return_pid is set True
    '''
    
    field = ['DESIG_ENG']
    field_allowed_values = ['Ramsar Site, Wetland of International Importance', 
                            'World Heritage Site (natural or mixed)']
    condition_field = ['INT_CRIT']
    condition_crit = ['Not Applicable']
    
    if return_pid:
        return invalid_wdpa_pid = wdpa_df[~wdpa_df[field[0]].isin(field_allowed_values) &
                                          ~wdpa_df[condition_field[0]].isin(condition_crit)]['WDPA_PID'].values
    
    return len(wdpa_df[~wdpa_df[field[0]].isin(field_allowed_values) &
                       ~wdpa_df[condition_field[0]].isin(condition_crit)]['WDPA_PID'].values >= 1