# QA for WDPA 2019

Stijn den Haan

Supervisor: Yichuan Shi

Bioinformatics internship • WCMC • 10 June --- 9 August 2019

---

#### A brief note on diction: 

- **Offending fields** are fields (columns) that contain values that do not adhere to the rules set in the WDPA manual.
- Offending fields are subdivided in **three types**:
    - *Duplicate*: specifically, the WDPA_PID should be unique throughout the WDPA
    - *Inconsistent*: multiple records (rows) about the same protected area contains conflicting attribute information
        - Example: records with the same WDPAID have different values present in the NAME field, e.g. 'De Veluwe' vs 'De VeLUwe'.
    - *Invalid*: a record has an incorrect value for a particular field where only a particular set of values is allowed.
        - Example: DESIG_TYPE = 'Individual' while only 'National', 'International', and 'Regional' are allowed values for this field.
    

Load packages

In [3]:
import numpy as np
import pandas as pd
import arcpy
import os

Load data

In [5]:
# set working directory
os.chdir("C:/Users/paintern/Desktop/Stijn/5. Scripts")

# load dummy data
wdpa_df = pd.read_csv('dummyData.csv', low_memory=False)

# load fields present in the WDPA
fields = ['WDPAID', 'WDPA_PID', 'PA_DEF', 'NAME', 'ORIG_NAME', 'DESIG', 
          'DESIG_ENG', 'DESIG_TYPE', 'IUCN_CAT', 'INT_CRIT', 'MARINE', 'REP_M_AREA', 
          'GIS_M_AREA', 'REP_AREA', 'GIS_AREA', 'NO_TAKE', 'NO_TK_AREA', 'STATUS', 'STATUS_YR', 
          'GOV_TYPE', 'OWN_TYPE', 'MANG_AUTH', 'MANG_PLAN', 'VERIF', 'METADATAID', 'SUB_LOC', 'PARENT_ISO3', 'ISO3', ]

**Convert** ArcGIS table to Pandas DataFrame

In [None]:
# Source: https://gist.github.com/d-wasserman/e9c98be1d0caebc2935afecf0ba239a0
def arcgis_table_to_df(in_fc, input_fields, query=""):
    """Function will convert an arcgis table into a pandas dataframe with an object ID index, and the selected
    input fields using an arcpy.da.SearchCursor."""

    OIDFieldName = arcpy.Describe(in_fc).OIDFieldName
    final_fields = [OIDFieldName] + input_fields
    data = [row for row in arcpy.da.SearchCursor(in_fc,final_fields,where_clause=query)]
    fc_dataframe = pd.DataFrame(data,columns=final_fields)
    fc_dataframe = fc_dataframe.set_index(OIDFieldName,drop=True)
    
    return fc_dataframe

# wdpa_df = arcgis_table_to_df(wdpa, fields)

**Utility** - Yichuan: "focus on later"

In [None]:
# find rows of the WDPA based on the WDPA_PID
def find_wdpa_rows(wdpa_df, wdpa_pid):
    '''
    Return a subset of dataframe based on wdpa_pid list

    Arguments:
    wdpa_df -- wdpa dataframe
    wdpa_pid -- a list of WDPA_PID
    '''
    return wdpa_df[wdpa_df['WDPA_PID'].isin(wdpa_pid)]

**Checks**

>Checking the validality (True vs False, errors or not) should be implemented as efficiently as possible to avoid having to pull out all 'offending' rows when return_pid is set to **False**

**i. Duplicate WDPA_PIDs**

In [None]:
def duplicate_wdpa_pid(wdpa_df, return_pid=False):
    '''
    Return True if WDPA_PID is duplicate in the dataframe. 
    
    Return list of WDPA_PID, if duplicates are present 
    and return_pid is set True.
    '''

    if return_pid:
        ids = wdpa_df['WDPA_PID'] # make a variable of the field to find
        duplicates = wdpa_df[ids.isin(ids[ids.duplicated()])] # find duplicates
        # if preferred, we could store this in a tuple as stated below
        #duplicates_tuple = (duplicates)
        #return duplicates_tuple
        return duplicates # output the duplicates

    return wdpa_df['WDPA_PID'].nunique() != wdpa_df.index.size # this returns True if there are WDPA_PID duplicates

**ii. Inconsistent values for the same WDPAID**

Parent function - **AP: sort the tuple on WDPAIDs**

In [None]:
def inconsistent_attributes_same_wdpaid(wdpa_df, 
                                        check_attribute, 
                                        return_pid=False):
    '''
    Factory of functions: this generic function is to be linked to
    the family of 'inconsistent' functions stated below. These latter 
    functions are to give information on which fields to check and pull 
    from the DataFrame. This function is the foundation of the others.
    
    Return True if inconsistent attributes are found for rows 
    sharing the same WDPAID

    Return list of WDPA_PID where inconsistency occurs, if 
    return_pid is set True

    Arguments:
    check_attributes -- list of attributes to check inconsistency
    '''
   
    # this function can be repurposed

    if return_pid:
        # Group by WDPAID to find duplicate WDPAIDs and count the 
        # number of unique values for the field in question
        wdpaid_groups = wdpa_df.groupby('WDPAID').check_attributes.nunique()
        
        # Select all WDPAID duplicates groups with >1 unique value
        # for the specified field ('check_attributes') 
        # Store the data in a tuple '()'
        wdpaid_inconsistent_attributes = (wdpaid_groups[wdpaid_groups > 1])
        
#### Action Point: sort by WDPAID (float) ####
        
        # Output the inconsistent WDPAIDs
        return wdpaid_inconsistent_attributes
        
    # Sum the number of times a WDPAID has more than 1 value for name
    return (d.groupby('WDPAID').NAME.nunique() > 1).sum() >= 1

Child functions - **AP: check `return` function**

In [None]:
def inconsistent_desig_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'DESIG' field
    for records with the same WDPAID
    
    Input: pandas dataframe
    Output: tuple with WDPAIDs holding inconsistencies (?)
    '''
    
    check_attributes = 'DESIG'
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes) # ???

def inconsistent_desig_eng_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'DESIG_ENG' field
    for records with the same WDPAID
    
    Input: pandas dataframe
    Output: tuple with WDPAIDs holding inconsistencies (?)
    '''
    
    check_attributes = 'DESIG_ENG'
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes) # ???

def inconsistent_name_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'NAME' field
    for records with the same WDPAID
    
    Input: pandas dataframe
    Output: tuple with WDPAIDs holding inconsistencies (?)
    '''

    check_attributes = 'NAME'
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes) # ???

def inconsistent_mang_auth_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'MANG_AUTH' field
    for records with the same WDPAID
    
    Input: pandas dataframe
    Output: tuple with WDPAIDs holding inconsistencies (?)
    '''
    
    check_attributes = 'MANG_AUTH'
    
    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes) # ???

def inconsistent_mang_plan_same_wdpaid(wdpa_df, return_pid=False):
    '''
    This function is to capture inconsistencies in the 'MANG_PLAN' field
    for records with the same WDPAID
    
    Input: pandas dataframe
    Output: tuple with WDPAIDs holding inconsistencies (?)
    '''
    check_attributes = 'MANG_PLAN'

    return inconsistent_attributes_same_wdpaid(wdpa_df, return_pid, check_attributes) # ???

---

**iii. Invalid values present in a field** - **AP: make function factory**

In [None]:
def invalid_value_in_field(wdpa_df, field, field_allowed_values, condition, return_pid=False):
    '''
    Factory of functions: this generic function is to be linked to
    the family of 'invalid' functions. These latter functions 
    are to give information on which fields to check and pull from 
    the DataFrame.
    
    Return True if invalid values are found in specified fields

    Return list of WDPAID where inconsistency occurs, if return_pid is set True

    Arguments:
    field -- in which invalid values are checked
    field_allowed_values -- expected values, case sensitive
    condition_field -- a constraint of another field for evaluating invalid value, leave "" if no condition specified
    condition_crit -- a list or tuple of values for which the condition_field needs to be evaluated, leave "" if no condition specified

    Example:
    def invalid_value_in_field(wdpa_df, field):
        field_allowed_values=("Ramsar Site, Wetland of International Importance", 
                              "UNESCO-MAB Biosphere Reserve", 
                              "World Heritage Site (natural or mixed)),
        condition_field= ("DESIG_TYPE")
        condition_crit= ("International")
        return_pid=True
    '''
    # This generic function can be repurposed to specific functions
    
    if return_pid:
        # Select the cells that are invalid, based on a set condition
        if condition_field & condition_crit: # if a condition is specified (i.e. list is not empty)
            wdpa_df[~wdpa_df[field].isin([field_allowed_values]) & # select fields with values that are not allowed
                    wdpa_df[condition_field].isin([condition_crit])] # set specific condition
        else: # if no condition is specified (i.e. list is empty)
            invalid_values = wdpa_df[~wdpa[field].isin(field_allowed_values)]
        
        # Store the WDPAIDs of the records with invalid values in a tuple
        invalid_values_tuple = (invalid_values["WDPAID"])
    
        # Output the invalid values
        return invalid_values_tuple
    
    #### AP: This is longer than it should be ####
    if condition_field & condition_crit:
        return len(wdpa_df[wdpa_df[condition_field].isin([condition_crit]) &
                       ~wdpa[field].isin(field_allowed_values)]) < 1
    else:
        return len(wdpa_df[wdpa_df[condition_field].isin([condition_crit]) < 1

In [1]:
def invalid_iucn_cat(wdpa_df, return_pid=False):
    '''
    Return True if IUCN_CAT is not equal to allowed values
    
    Return list of WDPA_PID where IUCN_CAT is invalid, if return_pid is set True
    '''
    
    field = 'IUCN_CAT'
    field_allowed_values = ("Ia", "Ib", "II", "III", 
                            "IV", "V", "VI", 
                            "Not Reported", 
                            "Not Applicable", 
                            "Not Assigned")
    condition_field = ('')
    condition_crit = ('')
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition, return_pid)

#### CHECK THIS ####

def invalid_iucn_cat_unesco_whs(wdpa_df, return_pid=False):
    '''
    Return True if IUCN_CAT is "Not Applicable" and DESIG_ENG is UNESCO-MAB or World Heritage Site
    
    Return list of WDPA_PID where IUCN_CAT is invalid, if return_pid is set True
    '''
    
    field = 'IUCN_CAT'
    field_allowed_values = ("Ia", "Ib", "II", "III", 
                            "IV", "V", "VI", 
                            "Not Reported", 
                            "Not Assigned") # "Not Applicable" is not allowed
    condition_field = ('DESIG_ENG')
    condition_crit = ('UNESCO-MAB Biosphere Reserve', 
                      'World Heritage Site (natural or mixed)')
    
    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition, return_pid)

#########################################################

def invalid_pa_def(wdpa_df, return_pid=False):
    '''
    Return True if PA_DEF not 1

    Return list of WDPA_PID where PA_DEF is not 1, if return_pid is set True
    '''

    field = 'PA_DEF'
    field_allowed_values = (1)
    condition = '~field.isin(field_allowed_values)'

    return invalid_value_in_field(wdpa_df, field, field_allowed_values, condition, return_pid)

def invalid_desig_type(wdpa_df, return_pid=False):
    return

def invalid_desig_eng_regional(wdpa_df, return_pid=False):
    return

def invalid_desig_eng_international(wdpa_df, return_pid=False):
    return

def invalid_desig_eng_int_crit(wdpa_df, return_pid=False):
    return

def invalid_marine(wdpa_df, return_pid=False):
    return

def invalid_status(wdpa_df, return_pid=False):
    return

def invalid_status_yr(wdpa_df, return_pid=False):
    return

def invalid_gov_type(wdpa_df, return_pid=False):
    return

def invalid_own_type(wdpa_df, return_pid=False):
    return

def invalid_verif(wdpa_df, return_pid=False):
    return

def invalid_metadataid(wdpa_df, return_pid=False):
    return

def invalid_gis_area(wdpa_df, return_pid=False):
    '''
    Return list of WDPA_PID where value small GIS_AREA are present 
    '''
    return 

def invalid_int_crit(wdpa_df, return_pid=False):
    '''
    Return list of WDPA_PID where invalid characters (space, comma), are present 
    '''   
    return

SyntaxError: invalid syntax (<ipython-input-1-32f87204d3d9>, line 12)

**iv. Marine data checks**

In [None]:
# hard code the rules for marine fields
def invalid_marine_areas(wdpa_df, return_pid=False):
    # all areal inconsistency to be pickup here and specified
    return

def invalid_no_take(wdpa_df, return_pid=False):
    # no take and no take are
    return