# Introduction

### This notebook performs:
+ Data cleaning on the facility name column which includes removal of punctuation and removal of facility type information contained in the facility name column.
+ The removed type information in the facility name by using a facility type dictionary.
+ Complite health facility types.
+ Check if points located in the right admin boundary based on provided admin boundary.
+ Check if points located on a settlement.
+ Check if points are overlap (identical lat/long).
+ Match Master Facility List (MFL).

### Input:
#### Spatial layers:
+ Health facilities locations
+ Settlement extent layer.BUA, SSA and Hamlets needs to be merge into a layer
+ Admin 1 boundary 
+ Admin 2 boundary
#### Tables
+ Spelling dictionary to fix spellings differences in the facility types
+ Type dictionary to standardize facility types
+ Master Facility List (MFL). Facility names and types need to be seperated into two columns

#### User Inputs:
 User needs to specify these inputs before running the script
+ root_dir: Output workspace
+ input_data: name of the input layer
+ today_date: Date. it will be used in output directory and gdb names
+ admin_name: It will be used as name convention in the output layers
+ point_poi_type: It will be used as name convention in the output layers
+ points_source: It will be used as name convention in the output layers
+ path_to_type_dict: Path to facility type dictionary
+ path_to_spelling_dict: Path to spelling dictionary
+ ADMIN1_BNDRY : Admin 1 column in the admin 1 boundary layer
+ ADMIN2_BNDRY : Admin 2 column in the admin 2 boundary layer
+ COUNTRY: Country name. It is used for getting spelling dictionary and type dictionary for each country
+ ADMIN1=Admin 1 name in the health facility layer
+ ADMIN2=Admin 2 name in the health facility layer
+ ADMIN3=Admin 3 name in the health facility layer
+ FACILITY_NAME = Facility name
+ CLEAN_NAME =clean facility name after pre-cleaning.
+ CORRECT_NAME = corrected facility name after misspelling  correction.
+ CLEAN_NAME_FINAL = final clean name after removing type information.
+ EXTRACT_TYPE = type information extracted, i.e. the difference between CORRECTED_NAME and clean_name_final.
+ SUB_TYPE = facility type defined in the type dictonary, obtained by mapping EXTRACT_TYPE to type dictionary
+ SCORE = match score between EXTRACT_TYPE and SUB_TYPE (scale 0-100), can be used to filter perfect-match results only.
+ dist_to_sett_threshold=Search distance to check points against settlement extents 
+ dist_to_check_overlap=Search distance to check points for overlaps
+ MFL_ADMIN1=Admin 1 column in the MFL
+ MFL_ADMIN2=Admin 2 column in the MFL
+ MFL_ADMIN3=Admin 3 column in the MFL
+ MFL_FACE_NAME=Facility name column in the MFL
+ MFL_FACE_TYPE=Facility type column in the MFL
+ MFL_ID=Facility unique id column in the MFL


### Output:
 A point layer with these columns :
- org_name: Original facility name. It is combination of hf, hf1 and hf2 columns
- duration_bins: Cassification of interview time in minutes
- clean_name: Clean facility name after pre-cleaning.
- corrected_name: Corrected facility name after misspelling  correction.
- clean_name_final: Final clean name after removing type information.
- extract_type: Type information extracted, i.e. the difference between CORRECTED_NAME and clean_name_final.
- type: Tacility type defined in the type dictonary, obtained by mapping EXTRACT_TYPE to type dictionary
- score: Match score between EXTRACT_TYPE and SUB_TYPE (scale 0-100), can be used to filter perfect-match results only.
- admin1_bdry_match: Indicates if points lacated in the right admin boundary (e.g province_bdry_match)
- admin2_bdry_match: Indicates if points lacated in the right admin boundary (e.g h_zone1_bdry_match)
- sett_type: Settlement type that a point are located (bua, ssa, hamlets).Points that far from a settlement more than 250 meters classified as "out of a settelement"
- is_overlaps: Indicates if points have identical lat/long
- admin1_2: Updated admin 1 names after matching to  MFL (e.g province_2)
- admin1_2_is_match: Indicates if admin 1 names match to MFL (e.g province_2_is_match)
- admin2_2: Updated admin 2 names after matching to  MFL (e.g h_zone2)
- admin2_2_is_match: Indicates if admin 2 names match to MFL (e.g h_zone2_is_match)
- admin3_2: Updated admin 3 names after matching to  MFL (e.g h_area2)
- admin3_2_is_match: Indicates if admin 3 names match to MFL (e.g h_area2_is_match)
- face_name2: Updated  facility names after matching to  MFL
- face_name2_is_match: Indicates if admin 3 names match to MFL (e.g h_area2_is_match)
- mfl_admin2= Admin 2 name from MFL. Only if facility names match to MFL (e.g mfl_health_zone)
- mfl_admin3= Admin 3 name from MFL. Only if facility names match to MFL (e.g mfl_health_area)
- mfl_unique_id= Unique id for each facility  from MFL. Only if facility names match to MFL (e.g fosa_uid). Some facilities with the same name but different types may cross match. These cases are flagged as "duplicate_match"
- mfl_unique_id= Facility type from MFL. Only if facility names match to MFL (e.g mfl_type)

# Import Libraries

In [4]:
import numpy as np
import pandas as pd
import re
#import geopandas as gpd
#import fiona
import os
import unidecode
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from ordered_set import OrderedSet
from sklearn.cluster import DBSCAN

import arcpy
from arcpy import env
from arcgis.features import GeoAccessor, GeoSeriesAccessor, SpatialDataFrame
pd.set_option('mode.chained_assignment', None)
pd.options.display.max_rows = None
pd.options.display.max_columns = None
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore')



# Define functions

### Cluster points


In [5]:
def create_cluster(sdf,distance, min_point,cluster_variable):
    '''checks overlappes and distance between points based on specified distance
    ----------------------------------------------------------------------------
    inputs:
        PoI: point of intereset layer
        distance= distance (in meters) threshold to check clusters
        min_point= mininun point count to check for a cluster
        crf= a projected coordinate system code as a string
            (102022 >> Africa equal area projection. Change if needed)
    output:
        A new column will be added to the PoI layer
        cluster_id: Points that are in the same clusters will have the same id value.
        if point that has cluster_id more than 99999 means that the point is not in a cluster.  
    '''
    ##===========================================================================##
    #output_name=os.path.join(output_gdb,PoI)
    cluster_variable="cluster_id_"+str(distance)+"m"
  
     ##----------------------------------------------------------##
   
    if  cluster_variable in sdf.columns:
        sdf.drop(cluster_variable, axis=1, inplace=True)
    points_coord=sdf[['x_coord_m','y_coord_m']]
    
    db     = DBSCAN(eps=distance, min_samples=min_point).fit(points_coord)
    labels = db.labels_ #labels of the found clusters, points that out of a cluster coded as -1
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0) #number of clusters
    labels_list=list(labels)
    # attached cluster id to df
    points_coord[cluster_variable]=np.nan
    for i in range(len(labels_list)):
        points_coord.loc[i,cluster_variable]=int(labels_list[i])
    points_coord_=points_coord[[cluster_variable]]    
    sdf_=sdf.merge(points_coord_, left_index=True, right_index=True)
    # delete temp. columns
    sdf_.drop(['x_coord_m','y_coord_m'], axis=1, inplace=True)
    
    # points that out of selected distance cluster are coded -1 by DBSCAN function\
    # section below creates unique "cluster_id" for the points that coded as -1\
    # if cluster_id is more than 9999, it means that these cluster are a single point cluster
    i=999999
    for idx, row in sdf_.iterrows():
              
        if row[cluster_variable]==-1.0:
        
            sdf_.loc[idx,cluster_variable]=i
            i=i+1
    #arcpy.DeleteFeatures_management(output_name)
    #sdf_.spatial.to_featureclass(output_name)
    return sdf_

### `clean_name`

Pre-cleaning on facility name:

- remove punctuations, change '&' to 'and'
- correct spelling of common words for consistency
- replace double whitespaces with one and strip extra whitespaces
- remove accent marks

Note: NA values in facility name column is replaced with empty string '' first and then converted back to NA.

In [6]:
def preclean(df, facility_name, clean_name, country_col):
    
    # replace NAs with empty string ''
    df[facility_name] = df[facility_name].fillna('')
     # remove accent marks
    df[clean_name] = [unidecode.unidecode(n) for n in df[facility_name]]
    df[clean_name]=df[clean_name]+" "
    df[clean_name] = df[clean_name].str.replace(" III "," 3 ")\
        .str.replace(" II "," 2 ")\
        .str.replace(" I "," 1 ")\
        .str.replace(" Iii "," 3 ")\
        .str.replace(" Ii "," 2 ")\
        .str.replace(" IV "," 4 ")\
        .str.replace(" Iv "," 4 ")
    df[clean_name]  =df[clean_name].apply(lambda x: " ".join(re.split('(\d+)', x)))
    df[clean_name] = df[clean_name].str.strip()\
            .str.title()\
            .str.replace("  ", " ")\
            .str.replace('.', '')\
            .str.replace(':', '')\
            .str.replace("'", '')\
            .str.replace('"', ' ')\
            .str.replace('[', '')\
            .str.replace(']', '')\
            .str.replace('+', '')\
            .str.replace('*', '')\
            .str.replace('[-_,/\(\);]', '')\
            .str.replace('&', ' and ')\
            .str.replace("  ", " ")\
            .str.strip()\
            .str.replace('center ', 'centre ', case=False)\
            .str.replace('^st ', ' saint ', case=False)\
            .str.replace(' st ', ' saint ', case=False)\
            .str.replace('cl ', 'clinique ', case=False)\
            .str.replace('Geral ', 'General ', case=False)\
            .str.replace('Hospitals ', 'Hopital ', case=False)\
            .str.replace('Hospital ', 'Hopital ', case=False)\
            .str.replace("Urban ", "Urbain ", case=False)\
            .str.replace("Distrital ", "District ", case=False)\
            .str.replace('^hosp | hosp | hosp$|^hosp$', ' Hopital ', case=False)\
            .str.replace("  ", " ")\
            .str.strip()\
            .str.replace(" De ", " de ")\

            #.str.replace('Clinique', 'Clinic', case=False)\
            #.str.replace('Polyclinique', 'Polyclinic', case=False)\
            #.str.replace('Dispensaire', 'Dispensary', case=False)\
            # .str.replace('Hôpital', 'Hospital', case=False)\
            #.str.replace('Hopital', 'Hospital', case=False)\
    
    # replace NAs in clean_name with empty string ''
    df[clean_name] = df[clean_name].fillna('')
    
    # change emptry string in facility_name back to NA
    df[facility_name] = df[facility_name].replace('', np.nan)

    
    
    return df

### Clean Admin names


In [7]:
def preclean_admin(df,admin_var, out_admin_var):
    
    # replace NAs with empty string ''
    df[out_admin_var] = df[admin_var].fillna('')
     # remove accent marks
    df[out_admin_var] = [unidecode.unidecode(n) for n in df[out_admin_var]]
    df[out_admin_var]=df[out_admin_var]+" "
    df[out_admin_var] = df[out_admin_var].str.replace(" III "," 3 ")\
        .str.replace(" II "," 2 ")\
        .str.replace(" I "," 1 ")\
        .str.replace(" Iii "," 3 ")\
        .str.replace(" Ii "," 2 ")\
        .str.replace(" IV "," 4 ")\
        .str.replace(" Iv "," 4 ")
    df[out_admin_var]  =df[out_admin_var].apply(lambda x: " ".join(re.split('(\d+)', x)))
    df[out_admin_var] = df[out_admin_var].str.title()\
            .str.replace('[-_,/\(\);.]', ' ')\
            .str.replace("AS ","", case=False)\
            .str.replace("ZS ","", case=False)\
            .str.replace('  ', ' ')\
            .str.strip()\

    # replace NAs in admin_var with empty string ''
    df[out_admin_var] = df[out_admin_var].fillna('')
    
    # change emptry string in facility_name back to NA
    df[out_admin_var] = df[out_admin_var].replace('', np.nan)

    return df

### `corrected_name`

Makes correction to possible misspellings using the spelling dictionary.

In [8]:
def correct_spelling(df, spelling_dict, country_col, clean_name, output_col):
    corrected_results = pd.DataFrame()

    for country_name in df[country_col].unique():
        # obtain dataset for the country
        df_ctr = df[df[country_col].str.upper()==country_name.upper()]
    
        spelling_dict_ctr = spelling_dict[spelling_dict['Country'].str.upper()==country_name.upper()]
        words_to_correct = spelling_dict_ctr['Word'].unique()

        df_ctr[output_col] = df_ctr[clean_name]
        for word in words_to_correct:
            misspellings = list(spelling_dict_ctr[(spelling_dict_ctr['Word']==word)]['Misspelling'])
            for misspelling in misspellings:
                df_ctr[output_col] = df_ctr[output_col]\
                .str.replace('|'.join(['^'+misspelling+' ', ' '+misspelling+' ',
                                           ' '+misspelling+'$', '^'+misspelling+'$']), ' '+word+' ', case=False)\
                .str.strip().replace(" De "," de ")
    
        # merge country results to all results
        corrected_results = pd.concat([corrected_results, df_ctr])
    # reset and drop index
    corrected_results.reset_index(inplace=True, drop=True)
                                                              
    return corrected_results

### `clean_name_final`

Use facility type and abbreviations in the type dictionary as keywords and remove type information from `corrected_name` to create the `clean_name_final` column.

In [9]:
def remove_type_info(df, type_dict, clean_name, clean_name_final, country):
    # remove whitespace between abbreviations of length 2 or 3
    # e.g. change C S to CS
    
    # obtain abbreviations of length 2 or 3
    tmp = type_dict[type_dict['Abbreviation'].str.len()<=3]['Abbreviation'].unique()
    # sort by decreasing length
    tmp = sorted(tmp, key=len, reverse=True)
    # change it to the pattern '^c s ' or ' c s$'
    tmp_dict = {}
    for t in tmp:
        tmp_dict[t] = ['^'+' '.join(list(t))+' ', ' '+' '.join(list(t))+'$']
    # replace the pattern with 'cs'
    for t in tmp:
        pats = tmp_dict[t]
        df[clean_name] = df[clean_name].str.replace(pats[0], t+' ',case=False)\
        .str.replace(pats[1], ' '+t, case=False)
    # remove type information
    df_grouped = df.groupby(country)
    res = pd.DataFrame()

    for group_name, df_group in df_grouped:
        # obtain the type dictionary for that country
        tmp = type_dict[type_dict['Country'].str.upper()==group_name.upper()]

        # facility types for that country
        types = list(tmp['Type'])
        type_keywords = set()
        for t in types:
            # add the full facility type 
            t = t.title()
            type_keywords.add(t)                 

            # add individual words as well
            t = t.replace('/', ' ')
            words = t.split(' ')
            # skip words that have punctuation / numbers and have length <= 3 (e.g. de, (major))
            words = [w for w in words if w.isalpha() and len(w)>3]
            for w in words:
                type_keywords.add(w)

        # obtain the list of type keywords and sort in descending length
        type_keywords = list(type_keywords)
        type_keywords = sorted(type_keywords, key=lambda s: -len(s))

        # abbreviations for that country
        abbrevs = set(tmp['Abbreviation'])

        abb_keywords = []
        for abbrev in abbrevs:
            # e.g. for CS, 4 patterns are considered: '^CS ', ' CS ', ' CS$', '^CS$'
            abbrev = abbrev.title()
            abb_keywords.extend(['^'+abbrev+'\s', '\s'+abbrev+'\s', '\s'+abbrev+'$',
                                '^'+abbrev+'$'])

        # obtain the list of abbreviation keywords and sort in descending length
        abb_keywords = sorted(abb_keywords, key=lambda s: -len(s))  


        # handle situations when type is 'Hospital District' in the type dictionary 
        # but name column has 'District Hospital' in ISS data
        type_len_2 = [t for t in type_keywords if len(t.split())==2]
        for t in type_len_2:
            df_group[clean_name] = df_group[clean_name].str.title()\
            .str.replace(' '.join(t.split()[::-1]), t, case=False)

        # remove type information using keywords generated above
        # remove meaningless connecting words like de, do, da, du
        df_group[clean_name_final] = df_group[clean_name].str.title()\
            .str.replace('|'.join(type_keywords), '')\
            .str.replace('|'.join(abb_keywords), ' ')\
            .str.strip()\
            .str.replace('^de | de | de$|^de$|^do | do | do$|^do$|^da | da | da$|^da$|^du | du | du$|^du$', 
                         ' ', case=False)\
            .str.replace("  ", " ")\
            .str.strip()\
            .str.title()
        res = pd.concat([res, df_group])
    return res

### `extract_type`

Extract facility type information by removing `clean_name_final` from `corrected_name`.

Note: empty string '' in `clean_name_final` and `corrected_name` are converted back to NA.

In [10]:
def extract_type(df, clean_name, clean_name_final, extract_type):
    extract_types = []

    for idx, row in df.iterrows():
        name = row[clean_name].upper()
        name_final = row[clean_name_final].upper()

        # if clean_name_final is exactly the same as clean_name,
        # this indicates no type information can be extracted, thus append NA
        if name.upper() == name_final.upper():
            extract_types.append(np.nan)

        else:
            name = OrderedSet(name.split())
            name_final = OrderedSet(name_final.split())
            # find the difference between two names
            diff = ' '.join(list(name.difference(name_final)))
            extract_types.append(diff.strip())

    # remove de, do, da, du at start or end of extract_type
    # replace empty string with NA
    df[extract_type] = extract_types
    df[extract_type] = df[extract_type].str.strip()\
        .str.replace("  ", " ")\
        .str.replace('^de |^do |^da |^du | du$| de$| do$| da$|^de$|^do$|^da$|^du$', '', case=False)\
        .str.replace('^de |^do |^da |^du | du$| de$| do$| da$|^de$|^do$|^da$|^du$', '', case=False)\
        .str.strip()\
        .str.title()\
        .replace('',np.nan)
    # replace empty string with NA
    df[clean_name].replace('', np.nan, inplace=True)
    df[clean_name_final].replace('', np.nan, inplace=True)

### `sub_type`

Use `extract_type` to map the type information extracted from the name column to one of the types in the type dictionary.

In [11]:
def map_type(df, country, extract_type, sub_type, score, type_dict):
    df_grouped = df.groupby(country)
    res = pd.DataFrame()
    for country_name in df[country].unique():
        df_group = df[df[country]==country_name]
        # obtain facility types and abbreviations for that country
        tmp = type_dict[type_dict['Country'].str.upper()==country_name.upper()]
        types, abbrevs = tmp['Type'], tmp['Abbreviation']
        sub_types = []
        scores = []

        for idx, row in df_group.iterrows():
            # if extract_type is NA, just append NA
            if not isinstance(row[extract_type],str):
                sub_types.append(np.nan)
                scores.append(np.nan)

            # find best match
            else:
                match, match_score = process.extractOne(row[extract_type], list(types)+list(abbrevs), scorer = fuzz.token_sort_ratio)
            
                scores.append(match_score)
                # if best match is abbreviation, map it to the corresponding type
                if match in list(abbrevs):
                    match_type = tmp[tmp['Abbreviation']==match]['Type'].iloc[0]
                    sub_types.append(match_type)
                else:
                    sub_types.append(match) 
        df_group[sub_type] = sub_types
        df_group[score] = scores
        res = pd.concat([res, df_group])
    return res

### Complite types

In [12]:
def complite_types(df):
    df_group=df.groupby([ADMIN2, ADMIN3,CLEAN_NAME_FINAL,"cluster_id_1000m" ])
    for i, _group in df_group:
       
        _group[SUB_TYPE]=_group[SUB_TYPE].fillna("xxx")
        get_types=list(_group[SUB_TYPE].unique())
        
        if len(get_types)==2 and "xxx" in get_types :
            get_types.remove("xxx")  
            df.loc[(df[ADMIN2]==i[0])& (df[ADMIN3]==i[1])\
            &(df[CLEAN_NAME_FINAL]==i[2])&(df.cluster_id_1000m==i[3]), SUB_TYPE]=get_types*_group.shape[0]
            
    return df

### Check by admin boundary

In [13]:
def check_by_admin(PoI,admin_var_from_poi, admin_boundary,admin_var_from_admin):
    ##================================================================================##
    # admin column from admin boundary in PoI layer
    new_admin_col=admin_var_from_admin+"_bdry"
    # result of match
    check_var=admin_var_from_poi+"_bdryMatch"
    #---------------------------------------------------------------------------------##
    #get admin id that point located
    arcpy.Near_analysis(PoI, admin_boundary)
    # convert point layer to sdf
    sdf_poi=pd.DataFrame.spatial.from_featureclass(PoI)
    # clean admin layer to be matched
    sdf_poi=preclean_admin(sdf_poi,admin_var_from_poi, "poi_admin")
    # conver admin boundry to sdf
    sdf_admin=pd.DataFrame.spatial.from_featureclass( admin_boundary)[["OBJECTID", admin_var_from_admin]]
    # clean admin name from boundary
    sdf_admin=preclean_admin(sdf_admin,admin_var_from_admin, "bdry_admin")
    sdf_admin.rename({ admin_var_from_admin:new_admin_col,"OBJECTID":"Match_code"}, axis=1, inplace=True)
    # merge point layer with boundary layer attribute
    sdf_merge=sdf_poi.merge(sdf_admin, how="left", left_on="NEAR_FID", right_on="Match_code")
    sdf_merge[check_var]=""
  
    # match admin names between point layer and boundary layer
    for index,  row in sdf_merge.iterrows():
        score=fuzz.ratio(row["poi_admin"], row["bdry_admin"])
        if score >=85:
            sdf_merge.loc[index,check_var]="YES"
        else:
            sdf_merge.loc[index,check_var]="NO"
    sdf_merge.drop(["Match_code","NEAR_FID","NEAR_DIST","poi_admin","bdry_admin",new_admin_col],axis=1, inplace=True)
    arcpy.DeleteFeatures_management(PoI)
    sdf_merge.spatial.to_featureclass(PoI)                    
    

### Check by settlement types

In [14]:
def check_by_settlement(PoI, settExtent, dist_to_sett_threshold):
    arcpy.Near_analysis(PoI, settExtent, dist_to_sett_threshold)
    sdf_poi=pd.DataFrame.spatial.from_featureclass(PoI)   
    sdf_sett=pd.DataFrame.spatial.from_featureclass( settExtent)[["OBJECTID","type"]]
    sdf_sett=sdf_sett.rename({ "type":"sett_type","OBJECTID":"Match_code"}, axis=1)
    sdf_merge=sdf_poi.merge(sdf_sett, how="left", left_on="NEAR_FID", right_on="Match_code")
    sdf_merge["sett_type"].fillna("Out of a settlement", inplace=True)
    sdf_merge.drop(["Match_code","NEAR_FID","NEAR_DIST"],axis=1, inplace=True)
    arcpy.DeleteFeatures_management(PoI)
    sdf_merge.spatial.to_featureclass(PoI)


### Check overlaps

In [15]:
def check_overlaps(fc,check_dist_m):
    out_table= r"C:\Users\hengin\Documents\overlap.csv"
    arcpy.management.FindIdentical(fc, out_table, "Shape", dist_to_check_overlap, 0, "ONLY_DUPLICATES")
    sdf=pd.DataFrame.spatial.from_featureclass(fc)
    identical_df=pd.read_csv(out_table)[["IN_FID"]]
    identical_df['is_overlap']="YES" 
    sdf_merge= sdf.merge(identical_df, how="left", left_on="OBJECTID", right_on="IN_FID") 
    sdf_merge['is_overlap'].fillna("NO",inplace=True) 
    sdf_merge.drop("IN_FID", axis=1, inplace=True)
    arcpy.DeleteFeatures_management(fc)
    sdf_merge.spatial.to_featureclass(fc)                       

# Part 1

## Intialize Inputs

In [36]:
# Country iso code
country_code="DRC" #  options country, province, districts, health zone, health area, depent on input data


# main directory that have input layers.Output will be saved in this directory
root_dir=r"D:\Grid3\ISS\processing"
#input layer
#input_data=r"D:\Grid3\ISS\processing\DRC_WHO_health_facilities_08102021\DRC_preprocess.gdb\DRC_WHO_hf_merged"
input_data=r"D:\Grid3\ISS\processing\{x}_WHO_health_facilities_08102021\{x}_preprocess.gdb\{x}_WHO_hf_merged".format(x=country_code)
# today date
today_date="09012021"    # format of the date "MonthDAYYEAR" exp:"01012021" >> this goes to into name of the  output layers

# point type 
point_poi_type="hf" # options >> health_facility, settlement, schools or other point of interests

# source of dataset
points_source="who"


# import type dictionary as TYPE_DICT
path_to_type_dict =  r"D:\Grid3\ISS\inputs\spelling_type_dict\{x}_type_dict_augmented_1130.csv".format(x=country_code)
TYPE_DICT = pd.read_csv(path_to_type_dict )

# import spelling dictionary as SPELLING_DICT
path_to_spelling_dict = r"D:\Grid3\ISS\inputs\spelling_type_dict\{x}_spelling_dict_052021.csv".format(x=country_code)
SPELLING_DICT = pd.read_csv(path_to_spelling_dict )

# master facility list
MFL_path=r"D:\Grid3\ISS\processing\MFL_by_country\mfl_by_country.gdb\{x}_mfl".format(x=country_code)

#path to settlement extent
#bua, ssa and hamlets should be merged into a layer
# sett_extent=r"D:\Grid3\ISS\inputs\CMR\CMR.gdb\GRID3_Cameroon_Settlement_Extents_Version_1"
# dist_to_sett_threshold="250 Meters" 
# dist_to_check_overlap="0.1 Meters"

# #Admin boundaries
# Admin1_bndry=r"D:\Grid3\ISS\inputs\CMR\CMR.gdb\cmr_admbnda_adm1_inc_20180104"
# ADMIN1_BNDRY="ADM1_FR"

# Admin2_bndry=r"D:\Grid3\ISS\inputs\CMR\CMR.gdb\cmr_admbnda_adm3_inc_20180104"
# ADMIN2_BNDRY="ADM3_FR"


#path to settlement extent
#bua, ssa and hamlets should be merged into a layer
sett_extent=r"D:\Grid3\\DRC\DRC_Health_Facilities\Data\Spatial_Data\GRID3_DRC_settlement_extents_20200403_V02.gdb\\bua_ssa_hamlet"
dist_to_sett_threshold="250 Meters" 
dist_to_check_overlap="0.1 Meters"

#Admin boundaries
Admin1_bndry=r"D:\Grid3\DRC\DRC_Health_Facilities\Data\Spatial_Data\ISS_hf_2020\ISS_2017_2021.gdb\DRC_admin1"
ADMIN1_BNDRY="ADM1_REF"


Admin2_bndry=r"D:\Grid3\DRC\DRC_Health_Facilities\Data\Spatial_Data\ISS_hf_2020\ISS_2017_2021.gdb\DRC_admin2"
ADMIN2_BNDRY="Nom"

# country column in english
COUNTRY = 'countries'

# output columns
ADMIN1="admin1"
ADMIN2="admin2"
ADMIN3="admin3"
FACILITY_NAME = 'org_name'
CLEAN_NAME = 'clean_name' # clean name after some pre-cleaning
CORRECT_NAME = 'corrected_name' # clean name after spelling correction
CLEAN_NAME_FINAL = 'clean_name_final' # final clean name after removing type information
EXTRACT_TYPE = 'type_extract' # type information extracted
SUB_TYPE = 'type' # type mapped to the type dictonary
SCORE = 'score' # match score between 'type_extract' and 'sub_type'



## Prepaire Workspace

In [37]:
# create output directory
if not  os.path.exists (os.path.join(root_dir,country_code+"_WHO_health_facilities_"+today_date)):
    os.mkdir(os.path.join(root_dir,country_code+"_WHO_health_facilities_"+today_date))
output_loc=os.path.join(root_dir,country_code+"_WHO_health_facilities_"+today_date)

# create final output gdb
if  not arcpy.Exists(os.path.join(output_loc,country_code+"_preprocess.gdb")):
    arcpy.CreateFileGDB_management(output_loc,country_code+"_preprocess.gdb")
output_gdb=os.path.join(output_loc,country_code+"_preprocess.gdb") 

# output
OUT_FILE = country_code + "_WHO_hf_preprocess"
SAVE_FILE = os.path.join(output_gdb, OUT_FILE)
OUTPUT =os.path.join(output_gdb, OUT_FILE)
OUTPUT_csv=  os.path.join(output_loc, OUT_FILE+".csv")                      

## Preprocessing

In [38]:

##calculate xy coordinates in meter. it necessary for clusterig
## africa equal area projection is used. Change if it is necessary
crf="102022"
arcpy.AddField_management(input_data,field_name='x_coord_m', field_type="DOUBLE")
arcpy.AddField_management(input_data,field_name='y_coord_m', field_type="DOUBLE")
arcpy.CalculateGeometryAttributes_management(input_data, [['x_coord_m', 'POINT_X'], 
                                                    ['y_coord_m', 'POINT_Y']],coordinate_system = crf)
# read input dataset as spatial dataframe
sdf=pd.DataFrame.spatial.from_featureclass(input_data)
# exclude records that do not have lat/long
sdf=sdf[sdf['x_coord_m'].notnull()]
##===================================================================##
print (f">> There are {arcpy.GetCount_management(input_data).getOutput(0)} rows in the input layer" )
print (f">> There are {sdf.shape[0]} rows remained after preprocessing" )

>> There are 122322 rows in the input layer
>> There are 122292 rows remained after preprocessing


## Formatting and Standardizig

In [39]:
#Cluster points
df=create_cluster(sdf,1000, 2,"cluster_id")

# pre-cleaning
pre_cleaned_res = preclean(df, facility_name = FACILITY_NAME, clean_name = CLEAN_NAME,
                          country_col = COUNTRY)
# make spelling correction
corrected_results = correct_spelling(pre_cleaned_res, spelling_dict=SPELLING_DICT, country_col=COUNTRY, 
                                     clean_name = CLEAN_NAME, output_col=CORRECT_NAME)
# remove type information
res = remove_type_info(corrected_results, type_dict=TYPE_DICT, clean_name=CORRECT_NAME, 
                       clean_name_final=CLEAN_NAME_FINAL, country=COUNTRY)
# obtain facility type extracted
extract_type(df=res, clean_name=CORRECT_NAME, 
             clean_name_final=CLEAN_NAME_FINAL, extract_type=EXTRACT_TYPE)
res = map_type(df=res, country = COUNTRY, extract_type=EXTRACT_TYPE, 
               sub_type=SUB_TYPE, score=SCORE, type_dict=TYPE_DICT)
# complite missing types
complite_type=complite_types(res)
complite_type.drop('cluster_id_1000m', inplace=True, axis=1)
# export the result
complite_type.spatial.to_featureclass(OUTPUT)
#check by if points in the right admin1
check_by_admin(OUTPUT,ADMIN1, Admin1_bndry,ADMIN1_BNDRY)

# check by if points in the right admin2
check_by_admin(OUTPUT,ADMIN2, Admin2_bndry,ADMIN2_BNDRY)
# check by settlement extent
check_by_settlement(OUTPUT, sett_extent, dist_to_sett_threshold)

# check by overlaps
check_overlaps(OUTPUT,dist_to_check_overlap)

##===================Result=========================##
result=pd.DataFrame.spatial.from_featureclass(OUTPUT)
print(">>> Summary by type of facilities:")
print(result[SUB_TYPE].value_counts())
print()
print(">>> Poinst that are located in the right admin1 unit:")
print(result[ADMIN1+"_bdry_match"].value_counts())
print()
print(">>> Poinst that are located in the right admin2 unit:")
print(result[ADMIN2+"_bdry_match"].value_counts())
print()
print(">>> Points summary based on settlement type based:")
print(result["sett_type"].value_counts())
print()
print()
print(f">>> Points summary based on overlaps  {dist_to_check_overlap}:")
print(result["is_overlap"].value_counts())
print()

>>> Summary by type of facilities:
Centre de Sante                              56547
Centre Medical                                6333
Hopital General de Reference                  4805
Poste de Sante                                4094
Centre Hospitalier                            3281
Centre de Sante de Reference                  2825
Clinique                                      1851
Hopital                                        763
Polyclinique                                   642
Centre de Sante Municipal                      569
Dispensaire                                    508
Centre Medico-Chirurgical                      263
Hopital Secondaire                             196
Maternite                                      162
Centre de Sante Clinique                        88
Centre Pediatrique                              80
Centre de Sante Maternite                       65
Centre De Sante Et Maternite                    56
Centre Medical Et Maternite                    

# Part 2

## Match  with master facility list (pyramid)

### Prepaire MFL for matching

In [41]:
#read input data
MFL=pd.DataFrame.spatial.from_featureclass("DRC_mfl")
# drop duplicated recor in order to avoid duplicate match 
MFL.drop_duplicates(subset=["mfl_uuid"], inplace=True)

## input variables from MFL
MFL_ADMIN1='adm1_name'
MFL_ADMIN2='adm2_name'
MFL_ADMIN3='adm3_name'
MFL_FACE_NAME='facility_short'
MFL_FACE_TYPE='type_clean'
MFL_ID='mfl_uuid'

# standardized admin names
MFL_ADMIN1_2=MFL_ADMIN1+"_2"
MFL=preclean_admin(MFL,MFL_ADMIN1, MFL_ADMIN1_2)
MFL_ADMIN2_2=MFL_ADMIN2+"_2"
MFL=preclean_admin(MFL,MFL_ADMIN2, MFL_ADMIN2_2)

## get list of ADMIN1, ADMIN2and ADMIN3 from MFL
MFL_ADMIN1_list=MFL[MFL_ADMIN1_2].unique().tolist()
MFL_ADMIN2_list=MFL[MFL_ADMIN2_2].unique().tolist()
#MFL_ADMIN3_list=MFL[MFL_ADMIN3].unique().tolist()

# remove accents in MFL
def remove_accents(a):
    if isinstance(a, str):
        return unidecode.unidecode(a)

MFL[MFL_ADMIN2]= MFL[MFL_ADMIN2].apply(remove_accents)
#MFL[MFL_ADMIN3]= MFL[MFL_ADMIN3].apply(remove_accents)
MFL[MFL_FACE_NAME]= MFL[MFL_FACE_NAME].apply(remove_accents)
MFL[MFL_FACE_TYPE]= MFL[MFL_FACE_TYPE].apply(remove_accents)



## Flag health facilities with the same name but different types 
## it will help to prevent duplicate match because type will not be used for the match process
groupby_admin1_2=MFL.groupby( [MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_FACE_NAME]).size().reset_index(name="count1")
MFL=MFL.merge(groupby_admin1_2, left_on=[MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_FACE_NAME], 
                      right_on=[MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_FACE_NAME])

## Flag health facilities that have the same name and type by health zone
## if the facility name duplicated, the duplicated records will be mathced based on 
## health zone, health area and facility name
groupby_admin1=MFL.groupby( [MFL_ADMIN1_2,MFL_FACE_NAME]).size().reset_index(name="count2")
MFL=MFL.merge(groupby_admin1, left_on=[MFL_ADMIN1_2,MFL_FACE_NAME], 
                      right_on=[MFL_ADMIN1_2,MFL_FACE_NAME])

In [42]:
groupby_admin1_2[groupby_admin1_2["count1"]>=2].shape

(708, 4)

In [43]:
groupby_admin1[groupby_admin1["count2"]>=2].shape

(1281, 3)

### Matching admin1 (province/states) names

In [44]:
result=pd.DataFrame.spatial.from_featureclass(OUTPUT)
# check by province_name
# format province names from imput data
#  new caloums to save match result and correct names if it matched
ADMIN1_2="clean_"+ADMIN1
ADMIN1_matched=ADMIN1_2+"_isMatch"
match_admin1=preclean_admin(result,ADMIN1, ADMIN1_2)

admin1_group=match_admin1.groupby(ADMIN1_2)
#match admin1 between MFL and the input data
for i, group in admin1_group:
    match_name, score = process.extractOne(i, MFL_ADMIN1_list)
    # match score above 80 will be true match and admin1 from the input data 
    # will be changed with admin1 from MFL
    if score >=80:
        match_admin1.loc[match_admin1[ADMIN1_2]==i, ADMIN1_matched] ="YES"
        match_admin1.loc[match_admin1[ADMIN1_2]==i, ADMIN1_2] = match_name
        
    # match score less than 80 will be false match and 
    # admin1 name from the input data will be kept.
    # the rows that thier admin1 did not match will not go 
    # next step matching process. Manually check admin1  if it is needed
    else:
        match_admin1.loc[match_admin1[ADMIN1_2]==i, ADMIN1_matched] ="NO"

##==========================================================================##
# admin1 match result
match_count=len(match_admin1[match_admin1[ADMIN1_matched] =="YES"][ADMIN1_2].unique())
notMatch_count=len(match_admin1[match_admin1[ADMIN1_matched] =="NO"][ADMIN1_2].unique())
print("##############=== admin1 matcth result ===#####################")
print ( f" {match_count} admin 2 matched between input data and MFL")
print ( f" {notMatch_count} admin 2 did not matched between input data and MFL")


##############=== admin1 matcth result ===#####################
 26 admin 2 matched between input data and MFL
 0 admin 2 did not matched between input data and MFL


### Matching admin2  ( district/health zone) names 

In [45]:

# format admin1 names from input data
#  new caloums to save match result and correct names if it matched
ADMIN2_2="clean_"+ADMIN2
ADMIN2_matched=ADMIN2_2+"_isMatch"
match_admin2=preclean_admin(match_admin1,ADMIN2,ADMIN2_2)

# create unique id column with combination of admin1, admin2
# the unique id will be used for cheking if admin2 match to MFL admin2
match_admin1[ADMIN2_matched]=""
match_admin2["uniqueid"]=match_admin2[ADMIN1_2]+"_"+match_admin2[ADMIN2_2]
# match admin2 name by each  admin1
# the match process will be limited to admin1
match_admin2[ADMIN2_matched]=""
for admin1_ in MFL_ADMIN1_list:
        # get match candidates from MFL
    match_candiates=MFL[MFL[MFL_ADMIN1_2]==admin1_][MFL_ADMIN2_2].unique().tolist() 
    # get admin2 names to be match to MFL admin2
    match_df=match_admin2[match_admin2[ADMIN1_2]==admin1_]
    if match_df.shape[0]>=1:
    # group by admin2 in order to make matching process shorther
    # since admin2 is repeated many times. So each admin2 will be match only once and
    # then all rows with the same admin2 will be changed 
        admin2_group=match_df.groupby([ADMIN1_2, ADMIN2_2])
        for i, group in admin2_group:
            get_admin2_name=i[1]
            # create unique id from each groups to update match result 
            # in the input data
            get_index=i[0]+"_"+i[1]
            # matching
            match_name, score = process.extractOne(get_admin2_name, match_candiates, scorer=fuzz.token_sort_ratio)
            # match score above 80 will be true match and admin2 from the input data 
            # will be changed with admin2 from MFL
            if score >=80:
                match_admin2.loc[match_admin2["uniqueid"]==get_index, ADMIN2_matched] ="YES"
                match_admin2.loc[match_admin2["uniqueid"]==get_index, ADMIN2_2] = match_name
                
                # match score less than 80 will be false match and 
                # admin2 name from the input data will be kept.
                # the rows that thier admin2 did not match will not go 
                # next step matching process. Manually check admin1  if it is needed
            else:
                match_admin2.loc[match_admin2["uniqueid"]==get_index, ADMIN2_matched] ="NO"
# drop unique id column                        
match_admin2.drop("uniqueid", axis=1, inplace=True)
##==========================================================================##
# admin 2 match result
match_count=len(match_admin2[match_admin2[ADMIN2_matched] =="YES"][ADMIN2_2].unique())
notMatch_count=len(match_admin2[match_admin2[ADMIN2_matched] =="NO"][ADMIN2_2].unique())
print("##=== admin2 matcth result ===###")
print ( f" {match_count} admin 2 matched between input data and MFL")
print ( f" {notMatch_count} admin 2 did not matched between input data and MFL")

##=== admin2 matcth result ===###
 477 admin 2 matched between input data and MFL
 25 admin 2 did not matched between input data and MFL


### Matching  admin3 (wards/health area) names

In [47]:

# format admin3 names from input data
# new caloums to save match result and correct names if it matched
ADMIN3_2="clean_"+ADMIN3
ADMIN3_matched=ADMIN3_2+"_isMatch"
match_admin3=preclean_admin(match_admin2,ADMIN3, ADMIN3_2)


# # create unique id column with combination of admin1, admin2, and admin3
# # the unique id will be used for cheking if admin3 match to MFL admin3
# match_admin3[ADMIN3_matched]=""
# match_admin3["uniqueid"]=match_admin3[ADMIN1_2]+"_"+match_admin3[ADMIN2_2]+"_"+match_admin3[ADMIN3_2]

# # matching process limited to admin1 and admin2.
# # iterate by admin1
# for admin1_ in MFL_ADMIN1_list:
#     selected_admin2=MFL[MFL[MFL_ADMIN1]==admin1_][MFL_ADMIN2].unique().tolist()
  
#     # itarate by each admin2
#     for admin2_ in selected_admin2:
#         # get match candidates from MFL
#         match_candiates=MFL[(MFL[MFL_ADMIN1]==admin1_) &(MFL[MFL_ADMIN2]==admin2_)]\
#         [MFL_ADMIN3].unique().tolist() 
#         # get admin3 names to be match to MFL admin3
#         match_df=match_admin3[(match_admin3[ADMIN1_2]==admin1_) & (match_admin3[ADMIN2_2]==admin2_)]
#         if match_df.shape[0]>=1:
#             # group by admin3 in order to make matching process shorther
#             # since admin3 is repeated many times. So each admin3 will be match only once and
#             # then all rows with the same admin3 will be changed 
#             admin3_group=match_df.groupby([ADMIN1_2, ADMIN2_2,ADMIN3_2])
#             for i, group in admin3_group:

#                 get_admin2_name=i[2]
#                 # create unique id from each groups to update match result 
#                 # in the input data
#                 get_index=i[0]+"_"+i[1]+"_"+i[2]
#                 # matching
#                 match_name, score = process.extractOne(get_admin2_name, match_candiates)
#                 # short names gives low matching score even if a letter differnce
#                 # different score threshold is used based on matching name lenght
#                 # true match and admin2 from the input data 
#                 # will be changed with admin3 from MFL
#                 if (len(get_admin2_name) <=4) & (score >=70):
#                     match_admin3.loc[match_admin3["uniqueid"]==get_index,ADMIN3_matched]="YES"
#                     match_admin3.loc[match_admin3["uniqueid"]==get_index,ADMIN3_2]=match_name
                    
#                 if (len(get_admin2_name)==5) & (score >=80):
#                     match_admin3.loc[match_admin3["uniqueid"]==get_index,ADMIN3_matched]="YES"
#                     match_admin3.loc[match_admin3["uniqueid"]==get_index,ADMIN3_2]=match_name
                    
#                 if score >=85:
#                     match_admin3.loc[match_admin3["uniqueid"]==get_index,ADMIN3_matched]="YES"
#                     match_admin3.loc[match_admin3["uniqueid"]==get_index,ADMIN3_2]=match_name
                    
#                     # match score less than 80 will be false match and 
#                     # admin2 name from the input data will be kept.
#                     # the rows that thier admin2 did not match will not go 
#                     # next step matching process. Manually check admin1  if it is needed
#                 else:
#                     match_admin3.loc[match_admin3["uniqueid"]==get_index,ADMIN3_matched]="NO"
                 
                    
# # drop unique id column                        
# match_admin3.drop("uniqueid", axis=1, inplace=True)
# ##==========================================================================##
# # admin3 match result
# match_count=len(match_admin3[match_admin3[ADMIN3_matched] =="YES"][ADMIN3_2].unique())
# notMatch_count=len(match_admin3[match_admin3[ADMIN3_matched] =="NO"][ADMIN3_2].unique())
# print("##=== admin3  matcth result ===###")
# print ( f" {match_count} admin 3 matched between input data and MFL")
# print ( f" {notMatch_count} admin 3 did not matched between input data and MFL") 

### Matching facility names based on admin1, admin2 and facility name

In [48]:
def return_match(df,index_,match_name, match_uuid, match_result, match_score, match_type):    
    df.loc[df["uniqueid"]==index_,"face_name2"]=match_name
    df.loc[df["uniqueid"]==index_,"mfl_match_name"]=match_name
    df.loc[df["uniqueid"]==index_,"mfl_match_uuid"]=mfl_uuid1
    df.loc[df["uniqueid"]==index_,"mfl_match_result"]=match_result
    df.loc[df["uniqueid"]==index_,"mfl_match_score"]=match_score
    df.loc[df["uniqueid"]==index_,"mfl_match_type"]=match_type

## facility name will be match in two different scenario
# 1. If health area match, the  match will be based on health zone, health area and facility name
# 2. If health areas did not match, he match will be based on health zone and facility name


# add new fields to save match
match_admin3["face_name2"]=match_admin3["clean_name_final"]
match_admin3["mfl_match_name"]=""
match_admin3["mfl_match_uuid"]=""
match_admin3["mfl_match_result"]=""
match_admin3["mfl_match_score"]=0
match_admin3["mfl_match_type"]=""
#match for cases if admin3 match between Mfl and input data
fname_match1=match_admin3[match_admin3[ADMIN2_matched] =="YES"]

# 1. Match will be based on health zone, health area and facility name
# get clean facility name
# create unique id column with combination of admin1, admin2, admin3 and facility name
# the unique id will be used for cheking if facility name match to MFL facility name
fname_match1["uniqueid"]=fname_match1[ADMIN1_2]+"_"+fname_match1[ADMIN2_2]+"_"+fname_match1["face_name2"]

for admin1_ in MFL_ADMIN1_list:
    selected_admin2=MFL[MFL[MFL_ADMIN1_2]==admin1_][MFL_ADMIN2_2].unique().tolist()  
    for admin2_ in selected_admin2:   
        match_candiates=MFL[(MFL[MFL_ADMIN1_2]==admin1_) &\
                         (MFL[MFL_ADMIN2_2]==admin2_)][MFL_FACE_NAME].unique().tolist()
        match_df=fname_match1[(fname_match1[ADMIN1_2]==admin1_) & (fname_match1[ADMIN2_2]==admin2_)]
        if match_df.shape[0]>=1:
            # group by face name in order to make matching process shorther
            # since face name is repeated many times. So each face name will be match only once and
                # then all rows with the same face name will be changed 
            fname_match1_group=match_df.groupby([ADMIN1_2, ADMIN2_2, "face_name2"])
            for i, group in fname_match1_group:
                get_fname_name=i[2]
                get_index=i[0]+"_"+i[1]+"_"+i[2]
                if len(match_candiates)>=1:
                    match_name, score = process.extractOne(get_fname_name, match_candiates, scorer=fuzz.token_sort_ratio)
                    match_name2, score2 = process.extractOne(get_fname_name, match_candiates, scorer=fuzz.partial_ratio)
                    
                    mfl_uuid1="/".join(MFL[(MFL[MFL_ADMIN1_2]==admin1_)&(MFL[MFL_ADMIN2_2]==admin2_)&(MFL[MFL_FACE_NAME]==match_name)]["mfl_uuid"].tolist())
                    mfl_uuid2="/".join(MFL[(MFL[MFL_ADMIN1_2]==admin1_)&(MFL[MFL_ADMIN2_2]==admin2_)&(MFL[MFL_FACE_NAME]==match_name2)]["mfl_uuid"].tolist())
            
               
                    if len(get_fname_name) <=4 and score >=75:
                        return_match(fname_match1,get_index, match_name, mfl_uuid1,"YES", score, "Simple")  
                        if score>=90:
                            match_candiates.remove(match_name)
                        continue
                    elif len(get_fname_name) ==5 and score >=80: 
                        return_match(fname_match1,get_index, match_name,mfl_uuid1,"YES", score, "Simple") 
                        if score>=90:
                            match_candiates.remove(match_name)
                        continue
                    elif len(get_fname_name) >5 and score >=83:
                        return_match(fname_match1,get_index, match_name, mfl_uuid1,"YES", score, "Simple") 
                        if score>=90:
                            match_candiates.remove(match_name)
                        continue
                    elif len(get_fname_name) >=5 and score <83 and score2==100:
                        return_match(fname_match1,get_index, match_name2, mfl_uuid2,"YES", score2, "Partial") 
                        if score==100:
                            match_candiates.remove(match_name2)
                        continue

                    else:
                        fname_match1.loc[fname_match1["uniqueid"]==get_index,"mfl_match_result"]="NO"
                        fname_match1.loc[fname_match1["uniqueid"]==get_index,"mfl_match_score"]=score
                        fname_match1.loc[fname_match1["uniqueid"]==get_index,"mfl_match_name"]=match_name
                        fname_match1.loc[fname_match1["uniqueid"]==get_index,"mfl_match_uuid"]=mfl_uuid1
                        
fname_match1.drop("uniqueid", axis=1, inplace=True)
fname_match1_yes=fname_match1[fname_match1["mfl_match_result"] =="YES"]
##==========================================================================##
# ## match by health zone health area and facility name
# ## exclude if there are facilities with the same but different type
# fname_match1_MFL=fname_match1_.merge(MFL[[MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_ID,MFL_FACE_NAME, MFL_FACE_TYPE,
#                                                "hf_code", 'duplicate_uuid', 'count1']],
#                                      left_on=[ADMIN2_2, "face_name2"],
#                                      right_on=[MFL_ADMIN2_2,MFL_FACE_NAME],how="left")


# # chnage type of the facility to MFL if facility name is unique in admin1
# for index, row in fname_match1_MFL.iterrows():
#     # if type of facility unique for each facility name ( 'pyrmd_count1' !=1)
#     # replace type to be identical to MFL
#     if fname_match1_MFL.at[index, 'count1']==1: 
#         fname_match1_MFL.at[index,SUB_TYPE]=fname_match1_MFL.at[index,MFL_FACE_TYPE]
# ## rematch facilities if facility name is duplicated in the same admin1 
# ## count column in MFL indicates duplication of facility name: count2==2
# ## seperate facilities that may duplicate match
# ## non duplicated
# non_duplicate=fname_match1_MFL[(fname_match1_MFL['count1']<2)| (fname_match1_MFL['count1'].isnull())]  
# ## may duplicated
# get_duplicate_match=fname_match1_MFL[fname_match1_MFL['count1']>=2]
# ## drop MFL columns from previous match 
# get_duplicate_match.drop([MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_ID,MFL_FACE_NAME, MFL_FACE_TYPE,
#                             "hf_code", 'duplicate_uuid', 'count1'], axis=1, inplace=True)
# ## drop duplicate cases from in the first match by using uuid column 
# get_duplicate_match.drop_duplicates(["uuid"], inplace=True)
# ## rematch with MFL by adding facility type in the match list
# get_duplicate_match=get_duplicate_match.merge(MFL[[MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_ID,MFL_FACE_NAME, MFL_FACE_TYPE,
#                                                "hf_code", 'duplicate_uuid', 'count1']],
#                                      left_on=[ADMIN2_2, "face_name2",SUB_TYPE],
#                                      right_on=[MFL_ADMIN2_2,MFL_FACE_NAME,MFL_FACE_TYPE],how="left")
# ## remerge 
# fname_match1_MFL2=pd.concat([non_duplicate,get_duplicate_match]) 

##==========================================================================##

match_count1=len(fname_match1[fname_match1["mfl_match_result"] =="YES"]["face_name2"].unique())
match_count2=len(fname_match1[fname_match1["mfl_match_result"] =="NO"]["face_name2"].unique())
print("##=== facility name matcth result (based on admin1, admin2) ===##")
print ( f" >>> {match_count1} facility name matched between input data and MFL")
print  ( f" >>> {match_count2} facility name did not matched between input data and MFL") 


##=== facility name matcth result (based on admin1, admin2) ===##
 >>> 3955 facility name matched between input data and MFL
 >>> 15094 facility name did not matched between input data and MFL


### Matching facility names based on admin1, admin2 and facility name

In [49]:
# 2. If health areas did not match, he match will be based on health zone and facility name

fname_match1_no=fname_match1[(fname_match1["mfl_match_result"] =="NO")|(fname_match1["mfl_match_result"] =="")]
fname_notMatch2_no=match_admin3[match_admin3[ADMIN2_matched] =="NO"]

fname_notMatch2=pd.concat([fname_match1_no,fname_notMatch2_no])                              
                              
# 1. Match will be based on health zone, health area and facility name
fname_notMatch2["uniqueid"]=fname_notMatch2[ADMIN1_2]+"_"+fname_notMatch2["face_name2"]

for admin1_ in MFL_ADMIN1_list:  
    match_candiates=MFL[MFL[MFL_ADMIN1_2]==admin1_][MFL_FACE_NAME].unique().tolist()   
    match_df=fname_notMatch2[fname_notMatch2[ADMIN1_2]==admin1_]
    if match_df.shape[0]>=1:
        # group by face name in order to make matching process shorther
        # since face name is repeated many times. So each face name will be match only once and
        # then all rows with the same face name will be changed 
        fname_notMatch2_group=match_df.groupby([ADMIN1_2, "face_name2"])
        for i, group in fname_notMatch2_group:
            get_fname_name=i[1]
            get_index=i[0]+"_"+i[1]
            if len(match_candiates)>=1:
                match_name, score = process.extractOne(get_fname_name, match_candiates, scorer=fuzz.token_sort_ratio)
                match_name2, score2 = process.extractOne(get_fname_name, match_candiates, scorer=fuzz.partial_ratio)
                
                mfl_uuid1="/".join(MFL[(MFL[MFL_ADMIN1_2]==admin1_)&(MFL[MFL_FACE_NAME]==match_name)]["mfl_uuid"].tolist())
                mfl_uuid2="/".join(MFL[(MFL[MFL_ADMIN1_2]==admin1_)&(MFL[MFL_FACE_NAME]==match_name2)]["mfl_uuid"].tolist())
            
              
                if len(get_fname_name) <=4 and score >=75:
                    return_match(fname_notMatch2,get_index, match_name, mfl_uuid1,"YES", score, "Simple")  
                    if score>=90:
                        match_candiates.remove(match_name)
                    continue
                elif len(get_fname_name) ==5 and score >=80: 
                    return_match(fname_notMatch2,get_index, match_name,mfl_uuid1,"YES", score, "Simple") 
                    if score>=90:
                        match_candiates.remove(match_name)
                    continue
                elif len(get_fname_name) >5 and score >=83:
                    return_match(fname_notMatch2,get_index, match_name, mfl_uuid1,"YES", score, "Simple") 
                    if score>=90:
                        match_candiates.remove(match_name)
                    continue
                elif len(get_fname_name) >=5 and score <83 and score2==100:
                    return_match(fname_notMatch2,get_index, match_name2, mfl_uuid2,"YES", score2, "Partial") 
                    if score==100:
                        match_candiates.remove(match_name2)
                    continue

                else:
                    fname_notMatch2.loc[fname_notMatch2["uniqueid"]==get_index,"mfl_match_result"]="NO"
                    fname_notMatch2.loc[fname_notMatch2["uniqueid"]==get_index,"mfl_match_score"]=score
                    fname_notMatch2.loc[fname_notMatch2["uniqueid"]==get_index,"mfl_match_name"]=match_name
                    fname_notMatch2.loc[fname_notMatch2["uniqueid"]==get_index,"mfl_match_uuid"]=mfl_uuid1
                    
fname_notMatch2.drop("uniqueid", axis=1, inplace=True)

# ##==========================================================================##
# ## match by health zone health area and facility name
# ## exclude if there are facilities with the same but different type
# fname_notMatch2_MFL=fname_notMatch2.merge(MFL[[MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_ID,MFL_FACE_NAME, MFL_FACE_TYPE,
#                                                "hf_code", 'duplicate_uuid', 'count2']],
#                                      left_on=[ADMIN1_2, "face_name2"],
#                                      right_on=[MFL_ADMIN1_2,MFL_FACE_NAME],how="left")


# # chnage type of the facility to MFL if facility name is unique in admin1
# for index, row in fname_notMatch2_MFL.iterrows():
#     # if type of facility unique for each facility name ( 'pyrmd_count1' !=1)
#     # replace type to be identical to MFL
#     if fname_notMatch2_MFL.at[index, 'count2']==1: 
#         fname_notMatch2_MFL.at[index,SUB_TYPE]=fname_notMatch2_MFL.at[index,MFL_FACE_TYPE]
# ## rematch facilities if facility name is duplicated in the same admin1 
# ## count column in MFL indicates duplication of facility name: count2==2
# ## seperate facilities that may duplicate match
# ## non duplicated
# non_duplicate=fname_notMatch2_MFL[(fname_notMatch2_MFL['count2']<2)| (fname_notMatch2_MFL['count2'].isnull())]  
# ## may duplicated
# get_duplicate_match=fname_notMatch2_MFL[fname_notMatch2_MFL['count2']>=2]
# ## drop MFL columns from previous match 
# get_duplicate_match.drop([MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_ID,MFL_FACE_NAME, MFL_FACE_TYPE,
#                             "hf_code", 'duplicate_uuid', 'count2'], axis=1, inplace=True)
# ## drop duplicate cases from in the first match by using uuid column 
# get_duplicate_match.drop_duplicates(["uuid"], inplace=True)
# ## rematch with MFL by adding facility type in the match list
# get_duplicate_match=get_duplicate_match.merge(MFL[[MFL_ADMIN1_2,MFL_ADMIN2_2,MFL_ID,MFL_FACE_NAME, MFL_FACE_TYPE,
#                                                "hf_code", 'duplicate_uuid', 'count2']],
#                                      left_on=[ADMIN2_2, "face_name2",SUB_TYPE],
#                                      right_on=[MFL_ADMIN1_2,MFL_FACE_NAME,MFL_FACE_TYPE],how="left")
# ## remerge 
# fname_notMatch2_MFL2=pd.concat([non_duplicate,get_duplicate_match])            

##==========================================================================##
match_count=len(fname_notMatch2[fname_notMatch2["mfl_match_result"] =="YES"]["face_name2"].unique())
notMatch_count=len(fname_notMatch2[fname_notMatch2["mfl_match_result"] =="NO"]["face_name2"].unique())
print("##=== facility name matcth result (based on admin2)===##")
print ( f">>> {match_count} admin 3 matched between input data and MFL")
print ( f">>> {notMatch_count} admin 3 did not matched between input data and MFL") 

##=== facility name matcth result (based on admin2)===##
>>> 2991 admin 3 matched between input data and MFL
>>> 11714 admin 3 did not matched between input data and MFL


In [50]:


merge_all=pd.concat([fname_match1_yes,fname_notMatch2])


# drop duplicated match by uuid (uuid is uique for each row)
merge_all.drop_duplicates(subset=['uuid'], inplace=True)

# fill empty rows with "NA" for admin1, admin2, admin3, facility name, and type
merge_all['face_name2'].fillna("NA",inplace=True)
merge_all['clean_name_final'].fillna("NA",inplace=True)
merge_all[SUB_TYPE].fillna("NA",inplace=True)
merge_all[ADMIN1_2].fillna("NA",inplace=True)
merge_all[ADMIN2_2].fillna("NA",inplace=True)
merge_all[ADMIN3_2].fillna("NA",inplace=True)

merge_all.spatial.to_featureclass(OUTPUT)     
merge_all.to_csv(OUTPUT_csv )


In [421]:
### get count of word frequency
word_count=merge_all[CLEAN_NAME_FINAL].str.split(expand=True).stack().value_counts()
word_count_df=pd.DataFrame(word_count).reset_index().rename({0:"frequancy", "index":"abrv"}, axis=1)
word_count_df=word_count_df[(word_count_df["abrv"].str.len()<=4)&(word_count_df["frequancy"]>=10)]
word_count_df.to_csv(output_loc+"\\abrv_frequancy.csv")


In [422]:
## get count of facility type
type_count=merge_all['type_extract'].value_counts()
type_count_df=pd.DataFrame(type_count).reset_index().rename({0:"frequancy", "index":"type"}, axis=1)
type_count_df.to_csv(output_loc+"\\type_list.csv")

In [61]:
word_count=result[CLEAN_NAME_FINAL].str.split(expand=True).stack().value_counts()
word_count_df=pd.DataFrame(word_count).reset_index().rename({0:"frequancy", "index":"abrv"}, axis=1)
word_count_df=word_count_df[(word_count_df["abrv"].str.len()<=4)&(word_count_df["frequancy"]>=10)]

In [40]:
result=pd.DataFrame.spatial.from_featureclass("DRC_mfl")

In [41]:
alist=result["facility_short"].unique().tolist()
