<h1>Preprocessing Site Names<span class="tocSkip"></span></h1>
<div class="toc">
    <ul class="toc-item">
        <li>
        <span><a href="#Packages-and-functions" data-toc-modified-id="Packages-and-functions-1">
        <span class="toc-item-num">1&nbsp;&nbsp;</span>Packages and functions</a></span>
        </li>
        <li>
            <span><a href="#Prepare-and-match-admin-boundary-data" data-toc-modified-id="Prepare-and-match-admin-boundary-data-2">
            <span class="toc-item-num">2&nbsp;&nbsp;</span>Prepare and match admin boundary data</a></span>
            <ul class="toc-item">
                <li>
                <span><a href="#Match-health-list-admin-names-to-shapefile-admin-names" data-toc-modified-id="Match-health-list-admin-names-to-shapefile-admin-names-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Match health list admin names to shapefile admin names</a></span>
                <ul class="toc-item">
                <li>
                <span><a href="#Match-orgunitlevel4-to-Geob-Adm-3" data-toc-modified-id="Match-orgunitlevel4-to-Geob-Adm-3-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Match orgunitlevel4 to Geob Adm 3</a></span>
                </li>
                <li>
                <span><a href="#Match-orgunitlevel3-to-Geob-Adm-2" data-toc-modified-id="Match-orgunitlevel3-to-Geob-Adm-2-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Match orgunitlevel3 to Geob Adm 2</a></span>
                </li>
                <li>
                <span><a href="#Match-orgunitlevel2-to-Geob-Adm-1" data-toc-modified-id="Match-orgunitlevel2-to-Geob-Adm-1-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Match orgunitlevel2 to Geob Adm 1</a></span>
                </li>
                </ul>
            </li>
        </ul>
    </li>
</div>

Version: April 11, 2023

This notebook works with health facilities from the HMIS Database and attempts to geolocate them, by joining them to administrative shapefiles, other sources of geo-located health facilities, and running queries with geocoding APIs.

**Data Sources**
- Health facilities (hierarchy list)
- Geoboundaries (Adm1 through 4)
- FEWS Admin-2 boundaries (updated post 2017)

# Packages and functions

In [1]:
from os.path import join
import pandas as pd


# local imports
import preprocessing_utils as ppu
import search_utils as ssu

In [2]:
from ctypes.util import find_library
find_library('c')

'/usr/lib/libc.dylib'

# Prepare and match admin boundary data

In [23]:
iso3 = "ETH"
country = "Ethiopia"
input_dir = "/Users/dianaholcomb/Documents/GWU/6501_Capstone/workspace/data"
input_filename = "ethiopia2023-04-17.csv"
output_dir = join(input_dir, "output", iso3)
num_admin_levels = 3
geoboundary_words_to_remove = [" City Council", " District Council", " Municipal Council", " District"]
org_level_words_to_remove = [" District", " PHCU", " Regional Health Bureau", " WorHO", " ZHD"]

In [24]:
geob_arr = ppu.get_geoboundares(num_admin_levels, iso3)

In [25]:
master_table = ppu.process_masterDF(input_dir, input_filename)
master_table.to_csv(f"{input_dir}/temp_{iso3}_clean.csv")

Len of original data: 94
Len of clean data: 14

Unique Level 2: 2
Unique Level 3: 2
Unique Level 4: 3
Unique Level 5: 14


In [26]:
geobList_arr = []
for idx, geob in enumerate(geob_arr):
    admIdx = idx+1
    print(f"Unique Geoboundaries Adm {admIdx}: {len(geob)}")
    geob_list = list(geob.shapeName)
    geob_list.sort()
    geobList_arr.append(geob_list)

Unique Geoboundaries Adm 1: 11
Unique Geoboundaries Adm 2: 74
Unique Geoboundaries Adm 3: 690


## Match health list admin names to shapefile admin names
### Match highest Geob Adm to orgunitlevel
Try using fuzzy matching  

In [27]:
if num_admin_levels == 2:
    org_unit_level = 3
elif  num_admin_levels > 2:
    org_unit_level = 4
else:
    org_unit_level = 1

print(f"Org Unit Level: {org_unit_level}")

Org Unit Level: 4


# Do Matching

In [28]:
curr_geob_lvl = num_admin_levels
curr_org_lvl = org_unit_level
master_table_copy = master_table.copy()

for geobIdx in range(num_admin_levels-1, -1, -1): # reverse loop
    print(f"-----Master list level: {curr_org_lvl}, Geoboundaries level: {curr_geob_lvl}-----")

    master_table_copy.loc[:, f"orgunitlevel{curr_org_lvl}_edit"] = master_table_copy[f"orgunitlevel{curr_org_lvl}"]

    geob_list = ppu.remove_words(geob_arr[geobIdx], "shapeName", geoboundary_words_to_remove)

    org_lvl_list = ppu.remove_words(master_table_copy, f"orgunitlevel{curr_org_lvl}_edit", org_level_words_to_remove)

    # Print names to inspect
    print(org_lvl_list)
    #ppu.inspect_level_names(curr_org_lvl, org_lvl_list, curr_geob_lvl, geobList_arr[geobIdx])

    table_adm_matches = ssu.find_matches(org_lvl_list.tolist(), geob_list.tolist(), 30, curr_org_lvl, curr_geob_lvl)
    matches_pct = (len(table_adm_matches) / len(org_lvl_list))
    print('Matches for Org level {}, Geob level {}: {:.2f}%'.format(curr_org_lvl, curr_geob_lvl, matches_pct*100))

    # Loop through each row in master table, and add new attribute names (adm3 and adm2), if the names produced
    # matches in the lookup tables
    for idx, row in master_table_copy.iterrows():
        if row[f"orgunitlevel{curr_org_lvl}_edit"] in list(table_adm_matches[f"name_level{curr_org_lvl}"]):
            match = table_adm_matches.loc[table_adm_matches[f"name_level{curr_org_lvl}"] == row[f"orgunitlevel{curr_org_lvl}_edit"], f"name_geob{curr_geob_lvl}"].iloc[0]
            master_table_copy.loc[idx, f'adm{curr_geob_lvl}'] = match

    print(master_table_copy[f"adm{curr_geob_lvl}"].isna().sum())

    # iterate down
    curr_geob_lvl -= 1
    curr_org_lvl -= 1
    print("------------------------------------------")


-----Master list level: 4, Geoboundaries level: 3-----
['Sofi' 'Chena']
Matches for Org level 4, Geob level 3: 50.00%
11
------------------------------------------
-----Master list level: 3, Geoboundaries level: 2-----
['Sofi' 'Kaffa']
Matches for Org level 3, Geob level 2: 50.00%
10
------------------------------------------
-----Master list level: 2, Geoboundaries level: 1-----
['Harari' 'Southwest Ethiopia']
Matches for Org level 2, Geob level 1: 50.00%
4
------------------------------------------


In [9]:
print(master_table_copy.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   index                        14 non-null     int64  
 1   orgunitlevel1                14 non-null     object 
 2   orgunitlevel2                14 non-null     object 
 3   orgunitlevel3                14 non-null     object 
 4   orgunitlevel4                12 non-null     object 
 5   orgunitlevel5                10 non-null     object 
 6   orgunitlevel6                1 non-null      object 
 7   organisationunitid           14 non-null     object 
 8   organisationunitname         14 non-null     object 
 9   organisationunitcode         14 non-null     object 
 10  organisationunitdescription  0 non-null      float64
 11  orgunitlevel4_edit           12 non-null     object 
dtypes: float64(1), int64(1), object(10)
memory usage: 1.4+ KB
None


In [29]:
master_table_copy.to_csv(f"{input_dir}/preprocess_{iso3}_matches.csv")