This notebook performs general cleaning on the facility name column and also removes facility type information from the facility name. The extracted type information is then mapped to one of the facility types in the type dictionary.

The output columns include:
- CLEAN_NAME: clean facility name after some pre-cleaning
- CLEAN_NAME_FINAL: final clean name after removing type information
- EXTRACT_TYPE: type information extracted, also the difference between CLEAN_NAME and CLEAN_NAME_FINAL
- SUB_TYPE: facility type defined in the type dictonary, obtained by mapping EXTRACT_TYPE to type dictionary
- SCORE: match score between EXTRACT_TYPE and SUB_TYPE (scale 0-100), can be used to filter perfect-match results only.

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
import fiona
import os
import unidecode
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from ordered_set import OrderedSet
pd.set_option('mode.chained_assignment', None)

In [2]:
# import dataset as df
dataDir = r"C:\Users\DUANYUEYUN\Documents\ArcGIS\Projects\GRID3\Healthsites"
priority_countries = ['South Sudan', 'Mozambique', 'Namibia', 'Nigeria', 'Zambia',
                      'Sierra Leone', 'Ghana',  'Burkina Faso', 'Ethiopia', 'Somalia',
                      'Rwanda', 'Kenya', 'Zimbabwe', 'Democratic Republic of the Congo']
dfs = []
for i in range(len(priority_countries)):
    country = priority_countries[i]
    filename = country + '-node.shp'
    path = os.path.join(dataDir, country, filename)
    df = gpd.read_file(path)
    df['country'] = country
    dfs.append(df)
df = pd.concat(dfs, axis=0)
df.reset_index(drop=True, inplace=True)

# get the index, for mapping processed data to original dataset
df.reset_index(inplace=True)

# import type dictionary as TYPE_DICT
dataDir = r"C:\Users\DUANYUEYUN\Documents\GRID3\Health facilities\Data\Africa"
TYPE_DICT = pd.read_csv(dataDir + "//type_dict_augmented_1130.csv")

In [3]:
# INPUT
# facility name column
FACILITY_NAME = 'name'
# country column
COUNTRY = 'country'

# OUTPUT
# output columns
CLEAN_NAME = 'clean_name' # clean name after some pre-cleaning
CLEAN_NAME_FINAL = 'clean_name_final' # final clean name after removing type information
EXTRACT_TYPE = 'type_extract' # type information extracted
SUB_TYPE = 'sub_type' # type mapped to the type dictonary
SCORE = 'score' # match score between 'type_extract' and 'sub_type'

# path to save cleaned results
SAVE_PATH = r"C:\Users\DUANYUEYUN\Documents\GRID3\Health facilities\Data\Africa\healthsites_cleaned_1202.csv"

Note: the country column in the dataset must match that in the type dictionary, ignoring cases.

In [4]:
# check if there's any country that does not exist in type dictionary
for c in df[COUNTRY]:
    if c not in TYPE_DICT['Country'].unique():
        print(c)

In [5]:
print("Country names in the type dictionary:")
print(TYPE_DICT['Country'].unique())

Country names in the type dictionary:
['Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi' 'Cameroon'
 'Cape Verde' 'Central African Republic' 'Chad' 'Comoros' 'Congo'
 "Cote d'Ivoire" 'Democratic Republic of the Congo' 'Djibouti'
 'Equatorial Guinea' 'Eritrea' 'Ethiopia' 'Gabon' 'Gambia' 'Ghana'
 'Guinea' 'Guinea Bissau' 'Kenya' 'Lesotho' 'Liberia' 'Madagascar'
 'Malawi' 'Mali' 'Mauritania' 'Mauritius' 'Mozambique' 'Namibia' 'Niger'
 'Nigeria' 'Rwanda' 'Sao Tome and Principe' 'Senegal' 'Seychelles'
 'Sierra Leone' 'Somalia' 'South Africa' 'South Sudan' 'Sudan' 'Tanzania'
 'Togo' 'Uganda' 'Zambia' 'Zanzibar' 'Zimbabwe' 'eSwatini']


# Define functions

## `clean_name`

Pre-cleaning on facility name:

- remove punctuations, change '&' to 'and'
- correct spelling of common words
- replace double whitespaces with one and strip extra whitespaces
- remove accent marks

Note: NA values in facility name column is replaced with empty string '' first and then converted back to NA.

In [6]:
def preclean(df, facility_name = FACILITY_NAME, clean_name = CLEAN_NAME):
    # replace NAs with empty string ''
    df[facility_name] = df[facility_name].fillna('')
    
    df[clean_name] = df[facility_name].str.strip()\
            .str.replace("  ", " ")\
            .str.replace('.', ' ')\
            .str.replace(':', ' ')\
            .str.replace("'", ' ')\
            .str.replace('"', ' ')\
            .str.replace('[-_,/\(\)]', ' ')\
            .str.replace('&', ' and ')\
            .str.strip()\
            .str.replace('center', 'centre', case=False)\
            .str.replace('Clinique', 'Clinic', case=False)\
            .str.replace('Polyclinique', 'Polyclinic', case=False)\
            .str.replace('Geral', 'General', case=False)\
            .str.replace('Dispensaire', 'Dispensary', case=False)\
            .str.replace('Hôpital', 'Hospital', case=False)\
            .str.replace('Hopital', 'Hospital', case=False)\
            .str.replace('Hospitais', 'Hospital', case=False)\
            .str.replace(' Hosp | hosp$', ' Hospital ', case=False)\
            .str.replace("Urbain", "Urban", case=False)\
            .str.replace("Distrital", "District", case=False)\
            .str.replace("  ", " ")\
            .str.strip()
    
    # replace NAs in clean_name with empty string ''
    df[clean_name] = df[clean_name].fillna('')
    
    # change emptry string in facility_name back to NA
    df[facility_name] = df[facility_name].replace('', np.nan)

    # remove accent marks
    df[clean_name] = [unidecode.unidecode(n) for n in df[clean_name]]

## `clean_name_final`

Use facility type and abbreviations in the type dictionary as keywords and remove type information from `clean_name` to create the `clean_name_final` column.

In [7]:
def remove_type_info(df, type_dict, clean_name, clean_name_final, country):
    # remove whitespace between abbreviations of length 2 or 3
    # e.g. change C S to CS
    
    # obtain abbreviations of length 2 or 3
    tmp = type_dict[type_dict['Abbreviation'].str.len()<=3]['Abbreviation'].unique()
    # sort by decreasing length
    tmp = sorted(tmp, key=len, reverse=True)
    # change it to the pattern '^c s ' or ' c s$'
    tmp_dict = {}
    for t in tmp:
        tmp_dict[t] = ['^'+' '.join(list(t))+' ', ' '+' '.join(list(t))+'$']
    # replace the pattern with 'cs'
    for t in tmp:
        pats = tmp_dict[t]
        df[clean_name] = df[clean_name].str.replace(pats[0], t+' ',case=False)\
        .str.replace(pats[1], ' '+t, case=False)
        
    # remove type information
    df_grouped = df.groupby(country)
    res = pd.DataFrame()

    for group_name, df_group in df_grouped:
        # obtain the type dictionary for that country
        tmp = type_dict[type_dict['Country'].str.upper()==group_name.upper()]

        # facility types for that country
        types = list(tmp['Type'])
        type_keywords = set()
        for t in types:
            # add the full facility type 
            t = t.title()
            type_keywords.add(t)                 

            # add individual words as well
            t = t.replace('/', ' ')
            words = t.split(' ')
            # skip words that have punctuation / numbers and have length <= 3 (e.g. de, (major))
            words = [w for w in words if w.isalpha() and len(w)>3]
            for w in words:
                type_keywords.add(w)

        # obtain the list of type keywords and sort in descending length
        type_keywords = list(type_keywords)
        type_keywords = sorted(type_keywords, key=lambda s: -len(s))

        # abbreviations for that country
        abbrevs = set(tmp['Abbreviation'])

        abb_keywords = []
        for abbrev in abbrevs:
            # e.g. for CS, 4 patterns are considered: '^CS ', ' CS ', ' CS$', '^CS$'
            abbrev = abbrev.title()
            abb_keywords.extend(['^'+abbrev+'\s', '\s'+abbrev+'\s', '\s'+abbrev+'$',
                                '^'+abbrev+'$'])

        # obtain the list of abbreviation keywords and sort in descending length
        abb_keywords = sorted(abb_keywords, key=lambda s: -len(s))  

        # some country-specific adjustments
        if group_name.upper() == 'UGANDA':
            df_group[clean_name] = df_group[clean_name].str.replace("HC II$", "HCII", case=False)\
            .str.replace("HC III$", "HCIII", case=False)\
            .str.replace("HC IV$", "HCIV", case=False)

        if group_name.upper() == 'MALAWI':
            df_group[clean_name] = df_group[clean_name].str.replace(" DHO$", " DH", case=False)

        if group_name.upper() == "ERITREA":
            df_group[clean_name] = df_group[clean_name].str.replace(" HO$", " HOSP", case=False)

        if group_name.upper() == 'MADAGASCAR':
            df_group[clean_name] = df_group[clean_name].str.replace("csb 1", " csb1", case=False)
            df_group[clean_name] = df_group[clean_name].str.replace("csb 2", " csb2", case=False)

        # handle situations when type is 'Hospital District' in the type dictionary 
        # but name column has 'District Hospital' in ISS data
        type_len_2 = [t for t in type_keywords if len(t.split())==2]
        for t in type_len_2:
            df_group[clean_name] = df_group[clean_name].str.title()\
            .str.replace(' '.join(t.split()[::-1]), t, case=False)

        # remove type information using keywords generated above
        # remove meaningless connecting words like de, do, da, du
        df_group[clean_name_final] = df_group[clean_name].str.title()\
            .str.replace('|'.join(type_keywords), '')\
            .str.replace('|'.join(abb_keywords), ' ')\
            .str.strip()\
            .str.replace('^de | de | de$|^de$|^do | do | do$|^do$|^da | da | da$|^da$|^du | du | du$|^du$', 
                         ' ', case=False)\
            .str.replace("  ", " ")\
            .str.strip()\
            .str.title()
        res = pd.concat([res, df_group])
    return res

## `extract_type`

Extract facility type information by removing `clean_name_final` from `clean_name`.

Note: empty string '' in `clean_name_final` from `clean_name` are converted back to NA.

In [21]:
def extract_type(df, clean_name, clean_name_final, extract_type):
    extract_types = []

    for idx, row in df.iterrows():
        name = row[clean_name].upper()
        name_final = row[clean_name_final].upper()

        # if clean_name_final is exactly the same as clean_name,
        # this indicates no type information can be extracted, thus append NA
        if name.upper() == name_final.upper():
            extract_types.append(np.nan)

        else:
            name = OrderedSet(name.split())
            name_final = OrderedSet(name_final.split())
            # find the difference between two names
            diff = ' '.join(list(name.difference(name_final)))
            extract_types.append(diff.strip())

    # remove de, do, da, du at start or end of extract_type
    # replace empty string with NA
    df[extract_type] = extract_types
    df[extract_type] = df[extract_type].str.strip()\
        .str.replace("  ", " ")\
        .str.replace('^de |^do |^da |^du | du$| de$| do$| da$|^de$|^do$|^da$|^du$', '', case=False)\
        .str.replace('^de |^do |^da |^du | du$| de$| do$| da$|^de$|^do$|^da$|^du$', '', case=False)\
        .str.strip()\
        .str.title()\
        .replace('',np.nan)
    # replace empty string with NA
    df[clean_name].replace('', np.nan, inplace=True)
    df[clean_name_final].replace('', np.nan, inplace=True)

## `sub_type`

Use `extract_type` to map the type information extracted from the name column to one of the types in the type dictionary.

In [9]:
def map_type(df, country, extract_type, sub_type, score, type_dict):
    df_grouped = df.groupby(country)
    res = pd.DataFrame()
    for country_name in df[country].unique():
        df_group = df[df[country]==country_name]
        # obtain facility types and abbreviations for that country
        tmp = type_dict[type_dict['Country'].str.upper()==country_name.upper()]
        types, abbrevs = tmp['Type'], tmp['Abbreviation']
        sub_types = []
        scores = []

        for idx, row in df_group.iterrows():
            # if extract_type is NA, just append NA
            if not isinstance(row[extract_type],str):
                sub_types.append(np.nan)
                scores.append(np.nan)

            # find best match
            else:
                match, match_score = process.extractOne(row[extract_type], list(types)+list(abbrevs), 
                                               scorer = fuzz.ratio)
                scores.append(match_score)
                # if best match is abbreviation, map it to the corresponding type
                if match in list(abbrevs):
                    match_type = tmp[tmp['Abbreviation']==match]['Type'].iloc[0]
                    sub_types.append(match_type)
                else:
                    sub_types.append(match) 
        df_group[sub_type] = sub_types
        df_group[score] = scores
        res = pd.concat([res, df_group])
    return res

In [10]:
def export_results(df, save_path):
    # export results
    # index_original could be used to map results to original dataset
    df.rename(columns={'index':'index_original'}, inplace=True)
    df.to_csv(save_path, index=False)

# Apply cleaning functions

In [11]:
# pre-cleaning
preclean(df, facility_name = FACILITY_NAME, clean_name = CLEAN_NAME)

In [12]:
# remove type information
res = remove_type_info(df, type_dict=TYPE_DICT, clean_name=CLEAN_NAME, 
                       clean_name_final=CLEAN_NAME_FINAL, country=COUNTRY)

In [13]:
# obtain facility type extracted
extract_type(df=res, clean_name=CLEAN_NAME, 
             clean_name_final=CLEAN_NAME_FINAL, extract_type=EXTRACT_TYPE)

In [14]:
print("Percentage of NA in extract type column:",
     round(res[EXTRACT_TYPE].isna().sum()/res.shape[0]*100,1))
print("Number of NA values in extract type column:", res[pd.isna(res[EXTRACT_TYPE])].shape[0])

Percentage of NA in extract type column: 41.5
Number of NA values in extract type column: 3071


In [15]:
# map facility type extracted to type in type dictionary
res = map_type(df=res, country = COUNTRY, extract_type=EXTRACT_TYPE, 
               sub_type=SUB_TYPE, score=SCORE, type_dict=TYPE_DICT)

In [16]:
print("Summary statistics of match score:")
res[SCORE].describe()

Summary statistics of match score:


count    4329.000000
mean       95.684685
std        11.394158
min        32.000000
25%       100.000000
50%       100.000000
75%       100.000000
max       100.000000
Name: score, dtype: float64

In [17]:
# randomly sample rows to examine results
# where type information is extracted
cols = [COUNTRY, FACILITY_NAME, CLEAN_NAME, CLEAN_NAME_FINAL, EXTRACT_TYPE,
       SUB_TYPE, SCORE]
res[~pd.isna(res[EXTRACT_TYPE])][cols].sample(5)

Unnamed: 0,country,name,clean_name,clean_name_final,type_extract,sub_type,score
692,Mozambique,Centro de Saude de Furancungo,Centro De Saude De Furancungo,Furancungo,Centro De Saude,Centro de Saude,100.0
290,Mozambique,Centro de Saude de Chinhambuzi,Centro De Saude De Chinhambuzi,Chinhambuzi,Centro De Saude,Centro de Saude,100.0
4743,Kenya,Kutulo Health Center,Kutulo Health Centre,Kutulo,Health Centre,Health Centre,100.0
3901,Ethiopia,Aba Health Center,Aba Health Centre,Aba,Health Centre,Health Centre,100.0
6903,Democratic Republic of the Congo,PS HEWA BORA,Ps Hewa Bora,Hewa Bora,Ps,Poste de Sante,100.0


In [18]:
# randomly sample rows to examine results
# where no type information is extracted
res[pd.isna(res[EXTRACT_TYPE])][cols].sample(5)

Unnamed: 0,country,name,clean_name,clean_name_final,type_extract,sub_type,score
4924,Kenya,Neem Pharmacy,Neem Pharmacy,Neem Pharmacy,,,
3002,Ghana,Haskay Pharmacy,Haskay Pharmacy,Haskay Pharmacy,,,
1877,Nigeria,Rauda Street,Rauda Street,Rauda Street,,,
4173,Ethiopia,Kebron Pharmacy,Kebron Pharmacy,Kebron Pharmacy,,,
6084,Democratic Republic of the Congo,La Grace,La Grace,La Grace,,,


In [19]:
# export results
export_results(res, save_path=SAVE_PATH)