# Philippines Geo Data Cleaning Workbook

## Setup

In [1]:
import pandas as pd

In [2]:
# pulling in our single source of truth data file for geographic information 
ssot_df = pd.read_csv(filepath_or_buffer='raw_data/new_locations_only_icm.csv')

# taking a peek at our data
ssot_df.head()

Unnamed: 0,region,province,city,barangay
0,LUZON,PALAWAN,ABORLAN,APO-APORAWAN
1,LUZON,PALAWAN,ABORLAN,APOC-APOC
2,LUZON,PALAWAN,ABORLAN,APORAWAN
3,LUZON,PALAWAN,ABORLAN,BARAKE
4,LUZON,PALAWAN,ABORLAN,CABIGAAN


In [3]:
# pulling in our unclean data that needs to be matched to the geo data in the ssot_df
unclean_base_df = pd.read_csv(filepath_or_buffer='raw_data/original_locations.csv')

# taking a peek at our data
unclean_base_df.head()

Unnamed: 0,region_id,region,region_alias,province_id,province,city_id,city,barangay_id,barangay,population
0,1,Ilocos Region,Region I,1,Ilocos Norte,1,Adams,1,Adams (Pob.),1785
1,1,Ilocos Region,Region I,1,Ilocos Norte,1,Adams,44235,D,0
2,1,Ilocos Region,Region I,1,Ilocos Norte,2,Bacarra,2,Bani,948
3,1,Ilocos Region,Region I,1,Ilocos Norte,2,Bacarra,3,Buyon,1524
4,1,Ilocos Region,Region I,1,Ilocos Norte,2,Bacarra,4,Cabaruan,1437


## Cleaning "Region" Field

In [4]:
# investigating unique values of region in the ssot data
unique_clean_regionnames_list = ssot_df['region'].unique().tolist()
print(unique_clean_regionnames_list)

['LUZON', 'VISAYAS', 'MINDANAO']


In [5]:
# investigating unique values of region in the unclean data
unique_unclean_regionnames_arr = unclean_base_df['region'].unique()
print(unique_unclean_regionnames_arr)

['Ilocos Region' 'Cagayan Valley' 'Central Luzon' 'Southern Tagalog'
 'Southwestern Tagalog' 'Bicol Region' 'Western Visayas' 'Central Visayas'
 'Eastern Visayas' 'Zamboanga Peninsula' 'Northern Mindanao' 'Davao'
 'SOCCSKSARGEN' 'CARAGA' 'Bangsamoro' 'Cordillera Administrative Region'
 'National Capital Region' 'Negros Island Region']


In [6]:
# make table for region name mapping possibiliies

# make first column the unique set of all region names from the unclean data
region_mapping_df = pd.DataFrame(unique_unclean_regionnames_arr, columns=['unclean_regionnames']) 

# make second column the cast-to-upper version of the first collumn
region_mapping_df['upper_unclean_regionnames'] = region_mapping_df['unclean_regionnames'].str.upper() 

# create a function to loop through each unclean upper region name...
# if it finds that the unclean region name contains one of the clean names...
# set the value of our new column to be the clean name value...
# if it can't identify it, set the new value to "not yet matched"
def set_clean_region_name(row):
    # for each evaluation of each row passed, start off assuming no match
    region_match_found = False
    # loop over all unique clean region names 
    for clean_region_name in unique_clean_regionnames_list:
        # if we find a match in the clean region names...
        # meaning if we find that one of the clean region names...
        # occurs in the string of the unclean region name...
        # return the clean region name that was found
        if clean_region_name in row['upper_unclean_regionnames']:
            region_match_found = True
            return clean_region_name
        
    # if we struck out and didn't find a match, return a placeholder
    if not region_match_found:
        return "no_match_found"
    
# updating the df to include a new column for clean region name post matching
region_mapping_df = (
    region_mapping_df
    .assign(matched_regionnames = region_mapping_df.apply(set_clean_region_name, axis=1))
)

# inspecting the result
region_mapping_df.head(100)

Unnamed: 0,unclean_regionnames,upper_unclean_regionnames,matched_regionnames
0,Ilocos Region,ILOCOS REGION,no_match_found
1,Cagayan Valley,CAGAYAN VALLEY,no_match_found
2,Central Luzon,CENTRAL LUZON,LUZON
3,Southern Tagalog,SOUTHERN TAGALOG,no_match_found
4,Southwestern Tagalog,SOUTHWESTERN TAGALOG,no_match_found
5,Bicol Region,BICOL REGION,no_match_found
6,Western Visayas,WESTERN VISAYAS,VISAYAS
7,Central Visayas,CENTRAL VISAYAS,VISAYAS
8,Eastern Visayas,EASTERN VISAYAS,VISAYAS
9,Zamboanga Peninsula,ZAMBOANGA PENINSULA,no_match_found


### Questions / Roadblocks:

1. There are some "regions" in the unclean data that appear at first glance to be impossible to map to the "regions" in the clean data based on text alone, as they have nothing in common that could be discerned algorithmically without context. For example, while it would be easy to tag "Central Luzon" (region name in unclean data) as "LUZON" (region name in SSOT), how would "Ilocos Region" (region name in unclean data) be tagged to any one of the three regions in the clean data? Perhaps there is a bit of up-front effort needed to go through these manually so I can conduct a rules-based mapping (given that we know our unclean data won't be changing in the near future.