# Philippines Geo Data Cleaning Workbook

## Setup

In [1]:
import pandas as pd
import janitor
import math
%load_ext blackcellmagic

## Import and manage file that screens out non-ICM regions -- `raw_data/non_icm_loc.csv`

Breaking this out as a separate step so we can keep track of any changes made to the original file.

- On 04/14/2020, Batangas and Bulacan were changed from `remove == True` to `remove == Nan`
    - Details: prompted by Joy K., removed by Paul J.
- On 04/14/2020, First, Second, Third, and Fourth (districts of Manila) were added to the csv and set to `remove == TRUE`
    - Details: discovery by Paul J. in first round of EDA, confirmed by Joy K., removed by Paul J. 

In [2]:
# pulling in the non icm file to be used as a negative filer...
# and dropping any NaNs / NAs because I want this to be a pure negative filter...
# i.e. I will want this DF to be only regions I want to drop from my main df
non_icm_base_df = (
    pd.read_csv(filepath_or_buffer="raw_data/non_icm_loc.csv").dropna().clean_names()
)

# taking a peek at our data
non_icm_base_df.head()

Unnamed: 0,province,remove
0,Abra,True
6,Apayao,True
7,Aurora,True
9,Bataan,True
10,Batanes,True


## Create the SSOT file

In order to create our SSOT file, we need to take the `new_locations.csv` and remove all provinces labeled TRUE in the `non_icm_loc.csv` file, which will allow us to create an up to date file with all locations (ICM only). 

In [3]:
# pulling in the SSOT file for all new locations
new_locations_base_df = pd.read_csv(
    filepath_or_buffer="raw_data/new_locations.csv"
).clean_names()

# taking a peek at our data
new_locations_base_df.head()

Unnamed: 0,id,province,city,barangay,latitude,logitude
0,207538,Abra,Bangued,Agtangao,17.5627,120.637016
1,207472,Abra,Bangued,Angad,17.57671,120.621513
2,207374,Abra,Bangued,Bañacao,17.60569,120.595734
3,207385,Abra,Bangued,Bangbangar,17.606991,120.609749
4,207435,Abra,Bangued,Cabuloan,17.596331,120.612694


In [4]:
# removing all rows from `new_locations_base_df` that appear in `non_icm_base_df`
ssot_df = new_locations_base_df[
    (new_locations_base_df["province"].isin(non_icm_base_df["province"]) != True)
]

# capitalizing all needed fields to facilitate later matching
ssot_df = ssot_df.apply(
    lambda x: x.astype(str).str.upper() if (x.dtype == "object") else x
)

# getting rid of all whitespace just to be sure
ssot_df = ssot_df.apply(
    lambda x: x.astype(str).str.strip() if (x.dtype == "object") else x
)

# saving out the ssot_df for logging / future use purposes
ssot_df.to_csv(path_or_buf="processed_data/ssot_df.csv", index=False)

# inspecting the filtered data
ssot_df

Unnamed: 0,id,province,city,barangay,latitude,logitude
303,249388,AGUSAN DEL NORTE,BUENAVISTA,ABILAN,8.96710,125.451851
304,249466,AGUSAN DEL NORTE,BUENAVISTA,AGONG-ONG,8.96052,125.399017
305,249469,AGUSAN DEL NORTE,BUENAVISTA,ALUBIJID,8.92323,125.453773
306,249613,AGUSAN DEL NORTE,BUENAVISTA,GUINABSAN,8.84064,125.241096
307,249751,AGUSAN DEL NORTE,BUENAVISTA,LOWER OLAVE,8.79082,125.491669
...,...,...,...,...,...,...
41943,255227,ZAMBOANGA SIBUGAY,TUNGAWAN,TIGBANUANG,7.58116,122.430321
41944,255636,ZAMBOANGA SIBUGAY,TUNGAWAN,TIGBUCAY,7.54406,122.395416
41945,255647,ZAMBOANGA SIBUGAY,TUNGAWAN,TIGPALAY,7.47038,122.348328
41946,255194,ZAMBOANGA SIBUGAY,TUNGAWAN,TIMBABAUAN,7.61146,122.393120


## Clean up the *unclean* file -- `raw_data/original_locations.csv` -- and add new correct geo-mapping fields

Just as we did with the SSOT file, first we need to take the *unclean* file and remove all provinces labeled TRUE in the `non_icm_loc.csv` file, which will allow us to create an up to date file with all locations (ICM only). 

In [5]:
# pulling in our unclean data that needs to be matched to the geo data in the ssot_df...
# sorting by province just as the ssot df is, for convenience
unclean_base_df = (
    pd.read_csv(filepath_or_buffer="raw_data/original_locations.csv")
    .clean_names()
    .sort_values("province")
)

# taking a peek at our data
unclean_base_df.head()

Unnamed: 0,region_id,region,region_alias,province_id,province,city_id,city,barangay_id,barangay,population
41093,16,Cordillera Administrative Region,CAR,76,Abra,1510,Villaviciosa,38236,Tuquib,840
40878,16,Cordillera Administrative Region,CAR,76,Abra,1491,La Paz,38034,Toon,928
40879,16,Cordillera Administrative Region,CAR,76,Abra,1491,La Paz,38035,Udangan,490
40880,16,Cordillera Administrative Region,CAR,76,Abra,1492,Lacub,38036,Bacag,233
40881,16,Cordillera Administrative Region,CAR,76,Abra,1492,Lacub,38037,Buneg,827


In [6]:
# removing all rows from `unclean_base_df` that appear in `non_icm_base_df`
unclean_base_df = unclean_base_df[
    (unclean_base_df["province"].isin(non_icm_base_df["province"]) != True)
]

# taking a peek at our data
unclean_base_df.head()

Unnamed: 0,region_id,region,region_alias,province_id,province,city_id,city,barangay_id,barangay,population
37052,14,CARAGA,Region XIII,66,Agusan del Norte,1296,Carmen,34279,Rojales,2083
37047,14,CARAGA,Region XIII,66,Agusan del Norte,1295,City of Cabadbaran,34274,Mahaba,1250
37048,14,CARAGA,Region XIII,66,Agusan del Norte,1296,Carmen,34275,Cahayagan,2380
37049,14,CARAGA,Region XIII,66,Agusan del Norte,1296,Carmen,34276,Gosoon,1772
37050,14,CARAGA,Region XIII,66,Agusan del Norte,1296,Carmen,34277,Manoligao,1513


### Create `under_construction_df` -- a new go-forward DF which will be a copy of the *unclean* file with the new cleaned columns added

In [7]:
under_construction_df = unclean_base_df.copy()

### Create a new column -- `province_cleaned` -- to be appended to the `under_construction_df`, with the *correct* name for the province associated with each row:

First, create `province_mapping_df`, which will eventually serve as a mapping dictionnary of unclean to clean names, but will start by simply storing all unique province names in the *unclean* file.

In [8]:
# investigating unique values of province in the ssot data
unique_clean_provincename_list = [
    x.upper() for x in ssot_df["province"].unique().tolist()
]
print(unique_clean_provincename_list)

['AGUSAN DEL NORTE', 'AGUSAN DEL SUR', 'AKLAN', 'ALBAY', 'ANTIQUE', 'BASILAN', 'BATANGAS', 'BILIRAN', 'BOHOL', 'BUKIDNON', 'BULACAN', 'CAMIGUIN', 'CAPIZ', 'CATANDUANES', 'CEBU', 'COMPOSTELA VALLEY', 'DAVAO DEL NORTE', 'DAVAO DEL SUR', 'DAVAO ORIENTAL', 'DINAGAT ISLANDS', 'EASTERN SAMAR', 'GUIMARAS', 'ILOILO', 'LEYTE', 'MAGUINDANAO', 'MARINDUQUE', 'MASBATE', 'MISAMIS OCCIDENTAL', 'MISAMIS ORIENTAL', 'NEGROS OCCIDENTAL', 'NEGROS ORIENTAL', 'NORTH COTABATO', 'NORTHERN SAMAR', 'OCCIDENTAL MINDORO', 'ORIENTAL MINDORO', 'PALAWAN', 'ROMBLON', 'SAMAR', 'SARANGANI', 'SIQUIJOR', 'SORSOGON', 'SOUTH COTABATO', 'SOUTHERN LEYTE', 'SULTAN KUDARAT', 'SULU', 'SURIGAO DEL NORTE', 'SURIGAO DEL SUR', 'TAWI-TAWI', 'ZAMBOANGA DEL NORTE', 'ZAMBOANGA DEL SUR', 'ZAMBOANGA SIBUGAY']


In [9]:
# investigating unique values of region in the unclean data
unique_unclean_provincename_list = unclean_base_df["province"].unique().tolist()
print(unique_unclean_provincename_list)

['Agusan del Norte', 'Agusan del Sur', 'Aklan', 'Albay', 'Antique', 'Basilan', 'Batangas', 'Biliran', 'Bohol', 'Bukidnon', 'Bulacan', 'Camiguin', 'Capiz', 'Catanduanes', 'Cebu', 'City of Isabela (Capital)', 'Compostela Valley', 'Cotabato', 'Davao Del Sur', 'Davao Occidental', 'Davao Oriental', 'Davao del Norte', 'Dinagat Islands', 'Eastern Samar', 'Guimaras', 'Iloilo', 'Leyte', 'Maguindanao', 'Marinduque', 'Masbate', 'Misamis Occidental', 'Misamis Oriental', 'Negros Occidental', 'Negros Oriental', 'North Cotabato', 'Northern Samar', 'Occidental Mindoro', 'Oriental Mindoro', 'Palawan', 'Romblon', 'Samar(Western Samar)', 'Sarangani', 'Siquijor', 'Sorsogon', 'South Cotabato', 'Southern Leyte', 'Sultan Kudarat', 'Sulu', 'Surigao del Norte', 'Surigao del Sur', 'Tawi-Tawi', 'Zamboanga Sibugay', 'Zamboanga del Norte', 'Zamboanga del Sur']


In [10]:
# make table for province name mapping possibiliies:

# make first column the unique set of all province names from the unclean data
province_mapping_df = pd.DataFrame(
    unique_unclean_provincename_list, columns=["unclean_provincename"]
)

# make second column the cast-to-upper version of the first collumn
province_mapping_df["upper_unclean_provincename"] = province_mapping_df[
    "unclean_provincename"
].str.upper()

# inspecting df we have so far
province_mapping_df.head()

Unnamed: 0,unclean_provincename,upper_unclean_provincename
0,Agusan del Norte,AGUSAN DEL NORTE
1,Agusan del Sur,AGUSAN DEL SUR
2,Aklan,AKLAN
3,Albay,ALBAY
4,Antique,ANTIQUE


Iterate over each unique province name in the *unclean* file, and check if the value in the `province` column matches a value contained in the `province` column of the SSOT file (accounting for capialization differences)

In [11]:
# create a function to loop through each unclean upper province name...
# if it finds that the unclean province name contains one of the clean names...
# set the value of our new column to be the clean name value...
# if it can't identify it, set the new value to "not yet matched"
def set_clean_province_name(row):
    # for each evaluation of each row passed, start off assuming no match
    province_match_found = False
    # loop over all unique clean province names
    for clean_province_name in unique_clean_provincename_list:
        # if we find a match in the clean province names...
        # meaning if we find that one of the clean province names...
        # occurs in the string of the unclean province name...
        # return the clean province name that was found
        if clean_province_name in row["upper_unclean_provincename"]:
            province_match_found = True
            return clean_province_name

    # if we struck out and didn't find a match, return a placeholder
    if not province_match_found:
        return "no_match_found"

# updating the df to include a new column for clean province name post matching
province_mapping_df = province_mapping_df.assign(
    matched_provincename=province_mapping_df.apply(set_clean_province_name, axis=1)
)

# inspecting the result
province_mapping_df.head()

Unnamed: 0,unclean_provincename,upper_unclean_provincename,matched_provincename
0,Agusan del Norte,AGUSAN DEL NORTE,AGUSAN DEL NORTE
1,Agusan del Sur,AGUSAN DEL SUR,AGUSAN DEL SUR
2,Aklan,AKLAN,AKLAN
3,Albay,ALBAY,ALBAY
4,Antique,ANTIQUE,ANTIQUE


In [12]:
# just inspecting the province names we couldn't match
province_names_failedtomatch_df = province_mapping_df.loc[
    province_mapping_df["matched_provincename"] == "no_match_found", :
]

province_names_failedtomatch_df

Unnamed: 0,unclean_provincename,upper_unclean_provincename,matched_provincename
15,City of Isabela (Capital),CITY OF ISABELA (CAPITAL),no_match_found
17,Cotabato,COTABATO,no_match_found
19,Davao Occidental,DAVAO OCCIDENTAL,no_match_found


In [13]:
# create a function for returning details on instances of "no_match_found" in a column

def get_no_match_found_details(df, col_name_with_matches):
    """
    Function to return match rate details for a column,
    where that column (when not matched) contains the string "no_match_found"
    """
    # first get number of total rows in df (which is # of province names)
    count_unique_province_names = len(df.index)

    # then get number of province names for which we found a match in our clean SSOT
    count_unmatched_province_names = (
        df[col_name_with_matches].str.count("no_match_found").sum()
    )

    # divide the number of matched province names by the total number of province names...
    # to get the percnt of province names we successfully matched
    province_names_match_rate = (
        count_unique_province_names - count_unmatched_province_names
    ) / count_unique_province_names

    # create string of details to output
    output_string = f"Out of {count_unique_province_names} unique province names in our data, we failed to match {count_unmatched_province_names}, resulting in a match rate of {'{:.2%}'.format(province_names_match_rate)}."

    return output_string

After having done the automated matches possible, perform the manual matching necessary based on additional research:

- For City of Isabela (Capital), the `matched_provincename` should be set to "BASILAN"
    - (Per Joy K.) City of Isabela (Capital) can be found under Basilan in NEW; we can hardcode the match. This differs from PSA SSOT (City of Isabela is officially under Region IX - PSA but NEW it is under AARM/Basilan).
- For Cotabato, the row in the `province_mapping_df` will be deleted, as the matching is more complicated and requires checking the city as well--i.e.--as there are multiple Cotabatos
    - (Per Joy K.) In this case we will hardcode the changes based on the corresponding Cities (Cotabato (City of Cotabato) >> Maguindanao (City of Cotabato) for all others change Cotabato >> North Cotabato)
        - PSA states there are officially 3 Cotabatos: North Cotabato, South Cotabato, and the City of Cotabato. In NEW we only have North Cotabato & South Cotabato. And in the OLD/ORG we have Cotabato, North Cotabato, and South Cotabato. 
        - In OLD/ORG Cotabato has municipalites called: City of Cotabato and the rest are all North Cotabatoian cities (confirmed via google). In NEW the City of Cotabato is under Maguindanao province. 
        - For future discrepencies I suggest reviewing the Municipality/Barangay levels to verify what to match the name(s) to. Let's document the change and what it should be according to the official PSA SSOT (e.g., City of Cotabato is under AARM/Maguindanao in NEW but in PSA it's SOCCSKSARGEN/City of Cotabato). 
- For Davao Occidental, the `matched_provincename` should be set to "DAVAO DEL SUR" 
    - (Per Joy K.) In this case it looks like in 2013 they created Davao Occidental from Davao del Sur which can be found in the NEW dataset. In this case we can again just manually match Davao Occidental >> Davao del Sur. 

In [14]:
# execute manual matching/fixes based on research above:

# manually matching City of Isabela (Capital)
province_mapping_df.loc[
    province_mapping_df["unclean_provincename"] == "City of Isabela (Capital)",
    "matched_provincename",
] = "BASILAN"

# removing Cotabato as its match depends on the city associated (i.e. logic more complicated)
province_mapping_df = province_mapping_df[
    province_mapping_df["unclean_provincename"] != "Cotabato"
]

# manually matching Davao Occidental
province_mapping_df.loc[
    province_mapping_df["unclean_provincename"] == "Davao Occidental",
    "matched_provincename",
] = "DAVAO DEL SUR"

In [15]:
# ensuring that our manual changes above cleared all the instance of "no_match_found"
get_no_match_found_details(
    df=province_mapping_df, col_name_with_matches="matched_provincename"
)

'Out of 53 unique province names in our data, we failed to match 0, resulting in a match rate of 100.00%.'

Use the `province_mapping_df` (and any other custom logic needed) to create the new `province_cleaned` column

- For any row where the `unclean_provincename` == "Cotabato", if the "city" listed is "Cotabato City", then set `province_cleaned` to "MAGUINDANAO"; otherwise, set `province_cleaned` to  "NORTH COTABATO"
- For all other rows (i.e. all rows where the `unclean_provincename` != "Cotabato"), look up the "province" name in our dictionnary table -- province_mapping_df -- and return the corresponding `matched_provincename`

In [16]:
# create a function to loop through all rows in our unclean base data...
# for each row where the unclean province name is NOT "Cotabato"...
# we'll rely on the `province_mapping_df`; then the custom logic for Cotabato
# is provided at the end
def create_province_cleaned_col(row):

    if "COTABATO" in row["province"].upper():
        if "Cotabato City" in row["city"]:
            return "MAGUINDANAO"
        else:
            return "NORTH COTABATO"
    else:
        return (
            province_mapping_df.loc[
                province_mapping_df["upper_unclean_provincename"].str.contains(
                    row["province"].upper()
                ),
                "matched_provincename",
            ]
            .to_string(index=False)
            .strip()
        )


# updating the df to include a new column for clean province name post matching
under_construction_df = under_construction_df.assign(
    province_cleaned=under_construction_df.apply(create_province_cleaned_col, axis=1)
)

# inspecting the result
under_construction_df.head()

  return func(self, *args, **kwargs)


Unnamed: 0,region_id,region,region_alias,province_id,province,city_id,city,barangay_id,barangay,population,province_cleaned
37052,14,CARAGA,Region XIII,66,Agusan del Norte,1296,Carmen,34279,Rojales,2083,AGUSAN DEL NORTE
37047,14,CARAGA,Region XIII,66,Agusan del Norte,1295,City of Cabadbaran,34274,Mahaba,1250,AGUSAN DEL NORTE
37048,14,CARAGA,Region XIII,66,Agusan del Norte,1296,Carmen,34275,Cahayagan,2380,AGUSAN DEL NORTE
37049,14,CARAGA,Region XIII,66,Agusan del Norte,1296,Carmen,34276,Gosoon,1772,AGUSAN DEL NORTE
37050,14,CARAGA,Region XIII,66,Agusan del Norte,1296,Carmen,34277,Manoligao,1513,AGUSAN DEL NORTE


### Write out `processed_data/under_construction_df.csv` to log the work done so far

In [17]:
# saving out the under construction df for logging
under_construction_df.to_csv(
    path_or_buf="processed_data/under_construction_df.csv", index=False
)

### Create a new column -- `city_cleaned` -- to be appended to the `under_construction_df`, with the *correct* name for the province associated with each row:

First I'll create a df that has only the fields we'll need to match `city` names in the `under_construct_df` to `city` names in our SSOT -- region, province_cleaned, city, and barangay.

In [18]:
# creating df with only fields need to match cities
just_geo_names_df = under_construction_df.loc[
    :, ("region", "province_cleaned", "city", "barangay")
]

# capitalizing all needed fields to facilitate later matching
just_geo_names_df = just_geo_names_df.apply(
    lambda x: x.astype(str).str.upper() if (x.dtype == "object") else x
)

# getting rid of all whitespace just to be sure
just_geo_names_df = just_geo_names_df.apply(
    lambda x: x.astype(str).str.strip() if (x.dtype == "object") else x
)

# taking a look at the df
just_geo_names_df.head()

Unnamed: 0,region,province_cleaned,city,barangay
37052,CARAGA,AGUSAN DEL NORTE,CARMEN,ROJALES
37047,CARAGA,AGUSAN DEL NORTE,CITY OF CABADBARAN,MAHABA
37048,CARAGA,AGUSAN DEL NORTE,CARMEN,CAHAYAGAN
37049,CARAGA,AGUSAN DEL NORTE,CARMEN,GOSOON
37050,CARAGA,AGUSAN DEL NORTE,CARMEN,MANOLIGAO


First we go for all low-hanging fruit -- cities that we can match to the `ssot_df` because we can find an exact pairing between sets of province, city, and barangay between the `just_geo_names_df` df and the `ssot_df`. We'll perform this matching via a left join of the `ssot_df` onto the `just_geo_names_df`. We'll then flag all the rows that were matched successfully with this simple method. 

In [19]:
# performing the left merge
just_geo_names_df = pd.merge(
    just_geo_names_df,
    ssot_df,
    how="left",
    left_on=["province_cleaned", "city", "barangay"],
    right_on=["province", "city", "barangay"],
)

# if the match of all geo names after the merge was successful, flagging it with a 1...
# we can use the exisitance of a value in the 'id' column as a proxy for match, as...
# if there isn't a match, it'll be NaN
def flag_successful_full_matches(row):

    if math.isnan(row["id"]):
        return 0
    else:
        return 1


# adding the variable for successful flagging
just_geo_names_df = just_geo_names_df.assign(
    full_match_successful=just_geo_names_df.apply(flag_successful_full_matches, axis=1)
)

# removing unneeded columns
just_geo_names_df = just_geo_names_df.loc[
    :, ["region", "province_cleaned", "city", "barangay", "full_match_successful"]
]

# inspecting the df
just_geo_names_df.head()

Unnamed: 0,region,province_cleaned,city,barangay,full_match_successful
0,CARAGA,AGUSAN DEL NORTE,CARMEN,ROJALES,1
1,CARAGA,AGUSAN DEL NORTE,CITY OF CABADBARAN,MAHABA,0
2,CARAGA,AGUSAN DEL NORTE,CARMEN,CAHAYAGAN,1
3,CARAGA,AGUSAN DEL NORTE,CARMEN,GOSOON,1
4,CARAGA,AGUSAN DEL NORTE,CARMEN,MANOLIGAO,1


Create a df with just the geo names we couldn't match to the `ssot_df` across all 3 geos so we can count the records still left to match. We'll do this multiple times from here on out until we arrive at 0 records we can't match.

In [20]:
# checking number of yet-to-be-matched geo names
problematic_geo_names = just_geo_names_df.loc[
    just_geo_names_df["full_match_successful"] == 0, :
]
f"There are {len(problematic_geo_names)} rows where we couldn't find a full match."

"There are 9266 rows where we couldn't find a full match."

(Round 1 of ad hoc research) Now let's make any fixes we noticed through ad hoc exploration and see how that affects our match rateNow let's make any fixes we noticed through ad hoc exploration and see how that affects our match rate.

In [21]:
# in Agusan Del Norte, in just_geo_names there is a city called "CITY OF CABADBARAN" that sould be "CABADBARAN CITY", per the ssot_df
just_geo_names_df.loc[
    (just_geo_names_df["province_cleaned"] == "AGUSAN DEL NORTE")
    & (just_geo_names_df["city"] == "CITY OF CABADBARAN"),
    ["city", "full_match_successful"],
] = ("CABADBARAN CITY", 1)

# in Agusan Del Sur, in just_geo_names there is a city called "CITY OF BAYUGA" that should be "BAYUGAN CITY", per the ssot_df
just_geo_names_df.loc[
    (just_geo_names_df["province_cleaned"] == "AGUSAN DEL SUR")
    & (just_geo_names_df["city"] == "CITY OF BAYUGAN"),
    ["city", "full_match_successful"],
] = ("BAYUGAN CITY", 1)

just_geo_names_df.head()

Unnamed: 0,region,province_cleaned,city,barangay,full_match_successful
0,CARAGA,AGUSAN DEL NORTE,CARMEN,ROJALES,1
1,CARAGA,AGUSAN DEL NORTE,CABADBARAN CITY,MAHABA,1
2,CARAGA,AGUSAN DEL NORTE,CARMEN,CAHAYAGAN,1
3,CARAGA,AGUSAN DEL NORTE,CARMEN,GOSOON,1
4,CARAGA,AGUSAN DEL NORTE,CARMEN,MANOLIGAO,1


In [22]:
# checking number of yet-to-be-matched geo names
problematic_geo_names = just_geo_names_df.loc[
    just_geo_names_df["full_match_successful"] == 0, :
]
f"There are {len(problematic_geo_names)} rows where we couldn't find a full match."

"There are 9192 rows where we couldn't find a full match."

(Round 1 of ad hoc research)  It appears we've spotted one trend that can be corrected algorithmically -- we should look for instances of the city names that use the formulation "CITY OF xxxxxxx" and replace them with the formulation "xxxxxxx CITY".

In [23]:
# if we spot "CITY OF" in the particular record's 'city'...
# we'll set 'city' to the correct formulation -- "xxxxxxx CITY"...
# otherwise, we'll keep it as is
def invert_city_of_formulations(row):

    if "CITY OF" in row["city"]:
        return row["city"].partition("CITY OF ")[2] + " CITY"
    else:
        return row["city"]


# updating the df's values for city per the function above
just_geo_names_df = just_geo_names_df.assign(
    city=just_geo_names_df.apply(invert_city_of_formulations, axis=1)
)

# inspecting df
just_geo_names_df

Unnamed: 0,region,province_cleaned,city,barangay,full_match_successful
0,CARAGA,AGUSAN DEL NORTE,CARMEN,ROJALES,1
1,CARAGA,AGUSAN DEL NORTE,CABADBARAN CITY,MAHABA,1
2,CARAGA,AGUSAN DEL NORTE,CARMEN,CAHAYAGAN,1
3,CARAGA,AGUSAN DEL NORTE,CARMEN,GOSOON,1
4,CARAGA,AGUSAN DEL NORTE,CARMEN,MANOLIGAO,1
...,...,...,...,...,...
28220,ZAMBOANGA PENINSULA,ZAMBOANGA DEL SUR,TABINA,BAGANIAN,1
28221,ZAMBOANGA PENINSULA,ZAMBOANGA DEL SUR,TABINA,BAYA-BAYA,1
28222,ZAMBOANGA PENINSULA,ZAMBOANGA DEL SUR,TABINA,CAPISAN,1
28223,ZAMBOANGA PENINSULA,ZAMBOANGA DEL SUR,PITOGO,MATIN-AO,1


In [24]:
# using another merge back on the ssot_df to see how much that fix helped our match rate
just_geo_names_df = pd.merge(
    just_geo_names_df,
    ssot_df,
    how="left",
    left_on=["province_cleaned", "city", "barangay"],
    right_on=["province", "city", "barangay"],
)

# adding the variable for successful flagging
just_geo_names_df = just_geo_names_df.assign(
    full_match_successful=just_geo_names_df.apply(flag_successful_full_matches, axis=1)
)

# removing unneeded columns
just_geo_names_df = just_geo_names_df.loc[
    :, ["region", "province_cleaned", "city", "barangay", "full_match_successful"]
]

# inspecting the df
just_geo_names_df.head()

Unnamed: 0,region,province_cleaned,city,barangay,full_match_successful
0,CARAGA,AGUSAN DEL NORTE,CARMEN,ROJALES,1
1,CARAGA,AGUSAN DEL NORTE,CABADBARAN CITY,MAHABA,1
2,CARAGA,AGUSAN DEL NORTE,CARMEN,CAHAYAGAN,1
3,CARAGA,AGUSAN DEL NORTE,CARMEN,GOSOON,1
4,CARAGA,AGUSAN DEL NORTE,CARMEN,MANOLIGAO,1


In [25]:
# checking number of yet-to-be-matched geo names
problematic_geo_names = just_geo_names_df.loc[
    just_geo_names_df["full_match_successful"] == 0, :
]
f"There are {len(problematic_geo_names)} rows where we couldn't find a full match."

"There are 8282 rows where we couldn't find a full match."

(Round 1 of ad hoc research) Looks like the formulation change from "CITY OF xxxxxxx" to "xxxxxxx CITY" fixed 910 -- (9192-8282) -- records!

(Round 2 of ad hoc research) Now let's make any fixes we noticed through ad hoc exploration and see how that affects our match rate

In [26]:
problematic_geo_names

Unnamed: 0,region,province_cleaned,city,barangay,full_match_successful
5,CARAGA,AGUSAN DEL NORTE,CARMEN,POBLACION (CARMEN),0
22,CARAGA,AGUSAN DEL NORTE,CARMEN,NUEVA FUERZA,0
30,CARAGA,AGUSAN DEL NORTE,JABONGA,A. BELTRAN (CAMALIG),0
33,CARAGA,AGUSAN DEL NORTE,CARMEN,LA PAZ,0
34,CARAGA,AGUSAN DEL NORTE,CARMEN,GUADALUPE,0
...,...,...,...,...,...
28103,ZAMBOANGA PENINSULA,ZAMBOANGA DEL SUR,GUIPOS,POBLACION (GUIPOS),0
28130,ZAMBOANGA PENINSULA,ZAMBOANGA DEL SUR,LAPUYAN,EMPTY,0
28178,ZAMBOANGA PENINSULA,ZAMBOANGA DEL SUR,TABINA,DONA JOSEFINA,0
28203,ZAMBOANGA PENINSULA,ZAMBOANGA DEL SUR,MARGOSATUBIG,EMPTY,0


In [27]:
problematic_geo_names.loc[
    (problematic_geo_names["province_cleaned"] == "AGUSAN DEL NORTE")
    & (problematic_geo_names["city"] == "CARMEN"),
    :,
]

Unnamed: 0,region,province_cleaned,city,barangay,full_match_successful
5,CARAGA,AGUSAN DEL NORTE,CARMEN,POBLACION (CARMEN),0
22,CARAGA,AGUSAN DEL NORTE,CARMEN,NUEVA FUERZA,0
33,CARAGA,AGUSAN DEL NORTE,CARMEN,LA PAZ,0
34,CARAGA,AGUSAN DEL NORTE,CARMEN,GUADALUPE,0
35,CARAGA,AGUSAN DEL NORTE,CARMEN,KATIPUNAN,0
36,CARAGA,AGUSAN DEL NORTE,CARMEN,POBLACION NORTE,0
37,CARAGA,AGUSAN DEL NORTE,CARMEN,MONTESUERTE,0
39,CARAGA,AGUSAN DEL NORTE,CARMEN,GAUDALUPE,0
40,CARAGA,AGUSAN DEL NORTE,CARMEN,COGON WEST,0
41,CARAGA,AGUSAN DEL NORTE,CARMEN,CALATRAVA,0


In [28]:
ssot_df.loc[
    (ssot_df["province"] == "AGUSAN DEL NORTE") & (ssot_df["city"] == "CARMEN"), :
]

Unnamed: 0,id,province,city,barangay,latitude,logitude
445,249276,AGUSAN DEL NORTE,CARMEN,CAHAYAGAN,9.02488,125.249847
446,249237,AGUSAN DEL NORTE,CARMEN,GOSOON,9.05545,125.226997
447,249392,AGUSAN DEL NORTE,CARMEN,MANOLIGAO,8.90158,125.23558
448,249360,AGUSAN DEL NORTE,CARMEN,POBLACION,8.99149,125.306396
449,249356,AGUSAN DEL NORTE,CARMEN,ROJALES,8.93398,125.262558
450,249265,AGUSAN DEL NORTE,CARMEN,SAN AGUSTIN,9.03296,125.219704
451,249300,AGUSAN DEL NORTE,CARMEN,TAGCATONG,8.98486,125.227913
452,249198,AGUSAN DEL NORTE,CARMEN,VINAPOR,9.06431,125.210312


### Next Steps:

- Start on round 2 of ad hoc research -- looking for more trends that we can exploit to successfully match the remaining 8282 pairings of province, city, and barangay.

### Questions / Roadblocks:
- In the above two cells, do you have any insight on what's going on with the 'barangay' values in the `problematic_geo_names` table? I assume records like "POBLACION NORTE" could maybe be linked to "POBLACION" in the `ssot_df`? Or maybe this is even more ambiguous? Seems weird to me to have a geography literally named "population"? 
- There are a few random directories / files scattered throughout the repo (like `icon` and `.idea`) that I wonder if are just random and can be deleted? Didn't want to delete them without knowing it was for sure no longer needed. 
- Is there a way you could grant me admin access over the repo? I'd like to look into setting up a few things like Milestones, Projects, and GitHub actions. 