# Cholera Outbreaks in Sub-Saharan Africa 2010-2019

Let's look at outbreak data from the Infectious Disease Dynamics Group at Johns
Hopkins University as given in their
[GitHub repository of Cholera Outbreaks in Sub-Saharan Africa 2010-2019](https://github.com/HopkinsIDD/cholera_outbreaks_ssa).
Specifically, we'll look at the outbreak reference data in
`reference_data/outbreak_data.csv` within that repository, which we've
downloaded to `data/outbreak_data.csv` within this repository.


In [None]:
import json
import os
import pandas as pd
import requests
from typing import Any


We'll start by looking at a few rows of the outbreak data:


In [None]:
outbreaks_df = pd.read_csv("data/outbreak_data.csv")
outbreaks_df


We can see that we have 999 rows with some missing values, but we'll address
our numerical data later. Let's first see all of the columns we have:


In [None]:
for column in sorted(outbreaks_df.columns):
    print(column)


We appear to have a few columns related to outbreak locations, so let's see if
any of them provide any useful geocoding information:


In [None]:
outbreaks_df[["area", "country", "location", "who_region"]]


The `area` column appears to be the size of the area, which doens't help us with
geocoding, and `who_region` doesn't really help us either, but the 3-letter ISO
code for `country` and the parts within `location` should be all we need to
obtain relevant GeoJSON data.

Let's first look at the unique `country` values:


In [None]:
outbreaks_df["country"].unique()


It appears we have at least 1 row where the country is `TZA_zanzibar` instead of
simply `TZA`, but before we bother cleaning up the `country` values, according
to the [data description](https://github.com/HopkinsIDD/cholera_outbreaks_ssa),
the `location` column contains:

> the name of the location where outbreak cases were reported and the name is
> made up of WHO region, country, and administrative units seperated by "::".

Therefore, the values in the `country` column appear to be duplicated in the
`location` column. Since we want to split out the constituent parts of
`location` anyway, let's do that, and perhaps just use the country component of
`location` instead of the existing `country` column, if it does not suffer from
the same problem.

The WHO Region within `location`, not only already exists in the `who_region`
column, but also doesn't help our geocoding effort, so we'll just drop that
component.  That means we'll simply have standard administrative regions from
`location`, starting with the country, which is administrative level 0, followed
by successively smaller administrative units (1, 2, etc.):


In [None]:
admins_df = (
    outbreaks_df["location"]
    .str.split("::", expand=True)  # Split `location` parts into columns
    .drop([0], axis=1)  # Drop WHO region (column 0)
    .rename(columns=lambda i: "country" if i == 1 else f"adm{i - 1}_raw")
    .apply(lambda column: column.str.upper() if column.name == "country" else column)
)

admins_df


Let's see if our new `country` column suffers from the same problem as the
existing `country` column:


In [None]:
admins_df["country"].unique()


Fortunately, we can see that our new `country` column appears to be clean, so we
don't need to clean it up.

However, it appears that many rows do not have a value for `adm2` and `adm3`,
so we likely need to group the data by `adm1`. We'll deal with such grouping
later, but let's first confirm that every row has a value for `adm1`:


In [None]:
len(admins_df["adm1_raw"])


Fortunately, every row does have a value for `adm1`. Therefore, assuming
they're all valid values (we'll clean up dirty data later, if necessary), we
should simply need to fetch GeoJSON data for each of the distinct countries at
administrative level 1, using the country ISO codes above (converted to upper
case).

We'll use the [geoboundaries API](https://www.geoboundaries.org/api.html) to
fetch the GeoJSON files, but since these files can be quite large, we need to
write the files to disk to avoid refetching the files each time we run this
notebook:


In [None]:
def geoboundary(iso3: str, level: int = 1) -> dict[str, Any]:
    """
    Return GeoJSON for a country at an administrative level from GeoBoundaries API.
    """
    iso = iso3.upper()
    path = f".geoboundaries-cache/{iso}-ADM{level}.geojson"

    # If we've already downloaded this file, read it and return the contents.
    if os.path.exists(path):
        with open(path, "r") as f:
            return json.load(f)

    # Fetch metadata from geoboundaries to obtain GeoJSON URL
    url = f"https://www.geoboundaries.org/api/current/gbOpen/{iso}/ADM{level}"
    response = requests.get(url)
    response.raise_for_status()

    # Extract value of `"gjDownloadURL"` from `metadata`, which is the GeoJSON URL,
    # and download the GeoJSON file.
    metadata = response.json()
    response = requests.get(metadata["gjDownloadURL"])
    response.raise_for_status()

    # Write the downloaded GeoJSON file to disk in case we need it later.
    with open(path, "wb") as f:
        f.write(response.content)

    return response.json()


Now we can fetch the administrative level 1 GeoJSON for every country in the
dataset and look at the `country` and `adm1` values:


In [None]:
official_adm1s_df = pd.DataFrame(
    columns=["country", "adm1"],
    data=(
        [country, feature["properties"]["shapeName"]]
        for country in admins_df["country"].unique()
        for feature in geoboundary(country)["features"]
    ),
)

official_adm1s_df


There's one last issue to address. If we look at the `adm1` values in
`admins_df`, we'll see that they don't quite line up with the `adm1` values in
`official_adm1s_df`. Specifically, the values in `admins_df` are all
lowercase, contain no whitespace (multi-word names are smashed together), and
diacritics are removed. This means that we won't be able to readily look up the
coordinates of a region, so we need to do a bit of work to support such lookups.

We'll use `thefuzz` library to do some "fuzzy" matching for us:


In [None]:
from thefuzz import process


def append_adm1_match(row):
    country, adm1_raw = row.loc[["country", "adm1_raw"]]
    choices = official_adm1s_df.query(f"country == '{country}'")["adm1"]
    adm1_match, *_ = process.extractOne(adm1_raw, choices, score_cutoff=70) or (None,)

    return pd.concat([row, pd.Series({"adm1_match": adm1_match})])


matched_adm1s_df = (
    admins_df[["country", "adm1_raw"]]
    .drop_duplicates()
    .sort_values(["country", "adm1_raw"])
    .apply(append_adm1_match, axis=1)
)

matched_adm1s_df


There are still a number of unmatched `adm1_raw` values, so let's see what they
are and how many there are:

In [None]:
unmatched_adm1s_df = matched_adm1s_df[matched_adm1s_df["adm1_match"].isna()][
    ["country", "adm1_raw"]
].drop_duplicates()
print(f"Number of unique unmatched ADM1 values: {len(unmatched_adm1s_df)}")
unmatched_adm1s_df


It appears that these `adm1_raw` values are actually names of administrative
level 2 regions.  For example, Kampala is the capital city of Uganda, which is
an administrative level 2 region since it is not a province.

The next step is to obtain ADM2 GeoJSONs from the geoboundaries API to see if
we can match these remaining `adm1_raw` values to ADM2 names:

In [None]:
official_adm2s_df = pd.DataFrame(
    columns=["country", "adm2"],
    data=(
        [country, feature["properties"]["shapeName"]]
        for country in unmatched_adm1s_df["country"].unique()
        for feature in geoboundary(country, 2)["features"]
    ),
)

official_adm2s_df


In [None]:
def append_adm2_match(row):
    country, adm1_raw = row.loc[["country", "adm1_raw"]]
    choices = official_adm2s_df.query(f"country == '{country}'")["adm2"]
    adm2_match, *_ = process.extractOne(adm1_raw, choices, score_cutoff=75) or (None,)

    return pd.concat([row, pd.Series({"adm2_match": adm2_match})])


matched_adm2s_df = (
    unmatched_adm1s_df[["country", "adm1_raw"]]
    .drop_duplicates()
    .sort_values(["country", "adm1_raw"])
    .apply(append_adm2_match, axis=1)
)

matched_adm2s_df


In [None]:
matched_adm2s_df[matched_adm2s_df["adm2_match"].isna()]

At this point, we have 2 remaining issues to address:

1. We have 8 ADM1 values left that do not match ADM2 values.
1. For all of the ADM1 values that are actually ADM2 values (22 - 8 = 14), we
   have to find their parent ADM1 values.  Unfortunately, the GeoJSONs returned
   by the geoboundary API do not have a way to find the parent ADM1 values.