## The Cholera dataset
In this notebook we'll explore cholera outbreak data (2010-2019) for sub-Saharan Africa available [here](https://github.com/HopkinsIDD/cholera_outbreaks_ssa/blob/main/reference_data/outbreak_data.csv). Further metadata about this dataset can be found in the repo's [README.md](https://github.com/HopkinsIDD/cholera_outbreaks_ssa) file. This dataset is sourced from [Zheng et al. (2022)](https://www.sciencedirect.com/science/article/pii/S1201971222003034), but for the purposes of this work, we'll use this dataset purely as a source of outbreak data. 

In [None]:
import geopandas as gpd
import holoviews as hv
import hvplot.pandas
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px

hv.extension("bokeh")  # pyright: ignore

%output widgets='live' holomap='scrubber'

In [None]:
cholera_data = pd.read_csv("data/outbreak_data.csv")
cholera_data

## A quick examination of the dataset

Looking at a summary of the dataset, columns, datatypes and missing values. 

In [None]:
cholera_data.columns

In [None]:
cholera_data.describe()

In [None]:
cholera_data.dtypes

In [None]:
cholera_data.isnull().values.any()

In [None]:
cholera_data.isnull().sum()

Above we can see that the majority of all columns do not have missing data, with the exception of `total_deaths`, `cfr` (the case fatality rate of an outbreak) and `total_confirmed_cases.` This makes sense as not all outbreaks will have confirmed deaths, and some suspected outbreaks may not have any confirmed cases. Below, we'll look at these missing values a bit closer. 

In [None]:
cholera_data[cholera_data.isnull().any(1)]

## Data cleaning and preparation
Some pre-processing of the cholera data, to make it easier to work with other datasets. 

In [None]:
# country column is not clean

# cholera_data = cholera_data.rename(columns={"country": "ISO3"})

In [None]:
cholera_data_admin2 = cholera_data.query("spatial_scale == 'admin2'")
cholera_data_admin2 = pd.concat(
    [
        cholera_data_admin2["location"]
        # Split `location` parts into columns
        .str.split("::", expand=True)
        .drop([0], axis=1)  # Drop WHO region column
        .rename({1: "ISO3", 2: "admin1", 3: "admin2"}, axis=1)
        .apply(lambda column: column.str.upper().str.removesuffix("HEALTHDISTRICT")),
        cholera_data_admin2.drop(["who_region", "country", "location"], axis=1),
    ],
    axis=1,
).sort_values(by=["ISO3", "admin1", "admin2"])

cholera_data_admin2.head(10)

In [None]:
# to group by year and month, we'll focus on the extract the year value from the 'start_date'
cholera_data_admin2["s_Date"] = pd.to_datetime(
    cholera_data_admin2["start_date"], format="%m/%d/%Y"
)
cholera_data_admin2["e_Date"] = pd.to_datetime(
    cholera_data_admin2["end_date"], format="%m/%d/%Y"
)
cholera_data_admin2["s_month"] = cholera_data_admin2["s_Date"].dt.month
cholera_data_admin2["s_year"] = cholera_data_admin2["s_Date"].dt.year
cholera_data_admin2

In [None]:
cholera_data_admin2["months_in_duration"] = cholera_data_admin2["e_Date"].dt.to_period(
    "M"
).astype(int) - cholera_data_admin2["s_Date"].dt.to_period("M").astype(int)
cholera_data_admin2

In [None]:
cholera_data_admin2.columns

In [None]:
cholera_data_admin2["outbreak_number"].unique()

## Exploratory data analysis 

In [None]:
repeated_outbreaks = (
    cholera_data_admin2.groupby(["s_year", "ISO3"]).max().loc[:, ["outbreak_number"]]
)
repeated_outbreaks

In [None]:
repeat_outbreak_bar = repeated_outbreaks.hvplot.bar(
    x="ISO3",
    y="outbreak_number",
    by="s_year",
    cmap="Category20",
    stacked=True,
    legend="right",
    width=800,
    rot=90,
)

repeat_outbreak_bar

In [None]:
overall_trends = (
    cholera_data_admin2.groupby("s_year")
    .sum()
    .loc[:, ["total_suspected_cases", "total_confirmed_cases", "total_deaths"]]
)
overall_trends

In [None]:
overall_trends.hvplot.line(
    x="s_year",
    y=["total_suspected_cases", "total_confirmed_cases", "total_deaths"],
    rot=90,
)

Interestingly, if we zoom into the chart we find that `total_deaths` are sometimes greater than `total_confirmed_cases.` We should keep in mind, that in this line chart everything is aggregated by year (and not by country) so that this may be due to differences in quality of surveillance records across different areas. Regardless, it's something to keep in mind. 

In either case, we observe 2 distinct peaks of `total_suspected_cases`, one in 2012 and the other in 2016-2017 - the latter supported by an increase in `total_confirmed_cases.`

Grouping annual `total_suspected_cases` by country. 

In [None]:
yearly_cases = (
    cholera_data_admin2.groupby(["s_year", "ISO3", "admin2"])
    .sum()["total_suspected_cases"]
    .reset_index()
)
yearly_cases

Digging deeper into our first bar chart (see above) we will look at how the extent of `total_suspected_cases` is distributed over time and where reoccurrent outbreaks are occuring for consecutive years. 

In [None]:
# using hvplot to create a holoviews plot, but could also use holoviews itself

bar_chart = yearly_cases.hvplot.bar(
    x="ISO3",
    y="total_suspected_cases",
    by="s_year",
    cmap="Category20",
    stacked=True,
    legend="right",
    width=800,
    rot=90,
)

bar_chart

In the stacked bar chart above we above observe that some countries see repeated outbreaks more often than others. Understanding this geographic distribution and if these countries are located nearer to each other will be helpful in understanding outbreak dynamics.

Which countries are most affected? 

In [None]:
country_agg = (
    cholera_data_admin2.groupby("ISO3")
    .sum()
    .loc[:, ["total_suspected_cases", "total_deaths"]]
)
country_agg = country_agg.sort_values("total_suspected_cases", ascending=False).head(10)

country_agg.loc[:, ["total_deaths", "total_suspected_cases"]].iloc[::-1].hvplot.barh(
    colormap="coolwarm_r", stacked=False, legend="bottom_right", height=600
)

## Giving geographic context

Now we'll want to merge the yearly cases against the `admin0` boundaries so that we can map the distribution over time. Because both datasets share the 3-digit `ISO3` country code, we can merge them together on that column. 

We'll need to add administrative boundaries to provide geospatial context. We'll use the ICPAC Administrative boundaries available [here](https://geoportal.icpac.net/layers/geonode:afr_g2014_2013_0/metadata_detail) and read them in with `geopandas`. This may take a few seconds.

In [None]:
# admin0_gdf = gpd.read_file(
#     "https://geoportal.icpac.net/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typename=geonode%3Aafr_g2014_2013_0&outputFormat=json&srs=EPSG%3A4326&srsName=EPSG%3A4326"
# )
# admin0_gdf

In [None]:
# admin1_gdf = gpd.read_file("https://geoportal.icpac.net/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typename=geonode%3Aafr_g2014_2013_1&outputFormat=json&srs=EPSG%3A4326&srsName=EPSG%3A4326")
# admin1_gdf

In [None]:
# admin2_gdf = gpd.read_file("https://geoportal.icpac.net/geoserver/ows?service=WFS&version=1.0.0&request=GetFeature&typename=geonode%3Aafr_g2014_2013_2&outputFormat=json&srs=EPSG%3A4326&srsName=EPSG%3A4326")
# admin2_gdf

In [None]:
import asyncio
from zipfile import ZipFile
from io import BytesIO

import aiohttp
from geojson_pydantic import FeatureCollection

In [None]:
%autoawait asyncio


async def fetch_bytes(session: aiohttp.ClientSession, url: str) -> bytes:
    async with session.get(url) as response:
        return await response.read()


async def fetch_gadm_geojson(
    session: aiohttp.ClientSession, iso3: str, adm: int
) -> str:
    geojson_filename = f"gadm41_{iso3}_{adm}.json"
    zip_url = f"https://geodata.ucdavis.edu/gadm/gadm4.1/json/{geojson_filename}.zip"
    zip_file = ZipFile(BytesIO(await fetch_bytes(session, zip_url)))

    return zip_file.read(geojson_filename).decode("utf-8")


async def fetch_gadm_geojsons(iso3s: list[str]) -> list[FeatureCollection]:
    async with aiohttp.ClientSession() as session:
        requests = [fetch_gadm_geojson(session, iso3, 2) for iso3 in iso3s]

        # Include raised exceptions in result list
        results = await asyncio.gather(*requests, return_exceptions=True)
        return [FeatureCollection.parse_raw(geojson) for geojson in results]

In [None]:
iso3s = cholera_data_admin2["ISO3"].unique().tolist()
geojsons = await fetch_gadm_geojsons(iso3s)
features = [feature for geojson in geojsons for feature in geojson]
feature_collection = FeatureCollection(type="FeatureCollection", features=features)
admin2_gdf = gpd.GeoDataFrame.from_features(feature_collection).rename(
    {"GID_0": "ISO3"}, axis=1
)
admin2_gdf

In [None]:
from thefuzz import process


def append_admin2_match(score_cutoff: int):
    def go(row):
        iso3, admin2 = row.loc[["ISO3", "admin2"]]
        choices = admin2_gdf.query(f"ISO3 == '{iso3}'")["NAME_2"]
        triple = process.extractOne(admin2, choices, score_cutoff=score_cutoff)
        name_2, score, *_ = triple or ("", 0)

        return pd.concat([row, pd.Series({"NAME_2": name_2, "score": score})])

    return go


score_cutoff = 91

matched_admin2s_df = (
    cholera_data_admin2[["ISO3", "admin2"]]
    .drop_duplicates()
    .sort_values(["ISO3", "admin2"])
    .apply(append_admin2_match(score_cutoff), axis=1)
    .sort_values(["score"])
    .query("NAME_2 != ''")
)

matched_admin2s_df.to_csv(f"matched_admin2s_{score_cutoff}.csv", index=False)
matched_admin2s_df

In [None]:
import geopy.geocoders
import pycountry

geolocator = geopy.geocoders.Nominatim(user_agent="cholera-dashboard")
geolocator.geocode()

In [None]:
yearly_cases_std = yearly_cases.merge(
    matched_admin2s_df, how="left", on=["ISO3", "admin2"]
)
yearly_cases_std = yearly_cases_std[~yearly_cases_std["NAME_2"].isna()]
yearly_cases_std

Checking the coordinate reference system of the `admin0` administrative boundary dataset. 

In [None]:
# admin0_gdf.crs

In [None]:
# merged_df = pd.merge(admin0_gdf, yearly_cases, how="inner", on="ISO3")
# merged_df

As we did an `inner` merge, this will keep only those countries within the `admin0` dataset that _also_ have records in the cholera outbreak dataset. 

## Mapping Cholera outbreaks from 2010-2019 at the national level

Below, we use `plotly` to mapping cholera outbreaks over time. This is two get a better sense, geographically, of where repeated outbreaks are occuring and to visualize any spatial autocorrelation between them (i.e., are nations repeatedly experiencing outbreaks in closer proximity to each other?) 

In [None]:
yearly_cases_std.columns

In [None]:
yearly_snapshot = px.choropleth(
    yearly_cases_std,
    locations="NAME_2",
    color="total_suspected_cases",
    hover_name="NAME_2",
    color_continuous_scale=px.colors.sequential.Plasma,
    animation_frame="s_year",
    animation_group="NAME_2",
    range_color=[0, 100000],
    geojson=feature_collection,
    featureidkey="properties.NAME_2",
)

yearly_snapshot.update_geos(scope="africa")

yearly_snapshot.show()

We see a repeated central focal point around the Democratic Republic of the Congo (ISO3:`COD`). This is supported by the earlier bar chart we developed earlier highlighting that COD was the country that experienced the greatest amount of `total_suspected_cases.`

## Important considerations
Something to keep in mind for your future analyses: This dataset is of outbreaks in sub-Saharan Africa and is not explicitly an `endemic` Cholera dataset. 

* `Epidemic cholera` is generally sporadic and located further inland
* `Endemic cholera` has a reoccuring indicence for consecutive years, often in coastal locations

These two are not mutually exclusive, and both can take place in the same area - but for all intents and purposes of this PoC, we'll focus on this as an endemic cholera study (knowing that there will be other dynamics at play inland. 

Additionally, as the data above is aggregated at the national level, we are not able to make any assumptions about the sub-national geographic distribution of the outbreak (the outbreak could be near a coastline, or further inland). 

This is why we will want to explore at the subnational levels as well.

## Data preparation for machine learning purposes

In order to explore the relationship between environmental covariates and cholera risk, we will convert cholera outbreaks into a `binary` data format based on the month of the outbreak's `start_date`. A value of `1` inidicates an outbreak in a particular month (_TBD: in x subnational adminstrative unit_) and a value of `0` indicating no outbreak present.  

In retrospect, this may not even be required here - as we can more easily implement it as a pre-processing step on the predictor variables using `get_dummies()` just before the `train_test_split()` function. 

In [None]:
cholera_data_admin2_std = matched_admin2s_df.merge(
    cholera_data_admin2, how="left", on=["ISO3", "admin2"]
)
cholera_data_admin2_std

In [None]:
one_hot = pd.get_dummies(cholera_data_admin2_std["s_month"], prefix="month")
one_hot

In [None]:
# Join the encoded df
cholera_data_admin2_one_hot = cholera_data_admin2_std.join(one_hot)
cholera_data_admin2_one_hot

On one-hot-encoding: 
* _Do we want to treat the binary variable as a reflection of when the outbreak started?_
* _Or do we want one-hot-encode all months during the duration of the outbreak? In which case we need to revise the code above._

With regards to the focus on endemic cholera, do we want to subset the dataset only to those areas within a specified distance (e.g., 100km) from the shore? This will reduce the confounding factors supplied in epidemic cholera further inland. **However**, it will also reduce the dataset we will have available for training and testing. 