# Exploratory Data Analysis of the Outbreak Dataset

In this notebook we'll explore cholera outbreak data (2010-2019) for sub-Saharan Africa available [here](https://github.com/HopkinsIDD/cholera_outbreaks_ssa/blob/main/reference_data/outbreak_data.csv). Further metadata about this dataset can be found in the repo's [README.md file](https://github.com/HopkinsIDD/cholera_outbreaks_ssa). This dataset is sourced from [Zheng et al. (2022)](https://www.sciencedirect.com/science/article/pii/S1201971222003034), but for the purposes of this work, we'll use this dataset purely as a source of outbreak data. 

Please refer to our `geolocate.ipynb` within this same repo to see the methodology behind the assembly and pre-processing of the boundary files for all district (administrative level 2) outbreak data (see: `geolocations.geojson`). We will join the district boundary data with the outbreak information in the notebook below. 

## Preprocessing of the outbreak dataset

In [None]:
import geopandas as gpd
import pandas as pd
import plotly.express as px
import numpy as np

In [None]:
outbreaks_df = pd.read_csv(
    "../data/outbreak_data.csv", parse_dates=["start_date", "end_date"]
).assign(
    start_month=lambda df: df.start_date.dt.month,
    start_year=lambda df: df.start_date.dt.year,
    # Do we need duration_in_months?  There is already a duration column that
    # represents the duration in weeks.
    duration_in_months=lambda df: np.ceil(
        (df.end_date - df.start_date) / np.timedelta64(1, "M")
    ).astype(int),
)

outbreaks_df

In [None]:
outbreaks_df.dtypes.sort_index()

In [None]:
# Expand `location` parts into separate columns
outbreaks_expanded_location_df = (
    outbreaks_df["location"]
    .str.split("::", expand=True)
    .rename(columns={0: "who_region", 1: "ISO3", 2: "admin1", 3: "admin2", 4: "admin3"})
    .drop(columns=["who_region"])
    .apply(lambda column: column.str.upper().str.removesuffix("HEALTHDISTRICT"))
)

outbreaks_expanded_location_df

In [None]:
outbreaks_admin2_df = (
    pd.concat(
        [
            outbreaks_expanded_location_df.drop("admin3", axis=1),
            outbreaks_df.drop(["who_region", "country", "location"], axis=1),
        ],
        axis=1,
    )
    .query("spatial_scale == 'admin2'")
    .sort_values(by=["ISO3", "admin1", "admin2"])
)

outbreaks_admin2_df

We've cleaned up the outbreak dataset above. Now we'll import the administrative boundaries for all districts to give each of the district outbreaks a geographic context. 

In [None]:
district_geometries = gpd.read_file("geolocations.geojson")
district_geometries

In [None]:
# sourcery skip: use-fstring-for-concatenation
yearly_cases_gdf = gpd.GeoDataFrame(
    (
        outbreaks_admin2_df.groupby(
            [
                "start_year",
                "ISO3",
                "admin1",
                "admin2",
                "location_period_id",
            ]
        )["total_suspected_cases"]
        .sum()
        .reset_index()
    )
    .merge(district_geometries, on="location_period_id")
    .assign(location=lambda gdf: gdf.admin2 + ", " + gdf.admin1 + ", " + gdf.ISO3)
    .set_index("location_period_id")
)

yearly_cases_gdf

We will add a new column `outbreak` to represent presence of an outbreak in the district (outbreak = 1). This will be used to provide sums or counts of outbreaks and will later be used when we develop a machine learning model and need outbreak absence data points (outbreak = 0).

In [None]:
yearly_cases_gdf["outbreak"] = 1

In [None]:
yearly_cases_gdf

Total number of georeferenced outbreaks within the pre-processed dataset:

In [None]:
len(yearly_cases_gdf)

## Exploratory data analysis

First we will start exploring those counties that repeatedly see cholera outbreaks. We'll group the district data by both start year and country code. 

In [None]:
repeated_outbreaks = yearly_cases_gdf.groupby(["start_year", "ISO3"])

In [None]:
repeat_outbreak_bar = px.bar(
    yearly_cases_gdf,
    x="ISO3",
    y="outbreak",
    color="start_year",
)

repeat_outbreak_bar

## Mapping Cholera outbreaks from 2010-2019 at the district level
Below, we use `plotly` to map cholera outbreaks over time. This is two get a better sense, geographically, of where repeated outbreaks are occuring and to visualize any spatial autocorrelation between them (i.e., are areas that reapetedly experience outbreaks in closer proximity to each other?)

In [None]:
yearly_snapshot = px.choropleth(
    yearly_cases_gdf,
    locations=yearly_cases_gdf.index,
    geojson=yearly_cases_gdf.geometry,
    color="total_suspected_cases",
    hover_name="location",
    color_continuous_scale=px.colors.sequential.Plasma,
    animation_frame="start_year",
    animation_group="location",
    range_color=[0, 100000],
)

yearly_snapshot.update_geos(scope="africa")
yearly_snapshot.show()