# Exploring the Outbreak Geometries


In [None]:
import typing
import warnings

import geopandas as gpd
import numpy as np
import pandas as pd

from shapely.errors import ShapelyDeprecationWarning

warnings.filterwarnings("ignore", category=ShapelyDeprecationWarning)

First, we'll read our outbreak data and get outbreak counts per spatial scale:


In [None]:
outbreaks_df = pd.read_csv(
    "../data/outbreak_data.csv", parse_dates=["start_date", "end_date"]
)
outbreaks_df.value_counts("spatial_scale")

We don't want to work with outbreaks across different spatial scales because
doing so might present issues with overlapping geometries that might cause
problems for computing zonal statistics, so we'll select all outbreaks in admin2
regions, since that represents the majority of the outbreaks.

Further, we'll add `start_year`, `start_month`, and `duration_in_months` columns
to show an example of how we can do this elsewhere (not in this notebook) to
determine the full span of months for which we'll need to obtain environmental
parameter data.


In [None]:
admin2_outbreaks_df = outbreaks_df.query("spatial_scale == 'admin2'").assign(
    start_year=lambda df: df.start_date.dt.year,
    start_month=lambda df: df.start_date.dt.month,
    duration_in_months=lambda df: np.ceil(
        (df.end_date - df.start_date) / np.timedelta64(1, "M")
    ).astype(int),
)

admin2_outbreaks_df

Now we can merge the outbreak data with our geometries shapefile on distinct
`location_period_id` to obtain the geometries for only our distinct admin2
outbreak regions.  In addition, we'll drop all rows with duplicate geometries.
This is because we have found duplicate geometries (identical coordinates) for
different `location_period_id` values.  We must use `keep=False` to throw out
_all_ duplicates because we don't know which row (if any) is valid.


In [None]:
geometries_file = "../data/AfricaShapefiles/total_shp_0427.shp"
geolocations_gdf = typing.cast(
    # We have to cast again because the pandas type hints are not properly
    # specified for subclassing DataFrame.  Thus, after invoking various methods
    # on our GeoDataFrame, its type reverts to DataFrame, so we must again tell
    # the type checker that we actually have a GeoDataFrame.
    gpd.GeoDataFrame,
    typing.cast(gpd.GeoDataFrame, gpd.read_file(geometries_file))
    .rename(columns={"lctn_pr": "location_period_id"})
    .merge(
        admin2_outbreaks_df[["location_period_id"]].drop_duplicates(),
        how="inner",
        on="location_period_id",
    )
    .drop_duplicates("geometry", keep=False)
    .assign(location_period_id=lambda df: df["location_period_id"].astype(int)),
)

geolocations_gdf

We can see this gives us 473 distinct regions.  Let's visualize them:

In [None]:
display(geolocations_gdf.crs)
geolocations_gdf.boundary.plot(linewidth=0.2)

Now we'll merge our distinct geometries with our admin2 outbreaks to see how
many outbreaks we can work with:

In [None]:
admin2_outbreaks_df.merge(geolocations_gdf, how="inner", on="location_period_id")

We can see that we are now left with 661 outbreaks at the admin2 level after
eliminating all locations with duplicate geometries.

Finally, we'll save our disctinct `location_period_id`s and geometries for use
elsewhere:

In [None]:
geolocations_gdf.to_file("geolocations.geojson")