# Turn water observations into waterbody polygons

* **Products used:** 
[wofs_ls_summary_alltime](https://explorer.digitalearth.africa/products/wofs_ls_summary_alltime)
* **Special requirements:** 
This notebook requires the [python_geohash](https://pypi.org/project/python-geohash/) library. You can install it locally by using `python -m pip install python-geohash`.
* **Prerequisites:** 
    * A coastline polygon to filter out polygons generated from ocean pixels.
        * Variable name: `land_sea_mask_fp`
        * Here we have used the [Marine Regions Global Oceans and Seas v01 dataset](https://www.marineregions.org/sources.php#goas).
* **Optional prerequisites:**
    * River line dataset for filtering out polygons comprised of river segments.
        * Variable name: `major_rivers_fp`
        * The option to filter out major rivers is provided, and so this dataset is optional if `filter_out_rivers = False`.
        * We therefore turn this off during the production of the water bodies shapefile. 
    * Urban high rise polygon dataset
        * Variable name: `urban_mask_fp`, but this is optional and can be skipped by setting `filter_out_urban_areas = False`.
        * WOfS has a known limitation, where deep shadows thrown by tall CBD buildings are misclassified as water. This results in 'waterbodies' around these misclassified shadows in capital cities. If you are not using WOfS for your analysis, you may choose to set `filter_out_urban_areas = False`.
        * Here we haved generated a polygon dataset to act as our `urban_mask` by thresholding the High Resolution Population Density Maps dataset.

## Background

Water is among one the most precious natural resources and is essential for the survival of life on Earth. For many countries in Africa, the scarcity of water is both an economic and social issue. Water is required not only for consumption but for industries and environmental ecosystems to function and flourish. 

With the demand for water increasing, there is a need to better understand our water availability to ensure we are managing our water resources effectively and efficiently.  

Digital Earth Africa (DE Africa)'s [Water Observations from Space (WOfS) dataset](https://docs.digitalearthafrica.org/en/latest/data_specs/Landsat_WOfS_specs.html), provides a water classified image of Africa approximately every 16 days. These individual water observations have been combined into a [WOfS All-Time Summary](https://explorer.digitalearth.africa/products/wofs_ls_summary_alltime) product, which calculates the frequency of wet observations (compared against all clear observations of that pixel), over the full 30-plus years satellite archive. 

The WOfS All-Time Summary product provides valuable insights into the persistence of water across the African landscape on a pixel by pixel basis. While knowing the wet history of a single pixel within a waterbody is useful, it is more useful to be able to map the whole waterbody as a single object. 

This notebook demonstrates a workflow for mapping waterbodies across Africa as polygon objects. This workflow has been used to produce **DE Africa Waterbodies**. 

## Description
This code follows the following workflow:

* Load the required python packages
* Load the required functions
* Set your chosen analysis parameters:
    * set up some file names for the inputs and outputs
    * create a datacube query object
    * set the analysis region
    * wetness threshold/s
    * min/max waterbody size
    * minimum number of valid observations
    * read in a land/sea mask
    * optional flag to filter out waterbodies that intersect with major rivers
        * if you set this flag you will need to provide a dataset to do the filtering
    * read in an urban mask
* Generate the first temporary polygon set:
  * For each tile:
    * Load the WOfS All Time Summary Dataset
    * Keep only pixels observed at least x times
    * Keep only pixels identified as wet at least x% of the time
        * Here the code can take in two wetness thresholds, to produce two initial temporary polygon files.
    * Convert the raster data into polygons
    * Append the polygon set to a temporary shapefile
* Remove artificial polygon borders created at tile boundaries by merging polygons that intersect across tile boundaries
* Filter the combined polygon dataset (note that this step happens after the merging of tile boundary polygons to ensure that artifacts are not created by part of a polygon being filtered out, while the remainder of the polygon that sits on a separate tile is treated differently).
    * Filter the polygons based on area / size
    * Remove polygons that intersect with Africa's coastline
    * Remove erroneous 'water' polygons within high-rise CBD areas
    * Combine the two generated wetness thresholds (optional)
    * Optional filtering for proximity to major rivers  
* Save out the final polygon set to a shapefile

## Load python packages

In [None]:
# Uncomment the line below to install the local waterbodies functions
!python -m pip install ../.

In [8]:
import os
import math
import shutil  #
import pyproj
import fiona
import shapely
import datetime
import rioxarray
import numpy as np
import pandas as pd
import geopandas as gpd
from pathlib import Path
import geohash as gh
import logging

import datacube
from datacube.utils import geometry

from deafrica_tools.areaofinterest import define_area
from deafrica_tools.spatial import xr_vectorize, xr_rasterize

from deafrica_waterbodies.cli.logs import logging_setup
from deafrica_waterbodies.cli.io import write_waterbodies_to_file
# from deafrica_waterbodies.cli.group_options import MutuallyExclusiveOption

from deafrica_waterbodies.waterbodies.polygons.attributes import add_timeseries_attribute, add_area_and_perimeter_attributes, assign_unique_ids
from deafrica_waterbodies.waterbodies.polygons.make_polygons import get_product_tiles, get_waterbodies, get_polygons_using_thresholds, merge_polygons_at_tile_boundary, filter_waterbodies, check_wetness_thresholds

In [9]:
# Suppress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [10]:
logging_setup(True)
_log = logging.getLogger(__name__)

## Define Analysis Parameters

The following section walks you through the analysis parameters you will need to set for this workflow. Each section describes the parameter, how it is used, and what value was used for the DE Africa Waterbodies product.

### Set up some file names for the inputs and outputs

In [15]:
# Create the folder to store the outputs.
output_dir = "/home/jovyan/Data/Waterbodies/OutputDatasets"
output_dir_fp = Path(output_dir)
os.makedirs(output_dir_fp, exist_ok=True)

# Set up some filenames to use.
base_filename = "ContinentalWaterbodies"
base_filename_fp = output_dir_fp / base_filename

os.makedirs(base_filename_fp, exist_ok=True)

# Putting this here to allow for testing of both methods
# to handle large polygons. 
#handle_large_polygons = 'erode-dilate-v1'
handle_large_polygons = 'erode-dilate-v2'
#handle_large_polygons = 'nothing'

# The name and filepath of the first temporary polygon dataset.
waterbodies_shapefile_temp = base_filename_fp / "temp"
# The filepath for the location of temp files during the code run.
waterbodies_shapefile_merged = base_filename_fp / "merged"
# The name and filepath of the outputs following the filtering steps
waterbodies_shapefile_filtered = f"filtered_{handle_large_polygons.replace('-','_')}"
# The name and file path of the final, completed waterbodies shapefile
waterbodies_shapefile_final = f"final_{handle_large_polygons.replace('-','_')}"

# File extension to use for outputs , either .shp (ESRI Shapefile) or .geojson (GeoJSON)
file_extension = ".shp"

### Define parameters to use when loading data from the datacube

In [16]:
dask_chunks = {'x': 3500, 'y': 3500, 'time': 1}
# Resolution of the WOfS datasets.
resolution = (-30, 30)
# CRS to work with for all files.
crs = "EPSG:6933"

# Create a datacube query.
query = dict(dask_chunks=dask_chunks, resolution=resolution, output_crs=crs)

### Set the analysis region
If you would like to perform the analysis for all of Africa, using the published WOfS All-time Summary, set `all_of_africa = True`. If you set the flag `all_of_africa` to `False`, you will need to provide either a latitude and longitude range covering the area of interest, a path to the shapefile / GeoJSON defining the area of interest, or a bounding box.

In [17]:
all_of_africa = True

if not all_of_africa:
    #"""
    # Load a shapefile or GeoJSON file for the area of interest:
    basin = define_area(
        vector_path=
        "/home/jovyan/Data/Waterbodies/InputDatasets/SenegalBasin.geojson")
    geopolygon = geometry.Geometry(basin.features[0]["geometry"],
                                   crs="epsg:4326")
    #"""
    """
    # Use the section below if you would like to define the area of interest using a bounding box.
    # Bounding box for a section of Bukama in the Democratic Republic of Congo
    # that covers the lakes Kabwe, Kabele and Mulenda.
    bbox = (25.752395, -9.267306, 26.189124, -8.610346)
    left, bottom, right, top = bbox
    geopolygon = geometry.box(left, bottom, right, top, crs="EPSG:4326")
    """
elif all_of_africa:
    geopolygon = None
# Generate the tiles to be used in this workflow.
tiles = get_product_tiles(product="wofs_ls_summary_alltime", aoi_gdf=geopolygon)

# Uncomment the section below to output the tiles.
#tiles_output_fp = base_filename_fp/f"tiles{file_extension}"
#tiles.to_file(tiles_output_fp)

[2023-10-03 02:10:23,403] {make_polygons.py:81} INFO - Getting all wofs_ls_summary_alltime regions...
[2023-10-03 02:10:23,404] {make_polygons.py:93} INFO - 4456 wofs_ls_summary_alltime tiles found.


<a id='wetnessthreshold'></a>
### How frequently wet does a pixel need to be to be included?
The value/s set here will be the minimum frequency (as a decimal between 0 and 1) that you want water to be detected across all analysis years before it is included. 

E.g. If this was set to 0.10, any pixels that are wet *at least* 10% of the time across all valid observations will be included. If you don't want to use this filter, set this value to 0.

Following the exploration of an appropriate wetness threshold for DE Africa Waterbodies [see here]( Add-link-to-notebook-showing-threshold-sensisitivity-analysis), we choose to set two thresholds here. The code is set up to loop through both wetness thresholds, and to write out two temporary shapefiles. These two shapefiles with two separate thresholds are then used together to combine polygons from both thresholds later on in the workflow.

Polygons identified by the secondary threshold that intersect with the polygons generated by the primary threshold will be extracted, and included in the final polygon dataset. This means that the **location** of polygons is set by the primary threshold, but the **shape** of these polygons is set by the secondary threshold.

Threshold values need to be provided as a list of either one or two floating point numbers. If one number is provided, then this will be used to generate the initial polygon dataset. If two thresholds are entered, the **first number becomes the secondary threshold, and the second number becomes the primary threshold**. If more than two numbers are entered, the code will generate an error below. 

In [18]:
# Ensure the smaller threshold is first in the list if using 2 thresholds.
secondary_threshold = 0.05
primary_threshold = 0.1
minimum_wet_thresholds = [secondary_threshold, primary_threshold]

check_wetness_thresholds(minimum_wet_thresholds)

'We will be running a hybrid wetness threshold. \n**You have set 0.1 as the primary threshold, which will define the location of the waterbody polygons \n with 0.05 set as the supplementary threshold, which will define the extent/shape of the waterbody polygons.**'

<a id='size'></a>

### How big/small should the polygons be?
This filtering step can remove very small and/or very large waterbody polygons. The size listed here is in m<sup>2</sup>. A single pixel in Landsat data is 30 m x 30 m = 900 m<sup>2</sup>. 

**MinSize**

E.g. A minimum size of 9000 m<sup>2</sup> means that polygons need to be at least 10 pixels to be included. If you don't want to use this filter, set this value to 0.

**MaxSize**

E.g. A maximum size of 1 000 000 m<sup>2</sup> means that you only want to consider polygons less than 1 km<sup>2</sup>. If you don't want to use this filter, set this number to `math.inf`. 

*NOTE: if you are doing this analysis for all of Africa, very large polygons will be generated offshore, in the steps prior to filtering by the specified `land_sea_mask`. For this reason, we have used a `max_polygon_size` = Area of Lake Victoria (the largest lake in Africa). This will remove the huge ocean polygons, but keep large inland waterbodies that we want to map.*

In [19]:
min_polygon_size = 4500  # 5 pixels
max_polygon_size = math.inf  #59947000000 approx area of Lake Victoria 59947 sq. km

### Filter results based on number of valid observations

The total number of valid WOfS observations for each pixel varies depending on the frequency of clouds and cloud shadow, the proximity to high slope and terrain shadow, and the seasonal change in solar angle. 

The `count_clear` parameter within the [`wofs_ls_summary_alltime`](https://explorer.digitalearth.africa/products/wofs_ls_summary_alltime) data provides a count of the number of valid observations each pixel recorded over the analysis period. We can use this parameter to mask out pixels that were infrequently observed. 
If this mask is not applied, pixels that were observed only once could be included if that observation was wet (i.e. a single wet observation means the calculation of the frequency statistic would be (1 wet observation) / (1 total observation) = 100% frequency of wet observations).

Here we set the minimum number of observations to be 128 (roughly 4 per year over our 32 year analysis). Note that this parameter does not specify the timing of these observations, but rather just the **total number of valid observations** (observed at any time of the year, in any year).

In [20]:
this_year = datetime.datetime.now().year
start_year_wofs_dataset = 1984
min_valid_observations_yearly = 4

no_of_years = this_year - start_year_wofs_dataset

#min_valid_observations = min_valid_observations_yearly * no_of_years

min_valid_observations = 128

print(min_valid_observations)

128


<a id='coastline'></a>
### Read in a land/sea mask

You can choose which land/sea mask you would like to use to mask out ocean polygons, depending on how much coastal water you would like in the final product. 

We use the [Marine Regions Global Oceans and Seas v01 dataset](https://www.marineregions.org/sources.php#goas). Any polygons that intersect with this mask are filtered out, i.e. if a polygon identified within our workflow overlaps with this coastal mask by even a single pixel, it will be discarded. 

In [21]:
filter_out_ocean_polygons = True

if filter_out_ocean_polygons:
    land_sea_mask_fp = "/home/jovyan/Data/Waterbodies/InputDatasets/goas_v01.gpkg"
    #land_sea_mask = gpd.read_file(land_sea_mask_fp).to_crs(crs)
else: 
    land_sea_mask = None

<a id='rivers'></a>
### Do you want to filter out polygons that intersect with major rivers?

This filtering step is done to remove river segments from the polygon dataset. 
Set the filepath to the dataset you would wish to use in the `major_rivers_fp` variable. The dataset needs to be a vector dataset, and [able to be read in by the fiona python library](https://fiona.readthedocs.io/en/latest/fiona.html#fiona.open).

Note that we reproject this dataset to the CRS specified in the variable `crs` to match the coordinate reference system of the WOfS data we use. A list of epsg code [can be found here](https://spatialreference.org/ref/epsg/).

If you don't want to filter out polygons that intersect with rivers, set this parameter to `False`.

**Note that for the DE Africa Water Body Polygon dataset, we set this filter to False (`filter_out_rivers = False`)**

In [22]:
filter_out_major_rivers_polygons = False

if filter_out_major_rivers_polygons:
    # Insert path to the dataset location.
    major_rivers_mask_fp = ""
    #major_rivers = gpd.GeoDataFrame.from_file(major_rivers_fp)
    #major_rivers = major_rivers.to_crs(crs)
else:
    major_rivers_mask_fp = None

<a id='Urban'></a>

### Read in a mask for high-rise CBDs

WOfS has a known limitation, where deep shadows thrown by tall CBD buildings are misclassified as water. This results in 'waterbodies' around these misclassified shadows in capital cities. 

To address this problem, we use the High Resolution Population Density Maps dataset to define a spatial footprint for Africa's CBD areas. The theory of using this dataset is that high-rises have a high population density (population count per area). Therefore pixels in the HRPDM dataset with a higher general population density than the specified threshold are vectorized and used as our CBD filter.

If you are not using WOfS for your analysis, you may choose to set `filter_out_urban_areas = False`.

In [23]:
filter_out_urban_polygons = False

if filter_out_urban_polygons:
    urban_mask_fp = ""
else:
    urban_mask_fp = None

## Generate the first temporary polygon dataset


In [None]:
_log.info("Generating the first temporary set of waterbody polygons.")
temp_primary, temp_secondary = get_polygons_using_thresholds(
    input_gdf=tiles,
    dask_chunks=dask_chunks,
    resolution=resolution,
    output_crs=crs,
    min_valid_observations=min_valid_observations,
    primary_threshold=primary_threshold,
    secondary_threshold=secondary_threshold,
    temp_dir=waterbodies_shapefile_temp,
)

### Check for completed tiles and rerun if necessary

In [None]:
# ## Check for which tiles have already been completed

# import re

# # Regular expression pattern
# pattern = r'x\d{3}y\d{3}'

# temp_folder = waterbodies_shapefile_temp / "0-0-1" / "shapefile"

# temp_primary = temp_folder.glob(f'temp_{primary_threshold}*.geojson')
# temp_secondary = temp_folder.glob(f'temp_{secondary_threshold}*.geojson')

# temp_primary_tiles = [re.search(pattern, temp_primary_file.name).group(0) for temp_primary_file in temp_primary]
# temp_secondary_tiles = [re.search(pattern, temp_secondary_file.name).group(0) for temp_secondary_file in temp_secondary]

# print(len(temp_primary_names))
# print(len(temp_secondary_names))

# completed_tiles = [tile_number for tile_number in temp_secondary_names if tile_number in temp_primary_names]

# tile_completion_status = tiles["label"].isin(completed_tiles)

# remaining_tiles = tiles[~tile_completion_status]

# ## Can then rerun remaining tiles 

In [None]:
## Load some tiles and merge them if not alread in memory

# test_tiles = [f"x{xtile}y0{ytile}" for xtile in range(200, 210) for ytile in range(70, 80)]

# test_primary_files = [waterbodies_shapefile_temp / "0-0-1" / "shapefile" / f"temp_{primary_threshold}_{tile}.geojson" for tile in test_tiles]
# test_secondary_files = [waterbodies_shapefile_temp / "0-0-1" / "shapefile" / f"temp_{secondary_threshold}_{tile}.geojson" for tile in test_tiles]\

# primary_data_list = []
# with fiona.Env(OGR_GEOJSON_MAX_OBJ_SIZE=2000):
#     for dataset in list(test_primary_files):
#         df = gpd.read_file(dataset)
#         primary_data_list.append(df)
    
# temp_primary = pd.concat(primary_data_list)

# secondary_data_list = []
# with fiona.Env(OGR_GEOJSON_MAX_OBJ_SIZE=2000):
#     for dataset in list(test_secondary_files):
#         df = gpd.read_file(dataset)
#         secondary_data_list.append(df)
        
# temp_secondary = pd.concat(secondary_data_list)

## Merge polygons that have an edge at a tile boundary

Now that we have all of the polygons across our whole region of interest, we need to check for artifacts in the data caused by tile boundaries.

We have created a GeoDataFrame `buffered_30m_tiles`, that consists of the tile boundaries, plus a 1 pixel (30 m) buffer. This GeoDataFrame will help us to find any polygons that have a boundary at the edge of a tile. We can then find where polygons touch across this boundary, and join them up.

In [None]:
_log.info("Merging polygons at tile boundaries...")
merged_temp_primary = merge_polygons_at_tile_boundary(input_polygons=temp_primary, tiles=tiles.to_crs(crs))

merged_temp_primary.to_file("/home/jovyan/Data/Waterbodies/OutputDatasets/ContinentalWaterbodies/temp/0-0-1/shapefile/temp_merged_primary.gpkg")

In [None]:
_log.info("Merging polygons at tile boundaries...")
merged_temp_secondary = merge_polygons_at_tile_boundary(input_polygons=temp_secondary, tiles=tiles.to_crs(crs))

merged_temp_secondary.to_file("/home/jovyan/Data/Waterbodies/OutputDatasets/ContinentalWaterbodies/temp/0-0-1/shapefile/temp_merged_secondary.gpkg")

<a id='Filtering'></a>

## Filter the merged polygons by:
- **Area:**
Based on the `min_polygon_size` and `max_polygon_size` parameters set [here](#size).
- **Coastline:**
Using the `land_sea_mask` dataset loaded [here](#coastline).
- **CBD location (optional):**
Using the `urban_mask` dataset loaded [here](#Urban).
- **Wetness thresholds:**
Here we apply the hybrid threshold described [here](#wetnessthreshold)
- **Intersection with rivers (optional):**
Using the `major_rivers` dataset loaded [here](#rivers)


### Dividing up very large polygons

The size of polygons is determined by the contiguity of waterbody pixels through the landscape. This can result in very large polygons, e.g. where rivers are wide and unobscured by trees, or where waterbodies are connected to rivers or neighbouring waterbodies. 

We can break too large polygons into smaller, more useful polygons by applying the [Polsby-Popper test (1991)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2936284). The Polsby-Popper test is an assessment of the 'compactness' of a polygon. This method was originally developed to test the shape of congressional and state legislative districts, to prevent gerrymandering. 

The Polsby-Popper test examines the ratio between the area of a polygon, and the area of a circle equal to the perimeter of that polygon. The result falls between 0 and 1, with values closer to 1 being assessed as more compact.

\begin{align*}
PPtest = \frac{polygon\ area * 4\pi}{polygon\ perimeter^2}
\end{align*}

In [None]:
pp_test_threshold = 0.005

_log.info("Filtering waterbodies...")
filtered_polygons = filter_waterbodies(
    primary_threshold_polygons=merged_temp_primary,
    secondary_threshold_polygons=merged_temp_secondary,
    min_polygon_size=min_polygon_size,
    max_polygon_size=max_polygon_size,
    filter_out_ocean_polygons=filter_out_ocean_polygons,
    land_sea_mask_fp=land_sea_mask_fp,
    filter_out_major_rivers_polygons=filter_out_major_rivers_polygons,
    major_rivers_mask_fp=major_rivers_mask_fp,
    filter_out_urban_polygons=filter_out_urban_polygons,
    urban_mask_fp=urban_mask_fp,
    handle_large_polygons=handle_large_polygons,
    pp_test_threshold=pp_test_threshold,
)

In [None]:
filtered_polygons

### Generate a unique ID for each polygon

A unique identifier is required for every polygon to allow it to be referenced. The naming convention for generating unique IDs here is the [geohash](geohash.org).

A Geohash is a geocoding system used to generate short unique identifiers based on latitude/longitude coordinates. It is a short combination of letters and numbers, with the length of the string a function of the precision of the location. The methods for generating a geohash are outlined [here - yes, the official documentation is a wikipedia article](https://en.wikipedia.org/wiki/Geohash).

Here we use the python package `python-geohash` to generate a geohash unique identifier for each polygon. We use `precision = 9` geohash characters, which represents an on the ground accuracy of <20 metres. This ensures that the precision is high enough to differentiate between waterbodies located next to each other.

In [None]:
filtered_polygons_with_unique_ids = assign_unique_ids(filtered_polygons)

In [None]:
product_version = "0.0.1"
output_bucket_name = "deafrica-waterbodies-dev"
output_file_name = f"filtered_{handle_large_polygons.replace('-','_')}"
output_file_type = "GeoJSON"

write_waterbodies_to_file(
    filtered_polygons_with_unique_ids,
    product_version = product_version,
    storage_location="local", 
    output_bucket_name=output_bucket_name,
    output_local_folder=base_filename_fp,
    output_file_name=output_file_name,
    output_file_type=output_file_type,
)


### Final checks and recalculation of attributes

In [None]:
filtered_polygons = gpd.read_file("/home/jovyan/Data/Waterbodies/OutputDatasets/ContinentalWaterbodies/0-0-1/shapefile/filtered_erode_dilate_v2.geojson")

In [None]:
# product_version = "0.0.1"
# output_bucket_name = "deafrica-waterbodies-dev"

waterbodies_gdf = add_area_and_perimeter_attributes(filtered_polygons)
waterbodies_gdf = add_timeseries_attribute(waterbodies_gdf,
                                           product_version,
                                           output_bucket_name)

In [None]:
# Write out final results to file.
final_output_fp = f"final_{handle_large_polygons.replace('-','_')}"

# Extra step to ensure final output is in EPSG:4326
filtered_polygons_with_unique_ids_sorted = waterbodies_gdf.to_crs("EPSG:4326")

write_waterbodies_to_file(
    filtered_polygons_with_unique_ids_sorted,
    product_version = product_version,
    storage_location="local", 
    output_bucket_name=output_bucket_name,
    output_local_folder=base_filename_fp,
    output_file_name=final_output_fp,
    output_file_type=output_file_type,
)