# Point-based and Parallel Processing Water Observations from Space (WOfS) Product in Africa  <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">

* **Products used:**
[ga_ls8c_wofs_2](https://explorer.digitalearth.africa/ga_ls8c_wofs_2)

## Description 
The [Water Observations from Space (WOfS)](https://www.ga.gov.au/scientific-topics/community-safety/flood/wofs/about-wofs) is a derived product from Landsat 8 satellite observations as part of provisional Landsat 8 Collection 2 surface reflectance and shows surface water detected in Africa.
Individual water classified images are called Water Observation Feature Layers (WOFLs), and are created in a 1-to-1 relationship with the input satellite data. 
Hence there is one WOFL for each satellite dataset processed for the occurrence of water.

The data in a WOFL is stored as a bit field. This is a binary number, where each digit of the number is independantly set or not based on the presence (1) or absence (0) of a particular attribute (water, cloud, cloud shadow etc). In this way, the single decimal value associated to each pixel can provide information on a variety of features of that pixel. 
For more information on the structure of WOFLs and how to interact with them, see [Water Observations from Space](../Datasets/Water_Observations_from_Space.ipynb) and [Applying WOfS bitmasking](../Frequently_used_code/Applying_WOfS_bitmasking.ipynb) notebooks.

This notebook explains how you can query WOfS product for each collected validation points in Africa based on point-based sampling approach. 

The notebook demonstrates how to:

1. Load validation points for each partner institutions following cleaning stage described in 
2. Query WOFL data for validation points and capture available WOfS defined class using point-based sampling and multiprocessing functionality
3. Extract a LUT for each point that contains both information for validation points and WOfS class as well number of clear observation in each month 
***

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages
Import Python packages that are used for the analysis.

In [1]:
%matplotlib inline

import datacube
from datacube.utils import masking, geometry 
import sys
import os
import rasterio
import xarray
import glob
import numpy as np
import pandas as pd
import seaborn as sn
import geopandas as gpd
import matplotlib.pyplot as plt
import multiprocessing as mp
import scipy, scipy.ndimage
import warnings
warnings.filterwarnings("ignore") #this will suppress the warnings for multiple UTM zones in your AOI 

sys.path.append("../Scripts")
from geopandas import GeoSeries, GeoDataFrame
from shapely.geometry import Point
from sklearn.metrics import confusion_matrix, accuracy_score 
from sklearn.metrics import plot_confusion_matrix, f1_score  
from deafrica_plotting import map_shapefile,display_map, rgb
from deafrica_spatialtools import xr_rasterize
from deafrica_datahandling import wofs_fuser, mostcommon_crs,load_ard,deepcopy
from deafrica_dask import create_local_dask_cluster
from tqdm import tqdm

### Analysis parameters

To analyse validation points collected by each partner institution, we need to obtain WOfS surface water observation data that corresponds with the labelled input data locations. 
- Path2csv: the path to CEO validation points labelled by each partner institutions in Africa 
- ValPoints: CEO validation points labelled by each partner institutions in Africa in ESRI shapefile format 
- Path: Direct path to the ESRI shapefile in case that the shapefile in available
- input_data: geopandas datafram for CEO validation points labelled by each partner institutions in Africa

*** Note: Run the following three cells in case that you dont have a ESRI shapefile for validation points. 

In [2]:
path2csv = '../Data/Processed/AGRYHMET/AGRYHMET_ValidationPoints.csv'
df = pd.read_csv(path2csv,delimiter=",")

In [3]:
geometries = [Point(xy) for xy in zip(df.LON, df.LAT)]
crs = {'init': 'epsg:4326'} 
ValPoints = GeoDataFrame(df, crs=crs, geometry=geometries)

In [4]:
ValPoints.to_file(filename='../Data/Processed/AGRYHMET/AGRYHMET_ValidationPoints.shp') 

*** Note: In case that you have ESRI shapefile for validation points, please continute from this point onward. 

In [5]:
path = '../Data/Processed/AGRYHMET/AGRYHMET_ValidationPoints.shp'

In [6]:
#reading the table and converting CRS to metric
input_data = gpd.read_file(path).to_crs('epsg:6933')  
input_data.columns

Index(['Unnamed_ 0', 'PLOT_ID', 'LON', 'LAT', 'FLAGGED', 'ANALYSES',
       'SENTINEL2Y', 'STARTDATE', 'ENDDATE', 'WATER', 'NO_WATER', 'BAD_IMAGE',
       'NOT_SURE', 'CLASS', 'COMMENT', 'MONTH', 'WATERFLAG', 'geometry'],
      dtype='object')

In [7]:
input_data= input_data.drop(['Unnamed_ 0'], axis=1)

In [8]:
#Checking the size of the input data 
input_data.shape

(8724, 17)

### Sample WOfS at the ground truth coordinates
To load WOFL data, we can first create a re-usable query as below that will define two particular items, `group_by` solar day, ensuring that the data between scenes is combined correctly. The second parameter is `resampling` method that is set to be nearest. This query will later be updated in the script for other parameters to conduct WOfS query. the time period we are interested in, as well as other important parameters that are used to correctly load the data. 

We can convert the WOFL bit field into a binary array containing True and False values. This allows us to use the WOFL data as a mask that can be applied to other datasets. The `make_mask` function allows us to create a mask using the flag labels (e.g. "wet" or "dry") rather than the binary numbers we used above. For more details on how to do masking on WOfS, see the [Applying_WOfS_bit_masking](../Frequently_used_code/Applying_WOfS_bitmasking.ipynb) notebook in Africa sandbox.

In [9]:
#generate query object 
query ={'group_by':'solar_day',
        'resampling':'nearest'}

Defining a function to query WOfS database according to the first five days before and after of each calendar month 

In [10]:
def get_wofs_for_point(index, row, input_data, query, results_wet, results_clear):
    dc = datacube.Datacube(app='WOfS_accuracy')
    #get the month value for each index
    month = input_data.loc[index]['MONTH'] 
    #get the value for time including year, month, start date and end date 
    timeYM = '2018-'+f'{month:02d}'
    start_date = np.datetime64(timeYM) - np.timedelta64(5,'D')
    end_date = np.datetime64(timeYM) + np.timedelta64(5,'D')
    time = (str(start_date),str(end_date))
    
    plot_id = input_data.loc[index]['PLOT_ID']
    #having the original query as it is 
    dc_query = deepcopy(query) 
    geom = geometry.Geometry(input_data.geometry.values[index].__geo_interface__,  geometry.CRS('EPSG:6933'))
    q = {"geopolygon":geom}
    t = {"time":time} 
    #updating the query
    dc_query.update(t)
    dc_query.update(q)
    #loading landsat-8 WOfs product and set the values for x and y (point-based) and also (window-based)
    wofls = dc.load(product ="ga_ls8c_wofs_2",
                    y = (input_data.geometry.y[index], input_data.geometry.y[index]),
                    x =(input_data.geometry.x[index], input_data.geometry.x[index]),
                    #y = (input_data.geometry.y[index] - 30.5, input_data.geometry.y[index] + 30.5), # setting x and y coordinates based on 3*3 pixel window-based query 
                    #x =(input_data.geometry.x[index] - 30.5, input_data.geometry.x[index] + 30.5),
                    crs = 'EPSG:6933',
                    time=time,
                    output_crs = 'EPSG:6933',
                    resolution=(-30,30))
    #exclude the records that wofl return as empty for water 
    if not 'water' in wofls:
        pass
    else:
        #Define a mask for wet and clear pixels 
        wet_nocloud = {"water_observed":True, "cloud_shadow":False, "cloud":False,"nodata":False}
        #Define a mask for dry and clear pixels 
        dry_nocloud = {"water_observed":False, "cloud_shadow":False, "cloud":False, "nodata":False}
        wofl_wetnocloud = masking.make_mask(wofls, **wet_nocloud).astype(int) 
        wofl_drynocloud = masking.make_mask(wofls, **dry_nocloud).astype(int)
        clear = (wofl_wetnocloud | wofl_drynocloud).water.all(dim=['x','y']).values
        #record the total number of clear observations for each point in each month and use it to filter out month with no valid data
        n_clear = clear.sum()  
        #condition to identify whether WOfS seen water in specific month for a particular location 
        if n_clear > 0:
            wet = wofl_wetnocloud.isel(time=clear).water.max().values  
        else:
            wet = 0 
        #updating results for both wet and clear observations 
        results_wet.update({str(int(plot_id))+"_"+str(month) : int(wet)})
        results_clear.update({str(int(plot_id))+"_"+str(month) : int(n_clear)})
        
        return time


Define a function for parallel processing 

In [11]:
def _parallel_fun(input_data, query, ncpus):
    
    manager = mp.Manager()
    results_wet = manager.dict()
    results_clear = manager.dict()
   
    # progress bar
    pbar = tqdm(total=len(input_data))
        
    def update(*a):
        pbar.update()

    with mp.Pool(ncpus) as pool:
        for index, row in input_data.iterrows():
            pool.apply_async(get_wofs_for_point,
                                 [index,
                                 row,
                                 input_data,
                                 query,
                                 results_wet,
                                 results_clear], callback=update)
        pool.close()
        pool.join()
        pbar.close()
        
    return results_wet, results_clear

Test the for loop 

In [12]:
results_wet_test = dict()
results_clear_test = dict()

for index, row in input_data[0:14].iterrows():
    time = get_wofs_for_point(index, row, input_data, query, results_wet_test, results_clear_test)
    print(time)

None
('2018-09-26', '2018-10-06')
('2017-12-27', '2018-01-06')
('2018-01-27', '2018-02-06')
('2018-02-24', '2018-03-06')
None
None
('2018-11-26', '2018-12-06')
('2018-03-27', '2018-04-06')
None
('2018-08-27', '2018-09-06')
('2018-11-26', '2018-12-06')
('2017-12-27', '2018-01-06')
('2018-01-27', '2018-02-06')


Point-based query and parallel processing on WOfS 

In [15]:
wet, clear = _parallel_fun(input_data, query, ncpus=15)


  0%|          | 0/10 [00:00<?, ?it/s][A
 20%|██        | 2/10 [00:00<00:01,  7.04it/s][A
 30%|███       | 3/10 [00:00<00:01,  6.35it/s][A
 50%|█████     | 5/10 [00:00<00:00,  7.19it/s][A
 70%|███████   | 7/10 [00:00<00:00,  8.83it/s][A
100%|██████████| 10/10 [00:01<00:00,  9.44it/s][A


In [16]:
#extracting the final table with both CEO labels and WOfS class Wet and clear observations 
wetdf = pd.DataFrame.from_dict(wet, orient = 'index')
cleardf = pd.DataFrame.from_dict(clear,orient='index')
df2 = wetdf.merge(cleardf, left_index=True, right_index=True)
df2 = df2.rename(columns={'0_x':'CLASS_WET','0_y':'CLEAR_OBS'})
#split the index (which is plotid + month) into seperate columns
for index, row in df2.iterrows():
    df2.at[index,'PLOT_ID'] = index.split('_')[0] +'.0'
    df2.at[index,'MONTH'] = index.split('_')[1]
#reset the index
df2 = df2.reset_index(drop=True)
#convert plot id and month to str to help with matching
input_data['PLOT_ID'] = input_data.PLOT_ID.astype(str)
input_data['MONTH']= input_data.MONTH.astype(str)
# merge both dataframe at locations where plotid and month match
final_df = pd.merge(input_data, df2, on=['PLOT_ID','MONTH'], how='outer')

In [17]:
#Defining the shape of final table 
final_df.shape

(8724, 19)

In [18]:
#Counting the number of rows in the final table with NaN values in class_wet and clear observation (Optional)
#This part is to test the parallel processig function returns identicial results each time that it runs 
countA = final_df["CLASS_WET"].isna().sum()
countB = final_df["CLEAR_OBS"].isna().sum()
countA, countB


(8717, 8717)

In [17]:
final_df.to_csv(('../../Results/WOfS_Assessment/Point_Based/Institutions/AGRYHMET_PointBased_5D.csv'))

In [19]:
print(datacube.__version__)

1.8.3


***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** September 2020

**Compatible datacube version:** 

## Tags
Browse all available tags on the DE Africa User Guide's [Tags Index](https://) (placeholder as this does not exist yet)