# Inspecting Validation Dataset <img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">

## Description

The [Water Observations from Space (WOfS)](https://www.ga.gov.au/scientific-topics/community-safety/flood/wofs/about-wofs) is a derived product from Landsat-8 satellite observations as part of provisional Landsat 8 Collection 2 surface reflectance and shows surface water detected in Africa. Prior to conduct accuracy assessment for Water Observations from Space in Africa, it can be useful to inpsect the validation points each partner institutions in Africa explored and classified as `water`, `no water`, `not sure` and `bad image`. 
The extracted tables from [Collect Earth Online](https://collect.earth/home) contains a few information about each validation point that are originally set in the CEO tool. For monthly query of WOfS product in Africa and subsequent accuracy assessment, we need to add twelve rows corresponding to each calendar month to each validation point and then filter out those months that have duplicated labels either due to having no clear S2 observation or locating in shadow at the time of Landsat observations.

This notebook explains how you can compile tables from Collect Earth Online tool from each partner institution and make them analysis-ready for WOfS analysis and accuracy assessment. 

The notebook demonstrates how to:

1. Load collected validation points as a list of observations each has a location and month
2. Data wrangling including cleaning the table, and mapping each point to twelve month observation 

***

## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages
Import Python packages that are used for the analysis.

In [1]:
%matplotlib inline

import datacube
from datacube.utils import masking, geometry 
import sys
import os
import dask 
import rasterio, rasterio.features
import xarray
import glob
import numpy as np
import pandas as pd
import seaborn as sn
import geopandas as gpd
import subprocess as sp
import matplotlib.pyplot as plt
import scipy, scipy.ndimage
import warnings
warnings.filterwarnings("ignore") #this will suppress the warnings for multiple UTM zones in your AOI 

from deafrica_tools.plotting import display_map, rgb
from deafrica_tools.spatial import xr_rasterize
from deafrica_tools.datahandling import wofs_fuser, mostcommon_crs,load_ard
from rasterio.mask import mask



### Connect to the datacube

In [2]:
dc = datacube.Datacube()

### Analysis parameters

To analyse validation points collected by each partner institution, we need to read the CEO extracted table as a dataframe and then do a few data wrangling:
- CEO: raw table extracted from CEO tool that contains information on each validation point including the class that each analyst assigned to the point
- ground_truth: pandas dataframe that will rename CEO table for a few columns and will be the main table for data wrangling 
- result: final table in which twelve rows have been assigned to each validation point based on the string values in water, no water, bad image and not sure columns. The value will be stored in a column called `month`

### Load the Dataset 

Raw validation points extracted from CEO tool

In [3]:
#Path to the row validation data points csv file 
CEO = '.../Data/CEO/RCMRD/CEO_1_RCMRD_2020-07-30.csv'

In [4]:
#Read in the validation data csv
df = pd.read_csv(CEO, delimiter=",")
ground_truth = df.drop(['SAMPLE_ID','USER_ID','IMAGERY_TITLE','COLLECTION_TIME','ANALYSIS_DURATION','PL_PLOTID'], axis=1)

In [5]:
#identifying the columns in validation table in order to do the rename. We might experience a few distruption in cell 7. In that case, make sure, you replace the string with the correct one
ground_truth.columns

In [6]:
#Defining the shape of the validation table 
ground_truth.shape

In [7]:
ground_truth = ground_truth.rename(columns={'WHAT IS THE FEATURE?':'CLASS','ENTER MONTHS[1-12] IN 2018, WATER WAS OBSERVED?':'WATER',
                                            'ENTER MONTHS[1-12] IN 2018, WATER WAS NOT OBSERVED?':'NO_WATER','ENTER MONTHS[1-12] IN 2018, IMAGE WAS BAD?':'BAD_IMAGE',
                                             'ENTER MONTHS[1-12] IN 2018, THAT YOU ARE UNSURE IF YOU OBSERVE WATER OR NOT? ':'NOT_SURE'})

In [8]:
#Getting a picture of dataframe  
ground_truth

In [9]:
#Converting column type to string if not already
ground_truth['NOT_SURE'] = ground_truth.NOT_SURE.astype(str)

In [10]:
#Making sure that the columns for each label class is string for further analysis  
cols = ['WATER','NO_WATER','BAD_IMAGE','NOT_SURE']
for col in cols:
    ground_truth[col] = ground_truth[col].str.replace('[','')
    ground_truth[col] = ground_truth[col].str.replace(']','')
    ground_truth[col] = ground_truth[col].str.replace('&','')
    ground_truth[col] = [''.join(c.split()) for c in ground_truth[col]]

In [11]:
#replacing the name of months with their numerical values
replacements = { 'WATER': {r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'},
               'NO_WATER': {r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'},
               'BAD_IMAGE':{r'Jan':'1', r'Feb':'2',r'Mar':'3',r'Apr':'4',r'May':'5',r'Jun':'6',r'Jul':'7',r'Aug':'8',r'Sep':'9',r'Oct':'10',r'Nov':'11',r'Dec':'12'}}

ground_truth.replace(replacements, regex=True, inplace=True)

In [12]:
#making sure that the observation time is set to 2018 in case there is a mistake in the table
ground_truth['SENTINEL2MOSAICYEARMONTH'] = ground_truth['SENTINEL2MOSAICYEARMONTH'].str.replace('2019-2019','2018-2018')

Defining a function to split strings in each class columns identified as `no water`, `water`, `bad image`, `not sure`.These classes will be assigned a value as 0, 1, 2, 3 respectively.

In [13]:
def split_str(row, newtable):
#check each row for no water info and update the waterflag column 
    monthstr=row['NO_WATER']
    if monthstr!='0'and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='0'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow) # update index / ignore original index
#check each row for water info and update the waterflag colum 
    monthstr=row['WATER']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='1'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow)  
#check each row for bad image info and update the waterflag colum 
    monthstr=row['BAD_IMAGE']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='2'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow) 
#check each row for not sure info and update the waterflag colum 
    monthstr=row['NOT_SURE']
    if monthstr!='0' and monthstr!='nan':
        monthlist=[[int(i) for i in s.split('-')] for s in monthstr.split(',')]
        for l in monthlist:
            if len(l)==1: l=[l[0],l[0]]
            for i in range(l[0], l[1]+1):
                newrow=row[['PLOT_ID','LON','LAT','FLAGGED','ANALYSES','WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS','COMMENT']]
                newrow['MONTH']=f'{i:02d}'
                newrow['WATERFLAG']='3'
                newrow["SENTINEL2YEAR"]='2018'
                newtable=newtable.append(newrow) 
                
    return newtable

In [14]:
#Making an empty dataframe
result = pd.DataFrame()

In [15]:
for irow in range(len(ground_truth)):
    result=split_str(ground_truth.iloc[irow], result)
    result.update(result)

In [16]:
result.shape
result.loc[13]#this shows part of the table

In [17]:
indexNames = result[result.duplicated(['LAT', 'LON','MONTH'], keep=False)]
indexNames.shape

In [18]:
result = result[['PLOT_ID', 'LON', 'LAT','FLAGGED','ANALYSES','SENTINEL2YEAR', 'WATER','NO_WATER','BAD_IMAGE','NOT_SURE','CLASS', 'COMMENT', 'MONTH','WATERFLAG']]

In [19]:
#final table that has Waterflags identified based on the lables from Analysts
result

In [20]:
#save the dataframe as csv file 
result.to_csv('../Data/Processed/RCMRD/CEO_1_RCMRD_2020-07-30.csv')

The following cell should be run following inspecting all subprojects for each partner institutions in order to join all tables into one for each partner institutions.

In [21]:
#joining dataframes together and extract one csv for each partner institution 
DF = glob.glob('../Data/Processed/RCMRD/*_RCMRD_*.csv')
frame = []
for d in DF: 
    f = pd.read_csv(d,delimiter=",")
    frame.append(f)
out = pd.concat(frame)
out.to_csv('../Data/Processed/RCMRD/RCMRD_ValidationPoints.csv')

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** January 2020

**Compatible datacube version:** 

## Tags
Browse all available tags on the DE Africa User Guide's [Tags Index](https://) (placeholder as this does not exist yet)