# Accuracy Assessment of Water Observations from Space (WOfS) Product in Africa<img align="right" src="../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">


## Description
Now that we have run WOfS classification for each AEZs in Africa, its time to conduct an accuracy assessment. The data used for assessing the accuracy was collected previously and set aside. It is stored in the Results folder: `Results/WOfS_Assessment/Point_Based/Intermediate_Per_AEZ`.

Accuracy assessment for WOfS product in Africa includes generating a confusion error matrix for a WOFL binary classification.
The inputs for the estimating the accuracy of WOfS derived product are a binary classification WOFL layer showing water/non-water and a shapefile containing validation points collected by [Collect Earth Online](https://collect.earth/) tool. Validation points are the ground truth or actual data while the extracted value for each location from WOFL is the predicted value. 

This notebook will explain how you can perform accuracy assessment for WOfS using collected ground truth dataset. It will output a confusion error matrix containing overall, producer's and user's accuracy, along with the F1 score for each class.


## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

### Load packages
Import Python packages that are used for the analysis.

In [1]:
%matplotlib inline

import os
import rasterio
import xarray
import glob
import numpy as np
import pandas as pd
import seaborn as sn
import geopandas as gpd
import matplotlib.pyplot as plt
import scipy, scipy.ndimage
import warnings
warnings.filterwarnings("ignore") #this will suppress the warnings for multiple UTM zones in your AOI 

from geopandas import GeoSeries, GeoDataFrame
from shapely.geometry import Point
from sklearn.metrics import confusion_matrix, accuracy_score 
from sklearn.metrics import plot_confusion_matrix, f1_score  

### Analysis Parameters 

- CEO : groundtruth points containing both WOfS classes, WOfS clear observations and the assigned label by analyst in each calendar month 
- input_data : dataframe for further analysis and accuracy assessment 

### Optional: Stitch together valid points generated from `03_Accuracy_Assessment-AEZ` into one CSV file for continental Africa
Ground truth points for non-provisional Landsat C2 WOfS exist in the folder `Results/WOfS_Assessment/wofs_ls`.

In [2]:
!pwd

/home/jovyan/dev/wofs-validation/Notebooks


In [5]:
file_path = ("../Results/WOfS_Assessment/wofs_ls")
all_filenames = [i for i in glob.glob(os.path.join(file_path, '*.{}'.format('csv')))]

In [6]:
all_filenames

['../Results/WOfS_Assessment/wofs_ls/Central_wofs_ls_valid.csv',
 '../Results/WOfS_Assessment/wofs_ls/Sahel_wofs_ls_valid.csv',
 '../Results/WOfS_Assessment/wofs_ls/Indian_ocean_wofs_ls_valid.csv',
 '../Results/WOfS_Assessment/wofs_ls/Southern_wofs_ls_valid.csv',
 '../Results/WOfS_Assessment/wofs_ls/Northern_wofs_ls_valid.csv',
 '../Results/WOfS_Assessment/wofs_ls/Eastern_wofs_ls_valid.csv',
 '../Results/WOfS_Assessment/wofs_ls/Western_wofs_ls_valid.csv']

In [7]:
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv

Unnamed: 0.1,Unnamed: 0,PLOT_ID,LON,LAT,CLASS,MONTH,ACTUAL,CLASS_WET,CLEAR_OBS,index_right,ID,CODE,COUNTRY,PREDICTION
0,138,137711642.0,23.061932,5.240099,Forest/woodlands,10,0,0.0,1.0,0,2.0,ANG,Central,0
1,1491,137711755.0,16.208833,-2.713546,Open water - freshwater,6,1,1.0,1.0,0,2.0,ANG,Central,1
2,1918,137711790.0,22.334056,-5.938465,Cultivated (Cropland/ Plantation),6,1,1.0,2.0,0,2.0,ANG,Central,1
3,2097,137711805.0,13.126633,-7.844516,Open water - marine,5,1,1.0,1.0,0,2.0,ANG,Central,1
4,2098,137711805.0,13.126633,-7.844516,Open water - marine,9,1,1.0,1.0,0,2.0,ANG,Central,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1041,8012,137483514.0,-7.561075,4.388714,Open water - freshwater,2,1,1.0,1.0,0,20.0,BEN,Western,1
1042,8013,137483514.0,-7.561075,4.388714,Open water - freshwater,3,1,1.0,1.0,0,20.0,BEN,Western,1
1043,8014,137483514.0,-7.561075,4.388714,Open water - freshwater,5,1,1.0,1.0,0,20.0,BEN,Western,1
1044,8035,137483515.0,6.352198,4.376923,Open water - freshwater,3,1,1.0,1.0,0,20.0,BEN,Western,1


In [8]:
#export to csv
combined_csv.to_csv('../Results/WOfS_Assessment/wofs_ls/Africa_all_ValidationPoints_ls_wofs.csv', index=False)

`index=False` removes the row names (row numbers for each csv).

### Clean the combined dataframe

In [10]:
df_file_location = '../Results/WOfS_Assessment/wofs_ls/Africa_all_ValidationPoints_ls_wofs.csv'
df = pd.read_csv(df_file_location,delimiter=",")

In [12]:
keep_columns = ['PLOT_ID', 'LON', 'LAT', 'CLASS', 'MONTH', 'ACTUAL',
       'CLASS_WET', 'CLEAR_OBS', 'index_right', 'ID', 'CODE', 'COUNTRY',
       'PREDICTION']

In [15]:
input_data = df.loc[:, keep_columns]

In [26]:
#Counting the number of validation points in all Africa 
countpoints = input_data.groupby('PLOT_ID',as_index=False,sort=False).last()
len(countpoints)

2329

In [18]:
#setting the class_wet column to be prediction  
input_data['PREDICTION'] = input_data['CLASS_WET'].apply(lambda x: '1' if x >=1 else '0')  

In [22]:
#Remove the duplicated plot IDs which means those that are labeled for similar month as 0, 1, 2  or 3.
duplicate = input_data.duplicated(['LAT', 'LON','MONTH'], keep=False)
input_data = input_data[duplicate==False]

#Counting the number of validation points (without duplication) in all Africa  
countduplicate = input_data.groupby('PLOT_ID',as_index=False,sort=False).last()
len(countduplicate)

2329

In [23]:
#Filter out those rows that are labeled more than 1 or there is no clear WOfS/SCL observations  
indexnames = input_data[(input_data['ACTUAL'] > 1) | (input_data['CLEAR_OBS']==0.0) | (input_data['CLEAR_OBS'].isna())].index
input_data.drop(indexnames, inplace=True)

In [25]:
countfinal = input_data.groupby('PLOT_ID',as_index=False,sort=False).last()
countfinal

Unnamed: 0,PLOT_ID,LON,LAT,CLASS,MONTH,ACTUAL,CLASS_WET,CLEAR_OBS,index_right,ID,CODE,COUNTRY,PREDICTION
0,137711642.0,23.061932,5.240099,Forest/woodlands,10,0,0.0,1.0,0,2.0,ANG,Central,0
1,137711755.0,16.208833,-2.713546,Open water - freshwater,6,1,1.0,1.0,0,2.0,ANG,Central,1
2,137711790.0,22.334056,-5.938465,Cultivated (Cropland/ Plantation),6,1,1.0,2.0,0,2.0,ANG,Central,1
3,137711805.0,13.126633,-7.844516,Open water - marine,9,1,1.0,1.0,0,2.0,ANG,Central,1
4,137711816.0,14.000022,-9.545944,Cultivated (Cropland/ Plantation),8,0,1.0,1.0,0,2.0,ANG,Central,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2324,137483511.0,6.725308,4.459231,Open water - freshwater,1,1,1.0,1.0,0,20.0,BEN,Western,1
2325,137483512.0,6.714737,4.406402,Open water - freshwater,10,1,1.0,1.0,0,20.0,BEN,Western,1
2326,137483513.0,6.854653,4.397676,Open water - freshwater,1,1,1.0,1.0,0,20.0,BEN,Western,1
2327,137483514.0,-7.561075,4.388714,Open water - freshwater,5,1,1.0,1.0,0,20.0,BEN,Western,1


### Create a Confusion Matrix 

In [27]:
confusion_matrix = pd.crosstab(input_data['ACTUAL'],input_data['PREDICTION'],rownames=['ACTUAL'],colnames=['PREDICTION'],margins=True)
confusion_matrix

PREDICTION,0,1,All
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3520,484,4004
1,1700,6339,8039
All,5220,6823,12043


### Calculate Producer's and User's Accuracy 

`Producer's Accuracy` is the map-maker accuracy showing the probability that a certain class on the ground is classified. Producer's accuracy complements error of omission. 

In [28]:
confusion_matrix["Producer's"] = [confusion_matrix.loc[0][0] / confusion_matrix.loc[0]['All'] * 100, confusion_matrix.loc[1][1] / confusion_matrix.loc[1]['All'] *100, np.nan]
confusion_matrix

PREDICTION,0,1,All,Producer's
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,3520,484,4004,87.912088
1,1700,6339,8039,78.853091
All,5220,6823,12043,


`User's Accuracy` is the map-user accuracy showing how often the class on the map will actually be present on the ground. `User's accuracy` shows the reliability. It is calculated based on the total number of correct classification for a particular class over the total number of classified sites.

In [29]:
#In case you received an error in this cell, change the indexing 0 and 1 from string to be a number (remove the quotation mark) 

users_accuracy = pd.Series([confusion_matrix['0'][0] / confusion_matrix['0']['All'] * 100,
                                confusion_matrix['1'][1] / confusion_matrix['1']['All'] * 100]).rename("User's")

confusion_matrix = confusion_matrix.append(users_accuracy)
confusion_matrix 

Unnamed: 0_level_0,0,1,All,Producer's,0,1
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3520.0,484.0,4004.0,87.912088,,
1,1700.0,6339.0,8039.0,78.853091,,
All,5220.0,6823.0,12043.0,,,
User's,,,,,67.43295,92.906346


`Overal Accuracy` shows what proportion of reference(actual) sites mapped correctly.

In [30]:
confusion_matrix.loc["User's", "Producer's"] = (confusion_matrix['0'][0] + confusion_matrix['1'][1]) / confusion_matrix['All']['All'] * 100
confusion_matrix

Unnamed: 0_level_0,0,1,All,Producer's,0,1
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3520.0,484.0,4004.0,87.912088,,
1,1700.0,6339.0,8039.0,78.853091,,
All,5220.0,6823.0,12043.0,,,
User's,,,,81.864984,67.43295,92.906346


In [31]:
input_data['PREDICTION'] = input_data['PREDICTION'] .astype(str).astype(int)

The `F1 score` is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1(perfect precision and recall), and is calculated as:

In [32]:
fscore = pd.Series([(2*(confusion_matrix.loc["User's"][0]*confusion_matrix.loc[0]["Producer's"]) / (confusion_matrix.loc["User's"][0] + confusion_matrix.loc[0]["Producer's"])) / 100,
                   f1_score(input_data['ACTUAL'],input_data['PREDICTION'])]).rename("F-score")
confusion_matrix = confusion_matrix.append(fscore)

In [33]:
confusion_matrix

Unnamed: 0_level_0,0,1,All,Producer's,0,1
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3520.0,484.0,4004.0,87.912088,,
1,1700.0,6339.0,8039.0,78.853091,,
All,5220.0,6823.0,12043.0,,,
User's,,,,81.864984,67.43295,92.906346
F-score,,,,,0.763226,0.853048


### Tidy Confusion Matrix 

- Limit decimal places
- Add readable class names 
- Remove non-sensical values 

In [34]:
confusion_matrix = confusion_matrix.round(decimals=2)

In [35]:
confusion_matrix = confusion_matrix.rename(columns={'0':'NoWater','1':'Water', 0:'NoWater',1:'Water','All':'Total'},index={'0':'NoWater','1':'Water',0:'NoWater',1:'Water','All':'Total'})

In [36]:
confusion_matrix

Unnamed: 0_level_0,NoWater,Water,Total,Producer's,NoWater,Water
ACTUAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NoWater,3520.0,484.0,4004.0,87.91,,
Water,1700.0,6339.0,8039.0,78.85,,
Total,5220.0,6823.0,12043.0,,,
User's,,,,81.86,67.43,92.91
F-score,,,,,0.76,0.85


In [37]:
#saving out the confusion matrix 
confusion_matrix.to_csv('../Results/WOfS_Assessment/wofs_ls/ConfusionMatrix/Africa_all_wofs_ls_confusion_matrix.csv')

**Reading the confusion matrix**

* Vertical labels (labelling each row) are Actual 
* Horizontal labels (labelling each column) are Predicted
* Producer's accuracy for each class in the "Producer's" column
* User's accuracy for each class in the "User's" row

**Stats according to the confusion matrix**
* Overall accuracy = 81.86%
* Producer's accuracy (water classed as water) = 78.85%
* User's accuracy (water classed as water was actually water) = 92.91%

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** January 2020

**Compatible datacube version:** 

## Tags
Browse all available tags on the DE Africa User Guide's [Tags Index](https://) (placeholder as this does not exist yet)