# Accuracy assessment of the Western Africa Cropland Mask<img align="right" src="../figs/DE_Africa_Logo_Stacked_RGB_small.jpg">



## Description

Now that we have run classifications for the Western Africa AEZ, its time to conduct an accuracy assessment. The data used for assessing the accuracy was collected previously and set aside. Its stored in the data/ folder: `data/Validation_samples.shp` 

This notebook will output a `confusion error matrix` containing Overall, Producer's, and User's accuracy, along with the F1 score for each class.

***
## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load Packages

In [1]:
import os
import sys
import glob
import rasterio
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
import geopandas as gpd
from sklearn.metrics import f1_score
from odc.io.cgroups import get_cpu_quota

# sys.path.append('../../Scripts')
from deafrica_tools.spatial import zonal_stats_parallel

  shapely_geos_version, geos_capi_version_string


## Analysis Parameters

* `pred_tif` : a binary classification of crop/no-crop output by the ML script.
* `grd_truth` : a shapefile containing crop/no-crop points to serve as the "ground-truth" dataset
* `aez_region` : a shapefile used to limit the ground truth points to the region where the model has classified crop/non-crop


In [2]:
pred_tif = "results/classifications/20210525/Western_gm_mads_two_seasons_20210525_mosaic_clipped.tif"
# grd_truth = '../pre-post_processing/data/training_validation/GFSAD2015/cropland_prelim_validation_GFSAD.shp'
grd_truth = 'data/validation_samples.shp'
# aez_region = 'data/Western.geojson'

### Load the datasets

`Ground truth points`

In [3]:
#ground truth shapefile
ground_truth = gpd.read_file(grd_truth).to_crs('EPSG:6933')

In [4]:
# rename the class column to 'actual'
ground_truth = ground_truth.rename(columns={'Class':'Actual'})
ground_truth.head()

Unnamed: 0,lon,lat,smpl_sampl,smpl_gfsad,Actual,geometry
0,7.466886,12.261465,1533,0,non-crop,POINT (720452.101 1552635.612)
1,12.864863,11.922979,458,0,non-crop,POINT (1241282.778 1510387.311)
2,7.82208,6.917926,189,0,non-crop,POINT (754723.435 880456.813)
3,9.899884,8.0304,2048,0,non-crop,POINT (955202.942 1021202.641)
4,-5.725412,9.702883,156,0,crop,POINT (-552423.751 1232076.864)


Reclassify 'Actual' column to match the raster values

In [5]:
# # open shapefile
# aez=gpd.read_file(aez_region).to_crs('EPSG:6933')
# # clip points to region
# ground_truth = gpd.overlay(ground_truth,aez, how='intersection')

In [6]:
ground_truth['Actual'] = np.where(ground_truth['Actual']=='non-crop', 0, ground_truth['Actual'])
ground_truth['Actual'] = np.where(ground_truth['Actual']=='crop', 1, ground_truth['Actual'])

### This cell if point sampling

In [7]:
#Point sampling of raster for validation purpose
prediction = rasterio.open(pred_tif)
coords = [(x,y) for x, y in zip(ground_truth.geometry.x, ground_truth.geometry.y)]
# Sample the raster at every point location and store values in DataFrame
ground_truth['Prediction'] = [int(x[0]) for x in prediction.sample(coords)]

### The next two cells if polygon sampling
#### Convert points into polygons

When the validation data was collected, 40x40m polygons were evaluated as either crop/non-crop rather than points, so we want to sample the raster using the same small polygons. We'll find the majority or 'mode' statistic within the polygon and use that to compare with the validation dataset.


In [None]:
#set radius (in metres) around points
radius = 20

#convert to equal area to set polygon size in metres
ground_truth = ground_truth

#create circle buffer around points, then find envelope
ground_truth['geometry'] = ground_truth['geometry'].buffer(radius).envelope

#export to file for use in zonal-stats
ground_truth.to_file(grd_truth[:-4]+"_poly.shp")

### Calculate zonal statistics

We want to know what the majority pixel value is inside each validation polygon.

In [None]:
zonal_stats_parallel(shp=grd_truth[:-4]+"_poly.shp",
                    raster=pred_tif,
                    statistics=['majority'],
                    out_shp=grd_truth[:-4]+"_poly.shp",
                    ncpus=round(get_cpu_quota()),
                    nodata=-999)

#read in the results
x=gpd.read_file(grd_truth[:-4]+"_poly.shp")

#add result to original ground truth array
ground_truth['Prediction'] = x['majority'].astype(np.int16)

#Remove the temporary shapefile we made
[os.remove(i) for i in glob.glob(grd_truth[:-4]+"_poly"+'*')]

***

## Create a confusion matrix

In [8]:
confusion_matrix = pd.crosstab(ground_truth['Actual'],
                               ground_truth['Prediction'],
                               rownames=['Actual'],
                               colnames=['Prediction'],
                               margins=True)

confusion_matrix

Prediction,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,180,20,200
1,26,74,100
All,206,94,300


### Calculate User's and Producer's Accuracy

`Producer's Accuracy`

In [9]:
confusion_matrix["Producer's"] = [confusion_matrix.loc[0, 0] / confusion_matrix.loc[0, 'All'] * 100,
                              confusion_matrix.loc[1, 1] / confusion_matrix.loc[1, 'All'] * 100,
                              np.nan]

`User's Accuracy`

In [10]:
users_accuracy = pd.Series([confusion_matrix[0][0] / confusion_matrix[0]['All'] * 100,
                                confusion_matrix[1][1] / confusion_matrix[1]['All'] * 100]
                         ).rename("User's")

confusion_matrix = confusion_matrix.append(users_accuracy)

`Overall Accuracy`

In [11]:
confusion_matrix.loc["User's","Producer's"] = (confusion_matrix.loc[0, 0] + 
                                                confusion_matrix.loc[1, 1]) / confusion_matrix.loc['All', 'All'] * 100

`F1 Score`

The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall), and is calculated as:

$$
\begin{aligned}
\text{Fscore} = 2 \times \frac{\text{UA} \times \text{PA}}{\text{UA} + \text{PA}}.
\end{aligned}
$$

Where UA = Users Accuracy, and PA = Producer's Accuracy

In [12]:
fscore = pd.Series([(2*(confusion_matrix.loc["User's", 0]*confusion_matrix.loc[0, "Producer's"]) / (confusion_matrix.loc["User's", 0]+confusion_matrix.loc[0, "Producer's"])) / 100,
                    f1_score(ground_truth['Actual'].astype(np.int8), ground_truth['Prediction'].astype(np.int8), average='binary')]
                         ).rename("F-score")

confusion_matrix = confusion_matrix.append(fscore)

### Tidy Confusion Matrix

* Limit decimal places,
* Add readable class names
* Remove non-sensical values 

In [13]:
# round numbers
confusion_matrix = confusion_matrix.round(decimals=2)

In [14]:
# rename booleans to class names
confusion_matrix = confusion_matrix.rename(columns={0:'Non-crop', 1:'Crop', 'All':'Total'},
                                            index={0:'Non-crop', 1:'Crop', 'All':'Total'})

In [15]:
#remove the nonsensical values in the table
confusion_matrix.loc["User's", 'Total'] = '--'
confusion_matrix.loc['Total', "Producer's"] = '--'
confusion_matrix.loc["F-score", 'Total'] = '--'
confusion_matrix.loc["F-score", "Producer's"] = '--'

In [16]:
confusion_matrix

Prediction,Non-crop,Crop,Total,Producer's
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Non-crop,180.0,20.0,200,90
Crop,26.0,74.0,100,74
Total,206.0,94.0,300,--
User's,87.38,78.72,--,84.67
F-score,0.89,0.76,--,--


### Export csv

In [None]:
confusion_matrix.to_csv('results/Eastern_confusion_matrix.csv')

## Next steps

This is the last notebook in the `Western Africa Cropland Mask` workflow! To revist any of the other notebooks, use the links below.

1. [Extract_training_data](1_Extract_training_data.ipynb) 
2. [Inspect_training_data](2_Inspect_training_data.ipynb)
3. [Train_fit_evaluate_classifier](3_Train_fit_evaluate_classifier.ipynb)
4. [Predict](4_Predict.ipynb)
5. [Object-based_filtering](5_Object-based_filtering.ipynb)
6. **Accuracy_assessment (this notebook)**

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** Dec 2020
