# Validating the 10m Northern Africa Cropland Mask


## Description
Previously, in the `6_Accuracy_assessment_20m.ipynb` notebook, we were doing preliminary validations on 20m resolution testing crop-masks. The crop-mask was stored on disk as a geotiff. The final cropland extent mask, produced at 10m resolution, is stored in the datacube and requires a different method for validating.

> NOTE: A very big sandbox is required (256GiB RAM) to run this script. 

This notebook will output a `confusion error matrix` containing Overall, Producer's, and User's accuracy, along with the F1 score for each class.

***
## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load Packages

In [1]:
import os
import sys
import glob
import rasterio
import datacube
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
import geopandas as gpd
from sklearn.metrics import f1_score
from rasterstats import zonal_stats

## Analysis Parameters

* `product` : name of crop-mask we're validating
* `bands`: the bands of the crop-mask we want to load and validate. Can one of either `'mask'` or `'filtered'`
* `grd_truth` : a shapefile containing crop/no-crop points to serve as the "ground-truth" dataset


In [2]:
product = "crop_mask_sahel"
band = 'mask'
grd_truth = 'data/validation_samples.shp'




### Load the datasets

`the cropland extent mask`

In [3]:
#connect to the datacube
dc = datacube.Datacube(app='feature_layers')
    
#load 10m cropmask
ds = dc.load(product=product, measurements=[band], resolution=(-10,10)).squeeze()
print(ds)

<xarray.Dataset>
Dimensions:      (y: 364800, x: 672000)
Coordinates:
    time         datetime64[ns] 2019-07-02T11:59:59.999999
  * y            (y) float64 3.36e+06 3.36e+06 3.36e+06 ... -2.88e+05 -2.88e+05
  * x            (x) float64 -1.728e+06 -1.728e+06 ... 4.992e+06 4.992e+06
    spatial_ref  int32 6933
Data variables:
    mask         (y, x) uint8 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
Attributes:
    crs:           EPSG:6933
    grid_mapping:  spatial_ref


`Ground truth points`

In [4]:
#ground truth shapefile
ground_truth = gpd.read_file(grd_truth).to_crs('EPSG:6933')

# rename the class column to 'actual'
ground_truth = ground_truth.rename(columns={'Class':'Actual'})

# reclassifer into int
ground_truth['Actual'] = np.where(ground_truth['Actual']=='non-crop', 0, ground_truth['Actual'])
ground_truth['Actual'] = np.where(ground_truth['Actual']=='crop', 1, ground_truth['Actual'])
ground_truth.head()

Unnamed: 0,lon,lat,smpl_sampl,smpl_gfsad,smpl_class,Actual,geometry
0,35.313223,13.014432,696,0,2,1,POINT (3407241.529 1646426.340)
1,0.882056,14.495036,518,0,2,1,POINT (85106.281 1830028.934)
2,32.045871,15.146135,1045,0,2,0,POINT (3091986.854 1910398.360)
3,17.154678,17.243342,1670,0,1,0,POINT (1655191.052 2167583.165)
4,16.441595,9.483514,574,0,2,1,POINT (1586388.357 1204472.400)



## Convert points into polygons

When the validation data was collected, 40x40m polygons were evaluated as either crop/non-crop rather than points, so we want to sample the raster using the same small polygons. We'll find the majority or 'mode' statistic within the polygon and use that to compare with the validation dataset.


In [5]:
#set radius (in metres) around points
radius = 20

#create circle buffer around points, then find envelope
ground_truth['geometry'] = ground_truth['geometry'].buffer(radius).envelope

### Calculate zonal statistics

We want to know what the majority pixel value is inside each validation polygon.

In [6]:
def custom_majority(x):
    a=np.ma.MaskedArray.count(x)
    b=np.sum(x)
    c=b/a
    if c>0.5:
        return 1
    if c<=0.5:
        return 0

In [7]:
#calculate stats
stats = zonal_stats(ground_truth.geometry,
                    ds[band].values,
                    affine=ds.geobox.affine,
                    add_stats={'majority':custom_majority},
                    nodata=255)

#append stats to grd truth df
ground_truth['Prediction']=[i['majority'] for i in stats]

ground_truth.head()

Unnamed: 0,lon,lat,smpl_sampl,smpl_gfsad,smpl_class,Actual,geometry,Prediction
0,35.313223,13.014432,696,0,2,1,"POLYGON ((3407221.529 1646406.340, 3407261.529...",0
1,0.882056,14.495036,518,0,2,1,"POLYGON ((85086.281 1830008.934, 85126.281 183...",0
2,32.045871,15.146135,1045,0,2,0,"POLYGON ((3091966.854 1910378.360, 3092006.854...",0
3,17.154678,17.243342,1670,0,1,0,"POLYGON ((1655171.052 2167563.165, 1655211.052...",0
4,16.441595,9.483514,574,0,2,1,"POLYGON ((1586368.357 1204452.400, 1586408.357...",0


***

## Create a confusion matrix

In [8]:
confusion_matrix = pd.crosstab(ground_truth['Actual'],
                               ground_truth['Prediction'],
                               rownames=['Actual'],
                               colnames=['Prediction'],
                               margins=True)

confusion_matrix

Prediction,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,191,9,200
1,26,62,88
All,217,71,288


### Calculate User's and Producer's Accuracy

`Producer's Accuracy`

In [9]:
confusion_matrix["Producer's"] = [confusion_matrix.loc[0, 0] / confusion_matrix.loc[0, 'All'] * 100,
                              confusion_matrix.loc[1, 1] / confusion_matrix.loc[1, 'All'] * 100,
                              np.nan]

`User's Accuracy`

In [10]:
users_accuracy = pd.Series([confusion_matrix[0][0] / confusion_matrix[0]['All'] * 100,
                                confusion_matrix[1][1] / confusion_matrix[1]['All'] * 100]
                         ).rename("User's")

confusion_matrix = confusion_matrix.append(users_accuracy)

`Overall Accuracy`

In [11]:
confusion_matrix.loc["User's","Producer's"] = (confusion_matrix.loc[0, 0] + 
                                                confusion_matrix.loc[1, 1]) / confusion_matrix.loc['All', 'All'] * 100

`F1 Score`

The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall), and is calculated as:

$$
\begin{aligned}
\text{Fscore} = 2 \times \frac{\text{UA} \times \text{PA}}{\text{UA} + \text{PA}}.
\end{aligned}
$$

Where UA = Users Accuracy, and PA = Producer's Accuracy

In [12]:
fscore = pd.Series([(2*(confusion_matrix.loc["User's", 0]*confusion_matrix.loc[0, "Producer's"]) / (confusion_matrix.loc["User's", 0]+confusion_matrix.loc[0, "Producer's"])) / 100,
                    f1_score(ground_truth['Actual'].astype(np.int8), ground_truth['Prediction'].astype(np.int8), average='binary')]
                         ).rename("F-score")

confusion_matrix = confusion_matrix.append(fscore)

### Tidy Confusion Matrix

* Limit decimal places,
* Add readable class names
* Remove non-sensical values 

In [13]:
# round numbers
confusion_matrix = confusion_matrix.round(decimals=2)

In [14]:
# rename booleans to class names
confusion_matrix = confusion_matrix.rename(columns={0:'Non-crop', 1:'Crop', 'All':'Total'},
                                            index={0:'Non-crop', 1:'Crop', 'All':'Total'})

In [15]:
#remove the nonsensical values in the table
confusion_matrix.loc["User's", 'Total'] = '--'
confusion_matrix.loc['Total', "Producer's"] = '--'
confusion_matrix.loc["F-score", 'Total'] = '--'
confusion_matrix.loc["F-score", "Producer's"] = '--'

In [16]:
confusion_matrix

Unnamed: 0_level_0,Non-crop,Crop,Total,Producer's
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Non-crop,191.0,9.0,200.0,95.5
Crop,26.0,62.0,88.0,70.45
Total,217.0,71.0,288.0,--
User's,88.02,87.32,--,87.85
F-score,0.92,0.78,--,--


### Export csv

In [17]:
confusion_matrix.to_csv('results/Sahel_10m_accuracy_assessment_confusion_matrix.csv')

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** Dec 2020
