# Accuracy assessment of the Eastern Africa Cropland Mask<img align="right" src="../../Supplementary_data/DE_Africa_Logo_Stacked_RGB_small.jpg">



## Description

Now that we have run classifications for the Eastern Africa AEZ, its time to conduct an accuracy assessment. The data used for assessing the accuracy was collected previously and set aside. Its stored in the data/ folder: `data/Validation_samples.shp` 

This notebook will output a `confusion error matrix` containing Overall, Producer's, and User's accuracy, along with the F1 score for each class.

***
## Getting started

To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load Packages

In [1]:
import rasterio
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
import geopandas as gpd
from sklearn.metrics import f1_score

## Analysis Parameters

* `pred_tif` : a binary classification of crop/no-crop output by the ML script.
* `grd_truth` : a shapefile containing crop/no-crop points to serve as the "ground-truth" dataset
* `aez_region` : a shapefile used to limit the ground truth points to the region where the model has classified crop/non-crop


In [6]:
pred_tif = 'results/classifications/predicted/Eastern_tile_F-9_prediction_pixel_gm_mads_two_seasons_20201123.tif'
grd_truth = '../data/training_validation/GFSAD2015/cropland_prelim_validation_GFSAD.shp'
aez_region = 'data/Eastern.shp'

### Load the datasets

`Ground truth points`

In [7]:
#ground truth shapefile
ground_truth = gpd.read_file(grd_truth).to_crs('EPSG:6933')

In [8]:
# rename the class column to 'actual'
ground_truth = ground_truth.rename(columns={'class':'Actual'})
ground_truth.head()

Unnamed: 0,Actual,geometry
0,0,POINT (2348960.082 2523843.580)
1,0,POINT (553136.870 2547155.762)
2,0,POINT (2278070.581 2571794.107)
3,0,POINT (608920.945 2660250.463)
4,0,POINT (2354922.751 2669991.398)


Clip ground_truth data points to the simplified AEZ

In [9]:
#open shapefile
aez=gpd.read_file(aez_region).to_crs('EPSG:6933')
# clip points to region
ground_truth = gpd.overlay(ground_truth,aez, how='intersection')

`Raster of predicted classes`

In [11]:
prediction = rasterio.open(pred_tif)

### Extract a list of coordinate values

In [12]:
coords = [(x,y) for x, y in zip(ground_truth.geometry.x, ground_truth.geometry.y)]

### Sample the prediction raster at the ground truth coordinates

In [13]:
# Sample the raster at every point location and store values in DataFrame
ground_truth['Prediction'] = [int(x[0]) for x in prediction.sample(coords)]
ground_truth.head()

Unnamed: 0,Actual,ID,CODE,COUNTRY,geometry,Prediction
0,0,,,Eastern,POINT (3331475.328 -1320107.828),0
1,0,,,Eastern,POINT (3483589.623 -1416514.327),0
2,0,,,Eastern,POINT (3663397.201 -1399129.780),0
3,0,,,Eastern,POINT (3874873.173 -1330109.132),0
4,0,,,Eastern,POINT (3523208.242 -1315359.680),0


---

## Create a confusion matrix

In [15]:
confusion_matrix = pd.crosstab(ground_truth['Actual'],
                               ground_truth['Prediction'],
                               rownames=['Actual'],
                               colnames=['Prediction'],
                               margins=True)

confusion_matrix

Prediction,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,217,0,217
1,67,7,74
All,284,7,291


### Calculate User's and Producer's Accuracy

`Producer's Accuracy`

In [17]:
confusion_matrix["Producer's"] = [confusion_matrix.loc[0, 0] / confusion_matrix.loc[0, 'All'] * 100,
                              confusion_matrix.loc[1, 1] / confusion_matrix.loc[1, 'All'] * 100,
                              np.nan]

`User's Accuracy`

In [18]:
users_accuracy = pd.Series([confusion_matrix[0][0] / confusion_matrix[0]['All'] * 100,
                                confusion_matrix[1][1] / confusion_matrix[1]['All'] * 100]
                         ).rename("User's")

confusion_matrix = confusion_matrix.append(users_accuracy)

`Overall Accuracy`

In [19]:
confusion_matrix.loc["User's","Producer's"] = (confusion_matrix.loc[0, 0] + 
                                                confusion_matrix.loc[1, 1]) / confusion_matrix.loc['All', 'All'] * 100

`F1 Score`

The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall), and is calculated as:

$$
\begin{aligned}
\text{Fscore} = 2 \times \frac{\text{UA} \times \text{PA}}{\text{UA} + \text{PA}}.
\end{aligned}
$$

Where UA = Users Accuracy, and PA = Producer's Accuracy

In [20]:
fscore = pd.Series([(2*(confusion_matrix.loc["User's", 0]*confusion_matrix.loc[0, "Producer's"]) / (confusion_matrix.loc["User's", 0]+confusion_matrix.loc[0, "Producer's"])) / 100,
                    f1_score(df['Actual'].astype(np.int8), df['Prediction'].astype(np.int8), average='binary')]
                         ).rename("F-score")

confusion_matrix = confusion_matrix.append(fscore)

### Tidy Confusion Matrix

* Limit decimal places,
* Add readable class names
* Remove non-sensical values 

In [21]:
# round numbers
confusion_matrix = confusion_matrix.round(decimals=2)

In [22]:
# rename booleans to class names
confusion_matrix = confusion_matrix.rename(columns={0:'Non-crop', 1:'Crop', 'All':'Total'},
                                            index={0:'Non-crop', 1:'Crop', 'All':'Total'})

In [23]:
#remove the nonsensical values in the table
confusion_matrix.loc["User's", 'Total'] = '--'
confusion_matrix.loc['Total', "Producer's"] = '--'
confusion_matrix.loc["F-score", 'Total'] = '--'
confusion_matrix.loc["F-score", "Producer's"] = '--'

In [24]:
confusion_matrix

Prediction,Non-crop,Crop,Total,User's
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Non-crop,217.0,0.0,217,100
Crop,67.0,7.0,74,9.46
Total,284.0,7.0,291,--
Producer's,76.41,100.0,--,76.98
F-score,0.87,0.17,--,--


### Export csv

In [None]:
confusion_matrix.to_csv('results/Eastern_confusion_matrix.csv')

## Next steps

This is the last notebook in the `Eastern Africa Cropland Mask` workflow! To revist any of the other notebooks, use the links below.

1. [Extracting_training_data](1_Extracting_training_data.ipynb) 
2. [Inspect_training_data](2_Inspect_training_data.ipynb)
3. [Train_fit_evaluate_classifier](3_Train_fit_evaluate_classifier.ipynb)
4. [Predict](4_Predict.ipynb)
5. [Object-based_filtering](5_Object-based_filtering.ipynb)
6. **Accuracy_assessment (this notebook)**

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Last modified:** Dec 2020
