# Assessing training data acquistion methods

Collect Earth Online is being trialled as a tool for collecting cropland training data.  This script will compare the test labels (GFSAD's validation data), against the user collected lables

Inputs will be:

1. `ceo-data....csv` : The results from collecting training data in the CEO tool


Output will be:
1. A `confusion error matrix` containing Overall, Producer's, and User's accuracy, along with the F1 score.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score

## Analysis Parameters

In [2]:
csv = 'data/training_validation/collect_earth/test_collection/ceo-cropland-training-data-acquisition-test---edward-sample-data-2020-08-05.csv'
# name = 

### Load the dataset

In [3]:
#ground truth shapefile
df = pd.read_csv(csv)

### Clean up dataframe

In [4]:
#only grab useful columns
df = df[['LON', 'LAT', 'PL_SAMPLEID', 'SMPL_CLASS', 'IS THE SAMPLE AREA ENTIRELY: CROP, NON-CROP, MIXED, OR UNSURE?']]

#rename columns
df = df.rename(columns={'IS THE SAMPLE AREA ENTIRELY: CROP, NON-CROP, MIXED, OR UNSURE?':'Prediction',
                  'SMPL_CLASS':'Actual'})

#reclassify so prediction and actual columns match
df['Prediction'] =np.where(df['Prediction']=='crop', 1, df['Prediction'])
df['Prediction'] = np.where(df['Prediction']=='non-crop', 0, df['Prediction'])
df['Prediction'] = np.where(df['Prediction']=='mixed', 2, df['Prediction'])
df['Prediction'] = np.where(df['Prediction']=='unsure', 3, df['Prediction'])

#remove nan rows
df = df.dropna()

df.head()

Unnamed: 0,LON,LAT,PL_SAMPLEID,Actual,Prediction
0,34.527969,-10.403404,0,0,0
1,36.104507,-11.172339,1,0,0
2,37.968063,-11.033538,2,0,0
3,40.159836,-10.483086,3,0,0
4,36.515121,-10.365581,4,0,0


### Generate a confusion matrix wih all classes

In [5]:
confusion_matrix = pd.crosstab(df['Actual'],
                               df['Prediction'],
                               rownames=['Actual'],
                               colnames=['Prediction'],
                               margins=True)

confusion_matrix

Prediction,0,1,2,3,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,198,5,10,4,217
1,6,60,7,1,74
All,204,65,17,5,291


### Reclassify into a binary assessment

In [6]:
counts = df.groupby('Prediction').count()

print("Total number of samples: " + str(len(df)))
print("Number of 'mixed' samples: "+ str(counts[counts.index==2]['Actual'].values[0]))
print("Number of 'unsure' samples: "+ str(counts[counts.index==3]['Actual'].values[0]))

print("Reclassifying 'mixed' and 'unsure' sample to 'non-crop' ")

df['Prediction'] = np.where(df['Prediction']==2, 0, df['Prediction'])
df['Prediction'] = np.where(df['Prediction']==3, 0, df['Prediction'])

Total number of samples: 291
Number of 'mixed' samples: 17
Number of 'unsure' samples: 5
Reclassifying 'mixed' and 'unsure' sample to 'non-crop' 


---

### Recreate confusion matrix

In [7]:
confusion_matrix = pd.crosstab(df['Actual'],
                               df['Prediction'],
                               rownames=['Actual'],
                               colnames=['Prediction'],
                               margins=True)

confusion_matrix

Prediction,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,212,5,217
1,14,60,74
All,226,65,291


### Calculate User's and Producer's Accuracy

`User's Accuracy`

In [8]:
confusion_matrix["User's"] = [confusion_matrix.loc[0, 0] / confusion_matrix.loc[0, 'All'] * 100,
                              confusion_matrix.loc[1, 1] / confusion_matrix.loc[1, 'All'] * 100,
                              np.nan]

`Producer's Accuracy`

In [9]:
producers_accuracy = pd.Series([confusion_matrix[0][0] / confusion_matrix[0]['All'] * 100,
                                confusion_matrix[1][1] / confusion_matrix[1]['All'] * 100]
                         ).rename("Producer's")

confusion_matrix = confusion_matrix.append(producers_accuracy)

`Overall Accuracy`

In [10]:
confusion_matrix.loc["Producer's", "User's"] = (confusion_matrix.loc[0, 0] + 
                                                confusion_matrix.loc[1, 1]) / confusion_matrix.loc['All', 'All'] * 100

`F1 Score`

The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall), and is calculated as:

$$
\begin{aligned}
\text{Fscore} = 2 \times \frac{\text{UA} \times \text{PA}}{\text{UA} + \text{PA}}.
\end{aligned}
$$

Where UA = Users Accuracy, and PA = Producer's Accuracy

In [11]:
fscore = pd.Series([(2*(confusion_matrix.loc[0, "User's"]*confusion_matrix.loc["Producer's", 0]) / (confusion_matrix.loc[0, "User's"]+confusion_matrix.loc["Producer's", 0])) / 100,
                    f1_score(df['Actual'].astype(np.int8), df['Prediction'].astype(np.int8), average='binary')]
                         ).rename("F-score")

confusion_matrix = confusion_matrix.append(fscore)

### Tidy Confusion Matrix

* Limit decimal places,
* Add readable class names
* Remove non-sensical values 

In [12]:
# round numbers
confusion_matrix = confusion_matrix.round(decimals=2)

In [13]:
# rename booleans to class names
confusion_matrix = confusion_matrix.rename(columns={0:'Non-crop', 1:'Crop', 'All':'Total'},
                                            index={0:'Non-crop', 1:'Crop', 'All':'Total'})

In [14]:
#remove the nonsensical values in the table
confusion_matrix.loc['Total', "User's"] = '--'
confusion_matrix.loc["Producer's", 'Total'] = '--'
confusion_matrix.loc["F-score", 'Total'] = '--'
confusion_matrix.loc["F-score", "User's"] = '--'

In [15]:
confusion_matrix

Prediction,Non-crop,Crop,Total,User's
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Non-crop,212.0,5.0,217,97.7
Crop,14.0,60.0,74,81.08
Total,226.0,65.0,291,--
Producer's,93.81,92.31,--,93.47
F-score,0.96,0.86,--,--


### Export csv

In [None]:
# confusion_matrix.to_csv('results/confusion_matrix.csv')