# Assessing analysts accuracy at labelling reference data

Collect Earth Online is being used as a tool for collecting cropland reference data.  The sample data contains 'known' labels seeded among the other samples. This script will compare the known test labels (GFSAD's validation data), against the user collected labels.

Inputs will be:

1. `ceo-data....csv` : The results from collecting training data in the CEO tool

Output will be:
1. A `confusion error matrix` containing Overall, Producer's, and User's accuracy, along with the F1 score.

***

In [1]:
import pandas as pd
import numpy as np
import seaborn as sn
import geopandas as gpd
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score

## Analysis Parameters

In [2]:
folder = 'data/training_validation/collect_earth/indian_ocean/'
csv = 'data/training_validation/collect_earth/indian_ocean/ceo-Cropland-Reference-Data-Acquisition---Indian-Ocean-Region---Ken-sample-data-2020-12-09.csv'

### Load the dataset

In [3]:
#ground truth shapefile
df = pd.read_csv(csv)

### Clean up dataframe


In [4]:
# this line if testing sample:
# df = df[['lon', 'lat', 'smpl_class','Is the sample area entirely: crop, non-crop, mixed, or unsure?']]

#This line if entire dataset:
df = df[['lon', 'lat', 'smpl_sampleid', 'smpl_gfsad_samp','smpl_class','Is the sample area entirely: crop, non-crop, mixed, or unsure?']]

#rename columns
df = df.rename(columns={'Is the sample area entirely: crop, non-crop, mixed, or unsure?':'Prediction',
                        'smpl_class':'Actual'})

#remove nan rows
df = df.dropna()
df.head()

Unnamed: 0,lon,lat,smpl_sampleid,smpl_gfsad_samp,Actual,Prediction
0,45.70709,-17.645966,0,0,1,non-crop
1,48.139548,-21.818282,1,0,2,non-crop
2,47.300707,-20.514019,2,1,1,non-crop
3,47.400594,-18.578418,3,0,1,non-crop
4,45.169179,-17.825989,4,0,1,non-crop


***
If this is the `test sample` (first 50-100 samples used for training analysts) then ignore the following cell.

If this is the reference data sample (2100) points, then run the cell below to extract the GFSAD validation samples before running the rest of the code


In [5]:
#This line if entire dataset:
df = df[['lon', 'lat', 'smpl_sampleid', 'smpl_gfsad_samp','smpl_class','Is the sample area entirely: crop, non-crop, mixed, or unsure?']]

#rename columns
df = df.rename(columns={'Is the sample area entirely: crop, non-crop, mixed, or unsure?':'Prediction',
                        'smpl_class':'Actual'})

df = df[df['smpl_gfsad_samp']==True]
print(len(df))

60


***

### Reclassify prediction & actual columns

1 = crop, 
0 = non-crop

In [6]:
df.head()

Unnamed: 0,lon,lat,smpl_sampleid,smpl_gfsad_samp,Actual,Prediction
2,47.300707,-20.514019,2,1,1,non-crop
55,47.619325,-20.054245,55,1,1,non-crop
64,48.361984,-15.752325,64,1,1,non-crop
96,47.480451,-17.750978,96,1,2,crop
102,47.635846,-21.844441,102,1,2,unsure


In [7]:
df['Prediction'] = np.where(df['Prediction']=='non-crop', 0, df['Prediction'])
df['Prediction'] = np.where(df['Prediction']=='crop', 1, df['Prediction'])

df['Actual'] = np.where(df['Actual']==1, 0, df['Actual'])
df['Actual'] = np.where(df['Actual']==2, 1, df['Actual'])

### Generate a confusion matrix with all classes

In [9]:
confusion_matrix = pd.crosstab(df['Actual'],
                               df['Prediction'],
                               rownames=['Actual'],
                               colnames=['Prediction'],
                               margins=True)

confusion_matrix

Prediction,0,1,unsure,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,28,2,0,30
1,7,15,8,30
All,35,17,8,60


### Reclassify into a binary assessment

In [11]:
counts = df.groupby('Prediction').count()

print("Total number of samples: " + str(len(df)))
print("Number of 'mixed' samples: "+ str(counts[counts.index=='mixed']['Actual'].values[0]))
print("Number of 'unsure' samples: "+ str(counts[counts.index=='unsure']['Actual'].values[0]))

print("Dropping 'mixed' and 'unsure' samples")

df = df.drop(df[df['Prediction']=='mixed'].index)
df = df.drop(df[df['Prediction']=='unsure'].index)

Total number of samples: 60
Number of 'unsure' samples: 8
Dropping 'mixed' and 'unsure' samples


---

### Recreate confusion matrix

In [12]:
confusion_matrix = pd.crosstab(df['Actual'],
                               df['Prediction'],
                               rownames=['Actual'],
                               colnames=['Prediction'],
                               margins=True)

confusion_matrix

Prediction,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,28,2,30
1,7,15,22
All,35,17,52


### Calculate User's and Producer's Accuracy

`Producer's Accuracy`

In [13]:
confusion_matrix["Producer's"] = [confusion_matrix.loc[0, 0] / confusion_matrix.loc[0, 'All'] * 100,
                              confusion_matrix.loc[1, 1] / confusion_matrix.loc[1, 'All'] * 100,
                              np.nan]

`User's Accuracy`

In [14]:
users_accuracy = pd.Series([confusion_matrix[0][0] / confusion_matrix[0]['All'] * 100,
                                confusion_matrix[1][1] / confusion_matrix[1]['All'] * 100]
                         ).rename("User's")

confusion_matrix = confusion_matrix.append(users_accuracy)

`Overall Accuracy`

In [15]:
confusion_matrix.loc["User's","Producer's"] = (confusion_matrix.loc[0, 0] + 
                                                confusion_matrix.loc[1, 1]) / confusion_matrix.loc['All', 'All'] * 100

`F1 Score`

The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall), and is calculated as:

$$
\begin{aligned}
\text{Fscore} = 2 \times \frac{\text{UA} \times \text{PA}}{\text{UA} + \text{PA}}.
\end{aligned}
$$

Where UA = Users Accuracy, and PA = Producer's Accuracy

In [16]:
fscore = pd.Series([(2*(confusion_matrix.loc["User's", 0]*confusion_matrix.loc[0, "Producer's"]) / (confusion_matrix.loc["User's", 0]+confusion_matrix.loc[0, "Producer's"])) / 100,
                    f1_score(df['Actual'].astype(np.int8), df['Prediction'].astype(np.int8), average='binary')]
                         ).rename("F-score")

confusion_matrix = confusion_matrix.append(fscore)

### Tidy Confusion Matrix

* Limit decimal places,
* Add readable class names
* Remove non-sensical values 

In [17]:
# round numbers
confusion_matrix = confusion_matrix.round(decimals=2)

In [18]:
# rename booleans to class names
confusion_matrix = confusion_matrix.rename(columns={0:'Non-crop', 1:'Crop', 'All':'Total'},
                                            index={0:'Non-crop', 1:'Crop', 'All':'Total'})

In [19]:
#remove the nonsensical values in the table
confusion_matrix.loc["User's", 'Total'] = '--'
confusion_matrix.loc['Total', "Producer's"] = '--'
confusion_matrix.loc["F-score", 'Total'] = '--'
confusion_matrix.loc["F-score", "Producer's"] = '--'

In [20]:
confusion_matrix

Prediction,Non-crop,Crop,Total,Producer's
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Non-crop,28.0,2.0,30,93.33
Crop,7.0,15.0,22,68.18
Total,35.0,17.0,52,--
User's,80.0,88.24,--,82.69
F-score,0.86,0.77,--,--


### Export csv

In [21]:
confusion_matrix.to_csv(folder+ 'reference_data_accuracy_results_Ken.csv')

***

## Finding difference between GFSAD and analysts labels

reclassify their label to match 1,2 labels of GFSAD, find where they differ, filter to only the crop, non-crop difference, export a shapefile suitable to go into CEO for re-training on incorrect labels

In [None]:
folder = 'data/training_validation/collect_earth/western/'
csv = 'data/training_validation/collect_earth/western/ceo-Cropland-Reference-Data-Testing-Sample---Western-Region---Yadjemi-sample-data-2020-12-09.csv'

In [26]:
#open
df = pd.read_csv(csv)

#--These lines if entire dataset:------------
df = df[['lon', 'lat', 'smpl_sampleid', 'smpl_gfsad_samp','smpl_class','Is the sample area entirely: crop, non-crop, mixed, or unsure?']]
#rename columns
df = df.rename(columns={'Is the sample area entirely: crop, non-crop, mixed, or unsure?':'Prediction',
                        'smpl_class':'Actual'})
df = df[df['smpl_gfsad_samp']==True]
print(len(df))
#--------------------------------------------

# #only the columns we care about
# df = df[['lon', 'lat', 'smpl_class','Is the sample area entirely: crop, non-crop, mixed, or unsure?']]
# #rename
# df = df.rename(columns={'Is the sample area entirely: crop, non-crop, mixed, or unsure?':'Prediction',
#                         'smpl_class':'Actual'})

#reclassify so classes match
df['Prediction'] = np.where(df['Prediction']=='non-crop', 0, df['Prediction'])
df['Prediction'] = np.where(df['Prediction']=='crop', 1, df['Prediction'])
df['Actual'] = np.where(df['Actual']==1, 0, df['Actual'])
df['Actual'] = np.where(df['Actual']==2, 1, df['Actual'])

#drop mixed and unsure labels
df = df.drop(df[df['Prediction']=='mixed'].index)
df = df.drop(df[df['Prediction']=='unsure'].index)

# index out the rows that differ
df_dif = df[df['Actual'] != df['Prediction']]
df_dif=df_dif.reset_index(drop=True)

#add ids to satisfy Collect earth
df_dif['PLOTID'] = range(0,len(df_dif))
df_dif['SAMPLEID'] = range(0,len(df_dif))

#create geodataframe
gdf_dif = gpd.GeoDataFrame(
        df_dif,
        crs='epsg:4326',
        geometry=gpd.points_from_xy(df_dif['lon'],df_dif['lat']))

#convert to polys
radius = 20
gdf_dif = gdf_dif.to_crs('EPSG:6933')
gdf_dif['geometry'] = gdf_dif['geometry'].buffer(radius).envelope
gdf_dif = gdf_dif.to_crs('EPSG:4326')

gdf_dif

60


Unnamed: 0,lon,lat,smpl_sampleid,smpl_gfsad_samp,Actual,Prediction,PLOTID,SAMPLEID,geometry
0,46.463591,-21.866442,113,1,0,1,0,0,"POLYGON ((46.46338 -21.86661, 46.46380 -21.866..."
1,47.045159,-16.740457,193,1,1,0,1,1,"POLYGON ((47.04495 -16.74062, 47.04537 -16.740..."
2,48.465225,-15.243888,232,1,1,0,2,2,"POLYGON ((48.46502 -15.24405, 48.46543 -15.244..."
3,47.815146,-21.5603,303,1,1,0,3,3,"POLYGON ((47.81494 -21.56047, 47.81535 -21.560..."
4,47.513442,-18.934237,305,1,1,0,4,4,"POLYGON ((47.51323 -18.93440, 47.51365 -18.934..."
5,47.675446,-22.451748,374,1,1,0,5,5,"POLYGON ((47.67524 -22.45192, 47.67565 -22.451..."
6,47.492826,-22.167014,888,1,1,0,6,6,"POLYGON ((47.49262 -22.16718, 47.49303 -22.167..."
7,46.415106,-19.391996,932,1,1,0,7,7,"POLYGON ((46.41490 -19.39216, 46.41531 -19.392..."
8,45.96825,-19.012023,994,1,0,1,8,8,"POLYGON ((45.96804 -19.01219, 45.96846 -19.012..."


In [28]:
gdf_dif.to_file(folder+'indian_ocean_reference_sample_divergence_Ken.shp')