# Clustering Zooniverse Marks to count Iguanas
The goal is to find the best method to cluster the data and find the best number of clusters.
The benchmark is a gold standard dataset obtained by experts.

In [1]:
%load_ext autoreload
%autoreload 2

import sys

sys.path.append("./")
sys.path.append("./zooniverse")

## Intro
### Retrieve a Classification report from Zooniverse
Export the classification export from your zooniverse project.
https://www.zooniverse.org/lab/11905/data-exports

This leads to a csv file which can be used for the analysis which should be renamed to `iguanas-from-above-classifications.csv` and placed in the `input_path` directory.
The methods do not use methods from zooniverse. It is a custom implementation.

An alternative would be to use the [code provided by zooniverse](https://github.com/zooniverse/Data-digging/tree/master/notebooks_ProcessExports)
(Bird Count Example)[https://github.com/zooniverse/Data-digging/blob/master/scripts_ProjectExamples/seabirdwatch/bird_count.py]

This notebooks assumes the data is flat and prepared. An alternative format would be the [cesar aggregation format](https://github.com/zooniverse/aggregation-for-caesar)

Used Methods are:

### DBSCAN 
It does not require the number of clusters to be specified. It is used here because, but has min_samples and eps as hyperparameters which need to be found. [Link](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)
For finding eps and min_sample a simple **grid search** is used.
Additionally, DBSCAN not assume a specific shape for the clusters (K-means assumes clusters are gaussian in shape) even though we should assume that points around an iguana is gaussian shaped.

### HDBSCAN
It is an extension of DBSCAN which is more robust to hyperparameter settings as it finds epsilon and min_samples automatically. [Link](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html)

### Processing workflow

The Data is flatted and filtered
![Image](images/dataprocessing-DataFiltering.png)

In each phase we have the following number of images if they are filtered for at least 4 true marks by users
1. phase 105
2. phase 160
3. phase 86


## Load the data

In [2]:
from zooniverse.utils.data_format import data_prep
from pathlib import Path

import pandas as pd
from zooniverse.analysis import get_annotation_count_stats
from zooniverse.utils.filters import filter_df_user_threshold
from zooniverse.config import get_config

# Phase Selection
# phase_tag = "Iguanas 1st launch"
# phase_tag = "Iguanas 2nd launch"
phase_tag = "Iguanas 3rd launch"

### use either the subset of the subset
use_gold_standard_subset = "expert_goldstandard" # Use the X-T2-GS-results-5th-0s as the basis
# use_gold_standard_subset = "expert" # Use the expert-GS-Xphase as the basis

## Input Path of all the data
input_path = Path("/Users/christian/data/zooniverse")

# Location for the analysis Results
output_path = Path(input_path.joinpath(f"2024_03_22_{use_gold_standard_subset}_analysis").joinpath(phase_tag))

output_path.mkdir(exist_ok=True, parents=True)

reprocess = False # if True, the raw classification data is reprocessed. If False, the data is loaded from disk




debug = False # debugging with a smaller dataset
plot_diagrams = False # plot the diagrams to disk for the clustering methods
show_plots = False # show the plots in the notebook

user_threshold = 3 # None or a number, filter records which have less than these user interactions.






# Location for plots
output_plot_path = output_path.joinpath("plots")
output_plot_path.mkdir(parents=True, exist_ok=True)


## Look into the config
This Config points to all files necessary for the analysis + the result files

In [3]:
config = get_config(phase_tag=phase_tag, input_path=input_path, output_path=output_path)
config

{'annotations_source': PosixPath('/Users/christian/data/zooniverse/IguanasFromAbove/2023-10-15/iguanas-from-above-classifications.csv'),
 'goldstandard_data': PosixPath('/Users/christian/data/zooniverse/Images/Zooniverse_Goldstandard_images/expert-GS-3rdphase_renamed.csv'),
 'gold_standard_image_subset': PosixPath('/Users/christian/data/zooniverse/Images/Zooniverse_Goldstandard_images/3-T2-GS-results-5th-0s.csv'),
 'image_source': None,
 'yes_no_dataset': PosixPath('/Users/christian/data/zooniverse/2024_03_22_expert_goldstandard_analysis/Iguanas 3rd launch/yes_no_dataset_Iguanas 3rd launch.csv'),
 'flat_dataset': PosixPath('/Users/christian/data/zooniverse/2024_03_22_expert_goldstandard_analysis/Iguanas 3rd launch/flat_dataset_Iguanas 3rd launch.csv'),
 'merged_dataset': PosixPath('/Users/christian/data/zooniverse/2024_03_22_expert_goldstandard_analysis/Iguanas 3rd launch/flat_dataset_filtered_Iguanas 3rd launch.csv'),
 'gold_standard_and_expert_count': PosixPath('/Users/christian/data

In [4]:
from zooniverse.utils.anonymize import UserAnonymizer

if reprocess:
    ds_stats = data_prep(phase_tag=phase_tag, 
                         output_path=output_path, 
                         input_path=input_path,
                         filter_combination=use_gold_standard_subset, 
                         config=config)
    # Anomymise the data to prevent usernames and user_ids to become public
    anonymizer = UserAnonymizer(config["flat_dataset"])
    anonymizer.anonymize_data()
    anonymizer.save_anonymized_data(config["flat_dataset"])
    
    anonymizer = UserAnonymizer(config["merged_dataset"])
    anonymizer.anonymize_data()
    anonymizer.save_anonymized_data(config["merged_dataset"])
    
    print(ds_stats)

In [5]:
if plot_diagrams == False:
    output_plot_path = None

# the flattened, filtered marks from zooniverse.
df_merged_dataset = pd.read_csv(config["merged_dataset"])

# data for reference
df_goldstandard_expert_count = pd.read_csv(config["gold_standard_image_subset"], sep=";")
df_expert_count = pd.read_csv(config["goldstandard_data"], sep=";")

### Optional Debugging

In [6]:


## Debugging helpers
if phase_tag == "Iguanas 1st launch":    
    if debug:

        df_merged_dataset = df_merged_dataset[df_merged_dataset.image_name.isin(["SFM01-2-2-2_333.jpg", "SFM01-2-2-2_334.jpg", "SFM01-2-2-3_201.jpg"])]

elif phase_tag == "Iguanas 2nd launch":
    if debug:
        df_merged_dataset = df_merged_dataset[
           df_merged_dataset.image_name.isin(["FMO03-1_65.jpg", "FMO03-1_72.jpg", "MBN04-2_182.jpg", "EGI08-2_78.jpg"])]
           # df_merged_dataset.image_name.isin(["FMO03-1_72.jpg"])]

    
elif phase_tag == "Iguanas 3rd launch":

    # this user is a spammer
    df_merged_dataset = df_merged_dataset[df_merged_dataset.user_id != 2581179]
    
    if debug:
        df_merged_dataset = df_merged_dataset[
           df_merged_dataset.image_name.isin(["FMO03-2_70.jpg", "MBN04-2_182.jpg", "EGI08-2_78.jpg"])]
            
    


## Look at the data


In [7]:
## Look at the data
df_merged_dataset.drop("user_name", axis=1)


Unnamed: 0.1,Unnamed: 0,flight_site_code,image_name,subject_id,x,y,tool_label,phase_tag,user_id
0,57,CaboIbebetsonS,PCIS01-5_67.jpg,78961972,301.422882,51.112278,"Others (females, young males, juveniles)",Iguanas 3rd launch,bbe0564f6fa09817cda58d4c1027735e
1,58,CaboIbebetsonS,PCIS01-5_67.jpg,78961972,35.903500,468.096375,"Others (females, young males, juveniles)",Iguanas 3rd launch,bbe0564f6fa09817cda58d4c1027735e
2,70,WestCoastB,GWB01-3_152.jpg,78925551,728.559448,181.467453,"Others (females, young males, juveniles)",Iguanas 3rd launch,
3,71,WestCoastB,GWB01-3_152.jpg,78925551,601.206055,277.385895,"Others (females, young males, juveniles)",Iguanas 3rd launch,
4,125,SouthCoastH,ESCH02-1_323.jpg,78965007,247.383331,56.599998,Adult Male not in a lek,Iguanas 3rd launch,59ab2166efbc163b3edd511475247309
...,...,...,...,...,...,...,...,...,...
7396,104634,SouthCoastH,ESCH02-1_174.jpg,78964907,688.849609,135.619003,Adult Male with a lek,Iguanas 3rd launch,c17e059e2c3b375b261e417e5a65091d
7397,113862,GEB02,GEB02-3_197.jpg,78922625,496.962006,433.519104,Adult Male not in a lek,Iguanas 3rd launch,
7398,113863,GEB02,GEB02-3_197.jpg,78922625,496.542480,416.344574,Adult Male not in a lek,Iguanas 3rd launch,
7399,113866,GEB02,GEB02-3_197.jpg,78922625,502.057251,430.122284,"Others (females, young males, juveniles)",Iguanas 3rd launch,68a0c0b65edf786c52bd62d8fbdf8c12


### Filter User if necessary and Marks


In [8]:
print(f"Before filtering: {df_merged_dataset.subject_id.nunique()}")
# There images in which some people said there are iguanas, but then didn't mark them. Clustering with fewer than 3 dots doesn't make sense
if user_threshold is not None:
    print(f"filtering records which have less than {user_threshold} interactions.")
    df_merged_dataset = filter_df_user_threshold(df_merged_dataset, user_threshold=user_threshold)
    
    
from zooniverse.utils.filters import filter_remove_marks
# Check if partials are still in the data. There shouldn't be any
df_merged_dataset = filter_remove_marks(df_merged_dataset)




Before filtering: 87
filtering records which have less than 3 interactions.


### Are there anonymous users in the data?
There should be

In [9]:
df_merged_dataset[df_merged_dataset.user_id.isnull().values]

Unnamed: 0.1,Unnamed: 0,flight_site_code,image_name,subject_id,x,y,tool_label,phase_tag,user_id,user_name
1804,20416,SouthCoastH,ESCH01-1_13.jpg,78964714,842.693787,272.662140,Adult Male with a lek,Iguanas 3rd launch,,adc99ea22219d2baff677763af1cd90f
1805,20417,SouthCoastH,ESCH01-1_13.jpg,78964714,890.732117,242.415756,Adult Male with a lek,Iguanas 3rd launch,,adc99ea22219d2baff677763af1cd90f
1806,20418,SouthCoastH,ESCH01-1_13.jpg,78964714,792.876221,124.988678,Adult Male with a lek,Iguanas 3rd launch,,adc99ea22219d2baff677763af1cd90f
1807,20419,SouthCoastH,ESCH01-1_13.jpg,78964714,796.434631,290.454102,"Others (females, young males, juveniles)",Iguanas 3rd launch,,adc99ea22219d2baff677763af1cd90f
1808,20420,SouthCoastH,ESCH01-1_13.jpg,78964714,816.005798,235.298965,"Others (females, young males, juveniles)",Iguanas 3rd launch,,adc99ea22219d2baff677763af1cd90f
...,...,...,...,...,...,...,...,...,...,...
1077,11744,WestCoast,PWC03-2-1_42.jpg,78963297,468.903534,196.045349,"Others (females, young males, juveniles)",Iguanas 3rd launch,,3952ec0af313d58d488dc97beb19c4cb
1479,15812,WestCoast,PWC03-2-1_42.jpg,78963297,431.552338,42.428822,Adult Male with a lek,Iguanas 3rd launch,,d51caf2b0d8a68548a70de2a954d48dd
1480,15813,WestCoast,PWC03-2-1_42.jpg,78963297,465.189301,190.431534,"Others (females, young males, juveniles)",Iguanas 3rd launch,,d51caf2b0d8a68548a70de2a954d48dd
1481,15814,WestCoast,PWC03-2-1_42.jpg,78963297,365.623840,152.758118,"Others (females, young males, juveniles)",Iguanas 3rd launch,,d51caf2b0d8a68548a70de2a954d48dd


In [10]:
# Amount of images
df_merged_dataset["subject_id"].nunique()

86

In [11]:
## After filtering there
df_merged_dataset

Unnamed: 0.1,Unnamed: 0,flight_site_code,image_name,subject_id,x,y,tool_label,phase_tag,user_id,user_name
408,4658,SouthCoastH,ESCH01-1_13.jpg,78964714,764.540100,361.271637,Adult Male with a lek,Iguanas 3rd launch,823e0c8bf213199beae68750e41a837a,4253d1b3d5ae39006bb949e6cf2f144e
409,4659,SouthCoastH,ESCH01-1_13.jpg,78964714,767.268921,334.892792,"Others (females, young males, juveniles)",Iguanas 3rd launch,823e0c8bf213199beae68750e41a837a,4253d1b3d5ae39006bb949e6cf2f144e
410,4660,SouthCoastH,ESCH01-1_13.jpg,78964714,769.088196,310.787964,"Others (females, young males, juveniles)",Iguanas 3rd launch,823e0c8bf213199beae68750e41a837a,4253d1b3d5ae39006bb949e6cf2f144e
411,4661,SouthCoastH,ESCH01-1_13.jpg,78964714,754.079529,277.132202,"Others (females, young males, juveniles)",Iguanas 3rd launch,823e0c8bf213199beae68750e41a837a,4253d1b3d5ae39006bb949e6cf2f144e
412,4662,SouthCoastH,ESCH01-1_13.jpg,78964714,808.201660,272.584106,"Others (females, young males, juveniles)",Iguanas 3rd launch,823e0c8bf213199beae68750e41a837a,4253d1b3d5ae39006bb949e6cf2f144e
...,...,...,...,...,...,...,...,...,...,...
5019,64823,WestCoast,PWC03-2-1_42.jpg,78963297,475.094360,192.540146,"Others (females, young males, juveniles)",Iguanas 3rd launch,3fc49e9436026f5e3c1361265a845d75,a1b06094359d2185d0bcee38da83928a
5724,73986,WestCoast,PWC03-2-1_42.jpg,78963297,466.367310,195.423187,"Others (females, young males, juveniles)",Iguanas 3rd launch,c11a32c827347926881e5e1db75cb701,691500ccebe2131f83809524df652f87
5725,73987,WestCoast,PWC03-2-1_42.jpg,78963297,356.656067,149.294586,"Others (females, young males, juveniles)",Iguanas 3rd launch,c11a32c827347926881e5e1db75cb701,691500ccebe2131f83809524df652f87
5726,73988,WestCoast,PWC03-2-1_42.jpg,78963297,421.485443,44.570198,"Others (females, young males, juveniles)",Iguanas 3rd launch,c11a32c827347926881e5e1db75cb701,691500ccebe2131f83809524df652f87


In [12]:
# how many marks per user
df_merged_dataset[["user_id", "x"]].groupby("user_id").count().head()

Unnamed: 0_level_0,x
user_id,Unnamed: 1_level_1
013cb4b55188fc660b8fd7f8dbb9bb8f,9
0156eebf62383fedf03616142d065d39,131
01671783e7cd4124074ed5fc29647828,13
01f32084acb4138c447e1404d765d7b4,2
023383cf6fd2b03c328c2e8054d2ccea,1


### Gold standard data
For reference

In [13]:
df_goldstandard_expert_count["subject_id"].nunique()

87

In [14]:
# look at the
df_goldstandard_expert_count.count()

subject_id    87
Median0s      87
Mean0s        87
Max0s         87
Std0s         86
Median0s.r    87
Mean0s.r      87
Mode0s        87
dtype: int64

In [15]:
# How many images are left in the zooniverse dataset?
len(list(df_merged_dataset.image_name.unique()))

86

In [16]:
#Is there an image in the goldstandard, which is not in the classifcations?
print(f"images in df_goldstandard_expert_count but not in df_merged_dataset: {len(set(df_goldstandard_expert_count.subject_id) - set(df_merged_dataset.subject_id.unique()))}")


images in df_goldstandard_expert_count but not in df_merged_dataset: 1


In [17]:
df_goldstandard_expert_count.subject_id

0     78922029
1     78922093
2     78922433
3     78922625
4     78924089
        ...   
82    78965032
83    78965058
84    78965066
85    78965103
86    78965135
Name: subject_id, Length: 87, dtype: int64

In [18]:
df_expert_count.subject_id


0       78925728
1       78925730
2       78925747
3       78925781
4       78925808
          ...   
1151    78925600
1152    78925604
1153    78925605
1154    78925608
1155    78925614
Name: subject_id, Length: 1156, dtype: int64

In [19]:
df_expert_count.count_total.sum()

388

In [20]:
# How many images are in the filtered flat zooniverse dataset
df_merged_dataset["subject_id"].nunique()

86

In [21]:
## plot some of the marks
from zooniverse.utils.plotting import plot_zooniverse_user_marks_v2

if phase_tag in["Iguanas 1st launch", "Iguanas 2nd launch"]  and  ( plot_diagrams or show_plots ) :
    for image_name, df_image_name in df_merged_dataset.groupby("image_name"):
        
        ## plot the marks
        markers_plot_path = plot_zooniverse_user_marks_v2(df_image_name,
                                                          image_path=df_image_name.iloc[0]["image_path"],
                                                          image_name=image_name,
                                                          output_path=output_plot_path, show=show_plots, title=f"Markers for {image_name}", fig_size=(5,5))
        

## Clustering

### Basic Statics like mean, median, mode

In [22]:
from sklearn.metrics import mean_squared_error
from zooniverse.analysis import get_mark_overview

basic_stats = []
kmeans_knee_stats = []
kmeans_silouettes = []
mse_errors = {}


for image_name, df_image_name in df_merged_dataset.groupby("image_name"):
    annotations_count = get_mark_overview(df_image_name)

    annotations_count_stats = get_annotation_count_stats(annotations_count=annotations_count,
                                                         image_name=df_image_name.iloc[0]["image_name"])

    ### basic statistics like mean, median
    basic_stats.append(annotations_count_stats)
    

df_basic_stats = pd.DataFrame(basic_stats)    

## join the gold standard data to the basic stats
if use_gold_standard_subset is not None:
    df_comparison = df_expert_count.merge(df_basic_stats, on='image_name', how='left')
else:
    df_comparison = df_basic_stats


In [23]:
df_basic_stats

Unnamed: 0,image_name,median_count,mean_count,mode_count,users,sum_annotations_count,annotations_count
0,ESCH01-1_13.jpg,18.0,16.46,19,24,395,"[3, 8, 11, 15, 15, 15, 15, 16, 17, 17, 18, 18,..."
1,ESCH01-1_19.jpg,6.5,6.39,7,18,115,"[3, 3, 3, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7, 8, 8, ..."
2,ESCH01-1_21.jpg,6.5,5.96,2,28,167,"[2, 2, 2, 2, 2, 2, 3, 3, 5, 5, 5, 6, 6, 6, 7, ..."
3,ESCH01-1_22.jpg,4.0,3.89,5,27,105,"[1, 1, 1, 1, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, ..."
4,ESCH01-1_23.jpg,2.0,2.11,2,27,57,"[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ..."
...,...,...,...,...,...,...,...
81,PWC01-1_61.jpg,14.0,13.25,14,24,318,"[2, 7, 9, 9, 11, 12, 13, 13, 13, 14, 14, 14, 1..."
82,PWC01-1_62.jpg,12.0,11.94,13,16,191,"[10, 10, 10, 11, 11, 11, 12, 12, 12, 13, 13, 1..."
83,PWC01-4_231.jpg,3.0,2.88,1,8,23,"[1, 1, 1, 3, 3, 4, 5, 5]"
84,PWC01-5_34.jpg,1.0,1.46,1,13,19,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3]"


In [24]:
df_basic_stats["mode_count"].sum()

302

In [25]:
# There might be records with too few annotations
df_comparison[(df_comparison.count_total > 0) & (df_comparison.sum_annotations_count < 5)].sort_values(by="users", ascending=False)

Unnamed: 0,subspecies,island,site_name,subject_group,image_name,subject_id,presence_absence,count_male-lek,count_male-no-lek,count_others,...,count_total,quality,condition,comment,median_count,mean_count,mode_count,users,sum_annotations_count,annotations_count


In [26]:
# images with an expert count of more than 0 and less than 5 different users
df_comparison[(df_comparison.count_total > 0) & (df_comparison.users < 5)].sort_values(by="users", ascending=False)


Unnamed: 0,subspecies,island,site_name,subject_group,image_name,subject_id,presence_absence,count_male-lek,count_male-no-lek,count_others,...,count_total,quality,condition,comment,median_count,mean_count,mode_count,users,sum_annotations_count,annotations_count


In [27]:
df_comparison["count_total"].sum()

388

### Fill NaN values with 0 because the errors can't be calculated otherwise

In [28]:

## Fill NaN values with 0 because the errors can't be calculated otherwise
df_comparison.fillna(0, inplace=True)


In [29]:

mse_errors["median_count_rmse"] = mean_squared_error(df_comparison.count_total, df_comparison.median_count,
                                                     squared=False)
mse_errors["mean_count_rmse"] = mean_squared_error(df_comparison.count_total, df_comparison.mean_count, squared=False)
mse_errors["mode_count_rmse"] = mean_squared_error(df_comparison.count_total, df_comparison.mode_count, squared=False)

pd.Series(mse_errors)

median_count_rmse    0.479466
mean_count_rmse      1.454819
mode_count_rmse      0.657667
dtype: float64

In [30]:
df_comparison

Unnamed: 0,subspecies,island,site_name,subject_group,image_name,subject_id,presence_absence,count_male-lek,count_male-no-lek,count_others,...,count_total,quality,condition,comment,median_count,mean_count,mode_count,users,sum_annotations_count,annotations_count
0,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_72.jpg,78925728,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
1,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_74.jpg,78925730,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
2,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_95.jpg,78925747,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
3,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_06.jpg,78925781,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
4,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_38.jpg,78925808,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1151,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_66.jpg,78925600,Y,0,0,1,...,1,Good,Hard,0,0.0,0.0,0.0,0.0,0.0,0
1152,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_70.jpg,78925604,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
1153,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_71.jpg,78925605,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
1154,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_82.jpg,78925608,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0


In [31]:

pd.Series(mse_errors)

median_count_rmse    0.479466
mean_count_rmse      1.454819
mode_count_rmse      0.657667
dtype: float64

In [32]:
df_comparison

Unnamed: 0,subspecies,island,site_name,subject_group,image_name,subject_id,presence_absence,count_male-lek,count_male-no-lek,count_others,...,count_total,quality,condition,comment,median_count,mean_count,mode_count,users,sum_annotations_count,annotations_count
0,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_72.jpg,78925728,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
1,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_74.jpg,78925730,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
2,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_95.jpg,78925747,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
3,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_06.jpg,78925781,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
4,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_38.jpg,78925808,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1151,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_66.jpg,78925600,Y,0,0,1,...,1,Good,Hard,0,0.0,0.0,0.0,0.0,0.0,0
1152,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_70.jpg,78925604,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
1153,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_71.jpg,78925605,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0
1154,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_82.jpg,78925608,N,0,0,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0


In [33]:
df_comparison[["median_count", "mean_count", "mode_count", "count_total"]].sum()

median_count    314.50
mean_count      380.51
mode_count      302.00
count_total     388.00
dtype: float64

### DBSCAN clustering and take the variant with the best silouette score for each image


In [34]:
## The old variant
from zooniverse.analysis import compare_dbscan_hyp_v2

eps_variants = [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
min_samples_variants = [3, 5, 8, 10]
if debug:
    eps_variants = [0.3]
    min_samples_variants = [3]
params = [(eps, min_samples) for eps in eps_variants for min_samples in min_samples_variants]

db_scan_results = {}
db_scan_best_results = []
db_scan_best_bic_results = []
for image_name, df_image_name in df_merged_dataset.groupby("image_name"):

    dbscan_localization = compare_dbscan_hyp_v2(
        # phase_tag=phase_tag,
        params=params,
        df_flat=df_image_name,
        # output_path=output_path,
        output_plot_path=output_plot_path,
        plot=show_plots,
    )

    db_scan_results[image_name] = pd.DataFrame(dbscan_localization)
    db_scan_best_results.append(pd.DataFrame(dbscan_localization).sort_values("dbscan_silouette_score", ascending=False).iloc[0])

df_dbscan_localization = pd.concat([*db_scan_results.values()])
df_scan_best_results = pd.DataFrame(db_scan_best_results)



  df_dbscan_localization = pd.concat([*db_scan_results.values()])


In [35]:
df_scan_best_results

Unnamed: 0,dbscan_count,dbscan_noise,dbscan_silouette_score,image_name,eps,min_samples,dbscan_BIC_score
10,18,28,0.616352,ESCH01-1_13.jpg,0.10,8,
4,9,12,0.716935,ESCH01-1_19.jpg,0.05,3,
20,8,1,0.861200,ESCH01-1_21.jpg,0.40,3,
27,4,1,0.931680,ESCH01-1_22.jpg,0.50,10,
16,2,1,0.874491,ESCH01-1_23.jpg,0.30,3,
...,...,...,...,...,...,...,...
27,4,3,0.664809,PWC01-1_61.jpg,0.50,10,
12,7,1,0.806874,PWC01-1_62.jpg,0.20,3,
24,4,3,0.752832,PWC01-4_231.jpg,0.50,3,
4,2,8,0.158432,PWC01-5_34.jpg,0.05,3,


In [36]:
# ## fixes the problem with the silouette score sorting
# from zooniverse.analysis import compare_dbscan_hyp_v2
# 
# eps_variants = [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
# min_samples_variants = [3, 5, 8, 10]
# if debug:
#     eps_variants = [0.3]
#     min_samples_variants = [3]
# params = [(eps, min_samples) for eps in eps_variants for min_samples in min_samples_variants]
# 
# db_scan_results = {}
# db_scan_best_results = []
# db_scan_best_bic_results = []
# for image_name, df_image_name in df_merged_dataset.groupby("image_name"):
# 
#     dbscan_localization = compare_dbscan_hyp_v2(
#         # phase_tag=phase_tag,
#         params=params,
#         df_flat=df_image_name,
#         # output_path=output_path,
#         output_plot_path=output_plot_path,
#         plot=show_plots,
#         
#     )
# 
#     db_scan_results[image_name] = pd.DataFrame(dbscan_localization)
#     
#     # DBSCAN tends to classfy all points as noise if min_samples is too high. Often only a single user marked an iguana.
#     # Sillouette Scoring needs a minimum of 2 clusters
#     # if there are points in decent radius they will belong to a cluster
#     if pd.DataFrame(dbscan_localization).dbscan_count.max() == 1:
#         db_scan_best_results.append(pd.DataFrame(dbscan_localization).sort_values("dbscan_count", ascending=False).iloc[0])
#         db_scan_best_bic_results.append(pd.DataFrame(dbscan_localization).sort_values("dbscan_count", ascending=False).iloc[0])
#         # If two or more cluster seem to exists take ones with the best Silouette score
#     else:  
#         # take the best result by silouette score if there are more clusters then 1
#         db_scan_best_results.append(pd.DataFrame(dbscan_localization).sort_values(["dbscan_silouette_score", "dbscan_count"], ascending=[False, False]).iloc[0])
#     
# df_dbscan_localization = pd.concat([*db_scan_results.values()])
# df_scan_best_results = pd.DataFrame(db_scan_best_results)



In [37]:
df_scan_best_results

Unnamed: 0,dbscan_count,dbscan_noise,dbscan_silouette_score,image_name,eps,min_samples,dbscan_BIC_score
10,18,28,0.616352,ESCH01-1_13.jpg,0.10,8,
4,9,12,0.716935,ESCH01-1_19.jpg,0.05,3,
20,8,1,0.861200,ESCH01-1_21.jpg,0.40,3,
27,4,1,0.931680,ESCH01-1_22.jpg,0.50,10,
16,2,1,0.874491,ESCH01-1_23.jpg,0.30,3,
...,...,...,...,...,...,...,...
27,4,3,0.664809,PWC01-1_61.jpg,0.50,10,
12,7,1,0.806874,PWC01-1_62.jpg,0.20,3,
24,4,3,0.752832,PWC01-4_231.jpg,0.50,3,
4,2,8,0.158432,PWC01-5_34.jpg,0.05,3,


Here it can be seen why the silouette score is difficult because it is often undefined.

In [38]:
## save the combinations of parameters, which maximized the silouette score.

df_dbscan_localization.to_csv(config["dbscan_hyperparam_grid"])
df_scan_best_results

Unnamed: 0,dbscan_count,dbscan_noise,dbscan_silouette_score,image_name,eps,min_samples,dbscan_BIC_score
10,18,28,0.616352,ESCH01-1_13.jpg,0.10,8,
4,9,12,0.716935,ESCH01-1_19.jpg,0.05,3,
20,8,1,0.861200,ESCH01-1_21.jpg,0.40,3,
27,4,1,0.931680,ESCH01-1_22.jpg,0.50,10,
16,2,1,0.874491,ESCH01-1_23.jpg,0.30,3,
...,...,...,...,...,...,...,...
27,4,3,0.664809,PWC01-1_61.jpg,0.50,10,
12,7,1,0.806874,PWC01-1_62.jpg,0.20,3,
24,4,3,0.752832,PWC01-4_231.jpg,0.50,3,
4,2,8,0.158432,PWC01-5_34.jpg,0.05,3,


In [39]:
df_scan_best_results.rename(columns={"dbscan_count": "dbscan_count_sil" }, inplace=True)

df_comparison = df_comparison.merge(df_scan_best_results, on='image_name', how='left')

In [40]:
df_comparison.fillna(0, inplace=True)

mse_errors["dbscan_count_sil_rmse"] = mean_squared_error(df_comparison.count_total, df_comparison.dbscan_count_sil, squared=False)

pd.Series(mse_errors)

median_count_rmse        0.479466
mean_count_rmse          1.454819
mode_count_rmse          0.657667
dbscan_count_sil_rmse    0.658981
dtype: float64

In [41]:

df_comparison = df_comparison.drop(["dbscan_noise", "dbscan_silouette_score", "eps", "min_samples", "dbscan_BIC_score", "with_noise", "bic_avg"], axis=1, errors="ignore")
df_comparison

Unnamed: 0,subspecies,island,site_name,subject_group,image_name,subject_id,presence_absence,count_male-lek,count_male-no-lek,count_others,...,quality,condition,comment,median_count,mean_count,mode_count,users,sum_annotations_count,annotations_count,dbscan_count_sil
0,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_72.jpg,78925728,N,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0
1,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_74.jpg,78925730,N,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0
2,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_95.jpg,78925747,N,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0
3,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_06.jpg,78925781,N,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0
4,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_38.jpg,78925808,N,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1151,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_66.jpg,78925600,Y,0,0,1,...,Good,Hard,0,0.0,0.0,0.0,0.0,0.0,0,0.0
1152,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_70.jpg,78925604,N,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0
1153,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_71.jpg,78925605,N,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0
1154,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_82.jpg,78925608,N,0,0,0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0,0.0


### HDBSCAN clustering for each image

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN states: "A distance threshold. Clusters below this value will be merged."



In [42]:
from zooniverse.analysis import HDBSCAN_Wrapper

hdbscan_values = []

eps_variants = [0.0] # 0 is the default
min_cluster_sizes = [5] # 5 is the default


for image_name, df_image_name in df_merged_dataset.groupby("image_name"):
    annotations_count = get_mark_overview(df_image_name)
    annotations_count_stats = get_annotation_count_stats(annotations_count=annotations_count,
                                                         image_name=df_image_name.iloc[0]["image_name"])
    
    # if less than min_cluster_sizes points are available clustering makes no sense
    if df_image_name.shape[0] >= 5: # If num_samples is 5 for the min_cluster_size is 5 there is no point in passing data with less than 5 samples
        params = [(eps, min_cluster_size, max_cluster_size) 
                    for eps in eps_variants
                    for min_cluster_size in min_cluster_sizes
                    for max_cluster_size in [None]
              ]

        df_hdbscan = HDBSCAN_Wrapper(df_marks=df_image_name[["x", "y"]],
                                     output_path=output_plot_path,
                                     plot=show_plots,
                                     show=show_plots,
                                     image_name=image_name,
                                     params=params)
        hdbscan_values.append(df_hdbscan)


df_hdbscan = pd.concat(hdbscan_values)



In [43]:
df_hdbscan.drop(["with_noise"], axis=1, inplace=True)
df_hdbscan

Unnamed: 0,image_name,HDBSCAN_count,eps,min_cluster_size,max_cluster_size,noise_points
0,ESCH01-1_13.jpg,19,0.0,5,,26
0,ESCH01-1_19.jpg,9,0.0,5,,8
0,ESCH01-1_21.jpg,10,0.0,5,,5
0,ESCH01-1_22.jpg,5,0.0,5,,6
0,ESCH01-1_23.jpg,2,0.0,5,,0
...,...,...,...,...,...,...
0,PWC01-1_61.jpg,14,0.0,5,,36
0,PWC01-1_62.jpg,11,0.0,5,,5
0,PWC01-4_231.jpg,3,0.0,5,,0
0,PWC01-5_34.jpg,1,0.0,5,,12


In [44]:
df_comparison = df_comparison.merge(df_hdbscan, on='image_name', how='left')
df_comparison.fillna(0, inplace=True)
df_comparison

Unnamed: 0,subspecies,island,site_name,subject_group,image_name,subject_id,presence_absence,count_male-lek,count_male-no-lek,count_others,...,mode_count,users,sum_annotations_count,annotations_count,dbscan_count_sil,HDBSCAN_count,eps,min_cluster_size,max_cluster_size,noise_points
0,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_72.jpg,78925728,N,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
1,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_74.jpg,78925730,N,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
2,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_95.jpg,78925747,N,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
3,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_06.jpg,78925781,N,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
4,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_38.jpg,78925808,N,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1151,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_66.jpg,78925600,Y,0,0,1,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
1152,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_70.jpg,78925604,N,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
1153,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_71.jpg,78925605,N,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
1154,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_82.jpg,78925608,N,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0


In [45]:
df_comparison.count_total.fillna(0, inplace=True)
df_comparison.HDBSCAN_count.fillna(0, inplace=True)

df_comparison.to_csv(config["comparison_dataset"])
print(f"saved {config['comparison_dataset']}")

mse_errors["hdbscan_count_rmse"] = mean_squared_error(df_comparison.count_total, df_comparison.HDBSCAN_count, squared=False)


saved /Users/christian/data/zooniverse/2024_03_22_expert_goldstandard_analysis/Iguanas 3rd launch/Iguanas 3rd launch_method_comparison.csv


# A look into the results
Root Means Squared Error for the different methods

In [46]:
df_rmse = pd.DataFrame(pd.Series(mse_errors).sort_values())
df_rmse.to_csv(config["rmse_errors"])


## The sum of the clustering
What is the sum of the methods

In [47]:
df_comparison.subject_id.nunique()

1156

In [48]:

df_comparison_sum = df_comparison[["count_total", "median_count", "mean_count", "mode_count", "dbscan_count_sil", "HDBSCAN_count"]].sum().sort_values()
df_comparison_sum.to_csv(config["method_sums"])


In [49]:
print(f"phase_tag: {phase_tag}, user_threshold: {user_threshold}")

phase_tag: Iguanas 3rd launch, user_threshold: 3


## Compare the numbers
The counts are only for images which were in the dataset after filtering.

### Sum of all the Methods

In [50]:
print(f"{config['method_sums'].name}")
pd.read_csv(config["method_sums"])

Iguanas 3rd launch_method_sums.csv


Unnamed: 0.1,Unnamed: 0,0
0,mode_count,302.0
1,median_count,314.5
2,dbscan_count_sil,318.0
3,HDBSCAN_count,358.0
4,mean_count,380.51
5,count_total,388.0


### Root Mean Squared Error

In [51]:
print(f"{config['rmse_errors'].name}")
pd.read_csv(config["rmse_errors"])

Iguanas 3rd launch_rmse_errors.csv


Unnamed: 0.1,Unnamed: 0,0
0,hdbscan_count_rmse,0.390191
1,median_count_rmse,0.479466
2,mode_count_rmse,0.657667
3,dbscan_count_sil_rmse,0.658981
4,mean_count_rmse,1.454819


### Comparison per Image Level

In [52]:
print(f"load {config['comparison_dataset']}")
pd.read_csv(config["comparison_dataset"])

load /Users/christian/data/zooniverse/2024_03_22_expert_goldstandard_analysis/Iguanas 3rd launch/Iguanas 3rd launch_method_comparison.csv


Unnamed: 0.1,Unnamed: 0,subspecies,island,site_name,subject_group,image_name,subject_id,presence_absence,count_male-lek,count_male-no-lek,...,mode_count,users,sum_annotations_count,annotations_count,dbscan_count_sil,HDBSCAN_count,eps,min_cluster_size,max_cluster_size,noise_points
0,0,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_72.jpg,78925728,N,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
1,1,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_74.jpg,78925730,N,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
2,2,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN02_95.jpg,78925747,N,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
3,3,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_06.jpg,78925781,N,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
4,4,A. c. hayampi,Marchena,BahiaNegra,MBN1,MBN03-2_38.jpg,78925808,N,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1151,1151,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_66.jpg,78925600,Y,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
1152,1152,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_70.jpg,78925604,N,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
1153,1153,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_71.jpg,78925605,N,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0
1154,1154,A. c. nanus,Genovesa,WestCoastB,GWB,GWB01-3_82.jpg,78925608,N,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0


## Discussion:
Clustering works, it yields better numbers than just taking mode,median or mean annotations from the volunteers, because it takes the spatial location of the marker dots into consideration.



Assert the numbers haven't changed

In [53]:
df_comparison_sum

mode_count          302.00
median_count        314.50
dbscan_count_sil    318.00
HDBSCAN_count       358.00
mean_count          380.51
count_total         388.00
dtype: float64

In [54]:
df_comparison.subject_id.nunique()

1156

In [55]:
# these are the numbers before the sorting was repaired
if phase_tag == "Iguanas 1st launch" and not debug:
   
    if user_threshold == 3:
        assert df_comparison_sum["median_count"] == 228.5
        assert df_comparison_sum["HDBSCAN_count"] == 244
        assert df_comparison_sum["count_total"] == 422
        
        assert df_comparison.subject_id.nunique() == 2733

if phase_tag == "Iguanas 2nd launch" and not debug:

    if user_threshold == 3:
        assert df_comparison_sum["median_count"] == 475
        assert df_comparison_sum["HDBSCAN_count"] == 541.0
        assert df_comparison_sum["count_total"] == 600
        
        assert df_comparison.subject_id.nunique() == 456

if phase_tag == "Iguanas 3rd launch" and not debug:
        
    if user_threshold == 3:
        assert df_comparison_sum["median_count"] == 314.5
        assert df_comparison_sum["HDBSCAN_count"] == 358
        assert df_comparison_sum["count_total"] == 388
        
        assert df_comparison.subject_id.nunique() == 1156
