# Recognizability bias in citizen science photographs
This notebook executes all steps required to train a number of species recognition models based on GBIF data, and analyzes the relationship between species specific performance, citizen science data availability, and species characteristics.

## Overview
- Load and set settings for the setup:
    - Tensorflow settings, directory paths
    - Numbers of datasets, dataset sizes, splits, steps 
- Explore the data:
    - Load a GBIF Darwin Core Archive (DwCA)
    - Given the settings, summarize what a division per taxonomic rank would provide in terms of training data
- Prepare the data:
    - Based on a chosen number of species per group and taxon rank, make the groups based on the DwCA
- Train the models:
    - Find each job and train a model for it
- Evaluate the models:
    - Use the test data put aside at the data preparation step to evaluate performances
    - Save all performance indicators to a central file
- Gather metrics based on annotation, biological traits, and citizen science data
    - Save a number of plots based on these metrics
    - Train and evaluate LASSO models
    - Gather other metrics for manuscript

## Import the required dependencies

In [None]:
import os
from dotenv import load_dotenv
load_dotenv(verbose=True)

## General settings
- Make sure you have an .env file in the root directory. See the readme for details
- The DwCA file is assumed to contain only observations with images

In [None]:
# Load the env variables from the file ".env"
dwca_file = os.path.join(os.getenv('STORAGE_DIR'), 'GBIF.zip')  # The path to the DwCA file with all data.

train_val_test_threshold = 220  # The minimum number of observations needed per species for train + validation + test
train_val_threshold = 200  # The minimum number of observations needed per species for train + validation (for the first, largest test set)
reduced_factor = .5  # Every time the number of observations is reduced, it is reduced to this proportion of the previous amount
validation_proportion = .1  # The proportion of observations of the train_val set reserved for validation
groups = 12  # The number of taxon groups to generate and compare between
observations_minimum = train_val_threshold  # The smallest subset of observations to train models on. Equal to the largest set so that only the max size is trained


This script reports the results of grouping on any taxonomic level:

- How many groups can be made based on the threshold provided, out of the total groups within that level, and how many species would the smallest groups contain.
- It also creates csv files with counts per species, and species per taxonomic group.

This step provides no direct input for any following steps, and is meant to help you choose the settings in the next step. It does return the created species csv file for subsequent use.

In [None]:
from Scripts.grouping import propse_groups
species_csv = propse_groups(dwca_file, train_val_test_threshold, groups)

For the dataset we used, the rank of order gives the best division, with at least 18 species per order, so this is the level we will use. Retrieve the relevant taxon groups, and create all training jobs for all taxa.

In [None]:
# For our dataset, we choose the level of order, giving a minimum of 18 species
from Scripts.create_jobs import create_jobs
from Scripts.grouping import get_groups

number_of_species = 18
taxonlevel = 'order'

grouping_csv = get_groups(dwca_file, train_val_test_threshold, number_of_species, taxonlevel)
create_jobs(0, train_val_threshold, reduced_factor, validation_proportion, grouping_csv, species_csv, dwca_file, observations_minimum)


## Model training

Train the models defined in the job creation step. If a model exists, this will be skipped. Training will take a long time depending on your hardware, number of models and dataset sizes.

In [None]:
from Scripts.train import train_models
train_models()


## Model evaluation

Evaluation requires a "Species (total).csv", containing a "Taxon" and a "Number of species" column (species per taxon group for the whole area of interest). The code below generates this file for a Norwegian context if it does not exist already, based on files retrieved from http://www2.artsdatabanken.no/artsnavn/Contentpages/Eksport.aspx. This tool is as generic as possible, expecting a folder of csv files. Some tweaking will be required for other contexts.

All models generated in the previous step are now evaluated using the test sets separated in the data preparation step. This will take a while. When that is done, all performance and bias metrics needed will be collected and stored in .csv files.


In [None]:
from Scripts.evaluate import evaluate
from Tools.count_species_per_taxon import count_species_per_taxon

count_species_per_taxon(
        encoding='cp1252',
        sep=';',
        output_file='Species (total).csv',
        grouping_csv=grouping_csv,
        language='no'
    )
    
evaluate()

Retrieve all relevant stats for those species and save as a .csv

In [None]:
from Scripts.retrieve_species_characteristics import retrieve
import pandas as pd

bird_species = []

df = pd.read_csv(os.path.join(os.getenv("STORAGE_DIR"), "GBIF - observations per species.csv"))
df = df[df["order"].isin(["Anseriformes", "Charadriiformes", "Passeriformes"])]

bird_species = df["scientificName"].to_list()
retrieve(bird_species)

Get annotations from Label Studio API, the number of images per observation with images, and gather all data in a central csv file.

In [None]:
from Tools.get_annotation_stats import get_annotation_stats
from Scripts.regression import collect_data, img_per_obs

get_annotation_stats(folder=os.environ["STORAGE_DIR"], api=os.environ["ANNOTATION_API"], token=os.environ["ANNOTATION_TOKEN"], pixels=299)
img_per_obs(os.path.join(os.environ["STORAGE_DIR"], "GBIF.zip"), os.environ["STORAGE_DIR"])
collect_data(folder=os.environ["STORAGE_DIR"])

Create plots for the manuscript

In [None]:
from Scripts.plots import collected_stats_scatter, collected_stats_slopes

collected_stats_scatter(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "svg", "Images (log)", "F1 (mean)",
                        taxa=["Anseriformes", "Charadriiformes", "Passeriformes"], highlight=["Anser serrirostris", "Aix galericulata", "Larus cachinnans", "Charadrius morinellus", "Linaria flavirostris", "Perisoreus infaustus"])


collected_stats_slopes(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "svg", "Images (log)", "F1 (mean)",
                            taxa=["Coleoptera", "Diptera", "Lepidoptera", "Odonata", "Lecanorales", "Agaricales", "Polyporales", "Asterales", "Asparagales"])

Gather metrics (using the scatterplots with regressions)

In [None]:
collected_stats_scatter(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "png", "Images (log)", "F1 (mean)",
                        taxa=["Agaricales", "Asparagales", "Asterales"])

collected_stats_scatter(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "png", "Images (log)", "F1 (mean)",
                        taxa=["Coleoptera","Diptera","Lecanorales"])

collected_stats_scatter(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "png", "Images (log)", "F1 (mean)",
                        taxa=["Lepidoptera", "Odonata", "Polyporales"])

collected_stats_scatter(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "svg", "img per obs", "F1 (mean)",
                        taxa=["Anseriformes", "Charadriiformes", "Passeriformes"])

Train LASSO models an gather metrics

In [None]:
from Scripts.regression import lasso

lasso(os.environ["STORAGE_DIR"], y="F1 (mean)", x=['Images (log)'])

lasso(os.environ["STORAGE_DIR"], y="F1 (mean)", x=['Habitat', 'HWI', 'Body mass (log)', 'Images (log)'])
lasso(os.environ["STORAGE_DIR"], y="F1 (mean)", x=['info pixels (log)', 'Images (log)'])
lasso(os.environ["STORAGE_DIR"], y="F1 (mean)", x=['Proportion AO vs TOVe', 'Proportion AO+img vs AO', 'img per obs', 'Images (log)'])


Gather bird metrics (using the scatterplots with regressions)

In [None]:
collected_stats_scatter(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "svg", "img per obs", "Proportion AO+img vs AO",
                        taxa=["Anseriformes", "Charadriiformes", "Passeriformes"])

collected_stats_scatter(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "svg", "Proportion AO+img vs AO", "F1 (mean)",
                        taxa=["Anseriformes", "Charadriiformes", "Passeriformes"])

collected_stats_scatter(os.environ["STORAGE_DIR"], "out.csv", os.path.join(os.environ["STORAGE_DIR"], "Graphs"), "svg", "Habitat", "info pixels (log)",
                        taxa=["Anseriformes", "Charadriiformes", "Passeriformes"])

Compare numbers of species in the official taxonomy and the training data

In [None]:
from Scripts.retrieve_species_characteristics import count_species

dwca_file = os.path.join(os.getenv('STORAGE_DIR'), 'GBIF.zip')  # The path to the DwCA file with all data.

count_species(os.path.join(os.getenv("TAXONOMY_DIR"), 'Animalia.csv'), dwca=dwca_file, order=["Anseriformes", "Charadriiformes", "Passeriformes"])
count_species(os.path.join(os.getenv("TAXONOMY_DIR"), 'Animalia.csv'), dwca=dwca_file, order=["Diptera"])

count_species(os.path.join(os.getenv("TAXONOMY_DIR"), 'Animalia.csv'), dwca=dwca_file, order=["Coleoptera"])
count_species(os.path.join(os.getenv("TAXONOMY_DIR"), 'Animalia.csv'), dwca=dwca_file, order=["Lepidoptera"])
count_species(os.path.join(os.getenv("TAXONOMY_DIR"), 'Animalia.csv'), dwca=dwca_file, order=["Odonata"])
count_species(os.path.join(os.getenv("TAXONOMY_DIR"), 'Fungi.csv'), dwca=dwca_file, order=["Lecanorales"])