# Maximizing citizen scientists' contribution to automated species recognition
This notebook executes all steps required to train a number of species recognition models based on GBIF data, and analyze the results in terms of the Value of Information of additional data.

## Overview
- Load and set settings for the setup:
    - Tensorflow settings, directory paths
    - Numbers of datasets, dataset sizes, splits, steps 
- Explore the data:
    - Load a GBIF Darwin Core Archive (DwCA)
    - Given the settings, summarize what a division per taxonomic rank would provide in terms of training data
- Prepare the data:
    - Based on a chosen number of species per group and taxon rank, make the groups based on the DwCA
    - For each group, create subsets of species and observations for models to be trained on
- Train the models:
    - Find each job and train a model for it
- Evaluate the models:
    - Use the test data put aside at the data preparation step to evaluate performances
    - Save all performance indicators to a central file
    - Save a number of plots based on these metrics

## Import the required dependencies

In [None]:
# Load the dependencies

import os
from dotenv import load_dotenv
from Tools.count_species_per_taxon import count_species_per_taxon

from Scripts.voi_grouping import propse_groups, get_groups
from Scripts.voi_create_jobs import create_jobs
from Scripts.voi_train import train_models
from Scripts.voi_evaluate import run_all_test_sets, collect_metrics, create_bias_plots, plot_mean_curve_per_taxon_together, plot_means_derivatives_together


## General settings
- Make sure you have an .env file in the root directory. See the readme for details
- The DwCA file is assumed to contain only observations with images

In [None]:
# Load the env variables from the file ".env"
load_dotenv(verbose=True)

# Settings for the generation of jobs
dwca_file = os.path.join(os.getenv('STORAGE_DIR'), 'GBIF.zip')  # The path to the DwCA file with all data.
train_val_test_threshold = 220  # The minimum number of observations needed per species for train + validation + test
train_val_threshold = 200  # The minimum number of observations needed per species for train + validation (for the first, largest test set)
validation_proportion = .1  # The proportion of observations of the train_val set reserved for validation
reduced_factor = .75  # Every time the number of observations is reduced, it is reduced to this proportion of the previous amount
groups = 12  # The number of taxon groups to generate and compare between
observations_minimum = 10  # The smallest subset of observations to train models on
duplicates = 5 # The number of independent runs per taxon group, each based on a single species subset and containing all dataset sizes

## Data exploration
This script reports the results of grouping on any taxonomic level:
- How many groups can be made based on the threshold provided, out of the total groups within that level, and how
many species would the smallest groups contain.
- It also creates csv files with counts per species, and species per taxonomic group.

This step provides no direct input for any following steps, and is meant to help you choose the
settings in the next step. It does return the created species csv file for subsequent use.

In [None]:
species_csv = propse_groups(dwca_file, train_val_test_threshold, groups)

## Data preparation
For the dataset we used, the rank of order gives the best division, with at least 17 species per order, so this is the level we will use. Retrieve the relevant taxon groups, and create all training jobs for all duplicates for all taxa.

In [None]:
number_of_species = 17
taxonlevel = 'order'

# Create a csv file in the folder of the DwCA file with groups of the chosen level and the specified minimum number
# of species, and store the path to that csv.
grouping_csv = get_groups(dwca_file, train_val_test_threshold, number_of_species, taxonlevel)


# Create jobs for each of the chose groups. This step can be repeated to generate more jobs (with different random
# subsets of species and observations
for i in range(duplicates):
    create_jobs(number_of_species, train_val_threshold, reduced_factor, validation_proportion, grouping_csv,
                species_csv, dwca_file, observations_minimum)

## Model training
Train the models defined in the job creation step. If a model exists, this will be skipped. Training will take a long time depending on your hardware, number of models and dataset sizes.


In [None]:
train_models()

## Model evaluation
Evaluation requires a "Species (total).csv", containing a "Taxon" and a "Number of species" column (species per
taxon group for the whole area of interest). The code below generates this file for a Norwegian context if it does
not exist already, based on files retrieved from http://www2.artsdatabanken.no/artsnavn/Contentpages/Eksport.aspx.
This tool is as generic as possible, expecting a folder of csv files. Some tweaking will be required for other
contexts.

All models generated in the previous step are now evaluated using the test sets separated in the data preparation step. This will take a while. When that is done, all performance and bias metrics needed will be collected and stored in .csv files.

In [None]:
count_species_per_taxon(
    source_folder='Artsnavnebase',
    encoding='cp1252',
    sep=';',
    output_file='Species (total).csv',
    grouping_csv=grouping_csv,
    language='no'
)

run_all_test_sets(os.environ.get("JOBS_DIR"))
collect_metrics(os.environ.get("JOBS_DIR"))


## Graph creation
A number of selected graphs will be generated for use in the manuscript. Note that some of these graphs have been post-processed manually before publication.

In [None]:
output_path=os.path.join(os.environ.get("JOBS_DIR"), "GRAPHS")
path=os.environ.get("JOBS_DIR")
extension="pdf"

os.makedirs(os.path.join(output_path), exist_ok=True)

create_bias_plots(rank="class", plot_output_dir=output_path,
                    csv_output_dir=path, extension=extension, plottype=["all_bias", "cs_img_bias"])

plot_mean_curve_per_taxon_together(path, output_path, extension, metrics=["F1"])

plot_means_derivatives_together(path, output_path, extension, metrics=["F1"],
                                per_species=True)

create_bias_plots(rank="order", plot_output_dir=output_path,
                    csv_output_dir=path, extension=extension, plottype="cs_img_bias",
                    sortby="Curve derivative of F1 at average img per species")