# 1. What has been done

## 1. DownloadingDatasets-Idigbio
  1. Processing of the CSV files with metadata were processed using Pandas
  1. 241 of the records in Idigbio haven't been identified, we can use these as part of the test dataset
  1. There are 6498 records that have been identified.
  1. Some typos were found on some of the records and these were fixed
  1. As of 2021, there are 134 identified species in the Solanum Genus for Mexico
  1. The dataset has 202 unique species names, but this happens because some names are invalid/outdated/have been replaced/are outright misidentified
  1. A mapping was created to replace the misidentified or incorrect species name
  1. We got the value counts per species and proceeded to merge it with the multimedia dataframe. Remember that the ocurrences and multimedia mappings are downloaded from 2 differente sources.
  1. There are multiple pictures per coreid and don't have duplicated URLS
  1. We found that we are missing some mexican species:
    * atitlanum
    * aviculare
    * bicorne
    * caripense
    * davisense
    * edmundoi
    * guerreroense
    * knoblochii
    * nitidibaccatum
    * setigeroides
    * triunfense
    
    But this is because they are very rare species
  1. The solanum section to which each species belongs is not present in the original iDigBio dataframe, it needs to be aggregated based on the feedback of Gera.
  1. There are 6476 records in this database
  1. With this, we are able to add the information of the section to each of the rows for the pictures.
    * There are some underrepresented sections, but this is normal as they are quite rate
    * There are no examples of the **archaesolanum** section in this dataset.
  1. The final result was downloaded to `idigbio_images_by_sections.csv`

## DownloadingDatasets-Gbif

1. The same treatment was done for gbif as for idigbio, but in this one, there are missing 10 species:
  * Solanum Atitlanum
  * Solanum Aviculare
  * Solanum Bicorne
  * Solanum Davisense
  * Solanum Deflexum
  * Solanum Edmundoi
  * Solanum Nitidibaccatum
  * Solanum Rostratum
  * Solanum Rudepannum
  * Solanum Setigeroides
1. A correction has been done to some of the records to have the species updated or corrected
1. There are 6483 records and 6726 after removing null values with no pictures
1. While searching for duplicated URLs, it was found that there are some pictures that contain more than one specimen on the same sheet, but given that both species belong to the same section, we should be OK to keep them.
1. We are missing some mexican species:
  1. solanum atitlanum
  1. solanum aviculare
  1. solanum bicorne
  1. solanum davisense
  1. solanum deflexum
  1. solanum edmundoi
  1. solanum nitidibaccatum
  1. solanum rostratum
  1. solanum rudepannum
  1. solanum setigeroides

1. We add the section to the rows of each of the pictures and find that we don't have specimens of the `archaesolanum` section which is a really rare sectin.
1. Results are saved to `gbif_images_by_sections.csv`

## DownloadingAllDatasets
1. This processes the 2 CSV files from gbif and idigbio, merges them, remove the duplicated URLs and checks how many actual valid records there are, however, it doesn't actually download anything, it's just mining for information.
1. According to these experiments, there are 9292 unique URLs between Idigbio and Gbif to download.

## Dataset downloader
It's a Python program designed to download and categorize images parsed from the CSV files above.

It:
1. Filters out duplicated URls so as to try them only once
1. Creates a parent download folder
1. Downloads the images to a solanum section folder

The code is hosted in a Git repository in Github `charlieitesm/tesis-dataset-downloader`.

It uses BS4, NP, Pandas and Requests in order to download everything. A report of those URLs from which it was not possible to perform a download is written to `failed_images.csv`


## Solanum Dataset Metrics

1. We want to obtain the metrics for the dataset:
  1. Section to which it belongs
  1. Species
  1. Source of file
  1. Size (MB)
  1. Resolution
  1. Type of file
  1. Hash (fingerprint)
  
  For this I used `ImageHash` to calculate the fingerprint of the images in order to identify near-duplicatres.
1. Originally, 8937 files where processed from the filesystem (this means that between this and the 9292 we had on record, means that the difference is because of the images that we were not able to download.
1. There are 190 near-duplicates out of 92 unique images, we say that a duplicate are one such that they have the same: 
  * Hash, section, species, filesize_mb
1. Using a flag on the dataset of duplicates, we can remove the records we don't want, specially those that have smaller sizes.
1. With the help of a biologist, I pruned those files that have the same fingerprint, but are classified in different sections.
1. Results were saved to `Downloaded_dedup_images_report.csv`


# Important files

1. idigbio_images_by_sections.csv
  * Contains all of the single images with the species, the section and the URL of the media
1. gbif_images_by_sections
  * Contains the single images with the species, URL and section of the section.
1. Downloaded_dedup_images_report.csv
  * Contains the dataframe with the images that will be kept before preprocessing


# Plan of attack

## Enero
1. ~Medir cuántas imágenes tienen una resolución menor a 512 y a qué sección pertenecen. Ver si es factible removerlas~
1. ~Quitar todas las secciones que tengan menos de 100 ejemplares.~
1. ~Reducir a 512x512 la resolucion de las imagenes~
1. ~Comprimir todo el dataset~
1. Subirlo a la DGX-1
1. Entrenamiento de modelos
    * Para todos los modelos se debe:
        * Guardar modelo en binario
        * Metricas
        * Usar CV=5
        * Usar data augmentation con reflexion, escalamiento, sheer
    1. Usar una VGG8 o VGG16 implementada a mano

## Febrero
1. Continuacion entrenamiento de modelos
    1. Usar una VGG mas grande de libreria
    1. Repetir para ResNet50
    1. Escoger una arquitectura mas grande y compleja (state of the art)
1. Decidir si se usara Transfer Learning con ImageNet

## Marzo - Abril
1. Escribir tesis

# Interesting questions and issues to pursue

1. Measure how outdated are these records in both iDigBio and Gbif, in particular how many records can be updated with new values for the species.
1. Measure what database has the most unique files and how many repeated records there are in each of the database.
  * What database seems to be the most reliable with more unique records and less errors?

# Notes

## Questions
1.

---

## Advice
1. Focus on sections that have at least 100 samples
1. About the size, there's no standard, check the state of the art, consider 512x512
    1. Checar cuales imagenes tienen menos de 512 y ver si podemos quitarlas del dataset
1. Sobre arquitectura de CNN
    1. Primero hay que ver que pasa sin usar Transfer Learning
    1. Usar una VGG-8/16 para ver que tan bien se comporta, sus metricas
        1. Hacerla a mano primero antes de usar una mejor
    1. Moverse a usar VGG mas grande y ResNet50
    1. Comparar con una mas grande y compleja
    1. Comparar con Transfer Learning
1. GradCam
    1. Una vez clasificada ilumina y resalta lo que esta usando para hacer la clasificacion, esto podemos dejarlo hasta el final.

1. Empieza para mas facil
    1. Vgg16 de libreria sin TL, enfocarse a comparar diferentes arquitecturas
    1. De ahi nos pasamos a Vgg8 a mano
    1. De ahi a algo mas complejo
    1. Y luego ver si despues nos alcanza TL
    1. Usar imagenes de 512x512 y usar data augmentation, reflexion, escalamiento