# LeafSnap data exploration

The original dataset has been downloaded from [kaggle.com](https://www.kaggle.com/xhlulu/leafsnap-dataset) as [leafsnap.com](leafsnap.com/dataset) is not available any more. It is stored at [SURF drive](https://surfdrive.surf.nl/files/index.php/s/MoCVal7gxS4aX51?path=%2Fdata%2FLeafSnap). There are 30 866 (~31k) color images of different sizes. The dataset covers all 185 tree species from the Northeastern United States. The original images of leaves taken from two different sources:

    "Lab" images, consisting of high-quality images taken of pressed leaves, from the Smithsonian collection.
    "Field" images, consisting of "typical" images taken by mobile devices (iPhones mostly) in outdoor environments.

The purpose of this notebook is to select a subset of the most populous 30 species of lab and field images. Already a [dataset of 30 classes](https://github.com/NLeSC/XAI/blob/master/Software/LeafSnapDemo/Data_preparation_30subset.ipynb) have been selected before, where for the lab images have been cropped semi-manually using IrfanView to remove the riles and color calibration image parts. But 2/3 of that dataset has been selected randomly, not according to the number of images in that class.

This notebook is used to explore the original dataset and find out the most polpulous 30 classes and see which have not been included yet in the previous 30-class dataset.

### Imports

In [52]:
import warnings
warnings.simplefilter('ignore')
import os
import PIL
import imageio
import pandas as pd
import numpy as np


### Read data frame with information about pictures

In the dataset, there is a data frame containing information about the pictures. Relevant for us are the columns:

    path: path to the individual pictures
    species: latin term for each plant
    source: picture taken in lab or field



In [53]:
# original dataset
data_path = "/home/elena/eStep/XAI/Data/LeafSnap/"
dataset_data_path = os.path.join(data_path, "leafsnap-dataset")
dataset_info_file = os.path.join(dataset_data_path, "leafsnap-dataset-images.txt")

img_info = pd.read_csv(dataset_info_file, sep="\t")
img_info.head()

Unnamed: 0,file_id,image_path,segmented_path,species,source
0,55497,dataset/images/lab/abies_concolor/ny1157-01-1.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
1,55498,dataset/images/lab/abies_concolor/ny1157-01-2.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
2,55499,dataset/images/lab/abies_concolor/ny1157-01-3.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
3,55500,dataset/images/lab/abies_concolor/ny1157-01-4.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
4,55501,dataset/images/lab/abies_concolor/ny1157-02-1.jpg,dataset/segmented/lab/abies_concolor/ny1157-02...,Abies concolor,lab


In [54]:
old_subset30_data_path = os.path.join(data_path, "leafsnap-dataset-30subset")
old_subset30_info_file = os.path.join(old_subset30_data_path, "leafsnap-dataset-30subset-images.txt")

img_info30_old = pd.read_csv(old_subset30_info_file, sep="\t")
img_info30_old.head()

Unnamed: 0,file_id,image_path,species,source
0,55821,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
1,55822,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
2,55823,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
3,55824,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
4,55825,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab


### Get the top 30 most populous species from the original dataset

In [55]:
species = img_info["species"]
species.describe()

count                30866
unique                 185
top       Maclura pomifera
freq                   448
Name: species, dtype: object

In [56]:
top30_species_counts = species.value_counts().head(30).to_frame()
top30_species_counts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, Maclura pomifera to Gleditsia triacanthos
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   species  30 non-null     int64
dtypes: int64(1)
memory usage: 480.0+ bytes


In [57]:
top30_species_counts

Unnamed: 0,species
Maclura pomifera,448
Ulmus rubra,317
Prunus virginiana,303
Acer rubrum,297
Broussonettia papyrifera,294
Prunus sargentii,288
Ptelea trifoliata,270
Ulmus pumila,265
Abies concolor,251
Asimina triloba,249


### The old 30 subset

In [58]:
species30_old = img_info30_old["species"]
species30_old.describe()

count                 6136
unique                  30
top       Maclura pomifera
freq                   448
Name: species, dtype: object

In [59]:
species30_old_counts = species30_old.value_counts().to_frame()
species30_old_counts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, Maclura pomifera to Quercus rubra
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   species  30 non-null     int64
dtypes: int64(1)
memory usage: 480.0+ bytes


In [60]:
species30_old_counts

Unnamed: 0,species
Maclura pomifera,448
Ulmus rubra,317
Acer rubrum,297
Broussonettia papyrifera,294
Prunus sargentii,288
Ptelea trifoliata,270
Ulmus pumila,265
Abies concolor,251
Asimina triloba,249
Diospyros virginiana,248


### Find which species should  be removed from the old 30 dataset as they are not in the top30 species 

In [61]:
remove_species_count=species30_old_counts[~species30_old_counts.isin(top30_species_counts)].dropna()
remove_species_count

Unnamed: 0,species
Salix nigra,197.0
Platanus occidentalis,188.0
Zelkova serrata,183.0
Quercus alba,175.0
Tilia americana,159.0
Magnolia acuminata,148.0
Quercus bicolor,145.0
Acer campestre,144.0
Acer platanoides,140.0
Platanus acerifolia,140.0


In [62]:
len(remove_species_count)

17

### Find which species should be added to the old 30 dataset as they are in the top30 species

In [63]:
add_species_count=top30_species_counts[~top30_species_counts.isin(species30_old_counts)].dropna()
add_species_count

Unnamed: 0,species
Prunus virginiana,303.0
Liriodendron tulipifera,235.0
Acer negundo,229.0
Styrax japonica,228.0
Quercus muehlenbergii,226.0
Aesculus pavi,225.0
Chionanthus virginicus,217.0
Juglans cinerea,217.0
Catalpa bignonioides,217.0
Cercis canadensis,216.0


In [64]:
len(add_species_count)

17

### Total number of images in the final dataset (top 30 species)

In [74]:
species.value_counts().head(30).sum()

7395