# LeafSnap data exploration

The original dataset has been downloaded from [kaggle.com](https://www.kaggle.com/xhlulu/leafsnap-dataset) as [leafsnap.com](leafsnap.com/dataset) is not available any more. It is stored at [SURF drive](https://surfdrive.surf.nl/files/index.php/s/MoCVal7gxS4aX51?path=%2Fdata%2FLeafSnap). There are 30 866 (~31k) color images of different sizes. The dataset covers all 185 tree species from the Northeastern United States. The original images of leaves taken from two different sources:

    "Lab" images, consisting of high-quality images taken of pressed leaves, from the Smithsonian collection.
    "Field" images, consisting of "typical" images taken by mobile devices (iPhones mostly) in outdoor environments.

For the purpose of this demo is to select a subset of the most populous 30 species of lab and field images. Already a [dataset of 30 classes](https://github.com/NLeSC/XAI/blob/master/Software/LeafSnapDemo/Data_preparation_30subset.ipynb) have been selected before, where the lab images have been cropped semi-manually using IrfanView to remove the riles and color calibration image parts. But 2/3 of that dataset has been selected randomly, not according to the number of images in that class.

This notebook is used to explore the original dataset and find out the most polpulous 30 classes and see which have not been included yet in the 30-class dataset.

### Imports

In [4]:
import warnings
warnings.simplefilter('ignore')
import os
import PIL
import imageio
import pandas as pd
import numpy as np


### Read data frame with information about pictures

In the dataset, there is a data frame containing information about the pictures. Relevant for us are the columns:

    path: path to the individual pictures
    species: latin term for each plant
    source: picture taken in lab or field



In [5]:
# original dataset
data_path = "/home/elena/eStep/XAI/Data/LeafSnap/"
dataset_data_path = os.path.join(data_path, "leafsnap-dataset")
dataset_info_file = os.path.join(dataset_data_path, "leafsnap-dataset-images.txt")

img_info = pd.read_csv(dataset_info_file, sep="\t")
img_info.head()

Unnamed: 0,file_id,image_path,segmented_path,species,source
0,55497,dataset/images/lab/abies_concolor/ny1157-01-1.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
1,55498,dataset/images/lab/abies_concolor/ny1157-01-2.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
2,55499,dataset/images/lab/abies_concolor/ny1157-01-3.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
3,55500,dataset/images/lab/abies_concolor/ny1157-01-4.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
4,55501,dataset/images/lab/abies_concolor/ny1157-02-1.jpg,dataset/segmented/lab/abies_concolor/ny1157-02...,Abies concolor,lab


In [6]:
subset30_data_path = os.path.join(data_path, "leafsnap-dataset-30subset")
subset30_info_file = os.path.join(subset30_data_path, "leafsnap-dataset-30subset-images.txt")

img_info30 = pd.read_csv(subset30_info_file, sep="\t")
img_info30.head()

Unnamed: 0,file_id,image_path,species,source
0,55821,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
1,55822,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
2,55823,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
3,55824,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
4,55825,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab


### Get the top 30 most populous species from the original dataset

In [7]:
species = img_info["species"]
species.describe()
species_counts = species.value_counts()
top30_species_counts = species_counts.head(30)
print(top30_species_counts)

Maclura pomifera            448
Ulmus rubra                 317
Prunus virginiana           303
Acer rubrum                 297
Broussonettia papyrifera    294
Prunus sargentii            288
Ptelea trifoliata           270
Ulmus pumila                265
Abies concolor              251
Asimina triloba             249
Diospyros virginiana        248
Quercus montana             247
Ilex opaca                  244
Liriodendron tulipifera     235
Acer negundo                229
Styrax japonica             228
Quercus muehlenbergii       226
Aesculus pavi               225
Catalpa bignonioides        217
Juglans cinerea             217
Chionanthus virginicus      217
Cercis canadensis           216
Ulmus americana             215
Staphylea trifolia          213
Cryptomeria japonica        212
Acer palmatum               212
Ostrya virginiana           209
Fraxinus nigra              205
Carya cordiformis           200
Gleditsia triacanthos       198
Name: species, dtype: int64


In [8]:
species = img_info["species"]
species.describe()

count                30866
unique                 185
top       Maclura pomifera
freq                   448
Name: species, dtype: object

In [9]:
top30_species_counts = species.value_counts().head(30).to_frame()
top30_species_counts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, Maclura pomifera to Gleditsia triacanthos
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   species  30 non-null     int64
dtypes: int64(1)
memory usage: 480.0+ bytes


In [10]:
top30_species_counts

Unnamed: 0,species
Maclura pomifera,448
Ulmus rubra,317
Prunus virginiana,303
Acer rubrum,297
Broussonettia papyrifera,294
Prunus sargentii,288
Ptelea trifoliata,270
Ulmus pumila,265
Abies concolor,251
Asimina triloba,249


### Find possible intersection of top 30 populous species and the 30 subset

In [11]:
species30 = img_info30["species"]
species30.describe()

count                 6136
unique                  30
top       Maclura pomifera
freq                   448
Name: species, dtype: object

In [12]:
species30_counts = species30.value_counts().to_frame()
species30_counts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, Maclura pomifera to Quercus rubra
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   species  30 non-null     int64
dtypes: int64(1)
memory usage: 480.0+ bytes


In [13]:
species30_counts

Unnamed: 0,species
Maclura pomifera,448
Ulmus rubra,317
Acer rubrum,297
Broussonettia papyrifera,294
Prunus sargentii,288
Ptelea trifoliata,270
Ulmus pumila,265
Abies concolor,251
Asimina triloba,249
Diospyros virginiana,248


In [14]:
common_species = pd.merge(top30_species_counts,species30_counts)
common_species.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13 entries, 0 to 12
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   species  13 non-null     int64
dtypes: int64(1)
memory usage: 208.0 bytes


In [15]:
print(common_species)

    species
0       448
1       317
2       297
3       294
4       288
5       270
6       265
7       251
8       249
9       248
10      247
11      244
12      215


### Remove the common species from the most populous and select the top 0

In [16]:
top30_unused_species_count = pd.concat([top30_species_counts, common_species]).drop_duplicates(keep=False).head(30)
print(top30_unused_species_count)

                         species
Prunus virginiana            303
Liriodendron tulipifera      235
Acer negundo                 229
Styrax japonica              228
Quercus muehlenbergii        226
Aesculus pavi                225
Cercis canadensis            216
Staphylea trifolia           213
Ostrya virginiana            209
Fraxinus nigra               205
