In [1]:
# Check if python is 3.9.5
import sys
print(sys.version)
%load_ext autoreload
%autoreload 2

3.9.5 (default, May 18 2021, 12:31:01) 
[Clang 10.0.0 ]


In [2]:
import geolifeclef

# GeoLifeClef Dataset

The [GeoLifeClef](https://www.imageclef.org/GeoLifeCLEF2021) dataset contains a lot of data from [iNaturalist](https://www.inaturalist.org/) and [Pl@ntNet](https://identify.plantnet.org/) observations of USA and France. How does the dataset look like?

The dataset includes a lot of different geographically rastered assignments (land-use, climate parameters, ...). Which of them could be interesting?

The regions are France (without department Gironde)

# Structure of dataset

What is the overall structure of the dataset? The [guidance paper](https://arxiv.org/abs/2004.04192) can be found in our [drive](https://drive.google.com/file/d/1SUYlboG5JobEAA80mdEIZFxq6bYkpFuE/view?usp=sharing). This is already the second time the authors push the challenge, as it was too difficult 2020. The data directory can be downloaded from Kaggle.

It contains:

In [22]:
geolifeclef.print_base_hierarchy()

- test_observation_ids_mapping.csv
- data
|- patches_sample
|- pre-extracted
|- observations
|- metadata
|- rasters
- sample_submission.csv


The `.csv` files are thought as a sample submission. So basically of main interest are the 5 directories within the folder `data`.
Greatest thing is, the authors already provide a deep exploration of the whole dataset, which is stored in [github](https://github.com/maximiliense/GLC.git). This [notebook](https://github.com/maximiliense/GLC/blob/20a401ca309fafa4fa34abe29401efa90a4dcb16/notebooks/Getting%20started%20-%20Data%20loading%20and%20visualization.ipynb) guides through all different data structures of the set. Thus, it is not really neccessary to get into much detail here, but lets briefly show some nice features we could get.df[df.name.str.contains("bio")]

In [32]:
df = geolifeclef.print_metadata()

## Climate data

The dataset includes raster information in ~1 km pixels of all [19 bioclimatic variables](https://www.worldclim.org/data/bioclim.html) from WorldClim 1.4.

In [38]:
df[df.name.str.contains("bio")]

Unnamed: 0,name,description,resolution
0,bio_1,Annual Mean Temperature,30 arcsec
1,bio_2,Mean Diurnal Range (Mean of monthly (max temp ...,30 arcsec
2,bio_3,Isothermality (bio_2/bio_7) (* 100),30 arcsec
3,bio_4,Temperature Seasonality (standard deviation * ...,30 arcsec
4,bio_5,Max Temperature of Warmest Month,30 arcsec
5,bio_6,Min Temperature of Coldest Month,30 arcsec
6,bio_7,Temperature Annual Range (bio_5-bio_6),30 arcsec
7,bio_8,Mean Temperature of Wettest Quarter,30 arcsec
8,bio_9,Mean Temperature of Driest Quarter,30 arcsec
9,bio_10,Mean Temperature of Warmest Quarter,30 arcsec


## Pedological data a.k.a *rock facts*

These variables are higher resolved (250 m) and come from the [SoilGrids250 database](https://soilgrids.org/). Still, the resolution could lead to difficulties when combined our soil microbiome data (~100 m resolution), as transitions in (soil) envrionments can be on a very small scale. But they are very good for our analysis, as soil organisms depend on soil properties.

In [40]:
df[~df.name.str.contains("bio")]

Unnamed: 0,name,description,resolution
19,orcdrc,Soil Organic Carbon Content (g/kg at 15cm depth),250 m
20,phihox,Ph x 10 in H20 (at 15cm depth),250 m
21,cecsol,Cation Exchange Capacity of Soil in cmolc/kg 1...,250 m
22,bdticm,Absolute Depth to Bedrock in cm,250 m
23,clyppt,Clay (0-2 micro meter) Mass Fraction at 15cm d...,250 m
24,sltppt,Silt Mass Fraction at 15cm depth,250 m
25,sndppt,Sand Mass Fraction at 15cm depth,250 m
26,bldfie,Bulk Density in kg/m3 at 15cm depth,250 m


<div class="alert alert-block alert-danger">
<b>Attention!</b> <br> When comparing the soil pH according to SoilGrids with other references, e.g. local maps like the one from <a href="https://www.umweltbundesamt.at/fileadmin/site/themen/boden/boris/ph_wert_referenzwerteband.pdf">Austria</a>, one can see deviations between both, especially in the alpine regions. Alpine soils are usually very strong dependent on their base rock, and even shape the residing plant communities. <br>
A way to overcome the pH values problem may be including the "uncertainity layer" which is available in the online database.
</div>

## Fine-resolved patches

Especially landcover, altitude, RGB and infrared patches are well resolved (≤ 250 m pixels). Landcover from [NCLD](https://www.usgs.gov/centers/eros/science/national-land-cover-database?qt-science_center_objects=0#qt-science_center_objects) (USA, 2011 version, 30 m pixels) and [CESBIO](http://osr-cesbio.ups-tlse.fr/~oso/posts/2017-03-30-carte-s2-2016/) (France, 2016, 10 m pixels) may be of big importance for us. The original assignments are:

In [48]:
geolifeclef.print_metadata(landcover=True).sort_values(by="original_landcover_code")

Unnamed: 0,landcover_code,original_landcover_code,landcover_label
0,0,0,Missing Data
1,1,11,Annual Summer Crops
18,18,11,Open Water
2,2,12,Annual Winter Crops
19,19,12,Perennial Ice/Snow
20,20,21,"Developed, Open Space"
21,21,22,"Developed, Low Intensity"
22,22,23,"Developed, Medium Intensity"
23,23,24,"Developed, High Intensity"
24,24,31,Barren Land (Rock/Sand/Clay)


<div class="alert alert-block alert-warning">
<b>Note:</b><br> There is also an alignment between both (french and us) assignments. If we only use US data, the original assignments should be preferred.
<\div>

Other than that, we also have RGB, infrared and altitudes.

## Observations

Still, we would also have the iNaturalist and Pl@ntNet observations, but those would even make it more difficult than it already is. At least it seems to me.