# GeoPlant Dataset

GeoPlant is a large-scale, multimodal dataset for spatial plant species prediction across Europe, combining expert-verified species observations with rich environmental context

The application domain for the dataset is mostly related with Ecological researchers, Biologists, Botanicts and also Machine Learning/ Remote Sensing Researcher. Due the broad scope of Environmental variables, it can be use for a range of professionals. 

As a target variable, the GeoPlant contains either Presence-Absence (PA) or Presence-Only (PO) survey polygons. 

## Presence-Absence (PA)

 - Survey inventory of all plants species in a given plot (10-400 m² ).

 - Source: 29 datasets hosted in the European Vegetation Archive (EVA)

 - 93,703 surveys across Europe with 5,016 plant species.

## Environmental Variables | Features 

For each surveyId, a set of environmental variables is associated with it. The environmental variables are composed by 5 major datasets, they are:
- Climatic Variables
- Soil variables
- Human Footprint
- Elevation
- Land Cover

The GeoPlant dataset allow retrieving the environmental variables as a tabular data where the features are already extracted for each surveyId, which means that instead of dealing with huge raster datasets, the GeoPlant provides a tabular light-weighted version for end-users. 

### Climatic Variables:
 19 climatic variables averaged over 1981-2010. 

 They are: 
 (Mean Annual Air Temp; Annual Precipitation Amount; Precipitation of the Driest Month; etc etc)

#### Climatic - Temperature related variables 
| Code  | Description                                  | Unit   | Notes                                                                                     |
|-------|----------------------------------------------|--------|-------------------------------------------------------------------------------------------|
| bio1  | Mean annual air temperature                   | °C     | Mean annual daily mean air temperatures averaged over 1 year                             |
| bio2  | Mean diurnal air temperature range            | °C     | Mean diurnal range of temperatures averaged over 1 year                                  |
| bio3  | Isothermality                                 | °C     | Ratio of diurnal variation to annual variation in temperatures                           |
| bio4  | Temperature seasonality                       | °C/100 | Standard deviation of the monthly mean temperatures                                      |
| bio5  | Mean daily maximum air temperature of the warmest month | °C     | The highest temperature of any monthly daily mean maximum temperature                    |
| bio6  | Mean daily minimum air temperature of the coldest month | °C     | The lowest temperature of any monthly daily mean minimum temperature                     |
| bio7  | Annual range of air temperature               | °C     | The difference between the maximum temperature of warmest month and minimum temperature of coldest month |
| bio8  | Mean daily mean air temperatures of the wettest quarter | °C     | The wettest quarter of the year is determined (to the nearest month)                     |
| bio9  | Mean daily mean air temperatures of the driest quarter | °C     | The driest quarter of the year is determined (to the nearest month)                      |
| bio10 | Mean daily mean air temperatures of the warmest quarter | °C     | The warmest quarter of the year is determined (to the nearest month)                     |
| bio11 | Mean daily mean air temperatures of the coldest quarter | °C     | The coldest quarter of the year is determined (to the nearest month)                     |

#### Climatic - Precipitation Related Variables 


| Col   | Description                              | Unit    | Time Unit | Notes                                                                                     |
|--------|------------------------------------------|---------|-----------|-------------------------------------------------------------------------------------------|
| bio12  | Annual precipitation amount              | kg m⁻²  | year⁻¹    | Accumulated precipitation amount over 1 year                                            |
| bio13  | Precipitation amount of the wettest month | kg m⁻²  | month⁻¹   | The precipitation of the wettest month                                                  |
| bio14  | Precipitation amount of the driest month | kg m⁻²  | month⁻¹   | The precipitation of the driest month                                                   |
| bio15  | Precipitation seasonality                 | kg m⁻²  |           | The Coefficient of Variation is the standard deviation of the monthly precipitation estimates expressed as a percentage of the mean (annual mean) |
| bio16  | Mean monthly precipitation amount of the wettest quarter | kg m⁻²  | month⁻¹   | The wettest quarter of the year is determined (to the nearest month)                    |
| bio17  | Mean monthly precipitation amount of the driest quarter | kg m⁻²  | month⁻¹   | The driest quarter of the year is determined (to the nearest month)                     |
| bio18  | Mean monthly precipitation amount of the warmest quarter | kg m⁻²  | month⁻¹   | The warmest quarter of the year is determined (to the nearest month)                    |
| bio19  | Mean monthly precipitation amount of the coldest quarter | kg m⁻²  | month⁻¹   | The coldest quarter of the year is determined (to the nearest month)                    |

### Land Cover 

Land Cover is retrieved from MODIS (Terra+Aqua) with a 500m resolution.  The LandCover is a discretized variable composed by classes from 1 to 17. 

| Land Cover Type                 | Code | Description                                                                                       |
|-------------------------------|------|-------------------------------------------------------------------------------------------------|
| Evergreen Needleleaf Forests   | 1    | Dominated by evergreen conifer trees (canopy >2m). Tree cover >60%.                              |
| Evergreen Broadleaf Forests    | 2    | Dominated by evergreen broadleaf and palmate trees (canopy >2m). Tree cover >60%.                |
| Deciduous Needleleaf Forests   | 3    | Dominated by deciduous needleleaf (larch) trees (canopy >2m). Tree cover >60%.                   |
| Deciduous Broadleaf Forests    | 4    | Dominated by deciduous broadleaf trees (canopy >2m). Tree cover >60%.                            |
| Mixed Forests                  | 5    | Dominated by neither deciduous nor evergreen (40-60% of each) tree type (canopy >2m). Tree cover >60%. |
| Closed Shrublands              | 6    | Dominated by woody perennials (1-2m height) >60% cover.                                         |
| Open Shrublands                | 7    | Dominated by woody perennials (1-2m height) 10-60% cover.                                       |
| Woody Savannas                | 8    | Tree cover 30-60% (canopy >2m).                                                                 |
| Savannas                      | 9    | Tree cover 10-30% (canopy >2m).                                                                 |
| Grasslands                    | 10   | Dominated by herbaceous annuals (<2m).                                                          |
| Permanent Wetlands            | 11   | Permanently inundated lands with 30-60% water cover and >10% vegetated cover.                    |
| Croplands                    | 12   | At least 60% of area is cultivated cropland.                                                    |
| Urban and Built-up Lands      | 13   | At least 30% impervious surface area including building materials, asphalt, and vehicles.       |
| Cropland/Natural Vegetation Mosaics | 14 | Mosaics of small-scale cultivation 40-60% with natural tree, shrub, or herbaceous vegetation.    |
| Permanent Snow and Ice         | 15   | At least 60% of area is covered by snow and ice for at least 10 months of the year.             |
| Barren                        | 16   | At least 60% of area is non-vegetated barren (sand, rock, soil) areas with less than 10% vegetation. |
| Water Bodies                  | 17   | At least 60% of area is covered by permanent water bodies.                                      |


### Human Footprint 

This dataset is a study by Venter et Al. 2016 (Global terrestrial Human Footprint maps for 1993 and 2009), with 1km of spatial resolution and eight variables measuring the direct and indirect human pressures on the environment globally in 1993 and 2009. 1) built environments, 2) population density, 3) electric infrastructure, 4) crop lands, 5) pasture lands, 6) roads, 7) railways, and 8) navigable waterway

| Variable                            | Meaning                         | Units                   | Interpretation                      |
| ----------------------------------- | ------------------------------- | ----------------------- | ----------------------------------- |
| **Built1994 / Built2009**           | Built-up area intensity         | 0–100                   | Higher = more urban development.    |
| **Croplands1992 / Croplands2005**   | Cropland extent                 | 0–100                   | Higher = more agriculture.          |
| **Lights1994 / Lights2009**         | Night-time light intensity      | 0–100                   | Proxy for human activity.           |
| **Pasture1993 / Pasture2009**       | Pasture area                    | 0–100                   | Livestock pressure.                 |
| **Popdensity1990 / Popdensity2010** | Population density              | people/km²              | High = more urban areas.            |
| **Railways / Roads**                | Linear infrastructure density   | arbitrary (often 0–255) | More = greater fragmentation.       |
| **HFP1993 / HFP2009**               | Composite Human Footprint Index | 0–50                    | Higher = more human pressure.       |
| **NavWater1994 / NavWater2009**     | Navigable water accessibility   | arbitrary               | Higher = closer to water transport. |

### Elevation
Retrieved from Aster Global Digital Model V3 with 30m Resolution. 

### Soil Data 
Provided from SoilGrids, it has 1km of spatial resolution (aggretated from 250m). The features presented in this dataset are soil properties at 5-15cm depth, such as: pH, Clay; Organic Carbon; Nitrogen.

######

| Variable     | Meaning                  | Units    | Interpretation                                                                          |
| ------------ | ------------------------ | -------- | --------------------------------------------------------------------------------------- |
| **bdod**     | Bulk density of dry soil | kg/m³    | High values = compact soil, low porosity (less root growth). Low = loose, aerated soil. |
| **cec**      | Cation exchange capacity | cmol/kg  | High = fertile soil (more nutrient retention). Low = poor nutrient retention.           |
| **cfvo**     | Coarse fragment volume   | %        | High = many stones/rocks (less plant rooting area).                                     |
| **clay**     | Clay fraction            | %        | High = fine-textured soil, retains water/nutrients but drains poorly.                   |
| **nitrogen** | Total nitrogen content   | g/kg     | High = fertile soil, supports vegetation growth.                                        |
| **phh2o**    | Soil pH in water         | pH units | 6–7 = optimal. <5 = acidic (bad), >8 = alkaline (bad).                                  |
| **sand**     | Sand fraction            | %        | High = coarse, drains well but low nutrient retention.                                  |
| **silt**     | Silt fraction            | %        | Medium texture — balanced soil (ideal around 40%).                                      |
| **soc**      | Soil organic carbon      | g/kg     | High = healthy, fertile soil, better structure. Low = degraded soil.         


In [None]:
## Import libraries
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import cartopy.crs as ccrs
import cartopy.feature as cfeature

### Import the created function
from src.utils import merge_metadata_data

#### Loading the datasets.

There are 4 datasets within this folder. They should be:
- Train
- Test
- Train Metadata
- Test Metadata 

In [None]:
train_path = ""
train_metadata_path = ""
test_path = ""
test_metadata_path = ""

SyntaxError: invalid syntax (2291590974.py, line 1)

In [4]:
path_metadata = "/mnt/d/desktop/COPERNICUS/Classes/3-semester/ml/species/output/train_metadata.parquet"
path_data = "/mnt/d/desktop/COPERNICUS/Classes/3-semester/ml/species/output/train_join.parquet"

## Create the whole df
df = merge_metadata_data(path_metadata,path_data)

## drop non necessary columns 
df = df.drop(columns = ['year', 'geoUncertaintyInM', 'areaInM2','region', 'country'])
df.set_index("surveyId",inplace=True)
## Drop non feature columns but preserve the df to map easily
## since the order to the X index is the same order of the df.
columns_to_drop = ['lon', 'lat', 'speciesId','predict']

X = df.drop(columns=columns_to_drop)

### Describe Dataset

In [11]:
X.info(verbose=True, max_cols=49)

<class 'pandas.core.frame.DataFrame'>
Index: 13501 entries, 333 to 3918720
Data columns (total 46 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   average_Bio1                             13501 non-null  int64  
 1   average_Bio2                             13501 non-null  int64  
 2   average_Bio3                             13501 non-null  int64  
 3   average_Bio4                             13501 non-null  int64  
 4   average_Bio5                             13501 non-null  int64  
 5   average_Bio6                             13501 non-null  int64  
 6   average_Bio7                             13501 non-null  int64  
 7   average_Bio8                             13501 non-null  int64  
 8   average_Bio9                             13501 non-null  int64  
 9   average_Bio10                            13501 non-null  int64  
 10  average_Bio11                            13501 

### Features 

The dataset is composed by only numerical features. However, not all the features represent quantitive variables. LandCover for instance is a categorical variable (qualitative).

### Output

The output is either 1 or 0, following a binary structure. It should be representing if the surveyId contains the target specie or not.

The output is simple as the concatenation of the X with the df["predict"], concatening by the surveyId. 

# Exploratory Data Analysis

In [None]:
### plot here whole heat map 


## comment the heatmap
Okay we observe likley correlta within the major dataset s bllabllba 

### Climatic Variables

Here we are observing the correlation of climatic variables first looking at the whole partial dataset and second spliting between macro aggregation: Temperature & Precipitation

### Soil Variables

- correlation 
- distribution 



### LandUse

It should have a distribution of each landuse class to show how unbalanced and the a commentary at the imbalanced of the classes.

### Elevation

### Human FootPrint

## Pre-Processing

- StandarScaler
- OneHotEncoder

Standardizing all features to be between zero and one. Normalization is gonna be for each column, so for each column we are subtracting the mean and dividing by the standard deviation.

A caveat, for LandUse we are not applying the StandarScaler since it does not make sense, in this case, we are applying one hot encoder. 



### Outliers

Considering the climatic variable and the variance of the landscape, we consider that there are no outiers regarding the climatic variables for two main reasons: The climatic are a average of 30 years of data and only looking trhough each surveyId does not represent an outier but it represents the variance of the clima towards the landscape. The same is applied for Soil data, there is no consider outlier within this dataset. 

For elevation, we are considering all elevation above 1500 meters as outliers. Due two main reasons: Starting from 1500 the oxygen starts to decrease significantly and affect tree growth. Due our specie be the most frequent specie in France, we assume that it is impossible the specie only be distributed above 1500m, which in our case of study, it should be defined as outlier. So, all altitude above will be replaced with the standard altitude of 1500m. 
So we are replacing all variables above 3 times the stardard desviation as the value be itself, which means that altitudes above 1200m are replaced by 1200 meters. 

LandUse is a qualitative variable and there is no variable however 