# Environmental Variable Statistics and Experimental Design

This notebook performs statistical analysis of environmental variables and designs experiments for species distribution modeling. It provides crucial insights into variable relationships, distributions, and helps optimize model configuration.

## Key Analyses:

### 1. **Variable Statistics**:
- Descriptive statistics for all environmental variables
- Distribution analysis (mean, median, standard deviation, skewness)
- Correlation analysis between variables
- Outlier detection and data quality assessment

### 2. **Experimental Design**:
- Variable selection optimization
- Multicollinearity assessment
- Principal Component Analysis (PCA)
- Variable importance ranking

### 3. **Data Exploration**:
- Spatial patterns in environmental variables
- Temporal trends and seasonality
- Cross-correlation matrices
- Distribution plots and histograms

## Applications:
- **Model Optimization**: Select optimal variable combinations
- **Quality Control**: Identify problematic variables or outliers
- **Ecological Interpretation**: Understand environmental gradients
- **Experimental Planning**: Design robust modeling experiments

## Statistical Methods:
- Correlation analysis (Pearson, Spearman)
- Principal Component Analysis (PCA)
- Variance Inflation Factor (VIF) for multicollinearity
- Distribution fitting and normality tests

# Maximum Entropy Principle in Species Distribution Modeling

The Maximum Entropy (MaxEnt) principle is fundamental to species distribution modeling. It states that among all possible probability distributions that satisfy given constraints, the one with maximum entropy is the most unbiased and should be preferred.

**Key Concepts:**
- **Entropy**: Measures uncertainty or information content in a probability distribution
- **Constraints**: Environmental conditions and species occurrence data
- **Unbiased**: No assumptions beyond the available data
- **Optimal**: Provides the most conservative estimate given the constraints

**Reference**: [The Principle of Maximum Entropy](https://medium.com/intuition/the-principle-of-maximum-entropy-ec5fa2f84a0c)

This principle guides our variable selection and model design to ensure robust, unbiased predictions of species distributions.

In [None]:
############### EXPERIMENTAL CONFIGURATION - MODIFY AS NEEDED ###############

# =============================================================================
# SPECIES AND MODELING CONFIGURATION
# =============================================================================
# Define the target species and modeling parameters for statistical analysis
# Uncomment and modify the lines below to change the analysis configuration

# specie = 'leptocybe-invasa'  # Target species: 'leptocybe-invasa' or 'thaumastocoris-peregrinus'
# pseudoabsence = 'random'  # Background point strategy: 'random', 'biased', 'biased-land-cover'
# training = 'east-asia'  # Training region for statistical analysis
# interest = 'south-east-asia'  # Region of interest for comparison

# =============================================================================
# ENVIRONMENTAL VARIABLE SELECTION FOR STATISTICAL ANALYSIS
# =============================================================================
# Choose which bioclimatic variables to include in the analysis
# Each number corresponds to a specific bioclimatic variable (1-19)

# bioclim = [i for i in range(1,20)]  # All 19 bioclimatic variables (comprehensive analysis)

# =============================================================================
# CURRENT CONFIGURATION (USING PREVIOUSLY DEFINED VARIABLES)
# =============================================================================
bioclim = bioclim  # Use current bioclimatic variable set from previous analysis
topo = topo1  # Topographic variables (elevation, slope, aspect)
savefig = True  # Save generated plots and statistics to output directory
bio = bio1  # Bioclimatic variable identifier for file naming
ndvi = ndvi1  # NDVI (Normalized Difference Vegetation Index) variable inclusion

###########################################################

In [None]:
# =============================================================================
# IMPORT REQUIRED LIBRARIES
# =============================================================================

# Standard library imports
import os  # File system operations and path manipulation

# Core scientific computing libraries
import numpy as np  # Numerical computing and array operations
import xarray as xr  # Multi-dimensional labeled arrays for raster data handling
import rioxarray as rioxr  # Raster I/O operations for xarray (geospatial data)

# Data manipulation and analysis libraries
import pandas as pd  # Data manipulation and analysis for tabular data
import geopandas as gpd  # Geospatial data handling and spatial operations
import elapid as ela  # Species distribution modeling library for SDM analysis
import scipy.stats as stats  # Statistical functions and tests for probability distributions

# Visualization library
import matplotlib.pyplot as plt  # Plotting and visualization for statistical plots

# =============================================================================
# MATPLOTLIB CONFIGURATION FOR PUBLICATION-QUALITY PLOTS
# =============================================================================
# Configure matplotlib parameters for consistent, publication-ready figure formatting
params = {'legend.fontsize': 'x-large',      # Large legend text for readability
         'axes.labelsize': 'x-large',        # Large axis labels
         'axes.titlesize':'x-large',         # Large plot titles
         'xtick.labelsize':'x-large',        # Large x-axis tick labels
         'ytick.labelsize':'x-large'}        # Large y-axis tick labels
plt.rcParams.update(params)  # Apply the formatting parameters globally

In [None]:
def subplot_layout(nplots):
    """
    Calculate optimal subplot layout for given number of plots.
    
    This function determines the best arrangement of subplots to create a balanced
    grid layout that minimizes empty space while maintaining readability.
    
    Parameters:
    -----------
    nplots : int
        Number of plots to arrange in the subplot grid.
        Must be a positive integer.
    
    Returns:
    --------
    ncols, nrows : tuple of int
        Number of columns and rows for the subplot layout.
        - ncols: Number of columns (maximum 4 for readability)
        - nrows: Number of rows needed to accommodate all plots
    
    Algorithm:
    ----------
    1. Calculate the square root of the number of plots for a balanced layout
    2. Round up to ensure all plots fit
    3. Limit maximum columns to 4 for optimal readability
    4. Calculate required rows based on columns and total plots
    
    Examples:
    ---------
    >>> subplot_layout(6)
    (3, 2)  # 3 columns, 2 rows
    >>> subplot_layout(12)
    (4, 3)  # 4 columns, 3 rows (max columns reached)
    >>> subplot_layout(1)
    (1, 1)  # Single plot
    """
    
    # Calculate square root and round up for balanced layout
    # This ensures a roughly square arrangement when possible
    ncols = min(int(np.ceil(np.sqrt(nplots))), 4)  # Max 4 columns for readability
    
    # Calculate rows needed to accommodate all plots
    # Ceiling division ensures we have enough rows for all plots
    nrows = int(np.ceil(nplots / ncols))
    
    return ncols, nrows

In [None]:
# =============================================================================
# SET UP FILE PATHS AND DIRECTORY STRUCTURE
# =============================================================================

out_path = os.path.join(os.path.dirname(os.getcwd()), 'out', specie)
docs_path = os.path.join(os.path.dirname(os.getcwd()), 'docs')
figs_path = os.path.join(os.path.dirname(os.getcwd()), 'figs')
input_path = os.path.join(out_path, 'input')
output_path = os.path.join(out_path, 'output')

In [None]:
# =============================================================================
# ENVIRONMENTAL VARIABLE CONFIGURATION AND DATA LOADING
# =============================================================================
# Build lists of raster files and labels for statistical analysis
# This section dynamically constructs file paths based on configuration settings

# =============================================================================
# INITIALIZE RASTER AND LABEL LISTS
# =============================================================================
# Start with empty lists that will be populated based on enabled variables

# Initialize with topographic variables if enabled
# SRTM (Shuttle Radar Topography Mission) provides elevation data
rasters, labels = (
    (['srtm_%s.tif' % training], ['srtm']) if topo else ([], [])
)

# Add NDVI (Normalized Difference Vegetation Index) variables if enabled
# NDVI provides vegetation health and density information
rasters, labels = (
    rasters + (['ndvi_%s.tif' % training] if ndvi else []),
    labels  + (['ndvi'] if ndvi else [])
)

# =============================================================================
# ADD BIOCLIMATIC VARIABLES (HISTORICAL OR FUTURE SCENARIOS)
# =============================================================================
# Add bioclimatic variables based on whether we're analyzing historical or future data
# Each bioclimatic variable represents a specific climate characteristic

if Future:
    # Future climate projections using climate model data
    # File naming convention: {model_prefix}_bio_{variable_number}_{training_region}_future.tif
    for no in bioclim:
        rasters.append('%s_bio_%s_%s_future.tif' %(model_prefix, no, training))
        labels.append('bioclim_%02d' %no)  # Format with zero-padding (e.g., bioclim_01, bioclim_12)
else:
    # Historical climate data from WorldClim or similar sources
    # File naming convention: {model_prefix}_bio_{variable_number}_{training_region}.tif
    for no in bioclim:
        rasters.append('%s_bio_%s_%s.tif' %(model_prefix, no, training))
        labels.append('bioclim_%02d' %no)  # Format with zero-padding

# =============================================================================
# CONSTRUCT FULL FILE PATHS AND LOAD RASTER DATA
# =============================================================================
# Create complete file paths by joining with the input directory
raster_paths = [os.path.join(input_path, raster) for raster in rasters]

# Initialize xarray Dataset to store all environmental variables
# xarray provides labeled multi-dimensional arrays ideal for geospatial data
training_data = xr.Dataset()

# Load each raster file and add to the dataset
for raster, label in zip(raster_paths, labels):
    # Open raster with rioxarray (handles geospatial metadata)
    # masked=True ensures missing values are properly handled
    da = rioxr.open_rasterio(raster, masked=True)
    training_data[label] = da  # Add to dataset with the specified label

# =============================================================================
# OPTIONAL: VISUALIZATION AND DATA EXPORT (COMMENTED OUT)
# =============================================================================
# Uncomment the following sections if you want to:
# 1. Print raster and label information for debugging
# 2. Create a visualization of all loaded rasters
# 3. Export the dataset to NetCDF format

# # Debug: Print loaded rasters and labels
# # print(rasters)
# # print(labels)

# Create a comprehensive plot of all raster data
# This creates a grid showing all environmental variables
# num_plots = len(labels)
# fig, axes = plt.subplots(4, 5, figsize=(20, 16))  # 4 rows, 5 columns
# axes = axes.flatten()

# for ax, label in zip(axes, labels):
#     training_data[label].plot(ax=ax, cmap='viridis')
#     ax.set_title(label)
#     ax.axis('off')

# # Hide empty subplots if there are fewer variables than subplot spaces
# for ax in axes[len(labels):]:
#     ax.axis('off')

# plt.tight_layout()
# plt.show()

# # Print dataset information
# print(training_data)

# # Export dataset to NetCDF format for later use
# training_data.to_netcdf('../data/training_dataxx.nc')


In [None]:
# =============================================================================
# LOAD AND ANNOTATE OCCURRENCE DATA
# =============================================================================
# Load presence and background points, then extract environmental values
# This section prepares the training data for statistical analysis by combining
# species occurrence data with environmental variables

# =============================================================================
# LOAD PRESENCE POINT DATA
# =============================================================================
# Load species presence records from CSV file
# File naming convention: {specie}_presence_{training_region}_{iteration}.csv
presence_file_name = '%s_presence_%s_%s.csv' %(specie, training, iteration)
presence_csv = pd.read_csv(os.path.join(input_path, 'train', presence_file_name))

# Convert longitude and latitude coordinates to GeoPandas geometry
# This creates point geometries from coordinate pairs
geometry = gpd.points_from_xy(presence_csv['lon'], presence_csv['lat'])
presence_gdf = gpd.GeoDataFrame(geometry=geometry, crs='EPSG:4326')  # WGS84 coordinate system

# =============================================================================
# LOAD BACKGROUND POINT DATA
# =============================================================================
# Load background/pseudo-absence points from CSV file
# File naming convention: {specie}_background_{pseudoabsence_strategy}_{training_region}.csv
background_file_name = '%s_background_%s_%s.csv' %(specie, pseudoabsence, training)
background_csv = pd.read_csv(os.path.join(input_path, 'train', background_file_name))

# Convert longitude and latitude coordinates to GeoPandas geometry
geometry = gpd.points_from_xy(background_csv['lon'], background_csv['lat'])
background_gdf = gpd.GeoDataFrame(geometry=geometry, crs='EPSG:4326')  # WGS84 coordinate system

# =============================================================================
# EXTRACT ENVIRONMENTAL VALUES AT OCCURRENCE POINTS
# =============================================================================
# Use elapid's annotate function to extract environmental variable values
# at each occurrence point location

# Extract environmental values at presence points
# This creates a DataFrame with environmental variables for each presence point
presence_train = ela.annotate(
    presence_gdf.geometry,        # Point geometries for presence locations
    raster_paths=raster_paths,    # List of environmental raster file paths
    labels=labels,                # Variable names corresponding to each raster
    drop_na=True,                 # Remove points with missing environmental data
    quiet=True                    # Suppress progress messages
)

# Extract environmental values at background points
# This creates a DataFrame with environmental variables for each background point
background_train = ela.annotate(
    background_gdf,               # Point geometries for background locations
    raster_paths=raster_paths,    # List of environmental raster file paths
    labels=labels,                # Variable names corresponding to each raster
    drop_na=True,                 # Remove points with missing environmental data
    quiet=True                    # Suppress progress messages
)

# Additional data cleaning step to ensure no missing values remain
background_train = background_train.dropna()

## 1. Entropy
Entropy is an old concept in physics, and describes the measure of chaos or disorder in a system. Higher entropy means lower chaos. The mathematician Claude Shannon introduced the entropy in information theory in 1948. Entropy in information theory is defined as the expected number of bits of information contained in an event.[1](https://medium.com/intro-to-artificial-intelligence/maximum-entropy-reinforcement-learning-ee7ad77289c0)

$$ f(X) = - \sum_{i=1}^n P(x_i) \log P_(x_i) \qquad (\mathrm{entropy}) $$ 

$$ g(X) = \sum_{i=1}^n P(x_i) = 1 \qquad (\mathrm{constraint})$$

where $X=\{x_1, x_2, ..., x_n\}$ are the environmental variables.

Maximising the entropy

$$ \frac{\partial f}{\partial p_j} - \lambda\frac{\partial g}{\partial p_j} = 0 $$

where j = 1,2, ... m

$$ -\log p_j - 1 - \lambda \cdot 1 = 0 $$
[solution](https://www.youtube.com/watch?v=ol8-kZFTLfg)


Gibbs probability density function

$$ p_1({\bf x}) = p({\bf x})e^{-{\bf x}} $$

## 1.1 Probability density plot

In [None]:
# =============================================================================
# PROBABILITY DENSITY FUNCTION CALCULATION
# =============================================================================
# Calculate probability density functions for environmental variables
# This analysis fits Maxwell distributions to background data to understand
# the environmental space available to the species

# Initialize dictionary to store probability density data for each variable
var_data = {}

# Map bioclimatic variable numbers to standardized labels with zero-padding
# This ensures consistent naming (e.g., bioclim_01, bioclim_12)
aa = [f'bioclim_{str(num).zfill(2)}' for num in bioclim]

# =============================================================================
# PROBABILITY DENSITY FUNCTION FITTING
# =============================================================================
# Fit Maxwell distributions to background data for each environmental variable
# The Maxwell distribution is often suitable for environmental data

nbins = 100  # Number of bins for probability density calculation

for name in aa:
    # Extract background data for the current variable
    var = background_train[name]
    
    # Calculate bin width and create extended bin range
    # Extended range helps capture the full distribution
    dx = (var.max() - var.min()) / nbins
    bins = np.linspace(var.min() - 10*dx, var.max() + 10*dx, nbins)
    
    # Fit Maxwell distribution to the background data
    # Maxwell distribution is a continuous probability distribution
    # commonly used for modeling environmental variables
    fit = stats.maxwell.fit(var)
    
    # Calculate probability density function values for the fitted distribution
    pdf = stats.maxwell.pdf(bins, *fit)
    
    # Store results in the var_data dictionary
    var_data[name] = {}
    var_data[name]['bins'] = bins          # Bin centers for plotting
    var_data[name]['pdf'] = pdf            # Probability density values
    var_data[name]['long_name'] = training_data[name].attrs['long_name']  # Descriptive name


In [None]:
# =============================================================================
# CREATE HISTOGRAM PLOTS WITH FITTED PROBABILITY DENSITY FUNCTIONS
# =============================================================================
# Generate subplot layout and create histograms showing both empirical data
# and fitted Maxwell probability density functions for each environmental variable

# Calculate optimal subplot layout using the custom function
ncols, nrows = subplot_layout(len(aa))

# Create figure with subplots
# Figure size scales with number of subplots for optimal readability
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(ncols*6, nrows*6))

# Handle single subplot case (when only one variable is selected)
if (nrows, ncols) == (1, 1):
    ax = [axs]  # Convert single axis to list for consistent indexing
else:
    ax = axs.ravel()  # Flatten 2D array of axes to 1D for easy iteration

# =============================================================================
# PLOT HISTOGRAMS AND FITTED DISTRIBUTIONS
# =============================================================================
# Create histogram plots for each environmental variable
# Each plot shows both the empirical distribution and fitted Maxwell PDF

i = 0
for i in range(len(aa)):
    # Plot histogram of background data
    # density=True normalizes the histogram to show probability density
    background_train[aa[i]].plot.hist(ax=ax[i], bins=20, density=True, facecolor='lightgray', edgecolor='darkgray')
    
    # Overlay fitted Maxwell probability density function
    ax[i].plot(var_data[aa[i]]['bins'], var_data[aa[i]]['pdf'], lw=3, label='maxwell pdf')
    
    # Set axis labels using descriptive names from raster metadata
    ax[i].set_xlabel(var_data[aa[i]]['long_name'])
    ax[i].set_ylabel('Probability Density')
    
    # Hide empty subplots if there are more subplot spaces than variables
    if ncols * nrows > len(aa):
        ax[len(aa)].set_axis_off()
    
    i = i + 1

In [None]:
if savefig:
    if Future:
        # Check if the 'models' variable is not null or empty
        if models:
            # If a models is specified, add it to the filename
            file_path = os.path.join(figs_path, '04_pdf_bioclim_%s_%s_%s_%s_future.png' % (training, bio, model_prefix, iteration))
        else:
            # If no models is specified, use the original filename
            file_path = os.path.join(figs_path, '04_pdf_bioclim_%s_%s_%s_future.png' % (training, bio,iteration))
        
        fig.savefig(file_path, transparent=True)

    else:
        if models:
            # If a models is specified, add it to the filename
            file_path = os.path.join(figs_path, '04_pdf_bioclim_%s_%s_%s_%s.png' % (training, bio, model_prefix, iteration))
        else:
            # This is the original logic for non-future scenarios, which remains unchanged
            file_path = os.path.join(figs_path, '04_pdf_bioclim_%s_%s_%s.png' % (training, bio, iteration))
        
        fig.savefig(file_path, transparent=True)

In [None]:
# fig, ax = plt.subplots(ncols=1, figsize=(8,6))

# background_train['bioclim_01'].plot.hist(ax=ax, bins=20, density=True, facecolor='lightgray', edgecolor='darkgray')
# plt.plot(var_data['bioclim_01']['bins'], var_data['bioclim_01']['pdf'], lw=3, label='maxwell pdf')
# ax.set_xlabel(var_data['bioclim_01']['long_name'])
# ax.set_ylabel('Probability Density')
# fig.savefig(os.path.join(docs_path, '04_pdf_bioclim-1_%s.png' %training), transparent=True, dpi=600)

# background_train['bioclim_12'].plot.hist(ax=ax, bins=20, density=True, facecolor='lightgray', edgecolor='darkgray')
# ax.plot(var_data['bioclim_12']['bins'], var_data['bioclim_12']['pdf'], lw=3, label='maxwell pdf')
# ax.set_xlabel(var_data['bioclim_12']['long_name'])
# ax.set_ylabel('Probability Density')
# fig.savefig(os.path.join(docs_path, '04_pdf_bioclim-12.png'), transparent=True, dpi=600)


## 1.2 Probability density presence and background

In [None]:
# =============================================================================
# COMPARATIVE HISTOGRAM PLOTS: PRESENCE VS BACKGROUND
# =============================================================================
# Create side-by-side histograms comparing environmental variable distributions
# between species presence points and background points for all variables

# Define color scheme for presence and background data
pair_colors = ['tab:blue', 'tab:red']  # Blue for presence, red for background

# Calculate optimal subplot layout for all environmental variables
ncols, nrows = subplot_layout(len(labels))

# Create figure with subplots
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(ncols*6, nrows*6))

# Handle single subplot case
if (nrows, ncols) == (1, 1):
    ax = [axs]
else:
    ax = axs.ravel()

# Get list of variable names for accessing metadata
xlabels = list(training_data.data_vars.keys())

# =============================================================================
# CREATE COMPARATIVE HISTOGRAMS FOR EACH VARIABLE
# =============================================================================
# Plot overlapping histograms showing distribution differences between
# presence and background points for each environmental variable

for iax, label in enumerate(labels):
    # Extract presence and background data for current variable
    pvar = presence_train[label]  # Environmental values at presence points
    bvar = background_train[label]  # Environmental values at background points
    
    # Create overlapping histograms
    ax[iax].hist(
        [pvar, bvar],                    # Data arrays for presence and background
        density=True,                    # Normalize to show probability density
        alpha=0.7,                       # Semi-transparent for overlap visibility
        label=['presence', 'background'], # Legend labels
        color=pair_colors,               # Color scheme defined above
    )
    
    # Set subplot title to variable name
    ax[iax].set_title(label)
    
    # Set x-axis label using descriptive name from raster metadata
    try:
        ax[iax].set_xlabel(training_data[xlabels[iax]].long_name)
    except AttributeError:
        # Fallback if long_name attribute is not available
        ax[iax].set_xlabel('No variable long_name')

# =============================================================================
# ADD LEGEND AND FORMATTING
# =============================================================================
# Create a single legend for the entire figure
handles, lbls = ax[iax].get_legend_handles_labels()
fig.legend(handles, lbls, loc='upper right', bbox_to_anchor=(0.16, 0.965))

# Adjust layout to prevent overlapping elements
plt.tight_layout()

# Hide empty subplots if there are more subplot spaces than variables
for axi in ax:
    if not axi.title.get_text():
        axi.set_visible(False)


In [None]:
# if savefig:
#     if Future:
#         fig.savefig(os.path.join(figs_path, '04_pdf_env-variables_%s_%s_future.png' %(training, bio)), transparent=True)
#     else:    
#         fig.savefig(os.path.join(figs_path, '04_pdf_env-variables_%s_%s.png' %(training, bio)), transparent=True)

if savefig:
    if Future:
        # Check if the 'models' variable is not null or empty
        if models:
            # If a models is specified, add it to the filename
            file_path = os.path.join(figs_path, '04_pdf_env-variables_%s_%s_%s_%s_future.png' % (training, bio, model_prefix, iteration))
        else:
            # If no models is specified, use the original filename
            file_path = os.path.join(figs_path, '04_pdf_env-variables_%s_%s_%s_future.png' % (training, bio, iteration))
        
        fig.savefig(file_path, transparent=True)

    else:
        if models:
            # If a models is specified, add it to the filename
            file_path = os.path.join(figs_path, '04_pdf_env-variables_%s_%s_%s_%s.png' % (training, bio, model_prefix, iteration))
        else:
            # This is the original logic for non-future scenarios, which remains unchanged
            file_path = os.path.join(figs_path, '04_pdf_env-variables_%s_%s_%s.png' % (training, bio, iteration))
        
        fig.savefig(file_path, transparent=True)

## 2. Variable correlation matrix

In [None]:
# =============================================================================
# PREPARE DATA FOR CORRELATION ANALYSIS
# =============================================================================
# Convert xarray dataset to pandas DataFrame for correlation analysis
# This step extracts the environmental variable values from the raster data

# Alternative approach (commented out): Merge multiple datasets
# ds = xr.merge([bioclim, srtm_region])

# Convert xarray dataset to pandas DataFrame
# isel(band=0): Select first band (most raster data has single band)
# reset_coords: Remove coordinate variables that aren't needed for correlation
# to_dataframe(): Convert to pandas DataFrame for statistical analysis
df = training_data.isel(band=0).reset_coords(['band', 'spatial_ref'], drop=True).to_dataframe()

# Calculate Spearman correlation matrix
# Spearman correlation is rank-based and more robust to outliers than Pearson
# It measures monotonic relationships between variables
correlation_matrix = df.corr(method='spearman')

In [None]:
# =============================================================================
# CREATE CORRELATION MATRIX HEATMAP VISUALIZATION
# =============================================================================
# Generate a heatmap showing correlations between all environmental variables
# This helps identify multicollinearity and variable relationships

# Create figure with constrained layout for better spacing
fig, ax = plt.subplots(figsize=(12, 12), constrained_layout=True)

# Create heatmap using imshow with coolwarm colormap
# coolwarm: blue for negative correlations, red for positive correlations
im = ax.imshow(correlation_matrix, cmap='coolwarm', interpolation='nearest')

# =============================================================================
# ADD CORRELATION VALUES AS TEXT OVERLAY
# =============================================================================
# Display numerical correlation values on each cell of the heatmap
# This provides precise correlation coefficients for interpretation

for i in range(correlation_matrix.shape[0]):
    for j in range(correlation_matrix.shape[1]):
        plt.text(j, i, f"{correlation_matrix.iloc[i, j]:.2f}",
                 ha='center', va='center', color='white', weight='bold')

# =============================================================================
# CONFIGURE AXIS LABELS AND TITLES
# =============================================================================
# Set up axis labels and formatting for the correlation matrix

# Get column names for axis labels
columns = df.columns.tolist()

# Set x-axis ticks and labels (rotated 90 degrees for readability)
ax.set_xticks(range(len(columns)), columns, rotation=90)

# Set y-axis ticks and labels
ax.set_yticks(range(len(columns)), columns)

# Add title to the plot
ax.set_title("Variable correlation matrix")

# Add colorbar with label
fig.colorbar(im, label="Spearman Correlation")

In [None]:
# if savefig:
#     if Future:
#         fig.savefig(os.path.join(figs_path, '04_var-corr-matrix_%s_%s_future.png' %(training, bio)), transparent=True, bbox_inches='tight')
#     else:
#         fig.savefig(os.path.join(figs_path, '04_var-corr-matrix_%s_%s.png' %(training, bio)), transparent=True, bbox_inches='tight')

if savefig:
    if Future:
        # Check if the 'models' variable is not null or empty
        if models:
            # If a models is specified, add it to the filename
            file_path = os.path.join(figs_path, '04_var-corr-matrix_%s_%s_%s_%s_future.png' % (training, bio, model_prefix, iteration))
        else:
            # If no models is specified, use the original filename
            file_path = os.path.join(figs_path, '04_var-corr-matrix_%s_%s_%s_future.png' % (training, bio, iteration))
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

    else:
        if models:
            # If a models is specified, add it to the filename
            file_path = os.path.join(figs_path, '04_var-corr-matrix_%s_%s_%s_%s.png' % (training, bio, model_prefix, iteration))
        else:
            # This is the original logic for non-future scenarios, which remains unchanged
            file_path = os.path.join(figs_path, '04_var-corr-matrix_%s_%s_%s.png' % (training, bio, iteration))
        
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

In [None]:
# =============================================================================
# HIERARCHICAL CLUSTERING AND DENDROGRAM VISUALIZATION
# =============================================================================
# Perform hierarchical clustering on environmental variables based on correlation
# This helps identify groups of similar variables and potential redundancy

# Import required functions for hierarchical clustering
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import squareform

# =============================================================================
# PREPARE DISSIMILARITY MATRIX FOR CLUSTERING
# =============================================================================
# Convert correlation matrix to dissimilarity matrix
# Correlation values range from -1 to 1, dissimilarity should be 0 to 1
# Using absolute correlation ensures both positive and negative correlations
# are treated as similarity (both indicate strong relationships)

dissimilarity = 1 - abs(correlation_matrix)

# =============================================================================
# PERFORM HIERARCHICAL CLUSTERING
# =============================================================================
# Convert dissimilarity matrix to condensed distance matrix for linkage function
# 'complete' linkage uses maximum distance between clusters
# This tends to create compact, well-separated clusters

Z = linkage(squareform(dissimilarity), 'complete', optimal_ordering=True)

# =============================================================================
# CREATE DENDROGRAM VISUALIZATION
# =============================================================================
# Generate dendrogram showing the hierarchical clustering structure
# This helps visualize which variables cluster together

xx = plt.figure(figsize=(12, 10))

# Create dendrogram with variable labels
dendrogram(Z, 
           labels=df.columns,      # Variable names as leaf labels
           orientation='top',      # Root at top, leaves at bottom
           leaf_rotation=90)       # Rotate leaf labels for readability

In [None]:
# if savefig:
#     if Future:
#         xx.savefig(os.path.join(figs_path, '04_Dendogram_%s_%s_future.png' %(training, bio)), transparent=True)
#     else:
#         xx.savefig(os.path.join(figs_path, '04_Dendogram_%s_%s.png' %(training, bio)), transparent=True)


if savefig:
    if Future:
        # Check if the 'models' variable is not null or empty
        if models:
            # If a models is specified, add it to the filename
            file_path = os.path.join(figs_path, '04_Dendogram_%s_%s_%s_%s_future.png' % (training, bio, model_prefix, iteration))
        else:
            # If no models is specified, use the original filename
            file_path = os.path.join(figs_path, '04_Dendogram_%s_%s_%s_future.png' % (training, bio, iteration))
        
        xx.savefig(file_path, transparent=True)

    else:
        if models:
            # If a models is specified, add it to the filename
            file_path = os.path.join(figs_path, '04_Dendogram_%s_%s_%s_%s.png' % (training, bio, model_prefix, iteration))
        else:
            # This is the original logic for non-future scenarios, which remains unchanged
            file_path = os.path.join(figs_path, '04_Dendogram_%s_%s_%s.png' % (training, bio, iteration))
        
        xx.savefig(file_path, transparent=True)

In [None]:
# # Ben's k medoids experiment
# import kmedoids
# from sklearn.metrics.pairwise import (
#     pairwise_distances)

# print(dissimilarity)
# dissimilarity.to_csv("/scratch/gito_aciar/sdm-toolbox/figs/dissmililarity.csv")
# kmin = 1
# kmax = 6
# dm = kmedoids.dynmsc(dissimilarity, kmax, kmin)

# print("Optimal number of clusters according to the Medoid Silhouette:", dm.bestk)

# k_medoid_results = kmedoids.fasterpam(dissimilarity,6)

# print(k_medoid_results)

In [None]:
# https://github.com/osgeokr/pySDM-geemap/blob/main/pySDM-geemap_Case%20Study%201_Pitta%20nympha.ipynb
# https://github.com/dennisbakhuis/Tutorials/blob/master/3_Covariance_PCA/Principle%20component%20analysis%20and%20the%20covariance%20matrix.ipynb
# https://www.geeksforgeeks.org/exploring-correlation-in-python/