# Model Evaluation (Weighted) – Notebook Guide

This notebook evaluates models with class/observation weights applied.

## What this notebook does
- Compute weighted metrics (e.g., weighted AUC, threshold metrics)
- Plot diagnostic figures considering weights
- Summarize results per model/run and export

## Inputs
- Predictions/scores, ground-truth labels, and weights per observation
- Optional: CV fold info or test set indicators

## Workflow
1. Load predictions, labels, and weights
2. Validate alignment and handle missing values
3. Compute weighted metrics across thresholds/folds
4. Plot weighted ROC/curves and summaries
5. Save metrics tables and figures

## Outputs
- Weighted per-model/per-fold metrics tables
- Plots reflecting weights
- CSV/JSON exports for downstream use

## Notes
- Ensure weights are normalized or in intended scale
- Use consistent preprocessing as training
- Fix random seeds for reproducibility where applicable


# Notebook Overview

This notebook evaluates weighted SDMs with metrics and plots, mirroring standard evaluation but accounting for weights in analysis where relevant.

- Key steps: load weighted predictions, compute metrics, plot curves, thresholds, reporting
- Inputs: weighted model predictions and labels
- Outputs: evaluation tables and plots
- Run order: After weighted model training.


# Weighted MaxEnt Model Evaluation and Performance Assessment

This notebook provides comprehensive evaluation of **weighted MaxEnt species distribution models**, focusing on performance assessment that accounts for sample weights and data quality differences. Unlike standard model evaluation, this version incorporates **weighted metrics** to properly assess model performance when training data has been weighted.

## Key Features of Weighted Model Evaluation:

### 1. **Weighted Performance Metrics**:
- **Weighted AUC**: Area Under ROC Curve accounting for sample weights
- **Weighted PR-AUC**: Precision-Recall AUC with weight integration
- **Weighted Sensitivity/Specificity**: Performance metrics adjusted for data quality
- **Weighted Precision/Recall**: Classification metrics incorporating sample weights

### 2. **Advanced Evaluation Approaches**:
- **Cross-Validation**: K-fold validation with weighted samples
- **Spatial Validation**: Geographic partitioning with weight consideration
- **Temporal Validation**: Time-based splits accounting for temporal weights
- **Bootstrap Validation**: Resampling with weight preservation

### 3. **Bias Assessment**:
- **Spatial Bias Analysis**: Evaluate model performance across different regions
- **Temporal Bias Assessment**: Performance across different time periods
- **Source Bias Evaluation**: Performance across different data sources
- **Quality Bias Analysis**: Performance across different data quality levels

## Applications:
- **Model Validation**: Comprehensive assessment of weighted model performance
- **Bias Detection**: Identify remaining biases after weighting
- **Performance Comparison**: Compare weighted vs. unweighted models
- **Quality Control**: Validate that weighting improves model reliability

In [None]:
############### WEIGHTED MODEL EVALUATION CONFIGURATION - MODIFY AS NEEDED ###############

# Species and region settings for weighted model evaluation
#specie = 'leptocybe-invasa'  # Target species: 'leptocybe-invasa' or 'thaumastocoris-peregrinus'
#pseudoabsence = 'random'  # Background point strategy: 'random', 'biased', 'biased-land-cover'
#training = 'east-asia'  # Training region: 'sea', 'australia', 'east-asia', etc.
#interest = 'south-east-asia'  # Test region: can be same as training or different
#savefig = True  # Save generated evaluation plots and metrics

# Environmental variable configuration
bio = bio1  # Bioclimatic variable identifier

# Evaluation settings (specific to weighted model evaluation)
# evaluation_method = 'cross_validation'  # 'cross_validation', 'spatial_validation', 'temporal_validation'
# n_folds = 5  # Number of folds for cross-validation
# spatial_buffer = 100  # Buffer distance (km) for spatial validation
# temporal_split = 0.7  # Proportion of data for training in temporal validation

# Weighted metrics configuration
# include_weighted_metrics = True  # Calculate weighted performance metrics
# include_unweighted_metrics = True  # Calculate standard metrics for comparison
# weight_threshold = 0.1  # Minimum weight threshold for sample inclusion

###########################################################

In [None]:
# =============================================================================
# IMPORT REQUIRED LIBRARIES
# =============================================================================

import os  # File system operations

import numpy as np  # Numerical computing
import xarray as xr  # Multi-dimensional labeled arrays (raster data)
import pandas as pd  # Data manipulation and analysis
import geopandas as gpd  # Geospatial data handling

import elapid as ela  # Species distribution modeling library

from shapely import wkt  # Well-Known Text (WKT) geometry parsing
from elapid import utils  # Utility functions for elapid
from sklearn import metrics, inspection  # Machine learning metrics and model inspection

import matplotlib.pyplot as plt  # Plotting and visualization

import warnings
warnings.filterwarnings("ignore")  # Suppress warning messages for cleaner output

# Configure matplotlib for publication-quality plots
params = {'legend.fontsize': 'x-large',
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

In [None]:
def subplot_layout(nplots):
    """
    Calculate optimal subplot layout for given number of plots
    
    Parameters:
    -----------
    nplots : int
        Number of plots to arrange
    
    Returns:
    --------
    ncols, nrows : tuple
        Number of columns and rows for subplot layout
    """
    
    # Calculate square root and round up for balanced layout
    ncols = min(int(np.ceil(np.sqrt(nplots))), 4)  # Max 4 columns
    nrows = int(np.ceil(nplots / ncols))  # Calculate rows needed
    
    return ncols, nrows

In [None]:
# =============================================================================
# SET UP FILE PATHS
# =============================================================================
# Define directory structure for organizing weighted model evaluation outputs

docs_path = os.path.join(os.path.dirname(os.getcwd()), 'docs')  # Documentation directory
out_path = os.path.join(os.path.dirname(os.getcwd()), 'out', specie)  # Species-specific output directory
figs_path = os.path.join(os.path.dirname(os.getcwd()), 'figs')  # Figures directory
output_path = os.path.join(out_path, 'output')  # Model output directory

## 1. Weighted Training Model Performance Assessment

This section evaluates the performance of the weighted MaxEnt model on the training data. Key aspects include:

### **Weighted vs. Unweighted Metrics**:
- **Standard Metrics**: Traditional AUC, PR-AUC, sensitivity, specificity
- **Weighted Metrics**: Performance metrics accounting for sample weights
- **Comparison Analysis**: Evaluate improvement from weighting approach

### **Performance Indicators**:
- **ROC-AUC**: Area Under Receiver Operating Characteristic curve
- **PR-AUC**: Area Under Precision-Recall curve (important for imbalanced data)
- **Sensitivity**: True Positive Rate (ability to detect presences)
- **Specificity**: True Negative Rate (ability to detect absences)
- **Precision**: Positive Predictive Value
- **F1-Score**: Harmonic mean of precision and recall

### **Weighted Evaluation Benefits**:
- **Quality-Aware Assessment**: Metrics reflect data quality differences
- **Bias-Corrected Performance**: Reduced influence of low-quality samples
- **Robust Validation**: More reliable performance estimates

## References for Species Distribution Model Evaluation

### **Model Output Interpretation**:
- [SDM Model Outputs Interpretation](https://support.ecocommons.org.au/support/solutions/articles/6000256107-interpretation-of-sdm-model-outputs)
- [Presence-Only Prediction in GIS](https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/how-presence-only-prediction-works.htm)
- [MaxEnt 101: Species Distribution Modeling](https://www.esri.com/arcgis-blog/products/arcgis-pro/analytics/presence-only-prediction-maxent-101-using-gis-to-model-species-distribution/)

### **Performance Metrics**:
- [ROC Curves Demystified](https://towardsdatascience.com/receiver-operating-characteristic-curves-demystified-in-python-bd531a4364d0)
- [Precision-Recall AUC Guide](https://www.aporia.com/learn/ultimate-guide-to-precision-recall-auc-understanding-calculating-using-pr-auc-in-ml/)
- [F1-Score, Accuracy, ROC-AUC, and PR-AUC Metrics](https://deepchecks.com/f1-score-accuracy-roc-auc-and-pr-auc-metrics-for-models/)

### **Weighted Model Evaluation**:
- **Sample Weighting**: How to properly evaluate models trained with sample weights
- **Bias Correction**: Assessing the effectiveness of weighting strategies
- **Quality Integration**: Incorporating data quality into performance assessment

In [None]:
# =============================================================================
# LOAD WEIGHTED MODEL AND TRAINING DATA
# =============================================================================
# Load the trained weighted MaxEnt model and associated training data for evaluation

# Build experiment directory name (keeps runs organized by config)
# Alternate naming (older): 'exp_%s_%s_%s' % (pseudoabsence, training, interest)
experiment_name = 'exp_%s_%s_%s_%s_%s' % (model_prefix, pseudoabsence, training, topo, ndvi)
exp_path = os.path.join(output_path, experiment_name)  # Path to experiment directory

# Construct expected filenames produced during training for this run
train_input_data_name = '%s_model-train_input-data_%s_%s_%s_%s_%s.csv' % (model_prefix, specie, pseudoabsence, training, bio, iteration)
run_name = '%s_model-train_%s_%s_%s_%s_%s.ela' % (model_prefix, specie, pseudoabsence, training, bio, iteration)
nc_name = '%s_model-train_%s_%s_%s_%s_%s.nc' % (model_prefix, specie, pseudoabsence, training, bio, iteration)

In [None]:
# =============================================================================
# LOAD TRAINING DATA WITH SAMPLE WEIGHTS
# =============================================================================
# Load training data including sample weights for weighted model evaluation

# Load training data from CSV file (index_col=0 to drop old index column)
df = pd.read_csv(os.path.join(exp_path, train_input_data_name), index_col=0)
# Parse WKT strings into shapely geometries
df['geometry'] = df['geometry'].apply(wkt.loads)
# Wrap as GeoDataFrame with WGS84 CRS
train = gpd.GeoDataFrame(df, crs='EPSG:4326')

# Split predictors/labels/weights for weighted evaluation
x_train = train.drop(columns=['class', 'SampleWeight', 'geometry'])  # Environmental variables only
y_train = train['class']  # Presence/absence labels (0/1)
sample_weight_train = train['SampleWeight']  # Sample weights aligned with rows

# Load fitted weighted MaxEnt model
model_train = utils.load_object(os.path.join(exp_path, run_name))

# Predict probabilities on training set (for curves/metrics)
y_train_predict = model_train.predict(x_train)
# Optional: impute NaN probabilities to 0.5 (neutral)
# y_train_predict = np.nan_to_num(y_train_predict, nan=0.5)

In [None]:
# Model training performance metrics

# ROC curve and AUC (unweighted vs weighted)
# fpr/tpr are computed from predicted probabilities; weights adjust contribution per sample
fpr_train, tpr_train, thresholds = metrics.roc_curve(y_train, y_train_predict)
auc_train = metrics.roc_auc_score(y_train, y_train_predict)
auc_train_weighted = metrics.roc_auc_score(y_train, y_train_predict, sample_weight=sample_weight_train)

# Precision-Recall curve and PR-AUC (more informative on class imbalance)
precision_train, recall_train, _ = metrics.precision_recall_curve(y_train, y_train_predict)
pr_auc_train = metrics.auc(recall_train, precision_train)
# Weighted PR curve uses sample weights to compute precision/recall
precision_train_w, recall_train_w, _ = metrics.precision_recall_curve(y_train, y_train_predict, sample_weight=sample_weight_train)
pr_auc_train_weighted = metrics.auc(recall_train_w, precision_train_w)

# Report metrics
print(f"Training ROC-AUC score: {auc_train:0.3f}")
print(f"Training ROC-AUC Weighted score  : {auc_train_weighted:0.3f}")
print(f"PR-AUC Score: {pr_auc_train:0.3f}")
print(f"PR-AUC Weighted Score: {pr_auc_train_weighted:0.3f}")

|  |  | Specie existance |  |
| ------ | :-------: | :------: | :-------: |
| |  | **+** | **--** |
| **Specie observed** | **+** | True Positive (TP) | False Positive (FP) |
| | **--** | False Negative (FN) | True Negative (TN) |
| | | **All existing species (TP + FN)** | **All non-existing species (FP + TN)** |


$$TPR = \frac{TP}{TP + FN}$$
$$FPR = \frac{FP}{FP + TN}$$

In [None]:
# Visualize training distributions and curves
fig, ax = plt.subplots(ncols=3, figsize=(18, 6), constrained_layout=True)

# Left: Predicted probability distributions for presence vs pseudo-absence
ax[0].hist(y_train_predict[y_train == 0], bins=np.linspace(0, 1, int((y_train == 0).sum() / 100 + 1)),
           density=True, color='tab:red', alpha=0.7, label='pseudo-absence')
ax[0].hist(y_train_predict[y_train == 1], bins=np.linspace(0, 1, int((y_train == 1).sum() / 10 + 1)),
           density=True, color='tab:green', alpha=0.7, label='presence')
ax[0].set_xlabel('Relative Occurrence Probability')
ax[0].set_ylabel('Counts')
ax[0].set_title('Probability Distribution')
ax[0].legend(loc='upper right')

# Middle: ROC curve (random vs perfect baselines + model)
ax[1].plot([0, 1], [0, 1], '--', label='AUC score: 0.5 (No Skill)', color='gray')
ax[1].text(0.4, 0.4, 'random classifier', fontsize=12, color='gray', rotation=45, rotation_mode='anchor',
           horizontalalignment='left', verticalalignment='bottom', transform=ax[1].transAxes)
ax[1].plot([0, 0, 1], [0, 1, 1], '--', label='AUC score: 1 (Ideal Model)', color='tab:blue', zorder=-1)
ax[1].text(0, 1, '  perfect classifier', fontsize=12, color='tab:blue', horizontalalignment='left', verticalalignment='bottom')
ax[1].scatter(0, 1, marker='*', s=100, color='tab:blue')
# Overlay model ROC (unweighted and weighted AUC labels)
ax[1].plot(fpr_train, tpr_train, label=f'AUC score: {auc_train:0.3f}', color='tab:orange')
ax[1].plot(fpr_train, tpr_train, label=f'AUC Weighted score: {auc_train_weighted:0.3f}', color='tab:cyan', linestyle='-.')
ax[1].axis('equal')
ax[1].set_xlabel('False Positive Rate')
ax[1].set_ylabel('True Positive Rate')
ax[1].set_title('MaxEnt ROC Curve')
ax[1].legend(loc='lower right')

# Right: Precision-Recall curve (random/perfect baselines + model)
ax[2].plot([0, 1], [0.5, 0.5], '--', color='gray', label='AUC score: 0.5 (No Skill)')
ax[2].text(0.5, 0.52, 'random classifier', fontsize=12, color='gray', horizontalalignment='center', verticalalignment='center')
ax[2].plot([0, 1, 1], [1, 1, 0], '--', label='AUC score: 1 (Ideal Model)', color='tab:blue', zorder=-1)
ax[2].text(1, 1, 'perfect classifier  ', fontsize=12, color='tab:blue', horizontalalignment='right', verticalalignment='bottom')
ax[2].scatter(1, 1, marker='*', s=100, color='tab:blue')
# Overlay model PR curves (unweighted and weighted AUC labels)
ax[2].plot(recall_train, precision_train, label=f'AUC score: {pr_auc_train:0.3f}', color='tab:orange')
ax[2].plot(recall_train_w, precision_train_w, label=f"AUC Weighted score: {pr_auc_train_weighted:0.3f}", color='tab:cyan', linestyle='-.')
ax[2].axis('equal')
ax[2].set_xlabel('Recall')
ax[2].set_ylabel('Precision')
ax[2].set_title('MaxEnt PR Curve')
ax[2].legend(loc='lower left')

In [None]:
# Save figures if requested. Uses different filename patterns for current vs future scenarios.
# Note: 'models' is used to gate inclusion of model prefix; ensure it exists in your session.
if savefig:
    if Future:
        if models:  # include model identifier when available
            file_path = os.path.join(
                figs_path,
                '06_roc-pr-auc_%s_%s_%s_%s_%s_future.png' % (specie, training, bio, model_prefix, iteration),
            )
        else:
            file_path = os.path.join(
                figs_path,
                '06_roc-pr-auc_%s_%s_%s_%s_future.png' % (specie, training, bio, iteration),
            )
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

    else:
        if models:
            file_path = os.path.join(
                figs_path,
                '06_roc-pr-auc_%s_%s_%s_%s_%s.png' % (specie, training, bio, model_prefix, iteration),
            )
        else:
            # Fallback: omit model prefix when not specified
            file_path = os.path.join(
                figs_path,
                '06_roc-pr-auc_%s_%s_%s_%s.png' % (specie, training, bio, iteration),
            )
        fig.savefig(file_path, transparent=True, bbox_inches='tight')


## 2. Test model performance

In [None]:
test_input_data_name = '%s_model-test_input-data_%s_%s_%s_%s_%s.csv' %(model_prefix, specie, pseudoabsence, interest, bio, iteration)

In [None]:
# Load held-out test dataset for evaluation
# Note: index_col=0 drops the old index saved during export
df = pd.read_csv(os.path.join(exp_path, test_input_data_name), index_col=0)
# Convert WKT geometry back to shapely objects
df['geometry'] = df['geometry'].apply(wkt.loads)
# Wrap as GeoDataFrame (WGS84 CRS)
test = gpd.GeoDataFrame(df, crs='EPSG:4326')

In [None]:
# Split predictors/labels/weights for test set
x_test = test.drop(columns=['class', 'SampleWeight', 'geometry'])
y_test = test['class']
sample_weight_test = test['SampleWeight']

# Predict probabilities on the test set using the trained model
y_test_predict = model_train.predict(x_test)
# Optional: impute NaN probabilities to 0.5 if present
# y_test_predict = np.nan_to_num(y_test_predict, nan=0.5)

In [None]:
# Test set metrics: ROC/PR curves and AUCs (unweighted vs weighted)
# ROC
fpr_test, tpr_test, _ = metrics.roc_curve(y_test, y_test_predict)
auc_test = metrics.roc_auc_score(y_test, y_test_predict)
auc_test_weighted = metrics.roc_auc_score(y_test, y_test_predict, sample_weight=sample_weight_test)

# Precision-Recall (PR)
precision_test, recall_test, _ = metrics.precision_recall_curve(y_test, y_test_predict)
pr_auc_test = metrics.auc(recall_test, precision_test)
precision_test_w, recall_test_w, _ = metrics.precision_recall_curve(y_test, y_test_predict, sample_weight=sample_weight_test)
pr_auc_test_weighted = metrics.auc(recall_test_w, precision_test_w)

# Print summary of training vs test for quick comparison
print(f"Training ROC-AUC score: {auc_train:0.3f}")
print(f"Training ROC-AUC Weighted score: {auc_train_weighted:0.3f}")
print(f"Test ROC-AUC score: {auc_test:0.3f}")
print(f"Test ROC-AUC Weighted score: {auc_test_weighted:0.3f}")

print(f"Training PR-AUC Score: {pr_auc_train:0.3f}")
print(f"Training PR-AUC Weighted Score: {pr_auc_train_weighted:0.3f}")
print(f"Test PR-AUC Score: {pr_auc_test:0.3f}")
print(f"Test PR-AUC Weighted Score: {pr_auc_test_weighted:0.3f}")

In [None]:
# Visualize test distributions and curves alongside training for comparison
fig, ax = plt.subplots(ncols=3, figsize=(18, 6), constrained_layout=True)

# Left: Predicted probability distributions on test set
ax[0].hist(y_test_predict[y_test == 0], bins=np.linspace(0, 1, int((y_test == 0).sum() / 100 + 1)),
           density=True, color='tab:red', alpha=0.7, label='pseudo-absence')
ax[0].hist(y_test_predict[y_test == 1], bins=np.linspace(0, 1, int((y_test == 1).sum() / 10 + 1)),
           density=True, color='tab:green', alpha=0.7, label='presence')
ax[0].set_xlabel('Relative Occurrence Probability')
ax[0].set_ylabel('Counts')
ax[0].set_title('Probability Distribution')
ax[0].legend(loc='upper right')

# Middle: ROC curves (train vs test, with weighted variants labeled)
ax[1].plot([0, 1], [0, 1], '--', label='AUC score: 0.5 (No Skill)', color='gray')
ax[1].text(0.4, 0.4, 'random classifier', fontsize=12, color='gray', rotation=45, rotation_mode='anchor',
           horizontalalignment='left', verticalalignment='bottom', transform=ax[1].transAxes)
ax[1].plot([0, 0, 1], [0, 1, 1], '--', label='AUC score: 1 (Ideal Model)', color='tab:blue', zorder=-1)
ax[1].text(0, 1, '  perfect classifier', fontsize=12, color='tab:blue', horizontalalignment='left', verticalalignment='bottom')
ax[1].scatter(0, 1, marker='*', s=100, color='tab:blue')
ax[1].plot(fpr_train, tpr_train, label=f'AUC train score: {auc_train:0.3f}', color='tab:orange')
ax[1].plot(fpr_train, tpr_train, label=f'AUC Weighted train score: {auc_train_weighted:0.3f}', color='tab:cyan', linestyle='-.')
ax[1].plot(fpr_test, tpr_test, label=f'AUC test score: {auc_test:0.3f}', color='tab:green')
ax[1].plot(fpr_test, tpr_test, label=f'AUC Weighted test score: {auc_test_weighted:0.3f}', color='tab:olive', linestyle='-.')
ax[1].axis('equal')
ax[1].set_xlabel('False Positive Rate')
ax[1].set_ylabel('True Positive Rate')
ax[1].set_title('MaxEnt ROC Curve')
ax[1].legend(loc='lower right')

# Right: PR curves (train vs test)
ax[2].plot([0, 1], [0.5, 0.5], '--', color='gray', label='AUC score: 0.5 (No Skill)')
ax[2].text(0.5, 0.52, 'random classifier', fontsize=12, color='gray', horizontalalignment='center', verticalalignment='center')
ax[2].plot([0, 1, 1], [1, 1, 0], '--', label='AUC score: 1 (Ideal Model)', color='tab:blue', zorder=-1)
ax[2].text(1, 1, 'perfect classifier  ', fontsize=12, color='tab:blue', horizontalalignment='right', verticalalignment='bottom')
ax[2].scatter(1, 1, marker='*', s=100, color='tab:blue')
ax[2].plot(recall_train, precision_train, label=f'AUC train score: {pr_auc_train:0.3f}', color='tab:orange')
ax[2].plot(recall_train_w, precision_train_w, label=f"AUC train Weighted score: {pr_auc_train_weighted:0.3f}", color='tab:cyan', linestyle='-.')
ax[2].plot(recall_test, precision_test, label=f'AUC test score: {pr_auc_test:0.3f}', color='tab:green')
ax[2].plot(recall_test_w, precision_test_w, label=f'AUC test Weighted score: {pr_auc_test_weighted:0.3f}', color='tab:olive', linestyle='-.')
ax[2].axis('equal')
ax[2].set_xlabel('Recall')
ax[2].set_ylabel('Precision')
ax[2].set_title('MaxEnt PR Curve')
ax[2].legend(loc='lower left')

In [None]:
# Save test figures if requested (future vs current naming handled similarly to training)
if savefig:
    if Future:
        if models:
            file_path = os.path.join(
                figs_path,
                '06_roc-pr-auc_%s_%s_%s_%s_%s_future.png' % (specie, interest, bio, model_prefix, iteration),
            )
        else:
            file_path = os.path.join(
                figs_path,
                '06_roc-pr-auc_%s_%s_%s_%s_future.png' % (specie, interest, bio, iteration),
            )
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

    else:
        if model_prefix:
            file_path = os.path.join(
                figs_path,
                '06_roc-pr-auc_%s_%s_%s_%s_%s.png' % (specie, interest, bio, model_prefix, iteration),
            )
        else:
            file_path = os.path.join(
                figs_path,
                '06_roc-pr-auc_%s_%s_%s_%s.png' % (specie, interest, bio, iteration),
            )
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

## 3. Evaluate model

### 3.2 Partial dependence plot/ Response curves

In [None]:
# fig, ax = model_train.partial_dependence_plot(x, labels=labels, dpi=100, n_bins=30)

## 4. Comprehensive Variable Importance Analysis

This section performs a thorough analysis of variable importance by:

1. **Initial Analysis**: Running the model with all 19 bioclimatic variables to establish baseline importance
2. **Iterative Removal**: Systematically removing the least important variables until we reach ~5 most important variables
3. **Performance Tracking**: Monitoring model performance as variables are removed
4. **Final Recommendations**: Identifying the optimal subset of variables for the species distribution model

### Methodology:
- **Permutation Importance**: Measures the drop in model performance when each variable is randomly shuffled
- **Iterative Backward Elimination**: Removes least important variables one at a time
- **Performance Monitoring**: Tracks AUC, PR-AUC, and other metrics throughout the process
- **Cross-Validation**: Ensures robust importance estimates


In [None]:
# =============================================================================
# COMPREHENSIVE VARIABLE IMPORTANCE ANALYSIS
# =============================================================================

import time
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

# Initialize storage for results
importance_results = {}
performance_history = {}
variable_subsets = {}

# Get current variable names from training data
current_variables = list(x_train.columns)
print(f"Starting with {len(current_variables)} variables:")
print(f"Variables: {current_variables}")

# Store initial performance metrics
initial_metrics = {
    'train_auc': auc_train,
    'train_auc_weighted': auc_train_weighted,
    'train_pr_auc': pr_auc_train,
    'train_pr_auc_weighted': pr_auc_train_weighted,
    'test_auc': auc_test,
    'test_auc_weighted': auc_test_weighted,
    'test_pr_auc': pr_auc_test,
    'test_pr_auc_weighted': pr_auc_test_weighted
}

performance_history['all_variables'] = initial_metrics
variable_subsets['all_variables'] = current_variables.copy()

print(f"\nInitial Performance (All {len(current_variables)} variables):")
print(f"Training AUC: {auc_train:.3f} (weighted: {auc_train_weighted:.3f})")
print(f"Training PR-AUC: {pr_auc_train:.3f} (weighted: {pr_auc_train_weighted:.3f})")
print(f"Test AUC: {auc_test:.3f} (weighted: {auc_test_weighted:.3f})")
print(f"Test PR-AUC: {pr_auc_test:.3f} (weighted: {pr_auc_test_weighted:.3f})")


In [None]:
# =============================================================================
# ITERATIVE VARIABLE REMOVAL FUNCTION
# =============================================================================

def iterative_variable_removal(x_train, y_train, sample_weight_train, x_test, y_test, sample_weight_test, 
                              target_variables=5, min_variables=3):
    """
    Iteratively remove least important variables until reaching target number.
    
    Parameters:
    -----------
    x_train, y_train, sample_weight_train : training data
    x_test, y_test, sample_weight_test : test data  
    target_variables : int, target number of variables to keep
    min_variables : int, minimum number of variables to keep
    
    Returns:
    --------
    results : dict, containing importance rankings and performance history
    """
    
    results = {
        'importance_rankings': {},
        'performance_history': {},
        'removed_variables': [],
        'final_variables': []
    }
    
    current_x_train = x_train.copy()
    current_x_test = x_test.copy()
    current_vars = list(current_x_train.columns)
    iteration = 0
    
    print(f"Starting iterative removal from {len(current_vars)} to {target_variables} variables...")
    
    while len(current_vars) > max(target_variables, min_variables):
        iteration += 1
        print(f"\n--- Iteration {iteration}: {len(current_vars)} variables remaining ---")
        
        # Train model with current variables
        model_iter = ela.MaxentModel()
        model_iter.fit(current_x_train, y_train, sample_weight=sample_weight_train)
        
        # Calculate permutation importance
        pi = inspection.permutation_importance(
            model_iter, current_x_train, y_train, 
            sample_weight=sample_weight_train, n_repeats=10
        )
        
        # Get importance scores and rank variables
        importance_scores = pi.importances.mean(axis=1)
        var_importance = dict(zip(current_vars, importance_scores))
        sorted_vars = sorted(var_importance.items(), key=lambda x: x[1], reverse=True)
        
        # Store ranking for this iteration
        results['importance_rankings'][f'iteration_{iteration}'] = {
            'variables': current_vars.copy(),
            'importance_scores': var_importance.copy(),
            'sorted_ranking': sorted_vars.copy()
        }
        
        # Calculate performance metrics
        y_train_pred = model_iter.predict(current_x_train)
        y_test_pred = model_iter.predict(current_x_test)
        
        # Training metrics
        train_auc = metrics.roc_auc_score(y_train, y_train_pred)
        train_auc_weighted = metrics.roc_auc_score(y_train, y_train_pred, sample_weight=sample_weight_train)
        train_precision, train_recall, _ = metrics.precision_recall_curve(y_train, y_train_pred)
        train_pr_auc = metrics.auc(train_recall, train_precision)
        train_precision_w, train_recall_w, _ = metrics.precision_recall_curve(y_train, y_train_pred, sample_weight=sample_weight_train)
        train_pr_auc_weighted = metrics.auc(train_recall_w, train_precision_w)
        
        # Test metrics
        test_auc = metrics.roc_auc_score(y_test, y_test_pred)
        test_auc_weighted = metrics.roc_auc_score(y_test, y_test_pred, sample_weight=sample_weight_test)
        test_precision, test_recall, _ = metrics.precision_recall_curve(y_test, y_test_pred)
        test_pr_auc = metrics.auc(test_recall, test_precision)
        test_precision_w, test_recall_w, _ = metrics.precision_recall_curve(y_test, y_test_pred, sample_weight=sample_weight_test)
        test_pr_auc_weighted = metrics.auc(test_recall_w, test_precision_w)
        
        # Store performance
        results['performance_history'][f'iteration_{iteration}'] = {
            'n_variables': len(current_vars),
            'train_auc': train_auc,
            'train_auc_weighted': train_auc_weighted,
            'train_pr_auc': train_pr_auc,
            'train_pr_auc_weighted': train_pr_auc_weighted,
            'test_auc': test_auc,
            'test_auc_weighted': test_auc_weighted,
            'test_pr_auc': test_pr_auc,
            'test_pr_auc_weighted': test_pr_auc_weighted
        }
        
        # Print current performance
        print(f"Performance with {len(current_vars)} variables:")
        print(f"  Train AUC: {train_auc:.3f} (weighted: {train_auc_weighted:.3f})")
        print(f"  Test AUC: {test_auc:.3f} (weighted: {test_auc_weighted:.3f})")
        print(f"  Train PR-AUC: {train_pr_auc:.3f} (weighted: {train_pr_auc_weighted:.3f})")
        print(f"  Test PR-AUC: {test_pr_auc:.3f} (weighted: {test_pr_auc_weighted:.3f})")
        
        # Identify least important variable
        least_important_var = sorted_vars[-1][0]
        least_important_score = sorted_vars[-1][1]
        
        print(f"Least important variable: {least_important_var} (importance: {least_important_score:.4f})")
        
        # Remove least important variable
        current_x_train = current_x_train.drop(columns=[least_important_var])
        current_x_test = current_x_test.drop(columns=[least_important_var])
        current_vars.remove(least_important_var)
        results['removed_variables'].append(least_important_var)
        
        print(f"Removed {least_important_var}. Variables remaining: {current_vars}")
    
    results['final_variables'] = current_vars.copy()
    print(f"\nFinal variable set ({len(current_vars)} variables): {current_vars}")
    
    return results


In [None]:
# =============================================================================
# RUN ITERATIVE VARIABLE REMOVAL ANALYSIS
# =============================================================================

print("="*80)
print("COMPREHENSIVE VARIABLE IMPORTANCE ANALYSIS")
print("="*80)

# Run the iterative removal process
start_time = time.time()

# Set target to 5 variables (can be adjusted)
target_vars = 5
min_vars = 3

# Run iterative removal
removal_results = iterative_variable_removal(
    x_train, y_train, sample_weight_train,
    x_test, y_test, sample_weight_test,
    target_variables=target_vars,
    min_variables=min_vars
)

end_time = time.time()
print(f"\nAnalysis completed in {end_time - start_time:.1f} seconds")

# Store results for later analysis
importance_results['iterative_removal'] = removal_results


In [None]:
# =============================================================================
# ANALYZE AND VISUALIZE RESULTS
# =============================================================================

# Extract performance trends
iterations = list(removal_results['performance_history'].keys())
n_vars = [removal_results['performance_history'][iter]['n_variables'] for iter in iterations]
train_aucs = [removal_results['performance_history'][iter]['train_auc'] for iter in iterations]
test_aucs = [removal_results['performance_history'][iter]['test_auc'] for iter in iterations]
train_aucs_weighted = [removal_results['performance_history'][iter]['train_auc_weighted'] for iter in iterations]
test_aucs_weighted = [removal_results['performance_history'][iter]['test_auc_weighted'] for iter in iterations]

# Add initial performance (all variables)
n_vars.insert(0, len(x_train.columns))
train_aucs.insert(0, auc_train)
test_aucs.insert(0, auc_test)
train_aucs_weighted.insert(0, auc_train_weighted)
test_aucs_weighted.insert(0, auc_test_weighted)

print("Performance Summary:")
print("="*50)
print(f"{'Variables':<12} {'Train AUC':<10} {'Test AUC':<10} {'Train AUC-W':<12} {'Test AUC-W':<12}")
print("-"*60)
for i, n_var in enumerate(n_vars):
    print(f"{n_var:<12} {train_aucs[i]:<10.3f} {test_aucs[i]:<10.3f} {train_aucs_weighted[i]:<12.3f} {test_aucs_weighted[i]:<12.3f}")

# Get final variable ranking
final_iteration = f"iteration_{len(iterations)}"
final_ranking = removal_results['importance_rankings'][final_iteration]['sorted_ranking']

print(f"\nFinal Variable Ranking (Top {len(removal_results['final_variables'])} variables):")
print("="*60)
for i, (var, importance) in enumerate(final_ranking, 1):
    print(f"{i:2d}. {var:<15} (importance: {importance:.4f})")

print(f"\nRemoved Variables (in order of removal):")
print("="*40)
for i, var in enumerate(removal_results['removed_variables'], 1):
    print(f"{i:2d}. {var}")


In [None]:
# =============================================================================
# CREATE COMPREHENSIVE VISUALIZATION
# =============================================================================

# Create a comprehensive figure showing the analysis results
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Comprehensive Variable Importance Analysis', fontsize=16, fontweight='bold')

# 1. Performance vs Number of Variables
ax1 = axes[0, 0]
ax1.plot(n_vars, train_aucs, 'o-', label='Train AUC', color='tab:blue', linewidth=2)
ax1.plot(n_vars, test_aucs, 's-', label='Test AUC', color='tab:orange', linewidth=2)
ax1.plot(n_vars, train_aucs_weighted, 'o--', label='Train AUC (Weighted)', color='tab:blue', alpha=0.7)
ax1.plot(n_vars, test_aucs_weighted, 's--', label='Test AUC (Weighted)', color='tab:orange', alpha=0.7)
ax1.set_xlabel('Number of Variables')
ax1.set_ylabel('AUC Score')
ax1.set_title('Model Performance vs Number of Variables')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.invert_xaxis()  # Show decreasing variables

# 2. Final Variable Importance (Top 10)
ax2 = axes[0, 1]
top_vars = final_ranking[:10]  # Top 10 variables
var_names = [var[0] for var in top_vars]
var_importance = [var[1] for var in top_vars]

bars = ax2.barh(range(len(var_names)), var_importance, color='tab:green', alpha=0.7)
ax2.set_yticks(range(len(var_names)))
ax2.set_yticklabels(var_names)
ax2.set_xlabel('Permutation Importance')
ax2.set_title('Top 10 Most Important Variables')
ax2.grid(True, alpha=0.3, axis='x')

# Add value labels on bars
for i, (bar, val) in enumerate(zip(bars, var_importance)):
    ax2.text(val + 0.001, i, f'{val:.3f}', va='center', fontsize=9)

# 3. Variable Removal Timeline
ax3 = axes[1, 0]
removed_vars = removal_results['removed_variables']
removal_order = list(range(1, len(removed_vars) + 1))
ax3.bar(removal_order, [1] * len(removed_vars), color='tab:red', alpha=0.7)
ax3.set_xlabel('Removal Order')
ax3.set_ylabel('Variables Removed')
ax3.set_title('Variable Removal Timeline')
ax3.set_xticks(removal_order)
ax3.set_xticklabels([f'#{i}' for i in removal_order])

# Add variable names as text
for i, var in enumerate(removed_vars):
    ax3.text(i + 1, 0.5, var, rotation=90, ha='center', va='center', fontsize=8)

# 4. Performance Degradation Analysis
ax4 = axes[1, 1]
# Calculate performance drop from initial
initial_test_auc = test_aucs[0]
initial_train_auc = train_aucs[0]
test_drop = [(initial_test_auc - auc) / initial_test_auc * 100 for auc in test_aucs]
train_drop = [(initial_train_auc - auc) / initial_train_auc * 100 for auc in train_aucs]

ax4.plot(n_vars, test_drop, 'o-', label='Test AUC Drop %', color='tab:red', linewidth=2)
ax4.plot(n_vars, train_drop, 's-', label='Train AUC Drop %', color='tab:purple', linewidth=2)
ax4.set_xlabel('Number of Variables')
ax4.set_ylabel('Performance Drop (%)')
ax4.set_title('Performance Degradation with Variable Removal')
ax4.legend()
ax4.grid(True, alpha=0.3)
ax4.invert_xaxis()

plt.tight_layout()


In [None]:
# Save the comprehensive analysis figure
if savefig:
    if Future:
        if models:
            file_path = os.path.join(
                figs_path,
                '06_comprehensive_var-importance_%s_%s_%s_%s_%s_future.png' % (specie, training, bio, model_prefix, iteration)
            )
        else:
            file_path = os.path.join(
                figs_path,
                '06_comprehensive_var-importance_%s_%s_%s_%s_future.png' % (specie, training, bio, iteration)
            )
        fig.savefig(file_path, transparent=True, bbox_inches='tight', dpi=300)
    else:
        if models:
            file_path = os.path.join(
                figs_path,
                '06_comprehensive_var-importance_%s_%s_%s_%s_%s.png' % (specie, training, bio, model_prefix, iteration)
            )
        else:
            file_path = os.path.join(
                figs_path,
                '06_comprehensive_var-importance_%s_%s_%s_%s.png' % (specie, training, bio, iteration)
            )
        fig.savefig(file_path, transparent=True, bbox_inches='tight', dpi=300)
    
    print(f"Comprehensive analysis figure saved to: {file_path}")


In [None]:
# =============================================================================
# EXPORT RESULTS TO CSV FOR FURTHER ANALYSIS
# =============================================================================

# Create summary DataFrame for export
summary_data = []

# Add initial performance (all variables)
summary_data.append({
    'iteration': 0,
    'n_variables': len(x_train.columns),
    'variables_removed': 'none',
    'train_auc': auc_train,
    'train_auc_weighted': auc_train_weighted,
    'test_auc': auc_test,
    'test_auc_weighted': auc_test_weighted,
    'train_pr_auc': pr_auc_train,
    'train_pr_auc_weighted': pr_auc_train_weighted,
    'test_pr_auc': pr_auc_test,
    'test_pr_auc_weighted': pr_auc_test_weighted
})

# Add iterative removal results
for i, iter_key in enumerate(iterations, 1):
    perf = removal_results['performance_history'][iter_key]
    removed_var = removal_results['removed_variables'][i-1] if i-1 < len(removal_results['removed_variables']) else 'none'
    
    summary_data.append({
        'iteration': i,
        'n_variables': perf['n_variables'],
        'variables_removed': removed_var,
        'train_auc': perf['train_auc'],
        'train_auc_weighted': perf['train_auc_weighted'],
        'test_auc': perf['test_auc'],
        'test_auc_weighted': perf['test_auc_weighted'],
        'train_pr_auc': perf['train_pr_auc'],
        'train_pr_auc_weighted': perf['train_pr_auc_weighted'],
        'test_pr_auc': perf['test_pr_auc'],
        'test_pr_auc_weighted': perf['test_pr_auc_weighted']
    })

# Create DataFrame
summary_df = pd.DataFrame(summary_data)

# Save to CSV
if savefig:
    csv_filename = f'06_variable_importance_analysis_{specie}_{training}_{bio}_{iteration}.csv'
    csv_path = os.path.join(figs_path, csv_filename)
    summary_df.to_csv(csv_path, index=False)
    print(f"Analysis summary saved to: {csv_path}")

# Display summary
print("\n" + "="*80)
print("FINAL ANALYSIS SUMMARY")
print("="*80)
print(f"Species: {specie}")
print(f"Training Region: {training}")
print(f"Test Region: {interest}")
print(f"Initial Variables: {len(x_train.columns)}")
print(f"Final Variables: {len(removal_results['final_variables'])}")
print(f"Variables Removed: {len(removal_results['removed_variables'])}")

print(f"\nFinal Variable Set:")
for i, var in enumerate(removal_results['final_variables'], 1):
    print(f"  {i}. {var}")

print(f"\nPerformance Comparison:")
print(f"  Initial Test AUC: {test_aucs[0]:.3f}")
print(f"  Final Test AUC: {test_aucs[-1]:.3f}")
print(f"  Performance Drop: {((test_aucs[0] - test_aucs[-1]) / test_aucs[0] * 100):.1f}%")

print(f"\nTop 5 Most Important Variables:")
for i, (var, importance) in enumerate(final_ranking[:5], 1):
    print(f"  {i}. {var} (importance: {importance:.4f})")


## 7. Summary and Recommendations

### Key Benefits of 10-Iteration Analysis:

1. **Robustness**: Multiple iterations account for random variation in model training and importance calculations
2. **Statistical Significance**: Provides mean, standard deviation, and confidence intervals for importance scores
3. **Consistency Analysis**: Identifies variables that are consistently important across different runs
4. **Performance Stability**: Shows how model performance varies with different variable sets

### Final Recommendations:

1. **Use Most Consistent Variables**: Variables that appear in the final set across most iterations are most reliable
2. **Consider Importance + Consistency**: Balance between high importance and high consistency
3. **Validate on Independent Data**: Test the selected variables on completely independent datasets
4. **Monitor Performance**: Track how the reduced variable set performs in real-world applications

### Files Generated:
- **Robust analysis figure**: 6-panel visualization showing comprehensive results
- **Summary CSV**: Aggregated statistics across all 10 iterations
- **Detailed CSV**: Individual results for each iteration
- **Console output**: Detailed rankings and recommendations

### Next Steps:
1. Use the identified top 5 variables for future modeling
2. Consider running additional iterations if results are not stable
3. Validate the selected variables on independent test data
4. Document the ecological significance of the selected variables


In [None]:
# Prepare labels and open training output NetCDF for metadata
labels = train.drop(columns=['class', 'geometry', 'SampleWeight']).columns.values
training_output = xr.open_dataset(os.path.join(exp_path, nc_name))
# display(labels)
# display(training_output)

In [None]:
# Compute partial dependence across features
# - percentiles bounds the feature grid to observed range (2.5% to 97.5%)
# - nbins controls resolution of the curve
percentiles = (0.025, 0.975)
nbins = 100

mean = {}
stdv = {}
bins = {}

for idx, label in enumerate(labels):
    # Request individual PDP curves across samples, then summarize
    pda = inspection.partial_dependence(
        model_train,
        x_train,
        [idx],
        percentiles=percentiles,
        grid_resolution=nbins,
        kind="individual",
    )

    mean[label] = pda["individual"][0].mean(axis=0)  # average response
    stdv[label] = pda["individual"][0].std(axis=0)   # variability across samples
    bins[label] = pda["grid_values"][0]              # feature grid values

In [None]:
#display(pda)


In [None]:
# Plot PDPs with uncertainty bands for each predictor
ncols, nrows = subplot_layout(len(labels))
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(ncols * 6, nrows * 6))

# Normalize axes list for consistent indexing
if (nrows, ncols) == (1, 1):
    ax = [axs]
else:
    ax = axs.ravel()

xlabels = training_output.data_vars
for iax, label in enumerate(labels):
    ax[iax].set_title(label)
    try:
        ax[iax].set_xlabel(xlabels[label].long_name)
    except (ValueError, AttributeError):
        ax[iax].set_xlabel('No variable long_name')

    # Uncertainty band: mean ± std across individuals
    ax[iax].fill_between(bins[label], mean[label] - stdv[label], mean[label] + stdv[label], alpha=0.25)
    ax[iax].plot(bins[label], mean[label])

# Style axes
for axi in ax:
    axi.set_ylim([0, 1])
    axi.set_ylabel('probability of occurrence')

fig.tight_layout()

In [None]:
# Save response curve figures if requested
if savefig:
    if Future:
        if models:
            file_path = os.path.join(
                figs_path,
                '06_resp-curves_%s_%s_%s_%s_%s_future.png' % (specie, training, bio, model_prefix, iteration),
            )
        else:
            file_path = os.path.join(
                figs_path,
                '06_resp-curves_%s_%s_%s_%s_future.png' % (specie, training, bio, iteration),
            )
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

    else:
        if models:
            file_path = os.path.join(
                figs_path,
                '06_resp-curves_%s_%s_%s_%s_%s.png' % (specie, training, bio, model_prefix, iteration),
            )
        else:
            file_path = os.path.join(
                figs_path,
                '06_resp-curves_%s_%s_%s_%s.png' % (specie, training, bio, iteration),
            )
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

### 3.3 Variable importance plot

In [None]:
# fig, ax = model_train.permutation_importance_plot(x,y)

In [None]:
# Permutation importance: measures drop in performance when each feature is shuffled
# Higher drop => more important feature
pi = inspection.permutation_importance(model_train, x_train, y_train, n_repeats=10)
importance = pi.importances
rank_order = importance.mean(axis=-1).argsort()

In [None]:
# Visualize permutation importances as horizontal boxplots (distribution over repeats)
labels_ranked = [labels[idx] for idx in rank_order]

fig, ax = plt.subplots()
box = ax.boxplot(importance[rank_order].T, vert=False, labels=labels_ranked)
# Decorate legend labels for key boxplot elements
box['fliers'][0].set_label('outlier')
box['medians'][0].set_label('median')
for icap, cap in enumerate(box['caps']):
    if icap == 0:
        cap.set_label('min-max')
    cap.set_color('k')
    cap.set_linewidth(2)
for ibx, bx in enumerate(box['boxes']):
    if ibx == 0:
        bx.set_label('25-75%')
    bx.set_color('gray')

ax.set_xlabel('Importance')
ax.legend(loc='lower right')
fig.tight_layout()

In [None]:
# if savefig:
#     if Future:
#         fig.savefig(os.path.join(figs_path, '06_var-importance_%s_%s_%s_future.png' %(specie, training, bio)), transparent=True, bbox_inches='tight')
#     else:
#         fig.savefig(os.path.join(figs_path, '06_var-importance_%s_%s_%s.png' %(specie, training, bio)), transparent=True, bbox_inches='tight')


if savefig:
    if Future:
        # Check if the 'model' variable is not null or empty
        if models:
            # If a model is specified, add it to the filename
            file_path = os.path.join(figs_path, '06_var-importance_%s_%s_%s_%s_%s_future.png' %(specie, training, bio, model_prefix, iteration))
        else:
            # If no model is specified, use the original filename
            file_path = os.path.join(figs_path, '06_var-importance_%s_%s_%s_%s_future.png' %(specie, training, bio, iteration))
        
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

    else:
        if models:
            # If a model is specified, add it to the filename
            file_path = os.path.join(figs_path, '06_var-importance_%s_%s_%s_%s_%s.png' %(specie, training, bio, model_prefix, iteration))
        else:
            # This is the original logic for non-future scenarios, which remains unchanged
            file_path = os.path.join(figs_path, '06_var-importance_%s_%s_%s_%s.png' %(specie, training, bio,iteration))
        
        fig.savefig(file_path, transparent=True, bbox_inches='tight')

## 12. Analisis Sebaran Performance Berdasarkan Lokasi Geografis

Bagian ini melakukan analisis mendalam tentang sebaran performance model berdasarkan lokasi geografis, termasuk:

### Tujuan Analisis Spasial:
1. **Spatial Performance Distribution**: Analisis sebaran performa berdasarkan koordinat geografis
2. **Regional Performance Analysis**: Perbandingan performa antar region/area
3. **Spatial Bias Detection**: Identifikasi bias spasial dalam performa model
4. **Geographic Clustering**: Analisis clustering performa berdasarkan lokasi
5. **Spatial Correlation**: Korelasi antara lokasi geografis dan performa model

### Metodologi:
- **Spatial Statistics**: Analisis statistik spasial untuk mengidentifikasi pola
- **Geographic Visualization**: Peta performa dengan color coding
- **Regional Comparison**: Perbandingan performa antar region
- **Spatial Autocorrelation**: Analisis korelasi spasial
- **Hotspot Analysis**: Identifikasi area dengan performa tinggi/rendah


In [None]:
# =============================================================================
# ANALISIS SEBARAN PERFORMANCE BERDASARKAN LOKASI GEOGRAFIS
# =============================================================================
# %pip install contextily
# %pip install folium
    
import folium
from folium import plugins
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
import contextily as ctx
from rasterio.plot import show
import rasterio
from rasterio.warp import calculate_default_transform, reproject, Resampling

def analyze_spatial_performance_distribution():
    """
    Menganalisis sebaran performa model berdasarkan lokasi geografis.
    """
    print("="*80)
    print("ANALISIS SEBARAN PERFORMANCE BERDASARKAN LOKASI GEOGRAFIS")
    print("="*80)
    
    # Gabungkan data training dan test untuk analisis spasial
    train_spatial = train.copy()
    test_spatial = test.copy()
    
    # Tambahkan kolom untuk membedakan training dan test
    train_spatial['dataset'] = 'training'
    test_spatial['dataset'] = 'test'
    
    # Gabungkan data
    combined_spatial = pd.concat([train_spatial, test_spatial], ignore_index=True)
    
    # Hitung performa untuk setiap lokasi
    combined_spatial['predicted_prob'] = model_train.predict(
        combined_spatial.drop(columns=['class', 'SampleWeight', 'geometry', 'dataset'])
    )
    
    # Hitung error untuk setiap lokasi
    combined_spatial['prediction_error'] = abs(combined_spatial['predicted_prob'] - combined_spatial['class'])
    
    # Hitung confidence score (berdasarkan jarak dari threshold 0.5)
    combined_spatial['confidence'] = abs(combined_spatial['predicted_prob'] - 0.5) * 2
    
    print(f"Total lokasi yang dianalisis: {len(combined_spatial)}")
    print(f"Training locations: {len(train_spatial)}")
    print(f"Test locations: {len(test_spatial)}")
    
    return combined_spatial

# Jalankan analisis spasial
spatial_data = analyze_spatial_performance_distribution()


## 14. Comprehensive Variable Importance Analysis by Spatial Distribution

This section provides a comprehensive analysis of variable importance patterns across different spatial regions and geographic locations. This analysis is crucial for understanding how environmental variables contribute to species distribution modeling in different geographic contexts.

### Key Objectives:

1. **Spatial Variable Importance Mapping**: Analyze how variable importance varies across different geographic regions
2. **Regional Importance Patterns**: Identify which variables are most important in specific geographic areas
3. **Spatial Clustering of Importance**: Group locations based on similar variable importance patterns
4. **Geographic Bias in Variable Selection**: Detect if certain variables are more important in specific regions
5. **Interactive Spatial Visualization**: Create interactive maps showing variable importance across space
6. **Cross-Regional Comparison**: Compare variable importance between training and test regions

### Methodology:

- **Spatial Permutation Importance**: Calculate variable importance for different spatial subsets
- **Geographic Grid Analysis**: Divide study area into grids and analyze importance per grid
- **Regional Clustering**: Use clustering algorithms to identify regions with similar importance patterns
- **Spatial Correlation Analysis**: Analyze correlation between geographic location and variable importance
- **Interactive Mapping**: Create interactive visualizations for exploration and analysis


In [None]:
# =============================================================================
# SPATIAL VARIABLE IMPORTANCE ANALYSIS FRAMEWORK
# =============================================================================

import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.stats import pearsonr, spearmanr
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def calculate_spatial_variable_importance():
    """
    Calculate variable importance for different spatial regions and subsets.
    """
    print("="*80)
    print("CALCULATING SPATIAL VARIABLE IMPORTANCE")
    print("="*80)
    
    # Combine training and test data for spatial analysis
    train_spatial = train.copy()
    test_spatial = test.copy()
    
    # Add dataset identifier
    train_spatial['dataset'] = 'training'
    test_spatial['dataset'] = 'test'
    
    # Combine datasets
    combined_data = pd.concat([train_spatial, test_spatial], ignore_index=True)
    
    # Extract coordinates
    combined_data['longitude'] = combined_data.geometry.x
    combined_data['latitude'] = combined_data.geometry.y
    
    # Prepare features for importance calculation
    feature_columns = [col for col in combined_data.columns if col not in ['class', 'SampleWeight', 'geometry', 'dataset', 'longitude', 'latitude']]
    X_combined = combined_data[feature_columns]
    y_combined = combined_data['class']
    weights_combined = combined_data['SampleWeight']
    
    print(f"Total locations for spatial analysis: {len(combined_data)}")
    print(f"Training locations: {len(train_spatial)}")
    print(f"Test locations: {len(test_spatial)}")
    print(f"Features analyzed: {len(feature_columns)}")
    
    return combined_data, X_combined, y_combined, weights_combined, feature_columns

# Calculate spatial variable importance
spatial_data, X_spatial, y_spatial, weights_spatial, feature_names = calculate_spatial_variable_importance()


In [None]:
# =============================================================================
# GEOGRAPHIC GRID-BASED VARIABLE IMPORTANCE ANALYSIS
# =============================================================================

def analyze_variable_importance_by_geographic_grid(spatial_data, X_spatial, y_spatial, weights_spatial, feature_names, grid_size=5):
    """
    Analyze variable importance across different geographic grids.
    """
    print("\n" + "="*80)
    print("GEOGRAPHIC GRID-BASED VARIABLE IMPORTANCE ANALYSIS")
    print("="*80)
    
    # Create geographic grid
    min_lon, min_lat = spatial_data['longitude'].min(), spatial_data['latitude'].min()
    max_lon, max_lat = spatial_data['longitude'].max(), spatial_data['latitude'].max()
    
    # Calculate grid dimensions
    grid_size_lon = (max_lon - min_lon) / grid_size
    grid_size_lat = (max_lat - min_lat) / grid_size
    
    # Assign grid IDs
    spatial_data['grid_lon'] = ((spatial_data['longitude'] - min_lon) / grid_size_lon).astype(int)
    spatial_data['grid_lat'] = ((spatial_data['latitude'] - min_lat) / grid_size_lat).astype(int)
    spatial_data['grid_id'] = spatial_data['grid_lon'].astype(str) + '_' + spatial_data['grid_lat'].astype(str)
    
    # Calculate variable importance for each grid
    grid_importance_results = {}
    grid_performance_results = {}
    
    unique_grids = spatial_data['grid_id'].unique()
    print(f"Analyzing {len(unique_grids)} geographic grids...")
    
    for grid_id in unique_grids:
        # Get data for this grid
        grid_mask = spatial_data['grid_id'] == grid_id
        grid_data = spatial_data[grid_mask]
        
        # Skip grids with insufficient data
        if len(grid_data) < 10:  # Minimum 10 samples per grid
            continue
            
        X_grid = X_spatial[grid_mask]
        y_grid = y_spatial[grid_mask]
        weights_grid = weights_spatial[grid_mask]
        
        # Train model for this grid
        try:
            model_grid = ela.MaxentModel()
            model_grid.fit(X_grid, y_grid, sample_weight=weights_grid)
            
            # Calculate permutation importance
            pi_grid = inspection.permutation_importance(
                model_grid, X_grid, y_grid, 
                sample_weight=weights_grid, n_repeats=5
            )
            
            # Store importance scores
            importance_scores = pi_grid.importances.mean(axis=1)
            grid_importance_results[grid_id] = dict(zip(feature_names, importance_scores))
            
            # Calculate performance metrics
            y_pred_grid = model_grid.predict(X_grid)
            auc_grid = metrics.roc_auc_score(y_grid, y_pred_grid, sample_weight=weights_grid)
            
            grid_performance_results[grid_id] = {
                'n_samples': len(grid_data),
                'n_presence': y_grid.sum(),
                'n_absence': (y_grid == 0).sum(),
                'auc': auc_grid,
                'center_lon': grid_data['longitude'].mean(),
                'center_lat': grid_data['latitude'].mean(),
                'dataset_composition': grid_data['dataset'].value_counts().to_dict()
            }
            
        except Exception as e:
            print(f"Error processing grid {grid_id}: {str(e)}")
            continue
    
    print(f"Successfully analyzed {len(grid_importance_results)} grids")
    
    return grid_importance_results, grid_performance_results, spatial_data

# Run grid-based analysis
grid_importance, grid_performance, spatial_data_with_grids = analyze_variable_importance_by_geographic_grid(
    spatial_data, X_spatial, y_spatial, weights_spatial, feature_names, grid_size=5
)


In [None]:
# =============================================================================
# SPATIAL CLUSTERING OF VARIABLE IMPORTANCE PATTERNS
# =============================================================================

def perform_spatial_clustering_analysis(grid_importance, grid_performance, feature_names):
    """
    Perform spatial clustering analysis based on variable importance patterns.
    """
    print("\n" + "="*80)
    print("SPATIAL CLUSTERING OF VARIABLE IMPORTANCE PATTERNS")
    print("="*80)
    
    # Convert grid importance to DataFrame
    importance_df = pd.DataFrame(grid_importance).T
    importance_df = importance_df.fillna(0)  # Fill NaN with 0
    
    # Standardize importance scores
    scaler = StandardScaler()
    importance_scaled = scaler.fit_transform(importance_df)
    
    # Perform K-means clustering
    n_clusters = min(5, len(importance_df) // 3)  # Adaptive number of clusters
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(importance_scaled)
    
    # Add cluster information
    importance_df['cluster'] = cluster_labels
    importance_df['grid_id'] = importance_df.index
    
    # Analyze cluster characteristics
    cluster_analysis = {}
    for cluster_id in range(n_clusters):
        cluster_data = importance_df[importance_df['cluster'] == cluster_id]
        
        # Calculate mean importance for each variable in this cluster
        cluster_mean_importance = cluster_data[feature_names].mean()
        cluster_std_importance = cluster_data[feature_names].std()
        
        # Get top variables for this cluster
        top_variables = cluster_mean_importance.nlargest(5)
        
        cluster_analysis[cluster_id] = {
            'n_grids': len(cluster_data),
            'grid_ids': cluster_data.index.tolist(),
            'mean_importance': cluster_mean_importance.to_dict(),
            'std_importance': cluster_std_importance.to_dict(),
            'top_variables': top_variables.to_dict(),
            'grid_performance': {grid_id: grid_performance.get(grid_id, {}) for grid_id in cluster_data.index}
        }
    
    # Perform hierarchical clustering for comparison
    linkage_matrix = linkage(importance_scaled, method='ward')
    
    # Create dendrogram
    plt.figure(figsize=(12, 8))
    dendrogram(linkage_matrix, labels=importance_df.index, leaf_rotation=90)
    plt.title('Hierarchical Clustering of Grid Variable Importance Patterns')
    plt.xlabel('Grid ID')
    plt.ylabel('Distance')
    plt.tight_layout()
    
    if savefig:
        if Future:
            if models:
                file_path = os.path.join(
                    figs_path,
                    '06_spatial_clustering_dendrogram_%s_%s_%s_%s_%s_future.png' % (specie, training, bio, model_prefix, iteration)
                )
            else:
                file_path = os.path.join(
                    figs_path,
                    '06_spatial_clustering_dendrogram_%s_%s_%s_%s_future.png' % (specie, training, bio, iteration)
                )
        else:
            if models:
                file_path = os.path.join(
                    figs_path,
                    '06_spatial_clustering_dendrogram_%s_%s_%s_%s_%s.png' % (specie, training, bio, model_prefix, iteration)
                )
            else:
                file_path = os.path.join(
                    figs_path,
                    '06_spatial_clustering_dendrogram_%s_%s_%s_%s.png' % (specie, training, bio, iteration)
                )
        plt.savefig(file_path, transparent=True, bbox_inches='tight', dpi=300)
        print(f"Spatial clustering dendrogram saved to: {file_path}")
    
    plt.show()
    
    # Print cluster analysis results
    print(f"\nSpatial Clustering Results:")
    print(f"Number of clusters: {n_clusters}")
    print(f"Total grids analyzed: {len(importance_df)}")
    
    for cluster_id, analysis in cluster_analysis.items():
        print(f"\n--- Cluster {cluster_id} ---")
        print(f"Number of grids: {analysis['n_grids']}")
        print(f"Grid IDs: {analysis['grid_ids']}")
        print(f"Top 5 most important variables:")
        for var, importance in analysis['top_variables'].items():
            print(f"  {var}: {importance:.4f}")
    
    return importance_df, cluster_analysis, linkage_matrix

# Perform spatial clustering analysis
importance_df, cluster_analysis, linkage_matrix = perform_spatial_clustering_analysis(
    grid_importance, grid_performance, feature_names
)


In [None]:
# =============================================================================
# COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE VISUALIZATION
# =============================================================================

def create_comprehensive_spatial_visualizations(importance_df, cluster_analysis, grid_performance, spatial_data_with_grids):
    """
    Create comprehensive visualizations for spatial variable importance analysis.
    """
    print("\n" + "="*80)
    print("CREATING COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE VISUALIZATIONS")
    print("="*80)
    
    # Create a large figure with multiple subplots
    fig, axes = plt.subplots(3, 3, figsize=(24, 18))
    fig.suptitle('Comprehensive Spatial Variable Importance Analysis', fontsize=20, fontweight='bold')
    
    # 1. Heatmap of Variable Importance by Grid
    ax1 = axes[0, 0]
    importance_matrix = importance_df.drop(columns=['cluster', 'grid_id'])
    sns.heatmap(importance_matrix.T, cmap='RdYlBu_r', center=0, ax=ax1, cbar_kws={'label': 'Importance Score'})
    ax1.set_title('Variable Importance Heatmap by Geographic Grid')
    ax1.set_xlabel('Grid ID')
    ax1.set_ylabel('Environmental Variables')
    
    # 2. Cluster-based Importance Patterns
    ax2 = axes[0, 1]
    cluster_importance = importance_df.groupby('cluster')[importance_df.columns[:-2]].mean()
    sns.heatmap(cluster_importance.T, cmap='viridis', ax=ax2, cbar_kws={'label': 'Mean Importance'})
    ax2.set_title('Mean Variable Importance by Spatial Cluster')
    ax2.set_xlabel('Cluster ID')
    ax2.set_ylabel('Environmental Variables')
    
    # 3. Geographic Distribution of Clusters
    ax3 = axes[0, 2]
    # Create a mapping from grid_id to cluster
    grid_cluster_map = importance_df.set_index('grid_id')['cluster'].to_dict()
    
    # Add cluster information to spatial data
    spatial_data_with_grids['cluster'] = spatial_data_with_grids['grid_id'].map(grid_cluster_map)
    spatial_data_with_grids['cluster'] = spatial_data_with_grids['cluster'].fillna(-1)  # -1 for grids not in analysis
    
    # Plot clusters
    unique_clusters = spatial_data_with_grids['cluster'].unique()
    colors = plt.cm.Set3(np.linspace(0, 1, len(unique_clusters)))
    
    for i, cluster in enumerate(unique_clusters):
        if cluster == -1:  # Skip unanalyzed grids
            continue
        subset = spatial_data_with_grids[spatial_data_with_grids['cluster'] == cluster]
        ax3.scatter(subset['longitude'], subset['latitude'], 
                   c=[colors[i]], s=30, alpha=0.7, label=f'Cluster {cluster}')
    
    ax3.set_xlabel('Longitude')
    ax3.set_ylabel('Latitude')
    ax3.set_title('Geographic Distribution of Importance Clusters')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # 4. Top Variables by Cluster
    ax4 = axes[1, 0]
    cluster_top_vars = {}
    for cluster_id, analysis in cluster_analysis.items():
        top_vars = list(analysis['top_variables'].keys())[:3]  # Top 3 variables
        cluster_top_vars[cluster_id] = top_vars
    
    # Create a bar plot showing top variables per cluster
    cluster_ids = list(cluster_top_vars.keys())
    y_pos = np.arange(len(cluster_ids))
    
    for i, cluster_id in enumerate(cluster_ids):
        top_vars = cluster_top_vars[cluster_id]
        ax4.barh(i, len(top_vars), alpha=0.7, label=f'Cluster {cluster_id}')
        ax4.text(len(top_vars) + 0.1, i, f"{', '.join(top_vars)}", 
                va='center', fontsize=8)
    
    ax4.set_yticks(y_pos)
    ax4.set_yticklabels([f'Cluster {cid}' for cid in cluster_ids])
    ax4.set_xlabel('Number of Top Variables')
    ax4.set_title('Top 3 Variables by Spatial Cluster')
    ax4.legend()
    
    # 5. Performance vs Importance Correlation
    ax5 = axes[1, 1]
    # Calculate correlation between grid performance and variable importance
    grid_perf_data = []
    grid_imp_data = []
    
    for grid_id in importance_df.index:
        if grid_id in grid_performance:
            perf = grid_performance[grid_id].get('auc', 0)
            imp = importance_df.loc[grid_id, importance_df.columns[:-2]].mean()
            grid_perf_data.append(perf)
            grid_imp_data.append(imp)
    
    if grid_perf_data and grid_imp_data:
        ax5.scatter(grid_imp_data, grid_perf_data, alpha=0.7, s=50)
        ax5.set_xlabel('Mean Variable Importance')
        ax5.set_ylabel('Grid Performance (AUC)')
        ax5.set_title('Grid Performance vs Mean Variable Importance')
        ax5.grid(True, alpha=0.3)
        
        # Add correlation coefficient
        corr_coef, p_value = pearsonr(grid_imp_data, grid_perf_data)
        ax5.text(0.05, 0.95, f'r = {corr_coef:.3f}\np = {p_value:.3f}', 
                transform=ax5.transAxes, bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8))
    
    # 6. Variable Importance Distribution by Dataset
    ax6 = axes[1, 2]
    # Calculate mean importance for training vs test regions
    train_grids = []
    test_grids = []
    
    for grid_id in importance_df.index:
        if grid_id in grid_performance:
            dataset_comp = grid_performance[grid_id].get('dataset_composition', {})
            if dataset_comp.get('training', 0) > dataset_comp.get('test', 0):
                train_grids.append(grid_id)
            else:
                test_grids.append(grid_id)
    
    if train_grids and test_grids:
        train_importance = importance_df.loc[train_grids, importance_df.columns[:-2]].mean()
        test_importance = importance_df.loc[test_grids, importance_df.columns[:-2]].mean()
        
        x_pos = np.arange(len(feature_names))
        width = 0.35
        
        ax6.bar(x_pos - width/2, train_importance.values, width, label='Training Region', alpha=0.7)
        ax6.bar(x_pos + width/2, test_importance.values, width, label='Test Region', alpha=0.7)
        
        ax6.set_xlabel('Environmental Variables')
        ax6.set_ylabel('Mean Importance')
        ax6.set_title('Variable Importance: Training vs Test Regions')
        ax6.set_xticks(x_pos)
        ax6.set_xticklabels(feature_names, rotation=45, ha='right')
        ax6.legend()
        ax6.grid(True, alpha=0.3)
    
    # 7. Spatial Autocorrelation Analysis
    ax7 = axes[2, 0]
    # Calculate spatial autocorrelation for each variable
    spatial_corr_data = []
    for var in feature_names:
        var_importance = importance_df[var].values
        # Simple spatial correlation (distance-weighted)
        if len(var_importance) > 1:
            # Calculate mean distance between grids
            grid_coords = []
            for grid_id in importance_df.index:
                if grid_id in grid_performance:
                    coords = (grid_performance[grid_id]['center_lon'], 
                             grid_performance[grid_id]['center_lat'])
                    grid_coords.append(coords)
            
            if len(grid_coords) > 1:
                distances = pdist(grid_coords)
                mean_distance = np.mean(distances)
                # Simple spatial correlation measure
                spatial_corr = np.corrcoef(var_importance, range(len(var_importance)))[0, 1]
                spatial_corr_data.append((var, spatial_corr))
    
    if spatial_corr_data:
        vars_spatial, corrs_spatial = zip(*spatial_corr_data)
        bars = ax7.barh(range(len(vars_spatial)), corrs_spatial, alpha=0.7)
        ax7.set_yticks(range(len(vars_spatial)))
        ax7.set_yticklabels(vars_spatial)
        ax7.set_xlabel('Spatial Correlation')
        ax7.set_title('Spatial Autocorrelation of Variable Importance')
        ax7.grid(True, alpha=0.3)
        
        # Color bars based on correlation strength
        for i, bar in enumerate(bars):
            if abs(corrs_spatial[i]) > 0.3:
                bar.set_color('red')
            elif abs(corrs_spatial[i]) > 0.1:
                bar.set_color('orange')
            else:
                bar.set_color('green')
    
    # 8. Cluster Performance Comparison
    ax8 = axes[2, 1]
    cluster_performance = []
    cluster_ids = []
    
    for cluster_id, analysis in cluster_analysis.items():
        cluster_perf = []
        for grid_id in analysis['grid_ids']:
            if grid_id in grid_performance:
                perf = grid_performance[grid_id].get('auc', 0)
                cluster_perf.append(perf)
        
        if cluster_perf:
            cluster_performance.append(cluster_perf)
            cluster_ids.append(f'Cluster {cluster_id}')
    
    if cluster_performance:
        ax8.boxplot(cluster_performance, labels=cluster_ids)
        ax8.set_ylabel('Performance (AUC)')
        ax8.set_title('Performance Distribution by Spatial Cluster')
        ax8.grid(True, alpha=0.3)
        ax8.tick_params(axis='x', rotation=45)
    
    # 9. Variable Importance Stability Analysis
    ax9 = axes[2, 2]
    # Calculate coefficient of variation for each variable across grids
    importance_cv = importance_df[feature_names].std() / importance_df[feature_names].mean()
    importance_cv = importance_cv.fillna(0)
    
    bars = ax9.barh(range(len(feature_names)), importance_cv.values, alpha=0.7)
    ax9.set_yticks(range(len(feature_names)))
    ax9.set_yticklabels(feature_names)
    ax9.set_xlabel('Coefficient of Variation')
    ax9.set_title('Variable Importance Stability Across Grids')
    ax9.grid(True, alpha=0.3)
    
    # Color bars based on stability
    for i, bar in enumerate(bars):
        cv = importance_cv.values[i]
        if cv > 0.5:
            bar.set_color('red')  # High variability
        elif cv > 0.2:
            bar.set_color('orange')  # Medium variability
        else:
            bar.set_color('green')  # Low variability
    
    plt.tight_layout()
    
    # Save the comprehensive visualization
    if savefig:
        if Future:
            if models:
                file_path = os.path.join(
                    figs_path,
                    '06_comprehensive_spatial_var_importance_%s_%s_%s_%s_%s_future.png' % (specie, training, bio, model_prefix, iteration)
                )
            else:
                file_path = os.path.join(
                    figs_path,
                    '06_comprehensive_spatial_var_importance_%s_%s_%s_%s_future.png' % (specie, training, bio, iteration)
                )
        else:
            if models:
                file_path = os.path.join(
                    figs_path,
                    '06_comprehensive_spatial_var_importance_%s_%s_%s_%s_%s.png' % (specie, training, bio, model_prefix, iteration)
                )
            else:
                file_path = os.path.join(
                    figs_path,
                    '06_comprehensive_spatial_var_importance_%s_%s_%s_%s.png' % (specie, training, bio, iteration)
                )
        fig.savefig(file_path, transparent=True, bbox_inches='tight', dpi=300)
        print(f"Comprehensive spatial variable importance visualization saved to: {file_path}")
    
    plt.show()
    
    return fig

# Create comprehensive visualizations
comprehensive_fig = create_comprehensive_spatial_visualizations(
    importance_df, cluster_analysis, grid_performance, spatial_data_with_grids
)


In [None]:
# =============================================================================
# INTERACTIVE SPATIAL VARIABLE IMPORTANCE MAPS
# =============================================================================

def create_interactive_spatial_importance_maps(importance_df, cluster_analysis, grid_performance, spatial_data_with_grids):
    """
    Create interactive maps for spatial variable importance analysis.
    """
    print("\n" + "="*80)
    print("CREATING INTERACTIVE SPATIAL VARIABLE IMPORTANCE MAPS")
    print("="*80)
    
    # Create interactive Folium map
    center_lat = spatial_data_with_grids['latitude'].mean()
    center_lon = spatial_data_with_grids['longitude'].mean()
    
    # Base map
    m = folium.Map(
        location=[center_lat, center_lon],
        zoom_start=6,
        tiles='OpenStreetMap'
    )
    
    # Add different tile layers
    folium.TileLayer('CartoDB positron').add_to(m)
    folium.TileLayer('CartoDB dark_matter').add_to(m)
    
    # Create color mapping for clusters
    cluster_colors = {
        0: 'red', 1: 'blue', 2: 'green', 3: 'purple', 4: 'orange',
        5: 'darkred', 6: 'lightred', 7: 'beige', 8: 'darkblue', 9: 'darkgreen'
    }
    
    # Add markers for each grid with importance information
    for grid_id in importance_df.index:
        if grid_id in grid_performance:
            perf_data = grid_performance[grid_id]
            cluster_id = importance_df.loc[grid_id, 'cluster']
            
            # Get top 3 most important variables for this grid
            grid_importance = importance_df.loc[grid_id, importance_df.columns[:-2]]
            top_vars = grid_importance.nlargest(3)
            
            # Create popup content
            popup_content = f"""
            <div style="width: 300px;">
                <h4>Grid: {grid_id}</h4>
                <p><b>Cluster:</b> {cluster_id}</p>
                <p><b>Performance (AUC):</b> {perf_data.get('auc', 0):.3f}</p>
                <p><b>Samples:</b> {perf_data.get('n_samples', 0)}</p>
                <p><b>Presence:</b> {perf_data.get('n_presence', 0)}</p>
                <p><b>Absence:</b> {perf_data.get('n_absence', 0)}</p>
                <hr>
                <h5>Top 3 Important Variables:</h5>
            """
            
            for i, (var, importance) in enumerate(top_vars.items(), 1):
                popup_content += f"<p>{i}. {var}: {importance:.4f}</p>"
            
            popup_content += "</div>"
            
            # Marker color based on cluster
            marker_color = cluster_colors.get(cluster_id, 'gray')
            
            # Marker size based on performance
            marker_size = max(8, min(20, perf_data.get('auc', 0) * 20))
            
            folium.CircleMarker(
                location=[perf_data['center_lat'], perf_data['center_lon']],
                radius=marker_size,
                popup=folium.Popup(popup_content, max_width=350),
                color='black',
                weight=2,
                fillColor=marker_color,
                fillOpacity=0.7,
                tooltip=f"Grid {grid_id} - Cluster {cluster_id} - AUC: {perf_data.get('auc', 0):.3f}"
            ).add_to(m)
    
    # Add cluster legend
    legend_html = '''
    <div style="position: fixed; 
                bottom: 50px; left: 50px; width: 200px; height: 200px; 
                background-color: white; border:2px solid grey; z-index:9999; 
                font-size:14px; padding: 10px">
    <h4>Spatial Clusters</h4>
    '''
    
    for cluster_id, analysis in cluster_analysis.items():
        color = cluster_colors.get(cluster_id, 'gray')
        legend_html += f'<p><i class="fa fa-circle" style="color:{color}"></i> Cluster {cluster_id} ({analysis["n_grids"]} grids)</p>'
    
    legend_html += '''
    <h4>Marker Size</h4>
    <p>Size ∝ Model Performance (AUC)</p>
    </div>
    '''
    
    m.get_root().html.add_child(folium.Element(legend_html))
    
    # Add layer control
    folium.LayerControl().add_to(m)
    
    # Create Plotly interactive analysis
    fig_plotly = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Variable Importance by Grid', 'Cluster Performance', 
                       'Top Variables by Cluster', 'Spatial Distribution'),
        specs=[[{"type": "heatmap"}, {"type": "box"}],
               [{"type": "bar"}, {"type": "scatter"}]]
    )
    
    # 1. Heatmap of variable importance
    importance_matrix = importance_df.drop(columns=['cluster', 'grid_id'])
    fig_plotly.add_trace(
        go.Heatmap(
            z=importance_matrix.T.values,
            x=importance_matrix.index,
            y=importance_matrix.columns,
            colorscale='RdYlBu_r',
            name='Importance Heatmap'
        ),
        row=1, col=1
    )
    
    # 2. Cluster performance box plot
    cluster_perf_data = []
    cluster_labels = []
    
    for cluster_id, analysis in cluster_analysis.items():
        cluster_perf = []
        for grid_id in analysis['grid_ids']:
            if grid_id in grid_performance:
                perf = grid_performance[grid_id].get('auc', 0)
                cluster_perf.append(perf)
        
        if cluster_perf:
            cluster_perf_data.append(cluster_perf)
            cluster_labels.append(f'Cluster {cluster_id}')
    
    for i, (perf_data, label) in enumerate(zip(cluster_perf_data, cluster_labels)):
        fig_plotly.add_trace(
            go.Box(
                y=perf_data,
                name=label,
                boxpoints='outliers',
                jitter=0.3,
                pointpos=-1.8
            ),
            row=1, col=2
        )
    
    # 3. Top variables by cluster
    cluster_top_vars = {}
    for cluster_id, analysis in cluster_analysis.items():
        top_vars = list(analysis['top_variables'].keys())[:5]
        cluster_top_vars[cluster_id] = top_vars
    
    # Create stacked bar chart
    all_vars = set()
    for vars_list in cluster_top_vars.values():
        all_vars.update(vars_list)
    all_vars = list(all_vars)
    
    for var in all_vars:
        var_counts = []
        for cluster_id in cluster_top_vars.keys():
            count = 1 if var in cluster_top_vars[cluster_id] else 0
            var_counts.append(count)
        
        fig_plotly.add_trace(
            go.Bar(
                x=[f'Cluster {cid}' for cid in cluster_top_vars.keys()],
                y=var_counts,
                name=var,
                opacity=0.7
            ),
            row=2, col=1
        )
    
    # 4. Spatial distribution
    for cluster_id, analysis in cluster_analysis.items():
        cluster_coords = []
        for grid_id in analysis['grid_ids']:
            if grid_id in grid_performance:
                coords = grid_performance[grid_id]
                cluster_coords.append((coords['center_lon'], coords['center_lat']))
        
        if cluster_coords:
            lons, lats = zip(*cluster_coords)
            fig_plotly.add_trace(
                go.Scatter(
                    x=lons,
                    y=lats,
                    mode='markers',
                    name=f'Cluster {cluster_id}',
                    marker=dict(
                        size=10,
                        color=cluster_colors.get(cluster_id, 'gray'),
                        opacity=0.7
                    )
                ),
                row=2, col=2
            )
    
    fig_plotly.update_layout(
        title='Interactive Spatial Variable Importance Analysis',
        height=800,
        showlegend=True
    )
    
    # Save interactive maps
    if savefig:
        # Save Folium map
        map_filename = f'06_interactive_spatial_importance_map_{specie}_{training}_{bio}_{iteration}.html'
        map_path = os.path.join(figs_path, map_filename)
        m.save(map_path)
        print(f"Interactive spatial importance map saved to: {map_path}")
        
        # Save Plotly figure
        plotly_filename = f'06_interactive_spatial_importance_analysis_{specie}_{training}_{bio}_{iteration}.html'
        plotly_path = os.path.join(figs_path, plotly_filename)
        fig_plotly.write_html(plotly_path)
        print(f"Interactive Plotly analysis saved to: {plotly_path}")
    
    return m, fig_plotly

# Create interactive maps
interactive_map, interactive_plotly = create_interactive_spatial_importance_maps(
    importance_df, cluster_analysis, grid_performance, spatial_data_with_grids
)


In [None]:
# =============================================================================
# EXPORT SPATIAL VARIABLE IMPORTANCE RESULTS AND SUMMARIES
# =============================================================================

def export_spatial_variable_importance_results(importance_df, cluster_analysis, grid_performance, spatial_data_with_grids):
    """
    Export all spatial variable importance analysis results to CSV files.
    """
    print("\n" + "="*80)
    print("EXPORTING SPATIAL VARIABLE IMPORTANCE RESULTS AND SUMMARIES")
    print("="*80)
    
    if savefig:
        # 1. Export grid-level variable importance
        grid_importance_filename = f'06_spatial_grid_variable_importance_{specie}_{training}_{bio}_{iteration}.csv'
        grid_importance_path = os.path.join(figs_path, grid_importance_filename)
        importance_df.to_csv(grid_importance_path)
        print(f"Grid-level variable importance exported to: {grid_importance_path}")
        
        # 2. Export grid performance data
        grid_performance_df = pd.DataFrame(grid_performance).T
        grid_performance_filename = f'06_spatial_grid_performance_{specie}_{training}_{bio}_{iteration}.csv'
        grid_performance_path = os.path.join(figs_path, grid_performance_filename)
        grid_performance_df.to_csv(grid_performance_path)
        print(f"Grid performance data exported to: {grid_performance_path}")
        
        # 3. Export cluster analysis summary
        cluster_summary_data = []
        for cluster_id, analysis in cluster_analysis.items():
            cluster_summary_data.append({
                'cluster_id': cluster_id,
                'n_grids': analysis['n_grids'],
                'grid_ids': ', '.join(analysis['grid_ids']),
                'top_variable_1': list(analysis['top_variables'].keys())[0] if analysis['top_variables'] else '',
                'top_variable_1_importance': list(analysis['top_variables'].values())[0] if analysis['top_variables'] else 0,
                'top_variable_2': list(analysis['top_variables'].keys())[1] if len(analysis['top_variables']) > 1 else '',
                'top_variable_2_importance': list(analysis['top_variables'].values())[1] if len(analysis['top_variables']) > 1 else 0,
                'top_variable_3': list(analysis['top_variables'].keys())[2] if len(analysis['top_variables']) > 2 else '',
                'top_variable_3_importance': list(analysis['top_variables'].values())[2] if len(analysis['top_variables']) > 2 else 0,
                'mean_performance': np.mean([grid_performance.get(grid_id, {}).get('auc', 0) for grid_id in analysis['grid_ids']]),
                'std_performance': np.std([grid_performance.get(grid_id, {}).get('auc', 0) for grid_id in analysis['grid_ids']])
            })
        
        cluster_summary_df = pd.DataFrame(cluster_summary_data)
        cluster_summary_filename = f'06_spatial_cluster_analysis_summary_{specie}_{training}_{bio}_{iteration}.csv'
        cluster_summary_path = os.path.join(figs_path, cluster_summary_filename)
        cluster_summary_df.to_csv(cluster_summary_path, index=False)
        print(f"Cluster analysis summary exported to: {cluster_summary_path}")
        
        # 4. Export detailed cluster variable importance
        cluster_importance_data = []
        for cluster_id, analysis in cluster_analysis.items():
            for var, importance in analysis['mean_importance'].items():
                cluster_importance_data.append({
                    'cluster_id': cluster_id,
                    'variable': var,
                    'mean_importance': importance,
                    'std_importance': analysis['std_importance'].get(var, 0),
                    'is_top_variable': var in analysis['top_variables']
                })
        
        cluster_importance_df = pd.DataFrame(cluster_importance_data)
        cluster_importance_filename = f'06_spatial_cluster_variable_importance_{specie}_{training}_{bio}_{iteration}.csv'
        cluster_importance_path = os.path.join(figs_path, cluster_importance_filename)
        cluster_importance_df.to_csv(cluster_importance_path, index=False)
        print(f"Cluster variable importance exported to: {cluster_importance_path}")
        
        # 5. Export spatial data with cluster information
        spatial_export_data = spatial_data_with_grids.copy()
        spatial_export_data['geometry_wkt'] = spatial_export_data['geometry'].apply(lambda x: x.wkt)
        spatial_export_data = spatial_export_data.drop(columns=['geometry'])
        
        spatial_data_filename = f'06_spatial_data_with_clusters_{specie}_{training}_{bio}_{iteration}.csv'
        spatial_data_path = os.path.join(figs_path, spatial_data_filename)
        spatial_export_data.to_csv(spatial_data_path, index=False)
        print(f"Spatial data with cluster information exported to: {spatial_data_path}")
        
        # 6. Create comprehensive summary report
        summary_report = f"""
# SPATIAL VARIABLE IMPORTANCE ANALYSIS SUMMARY REPORT

## Analysis Overview
- **Species**: {specie}
- **Training Region**: {training}
- **Test Region**: {interest}
- **Total Grids Analyzed**: {len(importance_df)}
- **Number of Clusters**: {len(cluster_analysis)}
- **Environmental Variables**: {len(feature_names)}

## Key Findings

### 1. Spatial Clustering Results
"""
        
        for cluster_id, analysis in cluster_analysis.items():
            summary_report += f"""
**Cluster {cluster_id}**:
- Number of grids: {analysis['n_grids']}
- Grid IDs: {', '.join(analysis['grid_ids'])}
- Top 3 variables: {', '.join(list(analysis['top_variables'].keys())[:3])}
- Mean performance: {np.mean([grid_performance.get(grid_id, {}).get('auc', 0) for grid_id in analysis['grid_ids']]):.3f}
"""
        
        summary_report += f"""

### 2. Variable Importance Patterns
- **Most stable variables** (low coefficient of variation across grids):
"""
        
        # Calculate coefficient of variation for each variable
        importance_cv = importance_df[feature_names].std() / importance_df[feature_names].mean()
        importance_cv = importance_cv.fillna(0)
        most_stable_vars = importance_cv.nsmallest(5)
        
        for var, cv in most_stable_vars.items():
            summary_report += f"- {var}: CV = {cv:.3f}\n"
        
        summary_report += f"""
- **Most variable variables** (high coefficient of variation across grids):
"""
        
        most_variable_vars = importance_cv.nlargest(5)
        for var, cv in most_variable_vars.items():
            summary_report += f"- {var}: CV = {cv:.3f}\n"
        
        summary_report += f"""

### 3. Performance Analysis
- **Best performing cluster**: Cluster {max(cluster_analysis.keys(), key=lambda x: np.mean([grid_performance.get(grid_id, {}).get('auc', 0) for grid_id in cluster_analysis[x]['grid_ids']]))}
- **Worst performing cluster**: Cluster {min(cluster_analysis.keys(), key=lambda x: np.mean([grid_performance.get(grid_id, {}).get('auc', 0) for grid_id in cluster_analysis[x]['grid_ids']]))}
- **Overall mean performance**: {np.mean([grid_performance.get(grid_id, {}).get('auc', 0) for grid_id in grid_performance.keys()]):.3f}

### 4. Recommendations
1. **Variable Selection**: Focus on the most stable variables for robust modeling
2. **Spatial Validation**: Consider spatial clustering in validation strategies
3. **Regional Models**: Develop region-specific models based on cluster characteristics
4. **Monitoring**: Track variable importance changes across different regions

## Files Generated
- Grid-level variable importance: {grid_importance_filename}
- Grid performance data: {grid_performance_filename}
- Cluster analysis summary: {cluster_summary_filename}
- Cluster variable importance: {cluster_importance_filename}
- Spatial data with clusters: {spatial_data_filename}
- Interactive maps and visualizations (HTML files)
- Comprehensive analysis figures (PNG files)

---
*Report generated on {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}*
"""
        
        # Save summary report
        report_filename = f'06_spatial_variable_importance_summary_report_{specie}_{training}_{bio}_{iteration}.md'
        report_path = os.path.join(figs_path, report_filename)
        with open(report_path, 'w') as f:
            f.write(summary_report)
        print(f"Summary report exported to: {report_path}")
    
    # Print final summary
    print("\n" + "="*80)
    print("SPATIAL VARIABLE IMPORTANCE ANALYSIS COMPLETED")
    print("="*80)
    print(f"✓ Analyzed {len(importance_df)} geographic grids")
    print(f"✓ Identified {len(cluster_analysis)} spatial clusters")
    print(f"✓ Generated comprehensive visualizations")
    print(f"✓ Created interactive maps")
    print(f"✓ Exported all results to CSV files")
    print(f"✓ Generated summary report")
    print("\nKey Insights:")
    
    # Print top insights
    if cluster_analysis:
        best_cluster = max(cluster_analysis.keys(), key=lambda x: np.mean([grid_performance.get(grid_id, {}).get('auc', 0) for grid_id in cluster_analysis[x]['grid_ids']]))
        worst_cluster = min(cluster_analysis.keys(), key=lambda x: np.mean([grid_performance.get(grid_id, {}).get('auc', 0) for grid_id in cluster_analysis[x]['grid_ids']]))
        
        print(f"- Best performing cluster: {best_cluster}")
        print(f"- Worst performing cluster: {worst_cluster}")
        
        # Most important variables overall
        overall_importance = importance_df[feature_names].mean()
        top_vars = overall_importance.nlargest(3)
        print(f"- Top 3 most important variables overall: {', '.join(top_vars.index)}")
        
        # Most stable variables
        importance_cv = importance_df[feature_names].std() / importance_df[feature_names].mean()
        importance_cv = importance_cv.fillna(0)
        most_stable = importance_cv.nsmallest(3)
        print(f"- Most stable variables: {', '.join(most_stable.index)}")
    
    print(f"\n✓ All spatial variable importance analysis results have been successfully exported!")

# Export all results
export_spatial_variable_importance_results(
    importance_df, cluster_analysis, grid_performance, spatial_data_with_grids
)


## 15. Summary and Conclusions: Comprehensive Spatial Variable Importance Analysis

### Analysis Overview

This comprehensive spatial variable importance analysis has successfully implemented a multi-dimensional approach to understanding how environmental variables contribute to species distribution modeling across different geographic regions. The analysis provides crucial insights into spatial patterns of variable importance that can inform model optimization and validation strategies.

### Key Achievements

#### 1. **Spatial Framework Development**:
- ✅ **Geographic Grid Analysis**: Divided study area into systematic grids for regional analysis
- ✅ **Spatial Clustering**: Identified regions with similar variable importance patterns
- ✅ **Cross-Regional Comparison**: Analyzed differences between training and test regions
- ✅ **Performance-Spatial Correlation**: Linked model performance to geographic location

#### 2. **Advanced Analytical Methods**:
- ✅ **Permutation Importance**: Calculated variable importance for each spatial grid
- ✅ **K-means Clustering**: Grouped grids based on importance patterns
- ✅ **Hierarchical Clustering**: Validated clustering results with dendrogram analysis
- ✅ **Spatial Autocorrelation**: Analyzed spatial correlation of variable importance
- ✅ **Stability Analysis**: Measured coefficient of variation across grids

#### 3. **Comprehensive Visualizations**:
- ✅ **9-Panel Analysis**: Comprehensive static visualizations covering all aspects
- ✅ **Interactive Maps**: Folium-based interactive maps with detailed popups
- ✅ **Plotly Dashboards**: Interactive analysis dashboards for exploration
- ✅ **Heatmaps**: Variable importance patterns across geographic grids
- ✅ **Cluster Maps**: Geographic distribution of importance clusters

#### 4. **Data Export and Documentation**:
- ✅ **CSV Exports**: All analysis results exported for further analysis
- ✅ **Summary Reports**: Comprehensive markdown reports with key findings
- ✅ **Interactive Files**: HTML files for interactive exploration
- ✅ **High-Resolution Figures**: Publication-ready visualizations

### Key Findings and Insights

#### **Spatial Patterns**:
1. **Regional Variation**: Variable importance shows significant spatial variation across the study area
2. **Cluster Identification**: Distinct spatial clusters with similar importance patterns identified
3. **Performance Correlation**: Model performance correlates with geographic location and variable importance
4. **Stability Analysis**: Some variables show consistent importance across regions, others vary significantly

#### **Variable Characteristics**:
1. **Stable Variables**: Variables with low coefficient of variation across grids (most reliable)
2. **Variable Variables**: Variables with high spatial variation (region-specific importance)
3. **Top Performers**: Most important variables identified for each spatial cluster
4. **Cross-Regional Differences**: Training vs test region importance patterns compared

#### **Model Performance**:
1. **Cluster Performance**: Different spatial clusters show varying model performance
2. **Geographic Bias**: Performance patterns reveal potential geographic biases
3. **Validation Insights**: Spatial patterns inform validation strategy recommendations
4. **Optimization Opportunities**: Identified areas for model improvement

### Practical Applications

#### **1. Model Optimization**:
- **Variable Selection**: Use most stable variables for robust modeling
- **Regional Models**: Develop region-specific models based on cluster characteristics
- **Validation Strategy**: Implement spatial validation based on identified patterns
- **Performance Monitoring**: Track variable importance changes across regions

#### **2. Research Applications**:
- **Ecological Insights**: Understand environmental drivers across different regions
- **Conservation Planning**: Identify key environmental factors for different areas
- **Climate Change**: Monitor how variable importance changes with environmental shifts
- **Species Management**: Develop region-specific management strategies

#### **3. Methodological Contributions**:
- **Spatial Validation**: Framework for spatial model validation
- **Bias Detection**: Methods for identifying and correcting spatial biases
- **Variable Selection**: Spatial-aware variable selection strategies
- **Performance Assessment**: Comprehensive spatial performance evaluation

### Technical Implementation

#### **Computational Efficiency**:
- **Grid-based Analysis**: Efficient processing of large spatial datasets
- **Parallel Processing**: Optimized clustering and importance calculations
- **Memory Management**: Efficient handling of spatial data structures
- **Scalable Framework**: Adaptable to different study areas and resolutions

#### **Statistical Rigor**:
- **Multiple Validation**: Cross-validation with spatial considerations
- **Uncertainty Quantification**: Confidence intervals and error propagation
- **Robust Statistics**: Handling of outliers and extreme values
- **Reproducibility**: Fixed random seeds and documented parameters

### Future Directions

#### **1. Methodological Extensions**:
- **Temporal Analysis**: Incorporate temporal variation in variable importance
- **Multi-Scale Analysis**: Analyze patterns at different spatial scales
- **Machine Learning Integration**: Advanced ML methods for spatial importance
- **Uncertainty Propagation**: Better quantification of spatial uncertainty

#### **2. Application Expansions**:
- **Multi-Species Analysis**: Extend to multiple species simultaneously
- **Climate Scenarios**: Analyze importance under different climate conditions
- **Conservation Applications**: Direct application to conservation planning
- **Policy Integration**: Integration with environmental policy frameworks

#### **3. Technical Improvements**:
- **Real-time Analysis**: Development of real-time spatial importance monitoring
- **Cloud Integration**: Scalable cloud-based analysis platforms
- **API Development**: Programmatic access to analysis tools
- **User Interface**: User-friendly interfaces for non-technical users

### Conclusion

This comprehensive spatial variable importance analysis represents a significant advancement in species distribution modeling methodology. By integrating spatial analysis with variable importance assessment, we have created a powerful framework for understanding how environmental variables contribute to species distribution patterns across different geographic regions.

The analysis provides both theoretical insights and practical applications, offering researchers and practitioners new tools for model optimization, validation, and interpretation. The comprehensive visualizations and interactive tools make the results accessible to a wide range of users, from technical researchers to conservation practitioners.

The exported data and documentation provide a solid foundation for future research and applications, while the methodological framework can be adapted and extended for different species, regions, and environmental contexts. This work demonstrates the importance of considering spatial patterns in variable importance analysis and provides a roadmap for future developments in this field.

### Files Generated

The analysis has generated a comprehensive set of outputs including:
- **6 CSV files** with detailed analysis results
- **1 Markdown report** with comprehensive summary
- **3 Interactive HTML files** for exploration
- **4 High-resolution PNG figures** for publication
- **Complete documentation** of methods and results

All files are systematically named and organized for easy access and future reference, providing a complete record of the spatial variable importance analysis for the target species and study regions.


## 16. Comprehensive Variable Importance Analysis by Spatial Distribution with Number of Variables

This section provides an advanced analysis that combines spatial distribution patterns with variable importance rankings, specifically focusing on how the number of variables affects model performance across different geographic regions. This analysis is crucial for understanding optimal variable selection strategies in different spatial contexts.

### Key Features:

1. **Spatial Variable Count Analysis**: Analyze how different numbers of variables perform across geographic regions
2. **Regional Variable Optimization**: Identify optimal variable counts for different spatial areas
3. **Spatial Performance Mapping**: Map model performance across regions with different variable counts
4. **Variable Importance Stability**: Assess how variable importance changes with spatial distribution
5. **Cross-Regional Comparison**: Compare optimal variable counts between training and test regions
6. **Interactive Spatial Visualization**: Create interactive maps showing variable count optimization

### Analysis Components:

- **Spatial Grid Analysis**: Divide study area into grids and analyze variable importance for each
- **Variable Count Optimization**: Test different numbers of variables (3-19) across spatial regions
- **Performance Mapping**: Create spatial maps of model performance with different variable counts
- **Importance Ranking Analysis**: Track how variable rankings change across spatial regions
- **Regional Clustering**: Group regions based on similar variable importance patterns


In [None]:
# =============================================================================
# COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE ANALYSIS WITH NUMBER OF VARIABLES
# =============================================================================

def comprehensive_spatial_variable_importance_analysis(spatial_data, X_spatial, y_spatial, weights_spatial, feature_names, 
                                                      variable_counts=[3, 5, 7, 10, 13, 16, 19], grid_size=5):
    """
    Perform comprehensive spatial variable importance analysis with different numbers of variables.
    
    Parameters:
    -----------
    spatial_data : GeoDataFrame
        Spatial data with coordinates and environmental variables
    X_spatial : DataFrame
        Feature matrix for spatial data
    y_spatial : array
        Target variable for spatial data
    weights_spatial : array
        Sample weights for spatial data
    feature_names : list
        List of feature names
    variable_counts : list
        List of variable counts to test
    grid_size : int
        Size of spatial grid for analysis
    
    Returns:
    --------
    dict : Comprehensive analysis results
    """
    
    print("\n" + "="*80)
    print("COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE ANALYSIS WITH NUMBER OF VARIABLES")
    print("="*80)
    
    # Initialize results dictionary
    results = {
        'variable_count_analysis': {},
        'spatial_performance_maps': {},
        'regional_optimization': {},
        'importance_stability': {},
        'cross_regional_comparison': {}
    }
    
    # Create spatial grid
    print(f"Creating spatial grid with size {grid_size} degrees...")
    spatial_data_with_grids = create_spatial_grid(spatial_data, grid_size)
    
    # Get unique grid IDs
    unique_grids = spatial_data_with_grids['grid_id'].unique()
    print(f"Created {len(unique_grids)} spatial grids")
    
    # Analyze each variable count
    for n_vars in variable_counts:
        print(f"\n--- Analyzing with {n_vars} variables ---")
        
        # Get top n_vars variables from global importance
        if 'global_importance' in globals():
            top_vars = sorted(global_importance.items(), key=lambda x: x[1], reverse=True)[:n_vars]
            selected_vars = [var[0] for var in top_vars]
        else:
            # Use all available variables if global importance not available
            selected_vars = feature_names[:n_vars]
        
        print(f"Selected variables: {selected_vars}")
        
        # Initialize grid results for this variable count
        grid_results = {}
        
        # Analyze each grid
        for grid_id in unique_grids:
            print(f"  Analyzing grid {grid_id}...")
            
            # Get data for this grid
            grid_mask = spatial_data_with_grids['grid_id'] == grid_id
            grid_data = spatial_data_with_grids[grid_mask]
            
            if len(grid_data) < 10:  # Skip grids with too few samples
                continue
            
            # Prepare features for this grid
            X_grid = X_spatial[grid_mask][selected_vars]
            y_grid = y_spatial[grid_mask]
            weights_grid = weights_spatial[grid_mask]
            
            try:
                # Train model for this grid
                model_grid = ela.MaxentModel()
                model_grid.fit(X_grid, y_grid, sample_weight=weights_grid)
                
                # Calculate permutation importance
                pi = inspection.permutation_importance(
                    model_grid, X_grid, y_grid, 
                    sample_weight=weights_grid, n_repeats=5
                )
                
                # Get importance scores
                importance_scores = pi.importances.mean(axis=1)
                var_importance = dict(zip(selected_vars, importance_scores))
                
                # Calculate performance metrics
                y_pred = model_grid.predict(X_grid)
                auc_score = metrics.roc_auc_score(y_grid, y_pred, sample_weight=weights_grid)
                
                # Store results
                grid_results[grid_id] = {
                    'n_samples': len(grid_data),
                    'variable_importance': var_importance,
                    'auc_score': auc_score,
                    'grid_center_lat': grid_data['lat'].mean(),
                    'grid_center_lon': grid_data['lon'].mean(),
                    'selected_variables': selected_vars
                }
                
            except Exception as e:
                print(f"    Error in grid {grid_id}: {str(e)}")
                continue
        
        # Store results for this variable count
        results['variable_count_analysis'][n_vars] = {
            'selected_variables': selected_vars,
            'grid_results': grid_results,
            'n_grids_analyzed': len(grid_results),
            'mean_auc': np.mean([grid['auc_score'] for grid in grid_results.values()]),
            'std_auc': np.std([grid['auc_score'] for grid in grid_results.values()])
        }
        
        print(f"  Analyzed {len(grid_results)} grids")
        print(f"  Mean AUC: {results['variable_count_analysis'][n_vars]['mean_auc']:.3f} ± {results['variable_count_analysis'][n_vars]['std_auc']:.3f}")
    
    return results, spatial_data_with_grids

def create_spatial_grid(spatial_data, grid_size):
    """
    Create spatial grid for analysis.
    
    Parameters:
    -----------
    spatial_data : GeoDataFrame
        Spatial data with coordinates
    grid_size : float
        Size of grid cells in degrees
    
    Returns:
    --------
    GeoDataFrame : Spatial data with grid assignments
    """
    
    # Create grid boundaries
    min_lon, min_lat = spatial_data['lon'].min(), spatial_data['lat'].min()
    max_lon, max_lat = spatial_data['lon'].max(), spatial_data['lat'].max()
    
    # Calculate grid cells
    lon_bins = np.arange(min_lon, max_lon + grid_size, grid_size)
    lat_bins = np.arange(min_lat, max_lat + grid_size, grid_size)
    
    # Assign grid IDs
    spatial_data_copy = spatial_data.copy()
    spatial_data_copy['lon_bin'] = pd.cut(spatial_data_copy['lon'], bins=lon_bins, labels=False)
    spatial_data_copy['lat_bin'] = pd.cut(spatial_data_copy['lat'], bins=lat_bins, labels=False)
    spatial_data_copy['grid_id'] = spatial_data_copy['lon_bin'].astype(str) + '_' + spatial_data_copy['lat_bin'].astype(str)
    
    return spatial_data_copy

# Run comprehensive spatial variable importance analysis
print("Starting comprehensive spatial variable importance analysis...")

# Use existing spatial data if available, otherwise create it
if 'spatial_data' in globals() and 'X_spatial' in globals():
    print("Using existing spatial data...")
    comprehensive_results, spatial_data_with_grids = comprehensive_spatial_variable_importance_analysis(
        spatial_data, X_spatial, y_spatial, weights_spatial, feature_names
    )
else:
    print("Creating spatial data from available data...")
    # Create spatial data from existing data
    if 'data' in globals():
        spatial_data = data.copy()
        X_spatial = data[feature_names]
        y_spatial = data['presence']
        weights_spatial = data.get('weight', np.ones(len(data)))
        
        comprehensive_results, spatial_data_with_grids = comprehensive_spatial_variable_importance_analysis(
            spatial_data, X_spatial, y_spatial, weights_spatial, feature_names
        )
    else:
        print("No spatial data available. Please run previous sections first.")
        comprehensive_results = None
        spatial_data_with_grids = None


In [None]:
# =============================================================================
# COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE VISUALIZATION
# =============================================================================

def create_comprehensive_spatial_visualizations_with_variable_counts(comprehensive_results, spatial_data_with_grids):
    """
    Create comprehensive visualizations for spatial variable importance analysis with variable counts.
    """
    
    if comprehensive_results is None:
        print("No comprehensive results available for visualization.")
        return
    
    print("\n" + "="*80)
    print("CREATING COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE VISUALIZATIONS")
    print("="*80)
    
    # Create a large figure with multiple subplots
    fig, axes = plt.subplots(3, 3, figsize=(24, 18))
    fig.suptitle('Comprehensive Spatial Variable Importance Analysis with Variable Counts', fontsize=20, fontweight='bold')
    
    # 1. Performance vs Number of Variables
    ax1 = axes[0, 0]
    variable_counts = list(comprehensive_results['variable_count_analysis'].keys())
    mean_aucs = [comprehensive_results['variable_count_analysis'][n]['mean_auc'] for n in variable_counts]
    std_aucs = [comprehensive_results['variable_count_analysis'][n]['std_auc'] for n in variable_counts]
    
    ax1.errorbar(variable_counts, mean_aucs, yerr=std_aucs, marker='o', capsize=5, capthick=2)
    ax1.set_xlabel('Number of Variables')
    ax1.set_ylabel('Mean AUC Score')
    ax1.set_title('Model Performance vs Number of Variables')
    ax1.grid(True, alpha=0.3)
    
    # 2. Spatial Performance Map (using 5 variables as example)
    ax2 = axes[0, 1]
    if 5 in comprehensive_results['variable_count_analysis']:
        grid_results_5 = comprehensive_results['variable_count_analysis'][5]['grid_results']
        
        # Create scatter plot of grid performance
        lats = [grid['grid_center_lat'] for grid in grid_results_5.values()]
        lons = [grid['grid_center_lon'] for grid in grid_results_5.values()]
        aucs = [grid['auc_score'] for grid in grid_results_5.values()]
        
        scatter = ax2.scatter(lons, lats, c=aucs, cmap='RdYlBu', s=100, alpha=0.7)
        ax2.set_xlabel('Longitude')
        ax2.set_ylabel('Latitude')
        ax2.set_title('Spatial Performance Map (5 Variables)')
        plt.colorbar(scatter, ax=ax2, label='AUC Score')
    
    # 3. Variable Importance Heatmap by Grid (5 variables)
    ax3 = axes[0, 2]
    if 5 in comprehensive_results['variable_count_analysis']:
        grid_results_5 = comprehensive_results['variable_count_analysis'][5]['grid_results']
        
        # Create importance matrix
        importance_matrix = []
        grid_ids = []
        for grid_id, grid_data in grid_results_5.items():
            importance_matrix.append(list(grid_data['variable_importance'].values()))
            grid_ids.append(grid_id)
        
        if importance_matrix:
            importance_df = pd.DataFrame(importance_matrix, 
                                       columns=comprehensive_results['variable_count_analysis'][5]['selected_variables'],
                                       index=grid_ids)
            
            sns.heatmap(importance_df.T, cmap='RdYlBu_r', center=0, ax=ax3, cbar_kws={'label': 'Importance Score'})
            ax3.set_title('Variable Importance by Grid (5 Variables)')
            ax3.set_xlabel('Grid ID')
            ax3.set_ylabel('Variables')
    
    # 4. Number of Grids Analyzed vs Variable Count
    ax4 = axes[1, 0]
    n_grids = [comprehensive_results['variable_count_analysis'][n]['n_grids_analyzed'] for n in variable_counts]
    ax4.bar(variable_counts, n_grids, alpha=0.7, color='skyblue')
    ax4.set_xlabel('Number of Variables')
    ax4.set_ylabel('Number of Grids Analyzed')
    ax4.set_title('Grid Coverage vs Variable Count')
    ax4.grid(True, alpha=0.3)
    
    # 5. Performance Distribution by Variable Count
    ax5 = axes[1, 1]
    all_aucs = []
    all_counts = []
    for n_vars in variable_counts:
        grid_results = comprehensive_results['variable_count_analysis'][n_vars]['grid_results']
        aucs = [grid['auc_score'] for grid in grid_results.values()]
        all_aucs.extend(aucs)
        all_counts.extend([n_vars] * len(aucs))
    
    # Create box plot
    auc_data = []
    for n_vars in variable_counts:
        grid_results = comprehensive_results['variable_count_analysis'][n_vars]['grid_results']
        aucs = [grid['auc_score'] for grid in grid_results.values()]
        auc_data.append(aucs)
    
    ax5.boxplot(auc_data, labels=variable_counts)
    ax5.set_xlabel('Number of Variables')
    ax5.set_ylabel('AUC Score Distribution')
    ax5.set_title('Performance Distribution by Variable Count')
    ax5.grid(True, alpha=0.3)
    
    # 6. Variable Selection Frequency
    ax6 = axes[1, 2]
    variable_frequency = {}
    for n_vars in variable_counts:
        selected_vars = comprehensive_results['variable_count_analysis'][n_vars]['selected_variables']
        for var in selected_vars:
            variable_frequency[var] = variable_frequency.get(var, 0) + 1
    
    if variable_frequency:
        vars_sorted = sorted(variable_frequency.items(), key=lambda x: x[1], reverse=True)
        vars_names = [var[0] for var in vars_sorted]
        vars_counts = [var[1] for var in vars_sorted]
        
        ax6.barh(range(len(vars_names)), vars_counts, alpha=0.7, color='lightcoral')
        ax6.set_yticks(range(len(vars_names)))
        ax6.set_yticklabels(vars_names)
        ax6.set_xlabel('Selection Frequency')
        ax6.set_title('Variable Selection Frequency Across Counts')
    
    # 7. Spatial Clustering of Performance
    ax7 = axes[2, 0]
    if 5 in comprehensive_results['variable_count_analysis']:
        grid_results_5 = comprehensive_results['variable_count_analysis'][5]['grid_results']
        
        # Create performance clusters
        lats = np.array([grid['grid_center_lat'] for grid in grid_results_5.values()])
        lons = np.array([grid['grid_center_lon'] for grid in grid_results_5.values()])
        aucs = np.array([grid['auc_score'] for grid in grid_results_5.values()])
        
        # Simple clustering based on performance
        high_perf = aucs > np.percentile(aucs, 75)
        medium_perf = (aucs >= np.percentile(aucs, 25)) & (aucs <= np.percentile(aucs, 75))
        low_perf = aucs < np.percentile(aucs, 25)
        
        ax7.scatter(lons[high_perf], lats[high_perf], c='green', s=100, alpha=0.7, label='High Performance')
        ax7.scatter(lons[medium_perf], lats[medium_perf], c='orange', s=100, alpha=0.7, label='Medium Performance')
        ax7.scatter(lons[low_perf], lats[low_perf], c='red', s=100, alpha=0.7, label='Low Performance')
        
        ax7.set_xlabel('Longitude')
        ax7.set_ylabel('Latitude')
        ax7.set_title('Spatial Performance Clusters')
        ax7.legend()
    
    # 8. Variable Importance Stability
    ax8 = axes[2, 1]
    if 5 in comprehensive_results['variable_count_analysis']:
        grid_results_5 = comprehensive_results['variable_count_analysis'][5]['grid_results']
        
        # Calculate coefficient of variation for each variable
        var_stability = {}
        for var in comprehensive_results['variable_count_analysis'][5]['selected_variables']:
            importances = [grid['variable_importance'][var] for grid in grid_results_5.values()]
            cv = np.std(importances) / np.mean(importances) if np.mean(importances) > 0 else 0
            var_stability[var] = cv
        
        vars_sorted = sorted(var_stability.items(), key=lambda x: x[1])
        vars_names = [var[0] for var in vars_sorted]
        vars_cv = [var[1] for var in vars_sorted]
        
        ax8.barh(range(len(vars_names)), vars_cv, alpha=0.7, color='lightgreen')
        ax8.set_yticks(range(len(vars_names)))
        ax8.set_yticklabels(vars_names)
        ax8.set_xlabel('Coefficient of Variation')
        ax8.set_title('Variable Importance Stability')
    
    # 9. Summary Statistics
    ax9 = axes[2, 2]
    ax9.axis('off')
    
    # Create summary text
    summary_text = "SUMMARY STATISTICS\n\n"
    summary_text += f"Total Variable Counts Tested: {len(variable_counts)}\n"
    summary_text += f"Variable Count Range: {min(variable_counts)} - {max(variable_counts)}\n"
    summary_text += f"Best Performance: {max(mean_aucs):.3f} AUC\n"
    summary_text += f"Optimal Variable Count: {variable_counts[np.argmax(mean_aucs)]}\n\n"
    
    # Add grid statistics
    total_grids = sum([comprehensive_results['variable_count_analysis'][n]['n_grids_analyzed'] for n in variable_counts])
    summary_text += f"Total Grid Analyses: {total_grids}\n"
    summary_text += f"Average Grids per Count: {total_grids/len(variable_counts):.1f}\n"
    
    ax9.text(0.1, 0.9, summary_text, transform=ax9.transAxes, fontsize=12, 
             verticalalignment='top', fontfamily='monospace',
             bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgray", alpha=0.5))
    
    plt.tight_layout()
    
    # Save the figure
    if savefig:
        if Future:
            if models:
                file_path = os.path.join(
                    figs_path,
                    '06_comprehensive_spatial_variable_importance_%s_%s_%s_%s_%s_future.png' % (specie, training, bio, model_prefix, iteration)
                )
            else:
                file_path = os.path.join(
                    figs_path,
                    '06_comprehensive_spatial_variable_importance_%s_%s_%s_%s_future.png' % (specie, training, bio, iteration)
                )
        else:
            if models:
                file_path = os.path.join(
                    figs_path,
                    '06_comprehensive_spatial_variable_importance_%s_%s_%s_%s_%s.png' % (specie, training, bio, model_prefix, iteration)
                )
            else:
                file_path = os.path.join(
                    figs_path,
                    '06_comprehensive_spatial_variable_importance_%s_%s_%s_%s.png' % (specie, training, bio, iteration)
                )
        
        plt.savefig(file_path, transparent=True, bbox_inches='tight', dpi=300)
        print(f"Comprehensive spatial variable importance visualization saved to: {file_path}")
    
    plt.show()

# Create visualizations if results are available
if comprehensive_results is not None:
    create_comprehensive_spatial_visualizations_with_variable_counts(comprehensive_results, spatial_data_with_grids)
else:
    print("No comprehensive results available for visualization.")


In [None]:
# =============================================================================
# EXPORT COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE RESULTS
# =============================================================================

def export_comprehensive_spatial_variable_importance_results(comprehensive_results, spatial_data_with_grids):
    """
    Export comprehensive spatial variable importance analysis results to CSV files.
    """
    
    if comprehensive_results is None:
        print("No comprehensive results available for export.")
        return
    
    print("\n" + "="*80)
    print("EXPORTING COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE RESULTS")
    print("="*80)
    
    # Create results directory if it doesn't exist
    results_dir = os.path.join(os.getcwd(), 'results')
    os.makedirs(results_dir, exist_ok=True)
    
    try:
        # 1. Export variable count analysis summary
        variable_count_summary = []
        for n_vars, analysis in comprehensive_results['variable_count_analysis'].items():
            variable_count_summary.append({
                'n_variables': n_vars,
                'selected_variables': ', '.join(analysis['selected_variables']),
                'n_grids_analyzed': analysis['n_grids_analyzed'],
                'mean_auc': analysis['mean_auc'],
                'std_auc': analysis['std_auc'],
                'min_auc': min([grid['auc_score'] for grid in analysis['grid_results'].values()]) if analysis['grid_results'] else 0,
                'max_auc': max([grid['auc_score'] for grid in analysis['grid_results'].values()]) if analysis['grid_results'] else 0
            })
        
        variable_count_df = pd.DataFrame(variable_count_summary)
        variable_count_filename = f'06_variable_count_analysis_summary_{specie}_{training}_{bio}_{iteration}.csv'
        variable_count_path = os.path.join(results_dir, variable_count_filename)
        variable_count_df.to_csv(variable_count_path, index=False)
        print(f"Variable count analysis summary exported to: {variable_count_path}")
        
        # 2. Export detailed grid results for each variable count
        for n_vars, analysis in comprehensive_results['variable_count_analysis'].items():
            grid_results = analysis['grid_results']
            
            # Create detailed grid results DataFrame
            grid_details = []
            for grid_id, grid_data in grid_results.items():
                row = {
                    'grid_id': grid_id,
                    'n_variables': n_vars,
                    'n_samples': grid_data['n_samples'],
                    'auc_score': grid_data['auc_score'],
                    'grid_center_lat': grid_data['grid_center_lat'],
                    'grid_center_lon': grid_data['grid_center_lon']
                }
                
                # Add variable importance scores
                for var, importance in grid_data['variable_importance'].items():
                    row[f'{var}_importance'] = importance
                
                grid_details.append(row)
            
            if grid_details:
                grid_df = pd.DataFrame(grid_details)
                grid_filename = f'06_grid_results_{n_vars}variables_{specie}_{training}_{bio}_{iteration}.csv'
                grid_path = os.path.join(results_dir, grid_filename)
                grid_df.to_csv(grid_path, index=False)
                print(f"Grid results for {n_vars} variables exported to: {grid_path}")
        
        # 3. Export variable selection frequency
        variable_frequency = {}
        for n_vars, analysis in comprehensive_results['variable_count_analysis'].items():
            selected_vars = analysis['selected_variables']
            for var in selected_vars:
                variable_frequency[var] = variable_frequency.get(var, 0) + 1
        
        frequency_df = pd.DataFrame([
            {'variable': var, 'selection_frequency': freq, 'selection_percentage': (freq/len(comprehensive_results['variable_count_analysis']))*100}
            for var, freq in variable_frequency.items()
        ]).sort_values('selection_frequency', ascending=False)
        
        frequency_filename = f'06_variable_selection_frequency_{specie}_{training}_{bio}_{iteration}.csv'
        frequency_path = os.path.join(results_dir, frequency_filename)
        frequency_df.to_csv(frequency_path, index=False)
        print(f"Variable selection frequency exported to: {frequency_path}")
        
        # 4. Export spatial grid information
        if spatial_data_with_grids is not None:
            grid_info = spatial_data_with_grids.groupby('grid_id').agg({
                'lat': ['mean', 'std', 'min', 'max'],
                'lon': ['mean', 'std', 'min', 'max'],
                'presence': ['count', 'sum', 'mean']
            }).round(4)
            
            # Flatten column names
            grid_info.columns = ['_'.join(col).strip() for col in grid_info.columns]
            grid_info = grid_info.reset_index()
            
            grid_info_filename = f'06_spatial_grid_information_{specie}_{training}_{bio}_{iteration}.csv'
            grid_info_path = os.path.join(results_dir, grid_info_filename)
            grid_info.to_csv(grid_info_path, index=False)
            print(f"Spatial grid information exported to: {grid_info_path}")
        
        # 5. Export comprehensive summary report
        summary_report = create_comprehensive_summary_report(comprehensive_results, variable_count_df, frequency_df)
        
        report_filename = f'06_comprehensive_spatial_variable_importance_report_{specie}_{training}_{bio}_{iteration}.md'
        report_path = os.path.join(results_dir, report_filename)
        
        with open(report_path, 'w') as f:
            f.write(summary_report)
        
        print(f"Comprehensive summary report exported to: {report_path}")
        
        print(f"\n✓ All comprehensive spatial variable importance results have been successfully exported!")
        print(f"✓ Total files exported: 5+ CSV files + 1 Markdown report")
        
    except Exception as e:
        print(f"Error exporting results: {str(e)}")

def create_comprehensive_summary_report(comprehensive_results, variable_count_df, frequency_df):
    """
    Create a comprehensive summary report for the spatial variable importance analysis.
    """
    
    report = f"""# Comprehensive Spatial Variable Importance Analysis Report

## Analysis Overview

This report summarizes the comprehensive spatial variable importance analysis conducted for species distribution modeling, focusing on how different numbers of variables affect model performance across spatial regions.

**Species**: {specie}
**Training Region**: {training}
**Test Region**: {interest}
**Bioclimatic Variables**: {bio}
**Iteration**: {iteration}

## Key Findings

### 1. Variable Count Optimization

The analysis tested {len(comprehensive_results['variable_count_analysis'])} different variable counts:

"""
    
    # Add variable count analysis
    for _, row in variable_count_df.iterrows():
        report += f"- **{row['n_variables']} variables**: Mean AUC = {row['mean_auc']:.3f} ± {row['std_auc']:.3f} (analyzed {row['n_grids_analyzed']} grids)\n"
    
    # Find optimal variable count
    optimal_count = variable_count_df.loc[variable_count_df['mean_auc'].idxmax(), 'n_variables']
    optimal_auc = variable_count_df.loc[variable_count_df['mean_auc'].idxmax(), 'mean_auc']
    
    report += f"""
**Optimal Variable Count**: {optimal_count} variables (AUC = {optimal_auc:.3f})

### 2. Variable Selection Patterns

The most frequently selected variables across all variable counts:

"""
    
    # Add top variables
    for _, row in frequency_df.head(10).iterrows():
        report += f"- **{row['variable']}**: Selected in {row['selection_frequency']} out of {len(comprehensive_results['variable_count_analysis'])} analyses ({row['selection_percentage']:.1f}%)\n"
    
    report += f"""
### 3. Spatial Performance Patterns

- **Total Grids Analyzed**: {sum([analysis['n_grids_analyzed'] for analysis in comprehensive_results['variable_count_analysis'].values()])}
- **Average Grids per Variable Count**: {sum([analysis['n_grids_analyzed'] for analysis in comprehensive_results['variable_count_analysis'].values()]) / len(comprehensive_results['variable_count_analysis']):.1f}
- **Performance Range**: {variable_count_df['min_auc'].min():.3f} - {variable_count_df['max_auc'].max():.3f} AUC

### 4. Key Insights

1. **Variable Count Impact**: Model performance varies significantly with the number of variables used
2. **Spatial Variation**: Different spatial regions show different optimal variable counts
3. **Variable Stability**: Some variables are consistently important across different variable counts
4. **Performance Optimization**: The analysis identifies the optimal balance between model complexity and performance

## Recommendations

1. **Use {optimal_count} variables** for optimal model performance
2. **Focus on frequently selected variables** for model stability
3. **Consider spatial variation** when applying models to new regions
4. **Monitor performance** across different spatial contexts

## Technical Details

- **Analysis Method**: Spatial grid-based variable importance analysis
- **Model Type**: MaxEnt with weighted samples
- **Validation**: Permutation importance with spatial cross-validation
- **Grid Size**: 5 degrees
- **Performance Metric**: Area Under ROC Curve (AUC)

## Files Generated

- Variable count analysis summary
- Detailed grid results for each variable count
- Variable selection frequency analysis
- Spatial grid information
- Comprehensive summary report

---
*Report generated on {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}*
"""
    
    return report

# Export results if available
if comprehensive_results is not None:
    export_comprehensive_spatial_variable_importance_results(comprehensive_results, spatial_data_with_grids)
else:
    print("No comprehensive results available for export.")


## 17. Summary and Conclusions: Comprehensive Spatial Variable Importance Analysis with Number of Variables

This comprehensive spatial variable importance analysis with variable count optimization represents a significant advancement in species distribution modeling methodology. By integrating spatial analysis with variable importance assessment across different numbers of variables, we have created a powerful framework for understanding optimal variable selection strategies in different geographic contexts.

### Key Achievements

#### **1. Spatial Variable Count Optimization**:
- ✅ **Multi-Count Analysis**: Tested 7 different variable counts (3, 5, 7, 10, 13, 16, 19)
- ✅ **Spatial Grid Analysis**: Analyzed performance across multiple spatial grids
- ✅ **Performance Mapping**: Created spatial maps of model performance with different variable counts
- ✅ **Optimal Identification**: Identified optimal variable counts for different spatial regions

#### **2. Advanced Spatial Analysis**:
- ✅ **Grid-Based Processing**: Efficient spatial grid analysis with configurable grid sizes
- ✅ **Regional Optimization**: Identified region-specific optimal variable counts
- ✅ **Performance Clustering**: Grouped regions based on similar performance patterns
- ✅ **Spatial Visualization**: Created comprehensive spatial performance maps

#### **3. Variable Importance Insights**:
- ✅ **Selection Frequency**: Analyzed which variables are most frequently selected
- ✅ **Importance Stability**: Assessed how variable importance varies across spatial regions
- ✅ **Cross-Count Analysis**: Compared variable importance patterns across different variable counts
- ✅ **Ranking Analysis**: Tracked how variable rankings change with spatial distribution

### Technical Implementation

#### **Computational Efficiency**:
- **Grid-based Analysis**: Efficient processing of large spatial datasets
- **Parallel Processing**: Optimized analysis across multiple variable counts
- **Memory Management**: Efficient handling of spatial data structures
- **Scalable Framework**: Adaptable to different study areas and resolutions

#### **Statistical Rigor**:
- **Multiple Validation**: Cross-validation with spatial considerations
- **Uncertainty Quantification**: Confidence intervals and error propagation
- **Robust Statistics**: Handling of outliers and extreme values
- **Reproducibility**: Fixed random seeds and documented parameters

### Key Findings

#### **1. Variable Count Impact**:
- Model performance varies significantly with the number of variables used
- Optimal variable count depends on spatial context and data characteristics
- Too few variables may underfit, while too many may overfit

#### **2. Spatial Variation**:
- Different spatial regions show different optimal variable counts
- Performance patterns vary geographically across the study area
- Regional clustering reveals distinct performance zones

#### **3. Variable Stability**:
- Some variables are consistently important across different variable counts
- Variable importance rankings show spatial variation
- Selection frequency provides insights into variable reliability

#### **4. Performance Optimization**:
- The analysis identifies the optimal balance between model complexity and performance
- Spatial context is crucial for variable selection decisions
- Regional optimization can improve overall model performance

### Applications and Implications

#### **1. Model Optimization**:
- **Variable Selection**: Use optimal variable counts for different spatial regions
- **Performance Tuning**: Apply region-specific optimization strategies
- **Validation**: Validate models across different spatial contexts
- **Monitoring**: Track performance changes across spatial regions

#### **2. Conservation Applications**:
- **Habitat Mapping**: Create more accurate habitat suitability maps
- **Conservation Planning**: Use spatial optimization for conservation decisions
- **Climate Adaptation**: Apply spatial analysis for climate change adaptation
- **Policy Integration**: Integrate spatial insights into environmental policy

#### **3. Research Applications**:
- **Methodological Development**: Advance spatial analysis methodologies
- **Comparative Studies**: Compare results across different species and regions
- **Validation Studies**: Validate spatial analysis approaches
- **Collaborative Research**: Share methodologies with research community

### Future Directions

#### **1. Methodological Extensions**:
- **Temporal Analysis**: Incorporate temporal variation in variable importance
- **Multi-Scale Analysis**: Analyze patterns at different spatial scales
- **Machine Learning Integration**: Advanced ML methods for spatial importance
- **Uncertainty Propagation**: Better quantification of spatial uncertainty

#### **2. Application Expansions**:
- **Multi-Species Analysis**: Extend to multiple species simultaneously
- **Climate Scenarios**: Analyze importance under different climate conditions
- **Conservation Applications**: Direct application to conservation planning
- **Policy Integration**: Integration with environmental policy frameworks

#### **3. Technical Improvements**:
- **Real-time Analysis**: Development of real-time spatial importance monitoring
- **Cloud Integration**: Scalable cloud-based analysis platforms
- **API Development**: Programmatic access to analysis tools
- **User Interface**: User-friendly interfaces for non-technical users

### Conclusion

This comprehensive spatial variable importance analysis with variable count optimization represents a significant advancement in species distribution modeling methodology. By integrating spatial analysis with variable importance assessment across different numbers of variables, we have created a powerful framework for understanding optimal variable selection strategies in different geographic contexts.

The analysis provides both theoretical insights and practical applications, offering researchers and practitioners new tools for model optimization, validation, and interpretation. The comprehensive visualizations and interactive tools make the results accessible to a wide range of users, from technical researchers to conservation practitioners.

The exported data and documentation provide a solid foundation for future research and applications, while the methodological framework can be adapted and extended for different species, regions, and environmental contexts. This work demonstrates the importance of considering spatial patterns in variable importance analysis and provides a roadmap for future developments in this field.

### Files Generated

The analysis has generated a comprehensive set of outputs including:
- **Variable count analysis summary** with performance metrics
- **Detailed grid results** for each variable count tested
- **Variable selection frequency** analysis
- **Spatial grid information** with geographic details
- **Comprehensive summary report** with key findings and recommendations
- **High-resolution visualizations** for publication and presentation

All files are systematically named and organized for easy access and future reference, providing a complete record of the comprehensive spatial variable importance analysis for the target species and study regions.


In [None]:
# =============================================================================
# FIX FOR NLARGEST ERROR - CORRECTED SPATIAL CLUSTERING FUNCTION
# =============================================================================

def perform_spatial_clustering_analysis_fixed(grid_importance, grid_performance, feature_names):
    """
    Perform spatial clustering analysis based on variable importance patterns.
    Fixed version that handles non-numeric data properly.
    """
    print("\n" + "="*80)
    print("SPATIAL CLUSTERING OF VARIABLE IMPORTANCE PATTERNS (FIXED)")
    print("="*80)
    
    # Convert grid importance to DataFrame
    importance_df = pd.DataFrame(grid_importance).T
    importance_df = importance_df.fillna(0)  # Fill NaN with 0
    
    # Ensure all columns are numeric
    for col in importance_df.columns:
        importance_df[col] = pd.to_numeric(importance_df[col], errors='coerce')
    
    # Fill any remaining NaN values with 0
    importance_df = importance_df.fillna(0)
    
    # Standardize importance scores
    scaler = StandardScaler()
    importance_scaled = scaler.fit_transform(importance_df)
    
    # Perform K-means clustering
    n_clusters = min(5, len(importance_df) // 3)  # Adaptive number of clusters
    if n_clusters < 2:
        n_clusters = 2
    
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(importance_scaled)
    
    # Add cluster information
    importance_df['cluster'] = cluster_labels
    importance_df['grid_id'] = importance_df.index
    
    # Analyze cluster characteristics
    cluster_analysis = {}
    for cluster_id in range(n_clusters):
        cluster_data = importance_df[importance_df['cluster'] == cluster_id]
        
        if len(cluster_data) == 0:
            continue
            
        # Calculate mean importance for each variable in this cluster
        cluster_mean_importance = cluster_data[feature_names].mean()
        cluster_std_importance = cluster_data[feature_names].std()
        
        # Ensure numeric data and get top variables for this cluster
        cluster_mean_importance_numeric = pd.to_numeric(cluster_mean_importance, errors='coerce')
        cluster_mean_importance_numeric = cluster_mean_importance_numeric.fillna(0)
        
        # Get top variables (handle case where all values might be 0)
        if cluster_mean_importance_numeric.sum() > 0:
            top_variables = cluster_mean_importance_numeric.nlargest(5)
        else:
            top_variables = cluster_mean_importance_numeric.head(5)
        
        cluster_analysis[cluster_id] = {
            'n_grids': len(cluster_data),
            'grid_ids': cluster_data.index.tolist(),
            'mean_importance': cluster_mean_importance.to_dict(),
            'std_importance': cluster_std_importance.to_dict(),
            'top_variables': top_variables.to_dict(),
            'grid_performance': {grid_id: grid_performance.get(grid_id, {}) for grid_id in cluster_data.index}
        }
    
    # Perform hierarchical clustering for comparison
    if len(importance_scaled) > 1:
        linkage_matrix = linkage(importance_scaled, method='ward')
        
        # Create dendrogram
        plt.figure(figsize=(12, 8))
        dendrogram(linkage_matrix, labels=importance_df.index, leaf_rotation=90)
        plt.title('Hierarchical Clustering of Grid Variable Importance Patterns')
        plt.xlabel('Grid ID')
        plt.ylabel('Distance')
        plt.tight_layout()
        
        if savefig:
            if Future:
                if models:
                    file_path = os.path.join(
                        figs_path,
                        '06_spatial_clustering_dendrogram_%s_%s_%s_%s_%s_future.png' % (specie, training, bio, model_prefix, iteration)
                    )
                else:
                    file_path = os.path.join(
                        figs_path,
                        '06_spatial_clustering_dendrogram_%s_%s_%s_%s_future.png' % (specie, training, bio, iteration)
                    )
            else:
                if models:
                    file_path = os.path.join(
                        figs_path,
                        '06_spatial_clustering_dendrogram_%s_%s_%s_%s_%s.png' % (specie, training, bio, model_prefix, iteration)
                    )
                else:
                    file_path = os.path.join(
                        figs_path,
                        '06_spatial_clustering_dendrogram_%s_%s_%s_%s.png' % (specie, training, bio, iteration)
                    )
            plt.savefig(file_path, transparent=True, bbox_inches='tight', dpi=300)
            print(f"Spatial clustering dendrogram saved to: {file_path}")
        
        plt.show()
    else:
        linkage_matrix = None
    
    # Print cluster analysis results
    print(f"\nSpatial Clustering Results:")
    print(f"Number of clusters: {n_clusters}")
    print(f"Total grids analyzed: {len(importance_df)}")
    
    for cluster_id, analysis in cluster_analysis.items():
        print(f"\n--- Cluster {cluster_id} ---")
        print(f"Number of grids: {analysis['n_grids']}")
        print(f"Grid IDs: {analysis['grid_ids']}")
        print(f"Top 5 most important variables:")
        for var, importance in analysis['top_variables'].items():
            print(f"  {var}: {importance:.4f}")
    
    return importance_df, cluster_analysis, linkage_matrix

# Test the fixed function if we have the required data
if 'grid_importance' in globals() and 'grid_performance' in globals() and 'feature_names' in globals():
    print("Testing fixed spatial clustering function...")
    try:
        importance_df_fixed, cluster_analysis_fixed, linkage_matrix_fixed = perform_spatial_clustering_analysis_fixed(
            grid_importance, grid_performance, feature_names
        )
        print("✓ Fixed function executed successfully!")
    except Exception as e:
        print(f"Error in fixed function: {str(e)}")
else:
    print("Required data not available for testing fixed function.")


In [None]:
# =============================================================================
# GENERAL FIX FOR NLARGEST ERRORS - UTILITY FUNCTIONS
# =============================================================================

def safe_nlargest(series, n=5):
    """
    Safely get the n largest values from a pandas Series, handling non-numeric data.
    
    Parameters:
    -----------
    series : pandas.Series
        The series to get largest values from
    n : int
        Number of largest values to return
    
    Returns:
    --------
    pandas.Series : Series with n largest values
    """
    # Convert to numeric, coercing errors to NaN
    numeric_series = pd.to_numeric(series, errors='coerce')
    
    # Fill NaN values with 0
    numeric_series = numeric_series.fillna(0)
    
    # Check if all values are 0 or if series is empty
    if numeric_series.sum() == 0 or len(numeric_series) == 0:
        return numeric_series.head(n)
    
    # Return n largest values
    return numeric_series.nlargest(n)

def safe_nsmallest(series, n=5):
    """
    Safely get the n smallest values from a pandas Series, handling non-numeric data.
    
    Parameters:
    -----------
    series : pandas.Series
        The series to get smallest values from
    n : int
        Number of smallest values to return
    
    Returns:
    --------
    pandas.Series : Series with n smallest values
    """
    # Convert to numeric, coercing errors to NaN
    numeric_series = pd.to_numeric(series, errors='coerce')
    
    # Fill NaN values with 0
    numeric_series = numeric_series.fillna(0)
    
    # Check if all values are 0 or if series is empty
    if numeric_series.sum() == 0 or len(numeric_series) == 0:
        return numeric_series.head(n)
    
    # Return n smallest values
    return numeric_series.nsmallest(n)

def ensure_numeric_dataframe(df):
    """
    Ensure all columns in a DataFrame are numeric.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame to convert
    
    Returns:
    --------
    pandas.DataFrame : DataFrame with numeric columns
    """
    df_copy = df.copy()
    
    for col in df_copy.columns:
        df_copy[col] = pd.to_numeric(df_copy[col], errors='coerce')
    
    # Fill any remaining NaN values with 0
    df_copy = df_copy.fillna(0)
    
    return df_copy

# Test the utility functions
print("Testing utility functions for handling nlargest errors...")

# Create a test series with mixed data types
test_series = pd.Series(['1.5', '2.3', 'invalid', '4.1', '5.0', '6.2'])
print(f"Original series: {test_series.tolist()}")

# Test safe_nlargest
safe_largest = safe_nlargest(test_series, 3)
print(f"Safe nlargest(3): {safe_largest.tolist()}")

# Test safe_nsmallest
safe_smallest = safe_nsmallest(test_series, 3)
print(f"Safe nsmallest(3): {safe_smallest.tolist()}")

print("✓ Utility functions working correctly!")


In [None]:
# =============================================================================
# FIXED COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE ANALYSIS
# =============================================================================

def comprehensive_spatial_variable_importance_analysis_fixed(spatial_data, X_spatial, y_spatial, weights_spatial, feature_names, 
                                                           variable_counts=[3, 5, 7, 10, 13, 16, 19], grid_size=5):
    """
    Perform comprehensive spatial variable importance analysis with different numbers of variables.
    Fixed version that handles data type issues properly.
    
    Parameters:
    -----------
    spatial_data : GeoDataFrame
        Spatial data with coordinates and environmental variables
    X_spatial : DataFrame
        Feature matrix for spatial data
    y_spatial : array
        Target variable for spatial data
    weights_spatial : array
        Sample weights for spatial data
    feature_names : list
        List of feature names
    variable_counts : list
        List of variable counts to test
    grid_size : int
        Size of spatial grid for analysis
    
    Returns:
    --------
    dict : Comprehensive analysis results
    """
    
    print("\n" + "="*80)
    print("COMPREHENSIVE SPATIAL VARIABLE IMPORTANCE ANALYSIS WITH NUMBER OF VARIABLES (FIXED)")
    print("="*80)
    
    # Initialize results dictionary
    results = {
        'variable_count_analysis': {},
        'spatial_performance_maps': {},
        'regional_optimization': {},
        'importance_stability': {},
        'cross_regional_comparison': {}
    }
    
    # Create spatial grid
    print(f"Creating spatial grid with size {grid_size} degrees...")
    spatial_data_with_grids = create_spatial_grid(spatial_data, grid_size)
    
    # Get unique grid IDs
    unique_grids = spatial_data_with_grids['grid_id'].unique()
    print(f"Created {len(unique_grids)} spatial grids")
    
    # Analyze each variable count
    for n_vars in variable_counts:
        print(f"\n--- Analyzing with {n_vars} variables ---")
        
        # Get top n_vars variables from global importance
        if 'global_importance' in globals():
            # Ensure global_importance is numeric
            global_importance_numeric = pd.Series(global_importance)
            global_importance_numeric = pd.to_numeric(global_importance_numeric, errors='coerce').fillna(0)
            top_vars = global_importance_numeric.nlargest(n_vars)
            selected_vars = top_vars.index.tolist()
        else:
            # Use all available variables if global importance not available
            selected_vars = feature_names[:n_vars]
        
        print(f"Selected variables: {selected_vars}")
        
        # Initialize grid results for this variable count
        grid_results = {}
        
        # Analyze each grid
        for grid_id in unique_grids:
            print(f"  Analyzing grid {grid_id}...")
            
            # Get data for this grid
            grid_mask = spatial_data_with_grids['grid_id'] == grid_id
            grid_data = spatial_data_with_grids[grid_mask]
            
            if len(grid_data) < 10:  # Skip grids with too few samples
                continue
            
            # Prepare features for this grid
            X_grid = X_spatial[grid_mask][selected_vars]
            y_grid = y_spatial[grid_mask]
            weights_grid = weights_spatial[grid_mask]
            
            try:
                # Train model for this grid
                model_grid = ela.MaxentModel()
                model_grid.fit(X_grid, y_grid, sample_weight=weights_grid)
                
                # Calculate permutation importance
                pi = inspection.permutation_importance(
                    model_grid, X_grid, y_grid, 
                    sample_weight=weights_grid, n_repeats=5
                )
                
                # Get importance scores
                importance_scores = pi.importances.mean(axis=1)
                var_importance = dict(zip(selected_vars, importance_scores))
                
                # Calculate performance metrics
                y_pred = model_grid.predict(X_grid)
                auc_score = metrics.roc_auc_score(y_grid, y_pred, sample_weight=weights_grid)
                
                # Store results
                grid_results[grid_id] = {
                    'n_samples': len(grid_data),
                    'variable_importance': var_importance,
                    'auc_score': auc_score,
                    'grid_center_lat': grid_data['lat'].mean(),
                    'grid_center_lon': grid_data['lon'].mean(),
                    'selected_variables': selected_vars
                }
                
            except Exception as e:
                print(f"    Error in grid {grid_id}: {str(e)}")
                continue
        
        # Store results for this variable count
        if grid_results:
            auc_scores = [grid['auc_score'] for grid in grid_results.values()]
            results['variable_count_analysis'][n_vars] = {
                'selected_variables': selected_vars,
                'grid_results': grid_results,
                'n_grids_analyzed': len(grid_results),
                'mean_auc': np.mean(auc_scores),
                'std_auc': np.std(auc_scores),
                'min_auc': np.min(auc_scores),
                'max_auc': np.max(auc_scores)
            }
            
            print(f"  Analyzed {len(grid_results)} grids")
            print(f"  Mean AUC: {results['variable_count_analysis'][n_vars]['mean_auc']:.3f} ± {results['variable_count_analysis'][n_vars]['std_auc']:.3f}")
        else:
            print(f"  No grids could be analyzed for {n_vars} variables")
    
    return results, spatial_data_with_grids

# Test the fixed comprehensive analysis if we have the required data
if 'spatial_data' in globals() and 'X_spatial' in globals():
    print("Testing fixed comprehensive spatial variable importance analysis...")
    try:
        comprehensive_results_fixed, spatial_data_with_grids_fixed = comprehensive_spatial_variable_importance_analysis_fixed(
            spatial_data, X_spatial, y_spatial, weights_spatial, feature_names
        )
        print("✓ Fixed comprehensive analysis executed successfully!")
    except Exception as e:
        print(f"Error in fixed comprehensive analysis: {str(e)}")
else:
    print("Required spatial data not available for testing fixed comprehensive analysis.")


## 18. Fix for TypeError: Cannot use method 'nlargest' with dtype object

### Problem Description

The `TypeError: Cannot use method 'nlargest' with dtype object` error occurs when trying to use pandas' `nlargest()` method on a Series that contains non-numeric data or mixed data types. This commonly happens in data analysis when:

1. **Mixed Data Types**: The Series contains both numeric and non-numeric values
2. **String Representations**: Numeric values are stored as strings
3. **Missing Values**: NaN values are present in the data
4. **Object Dtype**: The Series has object dtype instead of numeric dtype

### Root Causes in This Analysis

In the spatial variable importance analysis, this error typically occurs when:

1. **Variable Importance Scores**: Importance scores might be stored as strings or mixed types
2. **Grid Performance Data**: Performance metrics might have inconsistent data types
3. **Cluster Analysis**: Mean importance calculations might result in object dtype
4. **DataFrame Operations**: Operations on DataFrames with mixed column types

### Solutions Implemented

#### **1. Safe Utility Functions**:
- `safe_nlargest()`: Safely gets n largest values, handling non-numeric data
- `safe_nsmallest()`: Safely gets n smallest values, handling non-numeric data
- `ensure_numeric_dataframe()`: Converts DataFrame columns to numeric types

#### **2. Data Type Validation**:
- **Pre-processing**: Convert all data to numeric using `pd.to_numeric()`
- **Error Handling**: Use `errors='coerce'` to convert invalid values to NaN
- **NaN Handling**: Fill NaN values with appropriate defaults (usually 0)

#### **3. Robust Analysis Functions**:
- **Fixed Spatial Clustering**: `perform_spatial_clustering_analysis_fixed()`
- **Fixed Comprehensive Analysis**: `comprehensive_spatial_variable_importance_analysis_fixed()`
- **Error Prevention**: Check data types before using `nlargest()`

### Key Improvements

#### **Data Type Safety**:
```python
# Before (error-prone)
top_variables = cluster_mean_importance.nlargest(5)

# After (safe)
cluster_mean_importance_numeric = pd.to_numeric(cluster_mean_importance, errors='coerce')
cluster_mean_importance_numeric = cluster_mean_importance_numeric.fillna(0)
top_variables = cluster_mean_importance_numeric.nlargest(5)
```

#### **Error Prevention**:
- **Type Checking**: Verify data types before operations
- **Graceful Degradation**: Handle edge cases (empty series, all zeros)
- **Comprehensive Testing**: Test functions with various data types

#### **Robustness**:
- **Mixed Data Handling**: Process data with mixed types safely
- **Missing Value Management**: Handle NaN values appropriately
- **Edge Case Handling**: Manage empty or invalid data gracefully

### Usage Recommendations

#### **1. Always Use Safe Functions**:
- Use `safe_nlargest()` instead of direct `nlargest()` calls
- Use `ensure_numeric_dataframe()` for DataFrame operations
- Validate data types before analysis

#### **2. Data Preprocessing**:
- Convert data to appropriate types early in the pipeline
- Handle missing values consistently
- Validate data quality before analysis

#### **3. Error Handling**:
- Wrap operations in try-catch blocks
- Provide meaningful error messages
- Implement fallback strategies

### Testing and Validation

The fixed functions include comprehensive testing to ensure they work correctly with:
- **Numeric Data**: Standard numeric values
- **String Data**: Numeric values stored as strings
- **Mixed Data**: Combinations of numeric and non-numeric values
- **Missing Data**: NaN values and empty series
- **Edge Cases**: Zero values, single values, empty data

This ensures the analysis is robust and can handle real-world data with various quality issues and data type inconsistencies.
