<h1 style="background-color:#4E9A06; color:#ffffff; padding:10px 15px; border-radius:5px; margin-top:1rem; margin-bottom:1rem; text-align:center;">
  04 – Comparative Analysis: Kirk et al. (2025) vs. Our Capstone EDA
</h1>

<b>Modeling Workflow Summary</b>
This document outlines the machine learning workflow from the provided notebook. The process involves loading a cleaned dataset, engineering new features, building separate regression models for "Agricultural" and "Wild" ecosystems, evaluating their performance, and identifying the key drivers of disease incidence in each system.

### Objectives

- **Benchmark and Compare**: Evaluate the performance of machine learning (ML) models developed in this project against the results published by Kirk et al. (2025).
- **Understand Performance Discrepancies**: Investigate differences in predictive accuracy and R² values between models.
- **Generate Methodological Insights**: Identify modeling assumptions (e.g., lack of random effects, sample size limitations) that explain performance gaps.
- **Inform Future Work**: Provide recommendations to improve model robustness and interpretability, including possible integration of hierarchical models.

---

### Inputs

- `merged_climate_disease_final.csv`: A merged dataset of 4,339 plant-disease surveys with associated climate data.
- ML model results:
  - RandomForestClassifier / Regressor
  - Ridge regression with splines
  - Stacking ensembles
- Published performance metrics from:
  - Kirk et al. (2025), including conditional and marginal R² values
  - System-type-specific results (Agricultural vs. Wild)

---

### Outputs

- **Comparison Tables**: Side-by-side R² metrics (your models vs. Kirk et al.).
- **Visual Summaries**: Performance comparison plots and error distributions.
- **Justification & Interpretation**: Detailed markdown discussion explaining observed differences in model performance.
- **Future Directions**: Suggestions for improving ML models through hierarchical modeling, feature engineering, or increased sample stratification.




## 1. Shared Dataset and Background

Both **Kirk et al. (2025)** (*Ecology Letters*) and **our Capstone project** use the same **Dryad dataset** (n=4,339; DOI: 10.5061/dryad.p8cz8wb0h), containing:

- Plant–disease prevalence observations.
- Climate data (ERA5-Land) and historical climate averages (1960–1990 WorldClim baseline).

---

## 2. Hypotheses Tested

| # | Hypothesis                                           | Source          |
|---|------------------------------------------------------|-----------------|
| **H₁** | Weather, Anomaly & Historical Climate Effects   | Kirk et al.     |
| **H₂** | Wild vs. Agricultural System-Type Sensitivity   | Kirk et al.     |
| **H₃** | Thermal & Precipitation Mismatch                | Kirk et al.     |
| **H₄** | Geographic & Pathogen-Type Modulation           | Capstone-added  |
| **H₅** | Transmission-Mode Sensitivity                   | Capstone-added  |

---

## 3. Kirk et al. (2025): Methods and Key Findings

- **Modeling Method**:  
  - Binomial mixed-effects GLMM (Random effects: Study, Host order)
- **Data subsets**:
  - Wild (n=623), Agricultural (n=3,776)

| Subset        | Best Model (Temperature)                             | Key Findings                          |
|---------------|-------------------------------------------------------|---------------------------------------|
| **Wild**      | `(T_anom)²` (concave-down), `T_conc`, `T_hist_annual` | Strong mismatch; peak at +2.7°C anomaly|
| **Agricultural** | `T_anom × T_hist_month`, `T_hist_annual`           | No clear mismatch; driven by contemporaneous temperature |

---

## 4. Our Capstone Methodology (Detailed from provided notebooks)

### 4.1 Data Loading (`00_data_load_and_inspect.ipynb`)
- Loaded and inspected Dryad CSV.
- Verified variable distributions and missingness.

### 4.2 ETL & Preprocessing (`01_etl_preprocessing.ipynb`)
- Removed irrelevant features and explicitly handled missing values.
- Derived consistent climate anomalies (temperature and precipitation) based on historical baselines.
- Created categorical indicators (`System_Type`, `Host_Type`, `Parasite_Type`, `GeoZone`).

### 4.3 Exploratory Data Analysis (EDA) and Hypothesis Validation (`02a_eda_hypotheses_validations.ipynb`)
- Clearly formulated hypotheses (H₁–H₅).
- Extensive visualization (scatter plots, LOWESS/polynomial fits).
- Statistical validation using GLMM aligned with Kirk’s methods.
- Stratified analyses by pathogen type, geographic zones, and transmission modes.

### 4.4 Interactive Dashboard (`02b_eda_dashboard.ipynb`)
- Streamlit-based dashboard allowing stakeholders to interactively explore data.
- Features include:
  - System (wild/agricultural) comparison.
  - Pathogen-type and geographic stratification.
  - Anomaly exploration with real-time predictive visualization.

### 4.5 Predictive Modeling (`03_modeling_workflow.ipynb`)
- Pure machine learning pipeline:
  - Ridge regression
  - Stacking ensembles
- Cross-validation performance evaluation:
  - Agricultural systems: R² = 0.57
  - Wild systems: R² = 0.52

---

## 5. Justification of Performance Relative to Kirk et al. (2025)

While Kirk et al. reported high conditional R² (Agri = 0.63, Wild = 0.93), our pure ML pipeline achieved somewhat lower R² (Agri = 0.57, Wild = 0.52). Primary reasons include:

- **Absence of Random Effects**:  
  Kirk et al.’s models explicitly accounted for hierarchical (study/host) effects, capturing variance that our ML methods did not explicitly model.

- **Mixed-effects vs. Pure-ML Limitations**:  
  Mixed-effects inherently account for unmeasured heterogeneity, whereas our ML methods rely entirely on explicitly provided features.

- **Wild Subset Sample Size & Variance**:  
  Limited wild observations (~600) without explicit hierarchical modeling led to higher variance and reduced accuracy.

- **Generalized Feature Engineering**:  
  Kirk et al. employed carefully tailored climate variables and interactions. Our pipeline utilized a more generalized and less nuanced set of anomaly and spline-based features, limiting specific ecological insights.

### Explicit Reasons for Lower Wild-System Accuracy:

| Reason                   | Kirk et al. (2025)                           | Our Capstone ML Pipeline                       |
|--------------------------|----------------------------------------------|------------------------------------------------|
| **Random effects**       | ✅ Explicit hierarchical modeling             | ❌ Not modeled explicitly                      |
| **Variance stabilization** | ✅ Leveraged multiple studies (low variance)  | ❌ Small dataset; higher variance              |
| **Feature complexity**   | ✅ Tailored interactions                      | ❌ Generalized anomaly features                |
| **Hierarchical modeling**| ✅ Built-in via mixed-effects                 | ❌ Pure-ML without hierarchical capability     |

---

## 6. Hypothesis Validation Summary (Capstone)

| Hypothesis | Wild Results                                       | Agricultural Results                                | Validation Method                             |
|------------|----------------------------------------------------|-----------------------------------------------------|-----------------------------------------------|
| **H₁**     | Clear anomaly peak at ~+2.7°C; confirmed           | Strong contemporaneous T; minor anomaly; confirmed  | Visual (LOWESS), GLMM coefficients            |
| **H₂**     | High sensitivity (R² ~33%); confirmed              | Low sensitivity (R² ~5%); confirmed                 | R² comparisons, visual plots                  |
| **H₃**     | Clear mismatch; confirmed                          | No mismatch; confirmed                              | GLMM interaction terms                        |
| **H₄**     | Strong geographic & pathogen modulation; supported | Moderate modulation; partially supported            | Stratified GLMM and visualizations            |
| **H₅**     | Strong vector-borne sensitivity; supported         | Mild vector sensitivity; soil/contact negligible; supported | GLMM by transmission mode; scatter plots |

---

## 7. Capstone Contributions Beyond Kirk et al.

| Aspect                     | Kirk et al. (2025)                            | Capstone EDA Extensions                             |
|----------------------------|-----------------------------------------------|-----------------------------------------------------|
| **Hypotheses**             | Tested core hypotheses H₁–H₃                  | Extended by adding H₄ & H₅                          |
| **ETL & Data Preparation** | Basic data processing                         | Detailed ETL, rigorous missing data handling        |
| **Modeling Methods**       | Mixed-effects modeling                        | GLMM replicated + advanced ML pipeline              |
| **Visualizations**         | Static publication-quality visuals            | Interactive dashboard (Streamlit, Plotly)           |
| **Practical Insights**     | Broad ecological insights                     | Crop-/region-/pathogen-specific actionable insights |

---

## 8. Reasons for Not Achieving Kirk's Accuracy (Explicit)

- **Hierarchical random-effects modeling absence**:  
  Our ML approach was structurally limited due to absence of hierarchical variance modeling (study/host-level random effects).

- **Limited sample size (wild subset)**:  
  ML models struggled with small datasets (n~600) lacking hierarchical stabilization, causing high cross-validation variance.

- **Generalized vs. tailored features**:  
  Kirk et al.’s curated interactions and tailored covariates provided nuanced ecological insights. Our generalized features offered less detail.

- **Mixed-effects inherent advantage**:  
  GLMM’s built-in hierarchical modeling naturally captures variance not explicitly modeled by pure ML.

---

## 9. Comprehensive Methods Summary (Capstone)

- **Data loading & inspection**
- **ETL and detailed preprocessing**
- **Exploratory data analysis (EDA) with visualizations (scatter, LOWESS, stratified)**
- **Statistical hypothesis validation (GLMM)**
- **Interactive stakeholder dashboard (Streamlit)**
- **Machine learning modeling (Ridge regression, stacking ensembles)**
- **Cross-validation performance evaluation (R²)**

---

## 10. Future Work and Directions

- **Incorporate explicit hierarchical modeling**:  
  Adapt hierarchical ML (e.g., mixed-effect random forests, hierarchical Bayesian models) to better handle nested structures.

- **Feature refinement**:  
  Improve climate feature engineering, especially irrigation-adjusted rainfall surplus for agricultural prediction.

- **Dashboard enhancement**:  
  Expand dashboard usability, adding predictive scenarios and real-time risk assessment capabilities for practical agricultural decision-making.

- **Expand data collection**:  
  Encourage more comprehensive wild-system datasets to mitigate variance and enhance accuracy.

---

## 11. Summary of Contributions and Insights

- **Thoroughly reproduced and extended Kirk et al.’s findings**.
- **Clearly explained performance discrepancies**, particularly in wild systems.
- **Provided actionable insights and interactive tools** for real-world ecological and agricultural stakeholders.
- **Identified explicit reasons for performance limitations**, establishing clear paths for future improvement.
