## Gridded Datasets

* ~~%tree, non-tree, non-veg (`MODIS/006/MOD44B`) from MODIS instead of LULC~~
* ~~Redo reprojection of MODIS datasets with nodata flag set so NaNs aren't propogated~~
* ~~**Need to extract pixel values at finer (500m - 1km) spatial resolution for the MODIS datasets - fragmented landscapes are causing havoc with the predictions since a forest surrounded by pasture/crop or vice versa throws off the 5 km re-scaled data e.g. Samford site, Boyagin**~~
* **Consider creating PFT specific models: grass, crops, shrubs, Tree, (tropical trees?)**
* ~~MODIS months since burn date~~
* GPM-IMERG instead of CHIRPS, 6-month cml and 12-mnt cml, Cumulative Water Deficit
* ~~Fill NaNs in fire disturbance with 120 values...or not?~~
* ~~Extract GOSIF GPP and MODIS GPP for comparisons~~
* Digital Elevation Model
* Coefficient of variation for rainfall (std.dev of annual P / Mean Annual P) (Page et al 2022 - memory effects)
* Extract MODIS surface relfectance bands:
    * ~~Add NDWI as a general proxy for soil moisture~~
    * Use NIRV instead of EVI
        * NIRV = (NDVI-0.08) x NIR
* ~~Do my own Aridity Index instead of CGIAR's~~
* Update TerraClimate 'Moisture Index' with one derived from AWRA+CHIRPS (GPM-IMERG) data
* The last 4 months of TerraClimate 'CWD' is dodgy `CWD_5km_monthly_2002_2021.nc`
* Synoptic CO2 as predictor?
* Soil grids: upper horizon organic carbon and clay`/g/data/fj4/Soil_Landscape_grid/TERNLandscape_90m`
* ~~Topographic wetness index https://portal.tern.org.au/topographic-wetness-index-dem-h/17355~~



## Modelling

* ~~Make an ET model as well~~
* ~~Implement spatial Cross-validation to prevent test points coming from the same EC tower as training points.~~
* Consider balancing landcover types in EC data, so less data from Trees
* Consider implementing time-series splits for k-fold cross validation https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

If using categorical data like LULC:
* one-hot-encoding but this may be a poor way of handling categorical variables in this case:
* https://notebook.community/roaminsight/roamresearch/BlogPosts/Categorical_variables_in_tree_models/categorical_variables_post
* https://notebook.community/marcotcr/lime/doc/notebooks/Tutorial_H2O_continuous_and_cat
* `LightGBM` handles categorical variables well: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features

* Compare various CV techniques
    * ~~Random splits~~
    * ~~leave-one-group-out (spatial)~~
    * time-series-splits

## Plotting/Analysis

* ~~Make a map where the pixels are color coded to show which month each pixel has the largest carbon sink~~

## Reading

* ~~EC tower data QC papers by Peter Isaac~~
* ~~Cleverly's papers examining climate drivers of C fluxes~~
* ~~Biogeosciences special issues on OzFlux: https://bg.copernicus.org/articles/special_issue618.html~~
* ~~Li, X., Xiao, J. (2019) Mapping photosynthesis solely from solar-induced chlorophyll fluorescence: A global, fine-resolution dataset of gross primary production derived from OCO-2. Remote Sensing, 11(21), 2563; https://doi.org/10.3390/rs11212563.~~
* Drought rapidly diminishes the large net CO2 uptake in 2011 over semi-arid Australia
* Terrestrial carbon cycle model-data fusion: Progress and challenges
* Measuring fluxes of trace gases and energy between ecosystems and the atmosphere–the state and future of the eddy covariance method
* ~~An introduction to the Australian and New Zealand flux tower network – OzFlux~~
* Fire in Australian savannas: from leaf to landscape
* Multiple observation types reduce uncertainty in Australia’s terrestrial carbon and water cycles
* Dryland vegetation response to wet episode, not inherent shift in sensitivity to rainfall, behind Australia’s role in 2011 global carbon sink anomaly
* Learning from imbalanced data: open challenges and future directions

## Writing

* Literature review for chapter 1, upscaling EC Tower data
* Literature review for chapter 2, causal inference on time-series

## Questions

1. Need guidance on the best approach for cross-validation:
    * time-series-splits difficult because training on 29 seperate time-series
    * Nested, random k-fold splits will include some data leakage but not as bad as would be typical  because 29 seperate time-series. Should expect some over-estimation of test-scores
    * Spatial K-fold will exclude whole time-series from the training and test on those - this will certainly mean there is no data leakage but test-scores will most likely be underestimated, and hyperparameters may be then biased to those sites that are easier to predict.

2. Sometimes GPP in the flux tower data is negative...what is the best approach to dealing with this? Set to zero? If setting to zero, then should the other fluxes be modified?
3. Imbalance in the training data.
    * Should I consider better balancing the proportions of woody vs grassy ecosystems in the training data? For example by randomly sampling fewer years of data from the woody sites?
    * imbalance in number of large vs small fluxes? Oversample large fluxes?
    * Algorithmic: [SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)
4. How to include 'synoptic' CO2 measurements in the modelling/predictions?
5. Do you think its possible to validate these estimates of NEE using site level NEE from GEM sites?
6. Months since burnt, only for Trees? What should the starting months be? Or should it be a binary: if burnt anytime in last 10 years = 1, if not = 0. 
7. What is Ozwald GPP?
8. Have the EC sites been used to study the 2019 drought?
9. Should I implement a urban model? Suggested in reitz etal 2021

virtual env path: `/g/data/os22/chad_tmp/NEE_modelling/env/nee`

`https://dap.tern.org.au/thredds/catalog.html`