# Data Preparation Overview

In this tutorial we will extract all the data we need to run both the uncalibrated SWIM-RS model and to calibrate and run a parameterized model. The model uses a daily soil water balance, and as such, it needs a daily estimate of meteorological drivers and some *a priori* information on the soils in our sample locations. It also needs an estimate of the state of the vegetation on the surface, for which we use Landsat-based NDVI. With this information, we will be able to run SWIM-RS to estimate the daily ET, soil water storage, recharge, runoff, and simulated irrigation.

To calibrate the model so it behaves more realistically, we must use the parameter inversion software (PEST++, in this case) alongside target data which provides *somewhat* independent estimates of ET and snow on the ground. Once the model is calibrated for a sample plot, these data are no longer needed. Therefore, the calibrated model can be run for periods before or after SNODAS or ETf data is available.

The remote sensing data (ETf and NDVI) are the most time-consuming step, as the data are being extracted from potentially thousands of separate Landsat-like images in Earth Engine. The snow, soils, and irrigation data, on the other hand, are relatively quick, as the images are fewer, with one CONUS-wide image per day in SNODAS, a few static images for the soils data, and one image annually for the irrigation products. However, thanks to Earth Engine, even the Landsat-based products are quickly extracted if the number of sample plots are small and clustered in space, as they are for this tutorial.

The SWIM-RS approach requires the following input datasets to run and/or calibrate:
1. **NDVI**: Normalized Difference Vegetation Index is a measure made using the red and near-infrared bands of a multispectral instrument. It is a good way to estimate the relative density and vigor of vegetation, which is highly correlated with transpiration. NDVI is used as a proxy for the transpirative component of the crop coefficient in SWIM-RS, Kcb. Here, we access NDVI information from Landsat satellite images in Earth Engine.
2. **ETf**: The rate of ET expressed as a fraction of reference/potential ET. This is also known in agricultural water use modeling as the 'crop coefficient', or Kc. For this tutorial we use SSEBop, accessed from Google Earth Engine. We could use results from any number of remote sensing-based modeling approaches (METRIC, OpenET ensemble, etc.). *FOR USE IN CALIBRATION ONLY*
3. **Soils**: We need an initial estimate of soil hydraulic properties that govern the way water behaves in our very simple model soil water reservoir.
4. **Irrigation**: We use an irrigation mask (IrrMapper or LANID) to constrain the data extraction of the irrigated and unirrigated portions of any given sample plot.
5. **Snow**: We use the SNODAS product to estimate the snow water equivalent (SWE) at a daily time step to calibrate the simple snow model in SWIM-RS. *FOR USE IN CALIBRATION ONLY*

Note: For this tutorial, we use 'field' and 'plot' somewhat interchangeably. Indeed, the sample plots for this tutorial are fields. However, we could draw an arbitrary polygon over a location of interest and run the model there. Keep in mind, the data will represent the mean of the irrigated and unirrigated portions of the sample plot. Therefore, using sensible land-use features (e.g., individual agricultural fields) is a good approach because assuming homogenous land-use management in a single field is not a terrible assumption.

---

## Two Paths

1. **Run extraction** (requires Earth Engine access): Execute the cells below to download data from Earth Engine and GridMET
2. **Use pre-built data**: Skip to notebook 03 (data available in `data/prebuilt/`)

## 1. Import Libraries and Authorize Earth Engine

If new to Earth Engine, checkout https://developers.google.com/earth-engine/guides/auth

In [8]:
import os
import sys
import ee

root = os.path.abspath('../..')
sys.path.append(root)

from swimrs.swim.config import ProjectConfig
from swimrs.data_extraction.ee.etf_export import clustered_sample_etf
from swimrs.data_extraction.ee.ndvi_export import clustered_sample_ndvi
from swimrs.data_extraction.ee.snodas_export import sample_snodas_swe
from swimrs.data_extraction.snodas.snodas import create_timeseries_json
from swimrs.data_extraction.ee.ee_props import get_irrigation, get_ssurgo, get_landcover
from swimrs.data_extraction.ee.ee_utils import is_authorized

sys.setrecursionlimit(5000)

In [9]:
if not is_authorized():
    ee.Authenticate()
ee.Initialize()

## 2. Configuration

Load project configuration from the TOML file. This provides all paths, date ranges, and bucket settings.

In [10]:
# Load project configuration
project_dir = os.path.abspath('.')
config_file = os.path.join(project_dir, '1_Boulder.toml')

cfg = ProjectConfig()
cfg.read_config(config_file, project_root_override=project_dir)

print(f"Project: {cfg.project_name}")
print(f"Bucket: {cfg.ee_bucket}")
print(f"Date range: {cfg.start_dt} to {cfg.end_dt}")
print(f"Shapefile: {cfg.fields_shapefile}")
print(f"ETf model: {cfg.etf_target_model}")

Project: 1_Boulder
Bucket: wudr
Date range: 2004-01-01 00:00:00 to 2022-12-31 00:00:00
Shapefile: /home/dgketchum/code/swim-rs/examples/1_Boulder/data/gis/mt_sid_boulder.shp
ETf model: ssebop


In [11]:
# Export destination - use Cloud Storage bucket (faster) or Google Drive
# Change to True to use Google Drive instead of a bucket
USE_DRIVE = False

# Export settings derived from config
export_dest = 'drive' if USE_DRIVE else 'bucket'
export_bucket = None if USE_DRIVE else cfg.ee_bucket
file_prefix = cfg.project_name  # Bucket path prefix

# Drive folder from config (or default to project name)
drive_folder = cfg.resolved_config.get('earth_engine', {}).get('drive_folder', cfg.project_name)

# Date range from config
start_year = cfg.start_dt.year
end_year = cfg.end_dt.year

# Optional: Limit extraction to specific fields for testing
select_fields = ['043_000130', '043_000128', '043_000161']

# running all fields will take a bit longer
# select_fields = None

---

# Part A: Remote Sensing Extraction (ETf and NDVI)

## Extract ETf Raster Data

Now we're ready to do 'zonal stats' on our fields. We provide a local shapefile, and the code converts it to a FeatureCollection internally, then extracts ETf/NDVI summaries per field to CSV.

We need to use an irrigated lands mask (IrrMapper or LANID) to find irrigated and unirrigated zones within the polygons of our shapefile. You see this implemented in the code below, where the `mask` argument is either `irr` for irrigated, or `inv_irr` for the inverse of the irrigated mask, which are unirrigated areas.

For the raster data extraction, there are three options to get at the data:

* **`clustered_sample_etf`**: This function finds all Landsat images intersecting the sample polygons (i.e., our fields). Since our fields are clustered together, this finds a reasonable number of images and iterates over them, extracting data from each. We use this on the tutorial since the sample from the Montana fields database is geographically constrained.

* **`sparse_sample_etf`**: This function assumes the samples (fields) are spread out over many different Landsat images. It runs sample-by-sample, finding Landsat images overlapping each sample and extracting from them. This is used when we extract data for widely-spaced sites across the Conterminous US in examples 4 and 5, or globally, as in example 6.

* **`export_etf_images`**: This function exports the Landsat images themselves, clipped to the bounds of a 'hopefully' clustered set of sample polygons. This is helpful for experimentation with buffering zones and so on, but not meant for large numbers of samples.

In [12]:
# Extract ETf for both irrigated and unirrigated masks
# This divides every sample into a 'purely' irrigated section (irr) and an unirrigated one (inv_irr)
# This allows us to build a model for irrigated areas that aren't contaminated by unirrigated areas.

for mask in ['inv_irr', 'irr']:
    print(f"Extracting ETf ({mask})...")
    
    if USE_DRIVE:
        clustered_sample_etf(
            cfg.fields_shapefile, bucket=None, debug=False, mask_type=mask, 
            check_dir=None, start_yr=start_year, end_yr=end_year, 
            feature_id=cfg.feature_id_col, select=select_fields, 
            model=cfg.etf_target_model, usgs_nhm=True, dest='drive', 
            state_col=cfg.state_col, drive_folder=drive_folder, drive_categorize=True
        )
    else:
        clustered_sample_etf(
            cfg.fields_shapefile, bucket=cfg.ee_bucket, debug=False, mask_type=mask, 
            check_dir=None, start_yr=start_year, end_yr=end_year, 
            feature_id=cfg.feature_id_col, select=select_fields, 
            model=cfg.etf_target_model, usgs_nhm=True, dest='bucket', 
            file_prefix=file_prefix, state_col=cfg.state_col, drive_categorize=False
        )

Extracting ETf (inv_irr)...
etf_inv_irr_2004
etf_inv_irr_2005
etf_inv_irr_2006
etf_inv_irr_2007
etf_inv_irr_2008
etf_inv_irr_2009
etf_inv_irr_2010
etf_inv_irr_2011
etf_inv_irr_2012
etf_inv_irr_2013
etf_inv_irr_2014
etf_inv_irr_2015
etf_inv_irr_2016
etf_inv_irr_2017
etf_inv_irr_2018
etf_inv_irr_2019
etf_inv_irr_2020
etf_inv_irr_2021
etf_inv_irr_2022
Extracting ETf (irr)...
etf_irr_2004
etf_irr_2005
etf_irr_2006
etf_irr_2007
etf_irr_2008
etf_irr_2009
etf_irr_2010
etf_irr_2011
etf_irr_2012
etf_irr_2013
etf_irr_2014
etf_irr_2015
etf_irr_2016
etf_irr_2017
etf_irr_2018
etf_irr_2019
etf_irr_2020
etf_irr_2021
etf_irr_2022


## Extract NDVI Raster Data

This is just like the ETf extraction, but for NDVI. This is a little more straightforward as we can get the data straight from the Landsat collection, and don't need special permissions or knowledge of where the data are stored.

As with the ETf code, the extraction has three options to get at the data:
* **`clustered_sample_ndvi`**: For clustered fields (what we use here)
* **`sparse_sample_ndvi`**: For fields spread across many Landsat images
* **`export_ndvi_images`**: For exporting the Landsat images themselves

In [13]:
for mask in ['inv_irr', 'irr']:
    print(f"Extracting NDVI ({mask})...")
    
    if USE_DRIVE:
        clustered_sample_ndvi(
            cfg.fields_shapefile, bucket=None, debug=False, mask_type=mask, 
            check_dir=None, start_yr=start_year, end_yr=end_year, 
            feature_id=cfg.feature_id_col, select=select_fields, 
            satellite='landsat', dest='drive', 
            drive_folder=drive_folder, drive_categorize=True
        )
    else:
        clustered_sample_ndvi(
            cfg.fields_shapefile, bucket=cfg.ee_bucket, debug=False, mask_type=mask, 
            check_dir=None, start_yr=start_year, end_yr=end_year, 
            feature_id=cfg.feature_id_col, select=select_fields, 
            satellite='landsat', dest='bucket', 
            file_prefix=file_prefix, drive_categorize=False
        )

Extracting NDVI (inv_irr)...
ndvi_2004_inv_irr
ndvi_2005_inv_irr
ndvi_2006_inv_irr
ndvi_2007_inv_irr
ndvi_2008_inv_irr
ndvi_2009_inv_irr
ndvi_2010_inv_irr
ndvi_2011_inv_irr
ndvi_2012_inv_irr
ndvi_2013_inv_irr
ndvi_2014_inv_irr
ndvi_2015_inv_irr
ndvi_2016_inv_irr
ndvi_2017_inv_irr
ndvi_2018_inv_irr
ndvi_2019_inv_irr
ndvi_2020_inv_irr
ndvi_2021_inv_irr
ndvi_2022_inv_irr
Extracting NDVI (irr)...
ndvi_2004_irr
ndvi_2005_irr
ndvi_2006_irr
ndvi_2007_irr
ndvi_2008_irr
ndvi_2009_irr
ndvi_2010_irr
ndvi_2011_irr
ndvi_2012_irr
ndvi_2013_irr
ndvi_2014_irr
ndvi_2015_irr
ndvi_2016_irr
ndvi_2017_irr
ndvi_2018_irr
ndvi_2019_irr
ndvi_2020_irr
ndvi_2021_irr
ndvi_2022_irr


---

# Part B: Snow, Irrigation, and Soils Extraction

For the raster data extraction, there are three functions we need to run:

* **`sample_snodas_swe`**: This function iterates over the daily SNODAS images in Earth Engine, extracting mean SWE for each sample plot for each day, September through May. (https://nsidc.org/data/g02158/versions/1)

* **`get_irrigation`**: This function uses IrrMapper to get statistics about the irrigation status of each plot for each year, including the fraction of the plot that was irrigated. (https://www.mdpi.com/2072-4292/12/14/2328)

* **`get_ssurgo`**: This function uses data summarized and put in a public Earth Engine asset by Charles Morton at Desert Research Institute from SSURGO to summarize plot-scale soil texture and hydraulic properties used by SWIM-RS.

Note: The module also has functions for extracting vegetation height (`get_landfire`) and crop type (`get_cdl`).

## SWE Data (SNODAS)

Notes:
- Ensure that your Cloud Storage bucket has the correct permissions for Earth Engine to write to it.
- This will produce a monthly dataset for Sep - May, regardless of SWE status at the sample plots.

In [None]:
print("Extracting SNODAS SWE...")
sample_snodas_swe(
    cfg.fields_shapefile, bucket=export_bucket, debug=False, 
    check_dir=None, overwrite=False, feature_id=cfg.feature_id_col, 
    dest=export_dest, drive_folder=drive_folder, drive_categorize=True,
    file_prefix=file_prefix
)

## Irrigation Data (IrrMapper/LANID)

This will produce an annual dataset of the IrrMapper-estimated irrigated fraction for each sample plot.

In [None]:
print("Extracting irrigation data...")
get_irrigation(
    cfg.fields_shapefile, description=f'{cfg.project_name}_irr', debug=False, 
    selector=cfg.feature_id_col, lanid=True, dest=export_dest, 
    bucket=export_bucket, file_prefix=file_prefix,
    drive_folder=drive_folder, drive_categorize=True
)

## Land Cover Data

Extract dominant landcover (MODIS LC_Type1 + FROM-GLC10).

In [None]:
print("Extracting land cover data...")
get_landcover(
    cfg.fields_shapefile, description=f'{cfg.project_name}_landcover', debug=False, 
    selector=cfg.feature_id_col, dest=export_dest, 
    bucket=export_bucket, file_prefix=file_prefix,
    drive_folder=drive_folder, drive_categorize=True
)

## Soils Data (SSURGO)

This will produce a single dataset of the SSURGO-estimated soil properties for each sample plot.

In [None]:
print("Extracting SSURGO soil data...")
get_ssurgo(
    cfg.fields_shapefile, description=f'{cfg.project_name}_ssurgo', debug=False, 
    selector=cfg.feature_id_col, dest=export_dest, 
    bucket=export_bucket, file_prefix=file_prefix,
    drive_folder=drive_folder, drive_categorize=True
)

---

# Part C: Meteorology Data Extraction (GridMET)

In this section, we will:
1. Associate sample plots (fields) with their nearest GridMET cell
2. Extract GridMET bias-correction information from DRI's rasters
3. Download GridMET data from the THREDDS server, and NLDAS-2 hourly precipitation data

Read about GridMET: https://www.climatologylab.org/gridmet.html  
Read about NLDAS-2: https://ldas.gsfc.nasa.gov/nldas/nldas-2-model-data

## GridMET Cell Assignment

Our fields are in a pretty tight cluster. We're preparing to download meteorology from a 4-km resolution dataset (GridMET), so it's unnecessary to download a meteorology time series for each field, as many will just be copies. Rather, we'll identify the GridMET 'cells' with a shapefile, and find the closest cell to each field.

In addition to the raw meteorology data, we will also be accessing rasters that show the observed bias between AgriMet weather stations and GridMET's reference ET. These biases are due to the impacts of irrigated agriculture on the near-surface atmosphere, which often tends to see relatively high humidity and low temperature compared to arid and semi-arid surroundings. This bias is documented in Blankeneau (2020; https://doi.org/10.1016/j.agwat.2020.106376). The bias correction surfaces were mapped over CONUS by Desert Research Institute and OpenET and are documented in Melton et al., 2021 (https://doi.org/10.1111/1752-1688.12956).

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt
import contextily as ctx
import cartopy.crs as ccrs

%matplotlib inline

In [None]:
from swimrs.data_extraction.gridmet.gridmet import assign_gridmet_and_corrections, download_gridmet

In [None]:
# Associate each field with nearest GridMET cell and extract bias correction factors
assign_gridmet_and_corrections(
    fields=cfg.fields_shapefile, 
    gridmet_points=cfg.gridmet_centroids, 
    gridmet_ras=cfg.gridmet_corr_dir, 
    fields_join=cfg.gridmet_mapping, 
    factors_js=cfg.gridmet_factors, 
    feature_id=cfg.feature_id_col
)

This should print 'Get gridmet for 4 target points', as there should be only four unique GridMET cells that are closest to each of the fields. Let's visualize this:

In [None]:
import random
import warnings
warnings.filterwarnings("ignore", category=UserWarning, message="The GeoDataFrame you are attempting to plot is empty.")

gdf = gpd.read_file(cfg.fields_shapefile)
cdf = gpd.read_file(cfg.gridmet_centroids)
gdf_gfid = gpd.read_file(cfg.gridmet_mapping)

fig, ax = plt.subplots(figsize=(10, 10), subplot_kw={'projection': ccrs.epsg(5071)})

# Color fields by their assigned GridMET cell
unique_gfids = set(cdf['GFID'].unique()).union(gdf_gfid['GFID'].unique())
colors = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]) for i in range(len(unique_gfids))]
color_map = dict(zip(unique_gfids, colors))

for gfid, color in color_map.items():
    cdf[cdf['GFID'] == gfid].plot(ax=ax, edgecolor='black', color=color, transform=ccrs.epsg(5071))
    gdf_gfid[gdf_gfid['GFID'] == gfid].plot(ax=ax, edgecolor='black', color=color, transform=ccrs.epsg(5071))

# Hybrid basemap: satellite imagery + labels overlay
ctx.add_basemap(ax, source=ctx.providers.Esri.WorldImagery, crs=ccrs.epsg(5071))
ctx.add_basemap(ax, source=ctx.providers.Stadia.StamenTonerLabels, crs=ccrs.epsg(5071), alpha=0.8)

plt.title('Fields colored by GridMET Cell Assignment')
plt.show()

Each field's color should match that of the nearest GridMET centroid. The 'GFID' for each field has been saved in the output shapefile. This reduces the data we must download by a large factor.

## Download GridMET Data

Now we download the daily meteorological timeseries from GridMET's THREDDS server. This will probably take a few minutes.

Note: Under the hood, this code will also be downloading hourly precipitation data from NLDAS-2. This is helpful on days when there is precipitation and we want to know its intensity for modeling purposes (to estimate runoff/recharge).

In [None]:
os.makedirs(cfg.met_dir, exist_ok=True)

download_gridmet(
    cfg.gridmet_mapping, cfg.gridmet_factors, cfg.met_dir, 
    start='1987-01-01', end='2023-12-31',
    target_fields=None, overwrite=False, feature_id=cfg.feature_id_col
)

Let's look at one of the GridMET time series and see what we have.

In [None]:
import pandas as pd

# Find a GridMET file
met_files = [f for f in os.listdir(cfg.met_dir) if f.endswith('.csv')]
if met_files:
    met_data = os.path.join(cfg.met_dir, met_files[0])
    met_df = pd.read_csv(met_data, index_col='date')
    print(f"Loaded {met_files[0]}")
    print(met_df.head())
    print(f"\nColumns: {list(met_df.columns)}")

Here, we see we have information on the date, the location, and daily meteorological information from GridMET, including `tmin_c`, `tmax_c`, and `prcp_mm`. We see the critical reference ET estimates in 'uncorrected' form (`eto_mm_uncorr` and `etr_mm_uncorr`), which we could use in natural vegetation, and also in 'corrected' form (`eto_mm` and `etr_mm`), which we will be using over our irrigated study area. We also see `prcp_hr_XX`, which is the hourly NLDAS-2 precipitation estimate.

---

# Part D: Sync Data from Cloud Storage

Once your Earth Engine export tasks have completed (monitor at https://code.earthengine.google.com/tasks), sync the data from your Cloud Storage bucket to your local filesystem.

**Note**: If using Google Drive export (`USE_DRIVE = True`), you'll need to manually download files from Drive instead.

In [None]:
# Preview what will be synced (dry run)
if not USE_DRIVE and cfg.ee_bucket:
    print("Preview of files to sync (dry run):")
    cfg.sync_from_bucket(dry_run=True)

In [None]:
# Actually sync data from bucket
if not USE_DRIVE and cfg.ee_bucket:
    print("Syncing data from bucket...")
    cfg.sync_from_bucket(dry_run=False)
    print("Sync complete!")

## Convert SNODAS to Time Series JSON

After syncing, convert the month-by-month SNODAS files to per-field daily time series.

In [None]:
# Convert month-by-month SNODAS files to per-field daily time series
snow_extracts = os.path.join(cfg.data_dir, 'snow', 'snodas', 'extracts')
snow_out = os.path.join(cfg.data_dir, 'snow', 'snodas', 'snodas.json')

if os.path.exists(snow_extracts):
    os.makedirs(os.path.dirname(snow_out), exist_ok=True)
    create_timeseries_json(snow_extracts, snow_out, feature_id=cfg.feature_id_col)
    print(f"Created SNODAS time series: {snow_out}")
else:
    print(f"SNODAS extracts not found at {snow_extracts}")
    print("Run sync_from_bucket() after EE tasks complete.")

---

## Summary

Data extraction is complete. After syncing from the bucket, you should have:
- `data/remote_sensing/landsat/extracts/ssebop_etf/` - ETf CSVs by year and mask
- `data/remote_sensing/landsat/extracts/ndvi/` - NDVI CSVs by year and mask  
- `data/snow/snodas/snodas.json` - Per-field SWE time series
- `data/properties/` - Irrigation, landcover, and soils CSVs
- `data/met_timeseries/gridmet/` - GridMET meteorology CSVs

**Next**: Run notebook 03 to ingest this data into the SwimContainer.