# Data Preparation Overview

In this tutorial we will extract all the data we need to run both the uncalibrated SWIM-RS model and to calibrate and run a parameterized model. The model uses a daily soil water balance, and as such, it needs a daily estimate of meteorological drivers and some *a priori* information on the soils in our sample locations. It also needs an estimate of the state of the vegetation on the surface, for which we use Landsat-based NDVI. With this information, we will be able to run SWIM-RS to estimate the daily ET, soil water storage, recharge, runoff, and simulated irrigation.

To calibrate the model so it behaves more realistically, we must use the parameter inversion software (PEST++, in this case) alongside target data which provides *somewhat* independent estimates of ET and snow on the ground. Once the model is calibrated for a sample plot, these data are no longer needed. Therefore, the calibrated model can be run for periods before or after SNODAS or ETf data is available.

The remote sensing data (ETf and NDVI) are the most time-consuming step, as the data are being extracted from potentially thousands of separate Landsat-like images in Earth Engine. The snow, soils, and irrigation data, on the other hand, are relatively quick, as the images are fewer, with one CONUS-wide image per day in SNODAS, a few static images for the soils data, and one image annually for the irrigation products. However, thanks to Earth Engine, even the Landsat-based products are quickly extracted if the number of sample plots are small and clustered in space, as they are for this tutorial.

The SWIM-RS approach requires the following input datasets to run and/or calibrate:
1. NDVI: Normalized Difference Vegetation Index is a measure made using the red and near-infrared bands of a multispectral instrument. It is a good way to estimate the relative density and vigor of vegetation, which is highly correlated with transpiration. NDVI is used as a proxy for the transpirative component of the crop coefficient in SWIM-RS, Kcb. Here, we access NDVI information from Landsat satellite images in Earth Engine.
2. ETf: The rate of ET expressed as a fraction of reference/potential ET. This is also known in agricultural water use modeling as the 'crop coefficient', or Kc. For this tutorial we use SSEBop, accessed from Google Earth Engine. We could use results from any number of remote sensing-based modeling approaches (METRIC, OpenET ensemble, etc.). *FOR USE IN CALIBRATION ONLY*
3. Soils: We need an initial estimate of soil hydraulic properties that govern the way water behaves in our very simple model soil water reservoir.
4. Irrigation: We use an irrigation mask (IrrMapper or LANID) to constrain the data extraction of the irrigated and unirrigated portions of any given sample plot.
5. Snow: We use the SNODAS product to estimate the snow water equivalent (SWE) at a daily time step to calibrate the simple snow model in SWIM-RS. *FOR USE IN CALIBRATION ONLY*

Note: For this tutorial, we use 'field' and 'plot' somewhat interchangeably. Indeed, the sample plots for this tutorial are fields. However, we could draw an arbitraty polygon over a location of interest and run the model there. Keep in mind, the data will represent the mean of the irrigated and unirrigated portions of the sample plot. Therefore, using sensible land-use features (e.g., individual agricultrual fields) is a good approach because assuming homogenous land-use managment in a single field is not a terrible assumption. 

# Local Shapefile and Remote Sensing Extraction

In this notebook, we will:
1. Point to our local shapefile (no asset upload required).
2. Use the `clustered_sample_etf` function to perform SSEBop ETf extraction.
3. Use the `clustered_sample_ndvi` function to perform SSEBop NDVI extraction.


Ensure that the Earth Engine Python API is authenticated and configured, and that the Earth Engine CLI is available in your environment.

## 1. Import Libraries and Authorize Earth Engine

In [2]:
import os
import sys
import ee

root = os.path.abspath('../../..')
sys.path.append(root)

from swimrs.data_extraction.ee.etf_export import clustered_sample_etf
from swimrs.data_extraction.ee.ndvi_export import clustered_sample_ndvi

from swimrs.data_extraction.ee.ee_utils import is_authorized
from swimrs.utils.google_bucket import list_and_copy_gcs_bucket

sys.path.insert(0, os.path.abspath('../../..'))
sys.setrecursionlimit(5000)

If new to Earth Engine, checkout https://developers.google.com/earth-engine/guides/auth

In [3]:
if not is_authorized():
    ee.Authenticate()
ee.Initialize()

## 2: Point to Local Shapefile

Set a path to your local shapefile. The extraction functions will convert it to an Earth Engine FeatureCollection under the hood — no upload needed.

After the upload is complete, you can proceed with the extraction steps below.

## 3: Extract ETf Raster Data

Now we're ready to do 'zonal stats' on our fields. We provide a local shapefile, and the code converts it to a FeatureCollection internally, then extracts ETf/NDVI summaries per field to CSV. 

We need to use an irrigated lands mask (IrrMapper or LANID) to find irrigated and unirrigated zones within the polygons of our shapefile. You see this implemented in the code below, where the `mask` argument is either `irr` for irrigated, or `inv_irr` for the inverse of the irrigated mask, which are unirrigated areas.

For the raster data extraction, there are three options to get at the data:

*   **`clustered_sample_etf`**:
    *   This function finds all Landsat images intersecting the sample polygons (i.e., our fields).
    *   Since our fields are clustered together, this finds a reasonable number of images and iterates over them, extracting data from each.
    *   We use this on the tutorial since the sample from the Montana fields database is geographically constrained.
*   **`sparse_sample_etf`**:
    *   This function assumes the samples (fields) are spread out over many different Landsat images.
    *   It runs sample-by-sample, finding Landsat images overlapping each sample and extracting from them.
    *   This is used when we extract data for John Volk's flux data set, which are widely spaced across the Conterminous US.
*   **`export_etf_images`**:
    *   This function exports the Landsat images themselves, clipped to the bounds of a 'hopefully' clustered set of sample polygons.
    *   This is helpful for experimentation with buffering zones and so on, but not meant for large numbers of samples.

In [4]:
# Change this to your own
user = 'dgketchum'
ee_account = f'ee-{user}'

# The shapefile
shapefile_path = os.path.join(root, 'examples', '1_Boulder', 'data', 'gis', 'mt_sid_boulder.shp')

# Export destination switch
# drive = True  # default: export to Google Drive
# To use a Cloud Storage bucket instead (faster), uncomment the next two lines and set your bucket:
drive = False
bucket = 'wudr'

# If you're using a bucket, specifcy the GSUTIL command location
# If you don't have gsutil, there is a workaround described below
command = os.path.join(os.path.expanduser('~'), 'google-cloud-sdk', 'bin', 'gsutil')

# Derived export settings used below
export_dest = 'drive' if drive else 'bucket'
export_bucket = None if drive else bucket
bucket_subdir = 'swim/examples/1_Boulder'
# Using local shapefile; no EE asset needed.

# We must specify which column in the shapefile represents the field's unique ID, in this case it is 'FID_1'
FEATURE_ID = 'FID_1'
# Limit extraction to a few fields for this tutorial
select_fields = ['043_000130', '043_000128', '043_000161']

### Notes
- Ensure that your Cloud Storage bucket has the correct permissions for Earth Engine to write to it.
- You can modify parameters in the `clustered_sample_etf` function for different masking and debugging options.
- The data is downloaded by year.

In [5]:
etf_dst = os.path.join(root, 'examples', '1_Boulder', 'data', 'landsat', 'extracts', 'etf')


In [6]:
# Here, we run the ETf extraction against the local shapefile (no asset upload needed).

# every sample is divided into a 'purely' irrigated section (i.e., 'irr') and an unirrigated one (i.e., 'inv_irr')
# this allows us to build a model for irrigated areas that aren't contaminated by unirrigated areas.
# for this tutorial, we're going to use both
for mask in ['inv_irr', 'irr']:

    # the 'check_dir' will check the planned directory for the existence of the data
    # if a run fails for some reason, move what is complete from the bucket to the directory, then rerun
    # this will skip what's already there
    chk = os.path.join(etf_dst, '{}'.format(mask))

    # write the directory if it's not already there
    if not os.path.exists(chk):
        os.makedirs(chk, exist_ok=True)

    # Export ETf tables (Drive by default). Show both options explicitly for clarity.
    if drive:
        # Drive export (recommended for most users)
        clustered_sample_etf(shapefile_path, bucket=None, debug=False, mask_type=mask, check_dir=None, start_yr=2004,
                          end_yr=2023, feature_id=FEATURE_ID, select=select_fields, model='ssebop', usgs_nhm=True, dest='drive',
                          state_col='STATE', drive_folder='swim', drive_categorize=True)
    else:
        # Cloud Storage export (faster, requires a bucket)
        clustered_sample_etf(shapefile_path, bucket=bucket, debug=False, mask_type=mask, check_dir=None, start_yr=2004,
                          end_yr=2023, feature_id=FEATURE_ID, select=select_fields, model='ssebop', usgs_nhm=True, dest='bucket', file_prefix=bucket_subdir,
                          state_col='STATE', drive_categorize=False)

etf_inv_irr_2004
etf_inv_irr_2005
etf_inv_irr_2006
etf_inv_irr_2007
etf_inv_irr_2008
etf_inv_irr_2009
etf_inv_irr_2010
etf_inv_irr_2011
etf_inv_irr_2012
etf_inv_irr_2013
etf_inv_irr_2014
etf_inv_irr_2015
etf_inv_irr_2016
etf_inv_irr_2017
etf_inv_irr_2018
etf_inv_irr_2019
etf_inv_irr_2020
etf_inv_irr_2021
etf_inv_irr_2022
etf_inv_irr_2023
etf_irr_2004
etf_irr_2005
etf_irr_2006
etf_irr_2007
etf_irr_2008
etf_irr_2009
etf_irr_2010
etf_irr_2011
etf_irr_2012
etf_irr_2013
etf_irr_2014
etf_irr_2015
etf_irr_2016
etf_irr_2017
etf_irr_2018
etf_irr_2019
etf_irr_2020
etf_irr_2021
etf_irr_2022
etf_irr_2023


## 4: Retrieve ETf Data from Google Drive (or Cloud Storage)

Monitor export progress in the Earth Engine Code Editor (https://code.earthengine.google.com/) or the task monitor (https://code.earthengine.google.com/tasks). When tasks finish, retrieve the CSVs to your local machine as follows:

Steps (Google Drive):
1) Open Google Drive and locate the export folders. With `drive_categorize=True`, files go under separate folders:
   - `swim_etf` for ETf exports
   - `swim_ndvi` for NDVI exports
   If categorization is disabled, look under the single `swim` folder.
2) Inside `swim_etf`, you should see CSVs named like `etf_irr_2004.csv`, `etf_inv_irr_2004.csv`, etc. Inside `swim_ndvi`, you should see `ndvi_irr_2004.csv`, `ndvi_inv_irr_2004.csv`, etc.
3) Create local folders in this project for the extracts (if they don’t exist):
   - `examples/1_Boulder/data/landsat/extracts/etf/irr`
   - `examples/1_Boulder/data/landsat/extracts/etf/inv_irr`
   - `examples/1_Boulder/data/landsat/extracts/ndvi/irr`
   - `examples/1_Boulder/data/landsat/extracts/ndvi/inv_irr`
4) Download CSVs from Drive and place them into the matching local folders:
   - `etf_irr_*.csv` → `.../extracts/etf/irr/`
   - `etf_inv_irr_*.csv` → `.../extracts/etf/inv_irr/`
   - `ndvi_irr_*.csv` → `.../extracts/ndvi/irr/`
   - `ndvi_inv_irr_*.csv` → `.../extracts/ndvi/inv_irr/`
5) Verify that years in your study period (e.g., 2004–2023) are present for both masks before proceeding.

Tip: exporting to a Google Cloud Storage bucket is significantly faster, but most users will not have a bucket. If you do have a bucket, you can switch to `dest='bucket'` in the export calls and then move data with `gsutil` or the helper in `swimrs.utils.google_bucket`. For example:
- List and copy with gsutil (Cloud SDK): `gsutil -m cp gs://<bucket>/etf*inv_irr* examples/1_Boulder/data/landsat/extracts/etf/inv_irr/`
- Or use: `swimrs.utils.google_bucket.list_and_copy_gcs_bucket(cmd_path='gsutil', bucket_path='<bucket>', local_dir='.../extracts/etf/irr', glob='etf_irr')`

In [7]:
for mask in ['inv_irr', 'irr']:
    dst = os.path.join(etf_dst, '{}'.format(mask))
    glob_ = f'etf_{mask}'

    # list the data
    list_and_copy_gcs_bucket(command, bucket, dst, glob=glob_, dry_run=True)


FileNotFoundError: [Errno 2] No such file or directory: '/home/dgketchum/google-cloud-sdk/bin/gsutil'

In [None]:
for mask in ['inv_irr', 'irr']:
    dst = os.path.join(root, 'examples', '1_Boulder', 'data', 'landsat', 'extracts', 'etf', f'{mask}')
    glob_ = f'etf_{mask}'

    # copy the data
    list_and_copy_gcs_bucket(command, bucket, dst, glob=glob_, dry_run=False, overwrite=False)

## 5: Extract NDVI Raster Data

This is just like the ETf extraction, but for NDVI. This is a little more straightforward as we can get the data straight from the Landsat collection, and don't need special permissions or knowledge of where the data are stored.

As with the ETf code, the extraction has three options to get at the data, depending on the clustering of fields and user needs, and the functions are split up in the same way:

*   **`clustered_sample_ndvi`**:
    *   This function finds all Landsat images intersecting the sample polygons (i.e., our fields).
    *   Since our fields are clustered together, this finds a reasonable number of images and iterates over them, extracting data from each.
    *   We use this on the tutorial since the sample from the Montana fields database is geographically constrained.
*   **`sparse_sample_ndvi`**:
    *   This function assumes the samples (fields) are spread out over many different Landsat images.
    *   It runs sample-by-sample, finding Landsat images overlapping each sample and extracting from them.
    *   This is used when we extract data for John Volk's flux data set, which are widely spaced across the Conterminous US.
*   **`export_ndvi_images`**:
    *   This function exports the Landsat images themselves, clipped to the bounds of a 'hopefully' clustered set of sample polygons.
    *   This is helpful for experimentation with buffering zones and so on, but not meant for large numbers of samples.

In [None]:
ndvi_dst = os.path.join(root, 'examples', '1_Boulder', 'data', 'landsat', 'extracts', 'ndvi')

In [None]:
# Just like before, but with 'ndvi' instead of 'etf':
for mask in ['inv_irr', 'irr']:

    # the 'check_dir' will check the planned directory for the existence of the data
    # if a run fails for some reason, move what is complete from the bucket to the directory, then rerun
    # this will skip what's already there
    chk = os.path.join(ndvi_dst, '{}'.format(mask))

    # write the directory if it's not already there
    if not os.path.exists(chk):
        os.makedirs(chk, exist_ok=True)

    # Export NDVI tables (Drive by default). Show both options explicitly.
    if drive:
        clustered_sample_ndvi(shapefile_path, bucket=None, debug=False, mask_type=mask, check_dir=None, start_yr=2004,
                           end_yr=2023, feature_id=FEATURE_ID, select=select_fields, satellite='landsat', dest='drive', drive_folder='swim',
                           drive_categorize=True)
    else:
        clustered_sample_ndvi(shapefile_path, bucket=bucket, debug=False, mask_type=mask, check_dir=None, start_yr=2004,
                           end_yr=2023, feature_id=FEATURE_ID, select=select_fields, satellite='landsat', dest='bucket', file_prefix=bucket_subdir,
                           drive_categorize=False)

In [None]:
# Just like before, but with 'ndvi' instead of 'etf'
for mask in ['inv_irr', 'irr']:
    dst = os.path.join(ndvi_dst, f'{mask}')
    glob_ = f'ndvi_{mask}'

    # copy the data from a Cloud Storage bucket, or download from Drive manually
    if not drive and bucket:
        list_and_copy_gcs_bucket(command, bucket, dst, glob=glob_, dry_run=False, overwrite=False)
    else:
        print('Drive exports: download NDVI CSVs from Drive into', dst)