# Data Preparation Overview

In this tutorial we will extract all the data we need to run both the uncalibrated SWIM-RS model and to calibrate and run a parameterized model. The model uses a daily soil water balance, and as such, it needs a daily estimate of meteorological drivers and some *a priori* information on the soils in our sample locations. It also needs an estimate of the state of the vegetation on the surface, for which we use Landsat-based NDVI. With this information, we will be able to run SWIM-RS to estimate the daily ET, soil water storage, recharge, runoff, and simulated irrigation.

To calibrate the model so it behaves more realistically, we must use the parameter inversion software (PEST++, in this case) alongside target data which provides *somewhat* independent estimates of ET and snow on the ground. Once the model is calibrated for a sample plot, these data are no longer needed. Therefore, the calibrated model can be run for periods before or after SNODAS or ETf data is available.

The remote sensing data (ETf and NDVI) are the most time-consuming step, as the data are being extracted from potentially thousands of separate Landsat-like images in Earth Engine. The snow, soils, and irrigation data, on the other hand, are relatively quick, as the images are fewer, with one CONUS-wide image per day in SNODAS, a few static images for the soils data, and one image annually for the irrigation products. However, thanks to Earth Engine, even the Landsat-based products are quickly extracted if the number of sample plots are small and clustered in space, as they are for this tutorial.

The SWIM-RS approach requires the following input datasets to run and/or calibrate:
1. NDVI: Normalized Difference Vegetation Index is a measure made using the red and near-infrared bands of a multispectral instrument. It is a good way to estimate the relative density and vigor of vegetation, which is highly correlated with transpiration. NDVI is used as a proxy for the transpirative component of the crop coefficient in SWIM-RS, Kcb. Here, we access NDVI information from Landsat satellite images in Earth Engine.
2. ETf: The rate of ET expressed as a fraction of reference/potential ET. This is also known in agricultural water use modeling as the 'crop coefficient', or Kc. For this tutorial we use SSEBop, accessed from Google Earth Engine. We could use results from any number of remote sensing-based modeling approaches (METRIC, OpenET ensemble, etc.). *FOR USE IN CALIBRATION ONLY*
3. Soils: We need an initial estimate of soil hydraulic properties that govern the way water behaves in our very simple model soil water reservoir.
4. Irrigation: We use an irrigation mask (IrrMapper or LANID) to constrain the data extraction of the irrigated and unirrigated portions of any given sample plot.
5. Snow: We use the SNODAS product to estimate the snow water equivalent (SWE) at a daily time step to calibrate the simple snow model in SWIM-RS. *FOR USE IN CALIBRATION ONLY*

Note: For this tutorial, we use 'field' and 'plot' somewhat interchangeably. Indeed, the sample plots for this tutorial are fields. However, we could draw an arbitraty polygon over a location of interest and run the model there. Keep in mind, the data will represent the mean of the irrigated and unirrigated portions of the sample plot. Therefore, using sensible land-use features (e.g., individual agricultrual fields) is a good approach because assuming homogenous land-use managment in a single field is not a terrible assumption. 

# Earth Engine Asset Upload and Remote Sensing Extraction

In this notebook, we will:
1. Upload our data as an Earth Engine asset.
2. Use the `clustered_sample_etf` function to perform SSEBop ETf extraction.
3. Use the `clustered_sample_ndvi` function to perform SSEBop NDVI extraction.


Ensure that the Earth Engine Python API is authenticated and configured, and that the Earth Engine CLI is available in your environment.

## 1. Import Libraries and Authorize Earth Engine

In [1]:
import os
import sys
import ee


root = os.path.abspath('../../..')
sys.path.append(root)

from data_extraction.ee.etf_export import clustered_sample_etf
from data_extraction.ee.ndvi_export import clustered_sample_ndvi

from data_extraction.ee.ee_utils import is_authorized
from utils.google_bucket import list_and_copy_gcs_bucket

sys.path.insert(0, os.path.abspath('../../..'))
sys.setrecursionlimit(5000)

If new to Earth Engine, checkout https://developers.google.com/earth-engine/guides/auth

In [2]:
if not is_authorized():
    ee.Authenticate()
ee.Initialize()

Authorized


## 2: Upload Shapefile to Earth Engine Asset

Upload your shapefile as an asset in Earth Engine (https://developers.google.com/earth-engine/guides/table_upload).

After the upload is complete, you can proceed with the extraction steps below.

## 3: Extract ETf Raster Data

Now we're ready to do 'zonal stats' on our fields. Earth Engine makes this very easy: we provide the FeatureCollection (the asset from our shapefile upload), and a time period, and the code extracts summaries of ETf over each field object, returning them as a table (*.csv). 

We need to use an irrigated lands mask (IrrMapper or LANID) to find irrigated and unirrigated zones within the polygons of our shapefile. You see this implemented in the code below, where the `mask` argument is either `irr` for irrigated, or `inv_irr` for the inverse of the irrigated mask, which are unirrigated areas.

For the raster data extraction, there are three options to get at the data:

*   **`clustered_sample_etf`**:
    *   This function finds all Landsat images intersecting the sample polygons (i.e., our fields).
    *   Since our fields are clustered together, this finds a reasonable number of images and iterates over them, extracting data from each.
    *   We use this on the tutorial since the sample from the Montana fields database is geographically constrained.
*   **`sparse_sample_etf`**:
    *   This function assumes the samples (fields) are spread out over many different Landsat images.
    *   It runs sample-by-sample, finding Landsat images overlapping each sample and extracting from them.
    *   This is used when we extract data for John Volk's flux data set, which are widely spaced across the Conterminous US.
*   **`export_etf_images`**:
    *   This function exports the Landsat images themselves, clipped to the bounds of a 'hopefully' clustered set of sample polygons.
    *   This is helpful for experimentation with buffering zones and so on, but not meant for large numbers of samples.

In [1]:
# Change this to your own
ee_account = 'ee-dgketchum'

# If you don't have gsutil, there is a workaround described below
command = os.path.join(home, 'google-cloud-sdk', 'bin', 'gsutil')

# Define Constants and Remote Sensing Data Paths
# TODO: remove hard-coded collecitons and use variables defined here
IRR = 'projects/ee-dgketchum/assets/IrrMapper/IrrMapperComp'
ETF = 'projects/usgs-gee-nhm-ssebop/assets/ssebop/landsat/c02'

bucket = 'wudr'
fields = 'projects/ee-dgketchum/assets/swim/mt_sid_boulder'

# We must specify which column in the shapefile represents the field's unique ID, in this case it is 'FID_1'
FEATURE_ID = 'FID_1'

### Notes
- Ensure that your Cloud Storage bucket has the correct permissions for Earth Engine to write to it.
- You can modify parameters in the `clustered_sample_etf` function for different masking and debugging options.
- The data is downloaded by year.

In [4]:
etf_dst = os.path.join(root, 'tutorials', '1_Boulder', 'data', 'landsat', 'extracts', 'etf')


In [None]:
# Here, we run the clustered_field_etf function on the uploaded asset.

# every sample is divided into a 'purely' irrigated section (i.e., 'irr') and an unirrigated one (i.e., 'inv_irr')
# this allows us to build a model for irrigated areas that aren't contaminated by unirrigated areas.
# for this tutorial, we're going to use both
for mask in ['inv_irr', 'irr']:

    # the 'check_dir' will check the planned directory for the existence of the data
    # if a run fails for some reason, move what is complete from the bucket to the directory, then rerun
    # this will skip what's already there
    chk = os.path.join(etf_dst, '{}'.format(mask))

    # write the directory if it's not already there
    if not os.path.exists(chk):
        os.makedirs(chk, exist_ok=True)
        
    clustered_sample_etf(fields, bucket, debug=False, mask_type=mask, check_dir=None, start_yr=2004, end_yr=2023, feature_id=FEATURE_ID)

## 4: Move ETf Data from the Google Cloud Storage Bucket to Local Computer

We can monitor the data extraction on the Earth Engine code editor (https://code.earthengine.google.com/) or with their task monitor (https://code.earthengine.google.com/tasks).

Once the download is complete, we can move the data from the bucket to our local directory in one of three ways:

1. Download the data manually: click 'Open in GCS' button on the task list in the code editor, then click the download button on the bucket file. 
   
2. The other options require the Google Cloud SDK (https://cloud.google.com/sdk/docs/install): Move the data with the command line. For large transfers, this is much faster with the `-m` option for multiprocess transfers like, run `inv_irr` first; using `*irr*` will glob the irrigated and unirrigated at once and disorganize life:
   - `gsutil -m mv gs://wudr/etf*inv_irr* data/landsat/extracts/etf/inv_irr`
   - `gsutil -m mv gs://wudr/etf*irr* data/landsat/extracts/etf/irr`
   
3. Move the data programatically using Python's subprocess. You can use `list_and_copy_gcs_bucket`. Use `dry_run=True` to just list what will get copied, and `dry_run=False` to download the data:

In [None]:
for mask in ['inv_irr', 'irr']:
    dst = os.path.join(etf_dst, '{}'.format(mask))
    glob_ = f'etf_{mask}'

    # list the data
    list_and_copy_gcs_bucket(command, bucket, dst, glob=glob_, dry_run=True)


In [5]:
for mask in ['inv_irr', 'irr']:
    
    dst = os.path.join(root, 'tutorials', '1_Boulder', 'data', 'landsat', 'extracts', 'etf', f'{mask}')
    glob_ = f'etf_{mask}'

    # copy the data
    list_and_copy_gcs_bucket(command, bucket, dst, glob=glob_, dry_run=False, overwrite=False)

/media/research/IrrigationGIS/swim/examples/tutorial/landsat/extracts/etf/inv_irr/etf_inv_irr_2004.csv exists, skipping
/media/research/IrrigationGIS/swim/examples/tutorial/landsat/extracts/etf/inv_irr/etf_inv_irr_2005.csv exists, skipping
/media/research/IrrigationGIS/swim/examples/tutorial/landsat/extracts/etf/inv_irr/etf_inv_irr_2006.csv exists, skipping
/media/research/IrrigationGIS/swim/examples/tutorial/landsat/extracts/etf/inv_irr/etf_inv_irr_2007.csv exists, skipping
Copying gs://wudr/etf_inv_irr_2008.csv: Copying gs://wudr/etf_inv_irr_2008.csv...
/ [1 files][ 46.0 KiB/ 46.0 KiB]                                                
Operation completed over 1 objects/46.0 KiB.                                     

Copied gs://wudr/etf_inv_irr_2008.csv to /media/research/IrrigationGIS/swim/examples/tutorial/landsat/extracts/etf/inv_irr
/media/research/IrrigationGIS/swim/examples/tutorial/landsat/extracts/etf/inv_irr/etf_inv_irr_2009.csv exists, skipping
/media/research/IrrigationGIS/s

## 5: Extract NDVI Raster Data

This is just like the ETf extraction, but for NDVI. This is a little more straightforward as we can get the data straight from the Landsat collection, and don't need special permissions or knowledge of where the data are stored.

As with the ETf code, the extraction has three options to get at the data, depending on the clustering of fields and user needs, and the functions are split up in the same way:

*   **`clustered_sample_ndvi`**:
    *   This function finds all Landsat images intersecting the sample polygons (i.e., our fields).
    *   Since our fields are clustered together, this finds a reasonable number of images and iterates over them, extracting data from each.
    *   We use this on the tutorial since the sample from the Montana fields database is geographically constrained.
*   **`sparse_sample_ndvi`**:
    *   This function assumes the samples (fields) are spread out over many different Landsat images.
    *   It runs sample-by-sample, finding Landsat images overlapping each sample and extracting from them.
    *   This is used when we extract data for John Volk's flux data set, which are widely spaced across the Conterminous US.
*   **`export_ndvi_images`**:
    *   This function exports the Landsat images themselves, clipped to the bounds of a 'hopefully' clustered set of sample polygons.
    *   This is helpful for experimentation with buffering zones and so on, but not meant for large numbers of samples.

In [7]:
ndvi_dst = os.path.join(root, 'tutorials', '1_Boulder', 'data', 'landsat', 'extracts', 'ndvi')

In [None]:
# Just like before, but with 'ndvi' instead of 'etf':
for mask in ['inv_irr', 'irr']:

    # the 'check_dir' will check the planned directory for the existence of the data
    # if a run fails for some reason, move what is complete from the bucket to the directory, then rerun
    # this will skip what's already there
    chk = os.path.join(ndvi_dst, '{}'.format(mask))

    # write the directory if it's not already there
    if not os.path.exists(chk):
        os.makedirs(chk, exist_ok=True)

    clustered_sample_ndvi(fields, bucket, debug=False, mask_type=mask, check_dir=None, start_yr=2004, end_yr=2023, feature_id=FEATURE_ID)

In [8]:
# Just like before, but with 'ndvi' instead of 'etf'
for mask in ['inv_irr', 'irr']:

    dst = os.path.join(ndvi_dst, f'{mask}')
    glob_ = f'ndvi_{mask}'

    # copy the data
    list_and_copy_gcs_bucket(command, bucket, dst, glob=glob_, dry_run=False, overwrite=False)

Copying gs://wudr/ndvi_inv_irr_2004.csv: Copying gs://wudr/ndvi_inv_irr_2004.csv...
/ [1 files][ 65.9 KiB/ 65.9 KiB]                                                
Operation completed over 1 objects/65.9 KiB.                                     

Copied gs://wudr/ndvi_inv_irr_2004.csv to /media/research/IrrigationGIS/swim/tutorial/step_2_earth_engine_extract/landsat/ndvi/inv_irr
Copying gs://wudr/ndvi_inv_irr_2005.csv: Copying gs://wudr/ndvi_inv_irr_2005.csv...
/ [1 files][ 62.3 KiB/ 62.3 KiB]                                                
Operation completed over 1 objects/62.3 KiB.                                     

Copied gs://wudr/ndvi_inv_irr_2005.csv to /media/research/IrrigationGIS/swim/tutorial/step_2_earth_engine_extract/landsat/ndvi/inv_irr
Copying gs://wudr/ndvi_inv_irr_2006.csv: Copying gs://wudr/ndvi_inv_irr_2006.csv...
/ [1 files][ 66.2 KiB/ 66.2 KiB]                                                
Operation completed over 1 objects/66.2 KiB.                         