# Data Preparation Overview

In this tutorial we will extract all the data we need to run both the uncalibrated model and to calibrate and run a parameterized model. The model uses a daily soil water balance, and as such, it needs a daily estimate of meteorological drivers and some *a priori* information on the soils in our sample locations. It also needs an estimate of the state of the vegetation on the surface, for which we use Landsat-based NDVI. With this information, we will be able to run SWIM-RS to estimate the daily ET, soil water storage, recharge, runoff, and simulated irrigation.

To calibrate the model so it behaves more realistically, we must use the parameter inversion software (PEST++, in this case) alongside target data which provides *somewhat* independent estimates of ET and snow on the ground. Once the model is calibrated for a sample plot, these data are no longer needed. Therefore, the calibrated model can be run to times before or after SNODAS or ETf data is available.

The SWIM-RS approach requires a few input datasets to run:
1. NDVI: Normalized Difference Vegetation Index, is a measure made using the red and near-infrared bands of a multispectral instrument. It is a good way to estimate the relative density and vigor of vegetation, which is highly correlated with transpiration, and thus is used as a proxy for the transpirative component of the crop coefficient in SWIM-RS, Kcb. Here, we access NDVI information from Landsat satellite images in Earth Engine.
2. ETf: the rate of ET expressed as a fraction of reference/potential ET. This is also known in agricultural water use modeling as the 'crop coefficient', or Kc. For this tutorial we use SSEBop, accessed from Google Earth Engine. We could use results from any number of remote sensing-based modeling approaches (METRIC, OpenET ensemble, etc.). *FOR USE IN CALIBRATION ONLY*
3. Soils: We need an initial estimate of soil hydraulic properties that govern the way water behaves in our very simple model soil water reservoir.
4. Irrigation: We use an irrigation mask (IrrMapper or LANID) to constrain the data extraction of the irrigated and unirrigated portions of any given sample plot. ## TODO: make this optional
5. Snow: We use the SNODAS product to estimate the snow water equivalent (SWE) at a daily time step to calibrate the simple snow model in SWIM-RS. *FOR USE IN CALIBRATION ONLY*

# Earth Engine Asset Upload and Remote Sensing Extraction

In this notebook, we will:
1. Upload our data as an Earth Engine asset using the Earth Engine CLI.
2. Use the `clustered_sample_etf` function to perform SSEBop ETf extraction.
3. Use the `clustered_sample_ndvi` function to perform SSEBop NDVI extraction.


Ensure that the Earth Engine Python API is authenticated and configured, and that the Earth Engine CLI is available in your environment.

## 1. Import Libraries and Authorize Earth Engine

In [1]:
import os
import sys
import ee

# append the project path to the environment (mine is in /home/dgketchum/PycharmProjects')
sys.path.append('/home/dgketchum/PycharmProjects/swim-rs')

from data_extraction.ee.etf_export import clustered_sample_etf
from data_extraction.ee.ee_utils import is_authorized
from utils.google_bucket import list_and_copy_gcs_bucket

sys.path.insert(0, os.path.abspath('../..'))
sys.setrecursionlimit(5000)

If new to Earth Engine, checkout https://developers.google.com/earth-engine/guides/auth

In [2]:
if not is_authorized():
    ee.Authenticate()
ee.Initialize()

Authorized


## 2: Upload Shapefile to Earth Engine Asset

Upload your shapefile as an asset in Earth Engine (https://developers.google.com/earth-engine/guides/table_upload).

After the upload is complete, you can proceed with the extraction steps below.

## 3: Extract Raster Data

Now we're ready to do 'zonal stats' on our fields. We need to use an irrigated lands mask (IrrMapper or LANID) to find irrigated and unirrigated zones within the polygons of our shapefile.

For the raster data extraction, there are three options to get at the data:

*   **`clustered_sample_etf`**:
    *   This function finds all Landsat images intersecting the sample polygons (i.e., our fields).
    *   Since our fields are clustered together, this finds a reasonable number of images and iterates over them, extracting data from each.
    *   We use this on the tutorial since the sample from the Montana fields database is geographically constrained.
*   **`sparse_sample_etf`**:
    *   This function assumes the samples (fields) are spread out over many different Landsat images.
    *   It runs sample-by-sample, finding Landsat images overlapping each sample and extracting from them.
    *   This is used when we extract data for John Volk's flux data set, which are widely spaced across the Conterminous US.
*   **`export_etf_images`**:
    *   This function exports the Landsat images themselves, clipped to the bounds of a 'hopefully' clustered set of sample polygons.
    *   This is helpful for experimentation with buffering zones and so on, but not meant for large numbers of samples.

In [3]:
ee_account = 'ee-dgketchum'

# Define Constants and Remote Sensing Data Paths
IRR = 'projects/ee-dgketchum/assets/IrrMapper/IrrMapperComp'
ETF = 'projects/usgs-gee-nhm-ssebop/assets/ssebop/landsat/c02'

root = '/media/research/IrrigationGIS/swim'
bucket = 'wudr'
fields = 'projects/ee-dgketchum/assets/swim/mt_sid_boulder'

### Notes
- Ensure that your Cloud Storage bucket has the correct permissions for Earth Engine to write to it.
- You can modify parameters in the `clustered_sample_etf` function for different masking and debugging options.
- The data is downloaded by year.

In [4]:
# Here, we run the clustered_field_etf function on the uploaded asset.

# every sample is divided into a 'purely' irrigated section (i.e., 'irr') and an unirrigated one (i.e., 'inv_irr')
# this allows us to build a model for irrigated areas that aren't contaminated by unirrigated areas.
# for this tutorial, we're going to use both
for mask in ['inv_irr', 'irr']:

    # the 'check_dir' will check the planned directory for the existence of the data
    # if a run fails for some reason, move what is complete from the bucket to the directory, then rerun
    # this will skip what's already there
    chk = os.path.join(root, 'examples/tutorial/landsat/extracts/etf/{}'.format(mask))

    # write the directory if it's not already there
    if not os.path.exists(chk):
        os.makedirs(chk, exist_ok=True)
        
    clustered_sample_etf(fields, bucket, debug=False, mask_type=mask, check_dir=None, start_yr=2004, end_yr=2023)

etf_inv_irr_2004
etf_inv_irr_2005
etf_inv_irr_2006
etf_inv_irr_2007
etf_inv_irr_2008
etf_inv_irr_2009
etf_inv_irr_2010
etf_inv_irr_2011
etf_inv_irr_2012
etf_inv_irr_2013
etf_inv_irr_2014
etf_inv_irr_2015
etf_inv_irr_2016
etf_inv_irr_2017
etf_inv_irr_2018
etf_inv_irr_2019
etf_inv_irr_2020
etf_inv_irr_2021
etf_inv_irr_2022
etf_inv_irr_2023
etf_irr_2004
etf_irr_2005
etf_irr_2006
etf_irr_2007
etf_irr_2008
etf_irr_2009
etf_irr_2010
etf_irr_2011
etf_irr_2012
etf_irr_2013
etf_irr_2014
etf_irr_2015
etf_irr_2016
etf_irr_2017
etf_irr_2018
etf_irr_2019
etf_irr_2020
etf_irr_2021
etf_irr_2022
etf_irr_2023


## 4: Move Data from Bucket to Local Computer

We can monitor the data extraction on the Earth Engine code editor (https://code.earthengine.google.com/) or with their task monitor (https://code.earthengine.google.com/tasks).

Once the download is complete, we can move the data from the bucket to the local directory in one of three ways:

1. Download the data manually: click 'Open in GCS' button on the task list in the code editor, then click the download button on the bucket file.
   The following require the Google Cloud SDK (https://cloud.google.com/sdk/docs/install)
3. Move the data with the command line using e.g., 'gsutil cp gs://wudr/etf_inv_irr*.csv /media/research/IrrigationGIS/swim/examples/tutorial/landsat/extracts/etf/'
4. Move the data programatically using Python's subprocess. You can use `list_and_copy_gcs_bucket`. Use 'dry_run=True' to just list what will get copied, and 'dry_run=False' to download the data:

In [4]:
command = '/home/dgketchum/google-cloud-sdk/bin/gsutil'
for mask in ['inv_irr', 'irr']:
    dst = os.path.join(root, 'examples/tutorial/landsat/extracts/etf/{}'.format(mask))
    glob_ = f'etf_{mask}'

    # list the data
    list_and_copy_gcs_bucket(command, bucket, dst, glob=glob_, dry_run=True)


gs://wudr/etf_inv_irr_2004.csv
gs://wudr/etf_inv_irr_2005.csv
gs://wudr/etf_inv_irr_2006.csv
gs://wudr/etf_inv_irr_2007.csv
gs://wudr/etf_inv_irr_2008.csv
gs://wudr/etf_inv_irr_2009.csv
gs://wudr/etf_inv_irr_2010.csv
gs://wudr/etf_inv_irr_2011.csv
gs://wudr/etf_inv_irr_2012.csv
gs://wudr/etf_inv_irr_2013.csv
gs://wudr/etf_inv_irr_2014.csv
gs://wudr/etf_inv_irr_2015.csv
gs://wudr/etf_inv_irr_2016.csv
gs://wudr/etf_inv_irr_2017.csv
gs://wudr/etf_inv_irr_2018.csv
gs://wudr/etf_inv_irr_2019.csv
gs://wudr/etf_inv_irr_2020.csv
gs://wudr/etf_inv_irr_2021.csv
gs://wudr/etf_inv_irr_2022.csv
gs://wudr/etf_inv_irr_2023.csv
gs://wudr/etf_irr_2004.csv
gs://wudr/etf_irr_2005.csv
gs://wudr/etf_irr_2006.csv
gs://wudr/etf_irr_2007.csv
gs://wudr/etf_irr_2008.csv
gs://wudr/etf_irr_2009.csv
gs://wudr/etf_irr_2010.csv
gs://wudr/etf_irr_2011.csv
gs://wudr/etf_irr_2012.csv
gs://wudr/etf_irr_2013.csv
gs://wudr/etf_irr_2014.csv
gs://wudr/etf_irr_2015.csv
gs://wudr/etf_irr_2016.csv
gs://wudr/etf_irr_2017.csv
gs

In [5]:
command = '/home/dgketchum/google-cloud-sdk/bin/gsutil'
for mask in ['inv_irr', 'irr']:
    
    dst = os.path.join(root, 'examples/tutorial/landsat/extracts/etf/{}'.format(mask))
    glob_ = f'etf_{mask}'

    # copy the data
    list_and_copy_gcs_bucket(command, bucket, dst, glob=glob_, dry_run=False)

Copying gs://wudr/etf_inv_irr_2004.csv: Copying gs://wudr/etf_inv_irr_2004.csv...
/ [1 files][ 52.6 KiB/ 52.6 KiB]                                                
Operation completed over 1 objects/52.6 KiB.                                     

Copying gs://wudr/etf_inv_irr_2005.csv: Copying gs://wudr/etf_inv_irr_2005.csv...
/ [1 files][ 50.9 KiB/ 50.9 KiB]                                                
Operation completed over 1 objects/50.9 KiB.                                     

Copying gs://wudr/etf_inv_irr_2006.csv: Copying gs://wudr/etf_inv_irr_2006.csv...
- [1 files][ 53.6 KiB/ 53.6 KiB]                                                
Operation completed over 1 objects/53.6 KiB.                                     

Copying gs://wudr/etf_inv_irr_2007.csv: Copying gs://wudr/etf_inv_irr_2007.csv...
/ [1 files][ 54.1 KiB/ 54.1 KiB]                                                
Operation completed over 1 objects/54.1 KiB.                                     

Copying gs://wud