# Machine Learning Methods and Tools

Explain the method and why you think it's suitable for your use case. Explain the choice of tools/packages/data and the reason for use.

Addressing a key gap in the application of ML for SWE estimation, we introduce a novel two-step ML framework combining gradient boosted decision trees with feature optimization for the selection of training features to inform regionally optimized ANNs.
With a motivation to enhance SWE characterization targeting large-scale water resources management, we validate the framework by evaluating the performance of SWE estimates for 20,000 locations across the western U.S. at a 1-km scale and weekly temporal resolution.
This sections explains the methodology for developing the national-scale SWE estimation model and dives deeper into the case application in the Colorodo region that this tutorial executes. 

## Study Area Description

Model developmnet leverage data from SNOTEL sites located throughout 11 states ranging from the US-Canada border to the US-Mexico border and from the Pacific coast to the eastern slope of the Rocky Mountains to establish the study periphery.
The modeling domain includes all major mountain ranges but contains a greater portion of observations within the Sierra Nevada and the Rocky Mountains, where the water supply is dominated by seasonal snowpack and ongoing efforts by the Airborne Snow Observatory (ASO) provide ample data for data-driven model development.


## Machine Learning Approach

Snow falls different accross the western US, and thus, was decided upon to develop a distributed ML SWE estimation model addressing the unique hydrometeorological variability observed in the modeling doamain.
For example, the western U.S. contains snow climate classifications of coastal, coastal transitional, intermountain, and continental.
The modeling framework addresses the heterogeneity in snow processes through the the division of the study area into 23 regional locations. 
Dividing the model into sub-regions allows for the separation of microclimates to reduce the influence of individual region dynamics on differing regions during model training. 

<img align = 'center' src="./Images/Distribution.jpg" alt = 'drawing' width = '1000'/>

## Data
Machine learning models "learn" the relationships between independent and dependent variables through large amounts of data.
Data sourced for the model consisted of geographic and topographic information from the Copernicus Digital Elevation Model (90-m DEM) and ground measurement data from the NRCS Snow Telemetry and Snow Course program (i.e., SNOTEL), as well as from the California Department of Water Resources California Data Exchange Center (CDEC). 
In total, geographic and weekly SWE observational data from 594 SNOTEL sites and 106 CDEC sites from 2013-2017 are collated.
Weekly observations of the most recent date available at the same locations support near-real-time model inference. 

Model development investigated the use of climatological data from the National Oceanic and Atmospheric Administration's (NOAA) High-Resolution Rapid Refresh dataset (HRRR) and multispectral imagery from the Sentinal-1 remote sensing satellite mission. 
However, the large computational resources required to load and process the data became limiting to the production of expeditious SWE estimates at scale.
The limitations were the result of the large-memory tiled datasets and the need for point-scale architecture required by the ML models to train, test, and predict.
The data processing requirements to convert observations into training datasets are computationally expensive and unfeasible for fast-paced development, even for high-performance computing. 

<img align = 'center' src="Images/Distribution_locations_number.jpg" alt = 'drawing' width = '1000'/>




## GeoWeaver


The model has been adapted to the open-source workflow management sytem, [GeoWeaver](https://github.com/ESIPFed/Geoweaver). Geoweaver is a web-based application for interactive, full-stack machine leanring workflow management. The app provdies a user-friendly GUI to interact with and persistently store script logs and code revision histories. Shell and python scripts are supported by the platform, as well as seammless intergration wth Jupyter. 

A detailed overview of the use of the model within Geoweaver is provided in  Chapter 7. [Workflow Management and Cloud Computing](./workflow.ipynb)



<img align = 'center' src="./Images/GeoWeaver.JPG" alt = 'drawing' width = '500'/>

## Machine Learning Models
There are many different types of machine learning models for differnt applications, such as classification, regression, and clustering.
For the application of predicting 1-km gridded SWE, a regression model is the best approach.
While there are many regression-based machine learning algorithms, we use [Light Gradient Boosted Models (LightGBM)](https://lightgbm.readthedocs.io/en/v3.3.2/) and [Multi-Layered Perceptron networks (MLP)](https://www.tensorflow.org/tutorials/keras/regression).
Below is a brief description of each machine learning modeling methodlogy.


### LightGBM
Gradient boosted decision trees (GBDT) are a machine learning algorithm exhibiting impressive performance across various classification and regression applications.
The algorithm generates a solution based on an ensemble of learning models, where weak learner trees, trained on the residuals of an initial strong learner, are iteratively added to the model to minimize the overall loss function (negative root-mean-squared-error) of the model via gradient descent of the individual weak learners. 

The LightGBM framework is an evolution of GBDT, and introduces Gradient-based One-Side Sampling (GOSS) to the boosting algorithm. 
GOSS focuses the model learning on trees with larger gradients and randomly drops learners with small gradients to provide a more efficient and more accurate gain estimation than with traditional gradient boosting. 

### MLP
The MLP is a classical type of feedforward ANN, being successfully and frequently applied in environmental modeling applications.
The MLP regression model estimates a target variable by learning a non-linear function to describe the target from an input vector of features.
It performs learning via a back-propagation algorithm over a series of hidden layers containing interconnected nodes (neurons). 
The neurons connect bordering layers by a summation of weights and an activation function transforms model outputs to predicted values. 
The model calculates error and adjusts the weights to minimize the error during model training, supporting the use of 
MLPs to effectively describe a target variable with any function, continuous or discontinuous. 



## Dependencies (versions, environments)
The modeling framework was built using Python Version 3.8.
Below is a url-linked list of the required packages needed to process the data, train the model, process results, and visualize the model outputs.
Please take the time to review each package to understand its contribution in the machine learning pipeline.

| [os](https://docs.python.org/3/library/os.html)| [ulmo](https://ulmo.readthedocs.io/en/latest/)       | [pandas](https://pandas.pydata.org/)             |[io](https://docs.python.org/3/library/io.html)           | [shapely](https://pypi.org/project/shapely/)    | [datetime](https://docs.python.org/3/library/datetime.html)           |
|:-----------: | :--------: | :----------------: | :-----------: | :--------: | :----------------: |
| [re](https://docs.python.org/3/library/re.html) | [rasterio](https://pypi.org/project/rasterio/)   | [matplot.pyplot](https://pypi.org/project/matplotlib/)     | [copy](https://docs.python.org/3/library/copy.html)         | [lightgbm](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html)   |  [numpy](https://numpy.org/)             |
| [time](https://docs.python.org/3/library/time.html)         | [tensorflow](https://www.tensorflow.org/) |  [pystac_client](https://pystac-client.readthedocs.io/en/stable/)     | [tables](https://pypi.org/project/tables/) | [platfrom](https://docs.python.org/3/library/platform.html)   | [planetary_computer](https://pypi.org/project/planetary-computer/) |
| [xarray](https://pypi.org/project/xarray/)| [tqdm](https://pypi.org/project/tqdm/)       | [random](https://docs.python.org/3/library/random.html)             | [rioxarray](https://pypi.org/project/rioxarray/)    | [geopandas](https://geopandas.org/en/stable/getting_started/install.html)  | [requests](https://pypi.org/project/requests/) |
| [pyproj](https://pypi.org/project/pyproj/)       | [richdem](https://richdem.readthedocs.io/en/latest/)    | [cartopy](https://scitools.org.uk/cartopy/docs/latest/installing.html)            | [h5py](https://www.h5py.org/)         | [elevation](https://pypi.org/project/elevation/)  | [cmocean](https://pypi.org/project/cmocean/)            |
| [mpl_toolkits](https://matplotlib.org/2.2.2/mpl_toolkits/index.html) | [hdfdict](https://pypi.org/project/hdfdict/)    | [warning](https://docs.python.org/3/library/warnings.html)            | [math](https://docs.python.org/3/library/math.html)         | [pickle](https://docs.python.org/3/library/pickle.html)     |  [contextily](https://contextily.readthedocs.io/en/latest/)        |
|[folium](https://pypi.org/project/folium/)        | [branca](https://pypi.org/project/branca/)     |  [earthpy](https://earthpy.readthedocs.io/en/latest/)           |[netCDF4](https://pypi.org/project/netCDF4/)       | [osgeo](https://pypi.org/project/osgeo/)      | [webbrowser](https://docs.python.org/3/library/webbrowser.html)          |
| [geojson](https://pypi.org/project/geojson/)    | [fiona](https://pypi.org/project/Fiona/)              |    |  |                    | |


Now that we have a general understanding of the model framework, lets dig into a model example for the Sierra Nevada region.
[Model Development](./training.ipynb)
