![Mountain Photo](mountain.jpg)

# Capstone Project: Predicting Snow Depth in the Wasatch Mountain Range, UT, USA

**Subtitle:** A comprehensive analysis using Machine Learning techniques such as Random Forest and Bayesian Optimization  
**Author:** Audrey Malloy 

**Date:** April 8th, 2025

## Executive Summary

- **Problem Statement:** Accurately predicting snow depth is critical for environmental monitoring and watershed resource management and decision-making.
- **Objective:** Develop a machine learning model that predicts snow depth with high accuracy.
- **Key Findings:**
  - Best model achieved an R² score of 0.99 and RSME of ~3.00.
  - Feature importance analysis identified month, precipitation accumulation, and air temperature as critical contributors.
- **Recommendations:** 
  1. Implement real-time snow depth predictions.
  2. Use scenario analysis for drought and climate change preparedness.  
  3. Collect additional data for improved model accuracy.

## Introduction

**Problem Statement:**  
The unpredictable nature of snow depth poses challenges for resource planning in water management and recreational industries. Current prediction methods often lack accuracy in varying climates. A model that improves snow depth forecasts could help optimize water usage and support sustainable tourism. 

**Why It Matters:**  
Accurate snow depth predictions help in disaster preparedness, resource management, and improving operational efficiency in snow-sensitive industries. Snow plays a critical role in maintaining a healthy ecosystem and adding vital moisture to the arid environment of Northern Utah. Additionally, snowmelt is essential for regional watersheds dependent on runoffs during summer months. In particular, the Wasatch Range is a critical water source for major metropolitan areas like Salt Lake City. Snow is also a key economic resource, with the ski industry a main driver for tourism. For instance, Utah’s 2023-24 ski season attracted 6.7 million visitors. Given the rising interest in winter recreation, changes in the snow depth may significantly impact local communities, industries, and governments. 

## Methodology

### Data Collection:
- Meteorological data (precipitation, air temperature, soil measurements)
- Geographical data (elevation, station names)
---
### Data Summary:
#### General Information
- **Date**: Represents the observation date.
- **Station Name**: Name of the weather station.
- **Elevation**: Elevation of the station above sea level.
- **Latitude**: Latitude coordinate of the station.
- **Longitude**: Longitude coordinate of the station.

#### Snow and Precipitation Metrics
- **Snow Depth**: Measurement of snow depth.
- **Precipitation Accumulation**: Total precipitation measured over a time period.
- **Precipitation Increment**: Incremental change in precipitation.

#### Air Temperature Metrics
- **Air Temperature Average**: Average air temperature over the observation period.
- **Air Temperature Max**: Maximum air temperature recorded.
- **Air Temperature Min**: Minimum air temperature recorded.
- **Air Temperature Observations**: Total air temperature observations recorded.

#### Soil Temperature and Moisture Metrics
- **Soil Temperature Observations**: Observations of soil temperature at the station.
- **Soil Moisture Average**: Average soil moisture over the observation period.
- **Soil Moisture Max**: Maximum soil moisture observed.
- **Soil Moisture Min**: Minimum soil moisture observed.
- **Soil Temperature Average**: Average soil temperature over the observation period.
- **Soil Temperature Max**: Maximum soil temperature recorded.
- **Soil Temperature Min**: Minimum soil temperature recorded.

#### Short-Term Metrics (7-Day Observations)
- **7-Day Air Temperature Average**  
- **7-Day Precipitation Average**  
- **7-Day Snow Depth Average**  
- **7-Day Soil Temperature Average**  
- **7-Day Standard Deviations**:
  - Air Temperature, Precipitation, Snow Depth, and Soil Temperature.
- **7-Day Variances**:
  - Metrics for air temperature, precipitation, snow depth, and soil temperature.
- **7-Day Sums**:
  - Sum metrics for air temperature, precipitation, snow depth, and soil temperature.
- **7-Day Medians**:
  - Median metrics for air temperature, precipitation, snow depth, and soil temperature.
- **7-Day Min and Max**:
  - Minimum and maximum values for air temperature, precipitation, snow depth, and soil temperature.

#### Long-Term Metrics (30-Day Observations)
- **30-Day Air Temperature Average**  
- **30-Day Precipitation Average**  
- **30-Day Snow Depth Average**  
- **30-Day Soil Temperature Average**  
- **30-Day Standard Deviations**:
  - Air Temperature, Precipitation, Snow Depth, and Soil Temperature.
- **30-Day Variances**:
  - Metrics for air temperature, precipitation, snow depth, and soil temperature.
- **30-Day Sums**:
  - Sum metrics for air temperature, precipitation, snow depth, and soil temperature.
- **30-Day Medians**:
  - Median metrics for air temperature, precipitation, snow depth, and soil temperature.
- **30-Day Min and Max**:
  - Minimum and maximum values for air temperature, precipitation, snow depth, and soil temperature.

#### Additional Features
- **Month**: Month of the observation.
- **Year**: Year of the observation.

### Data Preprocessing:
- Handling missing values.
- Scaling and inputing dummy variables for categorical features.
- Windowing time series data into 7-day, 30-day with statistical metrics of variance, standard deviation, mean, max, min, and sum.



# Exploratory Data Analysis (EDA) Report

Exploratory Data Analysis (EDA) was conducted to gain insights into the snow depth dataset, identify trends, detect anomalies, and evaluate relationships between variables. This process guided feature selection and informed modeling decisions for snow depth predictions.

---

## 1. Variable Analysis

### 1.1 Time Series Exploration
- **Snowfall Months (October-May)**:
  - The dataset was limited to months with consistent snowfall (October-May) to avoid skewness from snow-free periods.
    

![Time Series Snow Depth Graph](time_snowdepth.jpg)
---

- **Seasonal trends**:
  - Decided to use months with more winter snow conditions to avoid non-snow month's ability to skew results. 
![Snow Depth Winter Data vs Full Seasons Data](winter_vs_year.jpg)


### 1.2 Station Data
- **Station Removal**:
  - Stations with insufficient data coverage or short timeframes were removed to ensure the dataset was robust and reliable. Millcreek and Powder Mountain being a newer station and did not have established data in the particular date range. 

| Station Name       | Count   | Mean       | Std Dev    |
|--------------------|---------|------------|------------|
| Brighton           | 3654.0  | 20.605911  | 25.590131  |
| Dry Fork           | 3654.0  | 14.387521  | 18.978090  |
| Mill Creek Canyon  | 195.0   | 4.984615   | 8.214569   |
| Mill-D North       | 3654.0  | 22.810619  | 28.538768  |
| Powder Mountain    | 1272.0  | 28.857704  | 35.025822  |

### 1.3 Distribution Patterns
- Histograms and scatter plots were used to identify the distribution of variables such as snow depth, air temperature, and precipitation. Features and target variable showed normal distribution and no recommendations for transforming data. 

---

## 2. Relationships Between Variables and Target

### 2.1 Temperature Observations
- **Analysis of Observation Windows**:
  - Evaluated air temperature metrics over 7-day and 30-day windows.
  - Insights:
    - 7-day metrics demonstrated stronger relevance to snow depth compared to 30-day metrics.

![7-day 30-day comparison](windows_airtemp.jpg)
  

### 2.2 Snow Depth Distribution
- **By Elevation and Stations**:
  - Violin plots and box plots revealed snow depth variability across different stations (e.g., Brighton, Mill-D North, Dry Fork) and elevation ranges.
  - Insights:
    - Higher elevations consistently showed greater snow depth.

![Station vs Snowdepth Violin Plot](station_violinplot.jpg)

### 2.3 Correlation Analysis
#### **Strong Positive Correlations**
- **7-Day Snow Depth Metrics**:
  - `7d_snowdepth_avg (0.988)`, `7d_snowdepth_sum (0.988)`, and `7d_snowdepth_max (0.988)` showed extremely strong correlations with the target variable.
  - *Insight*: Short-term metrics are closely tied to snow depth.
- **30-Day Snow Depth Metrics**:
  - `30d_snowdepth_min (0.923)` and `30d_snowdepth_avg (0.916)` demonstrated strong correlations, though slightly weaker than 7-day metrics.

#### **Moderate Positive Correlations**
- **Precipitation Variables**:
  - Metrics such as `30d_precip_std (0.620)` and `precip_accumulation (0.589)` showed moderate correlations.

#### **Negative Correlations**
- **Air Temperature**:
  - Metrics like `airtemp_avg (-0.279)` showed an inverse relationship with snow depth.
- **Soil Temperature**:
  - Features like `soiltemp_max (-0.454)` exhibited a stronger negative relationship compared to air temperature.

#### **Weak Correlations**
- **Geographical Variables**:
  - `Latitude (0.239)` and `elevation (0.236)` suggested weak-to-moderate influences.
- **Soil Moisture Variables**:
  - Metrics such as `soilmoisture_avg (0.073)` exhibited very weak correlations.

---

##  Key Insights

1. **Short-Term Metrics Dominate**:
   - Snow depth-specific 7-day metrics had the strongest correlations with the target variable, emphasizing the importance of recent conditions.
2. **Negative Impact of Temperature**:
   - Higher soil and air temperatures consistently correlated negatively with snow depth, reflecting snowmelt effects.
3. **Precipitation’s Role**:
   - While precipitation impacts snow depth, its correlation was moderate compared to snow depth-specific features.
4. **Snow Depth and Elevation**:
   - Snow depth increased significantly with elevation across stations, particularly Brighton and Mill-D North.

---

EDA provided valuable insights into the key factors influencing snow depth, such as short-term snow metrics, soil and air temperature, and elevation. These findings guided feature selection and model development, ensuring robust and accurate snow depth predictions.



## Model Selection:
- **Model:** Random Forest Regressor (RSME ~3.00)
- **Optimization Methods:** Bayesian Optimization (Optuna)

 ### Feature Importance Analysis:
#### Highly Important Features
- **Month (0.389)**: Strong seasonal influence on snow depth.
- **Precipitation Accumulation (0.353)**: Direct contributor to snowpack formation.
- **30-Day Air Temperature Metrics**:
  - **30d_airtemp_max (0.071)**: Helps understand snow accumulation and melting patterns.
  - **30d_airtemp_median (0.061)**: Influences snowmelt rates.

#### Moderately Important Features
- **30-Day Air Temperature Metrics**:
  - **30d_airtemp_avg (0.017)**: Captures sustained air temperature trends.
  - **30d_airtemp_sum (0.017)**: Summarizes broader temperature patterns.
  - **30d_airtemp_min (0.013)**: Critical for determining snowfall versus rainfall conditions.
- **Soil Moisture (0.012)**: Reflects conditions affecting snow retention or melting.
- **Year (0.016)**: Captures long-term climatic trends.

#### Less Important Features
- **Air and Soil Temperature Variables**:
  - **airtemp_obs (0.0017)**: Minimal contribution compared to broader metrics.
  - **soiltemp_avg (0.0011)**: Secondary effect on snow predictions.
- **Precipitation Increment (0.0006)**: Short-term changes are less impactful compared to cumulative precipitation.

---

### Key Insights
- **Seasonality Dominates**: Temporal features such as **Month** and **Precipitation Accumulation** are the most significant.
- **Air and Soil Temperatures**: Moderate to low importance, with soil temperature showing slightly stronger relevance to snowmelt.
- **Efficiency Recommendation**: Consider eliminating features with very low importance (e.g., **precip_increment**, **soiltemp_avg**) to simplify the model and improve efficiency.

![Feature Importance Graph](feature_importance.png)

### Evaluation Metrics:
-  Root Mean Squared Error (RMSE): 
    - Quantifies Error Magnitude: RMSE calculates the average magnitude of errors between predicted and observed values. By taking the square root of the mean of squared errors, it gives a measure in the same units as the variable being predicted (e.g., centimeters for snow depth).
    - Penalizes Large Errors: Because errors are squared before averaging, RMSE gives more weight to larger errors. This helps highlight situations where predictions deviate significantly from actual values, making it useful for scenarios where accuracy is critical.

### Model Evaluation
Several regression models were tested to determine the best-performing approach:
1. **Decision Tree Regressor**: handles non-linear relationships between features effectively and could explore the complex environmental factors influencing snow depth. 
2. **Support Vector Machine (SVM)**: effective for finding patterns in data with high dimensionality and complexity, like snow depth influenced by geographical and temporal factors.
3. **Random Forest Regressor**: Reduces overfitting and effective for predictive tasks. Ranks feature importance and excel in generalization and predicting across varied scenarios. 

### Hyperparameter Tuning
Used multiple optimization methods:
- **GridSearchCV**: Exhaustively explored predefined parameter grids to identify optimal configurations.
- **RandomizedSearchCV**: Sampled random parameter combinations, reducing computation time while maintaining performance.
- **Bayesian Optimization (Optuna)**: Applied probabilistic methods to refine hyperparameters for the Random Forest Regressor, achieving the best metrics.


## Findings

### Model Performance: 

| **Model**                          | **Root Mean Squared Error (RMSE)** |
|------------------------------------|------------------------------------|
| Randomized Search                  | 3.00                              |
| Grid Search CV                     | 3.00                              |
| Random Forest (no hypertuning)     | 3.03                              |
| Bayesian Optimization              | 3.00                              |
| Support Vector Machine (SVM)       | 9.73                              |
| Decision Tree                      | 4.62                              |

---
### Final Model Selection
The **Random Forest Regressor**, optimized with Bayesian methods, outperformed other models in predictive accuracy and generalization. 

    Bayesian Optimization best hyperparameters consists of:
    - n_estimators = 179, 
    - max_depth = 19,
    - min_samples_split = 3.

---
### **Key Takeaways**
- All Random Forest models (with or without hyperparameter tuning) show consistent and low RMSE values, indicating robust performance.
- **Support Vector Machine (SVM)** has the highest RMSE (9.73), suggesting it struggles more with predicting snow depth accurately.
- The **Decision Tree** model performs moderately well with an RMSE of 4.62 but lags behind the Random Forest models.

---
**Conclusion**: Bayesian Optimization and other hyperparameter-tuned Random Forest models provide the best predictive accuracy (RMSE ~3.00). The combination of robust model testing and advanced hyperparameter tuning identified the Random Forest Regressor as the optimal solution for snow depth prediction, ensuring accurate and reliable forecasts tailored to complex environmental dynamics.

<div style="text-align: center;">
    <img src="actual_predicted.jpg" alt="Predicted vs Acutal Snowdepth" width="800" height="600">
</div>



### Scenario Analysis:
1. **Seasonal Trends:** Snow depth varies significantly between winter and summer months.
2. **Extreme Weather Events:** High precipitation leads to substantial snow accumulation.
3. **Long-term Air Temperature Effects:** Persistent cold temperatures increase snow depth; warmer conditions result in snow melt.
4. **Soil Temperature Influences:** Frozen soil supports higher snow accumulation than thawed soil.

# Predicted Snow Depth Across Scenarios

### Seasonal Trends
| elevation | precip_accumulation | soiltemp_max | predicted_snow_depth |
|-----------|----------------------|--------------|-----------------------|
| 8750      | 50                   | 10           | 131.555128            |
| 8750      | 5                    | 65           | 1.840968              |

---

### Extreme Weather Events
| elevation | precip_accumulation | soiltemp_max | predicted_snow_depth |
|-----------|----------------------|--------------|-----------------------|
| 8750      | 100                  | 5            | 63.335368             |
| 8750      | 1                    | 25           | 14.357263             |

---

### Long-term Air Temperature Effects
| elevation | precip_accumulation | soiltemp_max | predicted_snow_depth |
|-----------|----------------------|--------------|-----------------------|
| 8750      | 30                   | 0            | 93.563687             |
| 8750      | 30                   | 20           | 35.626638             |

---

### Soil Temperature Influences
| elevation | precip_accumulation | soiltemp_max | predicted_snow_depth |
|-----------|----------------------|--------------|-----------------------|
| 8750      | 30                   | -5           | 94.715084             |
| 8750      | 30                   | 20           | 95.015164             |

---
![Snow Depth across Scenarios Graph](predicted_scenarios.jpg)


## Recommendations

1. Use the model for real-time snow depth predictions in environmental monitoring systems.
2. Implement scenario analysis for anticipating extreme snowfall events.
3. Enhance data collection systems to include additional features like wind speed and snowfall type.
4. Tailor the model to specific geographic regions to account for unique environmental factors (e.g., microclimates).
5. Combine snow depth predictions with water resource and hydrology models for comprehensive analysis of snowmelt impacts.

### Further Research Ideas:
- Expand datasets with new meteorological features.
- Explore advanced machine learning algorithms (e.g., deep learning).
- Studying the model's performance under various climate scenarios to understand global warming treends influence snow depth predictions.
- Compare results across diverse geographies to identify universal features vs region-specific drivers of snow depth. 

## Conclusion

The developed model achieves high accuracy in snow depth predictions, proving its value for environmental monitoring, disaster management, and resource optimization. By leveraging Bayesian optimization and feature selection, the model balances complexiity and reliability.

The results suggest the importance of integrating environmental, temporal, and statistical data to enhance prediction accuracy. From mitigating risk of extreme snowfall events to supporting sustainable water resource management, this model establishes a strong foundation for future research and advancements. 

The exploration of deep learning algorithms, incorporation of additional features, and the assessment of climate change impacts are a few examples of studies that could expand on this current analysis. These research directions will further refine the capability of predictive snow depth models and contribute to the overall wealth and knowledge in environmental modeling.  

![Predicted Snow Depth](snow_predictionTIME.jpg)

## References 
- USDA National Water and Climate Center: https://www.nrcs.usda.gov/resources/data-and-reports/snow-and-climate-monitoring-predefined-reports-and-maps