# Introduction to the Dataset

Understanding **evapotranspiration (ET)** is central to many environmental and water-resources applications. ET represents the combined process through which water leaves the land surface and enters the atmosphere in two ways:

- **Evaporation**: liquid water turning into vapor from soil or wet surfaces  
- **Transpiration**: water released by plants during physiological processes  

Because ET links the **water cycle**, **climate**, and **ecosystem functioning**, it is one of the most important variables for understanding how landscapes respond to environmental changes. ET controls how much water remains available in the soil, how vegetation uses water, and how energy is exchanged between the land surface and the atmosphere. Even small changes in ET can affect drought intensity, agricultural productivity, and regional water budgets.

## Why Study ET in a Semi-Arid Grassland?

This project focuses on the **US-SRG site (Santa Rita Grassland, Arizona)**, a semi-arid grassland ecosystem. These environments are strongly shaped by:
- limited water availability  
- high temperatures  
- strong seasonality driven by monsoon rainfall  

Because water is often the main limiting factor for vegetation growth, semi-arid ecosystems show a very **sensitive and dynamic relationship** between ET and climate. Studying ET here helps us understand how ecosystems cope with water scarcity and how environmental drivers (radiation, temperature, humidity, wind, precipitation) shape ET patterns.

This site was selected because it provides:
- continuous, high-quality environmental observations  
- a water-limited climate where ET responses are ecologically meaningful   

## Data Sources

This notebook uses two complementary datasets:

### 1. AmeriFlux – Environmental and Meteorological Data
AmeriFlux is a long-term network of ecological research towers across North America. It provides high-frequency ground measurements of variables that directly influence ET, including:
- precipitation  
- air temperature  
- solar radiation  
- vapor pressure deficit (VPD)  
- soil moisture and soil temperature   
- wind speed  

For this project, all environmental drivers come from the **US-SRG AmeriFlux site**.  
Reference: Russell Scott (2025). AmeriFlux FLUXNET-1F US-SRG Santa Rita Grassland, Ver. 5-7, AmeriFlux AMP, (Dataset). https://doi.org/10.17190/AMF/2204877

### 2. OpenET – Satellite-Based Evapotranspiration
ET itself is obtained from **OpenET**, a platform that uses satellite data and surface energy-balance modeling to generate reliable monthly and daily ET estimates across the Western United States.

While AmeriFlux provides *local, tower-measured environmental conditions*, OpenET provides **spatially consistent ET estimates**, making it ideal for examining monthly ET dynamics.

Reference: OpenET (2021). OpenET Monthly Evapotranspiration Dataset. OpenET Data Explorer. Retrieved from https://openetdata.org

### Time Window Alignment Between AmeriFlux and OpenET

The two datasets used in this project differ in their temporal coverage:

- **AmeriFlux (environmental drivers)**: monthly data available from 2008 to 2024 for the US-SRG site.  
- **OpenET (evapotranspiration)**: monthly ET estimates available from 2020 to mid-2025.

Because the analysis depends on comparing evapotranspiration (ET) with climate variables from AmeriFlux, both datasets must cover **exactly the same date range**. This ensures that each monthly ET value from OpenET can be aligned with the corresponding environmental conditions measured in AmeriFlux.

## Objective of This Notebook

This first notebook focuses on **data cleaning and preprocessing** of both datasets (AmeriFlux and OpenET). Specifically, it will:

- load the raw environmental and ET datasets  
- inspect structure, variable names, and time coverage  
- convert timestamps and ensure alignment across datasets  
- handle missing or inconsistent values  
- prepare a clean, analysis-ready dataset for later exploration  

This preprocessing step ensures that the dataset is accurate, consistent, and suitable for analyzing how climatic conditions influence ET in a water-limited semi-arid grassland ecosystem.



In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

# 1. AMERIFLUX DATA (Environmental Drivers)

In [None]:
# Missing values in AmeriFlux are typically encoded as -9999 or -9999.9, so we convert them to NaN on read.
amer_csv_path = Path("./data/raw/AMF_US-SRG_FLUXNET_FULLSET_MM_2008-2024_5-7.csv")
df_amer = pd.read_csv(amer_csv_path, na_values=[-9999, -9999.9])

# Filter to ILTER TO PROJECT DATE RANGE: Jan/2020 to Dec/2024
df_amer = (
    df_amer[(df_amer["TIMESTAMP"] >= 202001) & (df_amer["TIMESTAMP"] <= 202412)]
    .copy()
    .reset_index()
)

df_amer.head()

## Select and reformat key environmental variables

AmeriFlux uses a standardized FLUXNET/ONEFlux variable naming convention, where each environmental variable is encoded using short, technical labels (e.g., `TA_F_MDS`, `SW_IN_F_MDS`, `VPD_F_MDS`). These codes are extremely useful for data management but can be difficult to interpret during analysis, especially when integrating data from multiple sources.

To make the dataset clearer and easier to work with, we convert these variables into new column names that are more intuitive, self-descriptive, and aligned with the focus of my analysis (e.g., `AirTemp_C`, `Rain_mm`, `SoilH2O_m3m3`).

The official AmeriFlux variable descriptions are documented here:  https://ameriflux.lbl.gov/data/aboutdata/data-variables/

Additionally, the AmeriFlux monthly datasets encode time using a `TIMESTAMP` column in the format **YYYYMM**. Before analysis, this format is converted into a proper `datetime` object (the first day of each corresponding month), which enables indexing, resampling, and plotting using Python’s time-series tools.

Below, we create a cleaned subset of the AmeriFlux variables and immediately convert the timestamp into a usable datetime column.


In [None]:
# Convert YYYYMM → datetime (assigning day = 1)
date_dt = pd.to_datetime(df_amer["TIMESTAMP"].astype(str) + "01", format="%Y%m%d")

# Selecting key variables
clean_amer = pd.DataFrame(
    {
        "Date": date_dt,  # proper datetime
        "Rain_mm": df_amer["P_F"],  # mm/month
        "AirTemp_C": df_amer["TA_F_MDS"],  # °C
        "SolarRad_kWm2": df_amer["SW_IN_F_MDS"] / 1000,  # W/m² → kW/m²
        "VPD_kPa": df_amer["VPD_F_MDS"] / 10,  # hPa → kPa
        "SoilTemp_10cm_C": df_amer["TS_F_MDS_1"],  # °C
        "SoilH2O_m3m3": df_amer["SWC_F_MDS_1"],  # volumetric soil water content
        "Wind_speed_ms": df_amer["WS_F"],  # m/s
    }
)

clean_amer.head()

## Computing Relative Humidity (RH) from Air Temperature and VPD

AmeriFlux does not directly include **relative humidity (RH)** in the monthly dataset, but it provides: **air temperature (°C)** and **vapor pressure deficit (VPD, kPa)**.  
Using these two variables, RH can be estimated through the Tetens saturation vapor pressure equation.

The saturation vapor pressure is computed as:

$$
e_{\text{sat}} = 0.6108 \cdot \exp\left( \frac{17.27 \cdot T_{\text{air}}}{T_{\text{air}} + 237.3} \right)
$$

where:

$$ e_{\text{sat}} \text{ is the saturation vapor pressure (kPa)} $$

$$ T_{\text{air}} \text{ is the air temperature (°C)} $$

Relative humidity is then calculated as:

$$
RH = \left(1 - \frac{VPD}{e_{\text{sat}}}\right) \times 100
$$

Finally, RH values are constrained between **0% and 100%**, ensuring physically meaningful results.



In [None]:
def compute_rh(tair_c, vpd_kpa):
    esat = 0.6108 * np.exp((17.27 * tair_c) / (tair_c + 237.3))  # kPa
    rh = (1 - (vpd_kpa / esat)) * 100
    return np.clip(rh, 0, 100)


clean_amer["RelHum_%"] = compute_rh(clean_amer["AirTemp_C"], clean_amer["VPD_kPa"])

clean_amer.head()

## Data Quality Check: Missing Values and Variable Types

Before proceeding, a preliminary inspection of the AmeriFlux dataset was performed to verify variable types and the presence of missing values.

All variables were successfully converted to the expected numerical formats, and the time column (`Date`) was correctly parsed as a `datetime64` object. Most environmental variables exhibit complete coverage for the 2020–2024 period, with the exception of **Soil Temperature at 10 cm depth**, which contains **12 missing monthly values**.

Although this represents a small portion of the dataset, missing values in `SoilTemp_10cm_C` must be taken into account during exploratory analysis and especially when computing **correlations**, since:

- correlation functions in pandas typically ignore missing values pairwise;  
- missing soil temperature values may reduce the number of valid observations in multi-variable analyses;  
- any trend or seasonal analysis involving soil temperature may also be affected.

These entries will not be removed at this stage, but their presence will be considered during the correlation step to ensure that statistical interpretations are not biased by differences in sample size.


In [None]:
# Describing the data
print("\nColumn types:\n", clean_amer.dtypes)
print("\nMissing values:\n", clean_amer.isna().sum())

# 2. OPENET DATA (Evapotranspiration)

In [None]:
# Load OpenET monthly dataset
et_csv_path = "./data/raw/ET_monthly.csv"
df_et = pd.read_csv(et_csv_path)

# Convert Month column to datetime
df_et["Month"] = pd.to_datetime(df_et["Month"])

# Filter to match analysis window (2020–2024)
df_et = (
    df_et[(df_et["Month"] >= "2020-01-01") & (df_et["Month"] <= "2024-12-31")].copy().reset_index()
)

df_et.head()

## Converting OpenET Evapotranspiration to Millimeters per Month

The OpenET dataset downloaded from the OpenET Data Explorer includes the column  
**"ET Units"**, which explicitly reports the units used for the evapotranspiration
values in each monthly record. In this dataset, the unit is `"in"` (inches), meaning
that **Ensemble ET is provided as total evapotranspiration in inches per month**.

To ensure consistency with the AmeriFlux variables (most of which are expressed in
SI units), ET values are converted from inches to millimeters using the standard
conversion factor:

\[
1 \text{ inch} = 25.4 \text{ mm}
\]

Thus:

\[
ET_{\text{mm/month}} = ET_{\text{in/month}} \times 25.4
\]

This yields evapotranspiration in **millimeters per month**, which is directly
comparable to precipitation and other hydrological variables in the analysis.


In [None]:
# Keep only Date + Ensemble ET
df_et = df_et[["Month", "Ensemble ET"]].copy()
df_et = df_et.rename(columns={"Month": "Date"})

# Convert inches → millimeters
df_et["ET_mm_month"] = df_et["Ensemble ET"] * 25.4

# Drop raw inches
df_et = df_et.drop(columns=["Ensemble ET"])

df_et.head()

## 3. Merge AmeriFlux and OpenET + Save Final Dataset

After cleaning both datasets independently and converting all variables to
consistent units and formats, the next step is to **merge the monthly
environmental data from AmeriFlux with the monthly evapotranspiration (ET)
values from OpenET**.

Both datasets now use the same `Date` column in `datetime` format, which allows
months to be aligned directly and without additional conversions. An **outer
merge** is used to preserve the full temporal coverage and avoid unintentionally
dropping months where one of the sources may have missing data.

Once the datasets are merged, the final cleaned dataset is saved in **pickle
(`.pkl`) format**, which preserves data types and offers much faster load times
compared to CSV—ideal for subsequent exploratory analysis, visualization, and
correlation steps.

This merged dataset serves as the foundation for all analyses performed in the
next notebooks.



In [None]:
df_merged = (
    pd.merge(clean_amer, df_et, on="Date", how="outer").sort_values("Date").reset_index(drop=True)
)

df_merged.head()

In [None]:
# Save final merged dataset
output_pkl = "./data/processed/US_SRG_merged_clean.pkl"
df_merged.to_pickle(output_pkl)