# Soil Moisture Data Sources

This notebook implements standardized retrievers for multiple soil moisture data sources. Each source has unique characteristics:

## Data Source Characteristics

### Data Assimilation 

### ERA5
- **Provider**: ECMWF (European Centre for Medium-Range Weather Forecasts)
- **Resolution**: 0.1° x 0.1° (approximately 9km)
- **Temporal Coverage**: 1979-present
- **Update Frequency**: Monthly updates, with 2-3 month delay
- **Key Features**: High spatial resolution, consistent reanalysis

### GLDAS
- **Provider**: NASA GSFC
- **Resolution**: 0.25° x 0.25°
- **Temporal Coverage**: 2000-present
- **Update Frequency**: 3-hourly
- **Key Features**: Global coverage, multiple soil layers

### NLDAS
- **Provider**: NASA/NOAA
- **Resolution**: 0.125° x 0.125°
- **Temporal Coverage**: 1979-present
- **Update Frequency**: Hourly
- **Key Features**: North American focus, high temporal resolution

### FLDAS
- **Provider**: NASA GSFC
- **Resolution**: 0.1° x 0.1°
- **Temporal Coverage**: 1982-present
- **Update Frequency**: Monthly
- **Key Features**: Africa-focused land data assimilation

### MERRA-2
- **Provider**: NASA GMAO
- **Resolution**: 0.5° x 0.625°
- **Temporal Coverage**: 1980-present
- **Update Frequency**: Monthly
- **Key Features**: Comprehensive atmospheric reanalysis

### Remote Sensing 

### SMAP
- **Provider**: NASA
- **Resolution**: 9km x 9km
- **Temporal Coverage**: 2015-present
- **Update Frequency**: 3-hourly
- **Key Features**: Direct satellite observations, high accuracy


In [3]:
#import necessary packages 
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import cdsapi
import netCDF4
import earthaccess
import os
import tempfile
import sys
import json
import urllib3
import certifi
import requests
from time import sleep
from http.cookiejar import CookieJar
import urllib.request
from urllib.parse import urlencode
import getpass
from datetime import datetime
import h5py
from tqdm.auto import tqdm
import concurrent.futures
import warnings
from typing import Tuple, Optional,List, Dict
from pathlib import Path
import ftplib
import ssl
import json
import sys
import time
import uuid
from tqdm import tqdm



  from .autonotebook import tqdm as notebook_tqdm


In [4]:
#Parameters
max_lat= 45.02
min_lat= 40.5
max_lon= -71.85
min_lon= -79.77
start_date = "2013-01-01"
end_date = "2023-12-31"
lat_bounds = (min_lat, max_lat)
lon_bounds = (min_lon, max_lon)
area=(max_lat, min_lon, min_lat, max_lon) # (max_lat, min_lon, min_lat, max_lon)
bbox=(min_lon, min_lat, max_lon, max_lat) # (min_lon, min_lat, max_lon, max_lat)
date_range=(start_date, end_date) #start_dat ,end_date

In [8]:
#calculate statistics for the datasets 
def calculate_statistics(data, var_name, var_attrs):
    """
    Calculate statistics for a variable based on its data type
    
    Parameters:
    -----------
    data : np.ndarray
        The data array to analyze
    var_name : str
        Name of the variable
    var_attrs : dict
        Variable attributes
        
    Returns:
    --------
    dict
        Dictionary containing the statistics
    """
    # Check if data is datetime type
    if np.issubdtype(data.dtype, np.datetime64):
        return {
            'type': 'datetime',
            'min': data.min(),
            'max': data.max(),
            'shape': data.shape
        }
    
    # For numeric data
    valid_data = data[~np.isnan(data)]
    if len(valid_data) > 0:
        return {
            'type': 'numeric',
            'mean': np.nanmean(data),
            'median': np.nanmedian(data),
            'std': np.nanstd(data),
            'var': np.nanvar(data),
            'min': np.nanmin(data),
            'max': np.nanmax(data),
            'valid_points': len(valid_data),
            'missing_points': np.sum(np.isnan(data)),
            'coverage': (len(valid_data) / data.size * 100),
            'shape': data.shape
        }
    
    return {
        'type': 'empty',
        'shape': data.shape
    }

In [9]:
# ERA5 data retrieval with progress tracking
def get_era5_data(dataset, request, output_file):
    """
    Retrieve ERA5 data with progress tracking using a custom download approach
    
    Parameters:
    -----------
    dataset : str
        Name of the ERA5 dataset
    request : dict
        Request parameters for ERA5 data
    output_file : str
        Path to save the downloaded file
    
    Returns:
    --------
    xarray.Dataset
        The loaded dataset with ERA5 data
    """
    print("\nRetrieving ERA5 data...")
    print(f"Dataset: {dataset}")
    print(f"Time range: {request['year']}-{request['month']}")
    print(f"Spatial bounds: {request['area']}")
    
    try:
        client = cdsapi.Client()
        
        # First, submit the request and get the result
        print("Submitting request to ERA5...")
        result = client.retrieve(dataset, request)
        
        # Download the file with manual progress tracking
        print("\nDownloading data...")
        with open(output_file, 'wb') as f:
            result.download(output_file)
        
        # Get file size after download
        file_size = os.path.getsize(output_file)
        print(f"Download complete. File size: {file_size/1024/1024:.2f} MB")
        
        # Load and analyze the downloaded data
        print("\nLoading dataset...")
        ds = xr.open_dataset(output_file)
        
        # Print dataset information
        print("\nDataset Information:")
        print("-" * 50)
        print(f"Dimensions: {dict(ds.dims)}")
        print("\nVariables:")
        for var in ds.data_vars:
            print(f"\nVariable: {var}")
            data = ds[var].values
            valid_data = data[~np.isnan(data)]
            
            # Get variable attributes
            attrs = ds[var].attrs
            units = attrs.get('units', 'unknown')
            long_name = attrs.get('long_name', var)
            
            print(f"Description: {long_name}")
            print(f"Units: {units}")
            print(f"Shape: {data.shape}")
            
            # Calculate statistics
            if len(valid_data) > 0:
                print("Statistics:")
                print(f"  Mean:     {np.nanmean(data):.4f}")
                print(f"  Median:   {np.nanmedian(data):.4f}")
                print(f"  Std Dev:  {np.nanstd(data):.4f}")
                print(f"  Variance: {np.nanvar(data):.4f}")
                print(f"  Min:      {np.nanmin(data):.4f}")
                print(f"  Max:      {np.nanmax(data):.4f}")
                print(f"  Valid Points:    {len(valid_data):,}")
                print(f"  Missing Points:  {np.sum(np.isnan(data)):,}")
                print(f"  Data Coverage:   {(len(valid_data) / data.size * 100):.1f}%")
            else:
                print("No valid data points found")
        
        return ds
        
    except Exception as e:
        print(f"Error retrieving ERA5 data: {str(e)}")
        raise

In [10]:
#NLDAS data request 
def get_nldas_data(start_date, end_date, lat_bounds, lon_bounds):
    """
    Retrieve NLDAS soil moisture data with progress tracking
    
    Parameters:
    -----------
    start_date : str
        Start date in YYYY-MM-DD format
    end_date : str
        End date in YYYY-MM-DD format
    lat_bounds : tuple
        (min_lat, max_lat) for the region of interest
    lon_bounds : tuple
        (min_lon, max_lon) for the region of interest
    
    Returns:
    --------
    xarray.Dataset
        Combined dataset with NLDAS soil moisture data
    """
    print("\nRetrieving NLDAS soil moisture data...")
    print(f"Time range: {start_date} to {end_date}")
    print(f"Spatial bounds: lat {lat_bounds}, lon {lon_bounds}")
    
    try:
        # Authenticate with NASA Earthdata
        auth = earthaccess.login()
        
        # Search for granules
        print("\nSearching for NLDAS granules...")
        granules = earthaccess.search_data(
            short_name="NLDAS_NOAH0125_M",
            version="2.0",
            temporal=(start_date, end_date),
            bounding_box=(lon_bounds[0], lat_bounds[0], lon_bounds[1], lat_bounds[1])
        )
        
        if not granules:
            raise ValueError("No NLDAS granules found for the specified parameters")
            
        print(f"Found {len(granules)} granules")
        
        with tempfile.TemporaryDirectory() as temp_dir:
            # Download files
            print("\nDownloading granules...")
            downloaded_files = earthaccess.download(
                granules,
                local_path=temp_dir
            )
            
            if not downloaded_files:
                raise ValueError("Failed to download any granules")
            
            print(f"Successfully downloaded {len(downloaded_files)} files")
            
            # Process files
            print("\nProcessing downloaded files...")
            datasets = []
            
            # Use tqdm for processing progress
            for file_path in tqdm(downloaded_files, desc="Processing files", unit="file"):
                try:
                    ds = xr.open_dataset(file_path)
                    
                    # Select only soil moisture variables (SoilM_*)
                    soil_vars = [var for var in ds.data_vars if 'SoilM_' in var]
                    if not soil_vars:
                        print(f"Warning: No soil moisture variables found in {os.path.basename(file_path)}")
                        continue
                    ds = ds[soil_vars]
                    
                    # Apply spatial subsetting
                    if 'lat' in ds.dims:
                        ds = ds.sel(lat=slice(lat_bounds[0], lat_bounds[1]))
                    if 'lon' in ds.dims:
                        ds = ds.sel(lon=slice(lon_bounds[0], lon_bounds[1]))
                    
                    datasets.append(ds)
                except Exception as e:
                    print(f"Warning: Failed to process file {os.path.basename(file_path)}: {str(e)}")
                    continue
            
            if not datasets:
                raise ValueError("No valid soil moisture data found in downloaded files")
            
            # Combine datasets
            print("\nCombining datasets...")
            combined_ds = xr.concat(datasets, dim='time')
            
            # Print dataset information
            print("\nDataset Information:")
            print("-" * 50)
            print(f"Time range: {combined_ds.time.values[0]} to {combined_ds.time.values[-1]}")
            print(f"Number of timesteps: {len(combined_ds.time)}")
            print(f"Dimensions: {dict(combined_ds.sizes)}")
            
            # Print statistics for soil moisture variables
            print("\nSoil Moisture Statistics:")
            print("-" * 50)
            
            for var in combined_ds.data_vars:
                print(f"\nVariable: {var}")
                data = combined_ds[var].values
                
                # Get variable attributes
                attrs = combined_ds[var].attrs
                units = attrs.get('units', 'kg/m^2')
                long_name = attrs.get('long_name', var)
                
                print(f"Description: {long_name}")
                print(f"Units: {units}")
                print(f"Shape: {data.shape}")
                
                # Calculate statistics for numeric data
                if np.issubdtype(data.dtype, np.number):
                    valid_data = data[~np.isnan(data)]
                    if len(valid_data) > 0:
                        print("Statistics:")
                        print(f"  Mean:     {np.nanmean(data):.4f}")
                        print(f"  Median:   {np.nanmedian(data):.4f}")
                        print(f"  Std Dev:  {np.nanstd(data):.4f}")
                        print(f"  Min:      {np.nanmin(data):.4f}")
                        print(f"  Max:      {np.nanmax(data):.4f}")
                        print(f"  Valid Points:    {len(valid_data):,}")
                        print(f"  Missing Points:  {np.sum(np.isnan(data)):,}")
                        print(f"  Data Coverage:   {(len(valid_data) / data.size * 100):.1f}%")
                    else:
                        print("No valid numeric data found")
            
            return combined_ds
            
    except Exception as e:
        print(f"\nError retrieving NLDAS soil moisture data: {str(e)}")
        raise

In [11]:
# GLDAS data retrieval with progress tracking
def validate_dates(start_date: str, end_date: str) -> tuple:
    """Validate and parse date strings."""
    try:
        start = datetime.strptime(start_date, '%Y-%m-%d')
        end = datetime.strptime(end_date, '%Y-%m-%d')
        if end < start:
            raise ValueError("End date must be after start date")
        return start, end
    except ValueError as e:
        raise ValueError(f"Invalid date format. Please use YYYY-MM-DD format. Error: {str(e)}")

def validate_bounds(lat_bounds: tuple, lon_bounds: tuple) -> None:
    """Validate spatial bounds for GLDAS data."""
    if not isinstance(lat_bounds, tuple) or not isinstance(lon_bounds, tuple):
        raise TypeError("Bounds must be tuples")
    if len(lat_bounds) != 2 or len(lon_bounds) != 2:
        raise ValueError("Bounds must contain exactly 2 values")
    if not (-60 <= lat_bounds[0] <= 90 and -60 <= lat_bounds[1] <= 90):
        raise ValueError("GLDAS latitude must be between -60 and 90 degrees")
    if not (-180 <= lon_bounds[0] <= 180 and -180 <= lon_bounds[1] <= 180):
        raise ValueError("Longitude must be between -180 and 180 degrees")
    if lat_bounds[0] >= lat_bounds[1]:
        raise ValueError("Minimum latitude must be less than maximum latitude")
    if lon_bounds[0] >= lon_bounds[1]:
        raise ValueError("Minimum longitude must be less than maximum longitude")

def export_dataset(ds: xr.Dataset, start_date: str, end_date: str, output_dir: str = None) -> str:
    """
    Export dataset to NetCDF file with date-specific filename.
    
    Parameters:
    -----------
    ds : xarray.Dataset
        Dataset to export
    start_date : str
        Start date string
    end_date : str
        End date string
    output_dir : str, optional
        Directory to save the file (default: current directory)
        
    Returns:
    --------
    str
        Path to the exported file
    """
    # Create output directory if it doesn't exist
    if output_dir:
        Path(output_dir).mkdir(parents=True, exist_ok=True)
    else:
        output_dir = '.'
        
    # Create filename
    filename = f"gldas_data_{start_date}_{end_date}.nc"
    filepath = os.path.join(output_dir, filename)
    
    # Add export metadata
    ds.attrs['export_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    ds.attrs['data_period'] = f"{start_date} to {end_date}"
    
    # Export to NetCDF
    print(f"\nExporting data to {filepath}...")
    ds.to_netcdf(filepath)
    print("Export complete!")
    
    return filepath

def get_gldas_data(start_date: str, end_date: str, lat_bounds: tuple, lon_bounds: tuple, 
                   output_dir: str = None) -> xr.Dataset:
    """
    Retrieve GLDAS soil moisture data with progress tracking and validation.
    
    Parameters:
    -----------
    start_date : str
        Start date in YYYY-MM-DD format
    end_date : str
        End date in YYYY-MM-DD format
    lat_bounds : tuple
        (min_lat, max_lat) for the region of interest (between -60 and 90)
    lon_bounds : tuple
        (min_lon, max_lon) for the region of interest (-180 to 180)
    output_dir : str, optional
        Directory to save the output file (default: current directory)
    
    Returns:
    --------
    xarray.Dataset
        Combined dataset with GLDAS soil moisture data
    """
    try:
        # Validate inputs
        validate_dates(start_date, end_date)
        validate_bounds(lat_bounds, lon_bounds)
        
        print("\nRetrieving GLDAS soil moisture data...")
        print(f"Time range: {start_date} to {end_date}")
        print(f"Spatial bounds: lat {lat_bounds}, lon {lon_bounds}")
        
        # Authenticate with NASA Earthdata
        try:
            auth = earthaccess.login()
        except Exception as e:
            raise RuntimeError(f"Failed to authenticate with NASA Earthdata: {str(e)}")
        
        # Search for granules with retry mechanism
        max_retries = 3
        granules = None
        
        for attempt in range(max_retries):
            try:
                print(f"\nSearching for GLDAS granules (attempt {attempt + 1}/{max_retries})...")
                granules = earthaccess.search_data(
                    short_name="GLDAS_NOAH025_M",
                    version="2.1",
                    temporal=(start_date, end_date),
                    bounding_box=(lon_bounds[0], lat_bounds[0], lon_bounds[1], lat_bounds[1])
                )
                if granules:
                    break
            except Exception as e:
                if attempt == max_retries - 1:
                    raise RuntimeError(f"Failed to search for granules after {max_retries} attempts: {str(e)}")
                print(f"Attempt {attempt + 1} failed, retrying...")
                continue
        
        if not granules:
            raise ValueError("No GLDAS granules found for the specified parameters")
        
        print(f"Found {len(granules)} granules")
        
        # Process data in temporary directory
        with tempfile.TemporaryDirectory() as temp_dir:
            # Download files with progress tracking
            print("\nDownloading granules...")
            downloaded_files = earthaccess.download(
                granules,
                local_path=temp_dir
            )
            
            if not downloaded_files:
                raise RuntimeError("Failed to download any granules")
            
            print(f"Successfully downloaded {len(downloaded_files)} files")
            
            # Process files with detailed error handling
            print("\nProcessing downloaded files...")
            datasets = []
            failed_files = []
            processed_count = 0
            
            for file_path in tqdm(downloaded_files, desc="Processing files", unit="file"):
                try:
                    ds = xr.open_dataset(file_path)
                    
                    # Print dimensions and variables for first file
                    if processed_count == 0:
                        print(f"\nFile structure: {os.path.basename(file_path)}")
                        print("Dimensions:", list(ds.dims))
                        print("Available variables:", list(ds.data_vars))
                    
                    # Validate dataset structure
                    required_dims = {'time', 'lat', 'lon'}
                    if not all(dim in ds.dims for dim in required_dims):
                        raise ValueError(f"Missing required dimensions: {required_dims - set(ds.dims)}")
                    
                    # Select soil moisture variables
                    soil_vars = [var for var in ds.data_vars if 'SoilMoi' in var]
                    if not soil_vars:
                        print(f"Warning: No soil moisture variables found in {os.path.basename(file_path)}")
                        continue
                    
                    ds = ds[soil_vars]
                    
                    # Apply spatial subsetting with validation
                    ds = ds.sel(
                        lat=slice(lat_bounds[0], lat_bounds[1]),
                        lon=slice(lon_bounds[0], lon_bounds[1])
                    )
                    
                    # Validate data content
                    if ds.sizes['lat'] == 0 or ds.sizes['lon'] == 0:
                        raise ValueError("No data points within specified bounds")
                    
                    datasets.append(ds)
                    processed_count += 1
                    
                except Exception as e:
                    failed_files.append((os.path.basename(file_path), str(e)))
                    continue
            
            # Report processing results
            if failed_files:
                print("\nWarning: Some files failed to process:")
                for fname, error in failed_files:
                    print(f"  - {fname}: {error}")
            
            if not datasets:
                raise ValueError("No valid soil moisture data found in downloaded files")
            
            # Combine datasets with error handling
            print("\nCombining datasets...")
            try:
                combined_ds = xr.concat(datasets, dim='time')
                combined_ds = combined_ds.sortby('time')  # Ensure temporal ordering
            except Exception as e:
                raise RuntimeError(f"Failed to combine datasets: {str(e)}")
            
            # Generate comprehensive dataset report
            print("\nDataset Information:")
            print("-" * 50)
            print(f"Time range: {combined_ds.time.values[0]} to {combined_ds.time.values[-1]}")
            print(f"Time resolution: {np.median(np.diff(combined_ds.time.values)).astype('timedelta64[h]')}")
            print(f"Number of timesteps: {len(combined_ds.time)}")
            print(f"Spatial coverage: {combined_ds.sizes['lat']}x{combined_ds.sizes['lon']} grid points")
            print(f"Lat range: {float(combined_ds.lat.min().values):.3f} to {float(combined_ds.lat.max().values):.3f}")
            print(f"Lon range: {float(combined_ds.lon.min().values):.3f} to {float(combined_ds.lon.max().values):.3f}")
            
            # Calculate and report statistics for each layer
            print("\nSoil Moisture Statistics by Layer:")
            print("-" * 50)
            
            for var in combined_ds.data_vars:
                print(f"\nVariable: {var}")
                data = combined_ds[var].values
                
                # Get variable metadata
                attrs = combined_ds[var].attrs
                units = attrs.get('units', 'kg/m^2')
                long_name = attrs.get('long_name', var)
                
                print(f"Description: {long_name}")
                print(f"Units: {units}")
                print(f"Shape: {data.shape}")
                
                if np.issubdtype(data.dtype, np.number):
                    valid_data = data[~np.isnan(data)]
                    if len(valid_data) > 0:
                        percentiles = np.nanpercentile(data, [0, 25, 50, 75, 100])
                        print("Statistics:")
                        print(f"  Mean:     {np.nanmean(data):.4f}")
                        print(f"  Std Dev:  {np.nanstd(data):.4f}")
                        print(f"  Min (0th):   {percentiles[0]:.4f}")
                        print(f"  25th:     {percentiles[1]:.4f}")
                        print(f"  Median:   {percentiles[2]:.4f}")
                        print(f"  75th:     {percentiles[3]:.4f}")
                        print(f"  Max (100th):  {percentiles[4]:.4f}")
                        print(f"  Valid Points:    {len(valid_data):,}")
                        print(f"  Missing Points:  {np.sum(np.isnan(data)):,}")
                        print(f"  Data Coverage:   {(len(valid_data) / data.size * 100):.1f}%")
                        
                        # Check for potentially anomalous values
                        q1, q3 = percentiles[1], percentiles[3]
                        iqr = q3 - q1
                        outliers = np.sum((data < (q1 - 1.5 * iqr)) | (data > (q3 + 1.5 * iqr)))
                        if outliers > 0:
                            print(f"  Potential outliers: {outliers:,} points")
                    else:
                        print("Warning: No valid numeric data found")
            
            # Export the dataset
            export_filepath = export_dataset(combined_ds, start_date, end_date, output_dir)
            print(f"\nData exported to: {export_filepath}")
            
            return combined_ds
            
    except Exception as e:
        error_msg = f"\nError retrieving GLDAS soil moisture data: {str(e)}"
        print(error_msg)
        raise RuntimeError(error_msg) from e

In [5]:
# FLDAS data retrieval with progress tracking
def validate_dates(start_date: str, end_date: str) -> tuple:
    """Validate and parse date strings."""
    try:
        start = datetime.strptime(start_date, '%Y-%m-%d')
        end = datetime.strptime(end_date, '%Y-%m-%d')
        if end < start:
            raise ValueError("End date must be after start date")
        return start, end
    except ValueError as e:
        raise ValueError(f"Invalid date format. Please use YYYY-MM-DD format. Error: {str(e)}")

def validate_bounds(lat_bounds: tuple, lon_bounds: tuple) -> None:
    """Validate spatial bounds for FLDAS data."""
    if not isinstance(lat_bounds, tuple) or not isinstance(lon_bounds, tuple):
        raise TypeError("Bounds must be tuples")
    if len(lat_bounds) != 2 or len(lon_bounds) != 2:
        raise ValueError("Bounds must contain exactly 2 values")
    if not (-60 <= lat_bounds[0] <= 90 and -60 <= lat_bounds[1] <= 90):
        raise ValueError("FLDAS latitude must be between -60 and 90 degrees")
    if not (-180 <= lon_bounds[0] <= 180 and -180 <= lon_bounds[1] <= 180):
        raise ValueError("Longitude must be between -180 and 180 degrees")
    if lat_bounds[0] >= lat_bounds[1]:
        raise ValueError("Minimum latitude must be less than maximum latitude")
    if lon_bounds[0] >= lon_bounds[1]:
        raise ValueError("Minimum longitude must be less than maximum longitude")

def export_dataset(ds: xr.Dataset, start_date: str, end_date: str, output_dir: str = None) -> str:
    """
    Export dataset to NetCDF file with date-specific filename.
    """
    if output_dir:
        Path(output_dir).mkdir(parents=True, exist_ok=True)
    else:
        output_dir = '.'
        
    filename = f"fldas_data_{start_date}_{end_date}.nc"
    filepath = os.path.join(output_dir, filename)
    
    # Add export metadata
    ds.attrs['export_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    ds.attrs['data_period'] = f"{start_date} to {end_date}"
    ds.attrs['data_source'] = "FLDAS_NOAH01_C_GL_M.001"
    
    print(f"\nExporting data to {filepath}...")
    ds.to_netcdf(filepath)
    print("Export complete!")
    
    return filepath

def get_fldas_data(start_date: str, end_date: str, lat_bounds: tuple, lon_bounds: tuple, 
                   output_dir: str = None) -> xr.Dataset:
    """
    Retrieve FLDAS soil moisture data with progress tracking, validation, and export.
    
    Parameters:
    -----------
    start_date : str
        Start date in YYYY-MM-DD format
    end_date : str
        End date in YYYY-MM-DD format
    lat_bounds : tuple
        (min_lat, max_lat) for the region of interest
    lon_bounds : tuple
        (min_lon, max_lon) for the region of interest
    output_dir : str, optional
        Directory to save the output file (default: current directory)
    
    Returns:
    --------
    xarray.Dataset
        Combined dataset with FLDAS soil moisture data
    """
    try:
        # Validate inputs
        validate_dates(start_date, end_date)
        validate_bounds(lat_bounds, lon_bounds)
        
        print("\nRetrieving FLDAS soil moisture data...")
        print(f"Time range: {start_date} to {end_date}")
        print(f"Spatial bounds: lat {lat_bounds}, lon {lon_bounds}")
        
        # Authenticate with NASA Earthdata
        try:
            auth = earthaccess.login()
        except Exception as e:
            raise RuntimeError(f"Failed to authenticate with NASA Earthdata: {str(e)}")
        
        # Search for granules with retry mechanism
        max_retries = 3
        granules = None
        
        for attempt in range(max_retries):
            try:
                print(f"\nSearching for FLDAS granules (attempt {attempt + 1}/{max_retries})...")
                granules = earthaccess.search_data(
                    short_name="FLDAS_NOAH01_C_GL_M",
                    version="001",
                    temporal=(start_date, end_date),
                    bounding_box=(lon_bounds[0], lat_bounds[0], lon_bounds[1], lat_bounds[1])
                )
                if granules:
                    break
            except Exception as e:
                if attempt == max_retries - 1:
                    raise RuntimeError(f"Failed to search for granules after {max_retries} attempts: {str(e)}")
                print(f"Attempt {attempt + 1} failed, retrying...")
                continue
        
        if not granules:
            raise ValueError("No FLDAS granules found for the specified parameters")
        
        print(f"Found {len(granules)} granules")
        
        with tempfile.TemporaryDirectory() as temp_dir:
            # Download files
            print("\nDownloading granules...")
            downloaded_files = earthaccess.download(
                granules,
                local_path=temp_dir
            )
            
            if not downloaded_files:
                raise RuntimeError("Failed to download any granules")
            
            print(f"Successfully downloaded {len(downloaded_files)} files")
            
            # Process files
            print("\nProcessing downloaded files...")
            datasets = []
            failed_files = []
            processed_count = 0
            
            for file_path in tqdm(downloaded_files, desc="Processing files", unit="file"):
                try:
                    ds = xr.open_dataset(file_path)
                    
                    # Print information for first file
                    if processed_count == 0:
                        print(f"\nFile structure: {os.path.basename(file_path)}")
                        print("Dimensions:", list(ds.dims))
                        print("Available variables:", list(ds.data_vars))
                    
                    # Select soil moisture variables
                    soil_vars = [var for var in ds.data_vars if 'SoilMoi' in var and 'cm_tavg' in var]
                    if not soil_vars:
                        raise ValueError(f"No soil moisture variables found")
                    ds = ds[soil_vars]
                    
                    # Handle FLDAS specific coordinate system
                    if 'X' in ds.dims and 'Y' in ds.dims:
                        ds = ds.sel(
                            X=slice(lon_bounds[0], lon_bounds[1]),
                            Y=slice(lat_bounds[0], lat_bounds[1])
                        )
                    else:
                        raise ValueError("Expected X/Y coordinates not found in dataset")
                    
                    # Validate data content
                    if ds.sizes['X'] == 0 or ds.sizes['Y'] == 0:
                        raise ValueError("No data points within specified bounds")
                    
                    datasets.append(ds)
                    processed_count += 1
                    
                except Exception as e:
                    failed_files.append((os.path.basename(file_path), str(e)))
                    continue
            
            # Report processing results
            if failed_files:
                print("\nWarning: Some files failed to process:")
                for fname, error in failed_files:
                    print(f"  - {fname}: {error}")
            
            if not datasets:
                raise ValueError("No valid soil moisture data found in downloaded files")
            
            # Combine datasets
            print("\nCombining datasets...")
            try:
                combined_ds = xr.concat(datasets, dim='time')
                combined_ds = combined_ds.sortby('time')
            except Exception as e:
                raise RuntimeError(f"Failed to combine datasets: {str(e)}")
            
            # Dataset report
            print("\nDataset Information:")
            print("-" * 50)
            print(f"Time range: {combined_ds.time.values[0]} to {combined_ds.time.values[-1]}")
            print(f"Time resolution: {np.median(np.diff(combined_ds.time.values)).astype('timedelta64[h]')}")
            print(f"Number of timesteps: {len(combined_ds.time)}")
            print(f"Spatial coverage: {combined_ds.sizes['Y']}x{combined_ds.sizes['X']} grid points")
            print(f"Y range: {float(combined_ds.Y.min().values):.3f} to {float(combined_ds.Y.max().values):.3f}")
            print(f"X range: {float(combined_ds.X.min().values):.3f} to {float(combined_ds.X.max().values):.3f}")
            
            # Calculate statistics by layer
            print("\nSoil Moisture Statistics by Layer:")
            print("-" * 50)
            
            for var in combined_ds.data_vars:
                print(f"\nVariable: {var}")
                data = combined_ds[var].values
                
                # Get metadata
                attrs = combined_ds[var].attrs
                units = attrs.get('units', 'kg/m^2')
                long_name = attrs.get('long_name', var)
                
                print(f"Description: {long_name}")
                print(f"Units: {units}")
                print(f"Shape: {data.shape}")
                
                if np.issubdtype(data.dtype, np.number):
                    valid_data = data[~np.isnan(data)]
                    if len(valid_data) > 0:
                        percentiles = np.nanpercentile(data, [0, 25, 50, 75, 100])
                        print("Statistics:")
                        print(f"  Mean:     {np.nanmean(data):.4f}")
                        print(f"  Std Dev:  {np.nanstd(data):.4f}")
                        print(f"  Min (0th):   {percentiles[0]:.4f}")
                        print(f"  25th:     {percentiles[1]:.4f}")
                        print(f"  Median:   {percentiles[2]:.4f}")
                        print(f"  75th:     {percentiles[3]:.4f}")
                        print(f"  Max (100th):  {percentiles[4]:.4f}")
                        print(f"  Valid Points:    {len(valid_data):,}")
                        print(f"  Missing Points:  {np.sum(np.isnan(data)):,}")
                        print(f"  Data Coverage:   {(len(valid_data) / data.size * 100):.1f}%")
                        
                        # Check for outliers
                        q1, q3 = percentiles[1], percentiles[3]
                        iqr = q3 - q1
                        outliers = np.sum((data < (q1 - 1.5 * iqr)) | (data > (q3 + 1.5 * iqr)))
                        if outliers > 0:
                            print(f"  Potential outliers: {outliers:,} points")
                    else:
                        print("Warning: No valid numeric data found")
            
            # Export the dataset
            export_filepath = export_dataset(combined_ds, start_date, end_date, output_dir)
            print(f"\nData exported to: {export_filepath}")
            
            return combined_ds
            
    except Exception as e:
        error_msg = f"\nError retrieving FLDAS soil moisture data: {str(e)}"
        print(error_msg)
        raise RuntimeError(error_msg) from e

In [12]:
# SMAP data retrieval with progress tracking
def get_smap_data(start_date, end_date, lat_bounds, lon_bounds, max_files=10):
    """
    Retrieve SMAP L4 soil moisture data with missing value handling
    """
    print("\nRetrieving SMAP soil moisture data...")
    print(f"Time range: {start_date} to {end_date}")
    print(f"Spatial bounds: lat {lat_bounds}, lon {lon_bounds}")
    
    try:
        auth = earthaccess.login()
        
        granules = earthaccess.search_data(
            count=max_files,
            short_name="SPL4SMGP",
            temporal=(start_date, end_date),
            bounding_box=(lon_bounds[0], lat_bounds[0], lon_bounds[1], lat_bounds[1])
        )
        
        if not granules:
            raise ValueError("No SMAP granules found")
            
        print(f"Found {len(granules)} granules")
        
        with tempfile.TemporaryDirectory() as temp_dir:
            downloaded_files = earthaccess.download(
                granules,
                local_path=temp_dir
            )
            
            if not downloaded_files:
                raise ValueError("Failed to download granules")
                
            datasets = []
            for file_path in tqdm(downloaded_files, desc="Processing"):
                try:
                    with h5py.File(file_path, 'r') as f:
                        if len(datasets) == 0:
                            print("\nFile structure:")
                            def print_structure(name, obj):
                                if isinstance(obj, h5py.Dataset):
                                    print(f"{name}:")
                                    print(f"  Shape: {obj.shape}")
                                    print(f"  Dtype: {obj.dtype}")
                                    if '_FillValue' in obj.attrs:
                                        print(f"  Fill Value: {obj.attrs['_FillValue']}")
                            f.visititems(print_structure)
                        
                        if 'Geophysical_Data' in f:
                            geo_data = f['Geophysical_Data']
                            sm_vars = ['sm_surface', 'sm_rootzone', 'sm_profile']
                            ds_dict = {}
                            
                            time_value = None
                            for attr in f.attrs.keys():
                                if 'time' in attr.lower():
                                    try:
                                        time_str = f.attrs[attr]
                                        if isinstance(time_str, bytes):
                                            time_str = time_str.decode('utf-8')
                                        time_value = pd.to_datetime(time_str)
                                        break
                                    except:
                                        continue
                            
                            if time_value is None:
                                time_value = pd.Timestamp(start_date)
                            
                            for var in sm_vars:
                                if var in geo_data:
                                    # Get the data and attributes
                                    data = geo_data[var][:]
                                    attrs = dict(geo_data[var].attrs)
                                    
                                    # Handle missing values
                                    # Check for _FillValue in attributes
                                    fill_value = attrs.get('_FillValue', -9999)
                                    # Replace both -9999 and the fill_value with NaN
                                    data = np.where(data == -9999, np.nan, data)
                                    if fill_value != -9999:
                                        data = np.where(data == fill_value, np.nan, data)
                                    
                                    y_size, x_size = data.shape
                                    coords = {
                                        'y': np.linspace(lat_bounds[0], lat_bounds[1], y_size),
                                        'x': np.linspace(lon_bounds[0], lon_bounds[1], x_size),
                                        'time': [time_value]
                                    }
                                    
                                    # Print statistics for this variable
                                    print(f"\nStatistics for {var}:")
                                    valid_data = data[~np.isnan(data)]
                                    if len(valid_data) > 0:
                                        print(f"  Mean:     {np.mean(valid_data):.4f}")
                                        print(f"  Std Dev:  {np.std(valid_data):.4f}")
                                        print(f"  Min:      {np.min(valid_data):.4f}")
                                        print(f"  Max:      {np.max(valid_data):.4f}")
                                        print(f"  Valid Points:    {len(valid_data):,}")
                                        print(f"  Missing Points:  {np.sum(np.isnan(data)):,}")
                                        print(f"  Coverage:        {(len(valid_data) / data.size * 100):.1f}%")
                                        print(f"  Units:           {attrs.get('units', 'unknown')}")
                                    
                                    da = xr.DataArray(
                                        data[np.newaxis, :, :],
                                        dims=['time', 'y', 'x'],
                                        coords=coords,
                                        name=var,
                                        attrs=attrs
                                    )
                                    ds_dict[var] = da
                            
                            if ds_dict:
                                ds = xr.Dataset(ds_dict)
                                datasets.append(ds)
                                
                except Exception as e:
                    print(f"Warning: Failed to process {os.path.basename(file_path)}: {str(e)}")
            
            if not datasets:
                raise ValueError("No valid soil moisture data found")
            
            combined_ds = xr.concat(datasets, dim='time')
            print("\nRetrieved data summary:")
            print(f"Time period: {combined_ds.time.values[0]} to {combined_ds.time.values[-1]}")
            print("Variables:", list(combined_ds.data_vars))
            
            # Print final statistics for combined dataset
            print("\nFinal Dataset Statistics:")
            for var in combined_ds.data_vars:
                data = combined_ds[var].values
                valid_data = data[~np.isnan(data)]
                print(f"\n{var}:")
                if len(valid_data) > 0:
                    print(f"  Mean:     {np.mean(valid_data):.4f}")
                    print(f"  Std Dev:  {np.std(valid_data):.4f}")
                    print(f"  Min:      {np.min(valid_data):.4f}")
                    print(f"  Max:      {np.max(valid_data):.4f}")
                    print(f"  Valid Points:    {len(valid_data):,}")
                    print(f"  Missing Points:  {np.sum(np.isnan(data)):,}")
                    print(f"  Coverage:        {(len(valid_data) / data.size * 100):.1f}%")
            
            return combined_ds
            
    except Exception as e:
        print(f"\nError: {str(e)}")
        raise

In [13]:
# MERRA-2 data retrieval with progress tracking

def get_merra2_data(
    bbox: Tuple[float, float, float, float],
    date_range: Tuple[str, str],
    var_names: Optional[List[str]] = None,
    output_file: str = 'merra_2_soil_moisture_data.nc',
    product: str = 'M2T1NXLND_5.12.4'
) -> Optional[xr.Dataset]:
    """
    Fetch MERRA-2 data and perform basic statistical analysis.
    
    Parameters:
    -----------
    bbox : tuple
        Bounding box coordinates as (min_lon, min_lat, max_lon, max_lat)
        Valid ranges: longitude [-180, 180], latitude [-90, 90]
    
    date_range : tuple
        Start and end dates as ('YYYY-MM-DD', 'YYYY-MM-DD')
    
    var_names : list, optional
        List of variable names to fetch. If None, defaults to 
        ['SFMC', 'GWETTOP', 'PRMC', 'RZMC']
    
    output_file : str, optional
        Output filename for NetCDF data
        Default: 'merra_2_soil_moisture_data.nc'
    
    product : str, optional
        MERRA-2 product identifier
        Default: 'M2T1NXLND_5.12.4'
    
    Returns:
    --------
    xarray.Dataset or None
        Dataset containing the requested variables with statistics printed
        Returns None if the request fails
    
    Examples:
    --------
    >>> # Fetch data for New York State
    >>> ds = fetch_merra2_data(
    ...     bbox=(-79.77, 40.5, -71.85, 45.02),
    ...     date_range=('2020-01-01', '2020-01-31'),
    ...     var_names=['SFMC', 'PRMC']
    ... )
    """
    # Parameter validation
    try:
        # Validate and process bbox
        minlon, minlat, maxlon, maxlat = bbox
        if not (-180 <= minlon <= 180 and -180 <= maxlon <= 180):
            raise ValueError("Longitude must be between -180 and 180 degrees")
        if not (-90 <= minlat <= 90 and -90 <= maxlat <= 90):
            raise ValueError("Latitude must be between -90 and 90 degrees")
        if minlon >= maxlon or minlat >= maxlat:
            raise ValueError("Min values must be less than max values")

        # Validate and process dates
        start_date, end_date = date_range
        try:
            datetime.strptime(start_date, '%Y-%m-%d')
            datetime.strptime(end_date, '%Y-%m-%d')
        except ValueError:
            raise ValueError("Dates must be in YYYY-MM-DD format")
        
        if start_date > end_date:
            raise ValueError("Start date must be before end date")

        # Set default variables if none provided
        if var_names is None:
            var_names = ['SFMC', 'GWETTOP', 'PRMC', 'RZMC']
            print(f"Using default variables: {var_names}")

    except ValueError as e:
        print(f"Parameter validation failed: {e}")
        return None

    # Initialize urllib PoolManager and set base URL
    http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
    url = 'https://disc.gsfc.nasa.gov/service/subset/jsonwsp'
    
    def get_http_data(request):
        hdrs = {'Content-Type': 'application/json',
                'Accept': 'application/json'}
        data = json.dumps(request)       
        r = http.request('POST', url, body=data, headers=hdrs)
        response = json.loads(r.data)   
        if response['type'] == 'jsonwsp/fault':
            print('API Error: faulty %s request' % response['methodname'])
            sys.exit(1)
        return response
    
    # Construct the subset request
    subset_request = {
        'methodname': 'subset',
        'type': 'jsonwsp/request',
        'version': '1.0',
        'args': {
            'role': 'subset',
            'start': start_date,
            'end': end_date,
            'box': [minlon, minlat, maxlon, maxlat],
            'crop': True,
            'data': [{'datasetId': product,
                      'variable': varName
                     } for varName in var_names]
        }
    }
    
    # Submit request and get job ID
    response = get_http_data(subset_request)
    myJobId = response['result']['jobId']
    print('Job ID:', myJobId)
    print('Initial status:', response['result']['Status'])
    
    # Monitor job status
    status_request = {
        'methodname': 'GetStatus',
        'version': '1.0',
        'type': 'jsonwsp/request',
        'args': {'jobId': myJobId}
    }
    
    while response['result']['Status'] in ['Accepted', 'Running']:
        time.sleep(5)
        response = get_http_data(status_request)
        status = response['result']['Status']
        percent = response['result']['PercentCompleted']
        print(f'Job status: {status} ({percent}% complete)')
    
    def download_netcdf(url, output_file):
        print(f"\nDownloading data to {output_file}...")
        try:
            response = requests.get(url)
            response.raise_for_status()
            with open(output_file, 'wb') as f:
                f.write(response.content)
            print(f"Successfully saved data to {output_file}")
            return True
        except requests.exceptions.RequestException as e:
            print(f"Error downloading data: {e}")
            return False
    
    # Get results and download data
    if response['result']['Status'] == 'Succeeded':
        print('Job Finished:', response['result']['message'])
        
        result = requests.get('https://disc.gsfc.nasa.gov/api/jobs/results/'+myJobId)
        try:
            result.raise_for_status()
            urls = result.text.split('\n')
            
            success = False
            for url in urls:
                if url.strip():
                    print("\nAttempting download from:", url)
                    success = download_netcdf(url, output_file)
                    if success:
                        break
            
            if success:
                # Load and analyze the data
                ds = xr.open_dataset(output_file)
                
                print("\n=== Dataset Information ===")
                print(f"Dimensions: {dict(ds.dims)}")
                print(f"\nVariables: {list(ds.data_vars)}")
                print(f"\nTime range: {ds.time.values[0]} to {ds.time.values[-1]}")
                print(f"Spatial extent: {ds.lon.values.min():.2f}°E to {ds.lon.values.max():.2f}°E, "
                      f"{ds.lat.values.min():.2f}°N to {ds.lat.values.max():.2f}°N")
                
                for var in ds.data_vars:
                    print(f"\n=== Statistics for {var} ===")
                    data = ds[var]
                    print(f"Shape: {data.shape}")
                    print(f"Missing values: {data.isnull().sum().values}")
                    print(f"Mean: {float(data.mean()):.4f}")
                    print(f"Min: {float(data.min()):.4f}")
                    print(f"Max: {float(data.max()):.4f}")
                    print(f"Standard deviation: {float(data.std()):.4f}")
                
                return ds
            else:
                print("Failed to download data. Please check your credentials and try again.")
                return None
            
        except requests.exceptions.RequestException as e:
            print('Error getting download URLs:', e)
            return None
    else:
        print('Job Failed:', response['fault']['code'])
        return None

In [14]:
#Parameters
max_lat= 45.02
min_lat= 40.5
max_lon= -71.85
min_lon= -79.77
start_date = "2013-01-01"
end_date = "2023-12-31"
lat_bounds = (min_lat, max_lat)
lon_bounds = (min_lon, max_lon)
area=(max_lat, min_lon, min_lat, max_lon) # (max_lat, min_lon, min_lat, max_lon)
bbox=(min_lon, min_lat, max_lon, max_lat) # (min_lon, min_lat, max_lon, max_lat)
date_range=(start_date, end_date) #start_dat ,end_date

In [23]:
# ERA5_land request:
request = {
    "variable": "volumetric_soil_water_layer_1",
    "product_type": "reanalysis",
    "year": "2023",
    "month": "01",
    "day": ["01", "02"],
    "time": [f"{hour:02d}:00" for hour in range(24)],
    "area": area,  # [north, west, south, east]
    "format": "netcdf"
 }
    
    # Get the data
try:
     era5_data = get_era5_data(
        dataset="reanalysis-era5-land",
        request=request,
        output_file="era5_soil_moisture.nc"
    )  
except Exception as e:
    print(f"Failed to retrieve ERA5 data: {str(e)}")


Retrieving ERA5 data...
Dataset: reanalysis-era5-land
Time range: 2023-01
Spatial bounds: (45.02, -79.77, 40.5, -71.85)


2024-11-01 14:42:15,533 INFO [2024-09-28T00:00:00] **Welcome to the New Climate Data Store (CDS)!** This new system is in its early days of full operations and still undergoing enhancements and fine tuning. Some disruptions are to be expected. Your 
[feedback](https://jira.ecmwf.int/plugins/servlet/desk/portal/1/create/202) is key to improve the user experience on the new CDS for the benefit of everyone. Thank you.
2024-11-01 14:42:15,536 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2024-11-01 14:42:15,537 INFO [2024-09-16T00:00:00] Remember that you need to have an ECMWF account to use the new CDS. **Your old CDS credentials will not work in new CDS!**


Submitting request to ERA5...


2024-11-01 14:42:16,137 INFO Request ID is 60147992-c040-415e-9070-32a650f93d3c
2024-11-01 14:42:16,294 INFO status has been updated to accepted
2024-11-01 14:42:19,258 INFO status has been updated to running
2024-11-01 14:42:21,828 INFO status has been updated to successful



Downloading data...


                                                                                      

Download complete. File size: 0.28 MB

Loading dataset...

Dataset Information:
--------------------------------------------------
Dimensions: {'valid_time': 48, 'latitude': 46, 'longitude': 80}

Variables:

Variable: swvl1
Description: Volumetric soil water layer 1
Units: m**3 m**-3
Shape: (48, 46, 80)
Statistics:
  Mean:     0.3555
  Median:   0.3956
  Std Dev:  0.0937
  Variance: 0.0088
  Min:      0.0100
  Max:      0.5200
  Valid Points:    170,784
  Missing Points:  5,856
  Data Coverage:   96.7%


  print(f"Dimensions: {dict(ds.dims)}")


In [12]:
# Example usage NLDAS:
if __name__ == "__main__":
    start_date = start_date
    end_date = end_date
    lat_bounds = lat_bounds
    lon_bounds = lon_bounds
    
    try:
        # Set up output file name
        output_file = f"nldas_data_{start_date}_{end_date}.nc"
        
        # Retrieve the data
        nldas_data = get_nldas_data(
            start_date=start_date,
            end_date=end_date,
            lat_bounds=lat_bounds,
            lon_bounds=lon_bounds
        )
        
        # Save the data
        print(f"\nSaving data to {output_file}...")
        nldas_data.to_netcdf(output_file)
        
        # Print file size
        file_size = os.path.getsize(output_file) / (1024 * 1024)  # Convert to MB
        print(f"File saved successfully. Size: {file_size:.2f} MB")
        
    except Exception as e:
        print(f"Failed to retrieve NLDAS data: {str(e)}")


Retrieving NLDAS soil moisture data...
Time range: 2013-01-01 to 2023-12-31
Spatial bounds: lat (40.5, 45.02), lon (-79.77, -71.85)

Searching for NLDAS granules...
Found 132 granules

Downloading granules...


QUEUEING TASKS | : 100%|██████████| 132/132 [00:00<00:00, 17833.15it/s]
PROCESSING TASKS | : 100%|██████████| 132/132 [03:02<00:00,  1.38s/it]
COLLECTING RESULTS | : 100%|██████████| 132/132 [00:00<00:00, 323015.24it/s]


Successfully downloaded 132 files

Processing downloaded files...


Processing files: 100%|██████████| 132/132 [00:03<00:00, 37.72file/s]



Combining datasets...

Dataset Information:
--------------------------------------------------
Time range: 2013-01-01T00:00:00.000000000 to 2023-12-01T00:00:00.000000000
Number of timesteps: 132
Dimensions: {'time': 132, 'lat': 36, 'lon': 63}

Soil Moisture Statistics:
--------------------------------------------------

Variable: SoilM_0_10cm
Description: Soil moisture content (0-10cm)
Units: kg m-2
Shape: (132, 36, 63)
Statistics:
  Mean:     29.3776
  Median:   28.3492
  Std Dev:  7.0062
  Min:      2.9674
  Max:      47.5999
  Valid Points:    267,960
  Missing Points:  31,416
  Data Coverage:   89.5%

Variable: SoilM_10_40cm
Description: Soil moisture content (10-40cm)
Units: kg m-2
Shape: (132, 36, 63)
Statistics:
  Mean:     88.4562
  Median:   85.0910
  Std Dev:  20.4660
  Min:      16.7697
  Max:      142.7994
  Valid Points:    267,960
  Missing Points:  31,416
  Data Coverage:   89.5%

Variable: SoilM_40_100cm
Description: Soil moisture content (40-100cm)
Units: kg m-2
Shape

In [5]:
# GLDAS example
gldas_data = get_gldas_data(start_date, end_date, lat_bounds, lon_bounds)


Retrieving GLDAS soil moisture data...
Time range: 2013-01-01 to 2023-12-31
Spatial bounds: lat (40.5, 45.02), lon (-79.77, -71.85)

Searching for GLDAS granules (attempt 1/3)...
Found 132 granules

Downloading granules...


QUEUEING TASKS | : 100%|██████████| 132/132 [00:00<00:00, 16019.45it/s]
PROCESSING TASKS | : 100%|██████████| 132/132 [06:58<00:00,  3.17s/it]
COLLECTING RESULTS | : 100%|██████████| 132/132 [00:00<00:00, 296068.52it/s]


Successfully downloaded 132 files

Processing downloaded files...


Processing files:   1%|          | 1/132 [00:00<00:22,  5.75file/s]


File structure: GLDAS_NOAH025_M.A201301.021.nc4
Dimensions: ['time', 'bnds', 'lon', 'lat']
Available variables: ['time_bnds', 'Swnet_tavg', 'Lwnet_tavg', 'Qle_tavg', 'Qh_tavg', 'Qg_tavg', 'Snowf_tavg', 'Rainf_tavg', 'Evap_tavg', 'Qs_acc', 'Qsb_acc', 'Qsm_acc', 'AvgSurfT_inst', 'Albedo_inst', 'SWE_inst', 'SnowDepth_inst', 'SoilMoi0_10cm_inst', 'SoilMoi10_40cm_inst', 'SoilMoi40_100cm_inst', 'SoilMoi100_200cm_inst', 'SoilTMP0_10cm_inst', 'SoilTMP10_40cm_inst', 'SoilTMP40_100cm_inst', 'SoilTMP100_200cm_inst', 'PotEvap_tavg', 'ECanop_tavg', 'Tveg_tavg', 'ESoil_tavg', 'RootMoist_inst', 'CanopInt_inst', 'Wind_f_inst', 'Rainf_f_tavg', 'Tair_f_inst', 'Qair_f_inst', 'Psurf_f_inst', 'SWdown_f_tavg', 'LWdown_f_tavg']


Processing files: 100%|██████████| 132/132 [00:02<00:00, 53.46file/s]



Combining datasets...

Dataset Information:
--------------------------------------------------
Time range: 2013-01-01T00:00:00.000000000 to 2023-12-01T00:00:00.000000000
Time resolution: 744 hours
Number of timesteps: 132
Spatial coverage: 18x32 grid points
Lat range: 40.625 to 44.875
Lon range: -79.625 to -71.875

Soil Moisture Statistics by Layer:
--------------------------------------------------

Variable: SoilMoi0_10cm_inst
Description: Soil moisture
Units: kg m-2
Shape: (132, 18, 32)
Statistics:
  Mean:     31.4795
  Std Dev:  6.4977
  Min (0th):   8.3211
  25th:     26.7874
  Median:   30.1789
  75th:     34.7768
  Max (100th):  47.5826
  Valid Points:    68,244
  Missing Points:  7,788
  Data Coverage:   89.8%
  Potential outliers: 1,444 points

Variable: SoilMoi10_40cm_inst
Description: Soil moisture
Units: kg m-2
Shape: (132, 18, 32)
Statistics:
  Mean:     95.2638
  Std Dev:  21.2855
  Min (0th):   20.6895
  25th:     80.3381
  Median:   90.7238
  75th:     107.9240
  Max (

In [6]:
# FLDAS example
fldas_data = get_fldas_data(start_date, end_date, lat_bounds, lon_bounds)


Retrieving FLDAS soil moisture data...
Time range: 2013-01-01 to 2023-12-31
Spatial bounds: lat (40.5, 45.02), lon (-79.77, -71.85)

Searching for FLDAS granules (attempt 1/3)...
Found 132 granules

Downloading granules...


QUEUEING TASKS | : 100%|██████████| 132/132 [00:00<00:00, 13187.75it/s]
PROCESSING TASKS | : 100%|██████████| 132/132 [33:46<00:00, 15.35s/it]
COLLECTING RESULTS | : 100%|██████████| 132/132 [00:00<00:00, 349084.57it/s]


Successfully downloaded 132 files

Processing downloaded files...


Processing files:   1%|          | 1/132 [00:00<00:22,  5.83file/s]


File structure: FLDAS_NOAH01_C_GL_M.A201301.001.nc
Dimensions: ['time', 'bnds', 'X', 'Y']
Available variables: ['time_bnds', 'Evap_tavg', 'LWdown_f_tavg', 'Lwnet_tavg', 'Psurf_f_tavg', 'Qair_f_tavg', 'Qg_tavg', 'Qh_tavg', 'Qle_tavg', 'Qs_tavg', 'Qsb_tavg', 'RadT_tavg', 'Rainf_f_tavg', 'SWE_inst', 'SWdown_f_tavg', 'SnowCover_inst', 'SnowDepth_inst', 'Snowf_tavg', 'Swnet_tavg', 'Tair_f_tavg', 'Wind_f_tavg', 'SoilMoi00_10cm_tavg', 'SoilMoi10_40cm_tavg', 'SoilMoi40_100cm_tavg', 'SoilMoi100_200cm_tavg', 'SoilTemp00_10cm_tavg', 'SoilTemp10_40cm_tavg', 'SoilTemp40_100cm_tavg', 'SoilTemp100_200cm_tavg']


Processing files: 100%|██████████| 132/132 [00:01<00:00, 70.00file/s]



Combining datasets...

Dataset Information:
--------------------------------------------------
Time range: 2013-01-01T00:00:00.000000000 to 2023-12-01T00:00:00.000000000
Time resolution: 744 hours
Number of timesteps: 132
Spatial coverage: 45x79 grid points
Y range: 40.550 to 44.950
X range: -79.750 to -71.950

Soil Moisture Statistics by Layer:
--------------------------------------------------

Variable: SoilMoi00_10cm_tavg
Description: soil moisture content
Units: m^3 m-3
Shape: (132, 45, 79)
Statistics:
  Mean:     0.3622
  Std Dev:  0.0497
  Min (0th):   0.1411
  25th:     0.3361
  Median:   0.3690
  75th:     0.3959
  Max (100th):  0.4678
  Valid Points:    420,156
  Missing Points:  49,104
  Data Coverage:   89.5%
  Potential outliers: 13,172 points

Variable: SoilMoi10_40cm_tavg
Description: soil moisture content
Units: m^3 m-3
Shape: (132, 45, 79)
Statistics:
  Mean:     0.3800
  Std Dev:  0.0672
  Min (0th):   0.0897
  25th:     0.3452
  Median:   0.3867
  75th:     0.4292
 

In [15]:
# SMAP example
smap_data = get_smap_data(start_date, end_date, lat_bounds, lon_bounds)


Retrieving SMAP soil moisture data...
Time range: 2013-01-01 to 2023-12-31
Spatial bounds: lat (40.5, 45.02), lon (-79.77, -71.85)
Found 10 granules


QUEUEING TASKS | : 100%|██████████| 10/10 [00:00<00:00, 1405.27it/s]
PROCESSING TASKS | : 100%|██████████| 10/10 [03:30<00:00, 21.01s/it]
COLLECTING RESULTS | : 100%|██████████| 10/10 [00:00<00:00, 91779.08it/s]
Processing:  10%|█         | 1/10 [00:00<00:01,  5.14it/s]


File structure:
EASE2_global_projection:
  Shape: (1,)
  Dtype: |S1
Geophysical_Data/baseflow_flux:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophysical_Data/depth_to_water_table_from_surface_in_peat:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophysical_Data/free_surface_water_on_peat_flux:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophysical_Data/heat_flux_ground:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophysical_Data/heat_flux_latent:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophysical_Data/heat_flux_sensible:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophysical_Data/height_lowatmmodlay:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophysical_Data/land_evapotranspiration_flux:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophysical_Data/land_fraction_saturated:
  Shape: (1624, 3856)
  Dtype: float32
  Fill Value: -9999.0
Geophy

Processing:  20%|██        | 2/10 [00:00<00:01,  5.86it/s]


Statistics for sm_surface:
  Mean:     0.2135
  Std Dev:  0.1389
  Min:      0.0043
  Max:      0.8869
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_rootzone:
  Mean:     0.2416
  Std Dev:  0.1675
  Min:      0.0064
  Max:      0.9302
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:
  Mean:     0.2560
  Std Dev:  0.1740
  Min:      0.0076
  Max:      0.9294
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_surface:
  Mean:     0.2135
  Std Dev:  0.1389
  Min:      0.0039
  Max:      0.8894
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_rootzone:
  Mean:     0.2416


Processing:  30%|███       | 3/10 [00:00<00:01,  6.30it/s]

  Std Dev:  0.1674
  Min:      0.0064
  Max:      0.9302
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:
  Mean:     0.2560
  Std Dev:  0.1740
  Min:      0.0077
  Max:      0.9295
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_surface:
  Mean:     0.2131
  Std Dev:  0.1391
  Min:      0.0027
  Max:      0.9020
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_rootzone:
  Mean:     0.2415
  Std Dev:  0.1675
  Min:      0.0065
  Max:      0.9308
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:


Processing:  50%|█████     | 5/10 [00:00<00:00,  6.45it/s]

  Mean:     0.2560
  Std Dev:  0.1740
  Min:      0.0077
  Max:      0.9296
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_surface:
  Mean:     0.2125
  Std Dev:  0.1393
  Min:      0.0035
  Max:      0.8945
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_rootzone:
  Mean:     0.2415
  Std Dev:  0.1675
  Min:      0.0063
  Max:      0.9305
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:
  Mean:     0.2559
  Std Dev:  0.1740
  Min:      0.0076
  Max:      0.9296
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_surface:
  Mean:     0.2121


Processing:  60%|██████    | 6/10 [00:00<00:00,  6.56it/s]

  Std Dev:  0.1392
  Min:      0.0067
  Max:      0.8925
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_rootzone:
  Mean:     0.2414
  Std Dev:  0.1674
  Min:      0.0063
  Max:      0.9302
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:
  Mean:     0.2559
  Std Dev:  0.1740
  Min:      0.0076
  Max:      0.9294
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_surface:
  Mean:     0.2122
  Std Dev:  0.1393
  Min:      0.0093
  Max:      0.8780
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_rootzone:
  Mean:     0.2414
  Std Dev:  0.1674
  Min:      0.0063
  Max:      0.9300
  Valid Points:    1,684,725


Processing:  80%|████████  | 8/10 [00:01<00:00,  6.58it/s]

  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:
  Mean:     0.2559
  Std Dev:  0.1740
  Min:      0.0076
  Max:      0.9294
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_surface:
  Mean:     0.2125
  Std Dev:  0.1394
  Min:      0.0094
  Max:      0.8894
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_rootzone:
  Mean:     0.2415
  Std Dev:  0.1674
  Min:      0.0064
  Max:      0.9309
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:
  Mean:     0.2560
  Std Dev:  0.1740
  Min:      0.0076
  Max:      0.9296
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'


Processing:  90%|█████████ | 9/10 [00:01<00:00,  6.43it/s]


Statistics for sm_surface:
  Mean:     0.2129
  Std Dev:  0.1394
  Min:      0.0088
  Max:      0.8882
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_rootzone:
  Mean:     0.2415
  Std Dev:  0.1674
  Min:      0.0065
  Max:      0.9308
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:
  Mean:     0.2560
  Std Dev:  0.1740
  Min:      0.0076
  Max:      0.9296
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_surface:
  Mean:     0.2130
  Std Dev:  0.1394
  Min:      0.0062
  Max:      0.8893
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'


Processing: 100%|██████████| 10/10 [00:01<00:00,  6.38it/s]



Statistics for sm_rootzone:
  Mean:     0.2416
  Std Dev:  0.1674
  Min:      0.0066
  Max:      0.9306
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Statistics for sm_profile:
  Mean:     0.2560
  Std Dev:  0.1740
  Min:      0.0077
  Max:      0.9296
  Valid Points:    1,684,725
  Missing Points:  4,577,419
  Coverage:        26.9%
  Units:           b'm3 m-3'

Retrieved data summary:
Time period: 2013-01-01T00:00:00.000000000 to 2013-01-01T00:00:00.000000000
Variables: ['sm_surface', 'sm_rootzone', 'sm_profile']

Final Dataset Statistics:

sm_surface:
  Mean:     0.2129
  Std Dev:  0.1392
  Min:      0.0027
  Max:      0.9020
  Valid Points:    16,847,250
  Missing Points:  45,774,190
  Coverage:        26.9%

sm_rootzone:
  Mean:     0.2415
  Std Dev:  0.1674
  Min:      0.0063
  Max:      0.9311
  Valid Points:    16,847,250
  Missing Points:  45,774,190
  Coverage:        26.9%

sm_profile:
  Mean:     0.2560
  S

In [None]:
# MERRA-2 data 
merra2_data = get_merra2_data(
    bbox=bbox,  # (min_lon, min_lat, max_lon, max_lat)
    date_range= date_range,
    var_names=['SFMC', 'PRMC'] 
)

Job ID: 67239b288be99c3db6e2b0ad
Initial status: Accepted
Job status: Succeeded (100% complete)
Job Finished: Complete (M2T1NXLND_5.12.4)

Attempting download from: https://goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2/M2T1NXLND.5.12.4/doc/MERRA2.README.pdf

Downloading data to merra_2_soil_moisture_data.nc...
Error downloading data: 404 Client Error: Not Found for url: https://goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2/M2T1NXLND.5.12.4/doc/MERRA2.README.pdf%0D

Attempting download from: https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXLND.5.12.4/2020/01/MERRA2_400.tavg1_2d_lnd_Nx.20200101.nc4.nc4?SFMC[0:23][261:270][160:173],PRMC[0:23][261:270][160:173],time,lat[261:270],lon[160:173]

Downloading data to merra_2_soil_moisture_data.nc...
Successfully saved data to merra_2_soil_moisture_data.nc

=== Dataset Information ===
Dimensions: {'time': 24, 'lat': 10, 'lon': 14}

Variables: ['SFMC', 'PRMC']

Time range: 2020-01-01T00:30:00.000000000 to 2020-01-01T23:30:00.000000000


  print(f"Dimensions: {dict(ds.dims)}")
