# Update HYSETS Catchment Attributes

After checking for updated catchment polygons, the attributes associated with monitored station polygons are updated.  The majority of the preprocessing work is detailed in {cite}`kovacek2024bcub` and example notebooks are provided in the [jupyter book associated with that publication](https://dankovacek.github.io/bcub_demo/0_intro.html) detailing all of these steps. Where more recent catchment boundary information was found for monitoring stations in the preceding chapter, the updated polygons are used to revise catchment attributes.  The effect is most pronounced where the HYSETS "artificial boundaries" flagged catchments represented attributes with the nearest raster pixel or where the polygon was simply a square of area equal to that reported in official sources centred at the reported station location.

## Compare updated results with HYSETS attributes

In [1]:
import os
import geopandas as gpd
import pandas as pd
import numpy as np
import rioxarray as rxr
from shapely.geometry import Point
from time import time
from attribute_processing_functions import *

from bokeh.layouts import gridplot
from bokeh.plotting import figure, show
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.io import output_notebook
output_notebook()

### Load the original hysets data and the pre-processed results file

In [2]:
daymet_attributes = ['prcp', 'tmin', 'tmax', 'vp', 'swe', 'srad', 'low_prcp_duration', 'low_prcp_freq', 'high_prcp_duration', 'high_prcp_freq']

In [3]:
hs_df = pd.read_csv('data/HYSETS_watershed_properties.txt', sep=';')
ab_flag_stns = hs_df[hs_df['Flag_Artificial_Boundaries'] == 1]['Official_ID'].values
print(f'{len(ab_flag_stns)}/{len(hs_df)} HYSETS boundaries have ab flag')

1633/14425 HYSETS boundaries have ab flag


Below we look ahead at the results of this step to compare the updated values with the HYSETS attributes.  The remainder of this chapter computes the updated catchment attributes.

In [4]:
updated_catchment_attribute_path = os.path.join('data/BCUB_watershed_attributes_updated.geojson')
attributes_fpath = updated_catchment_attribute_path.replace('.geojson', '.csv')
if os.path.exists(attributes_fpath):
    attributes_df = pd.read_csv(attributes_fpath)
    attributes_df.head()    
else:
    results_df = gpd.read_file(updated_catchment_attribute_path)
    attributes_df = results_df[[c for c in results_df.columns if c != 'geometry']].copy()
    attributes_df.to_csv(attributes_fpath, index=False)

attributes_df.set_index('Official_ID', inplace=True)

DataSourceError: data/BCUB_watershed_attributes_updated.geojson: No such file or directory

In [None]:
# filter the unaltered hysets attributes for stations in the results dataframe
# og_df = hs_df[hs_df['Official_ID'].#.isin(attributes_df.index)].copy()
# og_df.set_index('Official_ID', inplace=True)
# # the soil permeability and porosity column names need to be updated
# og_df.rename({'Permeability_logk_m2': 'logk_ice_x100', 'Porosity_frac': 'porosity_x100'}, axis=1, inplace=True)
# og_df.columns
# len(og_df)

In [5]:
# get the attributes of interest (climate are not included in original as such)
attributes = [
    'logk_ice_x100', 'porosity_x100',
    'Slope_deg', 'Aspect_deg', 'Elevation_m', 'Drainage_Area_km2', 
    'Land_Use_Forest_frac', 'Land_Use_Shrubs_frac', 'Land_Use_Grass_frac',
    'Land_Use_Wetland_frac', 'Land_Use_Crops_frac', 'Land_Use_Urban_frac',
    'Land_Use_Water_frac', 'Land_Use_Snow_Ice_frac']

In [6]:
def scatter_plot(df, a, ab_flag_stns):
    min_val, max_val = df.min().min(), df.max().max()
    # Create a new plot with a title and axis labels
    p = figure(title=a)
    if a.lower() == 'drainage_area_km2':
        p = figure(title=a, x_axis_type='log', y_axis_type='log')
        
    df['stn_id'] = df.index  # Make sure the index column is available for tooltips
    df['ab_flag'] = [True if e in ab_flag_stns else False for e in df.index]
    flag_df = df[df['ab_flag'] == True].copy()
    noflag_df = df[df['ab_flag'] == False].copy()
    flag_source = ColumnDataSource(flag_df)
    noflag_source = ColumnDataSource(noflag_df)
    # Add a scatter renderer with circle markers
    p.scatter(
        x='original', y='revised', size=3, color="dodgerblue", alpha=0.6, source=noflag_source,
        legend_label='no_flag'
    )
    p.scatter(
        x='original', y='revised', size=3, color="orange", alpha=0.6, source=flag_source,
        legend_label='ab_flag'
    )

    # Add a HoverTool to show the index
    hover = HoverTool()
    hover.tooltips = [
        ("ID", "@stn_id"),
    ]
    p.add_tools(hover)

    
    x = np.linspace(min_val, max_val, 1000)
    y = x
    p.line(x, y, legend_label='1:1', color='red', line_width=3, line_dash='dashed')
    
    # Set axis labels
    p.xaxis.axis_label = 'original'
    p.yaxis.axis_label = 'updated'
    p.legend.click_policy = 'hide'
    p.legend.location = 'top_left'
    return p

In [7]:
plots = []
for a in attributes:
    result_a = a    
    if a.startswith('Land_Use'):
        result_a += '_2010'
    og_vals = og_df[[a]].copy().rename({a: 'original'}, axis=1)
    
    revised_vals = attributes_df[[result_a]].copy().rename({result_a: 'revised'}, axis=1)
    comp_df = pd.concat([og_vals, revised_vals], axis=1)
    comp_df.dropna(inplace=True, how='any')

    if a in ['logk_ice_x100', 'porosity_x100']:
        comp_df['revised'] /=  100
    plot = scatter_plot(comp_df, a, ab_flag_stns)
    plots.append(plot)


NameError: name 'og_df' is not defined

### View scatter plots of HYSETS vs. updated attributes

Approximately 25% of the stations we evaluated had an "artificial bounds" flag, meaning that catchment geometries were not available from official sources.  These catchment boundaries were approximated by a square centred at the "centroid" coordinates which were stated in the HYSET paper to reflect the reported station location.  Below we see the attributes based on updated values are quite different, in particular those which were updated from revised official sources (ab_flag).

The soil attributes describe a marked difference between studies.  This may be because this study uses the GLHYMPS 2.0 version {cite}`huscroft2018compiling` [DOI: https://doi.org/10.1002/2017GL075860](https://doi.org/10.1002/2017GL075860).  The source used in HYSETS (https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/DLGXYO) is {cite}'gleeson2018' [DOI: https://doi.org/10.5683/SP2/DLGXYO](https://doi.org/10.5683/SP2/DLGXYO)


```{note}
ab_flag = Artificial boundaries 
```

In [8]:
layout = gridplot(plots, ncols=3, width=350, height=325)
show(layout)

## Load updated catchment polygons

The data processing below is optional if you use the pre-processed (revised) attributes `BCUB_watershed_attributes_updated.csv`

The file `BCUB_watershed_bounds_updated.geojson` is the end result of the preceding chapter.

In [9]:
revised_catchment_geometry_fpath = 'data/BCUB_watershed_bounds_updated.geojson'
bcub_gdf = gpd.read_file(revised_catchment_geometry_fpath)
geom_updated_stns = bcub_gdf[bcub_gdf['geometry_updated'] == 1]['Official_ID'].values
print(len(geom_updated_stns))

1604


In [10]:
bcub_pts = bcub_gdf.copy()
bcub_pts['geometry'] = bcub_pts.apply(lambda row: Point(row['Centroid_Lon_deg_E'], row['Centroid_Lat_deg_N']), axis=1)
# we are overwriting the polygon geometry which is 3005
bcub_pts = bcub_pts.set_crs(4326, allow_override=True)
bcub_pts[['geometry']].head()
if 'index_right' in bcub_pts.columns:
    bcub_pts.drop('index_right', inplace=True, axis=1)

In [11]:
bcub_pts.columns

Index(['Watershed_ID', 'Source', 'Name', 'Official_ID', 'Centroid_Lat_deg_N',
       'Centroid_Lon_deg_E', 'Drainage_Area_km2', 'Drainage_Area_GSIM_km2',
       'Flag_GSIM_boundaries', 'Flag_Artificial_Boundaries', 'Elevation_m',
       'Slope_deg', 'Gravelius', 'Perimeter', 'Flag_Shape_Extraction',
       'Aspect_deg', 'Flag_Terrain_Extraction', 'Land_Use_Forest_frac',
       'Land_Use_Grass_frac', 'Land_Use_Wetland_frac', 'Land_Use_Water_frac',
       'Land_Use_Urban_frac', 'Land_Use_Shrubs_frac', 'Land_Use_Crops_frac',
       'Land_Use_Snow_Ice_frac', 'Flag_Land_Use_Extraction',
       'Permeability_logk_m2', 'Porosity_frac', 'Flag_Subsoil_Extraction',
       'region_code', 'geometry_updated', 'geometry'],
      dtype='object')

In [12]:
# organize the stations by the region they are contained in to reduce the number of raster loads.  
# import the BCUB (study) region boundary
# region_gdf = gpd.read_file('data/BCUB_regions_4326.geojson')
# region_gdf = region_gdf.to_crs(3005)
# # simplify the geometries (100m threshold) and add a small buffer (250m) to capture HYSETS station points recorded with low accuracy near boundaries
# region_gdf.geometry = region_gdf.simplify(100).buffer(500)
# region_gdf = region_gdf.to_crs(4326)

In [13]:
# # organize the stations by their containing study sub-region polygon
# assert region_gdf.crs == bcub_pts.crs

# for i, row in region_gdf.iterrows():
#     rc = row['region_code']
    
#     region_polygon = region_gdf.loc[[i]].copy()
#     region_polygon.to_crs(3005, inplace=True)
#     region_polygon.geometry = region_polygon.simplify(100).buffer(400)
#     region_polygon.to_crs(4326, inplace=True)
#     contained = gpd.sjoin(bcub_pts, region_polygon, how='inner', predicate='intersects')
#     if contained.empty:
#         continue
    
#     bcub_pts.loc[bcub_pts['Official_ID'].isin(contained['Official_ID'].values), 'region_code'] = rc
#     bcub_df.loc[bcub_df['Official_ID'].isin(contained['Official_ID'].values), 'region_code'] = rc

In [14]:
# update the two added stations 08AG003 (YKR), 10ED002 (LRD)
bcub_pts.loc[bcub_pts['Official_ID'] == '09AG003', 'region_code'] = 'YKR'
bcub_pts.loc[bcub_pts['Official_ID'] == '10ED002', 'region_code'] = '10E'
bcub_gdf.loc[bcub_gdf['Official_ID'] == '09AG003', 'region_code'] = 'YKR'
bcub_gdf.loc[bcub_gdf['Official_ID'] == '10ED002', 'region_code'] = '10E'
# assert len(bcub_pts[bcub_pts['region_code'] == None]) == 0

region_codes = sorted(list(set(bcub_pts['region_code'])))
print(len(bcub_pts), len(bcub_gdf))
# make sure all rows have an associated region_code
assert len([e for e in region_codes if e is None]) == 0

1609 1609


### Extract terrain, climate, land cover, and soil attributes

Terrain attributes are extracted from 1-arc-second DEM available at the USGS [National Map Downloader](https://apps.nationalmap.gov/downloader/#/).

In the GLHYMPS dataset, the attributes are truncated (.shp truncates at 10 symbols):
* porosity: `Porosity_x`,
* permeability: `logK_Ice_x`

In [15]:
# bcub_data_folder = '/home/danbot2/code_5820/large_sample_hydrology/bcub'
dem_folder = '/home/danbot/Documents/code/23/bcub/processed_data/processed_dem/'
local_data_folder = 'data/geospatial_layers/'
glhymps_path = os.path.join(local_data_folder, 'glhymps/GLHYMPS_clipped_3005.geojson')
nalcms_folder = os.path.join(local_data_folder, 'nalcms')
daymet_folder = os.path.join(local_data_folder, 'daymet')

In [16]:
nalcms_dict = {}
for y in [2010, 2015, 2020]:
    nalcms_fpath = os.path.join(nalcms_folder, f'NA_NALCMS_landcover_{y}_3005_clipped.tif')
    nalcms_dict[y] = rxr.open_rasterio(nalcms_fpath, mask_and_scale=True)

In [17]:
climate_dict = {}
for c in daymet_attributes:
    fpath = os.path.join(daymet_folder, f'{c}_mosaic_3005.tiff')
    climate_dict[c] = rxr.open_rasterio(fpath, mask_and_scale=True)

In [18]:
glhymps_data = gpd.read_file(glhymps_path)
glhymps_data.geometry = glhymps_data.geometry.make_valid()

### Re-process catchment attributes

In [19]:
from rioxarray.merge import merge_arrays

def get_merged_dem(basin_geom):
    r1_path = os.path.join(dem_folder, f'10E_USGS_3DEP_3005.tif')
    r2_path = os.path.join(dem_folder, f'LRD_USGS_3DEP_3005.tif')
    r1_dem, dem_crs, dem_affine = retrieve_raster(r1_path)
    r2_dem, dem_crs, dem_affine = retrieve_raster(r2_path)

    merged_raster = merge_arrays([r1_dem, r2_dem])
    masked_raster = merged_raster.rio.clip(basin_geom.geometry, merged_raster.rio.crs)
    
    return masked_raster

In [20]:
def process_catchment_attributes(rc, row, region_dem, crs):

    stn_id = row['Official_ID']
    
    t0 = time.time()
    basin_data = {}
    basin_data['region'] = rc
    basin_data['Official_ID'] = stn_id
    basin_data['geometry'] = row['geometry']
    basin_data['Drainage_Area_km2'] = round(row['geometry'].area / 1e6, 1)
    basin_data['Centroid_Lon_deg_E'] = row['Centroid_Lon_deg_E']
    basin_data['Centroid_Lat_deg_N'] = row['Centroid_Lat_deg_N']
        
    basin_polygon = gpd.GeoDataFrame(geometry=[row['geometry']], crs=crs)  
    basin_polygon.geometry = basin_polygon.geometry.buffer(0)
    
    if not basin_polygon.is_valid.all():
        basin_polygon.geometry = basin_polygon.geometry.make_valid()
        if not basin_polygon.is_valid.all():
            raise Exception('arg')
        else:
            print(f'Fixed invalid basin polygon geometry for {stn_id}.')

    # process soil attributes
    soil_masked = gpd.clip(glhymps_data, mask=basin_polygon)
    soil_masked = soil_masked[soil_masked.geometry.area > 1.0]   
    soil_masked.geometry = soil_masked.geometry.buffer(0)
    soil_masked.geometry = soil_masked.geometry.make_valid()
    
    assert all(soil_masked.is_valid)    
    porosity = get_soil_properties(soil_masked, 'Porosity_x')
    permeability = get_soil_properties(soil_masked, 'logK_Ice_x')
    basin_data['logk_ice_x100'] = round(permeability, 2)
    basin_data['porosity_x100'] = round(porosity, 5)
    del soil_masked
    
    # process NALCMS land cover
    for y in [2010, 2015, 2020]:
        # nalcms_fpath = os.path.join(nalcms_folder, f'NA_NALCMS_landcover_{y}_3005_clipped.tif')
        # clipped_land_cover = rxr.open_rasterio(nalcms_fpath, masked=True).rio.clip(basin_polygon.geometry, all_touched=True)
        clip_ok, clipped_nalcms = clip_raster_to_basin(basin_polygon, nalcms_dict[y])
        land_cover = process_lulc(i, basin_polygon, clipped_nalcms, y)
        land_cover = land_cover.to_dict('records')[0]
        basin_data.update(land_cover)

    # process terrain
    # make a special case for 10ED002 where we need to load and merge
    # the rasters for LRD and 10E and merge
    del clipped_nalcms
    if stn_id == '10ED002':
        print(f'processing special case: {stn_id}')
        clipped_dem = get_merged_dem(basin_polygon)
    else:
        dem_fpath = os.path.join(dem_folder, f'{rc}_USGS_3DEP_3005.tif')
        assert os.path.exists(dem_fpath)
        clip_ok, clipped_dem = clip_raster_to_basin(basin_polygon, region_dem)

        slope, aspect = calculate_slope_and_aspect(clipped_dem)
        # print(f'aspect, slope: {aspect:.1f} {slope:.2f} ')
        basin_data['Slope_deg'] = slope
        basin_data['Aspect_deg'] = aspect
    
        mean_el, median_el, min_el, max_el = process_basin_elevation(clipped_dem)
        basin_data['median_el'] = median_el
        basin_data['mean_el'] = mean_el
        basin_data['max_el'] = max_el
        basin_data['min_el'] = min_el
        basin_data['Elevation_m'] = mean_el

    # process climate params
    del clipped_dem
    for climate_param in daymet_attributes:
        clip_ok, clipped_data = clip_raster_to_basin(basin_polygon, climate_dict[climate_param])
        # Check if the clipped raster is empty or has no data
        if clipped_data is None:
            print(f'clip is empty, finding nearest point from polygon centroid')
            # If the clipped raster is empty or contains only NaN, find the nearest value
            spatial_mean = find_nearest_raster_value(climate_dict[climate_param], basin_polygon)            
        else:
            spatial_mean = round(clipped_data.mean(dim=['y', 'x']).item(), 1)
            
        basin_data[climate_param] = spatial_mean
            # basin_polygon.to_file(f'{stn_id}_error.geojson')
            # raise Exception(f'issue with {climate_param}')
        
    return basin_data


### Load the updated catchment geometries

In [21]:
all_basin_data = []
t0 = time.time()

results_df, processed_ids = pd.DataFrame(), []
if os.path.exists(updated_catchment_attribute_path):
    print(f'{updated_catchment_attribute_path.split("/")[-1]} exists, loading existing file.')
    results_df = gpd.read_file(updated_catchment_attribute_fpath)
    print(f'{len(results_df)} existing results loaded')
    processed_ids = results_df['Official_ID'].values.tolist()
else:
    print('Revisit the preceding chapter to generate revised catchment geometries.')

Revisit the preceding chapter to generate revised catchment geometries.


### Process monitored catchment attributes

In [None]:
batch_results = []
for rc in region_codes:
    print(f'Processing {rc} region catchments')
    batch_df = bcub_gdf[bcub_gdf['region_code'] == rc].copy()
    # batch_df = batch_df[~batch_df['Official_ID'].isin(processed_ids)].copy()
    dem_fpath = os.path.join(dem_folder, f'{rc}_USGS_3DEP_3005.tif')
    assert os.path.exists(dem_fpath)
    region_dem = rxr.open_rasterio(dem_fpath, mask_and_scale=True)
    for i, row in batch_df.iterrows():
        stn_id = row['Official_ID']

        result = process_catchment_attributes(rc, row, region_dem, bcub_gdf.crs)
        
        batch_results.append(result)
        processed_ids.append(stn_id)
        if (len(batch_results) % 200 == 0) | (len(processed_ids) >= len(bcub_gdf) - 1):
            new_results = gpd.GeoDataFrame(batch_results, crs='EPSG:3005')
            results_df = gpd.GeoDataFrame(pd.concat([results_df, new_results]), crs='EPSG:3005')    
            batch_results = []
            print('     ...saving output file.')
            results_df.to_file(updated_catchment_attribute_path, index=False)
            n_unique = len(list(set(results_df['Official_ID'])))
            print(f'    ...saved {len(results_df)} results file ({n_unique} unique station ids).')

Processing 08A region catchments
Processing 08B region catchments
Processing 08C region catchments
Processing 08D region catchments
Processing 08E region catchments
Processing 08F region catchments
Processing 08G region catchments
Processing 10E region catchments
     ...saving output file.
    ...saved 400 results file (200 unique station ids).
processing special case: 10ED002


## Citations

```{bibliography}
:filter: docname in docnames
```