# Update HYSETS Catchment Polygons

The HYSETS dataset {cite}`arsenault2020comprehensive` contains an "artificial boundaries" flag to indicate where the catchment boundary for the monitoring location is approximated due to either missing data or uncertainty in catchment delineation, in general due to small drainage area.  Approximately 25% of the catchments in the study region (British Columbia and surrounding areas) feature this flag.  

In July 2022, the Water Survey of Canada (WSC) published updated polygons for over 8000 catchments in Canada.  We find updated catchment boundaries from WSC and USGS where available.  Polygons updated from USGS and WSC official sources represent over 3/4 of the "artificial bounds" flagged catchments, and the remaining 92 are updated using reprocessed USGS 3DEP DEM and an algorithm matching the closest stream pixel (required for catchment delineation) to the reported drainage area. 

To start, download the HYSETS catchment polygons (`HYSETS_watershed_boundaries.zip`) from [that dataset's open data repository](https://osf.io/rpc3w/).

The resulting updated polygons are used in the next chapter/section to extract and validate attributes as part of the supporting information for technical validation of the associated publication.

## Import Data

The following files are needed for this notebook:

* **HYSETS catchment polygons**: 
* **HYSETS station points**:

### Import HYSETS data

In [1]:
import os
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from urllib.request import urlopen
import json

from bokeh.plotting import figure, show, output_file, save
from bokeh.io import output_notebook
import xyzservices.providers as xyz
output_notebook()

import data_processing_functions as dpf

### Set up working folders to organize temporary and final files

In [2]:
polygon_folder = os.path.join(os.getcwd(), 'data', 'catchment_polygons')
temp_folder = os.path.join(polygon_folder, 'temp')
updated_catchment_folder = os.path.join(polygon_folder, 'updated_catchment_set')
os.path.exists(temp_folder)
for f in [temp_folder, updated_catchment_folder]:
    if not os.path.exists(f):
        os.makedirs(f)

In [3]:
# import the HYSETS attributes data
hysets_df = pd.read_csv('data/HYSETS_watershed_properties.txt', sep=';')
hysets_df.columns

Index(['Watershed_ID', 'Source', 'Name', 'Official_ID', 'Centroid_Lat_deg_N',
       'Centroid_Lon_deg_E', 'Drainage_Area_km2', 'Drainage_Area_GSIM_km2',
       'Flag_GSIM_boundaries', 'Flag_Artificial_Boundaries', 'Elevation_m',
       'Slope_deg', 'Gravelius', 'Perimeter', 'Flag_Shape_Extraction',
       'Aspect_deg', 'Flag_Terrain_Extraction', 'Land_Use_Forest_frac',
       'Land_Use_Grass_frac', 'Land_Use_Wetland_frac', 'Land_Use_Water_frac',
       'Land_Use_Urban_frac', 'Land_Use_Shrubs_frac', 'Land_Use_Crops_frac',
       'Land_Use_Snow_Ice_frac', 'Flag_Land_Use_Extraction',
       'Permeability_logk_m2', 'Porosity_frac', 'Flag_Subsoil_Extraction'],
      dtype='object')

### Import (BCUB) study region bounds

Get the region bounds from the BCUB dataset [https://doi.org/10.5683/SP3/JNKZVT](https://doi.org/10.5683/SP3/JNKZVT) or just skip this step and use the pre-processed file (`data/data/study_region_stations.geojson`).

In [4]:
# import the BCUB (study) region boundary
region_gdf = gpd.read_file('data/BCUB_regions_4326.geojson')
region_gdf = region_gdf.to_crs(3005)
# simplify the geometries (100m threshold) and add a small buffer (250m) to 
# capture HYSETS station points recorded with low accuracy near boundaries
region_gdf.geometry = region_gdf.simplify(100).buffer(500)
# region_gdf = region_gdf.to_crs(4326)

In [5]:
# get the stations contained in the study region
centroids = hysets_df.apply(lambda x: Point(x['Centroid_Lon_deg_E'], x['Centroid_Lat_deg_N']), axis=1)
hysets_points = gpd.GeoDataFrame(hysets_df, geometry=centroids, crs='EPSG:4326')
hysets_points.to_crs(3005, inplace=True)
hysets_points.head(4)

Unnamed: 0,Watershed_ID,Source,Name,Official_ID,Centroid_Lat_deg_N,Centroid_Lon_deg_E,Drainage_Area_km2,Drainage_Area_GSIM_km2,Flag_GSIM_boundaries,Flag_Artificial_Boundaries,...,Land_Use_Water_frac,Land_Use_Urban_frac,Land_Use_Shrubs_frac,Land_Use_Crops_frac,Land_Use_Snow_Ice_frac,Flag_Land_Use_Extraction,Permeability_logk_m2,Porosity_frac,Flag_Subsoil_Extraction,geometry
0,1,HYDAT,SAINT JOHN RIVER AT FORT KENT,01AD002,47.25806,-68.59583,14703.9211,,0,0,...,0.0258,0.0089,0.0749,0.0242,0.0,1,-14.719327,0.180905,1,POINT (4899816.377 1923344.02)
1,2,HYDAT,ST. FRANCIS RIVER AT OUTLET OF GLASIER LAKE,01AD003,47.20661,-68.95694,1358.6435,,0,0,...,0.0219,0.0174,0.041,0.0414,0.0,1,-14.056491,0.20645,1,POINT (4884971.27 1899554.089)
2,3,HYDAT,MADAWASKA (RIVIERE) A 6 KM EN AVAL DU BARRAGE ...,01AD015,47.5385,-68.5918,2712.0,2693.814,1,0,...,0.0487,0.023,0.0351,0.06,0.0,1,-14.53739,0.165357,1,POINT (4877510.023 1944961.323)
3,4,HYDAT,FISH RIVER NEAR FORT KENT,01AE001,47.2375,-68.58278,2245.7638,,0,0,...,0.063,0.0115,0.0641,0.0528,0.0,1,-14.687869,0.170597,1,POINT (4902150.016 1922495.089)


Note that these are just the artificial bounds flagged rows, below we check for other corrections/updates from official sources.

### Load the original HYSETS polygons

In [7]:
hs_path = 'data/catchment_polygons/HYSETS_watershed_boundaries/HYSETS_watershed_boundaries_20200730.shp'
hs_polygons = gpd.read_file(hs_path)
hs_polygons = hs_polygons.set_crs(4326)

In [8]:
bcub_gdf = gpd.read_file('data/study_region_stations.geojson')
n_original_stns = len(bcub_gdf)
print(bcub_gdf.crs)
# the bcub geometries are (centroid) points from the HYSETS properties, 
# set the geometry to the HYSETS polygons instead
catchment_geometries = bcub_gdf.apply(lambda x: hs_polygons.loc[hs_polygons['OfficialID'] == x['Official_ID'], 'geometry'].values[0], axis=1)
bcub_gdf['geometry'] = catchment_geometries
bcub_gdf.to_crs(3005, inplace=True)
hs_polygons.to_crs(3005, inplace=True)

EPSG:4326


In [9]:
ab_df = bcub_gdf.loc[bcub_gdf['Flag_Artificial_Boundaries']== 1, :].copy()
print(f'{len(ab_df)}/{len(bcub_gdf)} catchment geometries are flagged "artificial bounds"')

408/1618 catchment geometries are flagged "artificial bounds"


In [10]:
gsim_df = bcub_gdf.loc[bcub_gdf['Flag_GSIM_boundaries']== 1, :].copy()
print(f'{len(gsim_df)}/{len(bcub_gdf)} catchment geometries are flagged as using GSIM boundaries')

243/1618 catchment geometries are flagged as using GSIM boundaries


In [11]:
wsc_ab = ab_df.loc[ab_df['Source'] == 'HYDAT', :].copy()
usgs_ab = ab_df.loc[ab_df['Source'] == 'USGS', :].copy()
print(f'{len(wsc_ab)}/{len(usgs_ab)} WSC/USGS artificial bounds flags ')

269/139 WSC/USGS artificial bounds flags 


### Check streamflow record length and completeness

In [12]:
for i, row in bcub_gdf.iterrows():
    stn_id = row['Official_ID']
    stn_df = dpf.get_timeseries_data(stn_id)
    bcub_gdf.loc[i, 'seasonal'] = dpf.check_if_seasonal(stn_df, 'time', stn_id)
    bcub_gdf.loc[i, 'n_complete_years'] = dpf.count_complete_years(stn_df, 'time', stn_id)

In [13]:
n_seasonal = len(bcub_gdf[bcub_gdf['seasonal']])
bcub_gdf = bcub_gdf[bcub_gdf['seasonal'] == False]
print(f'Filtering out {n_seasonal} seasonally recorded stations.  {len(bcub_gdf)} stations remaining.')

Filtering out 285 seasonally recorded stations.  1333 stations remaining.


In [14]:
record_lengths = []
for n in range(1, int(max(bcub_gdf['n_complete_years']))):
    n_stns = len(bcub_gdf[bcub_gdf['n_complete_years'] >= n].copy())
    record_lengths.append((n, n_stns))
record_df = pd.DataFrame(record_lengths, columns=['n_years', 'n_stns'])

Visualize the tradeoff between record length and sample size.  The longer the minimum required record, the fewer stations there are to use in the sample.

In [1]:
record_len_fig = figure(title='Number of (non-seasonal) stations by record length', width=400, height=250)
record_len_fig.line(record_df['n_stns'], record_df['n_years'], line_width=3, color='dodgerblue')
record_len_fig.xaxis.axis_label = r'$$\text{Complete years of record}$$'
record_len_fig.yaxis.axis_label = r'$$\text{Number of stations}$$'
show(record_len_fig)

NameError: name 'figure' is not defined

### Check for updated WSC polygons

The updated WSC catchments can be accessed at the Environment and Climate Change Canada (ECCC) [hydrometrics page](https://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www/HydrometricNetworkBasinPolygons/).  We only need to download the region files associated with the study region, corresponding to the first two digits of the station identifier (official id).

In [16]:
wsc_stn_df = bcub_gdf.loc[bcub_gdf['Source'] == 'HYDAT'].copy()
wsc_stns = wsc_stn_df['Official_ID']
prefixes = sorted(list(set([e[:2] for e in wsc_stns])))
prefixes

['05', '07', '08', '09', '10']

Download the above files `<2-digit identifier>.zip` and extract them in the `data` folder.

In [17]:
# find updated catchment polygons
def retrieve_and_update_WSC_polygon(stn_id):
    """
    Returns an updated WSC polygon if it exists or returns an empty (geo)dataframe.
    """
    stn_id = row['Official_ID']
    stn_prefix = stn_id[:2]
    catchment_path = f'data/catchment_polygons/{stn_prefix}/{stn_id}/{stn_id}_DrainageBasin_BassinDeDrainage.shp'
    if os.path.exists(catchment_path):
        print(f'Updated polygon found for {stn_id}')
        updated_polygon = gpd.read_file(catchment_path)
        updated_polygon.to_crs(3005, inplace=True)
        return updated_polygon
    return gpd.GeoDataFrame()

def retrieve_and_update_WSC_station_location(stn_id):
    """
    Returns an updated WSC station location if it exists or returns an empty (geo)dataframe.
    """
    stn_id = row['Official_ID']
    stn_prefix = stn_id[:2]
    file_path = f'data/catchment_polygons/{stn_prefix}/{stn_id}/{stn_id}_Station.shp'
    if os.path.exists(file_path):
        print(f'Updated station location found for {stn_id}')
        updated_pt = gpd.read_file(file_path)
        updated_pt.to_crs(3005, inplace=True)
        return updated_pt
    return gpd.GeoDataFrame()

### Check for updated USGS polygons

In [18]:
# api url for nwis sites
usgs_api_url = 'https://labs.waterdata.usgs.gov/api/nldi/linked-data/nwissite/'

In [20]:
def retrieve_usgs_stn_data(stn):
    # query the NWIS with the station number to get the station coordinates    
    try:
        query_url = usgs_api_url + f'USGS-{stn}'
        usgs_data = pd.read_json(query_url)
        usgs_stn_loc = usgs_data['features'][0]['geometry']['coordinates']
        stn_pt = Point(*usgs_stn_loc)
        return gpd.GeoDataFrame(geometry=[stn_pt], crs=4326)
    except Exception as ex:
        msg = f'USGS station query failed for {stn}. {ex}'
        print(msg)
        return pd.DataFrame()
    

def usgs_basin_polygon_query(url):
    """USGS polygons are in EPSG:4326 crs"""
    response = urlopen(url)
    json_data = response.read().decode('utf-8', 'replace')
    d = json.loads(json_data)
    return gpd.GeoDataFrame.from_features(d['features'], crs='EPSG:4326')


def retrieve_usgs_basin_data(stn):
    """Retrieve the USGS basin polygon and station location from the NLDI API. 
    If there is no basin for the station, use the NLDI to retrieve upstream 
    and downstream boundaries.  
    Pick the one closest in (HYSETS published) area to the station location."""    
    
    # query the basin polygon from USGS
    basin_query = usgs_api_url + f'USGS-{stn}/basin?simplified=false&splitCatchment=false'    
    try:
        usgs_basin_df = usgs_basin_polygon_query(basin_query)
        # dissolve the basin polygons
        usgs_basin_df = usgs_basin_df.dissolve()
        usgs_basin_df = usgs_basin_df.to_crs(3005)
        # check if geometry is multipolygon
        if usgs_basin_df.geometry.type.values[0] == 'MultiPolygon':
            print(f'   ...MultiPolygon detected, attemping to make geometry valid.')
            usgs_basin_df = usgs_basin_df.explode()
            usgs_basin_df['area'] = usgs_basin_df.geometry.area / 1E6
            usgs_basin_df['area_pct'] = usgs_basin_df['area'] / usgs_basin_df['area'].sum()
            usgs_basin_df = usgs_basin_df[usgs_basin_df['area_pct'] > 0.95]
            if len(usgs_basin_df) > 1:
                raise Exception('USGS basin polygon query returned multiple polygons.')

        return usgs_basin_df
    except Exception as ex:
        print(f'USGS basin polygon query failed for {stn}.  {ex}')
        return pd.DataFrame()

In [21]:
intermediate_step_path = 'data/updated_geometries_intermediate.geojson'
if os.path.exists(intermediate_step_path):
    print('loading existing file')
    bcub_gdf = gpd.read_file(intermediate_step_path)
else:
    bcub_gdf['geometry_updated'] = False
    # set a variable to prevent jupyter book from 
    # processing during book building step
    process_step = True
    if process_step:
        for i, row in bcub_gdf.iterrows():
            stn_id = row['Official_ID']
            source = row['Source']
            if source == 'HYDAT':
                # pt = retrieve_and_update_WSC_station_location(stn_id)
                polygon = retrieve_and_update_WSC_polygon(stn_id)            
            elif source == 'USGS':
                # pt = retrieve_usgs_stn_data(stn_id)
                polygon = retrieve_usgs_basin_data(stn_id)    
            if not polygon.empty:
                assert polygon.crs == 'EPSG:3005'
                pt = polygon.geometry.centroid
                pt = pt.to_crs(4326)
                lon, lat = pt.geometry.x[0], pt.geometry.y[0]
                bcub_gdf.loc[i, 'Centroid_Lat_deg_N'] = lat
                bcub_gdf.loc[i, 'Centroid_Lon_deg_E'] = lon
                bcub_gdf.loc[i, 'geometry'] = polygon.geometry.values[0]
                bcub_gdf.loc[i, 'geometry_updated'] = True
                bcub_gdf.loc[i, 'Flag_Artificial_Boundaries'] = 0
                bcub_gdf.loc[i, 'Flag_GSIM_boundaries'] = 0
            else:
                polygon = hs_polygons.loc[hs_polygons['OfficialID'] == stn_id, 'geometry'].copy()
                bcub_gdf.loc[i, 'geometry'] = polygon.geometry.values[0]
    
    updated_official = bcub_gdf.copy()
    updated_official.to_file(intermediate_step_path)

loading existing file


In [22]:
# add back in 08MH045 and 12388650 to recheck
bcub_gdf.loc[bcub_gdf['Official_ID'].isin(['08MH045','12388650']), 'Flag_Artificial_Boundaries'] = 1

In [23]:
ab_df_remaining = bcub_gdf.loc[bcub_gdf['Flag_Artificial_Boundaries'] == 1, :].copy()
gsim_remaining = bcub_gdf.loc[bcub_gdf['Flag_GSIM_boundaries'] == 1, :].copy()
print(f'{len(ab_df_remaining)}/{len(bcub_gdf)} catchment geometries remain flagged "artificial bounds" ({len(ab_df) - len(ab_df_remaining)} updated.)')
print(f'{len(gsim_remaining)}/{len(bcub_gdf)} catchment geometries remain flagged as using GSIM bounds.')

87/1333 catchment geometries remain flagged "artificial bounds" (321 updated.)
4/1333 catchment geometries remain flagged as using GSIM bounds.


### Delineate the remaining catchments from 1 arc-second USGS 3DEP DEM.

For all of the remaining rows labeled "FLAG_Artificial_Boundaries" or "FLAG_GSIM_boundaries", follow the methodology in the [BCUB dataset demo](https://dankovacek.github.io/bcub_demo/notebooks/2_DEM_Preprocessing.html) and reprocess the catchment bounds.

The code below assumes the underlying DEM has been hydraulically conditioned and flow direction, flow accumulation, and stream rasters have been generated for the area corresponding to each monitoring station catchment.  

In [24]:
# given the station location, get the corresponding (sub-)region code
# since rasters are broken up into regions for dem processing and 
# catchment delineation
if 'index_right' in ab_df_remaining.columns:
    ab_df_remaining.drop(columns=['index_right'], axis=1, inplace=True)

# ensure that all remaining points have a region code
ab_stns = ab_df_remaining['Official_ID'].values
gsim_stns = gsim_remaining['Official_ID'].values

# make sure no stations are included twice
assert len(np.intersect1d(ab_stns, gsim_stns)) == 0
remaining_stns = gpd.GeoDataFrame(pd.concat([ab_df_remaining, gsim_remaining]), crs=ab_df_remaining.crs)
print(f'{len(remaining_stns)} stations remain to re-process')

91 stations remain to re-process


## Process stations individually

In order to facilitate an iterative validation process for delineating catchment bounds that are uncertain, the stations will be processed one by one as follows:

:::{prf:algorithm} Catchment delineation
:label: catchment-validation

**Inputs** Given an approximate streamflow monitoring station location $L$ and a flow accumulation raster ($C$) with sufficient spatial coverage, an expected flow accumulation within some allowable range ($[\text{acc}_{min}, \text{acc}_{max}]$) and some allowable spatial distance ($d_{max}$) from the recorded station location.

**Output** a) HYSETS pour pt (station location), b) HYSETS "artificial bounds" polygon c) adjusted pour pt, d) adjusted (approximate) catchment boundary, and e) (nearby) stream network vector lines.  These geometries are rendered in standalone interactive HTML documents for each station to facilitate manual validation and iteration.

1. For each station location $L$
    1. Find the stream pixel $C*_{ij}$ corresponding to the smallest distance from $L$ between all valid flow accumulation raster cells $\text{acc}_{min} >= C_{ij} < \text{acc}_{max}$
    2. Delineate the catchment from $C*_{ij}$ using the 
:::


### Find the nearest pixel in the stream raster corresponding to the point geometry

Note that where the catchment geometry is missing in HYSETS, the centroid point is just the station location, and the geometry is simply a square centred at the station location and with an area equal to the drainage area reported by the official source {cite}`arsenault2020comprehensive`.  We then find the nearest stream pixel to the "centroid" which is not actually the centroid but the reported station location.  

Historical (discontinued) stations are difficult to align automatically with the correct stream location because often their geographic locations (representing pour points) were not recorded precisely.  Catchment delineation requires an x,y input that aligns precisely with the stream network raster, meaning greater pour point precision is required as resolution increases.  There are less than 100 catchments in the study region remaining with a `Artificial_Bounds_Flag`, and this is a reasonable number to validate "manually", which is done here.

The process of validating these final catchments is as follows:

* 

In [25]:
import time
import xarray as xr
import rioxarray as rxr
from shapely.geometry import Point, box

from attribute_processing_functions import clip_raster_to_basin

import whitebox 
wbt = whitebox.WhiteboxTools()
wbt.verbose = True

In [26]:
def retrieve_raster(filename):
    """
    Take in a file name and return the raster data, 
    the coordinate reference system, and the affine transform.
    """
    raster = rxr.open_rasterio(filename, mask_and_scale=True)
    crs = raster.rio.crs
    affine = raster.rio.transform(recalc=False)
    return raster, crs.to_epsg(), affine


def affine_map_vec(affine, x, y):
    a, b, c, d, e, f, _, _, _ = affine
    n = x.size
    new_x = np.zeros(n, dtype=np.float64)
    new_y = np.zeros(n, dtype=np.float64)
    for i in range(n):
        new_x[i] = x[i] * a + y[i] * b + c
        new_y[i] = x[i] * d + y[i] * e + f
    return new_x, new_y


def snap_pour_point(raster_filepath, pt, area, distance_tol=250):
    raster, crs, affine = retrieve_raster(raster_filepath)     
    acc = raster.squeeze()

    dx, dy = abs(raster.rio.resolution()[0]), abs(raster.rio.resolution()[1])
    
    # Determine area fraction
    area_frac = 2 if area > 10 else 5
    expected_cells = int((area * 1e6) / (dx * dy))
    
    # Calculate min and max cells
    min_cells = int((1 / area_frac) * expected_cells)
    max_cells = int(area_frac * expected_cells)
        
    # Get potential stream cells within the expected range
    yi, xi = np.where((acc >= min_cells) & (acc <= max_cells))
        
    if len(yi) == 0 or len(xi) == 0:
        print('No points returned meeting accumulation criteria.')
        return None, np.inf
    
    # Convert to coordinates
    affine_tup = tuple(raster.rio.transform(recalc=False))
    x_coords, y_coords = affine_map_vec(affine_tup, xi, yi)

    # Calculate distances and find the nearest stream cell
    stn_coords = (pt.geometry.x.values[0], pt.geometry.y.values[0])
    dists = np.sqrt((x_coords - stn_coords[0])**2 + (y_coords - stn_coords[1])**2)
        
    if len(dists) == 0 or np.min(dists) > distance_tol:
        print('No points found within distance tolerance.')
        return None, np.inf
    
    min_idx = np.argmin(dists)
    x_snap, y_snap = x_coords[min_idx] + 0.5 * dx, y_coords[min_idx] - 0.5 * dy
    
    return Point(x_snap, y_snap), np.min(dists)

In [27]:
dem_folder = '/home/danbot2/code_5820/large_sample_hydrology/bcub/processed_data/processed_dem/'
# dem_folder = '/home/danbot/Documents/code/23/bcub/processed_data/processed_dem/'

In [28]:
def delineate_new_catchment(pour_pt_path, rc, stn_id, stn_folder):
    d8_pntr = os.path.join(dem_folder, f'{rc}_USGS_3DEP_3005_fdir.tif')
    assert os.path.exists(d8_pntr)
    output = os.path.join(stn_folder, f'{stn_id}_basin.tif')
    
    if not os.path.exists(output):        
        wbt.watershed(
            d8_pntr, 
            pour_pt_path, 
            output, 
            esri_pntr=False, 
        )
    return output

In [29]:
def generate_stream_vectors(stn_data, adjusted_catchment_path, stn_folder):
    catchment = gpd.read_file(adjusted_catchment_path)
    # add a buffer to the clip to get a wider picture of streams
    catchment.geometry = catchment.geometry.buffer(500)
    stn_id = stn_data['Official_ID'].values[0]
    rc = stn_data['region_code'].values[0]
    
    # open and clip the stream raster
    acc_file = f'{rc}_USGS_3DEP_3005_accum.tif'
    acc_fpath = os.path.join(dem_folder, acc_file)
    acc_raster, crs, _ = retrieve_raster(acc_fpath)
    nodata_value = acc_raster.rio.nodata
    
    raster_res = acc_raster.rio.resolution()
    cell_area = abs(raster_res[0] * raster_res[1])
    
    assert acc_raster.rio.crs == catchment.crs

    geoms = [stn_data.geometry.values[0], catchment.geometry.values[0]]
    combined_geom = gpd.GeoDataFrame(geometry=geoms, crs=stn_data.crs)
    # cvx_hull = combined_geom.dissolve().convex_hull
    # Get the bounding box of the combined geometries
    bbox = combined_geom.total_bounds  # returns (minx, miny, maxx, maxy)

    # Create a Polygon from the bounding box and buffer it by 500 meters
    bbox_polygon = box(bbox[0], bbox[1], bbox[2], bbox[3])
    bbox_buffered = bbox_polygon.buffer(500)
    
    # Create a GeoDataFrame for the buffered bounding box
    bbox_mask = gpd.GeoDataFrame(geometry=[bbox_buffered], crs=stn_data.crs)
    
    # clip the raster with the catchment polygon    
    clip_ok, clipped_acc_raster = clip_raster_to_basin(bbox_mask, acc_raster)   
    
    # Save the clipped raster to a file
    acc_clip_fpath = os.path.join(stn_folder, f'{stn_id}_clipped_acc.tif')
    clipped_acc_raster.rio.to_raster(acc_clip_fpath, crs=acc_raster.rio.crs, nodata=np.nan)
    
    if clip_ok:
        # drop the raster to save memory
        del acc_raster
    # set the minimum area to 1 km^2 for filtering 
    # the accumulation for stream pixels 
    min_cells = int(1e6 / cell_area) 
    
    # render the streams raster from the clip
    streams_temp = os.path.join(stn_folder, f'{stn_id}_streams.tif')
    wbt.extract_streams(
        acc_clip_fpath, 
        streams_temp, 
        min_cells, 
        zero_background=False, 
    )
    
    assert os.path.exists(streams_temp)
    
    stream_vector_output = os.path.join(stn_folder, f'{stn_id}_stream_vectors.shp')
    
    wbt.raster_to_vector_lines(
        streams_temp, 
        stream_vector_output,
    )
    
    assert os.path.exists(stream_vector_output)
        
    return stream_vector_output
    

In [30]:
def raster_to_vector_basin(rc, catchment_raster_fpath, stn_id, stn_folder):
    """
    If we send too many pour points per batch to the Whitebox "unnest" function, 
    we generate a huge number of temporary raster files that could easily 
    exceed current SSD disk capacities.
    """
    raster, crs, affine = retrieve_raster(catchment_raster_fpath)
    polygon_path = os.path.join(stn_folder, f'{rc}_temp_polygon.shp')

    # this function creates rasters of ordered 
    # sets of non-overlapping basins
    if not os.path.exists(polygon_path):
        wbt.raster_to_vector_polygons(
            catchment_raster_fpath,
            polygon_path,
        )
    
    gdf = gpd.read_file(polygon_path, crs=crs)
    gdf = gdf.explode(index_parts=False)
    gdf.reset_index(inplace=True)
    gdf['area'] = gdf.geometry.area
    gdf = gdf[gdf.index == gdf['area'].idxmax()]
    # drop the raster from memory
    del raster
    return gdf

In [31]:
remaining_stns[remaining_stns['Centroid_Lat_deg_N'] < 0]

Unnamed: 0,Watershed_ID,Source,Name,Official_ID,Centroid_Lat_deg_N,Centroid_Lon_deg_E,Drainage_Area_km2,Drainage_Area_GSIM_km2,Flag_GSIM_boundaries,Flag_Artificial_Boundaries,...,Flag_Land_Use_Extraction,Permeability_logk_m2,Porosity_frac,Flag_Subsoil_Extraction,region_code,seasonal,n_complete_years,geometry_updated,geometry,index_right
1248,14159,USGS,Camas Creek near Hot Springs MT,12388650,-114.719459,47.500063,11.5513,,0,1,...,1,,,0,CLR,False,4.0,True,"POLYGON ((1852094.145 342272.647, 1852075.122 ...",


In [32]:
remaining_stns = remaining_stns.sort_values(by=['region_code'])
remaining_stns.reset_index(inplace=True, drop=True)
pour_pt_filenames = []

for i, row in remaining_stns.iterrows():
    
    stn_id = row['Official_ID']
    rc = row['region_code']
    area = row['Drainage_Area_km2']
    
    stn_folder = os.path.join(updated_catchment_folder, stn_id)

    if not os.path.exists(stn_folder):
        os.makedirs(stn_folder)
    # accumulation raster path
    acc_dem_path = os.path.join(dem_folder, f'{rc}_USGS_3DEP_3005_accum.tif')
    if not os.path.exists(acc_dem_path):
        print('missing ', acc_dem_path)
        continue
    assert os.path.exists(acc_dem_path), f'{acc_dem_path} not found'
    
    print(f'processing {stn_id} in {rc} region ({area:.2f} km²)')
    stn_data = remaining_stns.loc[[i]].copy()

    # 1) save the original HYSETS catchment geometry 
    hs_polygon_path = os.path.join(stn_folder, f'{stn_id}_HYSETS_polygon.geojson')
    if not os.path.exists(hs_polygon_path):
        stn_data.to_file(hs_polygon_path)

    # 2) save the original HYSETS station location
    hs_pt_path = os.path.join(stn_folder, f'{stn_id}_HYSETS_pt.geojson')
    if not os.path.exists(hs_pt_path):
        print('    ...processing HYSETS pour point')
        pt_x, pt_y = stn_data['Centroid_Lon_deg_E'].values[0], stn_data['Centroid_Lat_deg_N'].values[0]
        hs_pt = Point(pt_x, pt_y)
        if pt_y < 0:
            hs_pt = Point(pt_y, pt_x) # the order is reversed
        hs_pt_df = gpd.GeoDataFrame(geometry=[hs_pt], crs='4326')
        hs_pt_df.to_crs(3005, inplace=True)
        hs_pt_df.to_file(hs_pt_path)
    else:
        hs_pt_df = gpd.read_file(hs_pt_path)

    continue

    # 3) find the nearest stream cell to the reported station location
    adjusted_ppt_path = os.path.join(stn_folder, f'{stn_id}_adjusted_ppt.geojson')
    adjusted_ppt_path_shp = os.path.join(stn_folder, f'{stn_id}_adjusted_ppt.shp')    
    if not os.path.exists(adjusted_ppt_path) | os.path.exists(adjusted_ppt_path_shp):
        print('    ...processing adjusted pour point')
        nearest_pt, distance = snap_pour_point(acc_dem_path, hs_pt_df, area, distance_tol=1000)
        if nearest_pt is None:
            print(f'{stn_id}: no point returned within expected drainage area range.')
            continue
        adj_pt = gpd.GeoDataFrame(geometry=[nearest_pt], crs=remaining_stns.crs)
        adj_pt['Official_ID'] = stn_id
        adj_pt.to_file(adjusted_ppt_path)
        adj_pt.to_file(adjusted_ppt_path_shp)
        
        
    # 4) delineate a new catchment from the adjusted point
    adjusted_catchment_path = os.path.join(stn_folder, f'{stn_id}_adjusted_catchment.geojson')
    if not os.path.exists(adjusted_catchment_path):
        print('    ...delineating basin raster')
        catchment_raster_fpath = delineate_new_catchment(adjusted_ppt_path_shp, rc, stn_id, stn_folder)
        adjusted_catchment = raster_to_vector_basin(rc, catchment_raster_fpath, stn_id, stn_folder)
        adjusted_catchment.to_file(adjusted_catchment_path)
        
    # 5) save streamlines as a vector within some distance of the catchment polygon / ppt.
    streams_path = os.path.join(stn_folder, f'{stn_id}_streams.geojson')
    if not os.path.exists(streams_path):
        print('    ...processing streams vectors')
        streams_temp_path = generate_stream_vectors(stn_data, adjusted_catchment_path, stn_folder)
        gdf = gpd.read_file(streams_temp_path)
        gdf.to_file(streams_path)
        
        remove_extensions = ['.dbf', '.prj', '.shp', '.shx', '.tif', '.cpg']
        if os.path.exists(adjusted_catchment_path):
            for f in os.listdir(stn_folder):
                if any([f.endswith(e) for e in remove_extensions]):
                    os.remove(os.path.join(stn_folder, f))
        
        print(f'   ...processed {stn_id}, saved to {streams_path}')  

processing 15195000 in 08A region (20.59 km²)
processing 15129600 in 08A region (6.47 km²)
processing 15129510 in 08A region (12.38 km²)
processing 15041200 in 08B region (17093.84 km²)
processing 15056095 in 08B region (7.56 km²)
processing 15056280 in 08B region (11.89 km²)
processing 15057580 in 08B region (26.16 km²)
processing 15106940 in 08B region (11.60 km²)
processing 15106920 in 08B region (26.42 km²)
processing 15102200 in 08B region (6.53 km²)
processing 15101490 in 08B region (22.33 km²)
processing 15101200 in 08B region (5.91 km²)
processing 15100000 in 08B region (45.32 km²)
processing 15099900 in 08B region (27.97 km²)
processing 15094000 in 08B region (19.19 km²)
processing 15093200 in 08B region (6.89 km²)
processing 15056070 in 08B region (24.16 km²)
processing 15087700 in 08B region (31.08 km²)
processing 15087690 in 08B region (26.16 km²)
processing 15087618 in 08B region (11.11 km²)
processing 15087590 in 08B region (7.04 km²)
processing 15087565 in 08B region (39

## Create plots to visualize the results of adjusting pour points to the derived stream network

### Create validation plots set

Create and save interactive html plots to compare HYSETS "artificial bounds" with catchment boundaries delineated by snapping pour points to nearest flow accumulation cell within 5% of expected area (for basins $> 10 \text{km}^2$ and 2.5% otherwise.  Also set a maximum search distance tolerance of 250m.

In [33]:
usgs_tiles = xyz['USGS']['USTopo']

# dir(tiles)
tiles = usgs_tiles

In [34]:
def format_pt_for_plotting(pt_df):
    df = pt_df.copy()
    df.to_crs(3857, inplace=True)
    x, y = df.geometry.values[0].x, df.geometry.values[0].y
    return x, y

def format_poly_for_plotting(poly_df):
    df = poly_df.copy()
    df = df.explode(index_parts=False)
    df.reset_index(inplace=True, drop=True)
    df['area'] = df.geometry.area / 1e6
    df = df[df.index == df['area'].idxmax()]
    df.to_crs(3857, inplace=True)
    x, y = df.exterior.geometry.values[0].coords.xy
    return x, y


def format_linestring_for_plotting(line_df):
    # Prepare data for Bokeh
    df = line_df.copy()
    df.to_crs(3857, inplace=True)
    xs = []
    ys = []

    for geom in df.geometry:
        if geom.geom_type == 'MultiLineString':
            for line in geom:
                xs.append(list(line.xy[0]))
                ys.append(list(line.xy[1]))
        elif geom.geom_type == 'LineString':
            xs.append(list(geom.xy[0]))
            ys.append(list(geom.xy[1]))

    return xs, ys


def create_new_pt(x, y, pt_crs):
    pt = Point(x, y)
    return gpd.GeoDataFrame(geometry=[pt], crs=pt_crs)

    
def pour_point_plot(stn_id, hs_pt, hs_poly, adj_pt, adj_poly, streams, add_pt=None, pt_crs='EPSG:4326', adjusted=False):
    """
    Note if you add a point, it should be in decimal degrees (EPSG:4326) unless otherwise specified with pt_crs.
    """
    
    name = hysets_df[hysets_df['Official_ID'] == stn_id]['Name'].values[0]    
        
    p = figure(title=f'{stn_id} ({name})', x_axis_type="mercator", y_axis_type="mercator")
    
    ppt_x, ppt_y = format_pt_for_plotting(adj_pt)
    stn_x, stn_y = format_pt_for_plotting(hs_pt)
    new_poly_x, new_poly_y = format_poly_for_plotting(adj_poly)
    hs_poly_x, hs_poly_y = format_poly_for_plotting(hs_poly)
    
    stream_x, stream_y = format_linestring_for_plotting(streams)
    
    hs_da = hs_poly.geometry.area.values[0] / 1e6
    new_area = adj_poly.geometry.area.values[0] / 1e6
    
    # Add the transparent polygon to the plot
    p.patch(hs_poly_x, hs_poly_y, fill_alpha=0.3, line_color='green', fill_color='green',
           legend_label=f'HYSETS ({hs_da:.1f} km²)')
    p.patch(new_poly_x, new_poly_y, fill_alpha=0.3, line_color='orange', fill_color='orange',
           legend_label=f'BCUB ({new_area:.1f} km²)')
    
    p.scatter([stn_x], [stn_y], marker='o', size=14, color='green',
              legend_label=f'HYSETS stn')
    p.scatter([ppt_x], [ppt_y], marker='star', size=16, color='orange', line_color='black',
              legend_label=f'BCUB ppt')
    
    if add_pt is not None:
        print(add_pt)
        add_x, add_y = add_pt
        add_pt_df = create_pt(add_x, add_y, pt_crs)
        pt_x, pt_y = format_pt_for_plotting(add_pt_df)
        p.scatter([pt_x], [pt_y], marker='triangle', size=18, color='salmon', line_color='black',
              legend_label=f'Added pt')
        
    
    # Add a MultiLine glyph
    if new_area < 1000:
        p.multi_line(xs=stream_x, ys=stream_y, line_width=2, color='blueviolet', line_dash='dashed',
                    legend_label='3DEP streamline')    

    p.add_tile(tiles, retina=True)
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    p.legend.click_policy = 'hide'
    return p

In [35]:
pour_pt_filenames = []
total = 0
plot = []

output_plot_folder = 'data/validation_plots'
stn_plot_data_folder = f'data/catchment_polygons/updated_catchment_set/'

print(f'Processing {len(os.listdir(stn_plot_data_folder))} plots')

def process_plot(stn_id, adjusted=False):
    fpath = os.path.join(output_plot_folder, f'{stn_id}.html')
    if os.path.exists(fpath) & ~adjusted:
        return None
    if stn_id == None:
        print('phantom station with no name')
        return None
        
    stn_folder = os.path.join(updated_catchment_folder, stn_id)
    hs_pt_file = os.path.join(stn_folder, f'{stn_id}_HYSETS_pt.geojson')
    hs_poly_file = os.path.join(stn_folder, f'{stn_id}_HYSETS_polygon.geojson')
    
    adj_pt_file = os.path.join(stn_folder, f'{stn_id}_adjusted_ppt.geojson')
    adj_poly_file = os.path.join(stn_folder, f'{stn_id}_adjusted_catchment.geojson')
    streams_file = os.path.join(stn_folder, f'{stn_id}_streams.geojson')
    
    if adjusted:
        adj_pt_file = os.path.join(stn_folder, f'{stn_id}_REadjusted_ppt.geojson')
        adj_poly_file = os.path.join(stn_folder, f'{stn_id}_REadjusted_catchment.geojson')
        streams_file = os.path.join(stn_folder, f'{stn_id}_streams_adjusted.geojson')
    
    # try:
    hs_pt = gpd.read_file(hs_pt_file)
    hs_poly = gpd.read_file(hs_poly_file)
    adj_pt = gpd.read_file(adj_pt_file)
    adj_poly = gpd.read_file(adj_poly_file)
    streams = gpd.read_file(streams_file)
    plt = pour_point_plot(stn_id, hs_pt, hs_poly, adj_pt, adj_poly, streams, adjusted=adjusted)
    fpath = os.path.join(output_plot_folder, f'{stn_id}.html')
    # Specify the output file
    output_file(fpath)
    # Save the plot to the HTML file
    save(plt)


Processing 97 plots


In [36]:
for stn_id in sorted(os.listdir(stn_plot_data_folder)):
    process_plot(stn_id)

## Show an example validation plot

In [37]:
from bokeh.io import output_notebook
output_notebook()

In [38]:
to_check = [e.split('.')[0] for e in os.listdir('data/validation_plots/recheck')]
pt_crs = 'EPSG:4326'
checked_stns = {
    '15081800': create_new_pt(-132.87492, 55.36158, pt_crs), # catch trib just DS of stn or no?
    '15039900': create_new_pt(-133.98292, 58.24594, pt_crs), # reposition at lake outlet
    '10CC001': create_new_pt(-122.53885, 58.835, pt_crs), # reposition below Muskwa Creek
    '15109048': create_new_pt(-134.67127, 58.286815, pt_crs), # cature more of North Fork (NF)
    # '15056095': create_new_pt(-135.18648, 59.52632, pt_crs), # reposition at lake outlet REMOVED BECAUSE RECORD IS SEASONAL
    # '15031000': create_new_pt(-133.88392, 58.18253, pt_crs), # reposition to capture three upper tribs REMOVED BECAUSE RECORD IS SEASONAL
    # '12212430': create_new_pt(-122.49944, 48.99979, pt_crs), # reposition to capture upper tribs REMOVED BECAUSE RECORD IS SEASONAL
    # '15054000': create_new_pt(-134.64325, 58.38685, pt_crs), # better isolate Auke Creek REMOVED BECAUSE RECORD IS SEASONAL
    # '15087500': create_new_pt(-132.87254, 56.79361, pt_crs), # isolate east fork of Hobo Creek  REMOVED BECAUSE RECORD IS SEASONAL
    # '12110400': create_new_pt(-122.08512, 47.35607, pt_crs), # reposition on south fork Jenkins (Cranmar Creek)  REMOVED BECAUSE RECORD IS SEASONAL
    # '15056030': create_new_pt(-135.19131, 59.009803, pt_crs), # not great resolving of stream but captures upper basin well  REMOVED BECAUSE RECORD IS SEASONAL
    '05AE040': create_new_pt(-113.54587, 49.01493, pt_crs), # isolate East branch of Lee Creek
    # '12388650': create_new_pt(-114.69554, 47.48851, pt_crs), # Camas Creek near Hot Springs MT REMOVED BECAUSE RECORD IS SEASONAL
    # '08MH045': create_new_pt(-122.25, 49.213, pt_crs), # Bouchier Creek REMOVED BECAUSE RECORD IS SEASONAL
    '15087200': None, # Hammer Slough at Petersburg doesn't render from effect of bypass road
    '07FD913': None, # Young Drainage Project near Spirit River -- no idea!
    '07FD912': None, # Whitburn Drainage Project Near Spirit River -- no idea!
    '15053200': None, # Duck Creek doesn't resolve in 30m DEM due to urban development
    '12113349': None, # Mill Creek doesn't render well with 1 arcsecond DEM in urban area
    '15052475': None, # Jordan Creek doesn't render well with 1 arcsecond DEM in urban area
    # '08EG013': create_new_pt(-130.0851, 54.19629, pt_crs), # reposition at lake outlet
}


### Additional validation notes

* **15024750: Goat Creek Near Wrangell AK**  
    -there is a large (5-7km^2) trib just US of the station that in basemaps appears to drain to the north  
    -there's a low point where debris jam could cause flow to divert, or maybe past flows diverted  
* **15129600: Ophir Creek NR Yakutat AK**   
    - the 3DEP dem doesn't align with USGS basemapping but the area captured is a reasonable approximation
* **15129510: Old Situk River Nr Yakutat AK**  
    - significantly larger catchment delineated from very close to the reported location.   
    - does the north fork exist / should it be included?  
* **10CC001: Fort Nelson River at Fort Nelson**
    - the reported location doesn't include North Branch (HYSETS does) -- Muskwa River
    - 10CC002 is named "above Muskwa River" which coincides with this location,  
    - the much greater flow magnitude of 10CC001 suggests it was downstream of the Muskwa confluence.
* **12212430: Unnamed Trib to Bertrand Cr. Near H St.**
    - naming isn't very descriptive, base maps look like monitoring station is road ditch
    - pour point shifted away from reported location
    - however upstream network matches well with USGS base mapping despite pour point locat
* **08EG013: Boneyard Creek at Outlet of Rainbow Lake**
    - naming is helpful to identify outlet of Rainbow Lake
    - very different drainage area compared to WSC reported value
    - record is very short (~ 2 seasons 1962-64)
    - appears to have a dam at the outlet which may have caused lake connectivity and doubling of catchment area
* **08MH045: Bouchier Creek (on the Stave Lake Road)**   
    - the 3DEP doesn't capture a creek >= 1km2, also the **measured record is seasonal so mean values will be very skewed**


In [39]:
# pick a station id
# stn_id = [e for e in to_check if e not in checked_stns][0]
stn_id = '15024750'
fpath = os.path.join(output_plot_folder, f'{stn_id}.html')
stn_folder = os.path.join(updated_catchment_folder, stn_id)

hs_pt_file = os.path.join(stn_folder, f'{stn_id}_HYSETS_pt.geojson')
hs_poly_file = os.path.join(stn_folder, f'{stn_id}_HYSETS_polygon.geojson')
adj_pt_file = os.path.join(stn_folder, f'{stn_id}_adjusted_ppt.geojson')
adj_poly_file = os.path.join(stn_folder, f'{stn_id}_adjusted_catchment.geojson')
streams_file = os.path.join(stn_folder, f'{stn_id}_streams.geojson')

hs_pt = gpd.read_file(hs_pt_file)
hs_poly = gpd.read_file(hs_poly_file)
adj_pt = gpd.read_file(adj_pt_file)
adj_poly = gpd.read_file(adj_poly_file)
streams = gpd.read_file(streams_file)
plt = pour_point_plot(stn_id, hs_pt, hs_poly, adj_pt, adj_poly, streams)
show(plt)

## Re-delineate basins with adjusted pour points

In [40]:
for stn_id, pt in checked_stns.items():
    if pt is None:
        continue
    
    bcub_data = bcub_gdf[bcub_gdf['Official_ID'] == stn_id].copy()
    rc = bcub_data['region_code'].values[0]
    area = bcub_data['Drainage_Area_km2'].values[0]
        
    pt.to_crs(3005, inplace=True)
    
    stn_folder = os.path.join(updated_catchment_folder, stn_id)
    if not os.path.exists(stn_folder):
        os.makedirs(stn_folder)
        
    # accumulation raster path
    acc_dem_path = os.path.join(dem_folder, f'{rc}_USGS_3DEP_3005_accum.tif')
    if not os.path.exists(acc_dem_path):
        print('missing ', acc_dem_path)
        continue
    assert os.path.exists(acc_dem_path), f'{acc_dem_path} not found'
    
    print(f'processing {stn_id} in {rc} region ({area:.2f} km²)')
    stn_data = bcub_gdf[bcub_gdf['Official_ID'] == stn_id].copy()

    # 3) find the nearest stream cell to the UPDATED station location
    adjusted_ppt_path = os.path.join(stn_folder, f'{stn_id}_REadjusted_ppt.geojson')
    adjusted_ppt_path_shp = os.path.join(stn_folder, f'{stn_id}_REadjusted_ppt.shp')    
    if not os.path.exists(adjusted_ppt_path) | os.path.exists(adjusted_ppt_path_shp):
        print('    ...processing adjusted pour point')
        nearest_pt, distance = snap_pour_point(acc_dem_path, pt, area, distance_tol=1000)
        adj_pt = gpd.GeoDataFrame(geometry=[nearest_pt], crs=pt.crs)
        adj_pt['Official_ID'] = stn_id
        adj_pt.to_file(adjusted_ppt_path)
        adj_pt.to_file(adjusted_ppt_path_shp)
        
    # 4) delineate a new catchment from the adjusted point
    adjusted_catchment_path = os.path.join(stn_folder, f'{stn_id}_REadjusted_catchment.geojson')
    if not os.path.exists(adjusted_catchment_path):
        print('    ...delineating basin raster')
        catchment_raster_fpath = delineate_new_catchment(adjusted_ppt_path_shp, rc, stn_id, stn_folder)
        adjusted_catchment = raster_to_vector_basin(rc, catchment_raster_fpath, stn_id, stn_folder)
        adjusted_catchment.to_file(adjusted_catchment_path)
        
    # 5) save streamlines as a vector within some distance of the catchment polygon / ppt.
    streams_path = os.path.join(stn_folder, f'{stn_id}_streams_adjusted.geojson')
    if not os.path.exists(streams_path):
        print('    ...processing streams vectors')
        streams_temp_path = generate_stream_vectors(stn_data, adjusted_catchment_path, stn_folder)
        gdf = gpd.read_file(streams_temp_path)
        gdf.to_file(streams_path)
        
        remove_extensions = ['.dbf', '.prj', '.shp', '.shx', '.tif', '.cpg']
        if os.path.exists(adjusted_catchment_path):
            for f in os.listdir(stn_folder):
                if any([f.endswith(e) for e in remove_extensions]):
                    os.remove(os.path.join(stn_folder, f))
        
        print(f'   ...processed {stn_id}, saved to {streams_path}')  
    

processing 15081800 in 08D region (45.07 km²)
processing 15039900 in 08B region (28.49 km²)
processing 10CC001 in LRD region (43500.00 km²)
processing 15109048 in 08B region (11.21 km²)


IndexError: index 0 is out of bounds for axis 0 with size 0

### Revise the plots

In [None]:
for stn_id, pt in checked_stns.items():
    if pt is None:
        continue
    process_plot(stn_id, adjusted=True)

### Final review notes (excluded stations)

1. **15087200**: Hammer Slough at Petersburg doesn't render from effect of bypass road  
2. **07FD913**: Young Drainage Project near Spirit River -- no idea!  
3. **07FD912**: Whitburn Drainage Project Near Spirit River -- no idea!
4. **12113349**: Mill Creek doesn't resolve with 1 arcsecond DEM in urban area
5. **15052475**: Jordan Creek doesn't resolve with 1 arcsecond DEM in urban area
6. **12110400**: Jenkins Creek doesn't resolve with 1 arcsecond DEM in urban area
7. **15053200**: Duck Creek doesn't resolve with 1 arcsecond DEM in urban area
8. **12212430**: Unnamed Trip to Bertrand Creek Creek doesn't resolve with 1 arcsecond DEM in urban area 
9. **08EG013**: Boneyard Creek at Outlet of Rainbow Lake is lakes connected by outlet dams raising water level
10. **08MH045**: Bouchier Creek doesn't resolve a catchment with 1 arcsecond DEM.

## Assemble final catchment bounds into a dataframe to be used in subsequent computation

In [None]:
revised_geometry_fname = f'BCUB_watershed_bounds_updated.geojson'

excluded_stns = ['15087200', '07FD913', '07FD912', '12113349', '15052475',
                '12110400', '15053200', '12212430', '08EG013', '08MH045', '12388650']

revised_geometry_folder = os.path.join(os.getcwd(), 'data/catchment_polygons/updated_catchment_set')
revised_stns = [e for e in remaining_stns['Official_ID'] if e not in excluded_stns]
print(f'{len(revised_stns)} revised catchment bounds')
bcub_gdf.reset_index(drop=True, inplace=True)

for i, row in bcub_gdf.iterrows():
    stn_id = row['Official_ID']
    if stn_id in revised_stns:
        stn_geometry_folder = os.path.join(revised_geometry_folder, stn_id)
    
        # if the station has been revised, there should be new geometry
        # either it worked the first time, or the pour point was adjusted and
        # the catchment file is ...REadjusted_catchment.geojson
        if stn_id in revised_stns:
            revised_ppt_catchment_fpath = os.path.join(stn_geometry_folder, f'{stn_id}_REadjusted_catchment.geojson')
            if os.path.exists(revised_ppt_catchment_fpath):
                revised_catchment_fpath = revised_ppt_catchment_fpath
            else:
                revised_catchment_fpath = os.path.join(stn_geometry_folder, f'{stn_id}_adjusted_catchment.geojson')
            
            assert os.path.exists(revised_catchment_fpath), f'revised catchment file not found for {stn_id}'
            
            catchment_df = gpd.read_file(revised_catchment_fpath)
            # also need to update the centroid geom
            centroid_df = catchment_df.copy()
            # need to compute centroid in projected crs to get coords
            centroid_x, centroid_y = centroid_df.geometry.centroid.x.values[0], centroid_df.geometry.centroid.y.values[0]
            # then create a dataframe to reproject to 
            centroid_df = gpd.GeoDataFrame(geometry=[Point(centroid_x, centroid_y)], crs=catchment_df.crs)
            # then reproject back to decimal degrees (geographic crs)
            centroid_df.to_crs(4326, inplace=True)
            centroid_x, centroid_y = centroid_df.geometry.x.values[0], centroid_df.geometry.y.values[0]
            bcub_gdf.loc[i, 'geometry'] = catchment_df.geometry.values[0]
            bcub_gdf.loc[i, 'Centroid_Lat_deg_N'] = centroid_y
            bcub_gdf.loc[i, 'Centroid_Lon_deg_E'] = centroid_x
            bcub_gdf.loc[i, 'geometry_updated'] = True
            

bcub_gdf = bcub_gdf[~bcub_gdf['Official_ID'].isin(excluded_stns)]
bcub_gdf.to_file(os.path.join('data', revised_geometry_fname))
print(f'file saved to {revised_geometry_fname}')
print(len(bcub_gdf))

In [None]:
bcub_gdf[bcub_gdf['Official_ID'].isin(revised_stns)].head()

## Citations

```{bibliography}
:filter: docname in docnames
```