# 3. Pour Point Extraction

```{figure} img/pour_points.png
---
width: 600px
---
In this notebook, we will derive a set of pour points describing confluences that will be used to derive basins and extract basin attributes.
```

In this notebook, we'll use the stream network generated in the previous notebook to find all river confluences.   A set of input pour points are required for basin delineation in the next notebook.

The set of confluences will be filtered using the [National Hydrographic Network](https://natural-resources.canada.ca/science-and-data/science-and-research/earth-sciences/geography/topographic-information/geobase-surface-water-program-geeau/national-hydrographic-network/21361) waterbodies geometry to remove spurious confluences within lakes.

The remaining points will serve as pour points for basin delineation.  

The following files were pre-processed for the purpose of demonstration since the [original files cover all of Canada and are as a result very large](https://ftp.maps.canada.ca/pub/nrcan_rncan/vector/geobase_nhn_rhn/gpkg_en/CA/).  The files below (may) need to be downloaded and saved to `content/notebooks/data/region_polygons/`.  

* `Vancouver_Island.geojson`: this is the polygon describing Vancouver Island.  It was used to do a spatial intersection on the NHN geometry to select just the waterbody geometries on Vancouver Island.
* `Vancouver_Island_lakes.geojson`: the water bodies polygon set for Vancouver Island.


The steps in this notebook produce a set of river confluences, with spurious points within lakes filtered out.  The example below shows green points (confluences) and spurious lake points removed (shown yellow for illustration).


```{note}
Note that the stream network in the image above appears discontinuous due to the screen resolution.
```

In [16]:
import os
from utilities import *
# open the stream layer
base_dir = os.path.dirname(os.getcwd())
dem_folder = os.path.join(base_dir, 'notebooks/data/DEM/')
base_dir

'/home/danbot/Documents/code/23/bcub/content'

```{note}
For clarity, some functions have been relegated to a separate file.  To find more detail, see `utilities.py`.
```

In [17]:
# Create the folder where the pour point geometry information will be saved.
pour_pt_path = os.path.join(base_dir, f'notebooks/data/pour_points/')
if not os.path.exists(pour_pt_path):
    os.mkdir(pour_pt_path)

## Import rasters (flow direction, accumulation, stream network)

In [3]:
# open the streams dem
region = 'Vancouver_Island'
d8_path = os.path.join(dem_folder, f'{region}_d8_pointer.tif')
acc_path = os.path.join(dem_folder, f'{region}_acc.tif')
stream_path = os.path.join(dem_folder, f'{region}_streams.tif')

stream_raster, stream_crs, affine = retrieve_raster(stream_path)
resolution = stream_raster.rio.resolution()
dx, dy = abs(resolution[0]), abs(resolution[1])
print(f'Raster resolution is {dx:.0f}x{dy:.0f}m')


Raster resolution is 22x22m


Here we'll set a minimum threshold of 5 $km^2$ to limit the number of confluences for the sake of this demonstration. 

In [4]:
min_basin_area = 5 # km^2
# min number of cells comprising a basin
basin_threshold = int(min_basin_area * 1E6 / (dx * dy)) 
basin_threshold

10103

In [5]:
import time
import numpy as np
import pandas as pd

rt0 = time.time()

stream, _, _ = retrieve_raster(stream_path)
fdir, _, _ = retrieve_raster(d8_path)
acc, _, _ = retrieve_raster(acc_path)

# get raster data in matrix form
S = stream.data[0]
F = fdir.data[0]
A = acc.data[0]

rt1 = time.time()
print(f'   ...time to load resources: {rt1-rt0:.1f}s.')

   ...time to load resources: 8.5s.


Create a list of coordinates representing all the stream cells.

In [6]:
# get all the stream pixel indices
stream_px = np.argwhere(S == 1)

## Define confluence points in the stream network

Below we create a dictionary of potential pour points corresponding to confluences.  

We iterate through all the stream pixels, retrieve a 3x3 window of flow direction raster around each one, and check if it has more than one stream cell pointing towards it.

In [8]:
ppts = {}
nn = 0

for (i, j) in stream_px:
    c_idx = f'{i},{j}'
    if c_idx not in ppts:
        ppts[c_idx] = {}
    ppt = ppts[c_idx]

    # Add river outlets, as these are by definition
    # confluences and especially prevalent in coastal regions
    focus_cell_acc = A[i, j]
    focus_cell_dir = F[i, j]

    ppt['acc'] = focus_cell_acc

    if focus_cell_dir == 0:
        # the focus cell is already defined as a stream cell
        # so if its direction value is nan or 0, 
        # there is no flow direction and it's an outlet cell.
        ppt['OUTLET'] = True
        # by definition an outlet cell is also a confluence
        ppt['CONF'] = True
    else:
        ppt['OUTLET'] = False

    # get the 3x3 boolean matrix of stream and d8 pointer 
    # cells centred on the focus cell
    S_w = S[max(0, i-1):i+2, max(0, j-1):j+2].copy()
    F_w = F[max(0, i-1):i+2, max(0, j-1):j+2].copy()
    
    # create a boolean matrix for cells that flow into the focal cell
    F_m = mask_flow_direction(S_w, F_w)
    
    # check if cell is a stream confluence
    # set the target cell to false by default
    ppts = check_for_confluence(i, j, ppts, S_w, F_m)    


Convert the dictionary of stream confluences to a geodataframe in the same CRS as our raster.

In [9]:
import geopandas as gpd

output_ppt_path = os.path.join(pour_pt_path, f'{region}_ppts.geojson')

if not os.path.exists(output_ppt_path):
    t0 = time.time()
    ppt_df = pd.DataFrame.from_dict(ppts, orient='index')
    ppt_df.index.name = 'cell_idx'
    ppt_df.reset_index(inplace=True) 
    
    # split the cell indices into columns and convert str-->int
    ppt_df['ix'] = [int(e.split(',')[0]) for e in ppt_df['cell_idx']]
    ppt_df['jx'] = [int(e.split(',')[1]) for e in ppt_df['cell_idx']]
    
    # filter for stream points that are an outlet or a confluence
    ppt_df = ppt_df[(ppt_df['OUTLET'] == True) | (ppt_df['CONF'] == True)]
    print(f' There are {len(ppt_df)} confluences and outlets combined in the {region} region.')
else:
    ppt_df = gpd.read_file(output_ppt_path)


 There are 22560 confluences and outlets combined in the Vancouver_Island region.


In [10]:
n_pts_tot = len(stream_px)
n_pts_conf = len(ppt_df[ppt_df['CONF']])
n_pts_outlet = len(ppt_df[ppt_df['OUTLET']])

print(f'Of {n_pts_tot} total stream cells:')
print(f'    {n_pts_conf - n_pts_outlet} ({100*n_pts_conf/n_pts_tot:.1f}%) are stream confluences,')
print(f'    {n_pts_outlet} ({100*n_pts_outlet/n_pts_tot:.1f}%) are stream outlets.')


Of 934567 total stream cells:
    21044 (2.4%) are stream confluences,
    1516 (0.2%) are stream outlets.


```{note}
The pour points are thus far only described by the raster pixel index, we still need to apply a transform to map indices to projected coordinates.
```

In [11]:
ppt_gdf = create_pour_point_gdf(region, stream, ppt_df, stream_crs, output_ppt_path)

creating 100 chunks for processing
    ...10/100 chunks processed in 0.0s
    ...20/100 chunks processed in 0.1s
    ...30/100 chunks processed in 0.1s
    ...40/100 chunks processed in 0.2s
    ...50/100 chunks processed in 0.2s
    ...60/100 chunks processed in 0.2s
    ...70/100 chunks processed in 0.3s
    ...80/100 chunks processed in 0.3s
    ...90/100 chunks processed in 0.3s
    ...100/100 chunks processed in 0.4s
    22560 pour points created.
   ...ppts geodataframe processed in0.4s



In [None]:
# ta = time.time()
# polygon_path = os.path.join(base_dir, f'notebooks/data/region_polygons/{region}.geojson')
# region_polygon = gpd.read_file(polygon_path)
# # reproject to match nhn crs
# # region_polygon = region_polygon.to_crs(4617)
# tb = time.time()
# print(f'   ...region polygon opened in {tb-ta:.2f}s')


## Filter spurious confluences


One vestige of the stream network derivation is that it does not identify lakes.  There are lots of lakes on Vancouver Island, and we want to remove the spurious confluence points that fall within lakes and find locations where rivers flow into lakes.  We can do this with hydrographic information from the [National Hydrographic Netowork](https://natural-resources.canada.ca/science-and-data/science-and-research/earth-sciences/geography/topographic-information/geobase-surface-water-program-geeau/national-hydrographic-network/21361).

```{tip}
Lake polygons for Vancouver Island are saved under `content/notebooks/data/region_polygons/Vancouver_Island_lakes.geojson`
```



### Get the water body geometries that contain confluence points

From the [NHN documentation](https://ftp.maps.canada.ca/pub/nrcan_rncan/vector/geobase_nhn_rhn/doc/GeoBase_nhn_en_Catalogue_1_2.pdf):

Permanency code:
* -1 unknown
* 0 no value available
* 1 permanent
* 2 intermittent

    
| water_definition | Label | Code Definition |
|------------------|-------|-----------------|
| None | 0 | No Waterbody Type value available. |
| Canal | 1 | An artificial watercourse serving as a navigable waterway or to channel water. |
| Conduit | 2 | An artificial system, such as an Aqueduct, Penstock, Flume, or Sluice, designed to carry water for purposes other than drainage. |
| Ditch | 3 | Small, open manmade channel constructed through earth or rock for the purpose of conveying water. |
| *Lake | 4 | An inland body of water of considerable area. |
| *Reservoir | 5 | A wholly or partially manmade feature for storing and/or regulating and controlling water. |
| Watercourse | 6 | A channel on or below the earth's surface through which water may flow. |
| Tidal River | 7 | A river in which flow and water surface elevation are affected by the tides. |
| *Liquid Waste | 8 | Liquid waste from an industrial complex. |

```{warning}
The label "10" also exists, though I have not found a corresponding definition.  From the image below, it appears they may represent seasonal channels.  Light blue regions are lakes (4) and watercourses (6).
```

```{figure} img/label_10.png
---
width: 400px
---
Darker grey polygons are labeled with the code "10" appear to be seasonal channels.
```

In [31]:
# read the pre-processed lakes polygon file
region_lakes_path = os.path.join(base_dir, f'notebooks/data/region_polygons/{region}_lakes.geojson')
lakes_df = gpd.read_file(region_lakes_path)
lakes_df = lakes_df[[c for c in lakes_df.columns if c not in ['index_right', 'index_left']]]
assert lakes_df.crs == ppt_gdf.crs

```{warning}
Here we apply some subjective criteria to improve the performance of the lake inflow point discovery:
1. We remove lakes smaller than 0.01 $km^2$ to speed up the spatial join.
2. We only look at lakes that contain confluence points in order to relocate points to river mouths.
3. We apply a small buffer and simplify (or smooth) each water body polygon -- this is to reduce the number of river mouth points identified in heavily braided lake headwaters. 
4. Check that points identified as river mouths aren't in too close proximity (within 4 pixels).
5. Rasterize the lake polygons in order to find the nearest stream pixel crossing the line -- if we interpolate too few points, we miss the intersecting point.  When changing these parameters, consider that the simplification eliminates vertices defining the polygon, so you must interpolate the line with enough points to find a stream pixel at the intersection.  
```

In [67]:
# remove lakes < 0.01 km^2
lakes_df['area'] = lakes_df.geometry.area
lakes_df = lakes_df[lakes_df['area'] >= 10000]


Filter out points within water bodies.

In [68]:
# filter for water_definition code (see code table above)
# filter out all confluence points in lakes and reservoirs
lakes_filter = (lakes_df['water_definition'] == 4) | (lakes_df['water_definition'] == 5) 
lakes_df = lakes_df[lakes_filter].copy()

# intersect the pour point and filtered lake geometries
lake_ppts = gpd.sjoin(ppt_gdf, lakes_df, how='left', predicate='within')
filtered_ppts = lake_ppts[lake_ppts['index_right'].isna()].copy()

print(f'    {len(filtered_ppts)}/{len(ppt_gdf)} confluence points are not in lakes ({len(ppt_gdf) - len(filtered_ppts)} points removed).')


    19184/22560 confluence points are not in lakes (3376 points removed).


Find all water body polygons that contain at least one confluence point.  

In [69]:
# find the set of lakes that contain points
lakes_with_pts = gpd.sjoin(lakes_df, ppt_gdf, how='left', predicate='intersects')

# the rows with index_right == nan are lake polygons containing no points
filtered_lakes = lakes_with_pts[~lakes_with_pts['index_right'].isna()].copy()
# # get the set of all unique lake ids
lake_ids = list(set(filtered_lakes['id']))
filtered_lakes = lakes_df[lakes_df['id'].isin(lake_ids)].copy()

# # merge contiguous (adjacent) polygons 
filtered_lakes = gpd.GeoDataFrame(geometry=[filtered_lakes.geometry.unary_union], crs='EPSG:3005')
filtered_lakes = filtered_lakes.explode(index_parts=False).reset_index(drop=True)

print(f'  There are {len(filtered_lakes)} unique lake geometries that contain confluence points found in the {region} waterbodies layer')

  There are 311 unique lake geometries that contain confluence points found in the Vancouver_Island waterbodies layer


### Find and add lake inflows

We'll only check lakes that have spurious confluences, the general idea is we shift the in-lake confluence to the inflow location.  The method works best for large lake polygons and relatively smooth geometries where the stream network and NHN features align well, but it adds unnecessary points in other locations.  A few examples of good and bad behaviour are shown below.  

```{figure} img/lake_points_removed.png
---
width: 600px
---
Confluence points within lakes have been removed, while river mouths have been added.
```

```{figure} img/problem_points.png
---
width: 600px
---
Complex lake polygons and derived stream network disagreement result in points being added in unintentional locations.
```

The geometric manipulations below can be modified to address specific use cases.


In [82]:
from shapely.ops import unary_union

In [121]:
n = 0
tot_pts = 0
tb = time.time()
resolution = abs(acc.rio.resolution()[0])
min_acc_cells = 1E6 / (resolution**2)
# simplify the lake geometry to 
points_to_check = []
for _, row in filtered_lakes.iterrows():
    n += 1
    if n % 50 == 0:
        print(f'   Processing lake group {n}/{len(filtered_lakes)}, {tot_pts} points so far...')
    
    # give a slight buffer to the polygon and smooth it to address 
    # complex braided lake headwaters
    lake_geom = row.geometry.buffer(resolution).simplify(resolution)
    # we may have created multipolygons in the smoothing step, 
    # # below we keep the main polygon and drop the remainder
    # if lake_geom.geom_type == 'MultiPolygon':
    #     lake_geom = gpd.GeoDataFrame(geometry=[lake_geom], crs='EPSG:3005')
    #     lake_geom = lake_geom.explode(index_parts=False)
    #     lake_geom.reset_index(inplace=True, drop=True)
    #     lake_geom['area_1'] = lake_geom.area
    #     lake_geom = lake_geom.loc[lake_geom['area_1'].idxmax(), :].geometry    
    
    # if not lake_geom:
    #     continue
    # resample the shoreline vector to prevent missing confluence points
    resampled_shoreline = redistribute_vertices(lake_geom.exterior, resolution).coords.xy
    
    xs = resampled_shoreline[0].tolist()
    ys = resampled_shoreline[1].tolist()

    # # find the closest point to within 1/2 pixel of the lake edge
    px_pts = acc.sel(x=xs, y=ys, method='nearest', tolerance=resolution/2)
    latlon = list(set(zip(px_pts.x.values, px_pts.y.values)))

    if len(latlon) == 0:
        continue

    for x, y in latlon:
        acc_val = acc.sel(x=x, y=y).squeeze()
        if (acc_val.item() > min_acc_cells):
            tot_pts += 1
            points_to_check += [(x, y, resolution)]
            
        
print(f'{len(points_to_check)} points identified as potential lake inflows')

   Processing lake group 50/311, 298 points so far...
   Processing lake group 100/311, 582 points so far...
   Processing lake group 150/311, 942 points so far...
   Processing lake group 200/311, 1285 points so far...
   Processing lake group 250/311, 1741 points so far...
   Processing lake group 300/311, 2011 points so far...
2078 points identified as potential lake inflows


In [125]:
n = 0
all_pts = []
for inp in points_to_check:
    n += 1
    if n % 250 == 0:
        print(f'{n}/{len(points_to_check)} points checked.')

    x, y, resolution = inp
    pt = Point(x, y)
    
    # index_right is the lake id the point is contained in
    # don't let adjacent points both be pour points
    # but avoid measuring distance to points within lakes
    pt_dists = filtered_ppts[filtered_ppts['index_right'].isna()].distance(pt).min()

    # check the point is not within 5 cell widths of an existing point
    min_spacing = 10 * resolution
    dist_check = pt_dists <= min_spacing
    
    # accum_check = accum < 0.95 * max_acc
    accum_check = True
    if accum_check & (~dist_check):
        # check if the potential point is in any of the lakes
        # not_in_any_lake = sum([lg.contains(pt) for lg in lakes_df.geometry]) == 0
        if not lakes_df.contains(pt).any():
            all_pts.append(pt)

250/2078 points checked.
500/2078 points checked.
750/2078 points checked.
1000/2078 points checked.
1250/2078 points checked.
1500/2078 points checked.
1750/2078 points checked.
2000/2078 points checked.


Format the river mouth points into a geodataframe and append it to the filtered set.

In [126]:
rpts = filtered_ppts[['geometry']].copy()
all_pts_filtered = []
n = 0
for pt in all_pts:
    n += 1
    if n % 200 == 0:
        print(f'{n}/{len(all_pts)}')
    dists = rpts.distance(pt)
    if (dists > min_spacing).all():
        ptg = gpd.GeoDataFrame(geometry=[pt], crs='EPSG:3005')
        # append the new point to the reference point dataframe to
        # update the set of points checked against.
        rpts = gpd.GeoDataFrame(pd.concat([rpts, ptg]), crs='EPSG:3005')
        all_pts_filtered.append(pt)
                


200/1668
400/1668
600/1668
800/1668
1000/1668
1200/1668
1400/1668
1600/1668


Save the output

In [127]:
new_pts = gpd.GeoDataFrame(geometry=all_pts_filtered, crs=f'EPSG:{stream_crs}')
pour_points = gpd.GeoDataFrame(pd.concat([filtered_ppts, new_pts], axis=0), crs=f'EPSG:{stream_crs}')
pour_points.to_file(os.path.join(base_dir, f'notebooks/data/pour_points/{region}_pour_points3.geojson'))