# Data

## Introduction

```{figure} ../images/figure_1_study_region.png
---
alt: Study region polygons and HYSETS monitored catchment polygons.
name: study-region-fig
width: 500px
align: center
---
Study region polygons and WSC + USGS active (green triangles) and historical (yellow triangles) streamflow monitoring stations.  
```

```{note}
Before proceeding with the computations in the notebook, the streamflow time series and (optionally) catchment boundaries from the HYSETS dataset must be downloaded from the [HYSETS open data repository](https://osf.io/rpc3w/).  Some data are provided in the `data/` folder as part of this repository.  Data pre-processing can be skipped by downloading the input data files from (add dataset repository link)
```

The data used in this study comes from *The Hydrometeorological Sandbox École de Technologie Supérieure* (HYSETS) {cite}`arsenault2020comprehensive`.  The HYSETS data, including streamflow time series and attributes for 14,425 catchments can be accessed at [https://osf.io/rpc3w/](https://osf.io/rpc3w/).  We use a subset of approximately 1620 catchments contained in major basins covering and bounding British Columbia, as shown in {numref}`Figure {number} <study-region-fig>`.  Ten climate indices were processed from [Daymet](https://daymet.ornl.gov/) for this subset, for details see *BCUB - A large sample ungauged basin attribute dataset for British Columbia, Canada* {cite}`kovacek2024bcub` ([https://doi.org/10.5194/essd-2023-508](https://doi.org/10.5194/essd-2023-508)).  


For this experiment we use the following files:

* **Daily average streamflow time series**: filenames follow the convention `<official_id>.csv`.  These should be downloaded from the open data repository linked above and saved under `data/hysets_streamflow_timeseries/`
* **Catchment attributes**: filename: `BCUB_watershed_attributes_updated.csv`.   This file is provided in the `data/` folder, and it was modified from the original file `HYSETS_watershed_properties.txt` in the HYSETS dataset with ten added climate indices as described in {cite}`kovacek2024bcub`.

## Data pre-processing overview

Note that these steps are optional and the end results of these pre-processing steps are provided in the open data repository.

### get updated data sources and validate catchment attributes

1) Extract catchment attributes using updated catchment geometries where available (optional, updated catchment geometries are saved in `data/BCUB_watershed_bounds_updated.geojson`).
2) Process climate indices for HYSETS catchments in the study region (optional, pre-processed attributes are contained in `BCUB_watershed_attributes_updated.csv`)



### Pre-process streamflow data based on key assumptions
3) Define sensitivity test parameters:
    * **Bitrate:** the number of quantization levels to encode the streamflow time series $N_s = 2^b$.  This is done to test the sensitivity of the predictive model to information loss.   
    * **Prior**: when computing the Kullback-Leibler divergence ($D_{KL} (P||Q) = P \text{log}\frac{P}{Q}$), the simulated distribution Q can't contain zero probabilities, in other words we must prevent the model from saying observed states are impossible.  This is achieved by applying a prior distribution to q (the simulated series) in order to avoid division by zero.  The KL divergence is then computed on the posterior $Q'$.
    * **Minimum record length:** we set the minimum record length to 1 year in order to see the sensitivity of the model to record length,
    * **Record "completeness":** a (hydrological) year must be at least 90% complete in terms of daily mean observations.
    * **Partial counts**: another approach to incorporating measurement uncertainty is to assign a (relative) error interval around each observation.
    
### Compute f-divergence measures for catchment pairs
The quantization step then computes counts based on the proportion of the error interval covered by each bin.  This has a smoothing effect on the discrete PMF, with an increasing effect as the bitrate (number of bins / quantization levels) increases and bin intervals shrink.  

4) Compute the (Shannon) entropy of the streamflow time series for each bitrate and prior.  
5) Compute the KL divergence (KLD), Earth Mover's Distance (EMD), and total variation distance (TVD) for each pair of stations meeting the minimum record / minimum concurrency criteria.  

## Import HYSETS catchment attributes

In [1]:
import os
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point

In [2]:
# import the HYSETS attributes data
hysets_df = pd.read_csv('data/HYSETS_watershed_properties.txt', sep=';')

### Import the study region polygons

In [3]:
# import the BCUB (study) region boundary
region_gdf = gpd.read_file('data/BCUB_regions_4326.geojson')
region_gdf = region_gdf.to_crs(3005)
# simplify the geometries (100m threshold) and add a small buffer (250m) 
# to capture HYSETS station points recorded with low accuracy near boundaries
region_gdf.geometry = region_gdf.simplify(100).buffer(500)
region_gdf = region_gdf.to_crs(4326)

In [4]:
# get the stations contained in the study region
centroids = hysets_df.apply(lambda x: Point(x['Centroid_Lon_deg_E'], x['Centroid_Lat_deg_N']), axis=1)
hysets_gdf = gpd.GeoDataFrame(hysets_df, geometry=centroids, crs='EPSG:4326')
hysets_gdf.head(4)

Unnamed: 0,Watershed_ID,Source,Name,Official_ID,Centroid_Lat_deg_N,Centroid_Lon_deg_E,Drainage_Area_km2,Drainage_Area_GSIM_km2,Flag_GSIM_boundaries,Flag_Artificial_Boundaries,...,Land_Use_Water_frac,Land_Use_Urban_frac,Land_Use_Shrubs_frac,Land_Use_Crops_frac,Land_Use_Snow_Ice_frac,Flag_Land_Use_Extraction,Permeability_logk_m2,Porosity_frac,Flag_Subsoil_Extraction,geometry
0,1,HYDAT,SAINT JOHN RIVER AT FORT KENT,01AD002,47.25806,-68.59583,14703.9211,,0,0,...,0.0258,0.0089,0.0749,0.0242,0.0,1,-14.719327,0.180905,1,POINT (-68.59583 47.25806)
1,2,HYDAT,ST. FRANCIS RIVER AT OUTLET OF GLASIER LAKE,01AD003,47.20661,-68.95694,1358.6435,,0,0,...,0.0219,0.0174,0.041,0.0414,0.0,1,-14.056491,0.20645,1,POINT (-68.95694 47.20661)
2,3,HYDAT,MADAWASKA (RIVIERE) A 6 KM EN AVAL DU BARRAGE ...,01AD015,47.5385,-68.5918,2712.0,2693.814,1,0,...,0.0487,0.023,0.0351,0.06,0.0,1,-14.53739,0.165357,1,POINT (-68.5918 47.5385)
3,4,HYDAT,FISH RIVER NEAR FORT KENT,01AE001,47.2375,-68.58278,2245.7638,,0,0,...,0.063,0.0115,0.0641,0.0528,0.0,1,-14.687869,0.170597,1,POINT (-68.58278 47.2375)


### Find the stations within the study region

In [5]:
assert hysets_gdf.crs == region_gdf.crs

bcub_gdf = gpd.sjoin(hysets_gdf, region_gdf, how='inner', predicate='intersects')
print(len(bcub_gdf), len(set(bcub_gdf['Official_ID'])))

# Because of the buffer (to capture stations along the coast), 
# there's a duplicated 08GA065 that should be in 08G
bcub_gdf = bcub_gdf.drop_duplicates(subset=['Official_ID'])

1617 1616


In [6]:
bcub_gdf.columns

Index(['Watershed_ID', 'Source', 'Name', 'Official_ID', 'Centroid_Lat_deg_N',
       'Centroid_Lon_deg_E', 'Drainage_Area_km2', 'Drainage_Area_GSIM_km2',
       'Flag_GSIM_boundaries', 'Flag_Artificial_Boundaries', 'Elevation_m',
       'Slope_deg', 'Gravelius', 'Perimeter', 'Flag_Shape_Extraction',
       'Aspect_deg', 'Flag_Terrain_Extraction', 'Land_Use_Forest_frac',
       'Land_Use_Grass_frac', 'Land_Use_Wetland_frac', 'Land_Use_Water_frac',
       'Land_Use_Urban_frac', 'Land_Use_Shrubs_frac', 'Land_Use_Crops_frac',
       'Land_Use_Snow_Ice_frac', 'Flag_Land_Use_Extraction',
       'Permeability_logk_m2', 'Porosity_frac', 'Flag_Subsoil_Extraction',
       'geometry', 'index_right', 'region_code'],
      dtype='object')

In [7]:
# add in two stations in the far north just outside the study region but 
# important to include since the northern region is so sparsely monitored
to_include = ['10ED002', '09AG003']
added_stns = hysets_df[hysets_df['Official_ID'].isin(to_include)]
added_centroids = added_stns.apply(lambda x: Point(x['Centroid_Lon_deg_E'], x['Centroid_Lat_deg_N']), axis=1)
added_gdf = gpd.GeoDataFrame(added_stns, geometry=added_centroids, crs='EPSG:4326')
bcub_gdf = gpd.GeoDataFrame(pd.concat([bcub_gdf, added_gdf]), crs='4326')
bcub_gdf.loc[bcub_gdf['Official_ID'] == '10ED002', 'region_code'] = '10E'
bcub_gdf.loc[bcub_gdf['Official_ID'] == '09AG003', 'region_code'] = 'YKR'
bcub_gdf.to_file('data/study_region_stations.geojson')

# get the number of unique stations in the dataset
unique_stations = np.unique(bcub_gdf['Official_ID'])
print(f'{len(unique_stations)} unique monitored catchments in the dataset')

1618 unique monitored catchments in the dataset


### Remove excluded stations

Catchments without geometry published by official sources (WSC, USGS) are validated in the next chapter.  The following stations are excluded because the catchment bounds could not be validated.

In [8]:
excluded_stns = ['15087200', '07FD913', '07FD912', '12113349', '15052475',
                '12110400', '15053200', '12212430', '08EG013', '08MH045']
bcub_gdf = bcub_gdf[~bcub_gdf['Official_ID'].isin(excluded_stns)]
print(len(bcub_gdf))

1608


In [9]:
# visualize the locations (centroids) of the catchments
# convert to geodataframe
# convert coordinate reference system to 3857 for plotting
gdf = bcub_gdf.copy().to_crs(3857)
bbox = gdf.geometry.total_bounds

In [10]:
# visualize the catchment centroid locations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Colorblind, Sunset10
output_notebook()

# range bounds supplied in web mercator coordinates
p = figure(x_axis_type="mercator", y_axis_type="mercator", width=700, height=400,
          x_range=(bbox[0], bbox[2]), y_range=(bbox[1], bbox[3]))
p.add_tile("CartoDB Positron", retina=True)
p.scatter(x=gdf.geometry.x, y=gdf.geometry.y, color='orange', size=4)

show(p)

## Import streamflow timeseries


```{note}
At the top of `data_processing_functions.py`, update the `STREAMFLOW_DIR` variable to match where the HYSETS streamflow time series are stored.  
```


In [11]:
import data_processing_functions as dpf

In [12]:
# test loading streamflow time series for a pair of monitoring stations
s1, s2 = unique_stations[0], unique_stations[1]

test_df = dpf.retrieve_nonconcurrent_data(s1, s2)
flow_fig = figure(width=700, height=350, x_axis_type='datetime')
flow_fig.line(test_df.index, test_df[unique_stations[0]], color='navy', legend_label=unique_stations[0])
flow_fig.line(test_df.index, test_df[unique_stations[1]], color='dodgerblue', legend_label=unique_stations[1])
flow_fig.yaxis.axis_label = r'$$\text{Flow } \frac{m^3}{s}$$'
flow_fig.xaxis.axis_label = r'$$\text{Date}$$'
show(flow_fig)

## Next steps

In the subsequent chapters, the monitoring network station catchments are updated, and the catchment attributes are re-extracted and compared against the HYSETS values.  The target variables are then derived from the streamflow time series before training the gradient boosting models to test their predictability from catchment attributes.

## Citations

```{bibliography}
:filter: docname in docnames
```