# Data

## Introduction

```{figure} ../images/study_basins_and_regions.png
---
alt: Study region polygons and HYSETS monitored catchment polygons.
class: bg-primary mb-1
name: study-region-fig
width: 500px
align: center
---
Study region polygons (purple outline) and HYSETS monitored catchments (yellow).  
```

```{note}
Before proceeding with the computations in the notebook, the streamflow time series and (optionally) catchment boundaries from the HYSETS dataset must be downloaded from the [HYSETS open data repository](https://osf.io/rpc3w/).  Some data are provided in the `data/` folder as part of this repository.  Data pre-processing can be skipped by downloading the input data files from (add dataset repository link)
```

The data used in this study comes from *The Hydrometeorological Sandbox École de technologie supérieure* (HYSETS) {cite}`arsenault2020comprehensive`.  The HYSETS data, including streamflow time series and attributes for 14,425 catchments can be accessed at [https://osf.io/rpc3w/](https://osf.io/rpc3w/).  We use a subset of approximately 1620 catchments contained in major basins covering and bounding British Columbia, as shown in {numref}`Figure {number} <study-region-fig>`.  Ten climate indices were processed from [Daymet](https://daymet.ornl.gov/) for this subset, for details see *BCUB - A large sample ungauged basin attribute dataset for British Columbia, Canada* {cite}`kovacek2024bcub` ([https://doi.org/10.5194/essd-2023-508](https://doi.org/10.5194/essd-2023-508)).  


For this experiment we use the following files:

* **Daily average streamflow time series**: filenames follow the convention `<official_id>.csv`.  These should be downloaded from the open data repository linked above and saved under `data/hysets_streamflow_timeseries/`
* **Catchment attributes**: filename: `BCUB_watershed_properties_with_climate.csv`.   This file is provided in the `data/` folder, and it was modified from the original file `HYSETS_watershed_properties.txt` in the HYSETS dataset with ten added climate indices as described in {cite}`kovacek2024bcub`.

## Data pre-processing overview

Note that these steps are optional and the results of these pre-processing steps are provided in the open data repository.

1) Process climate indices for HYSETS catchments (optional, pre-processed attributes are contained in `BCUB_watershed_properties_with_climate.csv`)
2) Define testing parameters:
    * **Bitrate:** the number of quantization levels to encode the streamflow time series $N_s = 2^b$.  This is done to test the sensitivity of the predictive model to information loss.   
    * **Prior**: when computing the Kullback-Leibler divergence ($D_{KL} (P||Q) = P \text{log}\frac{P}{Q}$), the simulated distribution Q can't contain zero probabilities, in other words we must prevent the model from saying observed states are impossible.  This is achieved by applying a prior distribution to q (the simulated series) in order to avoid division by zero.  The KL divergence is then computed on the posterior $Q'$.
    * **Minimum record length:** we set the minimum record length to 1 year in order to see the sensitivity of the model to record length,
    * **Record "completeness":** a (hydrological) year must be at least 90% complete in terms of daily mean observations.
    * **Partial counts**: another approach to incorporating measurement uncertainty is to assign a (relative) error interval around each observation. The quantization step then computes counts based on the proportion of the error interval covered by each bin.  This has a smoothing effect on the discrete PMF, with an increasing effect as the bitrate (number of bins / quantization levels) increases and bin intervals shrink.
3) Compute the (Shannon) entropy of the streamflow time series for each bitrate and prior.
4) Compute the KL divergence (KLD), Wasserstein distance (WD), and total variation distance (TVD) for each pair of stations meeting the minimum record / minimum concurrency criteria.

## Import and inspect catchment attributes

In [None]:
import os
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point

In [None]:
# import the catchments used in the study
attributes_filename = 'BCUB_HYSETS_properties_with_climate.csv'
df = pd.read_csv(os.path.join('data', 'BCUB_HYSETS_properties_with_climate.csv'))

In [None]:
# get the number of unique stations in the dataset
unique_stations = np.unique(df['official_id'])
print(f'{len(unique_stations)} unique monitored catchments in the dataset')

In [None]:
# visualize the locations (centroids) of the catchments
centroids = df.apply(lambda row: Point(row['centroid_lon_deg_e'], row['centroid_lat_deg_n']), axis=1)
# convert to geodataframe
gdf = gpd.GeoDataFrame(df, geometry=centroids, crs='EPSG:4269')
# convert coordinate reference system to 3857 for plotting
gdf = gdf.to_crs(3857)
bbox = gdf.geometry.total_bounds

In [None]:
# visualize the catchment centroid locations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Colorblind
output_notebook()

# range bounds supplied in web mercator coordinates
p = figure(x_axis_type="mercator", y_axis_type="mercator", width=700, height=400,
          x_range=(bbox[0], bbox[2]), y_range=(bbox[1], bbox[3]))
p.add_tile("CartoDB Positron", retina=True)
p.scatter(x=gdf.geometry.x, y=gdf.geometry.y, color='orange', size=4)


show(p)

## Import streamflow timeseries

In [None]:
import data_processing_functions as dpf

# BASE_DIR = os.path.dirname(os.path.abspath(__file__))
BASE_DIR = os.getcwd()
STREAMFLOW_DIR = os.path.join(BASE_DIR, 'data', 'hysets_streamflow_timeseries')

In [None]:
# test loading streamflow time series for a pair of monitoring stations
s1, s2 = unique_stations[0], unique_stations[1]
print(s1, s2)
test_df = dpf.retrieve_nonconcurrent_data(s1, s2)
flow_fig = figure(width=700, height=350, x_axis_type='datetime')
flow_fig.line(test_df.index, test_df[unique_stations[0]], color='navy', legend_label=unique_stations[0])
flow_fig.line(test_df.index, test_df[unique_stations[1]], color='dodgerblue', legend_label=unique_stations[1])
flow_fig.yaxis.axis_label = r'$$\text{Flow } \frac{m^3}{s}$$'
flow_fig.xaxis.axis_label = r'$$\text{Date}$$'
show(flow_fig)

## Shannon entropy processing

Compute the Shannon entropy of individual streamflow time series.  The Shannon entropy is given by: $$H(X) = \sum P \log P$$

The entropy is computed for various quantization bit depths (`bitrate` parameter).  No prior is applied here.

From the scipy.stats [docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html):
>*"If messages consisting of sequences of symbols from a set are to be encoded and transmitted over a noiseless channel, then the Shannon entropy H(pk) gives a tight lower bound for the average number of units of information [bits] needed per symbol if the symbols occur with frequencies governed by the discrete distribution pk."*

In [None]:
# create a new output filename 
output_filename = attributes_filename.replace('.csv', f'_with_entropy.csv')
output_fpath = os.path.join('data', output_filename)

# the bitrate dictates the number of quantization levels = 2**b, i.e. 4 bits = 16 levels
quant_labels = []
if not os.path.exists(output_fpath):
    bitrates = [4, 6, 8]
    for bitrate in bitrates:
        label = f'H_{bitrate}_bits'
        print(f'Processing {bitrate} bit entropy')
        df[label] = df.apply(lambda row: dpf.compute_observed_series_entropy(row, bitrate), axis=1)
        quant_labels.append(label)
    # save the results
    df.to_csv(output_fpath, index=False)
else:
    df = pd.read_csv(output_fpath)
    quant_labels = [e for e in df.columns if e.startswith('H')]

In [None]:
# plot the CDFs of entropy by quantization
fig = figure(width=600, height=350, x_axis_label=r'$$H(X)$$', y_axis_label=r'$$P(H)$$')
n = 0
for l in quant_labels:
    x, y = dpf.compute_cdf(df[l])
    fig.line(x, y, legend_label=' '.join(l.split('_')[1:]), line_width=2, color=Colorblind[len(quant_labels)][n])
    n += 1
fig.legend.location = 'top_left'
show(fig)

## Pairwise f-divergence processing

There are about 1.3 million pairings.  To speed up the processing and avoid losing progress, we process these in parallel in batches and save the results intermittently.

In [None]:
import itertools

# generate all combinations of pairs of station ids
id_pairs = list(itertools.combinations(unique_stations, 2))
print(f' There are {len(id_pairs)} unique pairings in the dataset')

### Define input variables

In [None]:
# If true, the observation counts will incorporate observation error
# and divide observation counts based on proportion of bin covered 
# by the error range
partial_counts = [False, True]
# set a revision date for the results output file
revision_date = '20240717'

# how many pairs to compute in each batch
batch_size = 5000

# what percentage of 365 observations in a year counts as a "complete" year
completeness_threshold = 0.9
min_observations = 365 * 0.9

# station pairs with less than min_years concurrent years of data are excluded (for concurrent analysis),
# stations with less than min_years are excluded (for non-concurrent analysis),
min_years = 1 #[2, 3, 4, 5, 10]

# a prior is applied to q in the form of a uniform array of 10**c pseudo-counts "c"
# this prior is used to test the effect of the choice of prior on the model
pseudo_counts = [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# set the number of quantization levels to test, equal to 2^bitrate
bitrates = [4, 6, 8]


Review, organize, and separate the attribute and metadata columns.

In [None]:
print(df.columns)

In [None]:
attr_cols = [
    'drainage_area_km2', 'elevation_m', 'slope_deg', 'gravelius', 'perimeter', 'aspect_deg', 
    'land_use_forest_frac_2010','land_use_grass_frac_2010', 'land_use_wetland_frac_2010',
    'land_use_water_frac_2010', 'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 
    'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010', 'logk_ice_x100', 'porosity_x100']

climate_cols = [
    'tmax', 'tmin', 'prcp', 'srad', 'swe', 'vp', 
    'high_prcp_freq', 'low_prcp_freq', 'high_prcp_duration', 'low_prcp_duration',
]

flag_cols = ['flag_shape_extraction', 'flag_terrain_extraction', 'flag_subsoil_extraction', 'flag_gsim_boundaries', 'flag_artificial_boundaries', 'flag_land_use_extraction']
metadata_cols = [e for e in df.columns if e not in climate_cols + attr_cols]

In [None]:
# the 'process' variable is here so jupyter doesn't go computing 
# three million rows when the book is rebuilt.  This step is very time consuming.
process = False
if process:
    for use_partial_counts in partial_counts:
        for b in bitrates:
            print(f'Processing pairs at {b} bits (concurrent data={use_partial_counts})')
            results_fname = f'DKL_results_{b}bits_{revision_date}.csv'
            if use_partial_counts == True:
                results_fname = results_fname.replace('.csv', '_partial_counts.csv')

            out_fpath = os.path.join('data/', results_fname)
            existing_results = dpf.check_processed_results(out_fpath)
            print(f'    {len(existing_results)} existing results loaded.')

            if existing_results.empty:
                id_pairs_filtered = id_pairs
            else:
                id_pairs_filtered = dpf.filter_processed_pairs(existing_results, id_pairs)

            inputs = [(proxy_stn, target_stn, b, completeness_threshold, min_years, use_partial_counts, attr_cols, climate_cols, pseudo_counts) for proxy_stn, target_stn in id_pairs_filtered]

            batch_files = dpf.process_pairwise_comparisons(inputs, b, results_fname, batch_size)

            print(f'    Processed {len(sample_pairs)} pairs at ({b} bits) in {time() - t0:.1f} seconds')
            print(f'    Concatenating {len(batch_files)} batch files.')

            if len(batch_files) > 0:
                all_results = pd.concat([pd.read_csv(f, engine='pyarrow') for f in batch_files], axis=0)
                all_results.to_csv(out_fpath, index=False)
                if os.path.exists(out_fpath):
                    for f in batch_files:
                        os.remove(f)
                print(f'    Wrote {len(all_results)} results to {out_fpath}')
            else:
                print('    No new results to write to file.')

## Citations

```{bibliography}
:filter: docname in docnames
```