# Process Target Variables

After updating the catchment attributes using revised catchment bounds from USGS and WSC where available, or delineating them from processed rasters where not available, the target variables are the last input data to be processed before we can train the gradient boosted decision tree (GBDT) models to take in attributes and predict the various target variables.

## Shannon entropy processing

Compute the Shannon entropy of individual streamflow time series.  The Shannon entropy is given by: 

$$H(X) = \sum_{i=1}^n P(x_i) \log_2 P(x_i)$$

The entropy is computed for various quantization bit depths (`bitrate` parameter, $n=2^{bitrate}$ in the above summation).  No prior is applied here.

From the scipy.stats [docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html):
>*"If messages consisting of sequences of symbols from a set are to be encoded and transmitted over a noiseless channel, then the Shannon entropy H(pk) gives a tight lower bound for the average number of units of information [bits] needed per symbol if the symbols occur with frequencies governed by the discrete distribution pk."*

Let's run through an example computation to see the difference between 4, 6, and 8 bit quantization, how each represents the total measurement range, and how each quantization aligns with your own expectation of heteroscedastic rating curve uncertainty.

In [1]:
import os
from time import time
from scipy.stats import entropy
import data_processing_functions as dpf
import numpy as np
import pandas as pd
import geopandas as gpd

# visualize the catchment centroid locations
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.palettes import Colorblind, Sunset10
output_notebook()

BASE_DIR = os.getcwd()

In [2]:
stn = '05010500'
test_df = dpf.get_timeseries_data(stn)

test_fig = figure(title=f'Sample Distribution by Dictionary Size (Station ID {stn})',
                 width=800, height=300)
n = 0
for b in [4, 6, 8, 10, 12]:
    test_df.dropna(subset=[stn], inplace=True)
    # add a very small margin on the range to ensure the values are contained 
    # within the specified dictionary size because the right edge is closed
    # by default and will return 2**b + 1 for values equal to the max
    min_q, max_q = test_df[stn].min() - 1e-9, test_df[stn].max() + 1e-9
    assert min_q > 0
    # use equal width bins in log10 space
    log_edges = np.linspace(np.log10(min_q), np.log10(max_q), 2**b)
    linear_edges = [10**e for e in log_edges]
    test_df[f'{b}_bits_quantized'] = np.digitize(test_df[stn], linear_edges)
    unique, counts = np.unique(test_df[f'{b}_bits_quantized'], return_counts=True)
    count_dict = {k: 1/v for k, v in zip(unique, counts)}
    frequencies = [count_dict[e] if e in count_dict else 0 for e in range(1, 2**b)]
    normed_frequencies = frequencies / sum(frequencies)
    H = entropy(normed_frequencies, base=2)
    bin_midpoints = (linear_edges[1:] + linear_edges[-1]) / 2
    bottoms = [0 for _ in normed_frequencies]
    test_fig.quad(left=linear_edges[:-1], right=linear_edges[1:], top=normed_frequencies, bottom=bottoms, 
                  legend_label=f'{b} bits (H={H:.2f})', color=Sunset10[n], fill_alpha=0.5)
    
    test_fig.xaxis.axis_label = r'$$\text{Flow} \left[ m^3/s \right]$$'
    test_fig.yaxis.axis_label = r'P(X)'
    test_fig.legend.location = 'top_left'
    test_fig.legend.click_policy = 'hide'
    n += 2
    


##### show(test_fig)

In [3]:
# create a new output filename 
attributes_filename = 'BCUB_watershed_attributes_updated.csv'
attributes_fpath = os.path.join(os.getcwd(), 'data', attributes_filename)
df = pd.read_csv(attributes_fpath)
df.columns = [e.lower() for e in df.columns]
df.columns

Index(['region', 'official_id', 'drainage_area_km2', 'centroid_lon_deg_e',
       'centroid_lat_deg_n', 'logk_ice_x100', 'porosity_x100',
       'land_use_forest_frac_2010', 'land_use_shrubs_frac_2010',
       'land_use_grass_frac_2010', 'land_use_wetland_frac_2010',
       'land_use_crops_frac_2010', 'land_use_urban_frac_2010',
       'land_use_water_frac_2010', 'land_use_snow_ice_frac_2010',
       'lulc_check_2010', 'land_use_forest_frac_2015',
       'land_use_shrubs_frac_2015', 'land_use_grass_frac_2015',
       'land_use_wetland_frac_2015', 'land_use_crops_frac_2015',
       'land_use_urban_frac_2015', 'land_use_water_frac_2015',
       'land_use_snow_ice_frac_2015', 'lulc_check_2015',
       'land_use_forest_frac_2020', 'land_use_shrubs_frac_2020',
       'land_use_grass_frac_2020', 'land_use_wetland_frac_2020',
       'land_use_crops_frac_2020', 'land_use_urban_frac_2020',
       'land_use_water_frac_2020', 'land_use_snow_ice_frac_2020',
       'lulc_check_2020', 'slope_deg', '

In [4]:
# the bitrate dictates the number of quantization levels = 2**b, i.e. 4 bits = 16 levels
quant_labels = []
filename = 'BCUB_watershed_attributes_updated.csv'
entropy_fpath = os.path.join(BASE_DIR, 'data', 'processed_divergence_inputs', attributes_filename)
if not os.path.exists(entropy_fpath):
    bitrates = [4, 6, 8, 9, 10, 11, 12]
    for bitrate in bitrates:
        label = f'H_{bitrate}_bits'
        print(f'Processing {bitrate} bit entropy')
        df[label] = df.apply(lambda row: dpf.compute_observed_series_entropy(row, bitrate), axis=1)
        quant_labels.append(label)
    # save the results
    df.to_csv(entropy_fpath, index=False)
else:
    df = pd.read_csv(entropy_fpath)
    quant_labels = [e for e in df.columns if e.startswith('H')]

In [5]:
# plot the CDFs of entropy by quantization
fig = figure(width=600, height=350, x_axis_label=r'$$H(X)$$', y_axis_label=r'$$P(H)$$')
n = 0
for l in quant_labels:
    x, y = dpf.compute_cdf(df[l])
    fig.line(x, y, legend_label=' '.join(l.split('_')[1:]), line_width=2, color=Colorblind[len(quant_labels)][n])
    n += 1
fig.legend.location = 'top_left'
show(fig)

## Pairwise f-divergence processing

There are about 1.3 million pairings.  To speed up the processing and avoid losing progress, we process these in parallel in batches and save the results intermittently.

In [6]:
import itertools
unique_stations = list(set(df['official_id'].values))
# generate all combinations of pairs of station ids
id_pairs = list(itertools.combinations(unique_stations, 2))
print(f' There are {len(id_pairs)} unique pairings in the dataset')
# shuffle the pairs to make testing smaller batches more robust
np.random.seed(42)
np.random.shuffle(id_pairs)

 There are 1293636 unique pairings in the dataset


### Define input variables

In [7]:
# If true, the observation counts will incorporate observation error
# and divide observation counts based on proportion of bin covered 
# by the error range
partial_counts = True # or False to generate simple bin counting
# set a revision date for the results output file
revision_date = '20240812'

# how many pairs to compute in each batch
batch_size = 5000

# what percentage of 365 observations in a year counts as a "complete" year
completeness_threshold = 0.9
min_observations = 365 * 0.9

# station pairs with less than min_years concurrent years of data are excluded (for concurrent analysis),
# stations with less than min_years are excluded (for non-concurrent analysis),
min_years = 1 #[2, 3, 4, 5, 10]

# a prior is applied to q in the form of a uniform array of 10**c pseudo-counts "c"
# this prior is used to test the effect of the choice of prior on the model
pseudo_counts = [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# set the number of quantization levels to test, equal to 2^bitrate
bitrates = [4, 6, 8, 9, 10, 11, 12]

In [8]:
# load the attributes file with catchment geometries
geom_file = 'BCUB_watershed_attributes_updated.geojson'
bcub_gdf = gpd.read_file(os.path.join(os.getcwd(), 'data', geom_file))
bcub_gdf.columns = [c.lower() for c in bcub_gdf.columns]

Review, organize, and separate the attribute and metadata columns.

In [9]:
attr_cols = [
    'drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg', 
    'land_use_forest_frac_2010','land_use_grass_frac_2010', 'land_use_wetland_frac_2010',
    'land_use_water_frac_2010', 'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 
    'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010', 'logk_ice_x100', 'porosity_x100'
]

climate_cols = [
    'tmax', 'tmin', 'prcp', 'srad', 'swe', 'vp', 
    'high_prcp_freq', 'low_prcp_freq', 'high_prcp_duration', 'low_prcp_duration',
]

flag_cols = ['flag_shape_extraction', 'flag_terrain_extraction', 'flag_subsoil_extraction', 'flag_gsim_boundaries', 'flag_artificial_boundaries', 'flag_land_use_extraction']
metadata_cols = [e for e in df.columns if e not in climate_cols + attr_cols]

## Process the data 


```{note}
This step is very time consuming, you can skip by downloading the processed files as described at the [top of the page](https://dankovacek.github.io/divergence_measures/notebooks/1_data.html)
```

In [10]:
def input_batch_generator(df, id_pairs_filtered, bitrate, completeness_threshold, 
                    min_years, use_partial_counts, attr_cols, climate_cols, pseudo_counts):
    batch_inputs = []
    for proxy, target in id_pairs_filtered:
        proxy_dict = bcub_gdf.loc[bcub_gdf['official_id'] == proxy].to_dict(orient='records')[0]
        target_dict = bcub_gdf.loc[bcub_gdf['official_id'] == target].to_dict(orient='records')[0]

        batch = [
            proxy_dict, 
            target_dict, 
            bitrate, completeness_threshold, 
            min_years, use_partial_counts, attr_cols, climate_cols, 
            pseudo_counts
        ]
        batch_inputs.append(batch)
    return batch_inputs

In [11]:
temp_dir = os.path.join(os.getcwd(), 'data/', 'temp')
if not os.path.exists(temp_dir):
    os.makedirs(temp_dir)

In [12]:
# for b in bitrates:
#     results_fname = f'KL_results_{b}bits_{revision_date}.csv'
#     fpath = os.path.join('data/', 'processed_divergence_inputs', results_fname)
#     print(f'Loading {b} bits results....')
#     foo = pd.read_csv(fpath)
#     for pr in [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]:
#         min_kl, max_kl = foo[f'dkl_concurrent_post_{pr}R'].min(), foo[f'dkl_concurrent_post_{pr}R'].max()
#         print(f'    prior={pr}: {min_kl:.2f} - {max_kl:.2f}')
#     print('')

In [13]:
# the 'process' variable is here so jupyter doesn't go computing 
# a million rows per iteration when the book is built for pushing to github pages.
process = False
if process: 
    for bitrate in bitrates:
        print(f'Processing pairs at {bitrate} bits quantization (fractional counts={partial_counts})')
        results_fname = f'KL_results_{bitrate}bits_{revision_date}.csv'
        if partial_counts == True:
            results_fname = results_fname.replace('.csv', '_partial_counts.csv')

        out_fpath = os.path.join('data/', 'processed_divergence_inputs', results_fname)
        if os.path.exists(out_fpath):
            continue

        # existing_results = dpf.check_processed_results(out_fpath)

        # if existing_results.empty:
        #     id_pairs_filtered = id_pairs
        # else:
        #     id_pairs_filtered = dpf.filter_processed_pairs(existing_results, id_pairs)
        #     print(f'    {len(existing_results)} existing results loaded.')

        # set some number of batches to create inputs for multiprocessing
        # n_batches = max(len(id_pairs_filtered) // batch_size, 1)
        # batches = np.array_split(np.array(id_pairs_filtered, dtype=object), n_batches)
        # n_pairs = len(id_pairs_filtered)
        n_batches = max(len(id_pairs) // batch_size, 1)
        batches = np.array_split(np.array(id_pairs, dtype=object), n_batches)
        n_pairs = len(id_pairs)        
        print(
            f"    Processing {n_pairs} pairs in {n_batches} batches at {bitrate} bits (partial_counts={partial_counts})"
        )
        batch_no = 1
        batch_files = []
        t0 = time()
        for batch_ids in batches:
            print(f'Starting batch {batch_no}/{len(batches)} processing.')
            batch_fname = results_fname.replace('.csv', f'_batch_{batch_no:03d}.csv')
            batch_output_fpath = os.path.join(temp_dir, batch_fname)
            if os.path.exists(batch_output_fpath):
                batch_files.append(batch_output_fpath)
                batch_no += 1
                continue
            
            # define the input array for multiprocessing
            inputs = input_batch_generator(bcub_gdf, batch_ids, bitrate, completeness_threshold, 
                     min_years, partial_counts, attr_cols, climate_cols, pseudo_counts)
            batch_result = dpf.process_pairwise_comparisons(inputs, bitrate)
            if batch_result.empty:
                print('Empty batch.  Skipping')
            else:
                batch_result.to_csv(batch_output_fpath, index=False)
                print(f"    Saved {len(batch_result)} new results to file.")
            
            batch_files.append(batch_output_fpath)
            t2 = time()
            print(f'    Processed {len(batch_ids)} pairs at ({bitrate} bits) in {t2 - t0:.1f} seconds')
            batch_no += 1
            
        print(f'    Concatenating {len(batch_files)} batch files.')
        if len(batch_files) > 0:
            all_results = pd.concat([pd.read_csv(f, engine='pyarrow') for f in batch_files], axis=0)
            all_results.to_csv(out_fpath, index=False)
            if os.path.exists(out_fpath):
                for f in batch_files:
                    os.remove(f)
            print(f'    Wrote {len(all_results)} results to {out_fpath}')
        else:
            print('    No new results to write to file.')

Processing pairs at 4 bits quantization (fractional counts=True)
Processing pairs at 6 bits quantization (fractional counts=True)
Processing pairs at 8 bits quantization (fractional counts=True)
Processing pairs at 9 bits quantization (fractional counts=True)
Processing pairs at 10 bits quantization (fractional counts=True)
    Processing 1293636 pairs in 258 batches at 10 bits (partial_counts=True)
Starting batch 1/258 processing.
Starting batch 2/258 processing.
Starting batch 3/258 processing.
Starting batch 4/258 processing.
Starting batch 5/258 processing.
Starting batch 6/258 processing.
Starting batch 7/258 processing.
Starting batch 8/258 processing.
Starting batch 9/258 processing.
Starting batch 10/258 processing.
Starting batch 11/258 processing.
Starting batch 12/258 processing.
Starting batch 13/258 processing.
Starting batch 14/258 processing.
Starting batch 15/258 processing.
Starting batch 16/258 processing.
Starting batch 17/258 processing.
Starting batch 18/258 proces

Process ForkPoolWorker-5526:
Process ForkPoolWorker-5527:
Process ForkPoolWorker-5533:
Process ForkPoolWorker-5517:
Process ForkPoolWorker-5539:
Process ForkPoolWorker-5528:
Process ForkPoolWorker-5544:
Process ForkPoolWorker-5540:
Process ForkPoolWorker-5532:
Process ForkPoolWorker-5530:
Process ForkPoolWorker-5529:
Process ForkPoolWorker-5524:
Process ForkPoolWorker-5518:
Process ForkPoolWorker-5519:
Process ForkPoolWorker-5534:
Process ForkPoolWorker-5520:
Process ForkPoolWorker-5535:
Process ForkPoolWorker-5525:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Process ForkPoolWorker-5537:
Traceback (most recent call last):
Traceback (most recent call last):
Process ForkPoolWorker-5543:
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
Traceback (most recent call last):
Process ForkPoolWo

## Citations

```{bibliography}
:filter: docname in docnames
```