For some of the functionalities of rgispy, you will need [RGIS](https://github.com/bmfekete/RGIS) in your local environment. 

```
mamba create -n rgis python=3.9 gdal ipykernel geopandas xarray rasterio rioxarray sqlalchemy geoalchemy2 psycopg2 climata

~/my-conda-envs/rgis/pip install git+git://github.com/dvignoles/rgispy@main
```

Swap to your 'rgis' kernel in this notebook.

In [1]:
import pandas as pd
import xarray as xr
from pathlib import Path

In [2]:
from rgispy.network import gdbn_to_netcdf_base
from rgispy.mask import get_mask_ds, get_point_mask_from_df
from rgispy.sample import sample_ds
from rgispy.postprocess import join_sampled_files, georeference_sampled, normalize_sampled_files, get_sampled_df_byattr

In [41]:
# Change to wherever you want the outputs of this notebook to end up
OUTPUT_DIR = Path.cwd().joinpath('demo_outputs')
if not OUTPUT_DIR.exists():
    OUTPUT_DIR.mkdir()

In [42]:
# The datastream we want to sample
ds = Path('/asrc/ecr/balazs/WBMdsFiles/CONUS/Network_03min/TCfull+WBM20WTempPrist/CONUS_Output_Discharge_TCfull+WBM20WTempPrist_03min_dTS2020.gds.gz')

# the WBM network we are working on
net_gdbn = Path('/asrc/ecr/balazs/GHAAS2/RGISarchive/CONUS/Network/HydroSTN30/03min/Static/CONUS_Network_HydroSTN30_03min_Static.gdbn.gz')

# The network converted to netcdf (we'll create this)
net_nc = OUTPUT_DIR.joinpath('CONUS_Network_HydroSTN30_Static.nc')

# The Mask we will use to sample the WBM output grids (we'll create this)
mask_nc  = OUTPUT_DIR.joinpath('CONUS_Mask_HydronSTN30_Static.nc')

### Setup

For starters we need a representation of the network we work with in python. 

In [43]:
help(gdbn_to_netcdf_base)

Help on function gdbn_to_netcdf_base in module rgispy.network:

gdbn_to_netcdf_base(in_gdbn: pathlib.Path, out_netcdf: pathlib.Path, project: str = '') -> pathlib.Path
    Convert .gdbn rgis network to netcdf network compatible with rgispy
    Raises:
        Exception: unable to encode maximum value
        Exception: unable to encode maximum value
    Returns:
        Path: Path to created netcdf network



In [44]:
if not net_nc.exists():
    gdbn_to_netcdf_base(net_gdbn, net_nc, project="Demo")

network = xr.open_dataset(net_nc)
network

To sample WBM outputs, you basically need to know which WBM Network CellIDs you are interested in. 

The gauges in the csv below have been "snapped" to the network and associated with a CellID. 

If you have a `gdbc` file of snapped features, you can use `rgis2table <gdbc file> > myfeatures.csv` to export them. 

In [45]:
gauges_subset = pd.read_csv('input_data/CONUS_Gauges_HydroSTN30_03min_Static_Subset.csv', dtype={'station_id':str})
gauges_subset

Unnamed: 0.1,Unnamed: 0,ID,Name,CellID,XCoordOrig,XCoord03min,YCoordOrig,YCoord03min,station_id
0,0,1,ALABAMA RIVER AT CLAIBORNE L&D NEAR MONROEVILLE,286259,-87.550545,-87.574997,31.615158,31.674999,2428400
1,1,2,"ALABAMA RIVER NEAR MONTGOMERY, AL.",286318,-86.408302,-86.425003,32.411526,32.424999,2420000
2,2,3,"Allegheny River at Franklin, PA",4807,-79.820335,-79.775002,41.389503,41.325001,3025500
3,3,4,"Allegheny River at Kittanning, PA",3614,-79.531433,-79.525002,40.820343,40.825001,3036500
4,4,5,"Allegheny River at Natrona, PA",3260,-79.718384,-79.724998,40.615345,40.625,3049500
5,5,6,"Allegheny River at Parker, PA",3929,-79.68116,-79.675003,41.100616,41.125,3031500
6,6,7,"Allegheny River at West Hickory, PA",5263,-79.407822,-79.425003,41.570896,41.575001,3016000
7,7,8,"Allegheny River bl Conewango Creek at Warren, PA",5633,-79.149765,-79.175003,41.843948,41.825001,3015310
8,8,9,"ALTAMAHA RIVER AT DOCTORTOWN, GA",327749,-81.827888,-81.824997,31.654659,31.674999,2226000
9,9,10,"ALTAMAHA RIVER AT US 221, NR CHARLOTTEVILLE, GA",327763,-82.517029,-82.525002,31.957861,31.975,2224940


Using the CellIDs of the gauges, we create a mask of the network. We will then iterate over the records in our WBM output keeping only our desired cells. 

In [46]:
mask = get_mask_ds(network)
gauges_mask = get_point_mask_from_df(gauges_subset, network, wbm_fieldname='Cellid')
mask = mask.assign(Gauges=gauges_mask)
mask.to_netcdf(mask_nc)
mask

`sample_ds` iterates over the datastream and samples a list of Masks. In this case we are passing the function our mask netcdf file, and just one mask to sample, `'Gauges'`.

If you would like to sample gdbc files, you can first convert them to datastreams with `rgis2ds --template <network gdbn file>` or use `rgispy.sample.sample_gdbc` which does the conversion at runtime. 

### Sampling Outputs

In [47]:
help(sample_ds)

Help on function sample_ds in module rgispy.sample:

sample_ds(mask_nc: pathlib.Path, file_in: Union[BinaryIO, pathlib.Path], mask_layers: List[str], output_dir: pathlib.Path, year: Optional[int], variable: str, time_step: str, csv_name: Union[str, pathlib.Path] = None, cell_area: numpy.ndarray = None) -> None
    Sample a datastream using a netcdf mask
    
    Args:
        mask_nc (Path): netcdf mask file
        file_in (Union[BinaryIO, Path]): datastream file object or pathlike
        mask_layers (List[str]): list of masks from mask_nc to sample with
        output_dir (Path): directory of output
        year (Optional[int]): year of datastream file
        variable (str): variable of datastream file (ie. Discharge, Temperature..)
        time_step (str): annual, monthly, daily, alt, or dlt
        csv_name (Union[str, Path]): Name of resulting sampled csv
        cell_area (Optional[np.ndarray]): Cell Area grid corresponding to mask. Needed for polygon masks to calculate weighte

In [48]:
sample_ds(
    mask_nc,
    ds, 
    ['Gauges',],
    OUTPUT_DIR,
    2020,
    'Discharge',
    'Daily',
)

### Sampled Results

The outputs of the sampling process are in wide format with the first column being the CellID identifier

Each Year of data (each from a different datastream) will output as its own csv

The next section demonstrates some convenience pandas wrappers for reading in these csvs. 

In [49]:
discharge_csvs = sorted(OUTPUT_DIR.joinpath('Gauges', 'Daily').glob('Discharge*.csv'))
gauges_sample =join_sampled_files(discharge_csvs)
gauges_sample.head()

Unnamed: 0_level_0,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06,2020-01-07,2020-01-08,2020-01-09,2020-01-10,...,2020-12-22,2020-12-23,2020-12-24,2020-12-25,2020-12-26,2020-12-27,2020-12-28,2020-12-29,2020-12-30,2020-12-31
cellid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5633,207.36041,168.32173,169.76184,177.88289,194.61562,196.51176,198.42311,183.2935,175.5601,179.90372,...,97.96534,82.49037,113.95719,206.76605,272.9591,133.22107,55.805374,134.39664,98.61685,113.30947
327749,179.9179,219.45528,192.82738,213.50494,455.94635,499.14005,123.86987,135.79077,198.37408,253.01462,...,223.22884,205.23354,190.67566,187.05771,241.43396,235.68604,294.2166,321.5678,224.82028,198.84181
4807,318.65292,280.6538,252.5939,279.11957,295.9477,304.49747,297.4742,292.6431,269.40634,274.50165,...,169.31168,151.9631,160.82301,296.6238,385.25244,277.93283,101.065155,167.53857,215.77008,160.9241
286318,768.48315,785.91296,1289.8307,1640.0891,1269.8315,922.35657,1001.70135,920.2475,806.3162,565.05725,...,444.42938,474.1129,476.75662,616.18353,987.0025,910.68207,606.9061,698.96533,481.44556,408.20007
5263,251.79381,207.51556,199.81293,213.93773,230.97249,235.01498,236.87784,222.5459,211.23311,211.28032,...,133.72623,100.51388,143.27567,231.0501,317.471,201.42395,64.192184,149.42358,149.58693,118.84749


In [50]:
# extract lat lons from the network
georeference_sampled(gauges_sample, network)

Unnamed: 0_level_0,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06,2020-01-07,2020-01-08,2020-01-09,2020-01-10,...,2020-12-24,2020-12-25,2020-12-26,2020-12-27,2020-12-28,2020-12-29,2020-12-30,2020-12-31,longitude,latitude
cellid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5633,207.36041,168.32173,169.76184,177.88289,194.61562,196.51176,198.42311,183.2935,175.5601,179.90372,...,113.95719,206.76605,272.9591,133.22107,55.805374,134.39664,98.61685,113.30947,-79.175003,41.825001
327749,179.9179,219.45528,192.82738,213.50494,455.94635,499.14005,123.86987,135.79077,198.37408,253.01462,...,190.67566,187.05771,241.43396,235.68604,294.2166,321.5678,224.82028,198.84181,-81.824997,31.674999
4807,318.65292,280.6538,252.5939,279.11957,295.9477,304.49747,297.4742,292.6431,269.40634,274.50165,...,160.82301,296.6238,385.25244,277.93283,101.065155,167.53857,215.77008,160.9241,-79.775002,41.325001
286318,768.48315,785.91296,1289.8307,1640.0891,1269.8315,922.35657,1001.70135,920.2475,806.3162,565.05725,...,476.75662,616.18353,987.0025,910.68207,606.9061,698.96533,481.44556,408.20007,-86.425003,32.424999
5263,251.79381,207.51556,199.81293,213.93773,230.97249,235.01498,236.87784,222.5459,211.23311,211.28032,...,143.27567,231.0501,317.471,201.42395,64.192184,149.42358,149.58693,118.84749,-79.425003,41.575001
327763,163.46869,189.91391,216.10405,455.75305,485.27814,264.5694,170.53455,159.74155,219.21228,184.96793,...,185.61493,189.56291,239.06125,292.69922,222.8721,212.69754,176.28929,165.41496,-82.525002,31.975
286259,782.723,1089.4595,1200.2937,1583.0253,1736.0283,1237.5731,1192.0807,1280.6808,1240.3241,1269.3894,...,802.0588,892.2996,1039.5427,1140.5701,1002.3865,761.222,757.5161,668.8217,-87.574997,31.674999
3929,507.79218,433.1212,406.45108,437.91568,475.1067,471.0485,461.94394,450.14124,427.68448,427.1854,...,260.25586,438.52808,578.7573,430.21918,186.28235,255.03743,362.2795,266.0874,-79.675003,41.125
3260,707.3496,619.70917,570.3414,622.88043,670.77704,645.8341,612.37103,628.0267,588.2273,598.6996,...,317.7251,451.09512,632.9049,576.4431,306.36444,271.49802,416.9298,388.7026,-79.724998,40.625
3614,594.0001,508.11432,479.78113,499.5238,557.3517,532.7293,538.68774,511.2636,506.9046,482.81458,...,280.9904,425.37537,603.40137,499.14832,244.09265,242.29205,387.48407,319.95798,-79.525002,40.825001


In [51]:
normalize_sampled_files(discharge_csvs, 'Discharge', gauges_subset)

Unnamed: 0_level_0,Unnamed: 1_level_0,discharge
sampleid,date,Unnamed: 2_level_1
1,2020-01-01,782.72300
1,2020-01-02,1089.45950
1,2020-01-03,1200.29370
1,2020-01-04,1583.02530
1,2020-01-05,1736.02830
...,...,...
10,2020-12-27,292.69922
10,2020-12-28,222.87210
10,2020-12-29,212.69754
10,2020-12-30,176.28929


In [52]:
# You can select by attribute (in this case station_id)
get_sampled_df_byattr(discharge_csvs, gauges_subset, 'station_id', '03036500', normalize=False, stacked=True, variable='Discharge',)

Unnamed: 0_level_0,Unnamed: 1_level_0,Discharge
cellid,date,Unnamed: 2_level_1
3614,2020-01-01,594.00010
3614,2020-01-02,508.11432
3614,2020-01-03,479.78113
3614,2020-01-04,499.52380
3614,2020-01-05,557.35170
3614,...,...
3614,2020-12-27,499.14832
3614,2020-12-28,244.09265
3614,2020-12-29,242.29205
3614,2020-12-30,387.48407


## USGS Data

You can use the climata package to download usgs data. USGS Dicharge needs to be converted from cubic feet to cubic meters per second.

In [53]:
from climata.usgs import DailyValueIO

In [54]:
stations = gauges_subset.station_id
dates = pd.date_range(start="1/1/2020", end="12/31/2020", freq="D").tolist()

In [55]:
DISCHARGE = "00060" # ft^3/s
RIVTEMP = '00010' # Celsius
FT3_TO_M3 = 0.0283168

In [56]:
def download_usgs_df(station_id: str, param_id: str, date_list=None) -> pd.DataFrame:
    
    if date_list is None:
        date_list = pd.date_range(start="1/1/1990", end="12/31/2020", freq="D").tolist()
    
    data = DailyValueIO(
        start_date=date_list[0],
        end_date=date_list[-1],
        station=station_id,
        parameter=param_id,
    )

    if len(data.keys()) == 0:
        return (station_id, param_id, None)
    else:
        for series in data:
            value = [r[1] for r in series.data]
            dates = [r[0] for r in series.data]

        df = pd.DataFrame(value, index=dates)
        df['station_id'] = station_id
        return(station_id, param_id, df)

In [57]:
station, param, df = download_usgs_df(stations[0], DISCHARGE, dates)

In [58]:
df.head()

Unnamed: 0,0,station_id
2020-01-01,40000.0,2428400
2020-01-02,34200.0,2428400
2020-01-03,37600.0,2428400
2020-01-04,59100.0,2428400
2020-01-05,65100.0,2428400


In [59]:
usgs_discharge = OUTPUT_DIR.joinpath('usgs_discharge.csv')

if not usgs_discharge.exists():
    results = []
    nones = []
    
    for i, gauge in enumerate(stations):
        
        result = download_usgs_df(gauge, DISCHARGE, dates)
        if result[2] is None:
            nones.append(result)
        
        results.append(result)
        
        print(f"{i * 10} %")
        
    gauge_dfs = [x[2] for x in results if x[2] is not None]
    usgs_discharge_df = pd.concat(gauge_dfs)
    usgs_discharge_df = usgs_discharge_df.rename(columns={0: "discharge"}).set_index('station_id', append=True)
    usgs_discharge_df.index = usgs_discharge_df.index.rename(['date', 'station_id'])
    usgs_discharge_df = usgs_discharge_df.sort_values(['station_id', 'date'])

    # convert to m^3/s
    usgs_discharge_df['discharge'] = usgs_discharge_df['discharge'] * FT3_TO_M3
    usgs_discharge_df = usgs_discharge_df.rename(columns={'discharge':'usgs_discharge'})
    usgs_discharge_df.to_csv(usgs_discharge)

    print(f"{len(nones)} gauges returned no usgs results")

0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
80 %
90 %
0 gauges returned no usgs results


In [60]:
usgs_discharge_df

Unnamed: 0_level_0,Unnamed: 1_level_0,usgs_discharge
date,station_id,Unnamed: 2_level_1
2020-01-01,02224940,1056.21664
2020-01-02,02224940,1010.90976
2020-01-03,02224940,920.29600
2020-01-04,02224940,846.67232
2020-01-05,02224940,818.35552
...,...,...
2020-12-27,03049500,1095.86016
2020-12-28,03049500,1053.38496
2020-12-29,03049500,1166.65216
2020-12-30,03049500,1143.99872


From here you can compare the USGS discharge and WBM results directly. 

In [61]:
# cleanup (delete everything)
import shutil
shutil.rmtree(OUTPUT_DIR)