# Explore and Get Global Hydrography Derived from TDX-Hydro 

TDX-Hydro is the best available global hydrographic datasuite, released in 2022 by the [US National Geospatial-Intelligence Agency (NGA)](https://www.nga.mil) in collaboration with USACE ERDC and NASA, and derived from the 12 m resolution TanDEM-X elevation model.
- McCormack et al. 2022. [Validation of TDX-Hydro; a global, TanDEM-X derived, 12m resolution hydrographic data suite](https://agu.confex.com/agu/fm22/meetingapp.cgi/Paper/1119749). AGU Abstract. 

The [GEOGlOWS ECMWF Streamflow Model](https://geoglows.ecmwf.int/) project is building their [v2.0 release](https://data.geoglows.org/geoglows-2-0) around a [modified version of TDX-Hydro](https://data.geoglows.org/dataset-descriptions/gis-streams-and-catchments) with added attributes (i.e. topological order) and slightly simplified headwater streamlines for improved modeling and mapping.

TDX-Hydro was built around [HydroSHEDS v1 HydroBASINS](https://www.hydrosheds.org/products/hydrobasins) Level 2 boundaries (continental sub-units).

This notebook explores how to accesss these datasets and how they are interelated.

# Imports & Setup

In [1]:
import os
from pathlib import Path
import urllib

import fsspec
import s3fs
import numpy as np
import pandas as pd
import geopandas as gpd
import pyogrio
import pyarrow as pa

import geoviews as gv
gv.extension(
    'bokeh', 
    # 'matplotlib',
)

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [2]:
# Confirm conda environment
os.environ['CONDA_DEFAULT_ENV']

'hydrography'

### Set Paths for Model Inputs/Outputs
Use the [`pathlib`](https://docs.python.org/3/library/pathlib.html) library, whose many benfits for managing paths over  `os` library or string-based approaches are described in [this blog post](https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f).
- [pathlib](https://docs.python.org/3/library/pathlib.html) user guide: https://realpython.com/python-pathlib/

In [3]:
# Confirm your current working directory (cwd) and repo/project directory
working_dir = Path.cwd()
project_dir = working_dir.parent
data_dir = project_dir / 'data_temp' # a temporary data directory that we .gitignore
data_dir

PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp')

### Create local file system using `fsspec` library

We'll use the Filesystem Spec ([`fsspec`](https://filesystem-spec.readthedocs.io)) library and its extensions throughout this project to provide a unified pythonic interface to local, remote and embedded file systems and bytes storage.

In [41]:
# Create local file system using fsspec library
# local_fs = fsspec.implementations.local.LocalFileSystem()
local_fs = fsspec.filesystem('local') 

In [42]:
# List files in our temporary data directory
local_fs.ls(data_dir)

['/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/global_hydrography.qgz',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/geoglows-v2',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/.DS_Store',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/hydrobasins',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga']

In [6]:
# List file details
local_data_list = local_fs.ls(data_dir, detail=True)
# Show first item's details
local_data_list[0]

{'name': '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/hybas_ar_lev02_v1c.zip',
 'size': 2189459,
 'type': 'file',
 'created': 1715029836.387596,
 'islink': False,
 'mode': 33188,
 'uid': 502,
 'gid': 20,
 'mtime': 1715029836.387596,
 'ino': 262421410,
 'nlink': 1}

# HydroBASINS

The main purpose of fetching HydroBASINS v1 data is to understand the spatial organization of TDX-Hydro datafiles, which are organized around HydroBASINS Level 2 Continental Subunits.

Get and visualize HydroBASINS Level 2 Continental Subunits.
- https://www.hydrosheds.org/products/hydrobasins

Data are downloadable by continent at different levels, such as `Africa Level 02 - Standard (2MB)` is downloaded via:
- https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev02_v1c.zip

Refer to [HydroBASINS Technical Documentation](https://data.hydrosheds.org/file/technical-documentation/HydroBASINS_TechDoc_v1c.pdf) for attribute descriptions and coding and naming conventions.

## Select filename

In [7]:
# HydroBASINS Identifier & Region (Section 3.1 in Tech Docs)
hybas_region_dict = {
    'af': 'Africa',
    'ar': 'North American Arctic',
    'as': 'Central and South-East Asia',
    'au': 'Australia and Oceania',
    'eu': 'Europe and Middle East',
    'gr': 'Greenland',
    'na': 'North America and Caribbean sa South America',
    'si': 'Siberia',
}

In [8]:
# Construct URL patterns
hybas_root_url = 'https://data.hydrosheds.org/file/HydroBASINS'
hybas_format = 'standard'
hybas_url = f'{hybas_root_url}/{hybas_format}'

In [9]:
# Start with North America, to compare with NHDplus in Model My Watershed
region = 'na'
hybas_filename = f'hybas_{region}_lev02_v1c.zip'
hybas_filepath = f'{hybas_url}/{hybas_filename}'

In [10]:
# pathlib (or urllib) can be used for parsing string
Path(hybas_filepath).name

'hybas_na_lev02_v1c.zip'

## Get files with fsspec filesystem 

We'll use the Filesystem Spec ([`fsspec`](https://filesystem-spec.readthedocs.io)) library and its extensions throughout this project to provide a unified pythonic interface to local, remote and embedded file systems and bytes storage.

In [12]:
# HydroBASINS files need to be accessed one file at a time
hybas_fs = fsspec.filesystem(protocol='http')
hybas_fs.exists(hybas_filepath)

True

In [13]:
%%time
# getting remot info is fast
hybas_fs.info(hybas_filepath)

CPU times: user 3.43 ms, sys: 1.65 ms, total: 5.08 ms
Wall time: 552 ms


{'name': 'https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_na_lev02_v1c.zip',
 'size': 3103724,
 'mimetype': 'application/zip',
 'url': 'https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_na_lev02_v1c.zip',
 'type': 'file'}

In [14]:
%%time
# Check if we have it, else download
local_filepath = Path(data_dir / hybas_filename)
if local_filepath.exists():
    print('We have it!')
else:
    # Get the remote file and save to local directory, returns None
    hybas_fs.get(hybas_filepath, data_dir)
    print('Dowloaded!')
print(local_filepath, local_filepath.exists())

We have it!
/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/hybas_na_lev02_v1c.zip True
CPU times: user 567 µs, sys: 498 µs, total: 1.07 ms
Wall time: 653 µs


## Open all Level 2 files and combine

In [16]:
%%time
# Get all continents (level 2).
for region in hybas_region_dict.keys():
    hybas_filename = f'hybas_{region}_lev02_v1c.zip'
    hybas_filepath = f'{hybas_url}/{hybas_filename}'
    hybas_fs.get(hybas_filepath, data_dir)
    print(hybas_filename, (data_dir/hybas_filename).exists())

hybas_af_lev02_v1c.zip True
hybas_ar_lev02_v1c.zip True
hybas_as_lev02_v1c.zip True
hybas_au_lev02_v1c.zip True
hybas_eu_lev02_v1c.zip True
hybas_gr_lev02_v1c.zip True
hybas_na_lev02_v1c.zip True
hybas_si_lev02_v1c.zip True
CPU times: user 246 ms, sys: 130 ms, total: 376 ms
Wall time: 11.9 s


In [17]:
%%time
# Create list of GeoDataframes of all continental regions
gdf_list = []
for region in hybas_region_dict.keys():
    print(hybas_region_dict[region])
    hybas_filename = f'hybas_{region}_lev02_v1c.zip'
    print('  ', pyogrio.list_layers(data_dir/hybas_filename))
    gdf = gpd.read_file(
        data_dir/hybas_filename, 
        engine='pyogrio',
    )
    print('  ', gdf.PFAF_ID.values)
    gdf_list.append(gdf)

Africa
   [['hybas_af_lev02_v1c' 'Polygon']]
   [11 12 13 14 15 17 18 16]
North American Arctic
   [['hybas_ar_lev02_v1c' 'Polygon']]
   [81 82 83 35 84 85 86]
Central and South-East Asia
   [['hybas_as_lev02_v1c' 'Polygon']]
   [42 43 44 45 41 48 46 47 49]
Australia and Oceania
   [['hybas_au_lev02_v1c' 'Polygon']]
   [51 52 53 56 54 55 57]
Europe and Middle East
   [['hybas_eu_lev02_v1c' 'Polygon']]
   [21 22 23 24 25 26 27 28 29]
Greenland
   [['hybas_gr_lev02_v1c' 'Polygon']]
   [91]
North America and Caribbean sa South America
   [['hybas_na_lev02_v1c' 'Polygon']]
   [77 78 71 72 73 74 75 76]
Siberia
   [['hybas_si_lev02_v1c' 'Polygon']]
   [31 32 33 34 35 36]
CPU times: user 533 ms, sys: 139 ms, total: 672 ms
Wall time: 755 ms


In [18]:
# Concatenate the GeoDataFrames, which unfortuantly requires converting to Pandas and back
hybas_all_lev02_gdf = gpd.GeoDataFrame(pd.concat(gdf_list, ignore_index=True))
hybas_all_lev02_gdf.set_index('HYBAS_ID', inplace=True)
hybas_all_lev02_gdf.info()
hybas_all_lev02_gdf.crs

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 55 entries, 1020000010 to 3020024310
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   NEXT_DOWN  55 non-null     int64   
 1   NEXT_SINK  55 non-null     int64   
 2   MAIN_BAS   55 non-null     int64   
 3   DIST_SINK  55 non-null     float64 
 4   DIST_MAIN  55 non-null     float64 
 5   SUB_AREA   55 non-null     float64 
 6   UP_AREA    55 non-null     float64 
 7   PFAF_ID    55 non-null     int32   
 8   ENDO       55 non-null     int32   
 9   COAST      55 non-null     int32   
 10  ORDER      55 non-null     int32   
 11  SORT       55 non-null     int64   
 12  geometry   55 non-null     geometry
dtypes: float64(4), geometry(1), int32(4), int64(4)
memory usage: 5.2 KB


<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [19]:
hybas_all_lev02_gdf.head()

Unnamed: 0_level_0,NEXT_DOWN,NEXT_SINK,MAIN_BAS,DIST_SINK,DIST_MAIN,SUB_AREA,UP_AREA,PFAF_ID,ENDO,COAST,ORDER,SORT,geometry
HYBAS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1020000010,0,1020000010,1020000010,0.0,0.0,3258330.6,3258330.6,11,0,1,0,1,"MULTIPOLYGON (((33.67778 27.62917, 33.67119 27..."
1020011530,0,1020011530,1020011530,0.0,0.0,4660080.9,4660080.9,12,0,1,0,2,"MULTIPOLYGON (((34.80278 -19.81667, 34.79279 -..."
1020018110,0,1020018110,1020018110,0.0,0.0,4900405.1,4900405.1,13,0,1,0,3,"MULTIPOLYGON (((5.64444 -1.47083, 5.62972 -1.4..."
1020021940,0,1020021940,1020021940,0.0,0.0,4046600.5,4046600.5,14,0,1,0,4,"MULTIPOLYGON (((0.97778 5.98750, 0.97022 5.988..."
1020027430,0,1020027430,1020027430,0.0,0.0,6923559.6,6923559.6,15,0,1,0,5,"MULTIPOLYGON (((23.28611 32.22083, 23.28133 32..."


In [20]:
hybas_all_lev02_gdf.to_parquet(
    data_dir / 'hybas_all_lev02_gdf.parquet',
    compression='brotli',
)

In [21]:
hybas_all_lev02_gdf.loc[1020011530]

NEXT_DOWN                                                    0
NEXT_SINK                                           1020011530
MAIN_BAS                                            1020011530
DIST_SINK                                                  0.0
DIST_MAIN                                                  0.0
SUB_AREA                                             4660080.9
UP_AREA                                              4660080.9
PFAF_ID                                                     12
ENDO                                                         0
COAST                                                        1
ORDER                                                        0
SORT                                                         2
geometry     MULTIPOLYGON (((34.8027777777778 -19.816666666...
Name: 1020011530, dtype: object

## Plot 

In [22]:
# Select basins by Pfafstetter code level 2
PFAF_IDs = [
    # 77, 78, 71, 72, 
    73, 
    # 74, 
    # 75, 76,
]
plot_gdf = hybas_all_lev02_gdf[hybas_all_lev02_gdf.PFAF_ID.isin(PFAF_IDs)]

In [None]:
basemap = gv.tile_sources.CartoLight().opts(width=600, height=400)

In [40]:
hybas_map = gv.Polygons(plot_gdf.geometry).opts(tools=["hover"])

In [41]:
basemap * hybas_map

In [43]:
hybas_all_lev02_gdf[hybas_all_lev02_gdf.PFAF_ID.isin(PFAF_IDs)]

Unnamed: 0_level_0,NEXT_DOWN,NEXT_SINK,MAIN_BAS,DIST_SINK,DIST_MAIN,SUB_AREA,UP_AREA,PFAF_ID,ENDO,COAST,ORDER,SORT,geometry
HYBAS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
7020038340,0,7020038340,7020038340,0.0,0.0,1074536.2,1074536.2,73,0,1,0,5,"MULTIPOLYGON (((-76.24722 38.90000, -76.25478 ..."


# GEOGloWS v2 Streams and Boundaries

The [GEOGlOWS ECMWF Streamflow Model](https://geoglows.ecmwf.int/) project is building their [v2.0 release](https://data.geoglows.org/geoglows-2-0) around a [modified version of TDX-Hydro](https://data.geoglows.org/dataset-descriptions/gis-streams-and-catchments) with added attributes (i.e. topological order) and slightly simplified headwater streamlines for improved modeling and mapping.

[GEOGloWS v2 Data Guide](https://data.geoglows.org/geoglows-2-0)
- [Available Data: Datasets Quick Reference](https://data.geoglows.org/available-data)
- [GIS Streams and Catchments](https://data.geoglows.org/dataset-descriptions/gis-streams-and-catchments)

Amazon Web Services (AWS) Resources:
- AWS Marketplace: https://aws.amazon.com/marketplace/pp/prodview-aboaljwcz64zs  
- AWS Browse Bucket: http://geoglows-v2.s3-website-us-west-2.amazonaws.com/#streams/  

## Setup s3 filesystem

Use [S3Fs](https://s3fs.readthedocs.io/en/latest/), a file interface to AWS Simple Storage Service (S3), built upon fsspec.

In [7]:
bucket_uri = 's3://geoglows-v2'
region_name = 'us-west-2'
s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name=region_name))

In [8]:
s3.ls(bucket_uri)

['geoglows-v2/configs',
 'geoglows-v2/index.html',
 'geoglows-v2/licenses.md',
 'geoglows-v2/streams',
 'geoglows-v2/streams-global',
 'geoglows-v2/tables',
 'geoglows-v2/tdxhydro-processing',
 'geoglows-v2/v2-master-table.parquet']

In [9]:
s3.ls('geoglows-v2/tables')

['geoglows-v2/tables/package-metadata-table.parquet',
 'geoglows-v2/tables/v2-countries-table.parquet',
 'geoglows-v2/tables/v2-model-table.parquet']

In [10]:
s3.ls('geoglows-v2/tables', detail=True)

[{'Key': 'geoglows-v2/tables/package-metadata-table.parquet',
  'LastModified': datetime.datetime(2024, 4, 7, 23, 42, 10, tzinfo=tzutc()),
  'ETag': '"1dc44f550f4f5007b1d8cec610448ff2-17"',
  'Size': 138976855,
  'StorageClass': 'INTELLIGENT_TIERING',
  'type': 'file',
  'size': 138976855,
  'name': 'geoglows-v2/tables/package-metadata-table.parquet'},
 {'Key': 'geoglows-v2/tables/v2-countries-table.parquet',
  'LastModified': datetime.datetime(2024, 3, 27, 20, 14, 46, tzinfo=tzutc()),
  'ETag': '"364d9a7c66c05c6bca7a6dc830d922ea-14"',
  'Size': 116089831,
  'StorageClass': 'INTELLIGENT_TIERING',
  'type': 'file',
  'size': 116089831,
  'name': 'geoglows-v2/tables/v2-countries-table.parquet'},
 {'Key': 'geoglows-v2/tables/v2-model-table.parquet',
  'LastModified': datetime.datetime(2024, 3, 19, 19, 0, 16, tzinfo=tzutc()),
  'ETag': '"934f76158c2c007057e91f4e0d47f3d0-30"',
  'Size': 245506513,
  'StorageClass': 'INTELLIGENT_TIERING',
  'type': 'file',
  'size': 245506513,
  'name': 'geo

In [28]:
s3.ls('geoglows-v2/tdxhydro-processing', detail=True)

[{'Key': 'geoglows-v2/tdxhydro-processing/processing_options.xlsx',
  'LastModified': datetime.datetime(2024, 2, 1, 18, 33, 48, tzinfo=tzutc()),
  'ETag': '"98c7d2ef5fa26ab6535397904cf96b4e"',
  'Size': 13788,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 13788,
  'name': 'geoglows-v2/tdxhydro-processing/processing_options.xlsx'},
 {'Key': 'geoglows-v2/tdxhydro-processing/tdx_header_numbers.json',
  'LastModified': datetime.datetime(2023, 11, 3, 17, 50, 53, tzinfo=tzutc()),
  'ETag': '"6a92699d9234a0f00032af4028f5935d"',
  'Size': 1366,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 1366,
  'name': 'geoglows-v2/tdxhydro-processing/tdx_header_numbers.json'},
 {'Key': 'geoglows-v2/tdxhydro-processing/tdxhydro-lake-rivers.csv',
  'LastModified': datetime.datetime(2024, 2, 6, 20, 52, 38, tzinfo=tzutc()),
  'ETag': '"26a1d2318d6756b7ecb9e21ef0deb391"',
  'Size': 2013476,
  'StorageClass': 'INTELLIGENT_TIERING',
  'type': 'file',
  'size': 2013476,
  'name': 'geoglows-

## Get files

Saving to relative paths identical to how they are stored on S3.

In [29]:
filepath = 'geoglows-v2/tdxhydro-processing/terminal_node_vpu_list.csv'
s3.get(
    filepath,
    data_dir / filepath
)

[None]

In [30]:
(data_dir / filepath).exists()

True

In [None]:
# to be fleshed out as above


## Explore Meta-Data files

### v2_master_table

A metadata table for every stream reach, without geometry data, which is useful for traversing relationships among stream reaches.

Example use from [How To: GEOGLOWS Cookbook](https://data.geoglows.org/how-to):
- [Get a list of all rivers ID numbers in my watershed](https://data.geoglows.org/how-to#h.jplr2qu7ljvz)

In [59]:
filepath = 'geoglows-v2/v2-master-table.parquet'
# Get info from remote
s3.info(filepath)

{'Key': 'geoglows-v2/v2-master-table.parquet',
 'LastModified': datetime.datetime(2024, 3, 22, 1, 43, 35, tzinfo=tzutc()),
 'ETag': '"934f76158c2c007057e91f4e0d47f3d0-30"',
 'Size': 245506513,
 'StorageClass': 'INTELLIGENT_TIERING',
 'type': 'file',
 'size': 245506513,
 'name': 'geoglows-v2/v2-master-table.parquet'}

In [75]:
# Open file locally
local_fp = data_dir / filepath
local_fp.exists()

True

In [111]:
v2_master_table_df = pd.read_parquet(
    local_fp, 
    # dtype_backend='pyarrow', # arrow seems to slow down reading & increase memeory
)
v2_master_table_df.info()
v2_master_table_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6838900 entries, 0 to 6838899
Data columns (total 12 columns):
 #   Column                Dtype  
---  ------                -----  
 0   LINKNO                int64  
 1   DSLINKNO              int64  
 2   strmOrder             int64  
 3   USContArea            float64
 4   DSContArea            float64
 5   TDXHydroRegion        object 
 6   VPUCode               int64  
 7   TopologicalOrder      int64  
 8   LengthGeodesicMeters  float64
 9   TerminalLink          int64  
 10  musk_k                int64  
 11  musk_x                float64
dtypes: float64(4), int64(7), object(1)
memory usage: 626.1+ MB


Unnamed: 0,LINKNO,DSLINKNO,strmOrder,USContArea,DSContArea,TDXHydroRegion,VPUCode,TopologicalOrder,LengthGeodesicMeters,TerminalLink,musk_k,musk_x
0,110007873,110009186,2,1.083060e+07,4.209271e+07,1020000010,102,177181,13384.058990,110013122,53536,0.25
1,110010498,110011810,2,1.096513e+07,2.296918e+07,1020000010,102,177182,2201.649965,110013122,8807,0.25
2,110005251,110019682,2,1.111977e+07,1.703163e+07,1020000010,102,177183,1972.054239,110030180,7888,0.25
3,110018371,110014434,2,1.076590e+07,3.751075e+07,1020000010,102,177184,5305.002476,110030180,21220,0.25
4,110009187,110011811,2,1.321197e+07,2.302526e+07,1020000010,102,177185,3587.784457,110030180,14351,0.25
...,...,...,...,...,...,...,...,...,...,...,...,...
6838895,420783553,420750784,9,8.182978e+11,8.183078e+11,4020006940,404,550688,12581.201310,420711866,13356,0.25
6838896,420750784,420790716,9,8.183141e+11,8.183305e+11,4020006940,404,550689,16356.933960,420711866,17364,0.25
6838897,420790716,420774331,9,8.183527e+11,8.183661e+11,4020006940,404,550690,10981.256921,420711866,11657,0.25
6838898,420774331,420711866,9,8.183722e+11,8.183836e+11,4020006940,404,550691,13729.151664,420711866,14574,0.25


In [112]:
# Check to see if there is a reason to keep as string object
v2_master_table_df.TDXHydroRegion.notnull().value_counts()

TDXHydroRegion
True    6838900
Name: count, dtype: int64

In [113]:
# Clean up to save
int32_cols = ['LINKNO', 'DSLINKNO', 'strmOrder', 'VPUCode', 'TopologicalOrder', 'TerminalLink', 'musk_k']
for col in int32_cols:
    v2_master_table_df[col] = v2_master_table_df[col].astype(np.int32)
    # v2_master_table_df[col] = v2_master_table_df[col].astype(pd.ArrowDtype(pa.int32()))
v2_master_table_df.TDXHydroRegion = v2_master_table_df.TDXHydroRegion.astype(np.int64)
v2_master_table_df.set_index('LINKNO', inplace=True)
v2_master_table_df.sort_index(inplace=True)
v2_master_table_df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 6838900 entries, 110002637 to 820425136
Data columns (total 11 columns):
 #   Column                Dtype  
---  ------                -----  
 0   DSLINKNO              int32  
 1   strmOrder             int32  
 2   USContArea            float64
 3   DSContArea            float64
 4   TDXHydroRegion        int64  
 5   VPUCode               int32  
 6   TopologicalOrder      int32  
 7   LengthGeodesicMeters  float64
 8   TerminalLink          int32  
 9   musk_k                int32  
 10  musk_x                float64
dtypes: float64(4), int32(6), int64(1)
memory usage: 443.5 MB


In [114]:
v2_master_table_df.to_parquet(
    data_dir / 'geoglows-v2/v2-master-table_df.parquet',
    compression='brotli',
)

In [116]:
v2_master_table_df = pd.read_parquet(
    data_dir / 'geoglows-v2/v2-master-table_df.parquet', 
    # dtype_backend='pyarrow'
)

## Package Metadata

In [49]:
s3.ls('geoglows-v2/tables')

['geoglows-v2/tables/package-metadata-table.parquet',
 'geoglows-v2/tables/v2-countries-table.parquet',
 'geoglows-v2/tables/v2-model-table.parquet']

In [50]:
package_metadata_rfp = 'geoglows-v2/tables/package-metadata-table.parquet'

In [53]:
package_metadata_df = pd.read_parquet(
    data_dir / package_metadata_rfp, 
    # dtype_backend='pyarrow', # arrow seems to slow down reading & increase memeory
)
package_metadata_df.info()
package_metadata_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6838900 entries, 0 to 6838899
Data columns (total 4 columns):
 #   Column   Dtype  
---  ------   -----  
 0   LINKNO   int64  
 1   VPUCode  int64  
 2   lat      float64
 3   lon      float64
dtypes: float64(2), int64(2)
memory usage: 208.7 MB


Unnamed: 0,LINKNO,VPUCode,lat,lon
0,110007873,102,30.103444,32.364444
1,110010498,102,30.103222,32.385444
2,110005251,102,30.093111,32.284333
3,110018371,102,30.097778,32.224667
4,110009187,102,30.067556,32.388000
...,...,...,...,...
6838895,420783553,404,37.640111,118.594556
6838896,420750784,404,37.733222,118.703222
6838897,420790716,404,37.755111,118.795778
6838898,420774331,404,37.785000,118.923222


### Counties

In [54]:
df = pd.read_parquet(
    data_dir / 'geoglows-v2/tables/v2-countries-table.parquet', 
    # dtype_backend='pyarrow', # arrow seems to slow down reading & increase memeory
)
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6838900 entries, 0 to 6838899
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   LINKNO         int64  
 1   lon            float64
 2   lat            float64
 3   RiverCountry   object 
 4   OutletCountry  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 260.9+ MB


Unnamed: 0,LINKNO,lon,lat,RiverCountry,OutletCountry
0,110005251,32.284333,30.093111,Egypt,Egypt
1,110007873,32.364444,30.103444,Egypt,Egypt
2,110009186,32.385444,30.103222,Egypt,Egypt
3,110010498,32.385444,30.103222,Egypt,Egypt
4,110011810,32.436000,30.112889,Egypt,Egypt
...,...,...,...,...,...
6838895,820012086,-117.547634,52.288801,Canada,Canada
6838896,820008054,-117.524397,52.262690,Canada,Canada
6838897,820013433,-117.524397,52.262690,Canada,Canada
6838898,820014775,-117.461267,52.259801,Canada,Canada


### Model Table

In [55]:
df = pd.read_parquet(
    data_dir / 'geoglows-v2/tables/v2-model-table.parquet', 
    # dtype_backend='pyarrow', # arrow seems to slow down reading & increase memeory
)
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6838900 entries, 0 to 6838899
Data columns (total 12 columns):
 #   Column                Dtype  
---  ------                -----  
 0   LINKNO                int64  
 1   DSLINKNO              int64  
 2   strmOrder             int64  
 3   USContArea            float64
 4   DSContArea            float64
 5   TDXHydroRegion        object 
 6   VPUCode               int64  
 7   TopologicalOrder      int64  
 8   LengthGeodesicMeters  float64
 9   TerminalLink          int64  
 10  musk_k                int64  
 11  musk_x                float64
dtypes: float64(4), int64(7), object(1)
memory usage: 626.1+ MB


Unnamed: 0,LINKNO,DSLINKNO,strmOrder,USContArea,DSContArea,TDXHydroRegion,VPUCode,TopologicalOrder,LengthGeodesicMeters,TerminalLink,musk_k,musk_x
0,110007873,110009186,2,1.083060e+07,4.209271e+07,1020000010,102,177181,13384.058990,110013122,53536,0.25
1,110010498,110011810,2,1.096513e+07,2.296918e+07,1020000010,102,177182,2201.649965,110013122,8807,0.25
2,110005251,110019682,2,1.111977e+07,1.703163e+07,1020000010,102,177183,1972.054239,110030180,7888,0.25
3,110018371,110014434,2,1.076590e+07,3.751075e+07,1020000010,102,177184,5305.002476,110030180,21220,0.25
4,110009187,110011811,2,1.321197e+07,2.302526e+07,1020000010,102,177185,3587.784457,110030180,14351,0.25
...,...,...,...,...,...,...,...,...,...,...,...,...
6838895,420783553,420750784,9,8.182978e+11,8.183078e+11,4020006940,404,550688,12581.201310,420711866,13356,0.25
6838896,420750784,420790716,9,8.183141e+11,8.183305e+11,4020006940,404,550689,16356.933960,420711866,17364,0.25
6838897,420790716,420774331,9,8.183527e+11,8.183661e+11,4020006940,404,550690,10981.256921,420711866,11657,0.25
6838898,420774331,420711866,9,8.183722e+11,8.183836e+11,4020006940,404,550691,13729.151664,420711866,14574,0.25


## Stream Bluelines

### Living Atlas Optimized Streams

Provided as an ESRI web service for fast visualization.

See https://data.geoglows.org/tutorials/web-maps

#### Use pyogrio to read Geodatabase

https://pyogrio.readthedocs.io

https://pyogrio.readthedocs.io/en/stable/api.html#pyogrio.read_dataframe


In [40]:
atlas_fp = 'geoglows-v2/streams-global/geoglows-v2-map-optimized.gdb.zip'
# Get info from remote
s3.info(atlas_fp)

{'ETag': '"be0221e378820faa90f3e688e47653f4-237"',
 'LastModified': datetime.datetime(2024, 2, 7, 21, 44, 45, tzinfo=tzutc()),
 'size': 1979879683,
 'name': 'geoglows-v2/streams-global/geoglows-v2-map-optimized.gdb.zip',
 'type': 'file',
 'StorageClass': 'INTELLIGENT_TIERING',
 'VersionId': None,
 'ContentType': 'application/zip'}

In [None]:
# Check if we have it, else download
filepath = atlas_fp # This file takes 6 min to download
local_filepath = data_dir / filepath
if local_filepath.exists():
    print('We have it!')
else:
    # Get the remote file and save to local directory, returns None
    s3.get(filepath, data_dir)
    print('Dowloaded!')
print(local_filepath, local_filepath.exists())

In [41]:
# Open file locally
local_atlas_fp = data_dir / atlas_fp
local_atlas_fp.exists()

True

In [42]:
pyogrio.list_layers(local_fp)

array([['geoglowsv2', 'MultiLineString']], dtype=object)

In [43]:
pyogrio.read_info(local_fp)

{'crs': 'EPSG:3857',
 'encoding': 'UTF-8',
 'fields': array(['Shape_Length', 'LINKNO', 'DSLINKNO', 'strmOrder', 'DSContArea',
        'TDXHydroRegion', 'VPUCode', 'TopologicalOrder',
        'LengthGeodesicMeters', 'TerminalLink', 'musk_k', 'musk_x'],
       dtype=object),
 'dtypes': array(['float64', 'int32', 'int32', 'int32', 'float64', 'object', 'int32',
        'int32', 'float64', 'int32', 'int32', 'float64'], dtype=object),
 'geometry_type': 'MultiLineString',
 'features': 6838900,
 'total_bounds': (-19101586.0, -7437400.0, 19869419.0, 15751070.0),
 'driver': 'OpenFileGDB',
 'capabilities': {'random_read': True,
  'fast_set_next_by_index': True,
  'fast_spatial_filter': True,
  'fast_feature_count': True,
  'fast_total_bounds': True},
 'layer_metadata': None,
 'dataset_metadata': None}

In [57]:
# Have we already converted to Parquet?
local_atlas_parquet_fp = data_dir / 'geoglows-v2/streams-global/geoglows_v2_map_optimized_gdf.parquet'
local_atlas_parquet_fp.exists()

True

In [45]:
%%time
# Similar as opening the file as a dataframe.
# by default, reads from the first layer
# for these files, layer 1 can't be opened
if not local_atlas_parquet_fp.exists():
    # Reading this file takes ~15 minutes
    geoglows_v2_map_optimized_gdf = pyogrio.read_dataframe(
        local_fp, 
        layer=0,
        # fid_as_index=True,
        use_arrow=True,  # Speed read!
        arrow_to_pandas_kwargs={
            # 'dtype_backend':'pyarrow'
        },
    )
geoglows_v2_map_optimized_gdf.info()

CPU times: user 3min 36s, sys: 5min 41s, total: 9min 18s
Wall time: 14min 50s


In [54]:
# Clean up to save
geoglows_v2_map_optimized_gdf.TDXHydroRegion = geoglows_v2_map_optimized_gdf.TDXHydroRegion.astype('int')
geoglows_v2_map_optimized_gdf.set_index('LINKNO', inplace=True)
geoglows_v2_map_optimized_gdf.sort_index(inplace=True)
geoglows_v2_map_optimized_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 6838900 entries, 110002637 to 820425136
Data columns (total 12 columns):
 #   Column                Dtype   
---  ------                -----   
 0   Shape_Length          float64 
 1   DSLINKNO              int32   
 2   strmOrder             int32   
 3   DSContArea            float64 
 4   TDXHydroRegion        int64   
 5   VPUCode               int32   
 6   TopologicalOrder      int32   
 7   LengthGeodesicMeters  float64 
 8   TerminalLink          int32   
 9   musk_k                int32   
 10  musk_x                float64 
 11  geometry              geometry
dtypes: float64(4), geometry(1), int32(6), int64(1)
memory usage: 495.7 MB


In [56]:
# This takes 110 minutes and triples the size of the file!

# if not local_atlas_parquet_fp.exists():
#     # Saving this file takes ~110 minutes!
#     geoglows_v2_map_optimized_gdf.to_parquet(
#         local_atlas_parquet_fp,
#         compression='brotli',
#     )

In [148]:
# This takes longer to read than the GeoDataBase file
# test_gpd = gpd.read_parquet(local_atlas_parquet_fp)

## Basin Boundaries

### VPU Boundaries

See map (fig 3) at https://data.geoglows.org/tutorials/finding-river-numbers

NOTE: `pyogrio.read_dataframe()` seems 30% faster than `gpd.read_file()`, even when `use_arrow=False`

In [11]:
s3.ls('geoglows-v2/streams-global')

['geoglows-v2/streams-global/',
 'geoglows-v2/streams-global/geoglows-v2-map-optimized.gdb.zip',
 'geoglows-v2/streams-global/global_streams_simplified.gpkg',
 'geoglows-v2/streams-global/vpu-boundaries.gpkg']

In [18]:
vpu_boundaries_fp = 'geoglows-v2/streams-global/vpu-boundaries.gpkg'
s3.info(vpu_boundaries_fp)

{'Key': 'geoglows-v2/streams-global/vpu-boundaries.gpkg',
 'LastModified': datetime.datetime(2023, 11, 5, 3, 17, 33, tzinfo=tzutc()),
 'ETag': '"8eb0c676d48f7572a8ff20a7cfcb9e9b-109"',
 'Size': 1872367616,
 'StorageClass': 'INTELLIGENT_TIERING',
 'type': 'file',
 'size': 1872367616,
 'name': 'geoglows-v2/streams-global/vpu-boundaries.gpkg'}

In [19]:
# Check if we have it, else download
filepath = vpu_boundaries_fp
# This file takes 6 min to download
local_filepath = data_dir / filepath
if local_filepath.exists():
    print('We have it!')
else:
    # Get the remote file and save to local directory, returns None
    s3.get(filepath, data_dir)
    print('Dowloaded!')
print(local_filepath, local_filepath.exists())

We have it!
/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/geoglows-v2/streams-global/vpu-boundaries.gpkg True


In [20]:
pyogrio.read_info(data_dir / vpu_boundaries_fp)

{'crs': 'EPSG:4326',
 'encoding': 'UTF-8',
 'fields': array(['VPU'], dtype=object),
 'dtypes': array(['object'], dtype=object),
 'geometry_type': 'MultiPolygon',
 'features': 125,
 'total_bounds': (-171.831410742829,
  -55.3598044400256,
  178.490083230478,
  80.3259426819447),
 'driver': 'GPKG',
 'capabilities': {'random_read': True,
  'fast_set_next_by_index': True,
  'fast_spatial_filter': True,
  'fast_feature_count': True,
  'fast_total_bounds': True},
 'layer_metadata': {'GPKG_METADATA_ITEM_1': '<!DOCTYPE qgis PUBLIC \'http://mrcc.com/qgis.dtd\' \'SYSTEM\'>\n<qgis version="3.32.3-Lima">\n  <identifier></identifier>\n  <parentidentifier></parentidentifier>\n  <language></language>\n  <type>dataset</type>\n  <title></title>\n  <abstract></abstract>\n  <links/>\n  <dates/>\n  <fees></fees>\n  <encoding></encoding>\n  <crs>\n    <spatialrefsys nativeFormat="Wkt">\n      <wkt>GEOGCRS["WGS 84",ENSEMBLE["World Geodetic System 1984 ensemble",MEMBER["World Geodetic System 1984 (Transit)"]

In [21]:
# Read GeoPackage to GeoDataframe
vpu_boundaries_gdf = pyogrio.read_dataframe(
    data_dir / vpu_boundaries_fp, 
    layer=0,
    use_arrow=True, # 50% faster, but doesn't seem to work with s3
    arrow_to_pandas_kwargs={
        # 'dtype_backend':'pyarrow',
    },
)
vpu_boundaries_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   VPU       125 non-null    object  
 1   geometry  125 non-null    geometry
dtypes: geometry(1), object(1)
memory usage: 2.1+ KB


In [36]:
# rename to align with other dataframes
vpu_boundaries_gdf.rename(
    columns={'VPU':'VPUCode'}, 
    inplace=True,
)
# Set dtype to align with GDB file, and for efficient storage
vpu_boundaries_gdf.VPUCode = vpu_boundaries_gdf.VPUCode.astype(np.int32)
# Set index
vpu_boundaries_gdf.set_index('VPUCode', inplace=True)
vpu_boundaries_gdf.sort_index(inplace=True)
vpu_boundaries_gdf.info()


<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 125 entries, 101 to 804
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   geometry  125 non-null    geometry
dtypes: geometry(1)
memory usage: 1.5 KB


In [37]:
local_vpu_boundaries_fp = (data_dir / vpu_boundaries_fp).with_suffix('.parquet')
local_vpu_boundaries_fp

PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/geoglows-v2/streams-global/vpu-boundaries.parquet')

In [38]:
# Save GeoDataframe
# Takes 3 minutes! But 5x less storage that GeoPakage
vpu_boundaries_gdf.to_parquet(
    local_vpu_boundaries_fp,
    compression='brotli',
)

In [39]:
test_gdf = gpd.read_parquet(local_vpu_boundaries_fp)

# NGA TDX-Hydro
National Geospatial-Intelligence Agency (NGA)
https://earth-info.nga.mil/

Download: **Africa**: 1020011530 `Drainage Basins`
- https://earth-info.nga.mil/php/download.php?file=1020011530-basins-gpkg
- https://earth-info.nga.mil/php/download.php?file=1020011530-streamnet-gpkg

In [44]:
tdx_dir = data_dir / 'nga'
local_fs.ls(tdx_dir)

['/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_1020011530_01.gpkg',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamreach_basins_7020038340_01.gpkg',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_7020038340_01.gpkg-shm',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_7020038340_01.gpkg-wal',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamreach_basins_1020011530_01.gpkg',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamreach_basins_1020040190_01.gpkg',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_7020038340_01.gpkg',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamreach_basins_7020038340_01.gpkg-wal',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamreach_basins_7

In [57]:
tdx_basins_7020038340_fp = tdx_dir / 'TDX_streamreach_basins_7020038340_01.gpkg'
tdx_stream_7020038340_fp = tdx_dir / 'TDX_streamnet_7020038340_01.gpkg'

In [58]:
pyogrio.read_info(tdx_basins_7020038340_fp)

{'crs': 'EPSG:4326',
 'encoding': 'UTF-8',
 'fields': array(['streamID'], dtype=object),
 'dtypes': array(['int64'], dtype=object),
 'geometry_type': 'Unknown',
 'features': 140053,
 'total_bounds': (-89.8488333333333,
  24.5589444444433,
  -66.1410555555544,
  46.4619444444444),
 'driver': 'GPKG',
 'capabilities': {'random_read': True,
  'fast_set_next_by_index': True,
  'fast_spatial_filter': True,
  'fast_feature_count': True,
  'fast_total_bounds': True},
 'layer_metadata': None,
 'dataset_metadata': None}

In [59]:
pyogrio.read_info(tdx_stream_7020038340_fp)

{'crs': 'EPSG:4326',
 'encoding': 'UTF-8',
 'fields': array(['LINKNO', 'DSLINKNO', 'USLINKNO1', 'USLINKNO2', 'DSNODEID',
        'strmOrder', 'Length', 'Magnitude', 'DSContArea', 'strmDrop',
        'Slope', 'StraightL', 'USContArea', 'WSNO', 'DOUTEND', 'DOUTSTART',
        'DOUTMID'], dtype=object),
 'dtypes': array(['int32', 'int32', 'int32', 'int32', 'int64', 'int32', 'float64',
        'int32', 'float64', 'float64', 'float64', 'float64', 'float64',
        'int32', 'float64', 'float64', 'float64'], dtype=object),
 'geometry_type': 'LineString',
 'features': 140097,
 'total_bounds': (-89.8212222222222,
  24.5589999999989,
  -66.1413333333321,
  46.4454444444444),
 'driver': 'GPKG',
 'capabilities': {'random_read': True,
  'fast_set_next_by_index': True,
  'fast_spatial_filter': True,
  'fast_feature_count': True,
  'fast_total_bounds': True},
 'layer_metadata': {'DBF_DATE_LAST_UPDATE': '2021-12-08'},
 'dataset_metadata': None}

In [61]:
# Read GeoPackage to GeoDataframe
tdx_basins_7020038340_gdf = pyogrio.read_dataframe(
    tdx_basins_7020038340_fp, 
    layer=0,
    use_arrow=True, # 50% faster, but doesn't seem to work with s3
    arrow_to_pandas_kwargs={
        # 'dtype_backend':'pyarrow',
    },
)
tdx_basins_7020038340_gdf.info()
tdx_basins_7020038340_gdf

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 140053 entries, 0 to 140052
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype   
---  ------    --------------   -----   
 0   streamID  140053 non-null  int64   
 1   geometry  140053 non-null  geometry
dtypes: geometry(1), int64(1)
memory usage: 2.1 MB


Unnamed: 0,streamID,geometry
0,1,"POLYGON ((-69.71706 46.42639, -69.71572 46.426..."
1,2,"POLYGON ((-69.71939 46.39428, -69.71928 46.394..."
2,3,"POLYGON ((-69.77483 46.30506, -69.77483 46.304..."
3,4,"POLYGON ((-69.70206 46.30194, -69.70183 46.301..."
4,5,"POLYGON ((-69.71272 46.28150, -69.71261 46.281..."
...,...,...
140048,325343,"POLYGON ((-80.63483 34.01172, -80.63439 34.011..."
140049,325935,"POLYGON ((-80.64750 34.00028, -80.64728 34.000..."
140050,326527,"POLYGON ((-77.93961 34.01417, -77.93917 34.014..."
140051,327119,"POLYGON ((-79.51194 33.99761, -79.51172 33.997..."


In [60]:
# Read GeoPackage to GeoDataframe
tdx_stream_7020038340_gdf = pyogrio.read_dataframe(
    tdx_stream_7020038340_fp, 
    layer=0,
    use_arrow=True, # 50% faster, but doesn't seem to work with s3
    arrow_to_pandas_kwargs={
        # 'dtype_backend':'pyarrow',
    },
)
tdx_stream_7020038340_gdf.info()
tdx_stream_7020038340_gdf

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 140097 entries, 0 to 140096
Data columns (total 18 columns):
 #   Column      Non-Null Count   Dtype   
---  ------      --------------   -----   
 0   LINKNO      140097 non-null  int32   
 1   DSLINKNO    140097 non-null  int32   
 2   USLINKNO1   140097 non-null  int32   
 3   USLINKNO2   140097 non-null  int32   
 4   DSNODEID    140097 non-null  int64   
 5   strmOrder   140097 non-null  int32   
 6   Length      140097 non-null  float64 
 7   Magnitude   140097 non-null  int32   
 8   DSContArea  140097 non-null  float64 
 9   strmDrop    140097 non-null  float64 
 10  Slope       140097 non-null  float64 
 11  StraightL   140097 non-null  float64 
 12  USContArea  140097 non-null  float64 
 13  WSNO        140097 non-null  int32   
 14  DOUTEND     140097 non-null  float64 
 15  DOUTSTART   140097 non-null  float64 
 16  DOUTMID     140097 non-null  float64 
 17  geometry    140097 non-null  geometry
dtypes: float64(9), g

Unnamed: 0,LINKNO,DSLINKNO,USLINKNO1,USLINKNO2,DSNODEID,strmOrder,Length,Magnitude,DSContArea,strmDrop,Slope,StraightL,USContArea,WSNO,DOUTEND,DOUTSTART,DOUTMID,geometry
0,0,1777,-1,-1,-1,1,3847.9,1,9567845.0,42.07,0.010933,3233.7,5254867.5,0,45853.6,49701.4,47777.5,"LINESTRING (-69.67822 46.41356, -69.67822 46.4..."
1,1,2369,-1,-1,-1,1,2251.3,1,8768556.0,34.66,0.015397,1749.2,4320561.0,1,44802.7,47054.1,45928.4,"LINESTRING (-69.68589 46.40778, -69.68600 46.4..."
2,593,1777,-1,-1,-1,1,1469.3,1,8466694.0,11.98,0.008153,1286.2,4319318.0,593,45853.6,47322.9,46588.3,"LINESTRING (-69.67822 46.41356, -69.67811 46.4..."
3,1777,2369,0,593,-1,2,1050.9,2,19939082.0,0.91,0.000870,871.8,18034788.0,1777,44802.7,45853.6,45328.2,"LINESTRING (-69.68589 46.40778, -69.68589 46.4..."
4,2,4146,-1,-1,-1,1,3551.0,1,9120895.0,67.48,0.019002,2593.6,5267176.0,2,41041.1,44591.7,42816.4,"LINESTRING (-69.68700 46.37911, -69.68700 46.3..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140092,587,-1,-1,-1,-1,1,2354.1,1,10233235.0,0.00,0.000000,1721.9,7569312.0,587,0.0,2354.1,1177.1,"LINESTRING (-81.59922 24.64033, -81.59911 24.6..."
140093,1180,-1,-1,-1,-1,1,1326.7,1,9136435.0,0.00,0.000000,1072.3,4495984.5,1180,0.0,1326.7,663.4,"LINESTRING (-81.63022 24.61767, -81.63011 24.6..."
140094,1772,-1,-1,-1,-1,1,1000.1,1,4879280.0,0.00,0.000000,738.8,4387448.5,1772,0.0,1000.1,500.0,"LINESTRING (-81.60144 24.58478, -81.60156 24.5..."
140095,588,-1,-1,-1,-1,1,2044.7,1,5911555.0,0.76,0.000370,1396.2,4346421.0,588,0.0,2044.7,1022.4,"LINESTRING (-81.64478 24.57489, -81.64489 24.5..."


In [63]:
tdx_stream_7020038340_gdf[tdx_stream_7020038340_gdf.WSNO==588]

Unnamed: 0,LINKNO,DSLINKNO,USLINKNO1,USLINKNO2,DSNODEID,strmOrder,Length,Magnitude,DSContArea,strmDrop,Slope,StraightL,USContArea,WSNO,DOUTEND,DOUTSTART,DOUTMID,geometry
140095,588,-1,-1,-1,-1,1,2044.7,1,5911555.0,0.76,0.00037,1396.2,4346421.0,588,0.0,2044.7,1022.4,"LINESTRING (-81.64478 24.57489, -81.64489 24.5..."


In [64]:
10_000_000

10000000

# USGS NHDplus v2

## DRWI Stream Reaches

From Model My Watershed Pollution Assessment.
https://github.com/WikiWatershed/pollution-assessment


In [97]:
# Local directory of cloned repo
drwi_pa_dir = Path('/Users/aaufdenkampe/Documents/Python/pollution-assessment')

In [98]:
nhd_drwi_gdf = gpd.read_parquet(
    drwi_pa_dir / 'geography/reach_gdf.parquet',
)
nhd_drwi_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 19496 entries, 1748535 to 932040370
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   catchment_hectares  19496 non-null  float64 
 1   watershed_hectares  19496 non-null  float64 
 2   maflowv             19496 non-null  float64 
 3   geometry            19494 non-null  geometry
 4   cluster             17358 non-null  category
 5   sub_focusarea       186 non-null    Int64   
 6   nord                18870 non-null  Int64   
 7   nordstop            18844 non-null  Int64   
 8   huc12               19496 non-null  category
 9   streamorder         19496 non-null  int64   
 10  headwater           19496 non-null  int64   
 11  phase               4082 non-null   category
 12  fa_name             4082 non-null   category
 13  in_drb              19496 non-null  boolean 
 14  huc08               19496 non-null  category
 15  huc10               194

In [99]:
# Write to geopackage for use with QGIS on Mac
filename = 'nhd_drwi.gpkg'
nhd_drwi_gdf.to_file(
    filename,  # only takes file name, saving to CWD
    driver='GPKG',
    engine='pyogrio',
    layer='reaches',
)

In [100]:
# Use pathlib to move file by renaming path
# Create folder if necessary
nhd_dir = data_dir / 'nhdplus2'
if not nhd_dir.exists():
    nhd_dir.mkdir()
# rename filepath to move
(Path.cwd() / filename).rename(nhd_dir / filename )

PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nhdplus2/nhd_drwi.gpkg')

# End