# Get Global Hydrography from NGA TDX-Hydro 

This notebook demonstrates how to use functions in the [WikiWatershed/global-hydrography](https://github.com/WikiWatershed/global-hydrography) package to fetch data files from the TDX-Hydro dataserts released by the [US National Geospatial-Intelligence Agency (NGA)](https://www.nga.mil).

It uses processes that were explored in these notebooks:
- `sandbox/explore_data_sources.ipynb`
- `sandbox/reading_files.ipynb`

# Installation and Setup

Carefully follow our **[Installation Instructions](README.md#get-started)**, especially including:
- Creating a virtual environment for this repository (step 3)

## Python Imports
Using common conventions and following the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html): 
- https://google.github.io/styleguide/pyguide.html#s2.2-imports

In [34]:
import os
from pathlib import Path
from importlib import reload

import fsspec
# import s3fs
# import numpy as np
import pandas as pd
import geopandas as gpd
import pyogrio
import pyarrow as pa

In [2]:
# Confirm conda environment
os.environ['CONDA_DEFAULT_ENV']

'hydrography'

In [3]:
# Custom functions for Global Hydrography
import global_hydrography as gh

### If you get `ModuleNotFoundError`:

Then follow Installation instructions Step **4. Add your `global_hydrography` Path to Miniconda/Anaconda sites-packages** in the main ReadMe, running the following in your console, replacing `/your/path/to/global_hydrography/src` with your specific path.

```console
conda develop '/your/path/to/global_hydrography/src'
```

Then restart the kernel and rerun the imports above.


In [4]:
! conda develop '/Users/aaufdenkampe/Documents/Python/global-hydrography/src'

path exists, skipping /Users/aaufdenkampe/Documents/Python/global-hydrography/src
completed operation for: /Users/aaufdenkampe/Documents/Python/global-hydrography/src


In [5]:
# Explore the namespace for global-hydrography modules, functions, etc.
dir(gh)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'delineation',
 'io',
 'mnsi',
 'preprocess',
 'process']

## Set Paths for Data Inputs/Outputs
Use the [`pathlib`](https://docs.python.org/3/library/pathlib.html) library, whose many benfits for managing paths over  `os` library or string-based approaches are described in [this blog post](https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f).
- [pathlib](https://docs.python.org/3/library/pathlib.html) user guide: https://realpython.com/python-pathlib/

In [59]:
# Confirm your current working directory (cwd) and repo/project directory
working_dir = Path.cwd()
project_dir = working_dir.parent
# make a temporary data directory that we .gitignore
data_dir = project_dir / 'data_temp'
data_dir.mkdir(parents=True, exist_ok=True) # Required if it doesn't exist

## Create local file system using `fsspec` library

We'll use the Filesystem Spec ([`fsspec`](https://filesystem-spec.readthedocs.io)) library and its extensions throughout this project to provide a unified pythonic interface to local, remote and embedded file systems and bytes storage.

In [7]:
# Create local file system using fsspec library
# local_fs = fsspec.implementations.local.LocalFileSystem()
local_fs = fsspec.filesystem('local') 

In [8]:
# List files in our temporary data directory
local_fs.ls(data_dir)

['/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/global_hydrography.qgz',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/geoglows-v2',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/.DS_Store',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/test_downcast_gdf.parquet',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/test_gdf.parquet',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/test_pa_gdf.parquet',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/test_ga_pa_df.parquet',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/test_gpd_gdf.parquet',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/test_pa_geo_df.parquet',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/io_10m_annual_lulc',
 '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nhdplus2',
 '/Users/aaufdenkampe/Docume

In [9]:
# List file details (equivalent to file info)
local_data_list = local_fs.ls(data_dir, detail=True)
# Show first item's details
local_data_list[0]

{'name': '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/global_hydrography.qgz',
 'size': 136499,
 'type': 'file',
 'created': 1721235118.8378024,
 'islink': False,
 'mode': 33188,
 'uid': 502,
 'gid': 20,
 'mtime': 1721235118.8372164,
 'ino': 301838904,
 'nlink': 1}

# NGA TDX-Hydro

Data downloadable from the National Geospatial-Intelligence Agency (NGA) Office for Geomatics website, https://earth-info.nga.mil/, under the "Geosciences" tab.

The [TDX-Hydro Technical Document](https://earth-info.nga.mil/php/download.php?file=tdx-hydro-technical-doc) provides detailed information on how the datasets were developed and validated.

In [60]:
# Create local data directory
tdx_dir = data_dir / 'nga'
tdx_dir.mkdir(parents=True, exist_ok=True)
tdx_dir

PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga')

In [61]:
local_fs.exists(tdx_dir)

True

## Get TDX-Hydro Metadata

TDX-Hydro datasets are organized into 62 continental sub-units using the same 10-digit Level 2 codes (HYBAS_ID) developed by [HydroSHEDS v1 HydroBASINS](https://www.hydrosheds.org/products/hydrobasins). More information on the semantics of these codes are provided in the [HydroBASINS Technical Documentation](https://data.hydrosheds.org/file/technical-documentation/HydroBASINS_TechDoc_v1c.pdf).

NGA’s TDX-Hydro v1 files are organized by:
- TDXHydroRegion = HYBAS_ID Level 2 codes 
- Download: https://earth-info.nga.mil/php/download.php?file=hydrobasins_level2
- 62 globally
- Crosswalk from TDXHydroRegion to PFAF_ID values are provided in HydroBASINS Level 2 files (see above), such as https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev02_v1c.zip .

In [12]:
# Filenames and local paths
tdx_hydroregions_filename = Path('hydrobasins_level2.geojson')
tdx_hydroregions_filepath = tdx_dir / tdx_hydroregions_filename

In [13]:
tdx_root_url = 'https://earth-info.nga.mil/php/download.php'

# Download URL for Basin GeoJSON File with ID Numbers
tdx_hydroregions_url = f'{tdx_root_url}?file={tdx_hydroregions_filename.stem}'
tdx_hydroregions_url

'https://earth-info.nga.mil/php/download.php?file=hydrobasins_level2'

In [14]:
# Set up file system for TDX-Hydro HTTP filesystem, which unfortunately 
# isn't set up in accessible directories so files need to be accessed one at a time.
tdx_fs = fsspec.filesystem(protocol='http')

In [15]:
# Get info on the file, which should only take a few seconds
tdx_fs.info(tdx_hydroregions_url)

{'name': 'https://earth-info.nga.mil/php/download.php?file=hydrobasins_level2',
 'size': 95389402,
 'mimetype': 'application/octet-stream',
 'url': 'https://earth-info.nga.mil/php/download.php?file=hydrobasins_level2',
 'type': 'file'}

In [16]:
%%time
# Get the remote file and save to local directory, returns None
if not tdx_hydroregions_filepath.exists:
    tdx_fs.get(tdx_hydroregions_url, str(tdx_hydroregions_filepath))
else:
    print('We have it!')

We have it!
CPU times: user 196 µs, sys: 81 µs, total: 277 µs
Wall time: 260 µs


In [17]:
# Confirm info of local file matches remote file
local_fs.info(tdx_hydroregions_filepath)

{'name': '/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/hydrobasins_level2.geojson',
 'size': 95389402,
 'type': 'file',
 'created': 1721407912.568478,
 'islink': False,
 'mode': 33188,
 'uid': 502,
 'gid': 20,
 'mtime': 1715117902.7792468,
 'ino': 267369098,
 'nlink': 1}

In [18]:
tdx_hydroregions_gdf = gpd.read_file(tdx_hydroregions_filepath)
tdx_hydroregions_gdf.info()
tdx_hydroregions_gdf

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   HYBAS_ID  62 non-null     int64   
 1   SUB_AREA  62 non-null     float64 
 2   geometry  62 non-null     geometry
dtypes: float64(1), geometry(1), int64(1)
memory usage: 1.6 KB


Unnamed: 0,HYBAS_ID,SUB_AREA,geometry
0,1020000010,3258330.6,"MULTIPOLYGON (((38.1995 18.24379, 38.19861 18...."
1,1020011530,4660080.9,"MULTIPOLYGON (((19.42136 -34.68525, 19.4203 -3..."
2,1020018110,4900405.1,"MULTIPOLYGON (((9.15742 -2.07022, 9.16221 -2.0..."
3,1020021940,4046600.5,"MULTIPOLYGON (((-16.1354 11.25485, -16.12926 1..."
4,1020027430,6923559.6,"MULTIPOLYGON (((-16.48427 19.64912, -16.48272 ..."
...,...,...,...
57,3020003790,2620366.5,"MULTIPOLYGON (((76.9259 72.15787, 76.925 72.16..."
58,3020005240,1173702.8,"MULTIPOLYGON (((84.71574 73.8423, 84.69252 73...."
59,3020008670,2487156.5,"MULTIPOLYGON (((125.90113 73.46736, 125.98238 ..."
60,3020009320,2722031.4,"MULTIPOLYGON (((138.39167 56.69923, 138.39028 ..."


### Get TDXHydro Regions from HydroBASINS

This can give us extra metadata and names for guiding the ModelMW user.

The main purpose of fetching HydroBASINS v1c data is to understand the spatial organization of TDX-Hydro datafiles, which are organized around HydroBASINS Level 2 Continental Subunits.

Get and visualize HydroBASINS Level 2 Continental Subunits.
- https://www.hydrosheds.org/products/hydrobasins

Data are downloadable by continent at different levels, such as `Africa Level 02 - Standard (2MB)` is downloaded via:
- https://data.hydrosheds.org/file/HydroBASINS/standard/hybas_af_lev02_v1c.zip

Refer to [HydroBASINS Technical Documentation](https://data.hydrosheds.org/file/technical-documentation/HydroBASINS_TechDoc_v1c.pdf) for attribute descriptions and coding and naming conventions.

In [19]:
# HydroBASINS Identifier & Region (Section 3.1 in Tech Docs)
# 2-character identifier used for file naming
hybas_region_dict = {
    'af': 'Africa',
    'ar': 'North American Arctic',
    'as': 'Central and South-East Asia',
    'au': 'Australia and Oceania',
    'eu': 'Europe and Middle East',
    'gr': 'Greenland',
    'na': 'North America and Caribbean',
    'sa': 'South America',
    'si': 'Siberia',
}

In [20]:
# HydroBASINS Region Numbers (Section 3.1 in Tech Docs)
# Single digit prefix for HYBAS_ID values at all levels
hybas_region_number_dict = {
    1: 'Africa',
    2: 'Europe and Middle East',
    3: 'Siberia',
    4: 'Central and South-East Asia',
    5: 'Australia and Oceania',
    6: 'South America',
    7: 'North America and Caribbean',
    8: 'North American Arctic',
    9: 'Greenland',
}

In [21]:
# Construct URL patterns
hybas_root_url = 'https://data.hydrosheds.org/file/HydroBASINS'
hybas_format = 'standard'
hybas_url = f'{hybas_root_url}/{hybas_format}'

In [62]:
# Create local directory path
hybas_dir = data_dir / 'hydrobasins'
hybas_dir.mkdir(parents=True, exist_ok=True)

In [28]:
# Create remote filesytem
# HydroBASINS files need to be accessed one file at a time
hybas_fs = fsspec.filesystem(protocol='http')

In [27]:
def get_hybas_files(
    remote_filesystem: fsspec.filesystem,
    local_dir: Path,
    hybas_region_dict: dict,
    level: str = '02',
)-> None:

    for region in hybas_region_dict.keys():
        hybas_filename = f'hybas_{region}_lev{level}_v1c.zip'
        hybas_filepath = f'{hybas_url}/{hybas_filename}'
        if (local_dir / hybas_filename).exists():
            print(hybas_filename, 'We have it!')
        else:
            # Get the remote file and save to local directory, returns None
            hybas_fs.get(hybas_filepath, local_dir)
            print(hybas_filename, 'Dowloaded!')    

In [29]:
# Get files
get_hybas_files(
    hybas_fs,
    hybas_dir,
    hybas_region_dict,
)

hybas_af_lev02_v1c.zip We have it!
hybas_ar_lev02_v1c.zip We have it!
hybas_as_lev02_v1c.zip We have it!
hybas_au_lev02_v1c.zip We have it!
hybas_eu_lev02_v1c.zip We have it!
hybas_gr_lev02_v1c.zip We have it!
hybas_na_lev02_v1c.zip We have it!
hybas_sa_lev02_v1c.zip We have it!
hybas_si_lev02_v1c.zip We have it!


### Open all Level 2 files and combine

In [30]:
# Get all continents (level 2).
level = '02'
for region in hybas_region_dict.keys():
    hybas_filename = f'hybas_{region}_lev{level}_v1c.zip'
    hybas_filepath = f'{hybas_url}/{hybas_filename}'
    hybas_fs.get(hybas_filepath, hybas_dir)

In [39]:
%%time
# Create list of GeoDataframes of all continental regions
# Insert 'REGION_NAME' as a new column
level = '02'
gdf_list = []
for region_id, region_name in hybas_region_dict.items():
    print(region_id, region_name)
    hybas_filename = f'hybas_{region_id}_lev{level}_v1c.zip'
    print('  ', pyogrio.list_layers(hybas_dir/hybas_filename))
    gdf = gpd.read_file(
        hybas_dir/hybas_filename, 
        engine='pyogrio',
    )
    gdf.insert(1, 'REGION_NAME', region_name)
    print('  ', gdf.PFAF_ID.values)
    gdf_list.append(gdf)

af Africa
   [['hybas_af_lev02_v1c' 'Polygon']]
   [11 12 13 14 15 17 18 16]
ar North American Arctic
   [['hybas_ar_lev02_v1c' 'Polygon']]
   [81 82 83 35 84 85 86]
as Central and South-East Asia
   [['hybas_as_lev02_v1c' 'Polygon']]
   [42 43 44 45 41 48 46 47 49]
au Australia and Oceania
   [['hybas_au_lev02_v1c' 'Polygon']]
   [51 52 53 56 54 55 57]
eu Europe and Middle East
   [['hybas_eu_lev02_v1c' 'Polygon']]
   [21 22 23 24 25 26 27 28 29]
gr Greenland
   [['hybas_gr_lev02_v1c' 'Polygon']]
   [91]
na North America and Caribbean
   [['hybas_na_lev02_v1c' 'Polygon']]
   [77 78 71 72 73 74 75 76]
sa South America
   [['hybas_sa_lev02_v1c' 'Polygon']]
   [61 62 63 64 65 66 67]
si Siberia
   [['hybas_si_lev02_v1c' 'Polygon']]
   [31 32 33 34 35 36]
CPU times: user 615 ms, sys: 57.4 ms, total: 673 ms
Wall time: 830 ms


In [40]:
pd.concat(gdf_list, ignore_index=True).info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   HYBAS_ID     62 non-null     int64   
 1   REGION_NAME  62 non-null     object  
 2   NEXT_DOWN    62 non-null     int64   
 3   NEXT_SINK    62 non-null     int64   
 4   MAIN_BAS     62 non-null     int64   
 5   DIST_SINK    62 non-null     float64 
 6   DIST_MAIN    62 non-null     float64 
 7   SUB_AREA     62 non-null     float64 
 8   UP_AREA      62 non-null     float64 
 9   PFAF_ID      62 non-null     int32   
 10  ENDO         62 non-null     int32   
 11  COAST        62 non-null     int32   
 12  ORDER        62 non-null     int32   
 13  SORT         62 non-null     int64   
 14  geometry     62 non-null     geometry
dtypes: float64(4), geometry(1), int32(4), int64(5), object(1)
memory usage: 6.4+ KB


In [42]:
# Concatenate the GeoDataFrames
# https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#appending
hybas_all_lev02_gdf = pd.concat(gdf_list, ignore_index=True)
hybas_all_lev02_gdf.REGION_NAME = hybas_all_lev02_gdf.REGION_NAME.astype('category')
hybas_all_lev02_gdf.info()
hybas_all_lev02_gdf.head()


<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   HYBAS_ID     62 non-null     int64   
 1   REGION_NAME  62 non-null     category
 2   NEXT_DOWN    62 non-null     int64   
 3   NEXT_SINK    62 non-null     int64   
 4   MAIN_BAS     62 non-null     int64   
 5   DIST_SINK    62 non-null     float64 
 6   DIST_MAIN    62 non-null     float64 
 7   SUB_AREA     62 non-null     float64 
 8   UP_AREA      62 non-null     float64 
 9   PFAF_ID      62 non-null     int32   
 10  ENDO         62 non-null     int32   
 11  COAST        62 non-null     int32   
 12  ORDER        62 non-null     int32   
 13  SORT         62 non-null     int64   
 14  geometry     62 non-null     geometry
dtypes: category(1), float64(4), geometry(1), int32(4), int64(5)
memory usage: 6.4 KB


Unnamed: 0,HYBAS_ID,REGION_NAME,NEXT_DOWN,NEXT_SINK,MAIN_BAS,DIST_SINK,DIST_MAIN,SUB_AREA,UP_AREA,PFAF_ID,ENDO,COAST,ORDER,SORT,geometry
0,1020000010,Africa,0,1020000010,1020000010,0.0,0.0,3258330.6,3258330.6,11,0,1,0,1,"MULTIPOLYGON (((33.67778 27.62917, 33.67119 27..."
1,1020011530,Africa,0,1020011530,1020011530,0.0,0.0,4660080.9,4660080.9,12,0,1,0,2,"MULTIPOLYGON (((34.80278 -19.81667, 34.79279 -..."
2,1020018110,Africa,0,1020018110,1020018110,0.0,0.0,4900405.1,4900405.1,13,0,1,0,3,"MULTIPOLYGON (((5.64444 -1.47083, 5.62972 -1.4..."
3,1020021940,Africa,0,1020021940,1020021940,0.0,0.0,4046600.5,4046600.5,14,0,1,0,4,"MULTIPOLYGON (((0.97778 5.9875, 0.97022 5.9884..."
4,1020027430,Africa,0,1020027430,1020027430,0.0,0.0,6923559.6,6923559.6,15,0,1,0,5,"MULTIPOLYGON (((23.28611 32.22083, 23.28133 32..."


In [63]:
# Save to NGA TDX hydro directory
tdx_proccessed_dir = tdx_dir / 'processed'
tdx_proccessed_dir.mkdir(parents=True, exist_ok=True)

hybas_all_lev02_gdf.to_parquet(
    tdx_proccessed_dir / 'tdx_regions.parquet',
    compression='zstd',
)

## Get TDX-Hydro GPKG files by HYBAS_ID

# End