# Generate a Modified Nested Set Index from NGA TDX-Hydro

This notebook demonstrates how to use functions in the [WikiWatershed/global-hydrography](https://github.com/WikiWatershed/global-hydrography) package to generate a modified nested set index using the TDX-Hydro datasets released by the [US National Geospatial-Intelligence Agency (NGA)](https://www.nga.mil).

This example notebook assumes that you have already downloaded the applicable data using the example provided in the `1_GetData.ipynb` notebook. This notebook also assumes that you will have completed the necessary setup steps outline in the **[Installation Instructions](README.md#get-started)** (and also completed as part of the notebook `1_GetData.ipynb`) 

# Python Imports

In this step we will import the necessary python dependencies for this example

In [1]:
from pathlib import Path
import re
from importlib import reload

import pyogrio
import geopandas as gpd

from global_hydrography.delineation.mnsi import modified_nest_set_index
from global_hydrography.preprocess import TDXPreprocessor

# Compile files that need to be processed

In this step we will compile a list of the files that need to be processed to have a modified nested set index. Note this step assumes that you have downloaded the files to the same directory and used the same naming convention as the `1_GetData.ipynb` example notebook. If you have opted to use a different location or naming convention you will need to modify this step accordingly.

In [2]:
# Confirm your current working directory (cwd) and repo/project directory
working_dir = Path.cwd()
project_dir = working_dir.parent
data_dir = project_dir / 'data_temp' # a temporary data directory that we .gitignore
tdx_dir = data_dir / 'nga'

In [3]:
#Scan the files in the data directory and only pull of the streamnet (blueline) files
files_to_process = []
for item in tdx_dir.iterdir():
    if item.is_file() and 'streamnet' in item.name and item.suffix=='.gpkg':
        files_to_process.append(item)

In [4]:
files_to_process

[PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_1020011530_01.gpkg'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_7020038340_01.gpkg')]

# Compute the modified nested set index

In this step we will loop through each of the files to be processed, open them as a GeoDataFrame, applied the modified nested set algorithm, and then write them back to the original file. Note this steps assumes you have used the same file naming convention as the `1_GetData.ipynb` example notebook. If your naming convention is different, you may need to modify the code below. 

In [5]:
# define a helper function for the operation
def compute_mnsi(file:Path, preprocessor:TDXPreprocessor) -> None:

    #parse the file name to get the HDX Basin Id
    tdx_basin_id = int(re.search("\d{10}",file.name).group(0))

    # get file metadata
    info = pyogrio.read_info(file, layer=0)
    print(f"File read: layer = {info['layer_name']} last updated {info['layer_metadata']['DBF_DATE_LAST_UPDATE']}")
    
    #open the file as GeoDataFrame
    gdf = gpd.read_file(file, engine='pyogrio', layer=0, use_arrow=True)

    #apply preprocessing to make linkno globally unique
    preprocessor.tdx_to_global_linkno(gdf, tdx_basin_id)

    #apply preprocessing to make drop columns with no value
    preprocessor.tdx_drop_useless_columns(gdf)

    #compute the modified nested set index
    gdf = modified_nest_set_index(gdf)
    print('Computed: modified nested set index')

    # Set 'LINKNO' as index, to speed reads
    gdf.set_index('LINKNO', inplace=True)
    gdf.sort_index(inplace=True)

    #write back to the file
    tdx_parquet_path = tdx_dir / f"{info['layer_name']}_mnsi.parquet"
    gdf.to_parquet(tdx_parquet_path, compression='zstd')
    print(f'File saved: {tdx_parquet_path.name}')

    return tdx_parquet_path

In [6]:
#initialize a preprocessor instance
#we want to reuse this object to take advantage of the cached TDX Basin Id crosswalk
preprocessor = TDXPreprocessor()

file = files_to_process[1]

tdx_parquet_path = compute_mnsi(file, preprocessor)

File read: layer = TDX_streamnet_7020038340_01 last updated 2021-12-08
Computed: modified nested set index
File saved: TDX_streamnet_7020038340_01_mnsi.parquet


In [7]:
# Get file size, in bytes
tdx_parquet_path.stat().st_size

280516333

In [8]:
# Open the file as GeoDataFrame
gdf = gpd.read_parquet(tdx_parquet_path)
gdf.info()
gdf

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 140097 entries, 750000000 to 750327711
Data columns (total 18 columns):
 #   Column         Non-Null Count   Dtype   
---  ------         --------------   -----   
 0   DSLINKNO       140097 non-null  int32   
 1   USLINKNO1      140097 non-null  int32   
 2   USLINKNO2      140097 non-null  int32   
 3   ROOT_ID        140097 non-null  int32   
 4   DISCOVER_TIME  140097 non-null  int32   
 5   FINISH_TIME    140097 non-null  int32   
 6   strmOrder      140097 non-null  int32   
 7   Length         140097 non-null  float64 
 8   Magnitude      140097 non-null  int32   
 9   DSContArea     140097 non-null  float64 
 10  strmDrop       140097 non-null  float64 
 11  Slope          140097 non-null  float64 
 12  StraightL      140097 non-null  float64 
 13  USContArea     140097 non-null  float64 
 14  DOUTEND        140097 non-null  float64 
 15  DOUTSTART      140097 non-null  float64 
 16  DOUTMID        140097 non-null  float64 
 

Unnamed: 0_level_0,DSLINKNO,USLINKNO1,USLINKNO2,ROOT_ID,DISCOVER_TIME,FINISH_TIME,strmOrder,Length,Magnitude,DSContArea,strmDrop,Slope,StraightL,USContArea,DOUTEND,DOUTSTART,DOUTMID,geometry
LINKNO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
750000000,750001777,-1,-1,750021317,52,53,1,3847.9,1,9.567845e+06,42.07,0.010933,3233.7,5.254868e+06,45853.6,49701.4,47777.5,"LINESTRING (-69.67822 46.41356, -69.67822 46.4..."
750000001,750002369,-1,-1,750021317,49,50,1,2251.3,1,8.768556e+06,34.66,0.015397,1749.2,4.320561e+06,44802.7,47054.1,45928.4,"LINESTRING (-69.68589 46.40778, -69.686 46.407..."
750000002,750004146,-1,-1,750021317,47,48,1,3551.0,1,9.120895e+06,67.48,0.019002,2593.6,5.267176e+06,41041.1,44591.7,42816.4,"LINESTRING (-69.687 46.37911, -69.687 46.379, ..."
750000003,-1,-1,-1,750000003,1,2,1,4169.5,1,1.726447e+07,44.30,0.010624,2960.9,4.655121e+06,0.0,4169.5,2084.8,"LINESTRING (-69.792 46.31622, -69.79189 46.316..."
750000004,750001188,-1,-1,750133844,2816,2817,1,2207.2,1,6.414576e+06,65.32,0.029596,1753.8,4.333366e+06,340509.1,342716.4,341612.8,"LINESTRING (-69.66944 46.27889, -69.66956 46.2..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
750325343,750325935,750321790,750257263,750288662,470,2165,5,3608.2,848,1.400987e+10,0.78,0.000217,1944.1,1.400788e+10,302611.0,306216.2,304413.6,"LINESTRING (-80.64644 33.99967, -80.64633 33.9..."
750325935,750318240,750325343,750143599,750288662,468,2165,5,8468.1,849,1.402642e+10,0.00,0.000000,5011.1,1.401650e+10,294145.8,302611.0,298378.4,"LINESTRING (-80.61611 33.96222, -80.616 33.962..."
750326527,750320016,750322382,750193327,750293970,31,2996,6,5620.7,1483,2.338440e+10,0.00,0.000000,5339.7,2.337541e+10,14210.7,19831.2,17020.9,"LINESTRING (-77.95067 33.96689, -77.95067 33.9..."
750327119,750327711,750323566,750163135,750301090,3001,5930,6,1926.6,1465,2.370814e+10,0.00,0.000000,1743.6,2.370614e+10,145996.2,147923.5,146959.8,"LINESTRING (-79.50722 33.983, -79.50722 33.983..."


## Explore Modified Nested Set

In [10]:
gdf[gdf.ROOT_ID==750288662].sort_values('DISCOVER_TIME')

Unnamed: 0_level_0,DSLINKNO,USLINKNO1,USLINKNO2,ROOT_ID,DISCOVER_TIME,FINISH_TIME,strmOrder,Length,Magnitude,DSContArea,strmDrop,Slope,StraightL,USContArea,DOUTEND,DOUTSTART,DOUTMID,geometry
LINKNO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
750288662,-1,750288070,750170262,750288662,1,4898,6,1273.1,2449,3.940295e+10,0.00,0.000000,1028.1,3.940148e+10,0.0,1273.1,636.6,"LINESTRING (-79.23711 33.13078, -79.23722 33.1..."
750170262,750288662,-1,-1,750288662,2,3,1,438.9,1,6.562896e+06,0.00,0.000000,431.8,4.331999e+06,1273.1,1712.0,1492.5,"LINESTRING (-79.24244 33.13889, -79.24244 33.1..."
750288070,750288662,750301685,750169670,750288662,3,4898,6,6269.9,2448,3.939491e+10,0.00,0.000000,5059.3,3.939078e+10,1273.1,7543.1,4408.1,"LINESTRING (-79.24244 33.13889, -79.24256 33.1..."
750169670,750288070,750169078,750168486,750288662,4,7,2,718.9,2,1.356416e+07,1.00,0.001386,595.9,1.348522e+07,7543.1,8262.0,7902.5,"LINESTRING (-79.28856 33.16289, -79.28856 33.1..."
750168486,750169670,-1,-1,750288662,5,6,1,336.1,1,4.427854e+06,0.04,0.000119,291.3,4.359848e+06,8262.0,8598.1,8430.1,"LINESTRING (-79.28778 33.15756, -79.28789 33.1..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
750019844,750115156,750018660,750019252,750288662,4893,4898,2,2441.2,3,5.148408e+07,0.98,0.000400,1738.1,4.636870e+07,632110.8,634547.9,633329.4,"LINESTRING (-82.68056 35.00156, -82.68067 35.0..."
750019252,750019844,-1,-1,750288662,4894,4895,1,506.4,1,4.559639e+06,7.66,0.015123,482.9,4.342070e+06,634547.9,635053.5,634800.8,"LINESTRING (-82.699 34.99767, -82.69911 34.997..."
750018660,750019844,750017476,750018068,750288662,4895,4898,2,6750.6,2,4.180782e+07,24.27,0.003595,5175.3,1.942599e+07,634547.9,641287.2,637917.6,"LINESTRING (-82.699 34.99767, -82.69911 34.997..."
750018068,750018660,-1,-1,750288662,4896,4897,1,3314.0,1,8.503494e+06,233.83,0.070557,2555.0,4.741279e+06,641287.2,644596.4,642941.8,"LINESTRING (-82.75567 34.99633, -82.75578 34.9..."
