# Generate a Modified Nested Set Index from NGA TDX-Hydro

This notebook demonstrates how to use functions in the [WikiWatershed/global-hydrography](https://github.com/WikiWatershed/global-hydrography) package to generate a modified nested set index using the TDX-Hydro datasets released by the [US National Geospatial-Intelligence Agency (NGA)](https://www.nga.mil).

This example notebook assumes that you have already downloaded the applicable data using the example provided in the `1_GetData.ipynb` notebook. This notebook also assumes that you will have completed the necessary setup steps outline in the **[Installation Instructions](README.md#get-started)** (and also completed as part of the notebook `1_GetData.ipynb`) 

# Python Imports

In this step we will import the necessary python dependencies for this example

In [1]:
from pathlib import Path
import re
from importlib import reload

import pyogrio
import geopandas as gpd

from global_hydrography.delineation.mnsi import modified_nest_set_index
from global_hydrography.preprocess import TDXPreprocessor

# Compile files that need to be processed

In this step we will compile a list of the files that need to be processed to have a modified nested set index. Note this step assumes that you have downloaded the files to the same directory and used the same naming convention as the `1_GetData.ipynb` example notebook. If you have opted to use a different location or naming convention you will need to modify this step accordingly.

In [2]:
# Confirm your current working directory (cwd) and repo/project directory
working_dir = Path.cwd()
project_dir = working_dir.parent
data_dir = project_dir / 'data_temp' # a temporary data directory that we .gitignore
tdx_dir = data_dir / 'nga'

In [3]:
#Scan the files in the data directory and only pull of the streamnet (blueline) files
files_to_process = []
for item in tdx_dir.iterdir():
    if item.is_file() and 'streamnet' in item.name and item.suffix=='.gpkg':
        files_to_process.append(item)

In [4]:
files_to_process

[PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_1020011530_01.gpkg'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_7020038340_01.gpkg')]

# Compute the modified nested set index

In this step we will loop through each of the files to be processed, open them as a GeoDataFrame, applied the modified nested set algorithm, and then write them back to the original file. Note this steps assumes you have used the same file naming convention as the `1_GetData.ipynb` example notebook. If your naming convention is different, you may need to modify the code below. 

In [9]:
# define a helper function for the operation
def compute_mnsi(file:Path, preprocessor:TDXPreprocessor) -> None:

    #parse the file name to get the HDX Basin Id
    tdx_basin_id = int(re.search("\d{10}",file.name).group(0))

    # get file metadata
    info = pyogrio.read_info(file, layer=0)
    print(f"File read: layer = {info['layer_name']} last updated {info['layer_metadata']['DBF_DATE_LAST_UPDATE']}")
    
    #open the file as GeoDataFrame
    gdf = gpd.read_file(file, engine='pyogrio', layer=0, use_arrow=True)

    #apply preprocessing to make linkno globally unique
    preprocessor.tdx_to_global_linkno(gdf, tdx_basin_id)

    #apply preprocessing to make drop columns with no value
    preprocessor.tdx_drop_useless_columns(gdf)

    #compute the modified nested set index
    gdf = modified_nest_set_index(gdf)
    print('Computed: modified nested set index')

    # Set 'LINKNO' as index, to speed reads
    gdf.set_index('LINKNO', inplace=True)
    gdf.sort_index(inplace=True)

    #write back to the file
    tdx_parquet_path = tdx_dir / f"{info['layer_name']}_mnsi.parquet"
    gdf.to_parquet(tdx_parquet_path, compression='zstd')
    print(f'File saved: {tdx_parquet_path.name}')

    return tdx_parquet_path

In [10]:
#initialize a preprocessor instance
#we want to reuse this object to take advantage of the cached TDX Basin Id crosswalk
preprocessor = TDXPreprocessor()

file = files_to_process[1]

tdx_parquet_path = compute_mnsi(file, preprocessor)

File read: layer = TDX_streamnet_7020038340_01 last updated 2021-12-08
Computed: modified nested set index
File saved: TDX_streamnet_7020038340_01_mnsi.parquet


In [11]:
# Get file size, in bytes
tdx_parquet_path.stat().st_size

280516351

In [12]:
# Open the file as GeoDataFrame
gdf = gpd.read_parquet(tdx_parquet_path)
gdf.info()
gdf

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 140097 entries, 750000000 to 750327711
Data columns (total 18 columns):
 #   Column         Non-Null Count   Dtype   
---  ------         --------------   -----   
 0   DSLINKNO       140097 non-null  int32   
 1   USLINKNO1      140097 non-null  int32   
 2   USLINKNO2      140097 non-null  int32   
 3   strmOrder      140097 non-null  int32   
 4   Length         140097 non-null  float64 
 5   Magnitude      140097 non-null  int32   
 6   DSContArea     140097 non-null  float64 
 7   strmDrop       140097 non-null  float64 
 8   Slope          140097 non-null  float64 
 9   StraightL      140097 non-null  float64 
 10  USContArea     140097 non-null  float64 
 11  DOUTEND        140097 non-null  float64 
 12  DOUTSTART      140097 non-null  float64 
 13  DOUTMID        140097 non-null  float64 
 14  geometry       140097 non-null  geometry
 15  DISCOVER_TIME  140097 non-null  int32   
 16  FINISH_TIME    140097 non-null  int32   
 

Unnamed: 0_level_0,DSLINKNO,USLINKNO1,USLINKNO2,strmOrder,Length,Magnitude,DSContArea,strmDrop,Slope,StraightL,USContArea,DOUTEND,DOUTSTART,DOUTMID,geometry,DISCOVER_TIME,FINISH_TIME,ROOT_ID
LINKNO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
750000000,750001777,-1,-1,1,3847.9,1,9.567845e+06,42.07,0.010933,3233.7,5.254868e+06,45853.6,49701.4,47777.5,"LINESTRING (-69.67822 46.41356, -69.67822 46.4...",52,53,750021317
750000001,750002369,-1,-1,1,2251.3,1,8.768556e+06,34.66,0.015397,1749.2,4.320561e+06,44802.7,47054.1,45928.4,"LINESTRING (-69.68589 46.40778, -69.686 46.407...",49,50,750021317
750000002,750004146,-1,-1,1,3551.0,1,9.120895e+06,67.48,0.019002,2593.6,5.267176e+06,41041.1,44591.7,42816.4,"LINESTRING (-69.687 46.37911, -69.687 46.379, ...",47,48,750021317
750000003,-1,-1,-1,1,4169.5,1,1.726447e+07,44.30,0.010624,2960.9,4.655121e+06,0.0,4169.5,2084.8,"LINESTRING (-69.792 46.31622, -69.79189 46.316...",1,2,750000003
750000004,750001188,-1,-1,1,2207.2,1,6.414576e+06,65.32,0.029596,1753.8,4.333366e+06,340509.1,342716.4,341612.8,"LINESTRING (-69.66944 46.27889, -69.66956 46.2...",2816,2817,750133844
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
750325343,750325935,750321790,750257263,5,3608.2,848,1.400987e+10,0.78,0.000217,1944.1,1.400788e+10,302611.0,306216.2,304413.6,"LINESTRING (-80.64644 33.99967, -80.64633 33.9...",470,2165,750288662
750325935,750318240,750325343,750143599,5,8468.1,849,1.402642e+10,0.00,0.000000,5011.1,1.401650e+10,294145.8,302611.0,298378.4,"LINESTRING (-80.61611 33.96222, -80.616 33.962...",468,2165,750288662
750326527,750320016,750322382,750193327,6,5620.7,1483,2.338440e+10,0.00,0.000000,5339.7,2.337541e+10,14210.7,19831.2,17020.9,"LINESTRING (-77.95067 33.96689, -77.95067 33.9...",31,2996,750293970
750327119,750327711,750323566,750163135,6,1926.6,1465,2.370814e+10,0.00,0.000000,1743.6,2.370614e+10,145996.2,147923.5,146959.8,"LINESTRING (-79.50722 33.983, -79.50722 33.983...",3001,5930,750301090


## Explore Modified Nested Set