# Generate a Modified Nested Set Index from NGA TDX-Hydro

This notebook demonstrates how to use functions in the [WikiWatershed/global-hydrography](https://github.com/WikiWatershed/global-hydrography) package to generate a modified nested set index using the TDX-Hydro datasets released by the [US National Geospatial-Intelligence Agency (NGA)](https://www.nga.mil).

This example notebook assumes that you have already downloaded the applicable data using the example provided in the `1_GetData.ipynb` notebook. This notebook also assumes that you will have completed the necessary setup steps outline in the **[Installation Instructions](README.md#get-started)** (and also completed as part of the notebook `1_GetData.ipynb`) 

# Python Imports

In this step we will import the necessary python dependencies for this example

In [6]:
from pathlib import Path
import re

import pyogrio

from global_hydrography.delineation.mnsi import modified_nest_set_index
from global_hydrography.delineation.preprocess import TDXPreprocessor

# Compile files that need to be processed

In this step we will compile a list of the files that need to be processed to have a modified nested set index. Note this step assumes that you have downloaded the files to the same directory and used the same naming convention as the `1_GetData.ipynb` example notebook. If you have opted to use a different location or naming convention you will need to modify this step accordingly.

In [7]:
# Confirm your current working directory (cwd) and repo/project directory
working_dir = Path.cwd()
project_dir = working_dir.parent
data_dir = project_dir / 'data_temp' # a temporary data directory that we .gitignore

In [8]:
#Scan the files in the data directory and only pull of the streamnet (blueline) files
files_to_process = []
for item in data_dir.iterdir():
    if item.is_file() and 'streamnet' in item.name:
        files_to_process.append(item)

# Compute the modified nested set index

In this step we will loop through each of the files to be processed, open them as a GeoDataFrame, applied the modified nested set algorithm, and then write them back to the original file. Note this steps assumes you have used the same file naming convention as the `1_GetData.ipynb` example notebook. If your naming convention is different, you may need to modify the code below. 

In [20]:
# define a helper function for the operation
def compute_mnsi(file:Path, preprocessor:TDXPreprocessor) -> None:

    #parse the file name to get the HDX Basin Id
    tdx_basin_id = int(re.search("\d*",file.name).group(0))

    #open the file as GeoDataFrame
    gdf = pyogrio.read_dataframe(file, use_arrow=True)

    #apply preprocessing to make linkno globally unique
    gdf = preprocessor.tdx_to_global_linkno(gdf, tdx_basin_id)

    #compute the modified nested set index
    gdf = modified_nest_set_index(gdf)

    #write back to the file
    pyogrio.write_dataframe(gdf, file, use_arrow=True)

In [21]:
#initialize a preprocessor instance
#we want to reuse this object to take advantage of the cached TDX Basin Id crosswalk
preprocessor = TDXPreprocessor()

for file in files_to_process:
    compute_mnsi(file, preprocessor)



In [17]:
gdf

NameError: name 'gdf' is not defined