# Process NGA TDX-Hydro Basins and Save to GeoParquet files

This notebook demonstrates how to use functions in the [WikiWatershed/global-hydrography](https://github.com/WikiWatershed/global-hydrography) package to pre-process TDX-Hydro basin boundary datasets released by the [US National Geospatial-Intelligence Agency (NGA)](https://www.nga.mil).

This example notebook assumes that you have already downloaded the applicable data using the example provided in the `1_GetData.ipynb` notebook. This notebook also assumes that you will have completed the necessary setup steps outline in the **[Installation Instructions](README.md#get-started)** (and also completed as part of the notebook `1_GetData.ipynb`) 

The functions introduced in this notebook were developed in the `sandbox/modified_nested_set_index.ipynb` notebook.

# Python Imports

In this step we will import the necessary python dependencies for this example

In [1]:
from pathlib import Path
import re
from importlib import reload

import pyogrio
import geopandas as gpd
import pandas as pd

import global_hydrography as gh
from global_hydrography.preprocess import TDXPreprocessor

In [2]:
# Explore the namespace for global-hydrography modules, functions, etc.
dir(gh)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'delineation',
 'io',
 'mnsi',
 'preprocess',
 'process']

# Compile files that need to be processed

In this step we will compile a list of the files that need to be processed to have a modified nested set index. Note this step assumes that you have downloaded the files to the same directory and used the same naming convention as the `1_GetData.ipynb` example notebook. If you have opted to use a different location or naming convention you will need to modify this step accordingly.

In [3]:
# Confirm your current working directory (cwd) and repo/project directory
working_dir = Path.cwd()
project_dir = working_dir.parent
data_dir = project_dir / 'data_temp' # a temporary data directory that we .gitignore
tdx_dir = data_dir / 'nga'

In [4]:
#Scan the files in the data directory and only pull of the streamnet (blueline) files
files_to_process = []
for item in tdx_dir.iterdir():
    if item.is_file() and 'basins' in item.name and item.suffix=='.gpkg':
        files_to_process.append(item)

In [5]:
files_to_process

[PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamreach_basins_7020038340_01.gpkg'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamreach_basins_1020011530_01.gpkg'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamreach_basins_1020040190_01.gpkg'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/streams_no_basins.gpkg')]

# Create Basins GeoParquet file

## `process_tdx_basins()` function

This helper function was developed in the `sandbox/modified_nested_set_index.ipynb` notebook, mimicking the `create_tdx_mnsi()` function from the `examples/3_GenerateModifiedNestedSetIndex.ipynb` notebook.


In [6]:
# define a helper function for the operation
def process_tdx_basins(file:Path, preprocessor:TDXPreprocessor) -> None:
    ''' Creates a basins GeoParquet file from original TDX-Hydro
    GeoPackage basins files.

    The new GeoParquet file renames 'streamID' to 'LINKNO', modifies it to be 
    globally unique, and sets it as the index for interoperability with streamnet data. 
    The GeoParquet file is saved with a filename in the form of:
    `f"TDX_streamreach_basins_{tdx_hydro_region}_01.parquet"`.

    file: The Path to an original TDX-Hydro GeoPackage basins file.
    preprocessor: A TDXPreprocessor class instance. 

    Return: None
    '''

    # parse the file name to get the TDXHydroRegion
    tdx_hydro_region = int(re.search("\d{10}",file.name).group(0))
    print (f"Processing TDXHydroRegion = {tdx_hydro_region}")

    # get file metadata
    info = pyogrio.read_info(file, layer=0)
    print(f"  Reading: layer = {info['layer_name']}")
    
    # open the file as GeoDataFrame
    gdf = gpd.read_file(file, engine='pyogrio', layer=0, use_arrow=True)

    # Rename 'streamID' to 'LINKNO' to facilitate interoperability 
    # with streamnet files
    gdf.rename(columns={'streamID':'LINKNO'}, inplace=True)
    
    # apply preprocessing to make linkno globally unique
    preprocessor.tdx_to_global_linkno(gdf, tdx_hydro_region)

    # Set 'LINKNO' as index, to facilitate selection
    gdf.set_index('LINKNO', inplace=True)

    # write back to the file
    tdx_parquet_path = tdx_dir / f"TDX_streamreach_basins_{tdx_hydro_region}_01.parquet"
    gdf.to_parquet(tdx_parquet_path, compression='zstd')
    print(f'  File saved: {tdx_parquet_path.name}')

    return tdx_parquet_path

In [7]:
# Select file
file = files_to_process[0]
file.name

'TDX_streamreach_basins_7020038340_01.gpkg'

In [8]:
# Get file size, in MB
file.stat().st_size / 1_000_000

2647.330816

In [9]:
#initialize a preprocessor instance
#we want to reuse this object to take advantage of the cached TDX Basin Id crosswalk
preprocessor = TDXPreprocessor()

# Process basin
tdx_parquet_path = process_tdx_basins(file, preprocessor)
# 1m 1.8s

Processing TDXHydroRegion = 7020038340
  Reading: layer = basins
  File saved: TDX_streamreach_basins_7020038340_01.parquet


In [10]:
# Get file size, in MB
tdx_parquet_path.stat().st_size / 1_000_000

676.472978

**GeoParquet file (with zstd compression) is 3.9x smaller than GeoPackage!**

# Re-Read the Saved GeoParquet

In [11]:
# Open the file as GeoDataFrame
gdf = gpd.read_parquet(tdx_parquet_path)
gdf.info()
gdf
# 20.2s

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 140053 entries, 750000001 to 750327711
Data columns (total 1 columns):
 #   Column    Non-Null Count   Dtype   
---  ------    --------------   -----   
 0   geometry  140053 non-null  geometry
dtypes: geometry(1)
memory usage: 2.1 MB


Unnamed: 0_level_0,geometry
LINKNO,Unnamed: 1_level_1
750000001,"POLYGON ((-69.71706 46.42639, -69.71572 46.426..."
750000002,"POLYGON ((-69.71939 46.39428, -69.71928 46.394..."
750000003,"POLYGON ((-69.77483 46.30506, -69.77483 46.304..."
750000004,"POLYGON ((-69.70206 46.30194, -69.70183 46.301..."
750000005,"POLYGON ((-69.71272 46.2815, -69.71261 46.2815..."
...,...
750325343,"POLYGON ((-80.63483 34.01172, -80.63439 34.011..."
750325935,"POLYGON ((-80.6475 34.00028, -80.64728 34.0002..."
750326527,"POLYGON ((-77.93961 34.01417, -77.93917 34.014..."
750327119,"POLYGON ((-79.51194 33.99761, -79.51172 33.997..."


In [None]:
# Reading directly as Pyarrow Table is 2x faster
info, table = pyogrio.read_arrow(tdx_parquet_path)
print(info)
table

{'crs': 'EPSG:4326', 'encoding': 'UTF-8', 'fields': array(['LINKNO'], dtype=object), 'geometry_type': 'MultiPolygon', 'geometry_name': 'geometry', 'fid_column': 'OGC_FID'}


pyarrow.Table
geometry: binary
LINKNO: int64
----
geometry: [[010300000001000000710400006721FC3CE46D51C0E9933EE9933647407A5D9464CE6D51C0E9933EE9933647407A5D9464CE6D51C097482D4590364740281283C0CA6D51C097482D4590364740281283C0CA6D51C045FD1BA18C3647407E6C7AEEC86D51C045FD1BA18C3647407E6C7AEEC86D51C0F3B10AFD88364740D6C6711CC76D51C0F3B10AFD88364740D6C6711CC76D51C0A166F958853647402C21694AC56D51C0A166F958853647402C21694AC56D51C04F1BE8B481364740E8020A44B16D51C04F1BE8B481364740E8020A44B16D51C0FCCFD6107E364740EE11F0CDAB6D51C0FCCFD6107E364740EE11F0CDAB6D51C05839B4C876364740446CE7FBA96D51C05839B4C876364740446CE7FBA96D51C006EEA22473364740F220D657A66D51C006EEA22473364740F220D657A66D51C0B4A291806F364740497BCD85A46D51C0B4A291806F364740497BCD85A46D51C0100C6F3868364740A0D5C4B3A26D51C0100C6F3868364740A0D5C4B3A26D51C06B754CF060364740F72FBCE1A06D51C06B754CF060364740F72FBCE1A06D51C0192A3B4C5D364740A5E4AA3D9D6D51C0192A3B4C5D364740A5E4AA3D9D6D51C0C7DE29A859364740FC3EA26B9B6D51C0C7DE29A859364740FC3EA26B9B6D51C0

## Read Speed comparison

In [12]:
%%timeit
# Read entire file with Geopandas for comparison
gpd.read_parquet(tdx_parquet_path)
# 13.6 s ± 290 ms

13.6 s ± 290 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [13]:
%%timeit
# Use pyarrow for upper limit of read speed for geometries
pyogrio.read_arrow(tdx_parquet_path)
# 6.97 s
# 2x faster!

6.97 s ± 94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Compare with Streamnet

In [43]:
#Scan the files in the data directory and only pull of the streamnet (blueline) files
files_to_process = []
for item in tdx_dir.iterdir():
    if item.is_file() and 'streamnet' in item.name and item.suffix=='.parquet':
        files_to_process.append(item)
files_to_process

[PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_7020038340_01.parquet'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_7020038340_01_mnsi_test.parquet'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/TDX_streamnet_7020038340_01_mnsi.parquet')]

In [44]:
tdx_stream_mnsi_path = files_to_process[1]
tdx_stream_mnsi_path.name

'TDX_streamnet_7020038340_01_mnsi_test.parquet'

In [45]:
stream_mnsi_gdf = gpd.read_parquet(tdx_stream_mnsi_path)
stream_mnsi_gdf.info()
stream_mnsi_gdf.index

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 140097 entries, 750000000 to 750000589
Data columns (total 18 columns):
 #   Column         Non-Null Count   Dtype   
---  ------         --------------   -----   
 0   DSLINKNO       140097 non-null  int32   
 1   USLINKNO1      140097 non-null  int32   
 2   USLINKNO2      140097 non-null  int32   
 3   ROOT_ID        140097 non-null  int32   
 4   DISCOVER_TIME  140097 non-null  int32   
 5   FINISH_TIME    140097 non-null  int32   
 6   strmOrder      140097 non-null  int32   
 7   Length         140097 non-null  float64 
 8   Magnitude      140097 non-null  int32   
 9   DSContArea     140097 non-null  float64 
 10  strmDrop       140097 non-null  float64 
 11  Slope          140097 non-null  float64 
 12  StraightL      140097 non-null  float64 
 13  USContArea     140097 non-null  float64 
 14  DOUTEND        140097 non-null  float64 
 15  DOUTSTART      140097 non-null  float64 
 16  DOUTMID        140097 non-null  float64 
 

Index([750000000, 750000001, 750000593, 750001777, 750000002, 750000592,
       750000594, 750001185, 750001186, 750001778,
       ...
       750001178, 750001179, 750001770, 750001771, 750002362, 750000587,
       750001180, 750001772, 750000588, 750000589],
      dtype='int32', name='LINKNO', length=140097)

## Merge Data
To confirm LINKNO matches, etc.

In [46]:
# Try merging data
basins_gdf = gdf.copy(deep=True)


In [47]:
columns_to_merge = ['DSContArea', 'USContArea']

# Merge confirms that their LINKNO values match
# Although there are not as many basins as there are stream reaches!
basins_test_gdf = pd.merge(
    basins_gdf, 
    stream_mnsi_gdf[columns_to_merge], 
    how='right', 
    on='LINKNO',
)
basins_test_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 140097 entries, 750000000 to 750000589
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype   
---  ------      --------------   -----   
 0   geometry    140053 non-null  geometry
 1   DSContArea  140097 non-null  float64 
 2   USContArea  140097 non-null  float64 
dtypes: float64(2), geometry(1)
memory usage: 4.3 MB


**44 streams have no basins!!**

## Streams with no Basins

In [48]:
# Explore stream links with no basin geometry.
streams_no_basins_gdf = stream_mnsi_gdf[basins_test_gdf.geometry==None]
streams_no_basins_gdf.info()
streams_no_basins_gdf.head()

<class 'geopandas.geodataframe.GeoDataFrame'>
Index: 44 entries, 750000000 to 750020103
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   DSLINKNO       44 non-null     int32   
 1   USLINKNO1      44 non-null     int32   
 2   USLINKNO2      44 non-null     int32   
 3   ROOT_ID        44 non-null     int32   
 4   DISCOVER_TIME  44 non-null     int32   
 5   FINISH_TIME    44 non-null     int32   
 6   strmOrder      44 non-null     int32   
 7   Length         44 non-null     float64 
 8   Magnitude      44 non-null     int32   
 9   DSContArea     44 non-null     float64 
 10  strmDrop       44 non-null     float64 
 11  Slope          44 non-null     float64 
 12  StraightL      44 non-null     float64 
 13  USContArea     44 non-null     float64 
 14  DOUTEND        44 non-null     float64 
 15  DOUTSTART      44 non-null     float64 
 16  DOUTMID        44 non-null     float64 
 17  geometry       44 n

Unnamed: 0_level_0,DSLINKNO,USLINKNO1,USLINKNO2,ROOT_ID,DISCOVER_TIME,FINISH_TIME,strmOrder,Length,Magnitude,DSContArea,strmDrop,Slope,StraightL,USContArea,DOUTEND,DOUTSTART,DOUTMID,geometry
LINKNO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
750000000,750001777,-1,-1,750021317,52,53,1,3847.9,1,9567845.0,42.07,0.010933,3233.7,5254868.0,45853.6,49701.4,47777.5,"LINESTRING (-69.67822 46.41356, -69.67822 46.4..."
750100709,750101301,750090644,750068149,750100710,5,16,3,0.0,6,95732660.0,0.0,0.0,0.0,95732660.0,7889.2,7889.2,7889.2,"LINESTRING (-69.74322 43.89322, -69.74322 43.8..."
750155209,750155801,750154617,750113177,750170058,1767,4980,7,0.0,1607,24103590000.0,0.0,0.0,0.0,24103590000.0,234019.3,234019.3,234019.3,"LINESTRING (-73.75744 42.54056, -73.75744 42.5..."
750127463,750128055,750141672,750010841,750129283,3471,4886,6,0.0,708,10990470000.0,0.0,0.0,0.0,10990470000.0,423795.8,423795.8,423795.8,"LINESTRING (-78.23989 39.65089, -78.23989 39.6..."
750099079,750099671,750055269,750055861,750102638,89,96,2,0.0,4,87144990.0,0.0,0.0,0.0,87144990.0,68889.3,68889.3,68889.3,"LINESTRING (-76.08133 38.47233, -76.08133 38.4..."


In [49]:
streams_no_basins_gdf.Length.value_counts()

Length
0.0       43
3847.9     1
Name: count, dtype: int64

**NOTE: All but one have zero stream length. The one with a lenghth is a headwater stream at the edge of the TDXHydroRegion.

We will save these LINKNO for further exploration, below.

# Alternate: Create Basins with MNSI

To only read the basin's file for delinating the upstream watershed boundary.

Using new functions in source directory to streamline production.

This helper function was developed in the `sandbox/modified_nested_set_index.ipynb` notebook.

In [63]:
reload(gh.process)

<module 'global_hydrography.process' from '/Users/aaufdenkampe/Documents/Python/global-hydrography/src/global_hydrography/process.py'>

### `process_tdx_streams_basins()`

Process a pair of TDXHydro streamnet and streamreach_basins files for 
a given TDX Hydro Region, creating a set of GeoParquet files ready for use 
by Model My Watershed. This processing includes:
- Reads the 'TDX_streamnet*.gpkg' file provided by NGA, converts LINKNO fields to 
globally unique values, calculates and adds three new Modified Nested Set Index (MNSI) 
fields, drops useless fields, and sets LINKNO as the index.
- Reads the 'TDX_streareach_basins*.gpkg' file provided by NGA, renames 'streamID' 
to LINKNO, converts LINKNO to globally unique values, and sets LINKNO as the index.
- Moves MNSI fields from streament to basins datasets, saving a dataset of streams
that don't have a matching basin geometry.
- Saves three output datasets to GeoParquet files in the output directory.

Parameters:  
- input_dir: Directory with raw TDX Hydro GeoPackage ('.gpkg') files. 
- output_dir: Directory for saving processed GeoParquet ('.parquet') files. 
- tdx_hydro_region: The 10-digit TDX Hydro Region 
- preprocessor: An instance of the TDXPreprocessor class. 

Returns: a list of output file paths  
- TDX_streamnet_*.parquet  
- TDX_streamreach_basins_mnsi_*.parquet  
- TDX_streams_no_basin_*.parquet  


In [6]:
# define a helper function for the operation
def process_tdx_streams_basins(
    input_dir: Path,
    output_dir: Path,
    tdx_hydro_region: int, 
    preprocessor:TDXPreprocessor
) -> list[Path]:
    """Process a pair of TDXHydro streamnet and streamreach_basins files for 
    a given TDX Hydro Region, creating a set of GeoParquet files ready for use 
    by Model My Watershed. This processing includes:
    - Reads the 'TDX_streamnet*.gpkg' file provided by NGA, converts LINKNO fields to 
    globally unique values, calculates and adds three new Modified Nested Set Index (MNSI) 
    fields, drops useless fields, and sets LINKNO as the index.
    - Reads the 'TDX_streareach_basins*.gpkg' file provided by NGA, renames 'streamID' 
    to LINKNO, converts LINKNO to globally unique values, and sets LINKNO as the index.
    - Moves MNSI fields from streament to basins datasets, saving a dataset of streams
    that don't have a matching basin geometry.
    - Saves three output datasets to GeoParquet files in the output directory.

    Parameters:
        input_dir: Directory with raw TDX Hydro GeoPackage ('.gpkg') files.
        output_dir: Directory for saving processed GeoParquet ('.parquet') files.
        tdx_hydro_region: The 10-digit TDX Hydro Region
        preprocessor: An instance of the TDXPreprocessor class.

    Returns: a list of output file paths
        TDX_streamnet_*.parquet  
        TDX_streamreach_basins_mnsi_*.parquet  
        TDX_streams_no_basin_*.parquet  
    """
    # Get file paths
    print (f"Processing TDXHydroRegion = {tdx_hydro_region}")
    streamnet_file, basins_file = gh.process.select_tdx_files(
        input_dir, tdx_hydro_region,'.gpkg')
    

    ## Process streamnet file ##
    # get streamnet file metadata
    streamnet_info = pyogrio.read_info(streamnet_file, layer=0)
    print(f"  Reading: layer = {streamnet_info['layer_name']} " 
        f"last updated {streamnet_info['layer_metadata']['DBF_DATE_LAST_UPDATE']}"
    )
    
    # open streamnet file as GeoDataFrame
    streamnet_gdf = gpd.read_file(streamnet_file, engine='pyogrio', layer=0, use_arrow=True)

    # apply preprocessing to make linkno globally unique
    preprocessor.tdx_to_global_linkno(streamnet_gdf, tdx_hydro_region)

    # apply preprocessing to make drop columns with no value
    preprocessor.tdx_drop_useless_columns(streamnet_gdf)

    # compute the modified nested set index
    print('  Computing: modified nested set index')
    streamnet_gdf = gh.mnsi.modified_nest_set_index(streamnet_gdf)

    # Set 'LINKNO' as index, to facilitate selection
    streamnet_gdf.set_index('LINKNO', inplace=True)


    ## Process basins file ##
    # get basins file metadata
    basins_info = pyogrio.read_info(basins_file, layer=0)
    print(f"  Reading: layer = {basins_info['layer_name']}")

    # open basins file as GeoDataFrame
    basins_gdf = gpd.read_file(basins_file, engine='pyogrio', layer=0, use_arrow=True)

    # Rename 'streamID' to 'LINKNO' to facilitate interoperability 
    # with streamnet files
    basins_gdf.rename(columns={'streamID':'LINKNO'}, inplace=True)
    
    # apply preprocessing to make linkno globally unique
    preprocessor.tdx_to_global_linkno(basins_gdf, tdx_hydro_region)

    # Set 'LINKNO' as index, to facilitate selection
    basins_gdf.set_index('LINKNO', inplace=True)

    
    ## Move MNSI fields from streamnet to basins ##
    print(f"  Moving MNSI fields from streamnet to basins datasets.")
    basins_mnsi_gdf, streams_no_basin_gdf = gh.process.create_basins_mnsi(
        basins_gdf,
        streamnet_gdf,
    )
    # Drop MNSI fields from streamnet_gdf
    streamnet_gdf.drop(columns=gh.mnsi.MNSI_FIELDS, inplace=True)


    ## Write GeoParquet files ##
    gdf_dict = {
        'streamnet': streamnet_gdf,
        'streamreach_basins_mnsi': basins_mnsi_gdf,
        'streams_no_basin': streams_no_basin_gdf,
    }
    parquet_paths = []
    for dataset, gdf in gdf_dict.items():
        path = output_dir / f"TDX_{dataset}_{tdx_hydro_region}_01.parquet"
        parquet_paths.append(path)
        gdf.to_parquet(path, compression='zstd')
        print(f'  File saved: {path.name}')

    return parquet_paths

In [7]:
reload(gh.mnsi)
reload(gh.process)

<module 'global_hydrography.process' from '/Users/aaufdenkampe/Documents/Python/global-hydrography/src/global_hydrography/process.py'>

In [9]:
#initialize a preprocessor instance
#we want to reuse this object to take advantage of the cached TDX Basin Id crosswalk
preprocessor = TDXPreprocessor()

In [10]:
# Try function
process_tdx_streams_basins(
    tdx_dir,
    tdx_dir / 'processed',
    7020038340,
    preprocessor,
)
# 1m 25s 

Processing TDXHydroRegion = 7020038340
  Reading: layer = TDX_streamnet_7020038340_01 last updated 2021-12-08
  Computing: modified nested set index
  Reading: layer = basins
  Moving MNSI files from streamnet to basins datasets.
  File saved: TDX_streamnet_7020038340_01.parquet
  File saved: TDX_streamreach_basins_mnsi_7020038340_01.parquet
  File saved: TDX_streams_no_basin_7020038340_01.parquet


[PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/processed/TDX_streamnet_7020038340_01.parquet'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/processed/TDX_streamreach_basins_mnsi_7020038340_01.parquet'),
 PosixPath('/Users/aaufdenkampe/Documents/Python/global-hydrography/data_temp/nga/processed/TDX_streams_no_basin_7020038340_01.parquet')]

## Read Only Selected Rows for Speed

This uses `**kwargs` in [`geopandas.read_parquet()`](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_parquet.html) that are passed to [`pyarrow.parquet.read_table()`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table).

Although partioning the GeoParquet can dramatically improve read performance, it is also possible to gain some benefit for non-partitioned files. See https://dzone.com/articles/parquet-data-filtering-with-pandas

In [13]:
tdx_mnsi_fp = (
    tdx_dir / 'processed' / 
    'TDX_streamreach_basins_mnsi_7020038340_01.parquet'
)



In [12]:
%%timeit
# Read entire file with Geopandas for comparison
gpd.read_parquet(tdx_mnsi_fp)
# 13.6 s ± 290 ms

17.6 s ± 204 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%%timeit
# Read 4897 rows where 'ROOT_ID'==750288662
gpd.read_parquet(
    tdx_mnsi_fp,
    filters=[('ROOT_ID', '==', 750288662)]
)
# 6.49 s ± 64.5 ms


8.65 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
