Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add Spatial Index support #7

Open
Maxxen opened this issue Feb 27, 2023 · 12 comments
Open

Feature: Add Spatial Index support #7

Maxxen opened this issue Feb 27, 2023 · 12 comments
Labels

Comments

@Maxxen
Copy link
Member

Maxxen commented Feb 27, 2023

Ideally we implement an R*-TREE, but we should research what the current best bulk-loading algorithm is, since we basically only do bulk-loads by indexing vectors at a time.

A first step might be to just use the ART index on a bounding box expression, or use some clever space-filling curve encoding like a hilbert curve or geohash. We could also look into alternative encodings like "quad keys" or "s2/h3 cells". Once we have some spatial predicate functions implemented we should implement optimizer rules to look for these sorts of indexes and compare how useful they are performance-wise. Its probably not gonna beat an R-Tree but will probably be easier to implement to start with.

@Maxxen Maxxen added the roadmap label Apr 3, 2023
@joshhopkins
Copy link

Would love to see S2/H3 cell support.

@igor-suhorukov
Copy link

My publication and OSM converter could be useful for H3 OpenStreetMap data indexing and as real world dataset for DuckDB Spatial extension.

@colinjamesstewart
Copy link

@Maxxen Excited to see this on the roadmap -- having spatial indices would make my organization seriously consider using DuckDB extensively. Is there a (very) approximate timeline for adding this feature? Much appreciated, thanks.

@Maxxen
Copy link
Member Author

Maxxen commented Oct 31, 2023

Hi @colinjamesstewart! Unfortunately not, we might maybe be able to start prototyping non-peristent in-memory only indexes before the end of the year. But would you mind telling me a little bit about the workflows you do that makes you think the lack of spatial indexes would be a blocker right now? We recently landed a pretty big optimization for spatial predicate joins, and we can do the same for within-range/knearest joins too.

@colinjamesstewart
Copy link

Thanks, @Maxxen. We have a number of spatial databases with millions or even billions of records across Canada and the US, and we often need to do spatial queries (e.g. ST_Contains, ST_Intersection, etc.) on specific regions, such as cities and metro areas. I'll look into how well DuckDB handles queries on these databases, but in my experience, spatial indexing is essential when working with this amount of data.

@Maxxen
Copy link
Member Author

Maxxen commented Nov 2, 2023

@colinjamesstewart For sure, I understand that once you reach billions of records, being able to prune out intersection candidates at the storage level becomes more or less mandatory, but at city-region scale we are pretty efficient at brute-forcing it. There's still a lot to figure out when it comes to custom indexes in DuckDB in general but we progressing towards that over at the core project and hopefully we will be able to utilize that work in spatial soon.

@colinjamesstewart
Copy link

@Maxxen I ran a quick spatial-query performance test (using ST_Intersects) to compare DuckDB with Sedona, which we'd previously been using. DuckDB was often 30x faster. Performance starts to degrade when going from a table with millions of records to one with billions, but the performance on tables with <50M records is pretty impressive given that there's no spatial indexing.

@sedot42
Copy link

sedot42 commented Feb 7, 2024

Just another use case here: We would like to do 2D points-in-polygon-filtering on a point cloud (xyz data) with 4B points. Performance decreases roughly linearly with table size. With 700k points a query takes about 0.07s, with 7M points about 0.7s, with 70M points about 7s. With 4B points, analysis is a bit cumbersome. If the data fits in memory, we can do much faster queries with GeoPandas and its spatial index (R-tree afaik).

@raphaellaude
Copy link

Does duckdb-spatial utilize flatgeobuf spatial indices if they're available? Since they're optionally part of the file itself supporting spatial indexing for flatgeobuf files might be a nice intermediate step on the way to full spatial index support.

@Maxxen
Copy link
Member Author

Maxxen commented Feb 27, 2024

Yes, since our flatgeobuf support is based on gdal we do support pruning when reading flatgeobuf files, there is an optional "filter_bbox" argument you can pass to st_read to use it.

@cboettig
Copy link

Do you think the h3 extension, https://github.com/isaacbrodsky/h3-duckdb, might become an official submodule of this module at some point? I think it could be very valuable, but unless I'm miss-understanding, presently users need to build that module with duckdb from source?

@Maxxen
Copy link
Member Author

Maxxen commented Mar 10, 2024

Yes and no. We might add some utility functions in spatial to make it easier to integrate with H3 andI've briefly exchanged a few words with Isaac about DuckDB labs eventually adopting and distributing the H3 extension as a separate extension. But we dont really have a set policy on adopting third party extensions taking into account e.g. security, maintenance and support.

However, we are actively working on enabling support for some form of custom extension repositories that you can opt-into so that you can do e.g. INSTALL <extension> FROM <some repository>, with the goal of us being able to set up an official "community" extension repository for commonly used third-party extensions, where do some basic vetting, set up nightly/weekly builds and host binaries. The H3 extension would be an ideal candidate to include in this repository of community extensions once we get something up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants