-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Add Spatial Index support #7
Comments
Would love to see S2/H3 cell support. |
My publication and OSM converter could be useful for H3 OpenStreetMap data indexing and as real world dataset for DuckDB Spatial extension. |
@Maxxen Excited to see this on the roadmap -- having spatial indices would make my organization seriously consider using DuckDB extensively. Is there a (very) approximate timeline for adding this feature? Much appreciated, thanks. |
Hi @colinjamesstewart! Unfortunately not, we might maybe be able to start prototyping non-peristent in-memory only indexes before the end of the year. But would you mind telling me a little bit about the workflows you do that makes you think the lack of spatial indexes would be a blocker right now? We recently landed a pretty big optimization for spatial predicate joins, and we can do the same for within-range/knearest joins too. |
Thanks, @Maxxen. We have a number of spatial databases with millions or even billions of records across Canada and the US, and we often need to do spatial queries (e.g. ST_Contains, ST_Intersection, etc.) on specific regions, such as cities and metro areas. I'll look into how well DuckDB handles queries on these databases, but in my experience, spatial indexing is essential when working with this amount of data. |
@colinjamesstewart For sure, I understand that once you reach billions of records, being able to prune out intersection candidates at the storage level becomes more or less mandatory, but at city-region scale we are pretty efficient at brute-forcing it. There's still a lot to figure out when it comes to custom indexes in DuckDB in general but we progressing towards that over at the core project and hopefully we will be able to utilize that work in spatial soon. |
@Maxxen I ran a quick spatial-query performance test (using ST_Intersects) to compare DuckDB with Sedona, which we'd previously been using. DuckDB was often 30x faster. Performance starts to degrade when going from a table with millions of records to one with billions, but the performance on tables with <50M records is pretty impressive given that there's no spatial indexing. |
Just another use case here: We would like to do 2D points-in-polygon-filtering on a point cloud (xyz data) with 4B points. Performance decreases roughly linearly with table size. With 700k points a query takes about 0.07s, with 7M points about 0.7s, with 70M points about 7s. With 4B points, analysis is a bit cumbersome. If the data fits in memory, we can do much faster queries with GeoPandas and its spatial index (R-tree afaik). |
Does duckdb-spatial utilize flatgeobuf spatial indices if they're available? Since they're optionally part of the file itself supporting spatial indexing for flatgeobuf files might be a nice intermediate step on the way to full spatial index support. |
Yes, since our flatgeobuf support is based on gdal we do support pruning when reading flatgeobuf files, there is an optional "filter_bbox" argument you can pass to st_read to use it. |
Do you think the h3 extension, https://github.com/isaacbrodsky/h3-duckdb, might become an official submodule of this module at some point? I think it could be very valuable, but unless I'm miss-understanding, presently users need to build that module with duckdb from source? |
Yes and no. We might add some utility functions in spatial to make it easier to integrate with H3 andI've briefly exchanged a few words with Isaac about DuckDB labs eventually adopting and distributing the H3 extension as a separate extension. But we dont really have a set policy on adopting third party extensions taking into account e.g. security, maintenance and support. However, we are actively working on enabling support for some form of custom extension repositories that you can opt-into so that you can do e.g. |
Ideally we implement an R*-TREE, but we should research what the current best bulk-loading algorithm is, since we basically only do bulk-loads by indexing vectors at a time.
A first step might be to just use the ART index on a bounding box expression, or use some clever space-filling curve encoding like a hilbert curve or geohash. We could also look into alternative encodings like "quad keys" or "s2/h3 cells". Once we have some spatial predicate functions implemented we should implement optimizer rules to look for these sorts of indexes and compare how useful they are performance-wise. Its probably not gonna beat an R-Tree but will probably be easier to implement to start with.
The text was updated successfully, but these errors were encountered: