Function filter fragments returns the whole dataset #16

nagyrobir · 2023-09-30T17:35:28Z

Hi! Thank you for the work you do for the community. I don't know how far along this project is, but i have installed the libraries. I imported a geoparquet (one generated with QGIS the other with geopandas) and it reads it in. I might have understood it wrong from the code but would the geometry input from "filter_fragments" method from the GeoDataset class be used as a filter so that you read in only the data that intersects the bounding box of the input from "filter_fragments"? If that is so, i have tried the following

`import geopandas as gpd
import pyarrow.parquet as pa
from pyarrow.parquet import read_table
import shapely
import geoarrow.pyarrow as ga

tb = read_table(r"/home/parquet/buildings.parquet")
dataset = ga.dataset(tb,geometry_columns=["geometry"])

gpdf_mask = gpd.read_file("/home/shapefiles/area_1.gpkg")
bnds = gpdf_mask.iloc[0].geometry.wkt
x = dataset.filter_fragments(bnds)`

I have tried with datasets in both epsg:4326 and epsg:25833. When i run len(x.to_table()) the length of the result is the same as the length of the original dataset. The geometries are saved as wkb.

I have used the following files to do the testing. I dont know if it is a bug or if it is something i am doing wrong.

paleolimbot · 2023-09-30T17:57:59Z

The GeoDataset is pretty experimental (and should be marked better as such!)...it mostly works if you've taken care to write the data very carefully such that row groups are small and contain vageuly proximal features. There's no function implemented to actually do that, so it's actually not all that useful 😬 .

(I'll take a look at the next issue which is almost certainly a bug!)

nagyrobir · 2023-09-30T18:04:56Z

Oh shoot! Would be so ridiculously cool to have predicate pushdown on a geometry level that would work on an "intersects" level even if it just a bounding box. I think the people developing Apache Sedona did it, but having it implemented on function level on a PC would be wonderful. Thanks for the response!

…n documentation (#19) See #16.

paleolimbot · 2023-09-30T18:16:38Z

It could almost certainly be done as a Python user-defined function based on the functions that currently exist in this package. That's not currently a development priority but it's definitely a good idea!

paleolimbot · 2023-09-30T18:18:22Z

Note that there is geoarrow.pyarrow.box(), which will calculate feature-level bounding boxes.

nagyrobir · 2023-09-30T18:25:51Z

I am not inteligent enough to know how difficult would be to implement, but having a the functionality of "Select all geometries that intersect this bounding box " without reading in all the data, is something that is a game changer for Geospatial Analysts working with millions of features. So i'll be watching for that PR. :D

nagyrobir · 2023-10-16T16:06:04Z

I was just wondering @paleolimbot How could i write the data so that the function works based on the files that i have provided. Any ideas?

paleolimbot · 2023-10-17T01:06:13Z

The easiest way that I know about is a "hilbert sort" (geopandas.GeoSeries.hilbert_distance()) and then pyarrrow.parquet.write_table() with relatively small row groups.

nagyrobir · 2023-10-17T15:11:55Z

I tried so hard and got so far, but in the end...couldn't make it work... Still returns back the whole dataset. The row groups return batches of 50 which are grouped. Am i missing something? (*cries)

import pyarrow.parquet as pa
import geopandas as gpd
import geoarrow.pyarrow as ga
from geoarrow.pyarrow.io import read_pyogrio_table
from geoarrow.pyarrow.dataset import GeoDataset
from geoarrow.pyarrow import dataset
from pyarrow.dataset import dataset as pads
from shapely.geometry import box,Polygon
from shapely.wkt import loads


#set all paths so nothing needs to be changed in the code

building_file = "buildings_wgs84.parquet"
building_reordered = "buildings_reordered.parquet"
filter_area = "area.gpkg"

#reading the original dataset
pa_ds = pads(building_file )
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])

#hilbert_distance + sorting by index
gpd_ds=gpd.GeoDataFrame(ga_ds.to_table().to_pandas())
gpd_ds["geometry"]=gpd.GeoSeries.from_wkb(gpd_ds["geometry"])
sorted_idx = gpd_ds["geometry"].hilbert_distance().sort_values().index
gdf_sorted = gpd_ds.reindex(sorted_idx)

#writing it back to geoparquet
##COMMENT  OUT IF YOU DO NOT WANT TO OVERWRITE
gdf_sorted.to_parquet(building_reordered,row_group_size = 100)

#import the mask and create a wkt out of it
gpdf_mask = gpd.read_file(filter_area , engine="pyogrio").explode()
gpdf_mask = gpdf_mask.set_crs(epsg=4326)
bnds = gpdf_mask['geometry'].iloc[0].wkt

#read back the sorted parquet file
pa_ds = pads(building_reordered)
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])

filtered_ds = ga_ds.filter_fragments(bnds).to_table()
len(filtered_ds) == len(gdf_sorted)

Length of the filtered

jorisvandenbossche · 2023-10-17T17:38:57Z

Just a quick naive question: you are sure the bnds are in the same CRS as that data in ga_ds? Because this won't be checked, and eg if querying EPSG:4326 data with a box of projected coordinates can be a silent bug.

Can you provide a reproducible example? The link for the files above doesn't work anymore.

nagyrobir · 2023-10-17T20:20:24Z

Hi @jorisvandenbossche. I appologize for the unclear comment. I have updated the above code snippet, and now you can access the files again at the link provided in the beggining.

nagyrobir · 2023-10-17T20:53:34Z

I think i have sort of made it work. Instead of reading in the dataset as pyarrow dataset and converting it to a GeoDataset using "from pyarrow.dataset import dataset as pads" i used "from geoarrow.pyarrow.dataset import dataset".

#read back the sorted parquet file
pa_ds = pads(building_reordered)
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])
filtered_ds = ga_ds.filter_fragments(bnds).to_table()

I've used this:


from geoarrow.pyarrow.dataset import dataset
ga_ds= dataset("buildings_reordered.parquet",geometry_columns= ["geometry"], use_row_groups=True)
filtered_ds = ga_ds.filter_fragments(bnds).to_table()

Of course the dataset in filtered_ds is much larger, but that would be consinstent with the documentation, where bbox intersection would be applied.

nagyrobir mentioned this issue Sep 30, 2023

to_geopandas method returns an error #18

Open

paleolimbot mentioned this issue Sep 30, 2023

docs: Make experimental nature/limitations of GeoDataset explicit in documentation #19

Merged

paleolimbot added a commit that referenced this issue Sep 30, 2023

docs: Make experimental nature/limitations of GeoDataset explicit i…

25c567c

…n documentation (#19) See #16.

nagyrobir closed this as completed Oct 2, 2023

nagyrobir reopened this Oct 16, 2023

nagyrobir closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function filter fragments returns the whole dataset #16

Function filter fragments returns the whole dataset #16

nagyrobir commented Sep 30, 2023 •

edited

Loading

paleolimbot commented Sep 30, 2023

nagyrobir commented Sep 30, 2023

paleolimbot commented Sep 30, 2023

paleolimbot commented Sep 30, 2023

nagyrobir commented Sep 30, 2023

nagyrobir commented Oct 16, 2023

paleolimbot commented Oct 17, 2023

nagyrobir commented Oct 17, 2023 •

edited

Loading

jorisvandenbossche commented Oct 17, 2023

nagyrobir commented Oct 17, 2023

nagyrobir commented Oct 17, 2023 •

edited

Loading

Function filter fragments returns the whole dataset #16

Function filter fragments returns the whole dataset #16

Comments

nagyrobir commented Sep 30, 2023 • edited Loading

paleolimbot commented Sep 30, 2023

nagyrobir commented Sep 30, 2023

paleolimbot commented Sep 30, 2023

paleolimbot commented Sep 30, 2023

nagyrobir commented Sep 30, 2023

nagyrobir commented Oct 16, 2023

paleolimbot commented Oct 17, 2023

nagyrobir commented Oct 17, 2023 • edited Loading

jorisvandenbossche commented Oct 17, 2023

nagyrobir commented Oct 17, 2023

nagyrobir commented Oct 17, 2023 • edited Loading

nagyrobir commented Sep 30, 2023 •

edited

Loading

nagyrobir commented Oct 17, 2023 •

edited

Loading

nagyrobir commented Oct 17, 2023 •

edited

Loading