Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function filter fragments returns the whole dataset #16

Closed
nagyrobir opened this issue Sep 30, 2023 · 11 comments
Closed

Function filter fragments returns the whole dataset #16

nagyrobir opened this issue Sep 30, 2023 · 11 comments

Comments

@nagyrobir
Copy link

nagyrobir commented Sep 30, 2023

Hi! Thank you for the work you do for the community. I don't know how far along this project is, but i have installed the libraries. I imported a geoparquet (one generated with QGIS the other with geopandas) and it reads it in. I might have understood it wrong from the code but would the geometry input from "filter_fragments" method from the GeoDataset class be used as a filter so that you read in only the data that intersects the bounding box of the input from "filter_fragments"? If that is so, i have tried the following

`import geopandas as gpd
import pyarrow.parquet as pa
from pyarrow.parquet import read_table
import shapely
import geoarrow.pyarrow as ga

tb = read_table(r"/home/parquet/buildings.parquet")
dataset = ga.dataset(tb,geometry_columns=["geometry"])

gpdf_mask = gpd.read_file("/home/shapefiles/area_1.gpkg")
bnds = gpdf_mask.iloc[0].geometry.wkt
x = dataset.filter_fragments(bnds)`

I have tried with datasets in both epsg:4326 and epsg:25833. When i run len(x.to_table()) the length of the result is the same as the length of the original dataset. The geometries are saved as wkb.

I have used the following files to do the testing. I dont know if it is a bug or if it is something i am doing wrong.

@paleolimbot
Copy link
Contributor

The GeoDataset is pretty experimental (and should be marked better as such!)...it mostly works if you've taken care to write the data very carefully such that row groups are small and contain vageuly proximal features. There's no function implemented to actually do that, so it's actually not all that useful 😬 .

(I'll take a look at the next issue which is almost certainly a bug!)

@nagyrobir
Copy link
Author

Oh shoot! Would be so ridiculously cool to have predicate pushdown on a geometry level that would work on an "intersects" level even if it just a bounding box. I think the people developing Apache Sedona did it, but having it implemented on function level on a PC would be wonderful. Thanks for the response!

paleolimbot added a commit that referenced this issue Sep 30, 2023
@paleolimbot
Copy link
Contributor

It could almost certainly be done as a Python user-defined function based on the functions that currently exist in this package. That's not currently a development priority but it's definitely a good idea!

@paleolimbot
Copy link
Contributor

Note that there is geoarrow.pyarrow.box(), which will calculate feature-level bounding boxes.

@nagyrobir
Copy link
Author

I am not inteligent enough to know how difficult would be to implement, but having a the functionality of "Select all geometries that intersect this bounding box " without reading in all the data, is something that is a game changer for Geospatial Analysts working with millions of features. So i'll be watching for that PR. :D

@nagyrobir
Copy link
Author

I was just wondering @paleolimbot How could i write the data so that the function works based on the files that i have provided. Any ideas?

@nagyrobir nagyrobir reopened this Oct 16, 2023
@paleolimbot
Copy link
Contributor

The easiest way that I know about is a "hilbert sort" (geopandas.GeoSeries.hilbert_distance()) and then pyarrrow.parquet.write_table() with relatively small row groups.

@nagyrobir
Copy link
Author

nagyrobir commented Oct 17, 2023

I tried so hard and got so far, but in the end...couldn't make it work... Still returns back the whole dataset. The row groups return batches of 50 which are grouped. Am i missing something? (*cries)

import pyarrow.parquet as pa
import geopandas as gpd
import geoarrow.pyarrow as ga
from geoarrow.pyarrow.io import read_pyogrio_table
from geoarrow.pyarrow.dataset import GeoDataset
from geoarrow.pyarrow import dataset
from pyarrow.dataset import dataset as pads
from shapely.geometry import box,Polygon
from shapely.wkt import loads


#set all paths so nothing needs to be changed in the code

building_file = "buildings_wgs84.parquet"
building_reordered = "buildings_reordered.parquet"
filter_area = "area.gpkg"

#reading the original dataset
pa_ds = pads(building_file )
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])

#hilbert_distance + sorting by index
gpd_ds=gpd.GeoDataFrame(ga_ds.to_table().to_pandas())
gpd_ds["geometry"]=gpd.GeoSeries.from_wkb(gpd_ds["geometry"])
sorted_idx = gpd_ds["geometry"].hilbert_distance().sort_values().index
gdf_sorted = gpd_ds.reindex(sorted_idx)

#writing it back to geoparquet
##COMMENT  OUT IF YOU DO NOT WANT TO OVERWRITE
gdf_sorted.to_parquet(building_reordered,row_group_size = 100)

#import the mask and create a wkt out of it
gpdf_mask = gpd.read_file(filter_area , engine="pyogrio").explode()
gpdf_mask = gpdf_mask.set_crs(epsg=4326)
bnds = gpdf_mask['geometry'].iloc[0].wkt

#read back the sorted parquet file
pa_ds = pads(building_reordered)
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])

filtered_ds = ga_ds.filter_fragments(bnds).to_table()
len(filtered_ds) == len(gdf_sorted)

Length of the filtered

@jorisvandenbossche
Copy link

Just a quick naive question: you are sure the bnds are in the same CRS as that data in ga_ds? Because this won't be checked, and eg if querying EPSG:4326 data with a box of projected coordinates can be a silent bug.

Can you provide a reproducible example? The link for the files above doesn't work anymore.

@nagyrobir
Copy link
Author

Hi @jorisvandenbossche. I appologize for the unclear comment. I have updated the above code snippet, and now you can access the files again at the link provided in the beggining.

@nagyrobir
Copy link
Author

nagyrobir commented Oct 17, 2023

I think i have sort of made it work. Instead of reading in the dataset as pyarrow dataset and converting it to a GeoDataset using "from pyarrow.dataset import dataset as pads" i used "from geoarrow.pyarrow.dataset import dataset".

#read back the sorted parquet file
pa_ds = pads(building_reordered)
ga_ds = GeoDataset(pa_ds, geometry_columns=["geometry"])
filtered_ds = ga_ds.filter_fragments(bnds).to_table()

I've used this:


from geoarrow.pyarrow.dataset import dataset
ga_ds= dataset("buildings_reordered.parquet",geometry_columns= ["geometry"], use_row_groups=True)
filtered_ds = ga_ds.filter_fragments(bnds).to_table()

Of course the dataset in filtered_ds is much larger, but that would be consinstent with the documentation, where bbox intersection would be applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants