Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geoparquet format reader #37

Closed
roger120981 opened this issue May 21, 2022 · 13 comments
Closed

Geoparquet format reader #37

roger120981 opened this issue May 21, 2022 · 13 comments

Comments

@roger120981
Copy link

It is possible to incorporate in your library the new format launched by the OGC, GeoParquet..This is the link https://github.com/opengeospatial/geoparquet

@pka
Copy link
Member

pka commented May 21, 2022

A GeoPartquet reader would be welcome for sure. It shouldn't be very difficult to implement, since there are Rust implementations for reading Parquet (https://github.com/apache/arrow-rs) and AFAIK geometries are stored as WKB, which is already supported in geozero.

@BKDaugherty
Copy link

BKDaugherty commented May 29, 2022

I'm new to geospatial, but I'm gonna take a look at trying this out! (assuming that pull requests are welcome)

@kylebarron
Copy link
Member

Just a quick note that in versions of geoparquet so far WKB is the standard, but in future versions we hope to also enable an Arrow-native geometry encoding, see https://github.com/geopandas/geo-arrow-spec/blob/main/format.md. The difference between the two is that a binary buffer of WKB in memory still needs to be parsed to be used. In contrast, an Arrow-native implementation would store geometries in a layout standardized by the Arrow project. This means that it's O(1) to access any individual coordinate.

What this means is that using geozero on a GeoParquet dataset with geometries stored as WKB is essentially already supported by geozero, you just iterate over the items in the array and use the existing geozero WKB reader. Note here that an improvement would be to figure out how to let the WKB reader read directly from the Arrow buffer. I seemed to have to copy the section of the arrow buffer into a Vec<u8> to get geozero's WKB reader to accept it.

A separate geozero-geoarrow implementation would be to add support for the Arrow-native encoding. Note that https://github.com/geopandas/geo-arrow-spec/blob/main/format.md is a provisional, unreleased spec (though it was implemented by GDAL in 3.5.0). So if we have feedback while implementing a Rust reader, we should comment on the project. For prior work, there are WIP native arrow implementations in Python, C++, R, and a data repo.

@apps4uco
Copy link
Contributor

I am starting work on a geoparquet crate that will use geozero and could potentially be merged into geozero at some point.

FYI: I also will be working on a geoarrow crate and geodatafusion crate, stay tuned for developments....

@kylebarron
Copy link
Member

kylebarron commented Oct 1, 2022

A few notes... I've already started planning how to implement geoparquet and geoarrow and I'd love to discuss each of these so that we can align our work. It might also be worthwhile to discuss this offline, e.g. in the georust discord: https://discord.gg/63ueuxHu.

On GeoParquet:

  • If you're ok with loading data into Arrow, in many cases you don't need a geoparquet crate at all! If you load a geoparquet file into arrow, you automatically have geoarrow.
  • If you want to support loading data from Parquet into geo-types structs, then I think using geozero's setup will likely make things much faster. But you're going to hit the orphan rule, which says that you can't implement an external trait onto an external type. Because geozero owns the geozero traits and arrow(2)/parquet(2) own the structs, you have to create your own traits, which are then less interoperable with the rest of the system.
  • Are you planning to use parquet and arrow or parquet2 and arrow2?

On GeoArrow:

  • I've already been prototyping how to design geoarrow (around the proposed struct approach: Allow storing points/coordinates as struct arrays in addition to interleaved layout geoarrow/geoarrow#26) to make it fast and need fewer casts, so I'd love to chat more about your thoughts on this.
  • Similarly, are you planning to use arrow or arrow2? I'm locked into the arrow2 ecosystem, because I'm building geopolars around polars, and polars uses arrow2. The crates really aren't compatible with each other
  • Again, if you want to define traits onto arrow structs, you're going to hit the orphan rule again. Or at least, long term, we need geo to support reading data directly from arrow memory, and the way that will happen is by having traits defined in the geo crate, and then you won't be able to implement external traits from geo onto external structs from arrow/arrow2

@apps4uco
Copy link
Contributor

apps4uco commented Oct 4, 2022

I dont quite understand the problem you are referring to. I have written a
struct that has the following:
impl GeozeroDatasource for GeoParquetReader { .... }
and there are other crates eg geozero-shp that implement readers for geozero

The implementation I have currently uses parquet, and I was planning on having a feature to be able to select between parquet/parquet2.

I do want to use parquet rather than loading it into memory as I am targeting the use case of files that will not fit into memory. (Although I know that the arrow file could be memory mapped)

@kylebarron
Copy link
Member

It's still worthwhile to have a geoparquet/geoarrow implementation in geozero, so I don't want to dissuade you from that. My point is that today the georust/geo algorithms all use geo-specific structs. It's useful to be able to load geoparquet/geoarrow into those structs, but it also misses out on a huge reason for Arrow, being easy to transfer memory across languages without serialization.

I think my discussion comment on geo here as well as this blog post serve as useful reasons for why I see support for geoparquet/geoarrow in geozero as only a short-term stepping stone.

As mentioned in that discussion, I have a WIP implementation of geoarrow here.

and there are other crates eg geozero-shp that implement readers for geozero

Yes, but if you're implementing the GeozeroDatasource trait directly on Arrow structs, you won't be able to do that from a third party crate. If you're creating your own structs... then you need to copy the memory from Arrow into your structs, which kinda defeats the purpose.

Although I know that the arrow file could be memory mapped

This is true... but with the geozero approach you need to first load a chunk of data into memory and then copy all of it into geo structs, which is not very memory efficient. This is why I'm excited about the idea of geometry access traits, as mentioned in this discussion, so that you could actually memory map Arrow directly without a full copy.

@michaelkirk
Copy link
Member

Yes, but if you're implementing the GeozeroDatasource trait directly on Arrow structs, you won't be able to do that from a third party crate. If you're creating your own structs... then you need to copy the memory from Arrow into your structs, which kinda defeats the purpose.

Do you need to actually copy the memory though? Things like moves into a new-type struct are optimized away:

https://rust.godbolt.org/z/8GabGrhbT

@kylebarron
Copy link
Member

kylebarron commented Oct 5, 2022

Things like moves into a new-type struct are optimized away

That's super interesting. I've never looked into low-level compiled output before.

@apps4uco Re-reading my comment I think my tone might've come off too harsh... I'm fully supportive of both approaches! I've tried to closely follow spec discussions on both GeoParquet and GeoArrow and I'd be happy to review your PRs if you'd like 🙂. Keep in mind that neither spec has reached 1.0 yet. WKB-encoded GeoParquet is the most stable (native encodings are still up in the air) so I'd start with that one. You might even be able to leverage my previous geozero PR: #39

@kylebarron
Copy link
Member

Geoparquet is getting ready for a 1.0.0-beta.1 soon opengeospatial/geoparquet#147, and I'd love to see support for this in geozero. @apps4uco do you have any existing work to share? Just want to avoid duplication of work

@kylebarron
Copy link
Member

Would love some feedback on #80

@apps4uco
Copy link
Contributor

Geoparquet is getting ready for a 1.0.0-beta.1 soon opengeospatial/geoparquet#147, and I'd love to see support for this in geozero. @apps4uco do you have any existing work to share? Just want to avoid duplication of work

Hi, I have been busy on other things, and so havent really advanced on the implementation, For what its work here is the git repo with my latest changes: github.com:apps4uco/geoparquet.git

@kylebarron
Copy link
Member

The geoarrow crate now includes a geozero integration for all GeoArrow arrays as well as a GeoParquet reader. The upcoming 0.2 release will include a GeoParquet writer (it was merged yesterday).

In #186 we removed the minimal, outdated geoarrow integration from the geozero crate and are now pointing users to the geoarrow crate. Therefore, I'll close this issue.

Feel free to create issues on the geoarrow crate for any bugs with the GeoParquet reader or writer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants