Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write Arrow Table/RecordBatchReader to GDAL #346

Merged
merged 58 commits into from Apr 25, 2024

Conversation

kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Jan 25, 2024

A very WIP implementation for writing Arrow. This is mostly to start discussion and is largely inspired/copied from OSGeo/gdal#9133.

Notes:

  • This uses the Arrow PyCapsule interface so that it can read from any object that exposes itself via Arrow. This means that it doesn't have a pyarrow dependency, and can read from pyarrow.Table, pyarrow.RecordBatchReader, and geoarrow-rs' GeoTable. The RecordBatchReader means that it can handle an iterator of batches, so they don't all have to be materialized at once in a Python Table.

Todo:

  • Figure out how to create the OGR Layer first
  • I'm sure my use of pointers is wildly incorrect.

References:

Closes #314.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this!
It have wanted to look at this as well, but as you might have noticed in other projects, my time for geo is a bit limited at the moment ;) But so a starting WIP PR is always a good reason to take a look! I hope my comments provide some useful pointers

pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_ogr.pxd Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated
Comment on lines 1741 to 1754
# Create output fields using CreateFieldFromArrowSchema()
static bool create_fields_from_arrow_schema(
OGRLayerH destLayer,
const struct ArrowSchema* schema,
char** options
):
# The schema object is a struct type where each child is a column.
for child in schema.n_children:
# Access the metadata for this column
const char *metadata = child.metadata

# TODO: I don't know how to parse this metadata in C... I guess I can just use Python APIs for this in Cython?
# https://github.com/OSGeo/gdal/pull/9133/files#diff-37bedc92ae1d5e04706c7b9f8ea9e9fcccf984ca0c9997e2020ff85f1b958433R1159-R1185
# if metadata:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's a pity that GDAL doesn't provide a helper to create a full layer definition from a schema, instead of only field by field (that would us to access the individual children of the ArrowSchema).

We might want to vendor some helpers from nanoarrow-c to extract the children, and the metadata etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the child is something we can probably do quite easily, as it's something like schema.children[i]

For a first prototype, I think we could also ignore the metadata for a moment, and manually specify which are the geometry columns.

pyogrio/_io.pyx Outdated
char** options
):
# The schema object is a struct type where each child is a column.
for child in schema.n_children:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for child in schema.n_children:
for child in range(schema.n_children):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to change from

- for i in range(schema.n_children):
+ for child in schema.children:

I'm not sure if cython lets you iterate through a list of pointers as if it's a python list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if cython lets you iterate through a list of pointers as if it's a python list?

No idea either, you will have to try and see ;)

@kylebarron
Copy link
Contributor Author

This now compiles on 3.8! I still need to figure out how to exclude this code from trying to compile on earlier GDAL versions, and I still need to test it.

@jorisvandenbossche
Copy link
Member

I still need to figure out how to exclude this code from trying to compile on earlier GDAL versions, and I still need to test it.

Hmm, looking at how it is done on the reading side, it's only the declaration in pxd that we put behind an IF, and then in the actual code we just raise an error at the beginning of the function, but still compile it regardless of the GDAL version (wondering if cython does something smart here)

@kylebarron
Copy link
Contributor Author

Yeah I was confused because I thought I was doing the same process as we have for reading

@kylebarron kylebarron changed the title WIP for writing Arrow to GDAL WIP for writing from Arrow Jan 26, 2024
@kylebarron
Copy link
Contributor Author

I was able to compile and install it locally with

virtualenv env
source ./env/bin/activate
pip install --no-cache-dir -e '.[dev,test,geopandas]'

but trying to use it with

import pyarrow as pa
from pyogrio._io import ogr_write_arrow

import geopandas as gpd
from geopandas.io.arrow import _geopandas_to_arrow
gdf = gpd.read_file(gpd.datasets.get_path('nybb'))
table = _geopandas_to_arrow(gdf)

new_schema = table.schema.set(4, pa.field("geometry", pa.binary(), True, {"ARROW:extension:name": "ogc.wkb"}))
new_table = pa.Table.from_batches(table.to_batches(), new_schema)

test = ogr_write_arrow('test.fgb', 'test', 'FlatGeobuf', new_table, 'MultiPolygon', {}, {})

is crashing Python. Does anyone have any suggestions for how to debug? I don't know where to go from here.

@jorisvandenbossche
Copy link
Member

I fixed the segfault (gdb was pointing at exporting the schema, so stream.get_schema(stream, schema). My understanding is that we need to pass a schema as a pointer, but we need to have allocated it (and get_schema "fills it in"). So changed the declaration of schema and array a bit.
That fixed the segfault. Then there was still an error in create_fields_from_arrow_schema always erroring because of checking for a wrong error code. But with that fixed, your example script runs without any error. The strange thing is that it only doesn't write any file .. Not fully sure what I am missing here.

@jorisvandenbossche
Copy link
Member

Those annoying error return codes ... ;) We need to very carefully inspect every one while writing: the GDAL function OGR_L_WriteArrowBatch is returning a bool (true on success), but the ArrowArrayStream.get_next uses "0 on success, a non-zero error code otherwise.". While the code was doing it the other way around ..

Now it is writing actual content to the file. Just not for FlatGeobuf (that gives some error while writing), but for eg GPKG it is working.

I also opened #351 to more easily integrate this with all options for setting up the layer/dataset.

@kylebarron
Copy link
Contributor Author

I merged in #351 which simplified this a lot.

I'm still struggling with my dev environment though. In particular getting

ModuleNotFoundError: No module named 'pyogrio._ogr'

when trying to import pyogrio.raw. This is after pip install '.[dev,test,geopandas]'.

@kylebarron kylebarron changed the title WIP for writing from Arrow Write Arrow Table/RecordBatchReader to GDAL Feb 8, 2024
@jorisvandenbossche
Copy link
Member

Do you happen to do that import from within the project directory? Because in that case, you need to install it in editable mode (pip install -e .).

The issue is that when importing (or running tests) from the root of repo, python will find the pyogrio directory in the repo and try to import that, but the built extension module files are installed to site-packages.
This is actually a reason to prefer a "src-layout", i.e. nesting the package directory in a src/ directory, but we haven't done that here (https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/)

@kylebarron
Copy link
Contributor Author

Ah yes I've hit this so many times. It's also why compiled rust projects advise to use a different name for the folder containing python code.

@kylebarron
Copy link
Contributor Author

I added some conditional compilation statements around other functions that were trying to use GDAL Arrow struct definitions

IF CTE_GDAL_VERSION >= (3, 8, 0):

so let's see if we can get it to compile with that

pyogrio/_io.pyx Outdated
Comment on lines 2320 to 2322
exc = exc_check()
if exc:
raise exc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually typically catch the CPLE_.. error and reraise that with one of our generic error classes?

For example I see this pattern:

            except CPLE_BaseError as exc:
                raise FeatureError(str(exc))

But then what error class to use? (FeatureError or FieldError?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably DataSourceError since it is for the batch of features rather than individual features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used DataLayerError following our documentation, as it involves an error with a specific layer ("Errors relating to working with a single OGRLayer") and not with the whole dataset ("Errors relating to opening or closing an OGRDataSource (with >= 1 layers)")

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 10, 2024

@himikof @brendan-ward thanks a lot for the extensive review! Already addressed a few of the comments, will need to go through the bulk of them one of the coming days

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

Can you please add a test when crs is not passed in to verify the warning is raised?

If possible, would be good to have that also test (or in a separate test case) if GDAL auto detects CRS from metadata on the geometry column (per the TODO comment).

Can you please add a test for a driver that doesn't support write? See test_raw_io.py::test_write_unsupported.

It would probably also be good to probe at close errors on write, similar to test_raw_io.py::test_write_gdalclose_error.

It might be good to test other drivers similar to test_raw_io.py::test_write_supported, though it looks like only FGB would be particularly unique there.

If possible, please add tests for append capability supported / unsupported, similar to test_raw_io.py::test_write_append / test_raw_io.py::test_write_append_unsupported

Is the idea to add support to write_dataframe in a later PR? (would be a good idea, this is already getting pretty big)

pyogrio/_io.pyx Show resolved Hide resolved
pyogrio/_io.pyx Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Show resolved Hide resolved
pyogrio/raw.py Outdated Show resolved Hide resolved
pyogrio/raw.py Outdated Show resolved Hide resolved
pyogrio/raw.py Outdated
Comment on lines 672 to 673
if geometry_type is None:
raise ValueError("Need to specify 'geometry_type'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a 3-state variable make sense here, to enable writing without geometry?

  • "" (or some other arbitrary value to indicate unset): default, indicates not set by user and will raise exception
  • None: provided by user to indicate there is no geometry to write
  • Point ... GeometryCollection: valid geometry type

@@ -304,3 +310,270 @@ def test_arrow_bool_exception(tmpdir, ext):
else:
with open_arrow(filename):
pass


# Point(0, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also gets used in a variety of places, perhaps this could be added as a function in conftest.py and then just call it when used below?

def get_wkb_points(count=1):
    return np.array([bytes.fromhex("010100000000000000000000000000000000000000")]*count, dtype=object)

We could hook it up to other usages in a different PR, this would just set us up for that while simplifying here.

pyogrio/tests/test_arrow.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

Can you please add a test when crs is not passed in to verify the warning is raised?

If possible, would be good to have that also test (or in a separate test case) if GDAL auto detects CRS from metadata on the geometry column (per the TODO comment).

Added a test for not providing a crs. That directly tests the second aspect as well, because the table that we are writing in test was written with the result of read_arrow which will have this CRS metadata in its schema.

Can you please add a test for a driver that doesn't support write? See test_raw_io.py::test_write_unsupported.

It would probably also be good to probe at close errors on write, similar to test_raw_io.py::test_write_gdalclose_error.

It might be good to test other drivers similar to test_raw_io.py::test_write_supported, though it looks like only FGB would be particularly unique there.

If possible, please add tests for append capability supported / unsupported, similar to test_raw_io.py::test_write_append / test_raw_io.py::test_write_append_unsupported

Copied and adapted those tests from raw io (and fixed the append=True case, we were still incorrectly trying to add new fields to the existing layer). Only for geojson-like formats I get some error that I still have to report upstream.

Is the idea to add support to write_dataframe in a later PR? (would be a good idea, this is already getting pretty big)

Yes, that was my idea. I already have a branch locally that starts testing that (most errors are about the missing support for promoting the geometry type, but which is something for which we could add some custom shapely code in the geopandas -> arrow conversion), but indeed was planning to leave that for a separate PR ;)

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @jorisvandenbossche , this is getting really close. A few minor outstanding comments, but otherwise this looks ready to merge (and we can deal with outstanding issues around encoding - namely not allowing encoding for anything other than shapefiles - in #384 after this is merged)

pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/tests/test_arrow.py Outdated Show resolved Hide resolved
dataset_metadata, layer_metadata, metadata
)

# TODO: does GDAL infer CRS automatically from geometry metadata?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the associated test case, it looks like GDAL does not infer this automatically; comment can be removed?

Co-authored-by: Brendan Ward <bcward@astutespruce.com>
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brendan-ward brendan-ward merged commit 154466a into geopandas:main Apr 25, 2024
20 checks passed
@kylebarron kylebarron deleted the kyle/write-arrow branch April 26, 2024 01:07
@kylebarron
Copy link
Contributor Author

Honestly it was mostly @jorisvandenbossche , but I'm happy to have got the ball rolling!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support writing using the Arrow API
4 participants