Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Writing to in-memory multilayer GPKGs #2875

Open
AdnanAvdagic opened this issue Apr 20, 2023 · 5 comments
Open

ENH: Writing to in-memory multilayer GPKGs #2875

AdnanAvdagic opened this issue Apr 20, 2023 · 5 comments

Comments

@AdnanAvdagic
Copy link

AdnanAvdagic commented Apr 20, 2023

Is your feature request related to a problem?

In my company we use a postgis database where we host all of our customers data. We then allow our customers to export this as any number of file formats, including GPKG. The issue is that we cannot use pythons IO.BytesIO in-memory files if we want multiple layers in the exported GPKG file and we would like not to have to write to disk.

Describe the solution you'd like

Is there any way for geopandas to write to in-memory multilayer GPKGs so they can be sent from the webserver?

API breaking implications

No idea if it breaks anything.

Describe alternatives you've considered

The alternative is creating a temp file on disk and writing to that

Additional context

tmp_file = io.BytesIO()
gdf.to_file(tmp_file, driver="GPKG", layer=name)
@AdnanAvdagic AdnanAvdagic changed the title ENH: ENH: Writing to in-memory multilayer GPKGs Apr 20, 2023
@m-richards
Copy link
Member

m-richards commented Apr 24, 2023

Hi @AdnanAvdagic this certainly seems like something that we could support. A PR would definitely be welcome if you're interested! Having a quick look in the fiona case, I'm seeing something different to you.

gdf = gpd.GeoDataFrame(gpd.read_file(gpd.datasets.get_path("nybb")))
tmp_file = io.BytesIO()
gdf.to_file(tmp_file,driver="GPKG")
gdf2 = gpd.read_file(tmp_file)
  File "C:\Users\Matt\.conda\envs\pandas-dev\lib\site-packages\fiona\collection.py", line 162, in __init__
    self.session.start(self, **kwargs)
  File "fiona\ogrext.pyx", line 540, in fiona.ogrext.Session.start
  File "fiona\_shim.pyx", line 90, in fiona._shim.gdal_open_vector
fiona.errors.DriverError: '/vsimem/ec35a40572b84db6bb280bf285beea47' not recognized as a supported file format.

I get an error on read instead in my local environment (with fiona 1.8.22).

But this seems to be in the geopandas wrapping of fiona. I'll try have a look in more detail at some point.

There is also another crash if driver=None for io.BytesIO which we should handle better.

@jorisvandenbossche
Copy link
Member

@m-richards I think when writing to a in-memory file like object, and then if you want to read it back, you have to put the "current position" back to the start of the object. This works for me:

gdf = geopandas.read_file(geopandas.datasets.get_path("nybb"))

tmp_file = io.BytesIO()
gdf.to_file(tmp_file, driver="GPKG")
# this line is needed
tmp_file.seek(0)

In [20]: geopandas.read_file(tmp_file)
Out[20]: 
   BoroCode       BoroName     Shape_Leng    Shape_Area                                           geometry
0         5  Staten Island  330470.010332  1.623820e+09  MULTIPOLYGON (((970217.022 145643.332, 970227....
1         4         Queens  896344.047763  3.045213e+09  MULTIPOLYGON (((1029606.077 156073.814, 102957...
2         3       Brooklyn  741080.523166  1.937479e+09  MULTIPOLYGON (((1021176.479 151374.797, 102100...
3         1      Manhattan  359299.096471  6.364715e+08  MULTIPOLYGON (((981219.056 188655.316, 980940....
4         2          Bronx  464392.991824  1.186925e+09  MULTIPOLYGON (((1012821.806 229228.265, 101278...

So this works for a single layer. The question is now if this can also work with multiple layers. Testing naively writing to the same buffer doesn't work:

df = geopandas.GeoDataFrame({"col": [1, 2], "geometry": geopandas.points_from_xy([1, 2], [1, 2])})
df1 = df.iloc[:1]
df2 = df.iloc[1:]

tmp_file = io.BytesIO()
df1.to_file(tmp_file, driver="GPKG", layer="layer1")
# it doesn't work neither with or without the following line
# tmp_file.seek(0)
df2.to_file(tmp_file, driver="GPKG", layer="layer2")

tmp_file.seek(0)
geopandas.read_file(tmp_file, layer="layer1")

tmp_file.seek(0)
geopandas.read_file(tmp_file, layer="layer2")
# -> ValueError: Null layer: 'layer2'

Checking with pyogrio confirms that the file only has the first layer:

tmp_file.seek(0)
pyogrio.list_layers(tmp_file)
# -> array([['layer1', 'Point']], dtype=object)

For normal files, you don't have to explicitly say to append, if you write to a geopackage file that already exist, it will automatically add a new layer (if you provide a different name, or otherwise overwrite the layer, I suppose).
But when explicitly asking to append, fiona raises an error that this isn't supported for in-memory file-like objects:

tmp_file = io.BytesIO()
df1.to_file(tmp_file, driver="GPKG", layer="layer1")
tmp_file.seek(0)
df2.to_file(tmp_file, driver="GPKG", layer="layer2", mode="a")
# -> OSError: Append mode is not supported for datasets in a Python file object.

So with would need some more investigation if this could be possible with either fiona or pyogrio.

@theroggy
Copy link
Member

theroggy commented Apr 29, 2023

Funny, apparently I was also having a look at this at the same time as @jorisvandenbossche :-)...

I tried if it would be possible using the "vsimem" feature of gdal, but it seems that the 2nd layer isn't added to the memory geopackage, but the geopackage is just overwritten. Based on a quick scan of the code in pyogrio it should work, and when the path is a real file it works fine, so I suppose the issue (or the fact that it isn't supported) is in gdal:

import pyogrio
import geopandas as gpd
from osgeo import gdal

gdf = gpd.GeoDataFrame(gpd.read_file(gpd.datasets.get_path("nybb")))

memfile_path = "/vsimem/memoryfile.gpkg"
# memfile_path = "C:/temp/memoryfile.gpkg"
try:
    gdf.iloc[[0, 1]].to_file(memfile_path, driver="GPKG", layer="test1" , engine="pyogrio")
    layers = pyogrio.list_layers(memfile_path)
    gdf.iloc[[2, 3]].to_file(memfile_path, driver="GPKG", layer="test2" , engine="pyogrio")
    layers = pyogrio.list_layers(memfile_path)
    print(layers)
    gdf1 = gpd.read_file(memfile_path, driver="GPKG", layer="test1")
    gdf2 = gpd.read_file(memfile_path, driver="GPKG", layer="test2")
    print(gdf1)
    print(gdf2)
finally:
    gdal.Unlink(memfile_path)

@jorisvandenbossche
Copy link
Member

I think that for pyogrio, we actually don't yet support writing to an in-memory BytesIO or /vsimem at all (also not for a single file / layer). Opened geopandas/pyogrio#249 for this

@m-richards
Copy link
Member

Thanks both for clarifying this, I don't use BytesIO very often and it clearly shows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants