Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Exporting a GeoDataframe with column 'fid' to GeoPackage will delete that column #1035

Open
joehuanguf opened this issue Jul 2, 2019 · 10 comments
Labels
Milestone

Comments

@joehuanguf
Copy link

joehuanguf commented Jul 2, 2019

Steps:

  1. Create a GeoDataframe with a column 'fid'

  2. Export as a GeoPackage using to_file()

  3. Read the exported GeoPackage

Bug

The 'fid' column should exist, but is missing.

Hunches

A scan of both the GeoPandas and the Fiona source code reveal no reason this bug should occur, so I suspect that it's actually an issue with GDAL.

@kuanb
Copy link

kuanb commented Jul 2, 2019

Example script to recreate:

import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

columns = ['fid', 'gid', 'id']
data = [(1, 1.0, '1')]

df = pd.DataFrame(data, columns=columns)

p = Point(1,2)
gdf = gpd.GeoDataFrame(df, geometry=[p])

print(gdf, '\n')


gdf.to_file('ohm/test', driver='GPKG', limit=0)
gdf2 = gpd.read_file('ohm/test')

print(gdf2)

Output will be as follows:

   fid  gid id     geometry
0    1  1.0  1  POINT (1 2) 

   gid id     geometry
0  1.0  1  POINT (1 2)

fid is trimmed at some point during the export/import operation.

I noticed in ogr2ogr documentation the flag for preserving fid. I wonder if this is related to the need to pass through specific parameters/flags to fiona?

Link: https://gdal.org/programs/ogr2ogr.html#cmdoption-ogr2ogr-preserve-fid

@kuanb
Copy link

kuanb commented Jul 2, 2019

Update:

This is expected behavior for the driver and occurs on write. GPKG uses "fid" as the index. If present, it is assigned as the index "id." A more clear example would be to use a number other than 1 for the "fid" value in the above example. Let's say we use "12" instead.

The resulting output would have the "fid" column missing from the columns but the index value for the first row would be "12" instead of "0."

Reading in the file with fiona we can see:

with fiona.open('ohm/test') as f:
    feats = list(f)
    print(feats)

# [{'type': 'Feature', 'id': '12', 'properties': OrderedDict([('gid', 1.0), ('id', '1')]), 'geometry': {'type': 'Point', 'coordinates': (1.0, 2.0)}}]

In the above example we can see the "12" value being assigned to the feature id. But it is not preserved and carried through to the index in GeoPandas:

gpd.read_file('foo/test')

Returns...

  gid id geometry
0 1.0 1 POINT (1 2)

@kuanb
Copy link

kuanb commented Jul 3, 2019

I took some additional notes on what is happening and captured it here: http://kuanbutts.com/2019/07/02/gpkg-write-from-geopandas/

Ultimately, I am not sure if this is a decision GeoPandas should be making. I can see the argument that this is something that a user might want to specifically handle themselves.

The only item I think might be valid in terms of making a modification to GeoPandas would be that, when reading in a GPKG, create an additional column fid and use id values present to create values for fid.

@jorisvandenbossche
Copy link
Member

@kuanb thanks for the exploration, very useful!

There is also some related discussion to FIDs in fiona in Toblerity/Fiona#327. Additional problem in handling this consistently is that it probably also depends on the driver.
And some discussion about id's specific for GeoJSON:#390

Also not sure what is best here. We could set the ids as the index or as an 'fid' column, it's not that hard code-wise if we would decide that is best (although it is somewhat backwards incompatible). But I would prefer to not do too many driver-specific things to geopandas.
Eg roundtrip to Shapefile or GeoJSON do preserve the fid column, and their id's become simple integers starting at 0. But in that light, it is certainly confusing that GeoPackage behaves differently

@stdmn
Copy link

stdmn commented Oct 24, 2019

Any update on this? Would be great to have the option of explicity indexing by the GeoJSON feature IDs instead of just re-sequencing based on read order.

I'm specifically interested in this because I'm splitting the data between vector tiles and a Postgres Database, so if I can't specify the index manually, there's a chance that features could end up with different indices which would make it impossible to share information between the tiles and the DB.

@theroggy
Copy link
Member

theroggy commented Jun 2, 2020

Also interested in this. Mainly for the part about reading the fid values properly from an existing geopackage rather than writing the fid column.

I need to do some checks on the geometries in a geopackage file and print out which rows need to be manually checked, and the fid would be the logical thing to print out for users to find the rows to be checked... but not possible due to the current behaviour of just ditching the fid column.

At first sight, I think the safest solution would be to add an fid column rather than change the pandas index for (backwards)compatibility reasons...

@jorisvandenbossche
Copy link
Member

I am still a bit hesitant to always convert the id to an "fid" column, when reading from a GeoPackage file: 1) it introduces GPKG specific code in geopandas and 2) it adds an additional column for all users, also when you actually don't use/need it.

But I certainly understand that, if you want access to the FID of the GeoPackage file, the current situation is quite annoying..

So some option I can think of:

  • Simply always include a "fid" column on GPKG file reading (but only for GPKG, I think?)
  • Have an option to indicate to preserve the id when reading (then this could be done for all formats, I think, and we don't necessarily need to make it specific to GPK?)
  • Warn users when they write a GeoDataFrame with an "fid" column to GPKG that their column will be lost when reading back (this would be an option if we decide to not change reading, to avoid the roundtrip surprise, but of course in itself doesn't solve the use case of accessing GPKG's FID)

Thoughts? Somebody interested in doing a PR for one of those?

@theroggy
Copy link
Member

I forgot about this back then... but a new case where a solution for this would have made life a lot easier arrived...

So, in response to @jorisvandenbossche: I think your second option sounds great. It is safe backwards-compatibility wise, and as it is an index, it sounds reasonable that it would end up as the index of the resulting GeoDataFrame as well.

I also noticed pyogrio.read_dataframe supports it already (on the master branch, not yet in the released latest version), via a parameter called "fid_as_index=False":
https://pyogrio.readthedocs.io/en/latest/api.html#geopandas-integration

@MaxDragonheart
Copy link

Any solution? In my project fid is very important because contains unique values that I use for a lot of processes.

@MaxDragonheart
Copy link

This is my workaround, hope it can be usefull https://gist.github.com/MaxDragonheart/46445a150aac9d528dadd2ec877203a5

@martinfleis martinfleis added this to the 1.0 milestone Dec 30, 2022
@jorisvandenbossche jorisvandenbossche modified the milestones: 1.0, 1.1 Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants