[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776
[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776
Conversation
There was a problem hiding this comment.
Pull request overview
Extends Sedona’s Spark OSM PBF datasource to surface standard OSM metadata (changeset, timestamp, uid/user, version, visible) for Nodes/Ways/Relations, including DenseNodes, and wires these fields through to the Spark output schema and rows.
Changes:
- Add 6 metadata columns to the Spark schema and map them into
InternalRow. - Populate
OSMEntitymetadata fromOsmformat.Info(nodes/ways/relations) via newInfoResolver. - Decode DenseNodes
DenseInfometadata (including delta-encoded fields) inDenseNodeExtractor, and update iterators to passPrimitiveBlock/date_granularity.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala | Adds tests for metadata presence and schema columns. |
| spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/SchemaProvider.scala | Adds metadata columns to datasource schema. |
| spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/OsmPartitionReader.scala | Maps new metadata fields into Spark InternalRow (incl. timestamp conversion). |
| spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/model/OSMEntity.java | Adds nullable metadata fields + getters/setters. |
| spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/WayIterator.java | Populates Way metadata from Info using InfoResolver. |
| spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/RelationIterator.java | Populates Relation metadata from Info using InfoResolver. |
| spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/NodeIterator.java | Populates Node metadata from Info using InfoResolver. |
| spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/BlobIterator.java | Passes PrimitiveBlock/date_granularity through to iterators/extractor. |
| spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java | New utility to extract Info fields onto entities. |
| spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java | Decodes DenseInfo metadata (delta-coded fields) for DenseNodes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java
Outdated
Show resolved
Hide resolved
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java
Show resolved
Hide resolved
...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java
Show resolved
Hide resolved
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java
Show resolved
Hide resolved
...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java
Show resolved
Hide resolved
Extract changeset, timestamp, uid, user, version, and visible fields from the Info/DenseInfo protobuf messages that were previously ignored by the OSM PBF reader. These fields are part of the standard OSM PBF format and provide useful provenance metadata for each entity.
- Guard DenseNodeExtractor field access with get*Count() > idx to prevent IndexOutOfBoundsException when repeated fields are absent - Default visible to true when not present per OSM PBF spec (applies to both InfoResolver and DenseNodeExtractor) - Add metadata value assertions in dense node test to verify delta-decoded fields are actually populated
dc520de to
5ae595e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| val timestamp = nodeWithMetadata.getAs[Long]("timestamp") | ||
| val uid = nodeWithMetadata.getAs[Long]("uid") | ||
| val user = nodeWithMetadata.getAs[String]("user") | ||
| val version = nodeWithMetadata.getAs[Long]("version") |
There was a problem hiding this comment.
In this dense-node metadata test, the selected columns have Spark types timestamp: TimestampType, uid: IntegerType, and version: IntegerType (per SchemaProvider). Using getAs[Long] for these fields will typically throw a ClassCastException at runtime. Read timestamp as java.sql.Timestamp (or use getTimestamp), and read uid/version as Int (or Integer) before doing range checks.
| val timestamp = nodeWithMetadata.getAs[Long]("timestamp") | |
| val uid = nodeWithMetadata.getAs[Long]("uid") | |
| val user = nodeWithMetadata.getAs[String]("user") | |
| val version = nodeWithMetadata.getAs[Long]("version") | |
| val timestampValue = nodeWithMetadata.getAs[java.sql.Timestamp]("timestamp") | |
| val timestamp = timestampValue.getTime | |
| val uidValue = nodeWithMetadata.getAs[Int]("uid") | |
| val uid = uidValue.toLong | |
| val user = nodeWithMetadata.getAs[String]("user") | |
| val versionValue = nodeWithMetadata.getAs[Int]("version") | |
| val version = versionValue.toLong |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Extend OSM PBF reader to support additional fields (changeset, timestamp, uid, user, version, visible) #2760What changes were proposed in this PR?
Extend the OSM PBF reader to extract the following metadata fields from
Info(for Node/Way/Relation) andDenseInfo(for DenseNodes) protobuf messages:changeset(BIGINT) — the changeset ID the entity belongs totimestamp(TIMESTAMP) — when the entity was last modifieduid(INT) — the user ID of the last editoruser(STRING) — the username of the last editor (resolved from the string table viauser_sid)version(INT) — the entity version numbervisible(BOOLEAN) — whether the entity is visible (relevant for history files)These fields are part of the standard OSM PBF format specification but were previously ignored by the reader.
Key implementation details
InfoResolverutility class to extract Info fields fromOsmformat.Infofor nodes, ways, and relationsDenseNodeExtractorto decode DenseInfo fields (delta-encoded for timestamp, changeset, uid, user_sid)OSMEntityto avoid constructor bloatSchemaProviderto include the 6 new columns in the output schemaOsmPartitionReaderto map the new fields into SparkInternalRowPrimitiveBlock(instead of justStringTable) toWayIteratorandRelationIteratorso they can access both the string table anddate_granularityfor timestamp conversionHow was this patch tested?
Added 3 new tests to
OsmReaderTest:version,timestamp,changesetare non-null for all nodes/ways/relations in the Monaco PBF dataset, timestamps are in a reasonable range, version >= 1, changeset >= 0All 10 existing tests continue to pass.
Did this PR include necessary documentation updates?