Skip to content

[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776

Draft
jiayuasu wants to merge 2 commits intomasterfrom
feature/osm-pbf-metadata-fields
Draft

[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776
jiayuasu wants to merge 2 commits intomasterfrom
feature/osm-pbf-metadata-fields

Conversation

@jiayuasu
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Extend the OSM PBF reader to extract the following metadata fields from Info (for Node/Way/Relation) and DenseInfo (for DenseNodes) protobuf messages:

  • changeset (BIGINT) — the changeset ID the entity belongs to
  • timestamp (TIMESTAMP) — when the entity was last modified
  • uid (INT) — the user ID of the last editor
  • user (STRING) — the username of the last editor (resolved from the string table via user_sid)
  • version (INT) — the entity version number
  • visible (BOOLEAN) — whether the entity is visible (relevant for history files)

These fields are part of the standard OSM PBF format specification but were previously ignored by the reader.

Key implementation details

  • Added InfoResolver utility class to extract Info fields from Osmformat.Info for nodes, ways, and relations
  • Extended DenseNodeExtractor to decode DenseInfo fields (delta-encoded for timestamp, changeset, uid, user_sid)
  • Added metadata fields with setters to OSMEntity to avoid constructor bloat
  • Updated SchemaProvider to include the 6 new columns in the output schema
  • Updated OsmPartitionReader to map the new fields into Spark InternalRow
  • Passed PrimitiveBlock (instead of just StringTable) to WayIterator and RelationIterator so they can access both the string table and date_granularity for timestamp conversion

How was this patch tested?

Added 3 new tests to OsmReaderTest:

  1. Metadata fields populated for all entities — verifies version, timestamp, changeset are non-null for all nodes/ways/relations in the Monaco PBF dataset, timestamps are in a reasonable range, version >= 1, changeset >= 0
  2. Schema includes metadata for dense nodes — verifies the 6 new fields appear in the schema when reading dense node PBF files
  3. Schema includes metadata for normal nodes — verifies the 6 new fields appear in the schema when reading normal node PBF files

All 10 existing tests continue to pass.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation. The new fields are automatically available in the output schema when reading OSM PBF files.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends Sedona’s Spark OSM PBF datasource to surface standard OSM metadata (changeset, timestamp, uid/user, version, visible) for Nodes/Ways/Relations, including DenseNodes, and wires these fields through to the Spark output schema and rows.

Changes:

  • Add 6 metadata columns to the Spark schema and map them into InternalRow.
  • Populate OSMEntity metadata from Osmformat.Info (nodes/ways/relations) via new InfoResolver.
  • Decode DenseNodes DenseInfo metadata (including delta-encoded fields) in DenseNodeExtractor, and update iterators to pass PrimitiveBlock/date_granularity.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala Adds tests for metadata presence and schema columns.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/SchemaProvider.scala Adds metadata columns to datasource schema.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/OsmPartitionReader.scala Maps new metadata fields into Spark InternalRow (incl. timestamp conversion).
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/model/OSMEntity.java Adds nullable metadata fields + getters/setters.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/WayIterator.java Populates Way metadata from Info using InfoResolver.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/RelationIterator.java Populates Relation metadata from Info using InfoResolver.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/NodeIterator.java Populates Node metadata from Info using InfoResolver.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/BlobIterator.java Passes PrimitiveBlock/date_granularity through to iterators/extractor.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java New utility to extract Info fields onto entities.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java Decodes DenseInfo metadata (delta-coded fields) for DenseNodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as draft March 22, 2026 22:17
Extract changeset, timestamp, uid, user, version, and visible fields
from the Info/DenseInfo protobuf messages that were previously ignored
by the OSM PBF reader. These fields are part of the standard OSM PBF
format and provide useful provenance metadata for each entity.
- Guard DenseNodeExtractor field access with get*Count() > idx to
  prevent IndexOutOfBoundsException when repeated fields are absent
- Default visible to true when not present per OSM PBF spec
  (applies to both InfoResolver and DenseNodeExtractor)
- Add metadata value assertions in dense node test to verify
  delta-decoded fields are actually populated
@jiayuasu jiayuasu force-pushed the feature/osm-pbf-metadata-fields branch from dc520de to 5ae595e Compare March 24, 2026 07:52
@jiayuasu jiayuasu requested a review from Copilot March 24, 2026 07:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +310 to +313
val timestamp = nodeWithMetadata.getAs[Long]("timestamp")
val uid = nodeWithMetadata.getAs[Long]("uid")
val user = nodeWithMetadata.getAs[String]("user")
val version = nodeWithMetadata.getAs[Long]("version")
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this dense-node metadata test, the selected columns have Spark types timestamp: TimestampType, uid: IntegerType, and version: IntegerType (per SchemaProvider). Using getAs[Long] for these fields will typically throw a ClassCastException at runtime. Read timestamp as java.sql.Timestamp (or use getTimestamp), and read uid/version as Int (or Integer) before doing range checks.

Suggested change
val timestamp = nodeWithMetadata.getAs[Long]("timestamp")
val uid = nodeWithMetadata.getAs[Long]("uid")
val user = nodeWithMetadata.getAs[String]("user")
val version = nodeWithMetadata.getAs[Long]("version")
val timestampValue = nodeWithMetadata.getAs[java.sql.Timestamp]("timestamp")
val timestamp = timestampValue.getTime
val uidValue = nodeWithMetadata.getAs[Int]("uid")
val uid = uidValue.toLong
val user = nodeWithMetadata.getAs[String]("user")
val versionValue = nodeWithMetadata.getAs[Int]("version")
val version = versionValue.toLong

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend OSM PBF reader to support additional fields (changeset, timestamp, uid, user, version, visible)

2 participants