[GH-2760] Extend OSM PBF reader to support additional metadata fields by jiayuasu · Pull Request #2776 · apache/sedona

jiayuasu · 2026-03-21T20:08:21Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Developer Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes Extend OSM PBF reader to support additional fields (changeset, timestamp, uid, user, version, visible) #2760

What changes were proposed in this PR?

Extend the OSM PBF reader to extract the following metadata fields from Info (for Node/Way/Relation) and DenseInfo (for DenseNodes) protobuf messages:

changeset (BIGINT) — the changeset ID the entity belongs to
timestamp (TIMESTAMP) — when the entity was last modified
uid (INT) — the user ID of the last editor
user (STRING) — the username of the last editor (resolved from the string table via user_sid)
version (INT) — the entity version number
visible (BOOLEAN) — whether the entity is visible (relevant for history files)

These fields are part of the standard OSM PBF format specification but were previously ignored by the reader.

Key implementation details

Added InfoResolver utility class to extract Info fields from Osmformat.Info for nodes, ways, and relations
Extended DenseNodeExtractor to decode DenseInfo fields (delta-encoded for timestamp, changeset, uid, user_sid)
Added metadata fields with setters to OSMEntity to avoid constructor bloat
Updated SchemaProvider to include the 6 new columns in the output schema
Updated OsmPartitionReader to map the new fields into Spark InternalRow
Passed PrimitiveBlock (instead of just StringTable) to WayIterator and RelationIterator so they can access both the string table and date_granularity for timestamp conversion

How was this patch tested?

Added 3 new tests to OsmReaderTest:

Metadata fields populated for all entities — verifies version, timestamp, changeset are non-null for all nodes/ways/relations in the Monaco PBF dataset, timestamps are in a reasonable range, version >= 1, changeset >= 0
Schema includes metadata for dense nodes — verifies the 6 new fields appear in the schema when reading dense node PBF files
Schema includes metadata for normal nodes — verifies the 6 new fields appear in the schema when reading normal node PBF files

All 10 existing tests continue to pass.

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the documentation. The new fields are automatically available in the output schema when reading OSM PBF files.

Copilot

Pull request overview

Extends Sedona’s Spark OSM PBF datasource to surface standard OSM metadata (changeset, timestamp, uid/user, version, visible) for Nodes/Ways/Relations, including DenseNodes, and wires these fields through to the Spark output schema and rows.

Changes:

Add 6 metadata columns to the Spark schema and map them into InternalRow.
Populate OSMEntity metadata from Osmformat.Info (nodes/ways/relations) via new InfoResolver.
Decode DenseNodes DenseInfo metadata (including delta-encoded fields) in DenseNodeExtractor, and update iterators to pass PrimitiveBlock/date_granularity.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala	Adds tests for metadata presence and schema columns.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/SchemaProvider.scala	Adds metadata columns to datasource schema.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/OsmPartitionReader.scala	Maps new metadata fields into Spark `InternalRow` (incl. timestamp conversion).
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/model/OSMEntity.java	Adds nullable metadata fields + getters/setters.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/WayIterator.java	Populates `Way` metadata from `Info` using `InfoResolver`.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/RelationIterator.java	Populates `Relation` metadata from `Info` using `InfoResolver`.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/NodeIterator.java	Populates `Node` metadata from `Info` using `InfoResolver`.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/BlobIterator.java	Passes `PrimitiveBlock`/`date_granularity` through to iterators/extractor.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java	New utility to extract `Info` fields onto entities.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java	Decodes DenseInfo metadata (delta-coded fields) for DenseNodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java

spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java

...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java

spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala

spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java

...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java

Extract changeset, timestamp, uid, user, version, and visible fields from the Info/DenseInfo protobuf messages that were previously ignored by the OSM PBF reader. These fields are part of the standard OSM PBF format and provide useful provenance metadata for each entity.

- Guard DenseNodeExtractor field access with get*Count() > idx to prevent IndexOutOfBoundsException when repeated fields are absent - Default visible to true when not present per OSM PBF spec (applies to both InfoResolver and DenseNodeExtractor) - Add metadata value assertions in dense node test to verify delta-decoded fields are actually populated

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T08:01:48Z

spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala

+      val timestamp = nodeWithMetadata.getAs[Long]("timestamp")
+      val uid = nodeWithMetadata.getAs[Long]("uid")
+      val user = nodeWithMetadata.getAs[String]("user")
+      val version = nodeWithMetadata.getAs[Long]("version")


In this dense-node metadata test, the selected columns have Spark types timestamp: TimestampType, uid: IntegerType, and version: IntegerType (per SchemaProvider). Using getAs[Long] for these fields will typically throw a ClassCastException at runtime. Read timestamp as java.sql.Timestamp (or use getTimestamp), and read uid/version as Int (or Integer) before doing range checks.

Suggested change

val timestamp = nodeWithMetadata.getAs[Long]("timestamp")

val uid = nodeWithMetadata.getAs[Long]("uid")

val user = nodeWithMetadata.getAs[String]("user")

val version = nodeWithMetadata.getAs[Long]("version")

val timestampValue = nodeWithMetadata.getAs[java.sql.Timestamp]("timestamp")

val timestamp = timestampValue.getTime

val uidValue = nodeWithMetadata.getAs[Int]("uid")

val uid = uidValue.toLong

val user = nodeWithMetadata.getAs[String]("user")

val versionValue = nodeWithMetadata.getAs[Int]("version")

val version = versionValue.toLong

jiayuasu requested a review from Copilot March 21, 2026 20:09

jiayuasu added this to the sedona-1.9.0 milestone Mar 21, 2026

Copilot started reviewing on behalf of jiayuasu March 21, 2026 20:09 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

jiayuasu marked this pull request as draft March 22, 2026 22:17

jiayuasu added 2 commits March 24, 2026 00:40

jiayuasu force-pushed the feature/osm-pbf-metadata-fields branch from dc520de to 5ae595e Compare March 24, 2026 07:52

jiayuasu requested a review from Copilot March 24, 2026 07:57

Copilot started reviewing on behalf of jiayuasu March 24, 2026 07:58 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776

[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776
jiayuasu wants to merge 2 commits intomasterfrom
feature/osm-pbf-metadata-fields

jiayuasu commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-      val timestamp = nodeWithMetadata.getAs[Long]("timestamp")
-      val uid = nodeWithMetadata.getAs[Long]("uid")
-      val user = nodeWithMetadata.getAs[String]("user")
-      val version = nodeWithMetadata.getAs[Long]("version")
+      val timestampValue = nodeWithMetadata.getAs[java.sql.Timestamp]("timestamp")
+      val timestamp = timestampValue.getTime
+      val uidValue = nodeWithMetadata.getAs[Int]("uid")
+      val uid = uidValue.toLong
+      val user = nodeWithMetadata.getAs[String]("user")
+      val versionValue = nodeWithMetadata.getAs[Int]("version")
+      val version = versionValue.toLong

Conversation

jiayuasu commented Mar 21, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Key implementation details

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants