Skip to content

[GH-2650] Fix warning message when reading shapefiles from S3#2655

Merged
jiayuasu merged 3 commits intomasterfrom
fix/GH-2650-shapefile-s3-warning
Feb 15, 2026
Merged

[GH-2650] Fix warning message when reading shapefiles from S3#2655
jiayuasu merged 3 commits intomasterfrom
fix/GH-2650-shapefile-s3-warning

Conversation

@jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Feb 15, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

When reading shapefiles from S3 using Spark DataSource V2, users see spurious FileNotFoundException warnings:

WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://bucket/path/file.???

Root cause: Spark's FileTable.fileIndex lazy val calls FileStreamSink.hasMetadata() which tries to stat the input paths as directories. For shapefiles, ShapefileDataSource.transformPaths() converts .shp paths to glob patterns (e.g., file.???). When hasMetadata tries fs.getFileStatus(new Path("file.???")) on S3, it throws FileNotFoundException which is caught and logged as a WARN. This check is only relevant for streaming sinks, not batch read-only sources.

Fix: Override fileIndex in ShapefileTable, GeoPackageTable, and GeoParquetMetadataTable to construct the InMemoryFileIndex directly, skipping the irrelevant FileStreamSink.hasMetadata check. Since DataSource.checkAndGlobPathIfNecessary is package-private to org.apache.spark.sql, a bridge helper SedonaFileIndexHelper is placed in the org.apache.spark.sql.execution.datasources package within the spark/common module.

Changes:

  • New: SedonaFileIndexHelper.scala in spark/common -- bridge to access package-private DataSource.checkAndGlobPathIfNecessary
  • Modified: ShapefileTable.scala (4 Spark versions) -- override fileIndex
  • Modified: GeoPackageTable.scala (4 Spark versions) -- override fileIndex
  • Modified: GeoParquetMetadataTable.scala (4 Spark versions) -- override fileIndex

How was this patch tested?

All existing tests pass across all 4 Spark versions (3.4, 3.5, 4.0, 4.1).

A new regression test was added to ShapefileTests in all 4 Spark versions: reading shapefile by .shp path should not produce FileStreamSink metadata warning. The test:

  1. Attaches a custom Log4j appender to capture WARN messages at runtime
  2. Reads a shapefile by .shp path (triggering the transformPaths glob conversion to file.???)
  3. Asserts no "Assume no metadata directory" warning was captured

The test was verified to fail without the fix (capturing the exact warning about looking for metadata directory in the path datatypes1.???) and pass with the fix.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

@jiayuasu jiayuasu marked this pull request as draft February 15, 2026 07:57
@jiayuasu jiayuasu changed the title Fix warning message when reading shapefiles from S3 [GH-2650] Fix warning message when reading shapefiles from S3 Feb 15, 2026
Override fileIndex in ShapefileTable, GeoPackageTable, and
GeoParquetMetadataTable to skip FileStreamSink.hasMetadata check.

Spark's FileTable.fileIndex calls FileStreamSink.hasMetadata which tries
to stat the input paths as directories. For shapefile paths that get
transformed to glob patterns (e.g., file.???), this causes
FileNotFoundException warnings on S3.

The fix creates SedonaFileIndexHelper in the org.apache.spark.sql package
to access the package-private DataSource.checkAndGlobPathIfNecessary
method directly, bypassing the streaming metadata check that is
irrelevant for batch read-only sources.

Fixes #2650
Add a test that attaches a Log4j appender to capture WARN messages, reads
a shapefile by .shp path (which triggers the glob transform to file.???),
and asserts that no 'Assume no metadata directory' warning is emitted.

The test fails without the fileIndex override fix and passes with it,
confirming the fix for #2650.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes spurious FileNotFoundException WARN logs emitted by Spark’s FileStreamSink.hasMetadata() when Sedona file-based DataSource V2 tables (notably Shapefile) pass globbed paths (e.g., file.???) on cloud storage such as S3.

Changes:

  • Introduce SedonaFileIndexHelper to build an InMemoryFileIndex without triggering Spark’s streaming-metadata directory check.
  • Override fileIndex in Shapefile / GeoPackage / GeoParquetMetadata tables across Spark 3.4, 3.5, 4.0, 4.1 to use the helper.
  • Add regression tests (all Spark versions) asserting that reading a .shp path does not emit the “Assume no metadata directory” warning.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/SedonaFileIndexHelper.scala New helper to construct a file index while bypassing FileStreamSink.hasMetadata() checks.
spark/spark-3.4/src/main/scala/org/apache/sedona/sql/datasources/shapefile/ShapefileTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-3.4/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-3.4/src/main/scala/org/apache/spark/sql/execution/datasources/v2/geoparquet/metadata/GeoParquetMetadataTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-3.4/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala Regression test capturing WARN logs to ensure no metadata-directory warning is emitted.
spark/spark-3.5/src/main/scala/org/apache/sedona/sql/datasources/shapefile/ShapefileTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-3.5/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-3.5/src/main/scala/org/apache/spark/sql/execution/datasources/v2/geoparquet/metadata/GeoParquetMetadataTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-3.5/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala Regression test capturing WARN logs to ensure no metadata-directory warning is emitted.
spark/spark-4.0/src/main/scala/org/apache/sedona/sql/datasources/shapefile/ShapefileTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-4.0/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-4.0/src/main/scala/org/apache/spark/sql/execution/datasources/v2/geoparquet/metadata/GeoParquetMetadataTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-4.0/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala Regression test capturing WARN logs to ensure no metadata-directory warning is emitted.
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/shapefile/ShapefileTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-4.1/src/main/scala/org/apache/spark/sql/execution/datasources/v2/geoparquet/metadata/GeoParquetMetadataTable.scala Override fileIndex to use SedonaFileIndexHelper.
spark/spark-4.1/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala Regression test capturing WARN logs to ensure no metadata-directory warning is emitted.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as ready for review February 15, 2026 08:41
@jiayuasu jiayuasu merged commit 1e6303e into master Feb 15, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warning message when reading shapefiles from public s3 buckets

1 participant