[GH-2650] Fix warning message when reading shapefiles from S3#2655
Merged
[GH-2650] Fix warning message when reading shapefiles from S3#2655
Conversation
Override fileIndex in ShapefileTable, GeoPackageTable, and GeoParquetMetadataTable to skip FileStreamSink.hasMetadata check. Spark's FileTable.fileIndex calls FileStreamSink.hasMetadata which tries to stat the input paths as directories. For shapefile paths that get transformed to glob patterns (e.g., file.???), this causes FileNotFoundException warnings on S3. The fix creates SedonaFileIndexHelper in the org.apache.spark.sql package to access the package-private DataSource.checkAndGlobPathIfNecessary method directly, bypassing the streaming metadata check that is irrelevant for batch read-only sources. Fixes #2650
Add a test that attaches a Log4j appender to capture WARN messages, reads a shapefile by .shp path (which triggers the glob transform to file.???), and asserts that no 'Assume no metadata directory' warning is emitted. The test fails without the fileIndex override fix and passes with it, confirming the fix for #2650.
0c94f0b to
99ae31d
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes spurious FileNotFoundException WARN logs emitted by Spark’s FileStreamSink.hasMetadata() when Sedona file-based DataSource V2 tables (notably Shapefile) pass globbed paths (e.g., file.???) on cloud storage such as S3.
Changes:
- Introduce
SedonaFileIndexHelperto build anInMemoryFileIndexwithout triggering Spark’s streaming-metadata directory check. - Override
fileIndexin Shapefile / GeoPackage / GeoParquetMetadata tables across Spark 3.4, 3.5, 4.0, 4.1 to use the helper. - Add regression tests (all Spark versions) asserting that reading a
.shppath does not emit the “Assume no metadata directory” warning.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/SedonaFileIndexHelper.scala | New helper to construct a file index while bypassing FileStreamSink.hasMetadata() checks. |
| spark/spark-3.4/src/main/scala/org/apache/sedona/sql/datasources/shapefile/ShapefileTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-3.4/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-3.4/src/main/scala/org/apache/spark/sql/execution/datasources/v2/geoparquet/metadata/GeoParquetMetadataTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-3.4/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala | Regression test capturing WARN logs to ensure no metadata-directory warning is emitted. |
| spark/spark-3.5/src/main/scala/org/apache/sedona/sql/datasources/shapefile/ShapefileTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-3.5/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-3.5/src/main/scala/org/apache/spark/sql/execution/datasources/v2/geoparquet/metadata/GeoParquetMetadataTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-3.5/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala | Regression test capturing WARN logs to ensure no metadata-directory warning is emitted. |
| spark/spark-4.0/src/main/scala/org/apache/sedona/sql/datasources/shapefile/ShapefileTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-4.0/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-4.0/src/main/scala/org/apache/spark/sql/execution/datasources/v2/geoparquet/metadata/GeoParquetMetadataTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-4.0/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala | Regression test capturing WARN logs to ensure no metadata-directory warning is emitted. |
| spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/shapefile/ShapefileTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-4.1/src/main/scala/org/apache/sedona/sql/datasources/geopackage/GeoPackageTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-4.1/src/main/scala/org/apache/spark/sql/execution/datasources/v2/geoparquet/metadata/GeoParquetMetadataTable.scala | Override fileIndex to use SedonaFileIndexHelper. |
| spark/spark-4.1/src/test/scala/org/apache/sedona/sql/ShapefileTests.scala | Regression test capturing WARN logs to ensure no metadata-directory warning is emitted. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...common/src/main/scala/org/apache/spark/sql/execution/datasources/SedonaFileIndexHelper.scala
Outdated
Show resolved
Hide resolved
...common/src/main/scala/org/apache/spark/sql/execution/datasources/SedonaFileIndexHelper.scala
Outdated
Show resolved
Hide resolved
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Warning message when reading shapefiles from public s3 buckets #2650What changes were proposed in this PR?
When reading shapefiles from S3 using Spark DataSource V2, users see spurious
FileNotFoundExceptionwarnings:Root cause: Spark's
FileTable.fileIndexlazy val callsFileStreamSink.hasMetadata()which tries to stat the input paths as directories. For shapefiles,ShapefileDataSource.transformPaths()converts.shppaths to glob patterns (e.g.,file.???). WhenhasMetadatatriesfs.getFileStatus(new Path("file.???"))on S3, it throwsFileNotFoundExceptionwhich is caught and logged as a WARN. This check is only relevant for streaming sinks, not batch read-only sources.Fix: Override
fileIndexinShapefileTable,GeoPackageTable, andGeoParquetMetadataTableto construct theInMemoryFileIndexdirectly, skipping the irrelevantFileStreamSink.hasMetadatacheck. SinceDataSource.checkAndGlobPathIfNecessaryis package-private toorg.apache.spark.sql, a bridge helperSedonaFileIndexHelperis placed in theorg.apache.spark.sql.execution.datasourcespackage within thespark/commonmodule.Changes:
SedonaFileIndexHelper.scalainspark/common-- bridge to access package-privateDataSource.checkAndGlobPathIfNecessaryShapefileTable.scala(4 Spark versions) -- overridefileIndexGeoPackageTable.scala(4 Spark versions) -- overridefileIndexGeoParquetMetadataTable.scala(4 Spark versions) -- overridefileIndexHow was this patch tested?
All existing tests pass across all 4 Spark versions (3.4, 3.5, 4.0, 4.1).
A new regression test was added to
ShapefileTestsin all 4 Spark versions: reading shapefile by .shp path should not produce FileStreamSink metadata warning. The test:.shppath (triggering thetransformPathsglob conversion tofile.???)The test was verified to fail without the fix (capturing the exact warning about looking for metadata directory in the path
datatypes1.???) and pass with the fix.Did this PR include necessary documentation updates?