Skip to content

docs: Update Parquet scan documentation#3433

Open
andygrove wants to merge 14 commits intoapache:mainfrom
andygrove:native_comet_docs
Open

docs: Update Parquet scan documentation#3433
andygrove wants to merge 14 commits intoapache:mainfrom
andygrove:native_comet_docs

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Feb 6, 2026

Overview

This PR removes all references to the deprecated native_comet scan implementation from the documentation and
configuration, and improves the accuracy and clarity of the Parquet scan documentation.

Changed Files

common/src/main/scala/org/apache/comet/CometConf.scala

  • Changed the category of spark.comet.scan.impl from CATEGORY_SCAN to CATEGORY_PARQUET
  • Rewrote the doc string to describe native_datafusion and native_iceberg_compat without referencing
    native_comet
  • Removed the .internal() marker, making this configuration visible to users

docs/source/contributor-guide/parquet_scans.md

Major rewrite of the Parquet scan documentation:

  • Removed all references to the deprecated native_comet scan (previously listed as one of three implementations)
  • Removed the comparison table that included native_comet and the "benefits over native_comet" section
  • Removed the separate native_comet S3 section (which described Hadoop-AWS-based S3 access)
  • Updated the S3 configuration and examples sections to reference both native_datafusion and native_iceberg_compat
    (previously only referenced native_datafusion)
  • Clarified that auto mode currently always selects native_iceberg_compat
  • Separated limitations into two clear categories:
    • Fallback to Spark (safe): unsupported features that cause Comet to fall back to Spark, producing correct
      results with reduced performance
    • Potential incorrect results: issues that do not fall back and may produce wrong answers (datetime rebasing
      for both scans, hard-coded config defaults for native_iceberg_compat)
  • Added previously undocumented native_datafusion limitations that cause fallback:
    • Dynamic Partition Pruning (DPP)
    • input_file_name(), input_file_block_start(), input_file_block_length() SQL functions
    • Spark metadata columns (e.g., _metadata.file_path)
  • Added Parquet encryption as a shared fallback limitation
  • Fixed misleading wording for ignoreMissingFiles/ignoreCorruptFiles (previously said "not compatible with Spark",
    now clarifies it falls back to Spark)
  • Removed stale issue links (#1545, #1758) that referenced old native_datafusion issues

docs/source/contributor-guide/ffi.md

  • Replaced reference to native_comet with a general description of scans that use mutable buffers

docs/source/contributor-guide/roadmap.md

  • Removed the "Removing the native_comet scan implementation" roadmap section (now completed)
  • Simplified the Iceberg integration description by removing the mention of the native_comet to
    native_iceberg_compat transition

andygrove and others added 5 commits February 6, 2026 07:08
Fix grammar, add encryption fallback and native_iceberg_compat
hard-coded config limitations, clarify S3 section applies to both
scan implementations, and remove orphaned link references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove changed the title docs: remove all mentions of native_comet scan docs: Update Parquet scan documentation Feb 9, 2026
@andygrove andygrove added this to the 0.14.0 milestone Feb 9, 2026
andygrove and others added 5 commits February 13, 2026 12:52
Clarify which limitations fall back to Spark vs which may produce
incorrect results. Add missing documented limitations for
native_datafusion (DPP, input_file_name, metadata columns). Fix
misleading wording for ignoreCorruptFiles/ignoreMissingFiles. Note
that auto mode currently always selects native_iceberg_compat.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The section intro already states all limitations fall back to Spark,
so individual bullet points don't need to repeat it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure shared and per-scan limitation lists into two clear
categories: features that fall back to Spark (safe) and issues that
may produce incorrect results without falling back. Remove redundant
"Comet falls back to Spark" from individual bullets where the section
intro already states it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as ready for review February 13, 2026 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants