Skip to content

Spark 4.1: native_datafusion bytesRead task metric off by 6-14x vs Spark #4194

@andygrove

Description

@andygrove

Sub-issue of #4098.

Description

Three tests fail in `Spark 4.1, JDK 17/auto [exec]` (`CometTaskMetricsSuite`):

  • `native_datafusion scan reports task-level input metrics matching Spark`
  • `input metrics aggregate across multiple native scans in a join`
  • `input metrics aggregate across multiple native scans in a union`

Symptom (one example):

```
9.6 was greater than or equal to 0.7, but 9.6 was not less than or equal to 1.3
bytesRead ratio out of range: comet=90498, spark=9427, ratio=9.6
```

Two more failures with similar 6.4 and 13.9 ratios.

Suspected root cause

Spark 4.1 changed what `inputMetrics.bytesRead` accounts for, most likely now reports a smaller subset (e.g. only bytes actually read into row buffers, versus full Parquet footer plus row group). Compare `ParquetFileReader` / `PartitionedFile` accounting between 4.0 and 4.1 and adjust Comet's metric source accordingly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:scanParquet scan / data readingbugSomething isn't workingpriority:lowMinor issues, test failures, tooling, cosmeticspark 4.1

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions