Skip to content

Conversation

@AryanBagade
Copy link
Contributor

Rationale for this change

This PR enforces the clippy::needless_pass_by_value lint rule to prevent unnecessary data clones and improve performance in the datafusion-datasource crate. This is part of the effort tracked in #18503 to enforce this lint rule across all DataFusion crates.
Functions that take ownership of values (pass-by-value) when they only need to read them force callers to .clone() data unnecessarily, which degrades performance. By changing these functions to accept references instead, we eliminate these unnecessary clones.

What changes are included in this PR?

  • Added lint rule enforcement to datafusion/datasource/src/mod.rs
  • Fixed 11 violations of clippy::needless_pass_by_value across 5 files:
    • file_scan_config.rs: 2 fixes
    • memory.rs: 3 fixes
    • source.rs: 1 fix
    • statistics.rs: 4 fixes
    • write/demux.rs: 1 fix
  • Updated callers in datafusion-core and datafusion-catalog-listing to pass references

Are these changes tested?

Yes, all changes are tested:

  • ✅ All 82 unit tests pass (cargo test -p datafusion-datasource)
  • ✅ All 7 doc tests pass
  • ✅ Strict clippy checks pass with -D warnings
  • ✅ CI lint script passes (./dev/rust_lint.sh)
  • ✅ Dependent crates (datafusion-catalog-listing, datafusion-core) pass all tests and clippy checks

Tests are covered by existing tests as this is a refactoring that changes internal function signatures without changing behavior.

Are there any user-facing changes?

No user-facing changes. All changes are internal to the datafusion-datasource crate. The public API remains unchanged - only internal function signatures were modified to accept references instead of owned values.

Then at the bottom add:
Fixes #18611
Part of #18503

…on-datasource

This commit enforces the `clippy::needless_pass_by_value` lint rule to prevent unnecessary data clones and improve performance in the datafusion-datasource crate.

Changes:
- Added lint rule to datafusion/datasource/src/mod.rs
- Fixed 11 violations across 5 files by changing pass-by-value to pass-by-reference
- Updated callers in datafusion-core and datafusion-catalog-listing

Fixes apache#18611
Part of apache#18503
@github-actions github-actions bot added core Core DataFusion crate catalog Related to the catalog crate datasource Changes to the datasource crate labels Nov 13, 2025
@Jefffrey Jefffrey changed the title chore: Enforce lint rule clippy::needless_pass_by_value to datafusi… chore: Enforce lint rule clippy::needless_pass_by_value to datafusion-datasource Nov 14, 2025
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on it. I suggest to revert several public API fixes, otherwise it looks good.

pub fn project(
&mut self,
file_batch: RecordBatch,
file_batch: &RecordBatch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to revert the change and suppress it with #[expect(clippy::needless_pass_by_value)], due to

  1. This is a public API, it's better to avoid changing it so that the downstream datafusion dependents don't have to update it during ugprades
  2. Cloning RecordBatch is a shallow clone for inner heavy payloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

/// Create a new execution plan from a list of constant values (`ValuesExec`)
pub fn try_new_as_values(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

/// batches are provided.
pub fn try_new_from_batches(
schema: SchemaRef,
schema: &SchemaRef,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Self::new(&min_max_sort_order, &min_max_schema, &min_batch, &max_batch)
}

pub fn new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

/// A tuple containing:
/// * The processed file groups with their individual statistics attached
/// * The summary statistics across all file groups, aka all files summary statistics
pub fn compute_all_files_statistics(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

- Reverted public API changes to maintain stability
- Added #[expect(clippy::needless_pass_by_value)] to public methods
- Kept all internal/private function improvements
- RecordBatch clone is shallow, performance impact is minimal
@github-actions github-actions bot removed core Core DataFusion crate catalog Related to the catalog crate labels Nov 14, 2025
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again.

The CI failure on linux might be due to #18692, on MacOS it's passing, so I don't think we should worry about it.

@AryanBagade
Copy link
Contributor Author

AryanBagade commented Nov 14, 2025

The CI failure on linux might be due to #18692, on MacOS it's passing, so I don't think we should worry about it.

yay all tests were passing on my device!!!
thanks

@Jefffrey Jefffrey added this pull request to the merge queue Nov 14, 2025
Merged via the queue into apache:main with commit 1dddf03 Nov 14, 2025
51 of 52 checks passed
jizezhang pushed a commit to jizezhang/datafusion that referenced this pull request Nov 15, 2025
…on-datasource (apache#18682)

## Rationale for this change
This PR enforces the `clippy::needless_pass_by_value` lint rule to
prevent unnecessary data clones and improve performance in the
`datafusion-datasource` crate. This is part of the effort tracked in
apache#18503 to enforce this lint rule across all DataFusion crates.
Functions that take ownership of values (pass-by-value) when they only
need to read them force callers to `.clone()` data unnecessarily, which
degrades performance. By changing these functions to accept references
instead, we eliminate these unnecessary clones.

 ## What changes are included in this PR?
- Added lint rule enforcement to `datafusion/datasource/src/mod.rs`
- Fixed 11 violations of `clippy::needless_pass_by_value` across 5
files:
    - `file_scan_config.rs`: 2 fixes
    - `memory.rs`: 3 fixes
    - `source.rs`: 1 fix
    - `statistics.rs`: 4 fixes
    - `write/demux.rs`: 1 fix
- Updated callers in `datafusion-core` and `datafusion-catalog-listing`
to pass references


## Are these changes tested?
Yes, all changes are tested:
  - ✅ All 82 unit tests pass (`cargo test -p datafusion-datasource`)
  - ✅ All 7 doc tests pass
  - ✅ Strict clippy checks pass with `-D warnings`
  - ✅ CI lint script passes (`./dev/rust_lint.sh`)
- ✅ Dependent crates (`datafusion-catalog-listing`, `datafusion-core`)
pass all tests and clippy checks

Tests are covered by existing tests as this is a refactoring that
changes internal function signatures without changing behavior.

## Are there any user-facing changes?
No user-facing changes. All changes are internal to the
`datafusion-datasource` crate. The public API remains unchanged - only
internal function signatures were modified to accept references instead
of owned values.

Then at the bottom add:
Fixes apache#18611
Part of apache#18503
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enforce lint rule clippy::needless_pass_by_value to datafusion-datasource

3 participants