Skip to content

Poor documentation and/or test coverage of TableProvider::statistics #13439

@rroelke

Description

@rroelke

The extent of the documentation for TableProvider::statistics in version 43.0.0 is:

Get statistics for this table, if available

This offers no explanation as to how the statistics will or will not be used.

A user with experience in analytical database engines writing a custom TableProvider implementation may suspect that TableProvider::statistics is used by the datafusion query optimizer to determine join orders, perhaps among other things.

However, this conclusion is apparently incorrect, which I deduce from the following pieces of evidence:

  1. I am a user fitting that description and found that my custom TableProvider::statistics was not called in the presence of a join query
  2. cargo check --workspace --tests runs with no errors if I remove the fn statistics declaration from the trait TableProvider definition
  3. having found the source code for the rule which changes join orders it is clear that it calls ExecutionPlan::statistics instead.

Expectation

The documentation should set appropriate expectations for what TableProvider::statistics is used for, so that developers can make informed choices about whether or not to implement it.

Additional context

The apparent answer to what TableProvider::statistics is used for is "nothing" based on the cargo check --workspace --tests comment above, but removing the trait method is a breaking change. Based on the slack discussion prior to filing this issue, at least one user is depending on TableProvider::statistics for their custom optimizer rules and removing it would require them to find a workaround.

Short of deprecating or removing the trait method, I would personally be satisfied just with updates to the method documentation.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions