Conversation
| Think of these as a funnel: `TableProvider::scan()` is called once during | ||
| planning to create an `ExecutionPlan`, then `ExecutionPlan::execute()` is called | ||
| once per partition to create a stream, and those streams are where rows are | ||
| actually produced during execution. |
There was a problem hiding this comment.
Would it be useful to know how these types relate to physical vs logical planning?
| Since `execute()` is called once per partition, partitioning directly controls | ||
| the parallelism of your table scan. Each partition runs on its own task, so | ||
| more partitions means more concurrent work -- up to the number of available | ||
| cores. |
There was a problem hiding this comment.
It would probably be good to briefly explain the relationship between tasks and threads. For example: is it ok to have a lot more tasks than cores, or should you cap the number of partitions (and thus tasks) you expose?
2010YOUY01
left a comment
There was a problem hiding this comment.
LGTM. I read through it and found the concepts well explained and easy to follow. One follow-up after publishing would be to link this blog from the doc comments of related APIs such as TableProvider.
| Here is a minimal but complete example of a custom table provider that generates | ||
| data lazily during streaming: | ||
|
|
||
| ```rust |
There was a problem hiding this comment.
Perhaps we can move it to the datafusion-examples, or maybe we already have one similar, and we can directly link it here.
| the `us-east-1` partition. If that partition holds 100 million rows, you have | ||
| just eliminated 90% of the I/O. DataFusion still applies the `event_type` | ||
| filter via `FilterExec` if you reported it as `Unsupported`. | ||
|
|
There was a problem hiding this comment.
| ### Only Push Down Filters When the Data Source Can Do Better | |
| DataFusion already pushes filters as close to the data source as possible, typically placing them directly above the scan. `FilterExec` is also highly optimized, with vectorized evaluation and type-specialized kernels for fast predicate evaluation. | |
| Because of this, you should only implement filter pushdown when your data source can do strictly better. For example, avoid I/O by skipping data early using metadata. If your data source cannot eliminate I/O in this way, it is usually better to let DataFusion handle the filter, as its in-memory execution is already highly efficient (unless there are additional opportunities for deeper, application-specific optimizations). |
Here is one clarification about filter pushdown that I think is important to mention. I have drafted it for you to consider.
pgwhalen
left a comment
There was a problem hiding this comment.
As someone who struggled in the past, I'm thrilled to see this get created now! I added some comments that highlight my biggest struggles.
| |---|---|---| | ||
| | Already in `RecordBatch`es in memory | [`MemTable`] | Nothing -- just construct it | | ||
| | An async stream of batches | [`StreamTable`] | A stream factory | | ||
| | A table with known sort order | [`SortedTableProvider`] wrapping another provider | The inner provider | |
There was a problem hiding this comment.
SortedTableProvider is only implemented in tests, right? Not that that's not useful for reference, but this section does make it seem like it's meant to be built on top of.
| 3. **Resource management breaks down.** DataFusion manages concurrency and | ||
| memory during execution. Work done during planning bypasses these controls. | ||
|
|
||
| ## Filter Pushdown: Doing Less Work |
There was a problem hiding this comment.
(Just a thought, since I understand this is not a trivial ask).
As someone who struggled with implementing filter pushdown in a custom table provider before, it would be helpful if there was a section with examples on actually implementing it, just to specifying that the table provider supports it. Going from filters: &[Expr] to something more useful was pretty intimidating for me as a newbie. I ended up learning about LiteralGuarantee::analyze from the pruning_predicate.rs example and that was the big unlock, but as someone not familiar with database internals terminology, it was nonobvious initially.
| scratch. DataFusion provides building blocks that let you plug in at whatever | ||
| level makes sense: | ||
|
|
||
| | If your data is... | Start with | You implement | |
There was a problem hiding this comment.
You might consider adding "custom DataSource for a ListingTable" as an option here, or at least acknowledge it.
I've wrote a custom table provider a year or two ago and struggled initially mostly because I tried to make them listing tables. It was the wrong decision in retrospect, but at the time I was convinced that it was solving problems for me because I was reading from multiple mmapped files composing a logical table. I blindly emulated the ParquetSource, and made my own version of so many layers (FileOpener, FileFormat, FileSource), tying myself into lots of assumptions about partitioning/etc. that I didn't fully understand up front.
| - [TableProvider API docs][`TableProvider`] | ||
| - [ExecutionPlan API docs][`ExecutionPlan`] | ||
| - [SendableRecordBatchStream API docs][`SendableRecordBatchStream`] |
There was a problem hiding this comment.
It looks like you meant to add links here.
| ### Keep `scan()` Lightweight | ||
|
|
||
| This is a critical point: **`scan()` runs during planning, not execution.** It | ||
| should return quickly. Best practices are to avoid performing I/O, network | ||
| calls, or heavy computation here. The `scan` method's job is to *describe* how | ||
| the data will be produced, not to produce it. All the real work belongs in the | ||
| stream (Layer 3). | ||
|
|
||
| A common pitfall is to fetch data or open connections in `scan()`. This blocks | ||
| the planning thread and can cause timeouts or deadlocks, especially if the query | ||
| involves multiple tables or subqueries that all need to be planned before | ||
| execution begins. |
There was a problem hiding this comment.
I want to agree with this but in practice doesn't ListingTable, some of our examples only avoid this because they pre-load all of the data (which is not realistic for a production system) (e.g.).
We might also want to document scan_with_args() instead of scan().
Closes apache/datafusion#16821
This blog post is designed to help new users of DataFusion write their own table providers and understand some of the core concepts.
Preview site: https://datafusion.staged.apache.org/blog/2026/03/20/writing-table-providers/