Skip to content

Blog post on writing table providers#161

Open
timsaucer wants to merge 3 commits intomainfrom
site/writing-table-providers
Open

Blog post on writing table providers#161
timsaucer wants to merge 3 commits intomainfrom
site/writing-table-providers

Conversation

@timsaucer
Copy link
Member

@timsaucer timsaucer commented Mar 20, 2026

Closes apache/datafusion#16821

This blog post is designed to help new users of DataFusion write their own table providers and understand some of the core concepts.

Preview site: https://datafusion.staged.apache.org/blog/2026/03/20/writing-table-providers/

@timsaucer timsaucer marked this pull request as ready for review March 20, 2026 22:09
Copy link

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Comment on lines +49 to +52
Think of these as a funnel: `TableProvider::scan()` is called once during
planning to create an `ExecutionPlan`, then `ExecutionPlan::execute()` is called
once per partition to create a stream, and those streams are where rows are
actually produced during execution.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to know how these types relate to physical vs logical planning?

Comment on lines +191 to +194
Since `execute()` is called once per partition, partitioning directly controls
the parallelism of your table scan. Each partition runs on its own task, so
more partitions means more concurrent work -- up to the number of available
cores.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be good to briefly explain the relationship between tasks and threads. For example: is it ok to have a lot more tasks than cores, or should you cap the number of partitions (and thus tasks) you expose?

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I read through it and found the concepts well explained and easy to follow. One follow-up after publishing would be to link this blog from the doc comments of related APIs such as TableProvider.

Here is a minimal but complete example of a custom table provider that generates
data lazily during streaming:

```rust
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can move it to the datafusion-examples, or maybe we already have one similar, and we can directly link it here.

the `us-east-1` partition. If that partition holds 100 million rows, you have
just eliminated 90% of the I/O. DataFusion still applies the `event_type`
filter via `FilterExec` if you reported it as `Unsupported`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Only Push Down Filters When the Data Source Can Do Better
DataFusion already pushes filters as close to the data source as possible, typically placing them directly above the scan. `FilterExec` is also highly optimized, with vectorized evaluation and type-specialized kernels for fast predicate evaluation.
Because of this, you should only implement filter pushdown when your data source can do strictly better. For example, avoid I/O by skipping data early using metadata. If your data source cannot eliminate I/O in this way, it is usually better to let DataFusion handle the filter, as its in-memory execution is already highly efficient (unless there are additional opportunities for deeper, application-specific optimizations).

Here is one clarification about filter pushdown that I think is important to mention. I have drafted it for you to consider.

Copy link

@pgwhalen pgwhalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who struggled in the past, I'm thrilled to see this get created now! I added some comments that highlight my biggest struggles.

|---|---|---|
| Already in `RecordBatch`es in memory | [`MemTable`] | Nothing -- just construct it |
| An async stream of batches | [`StreamTable`] | A stream factory |
| A table with known sort order | [`SortedTableProvider`] wrapping another provider | The inner provider |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortedTableProvider is only implemented in tests, right? Not that that's not useful for reference, but this section does make it seem like it's meant to be built on top of.

3. **Resource management breaks down.** DataFusion manages concurrency and
memory during execution. Work done during planning bypasses these controls.

## Filter Pushdown: Doing Less Work

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a thought, since I understand this is not a trivial ask).

As someone who struggled with implementing filter pushdown in a custom table provider before, it would be helpful if there was a section with examples on actually implementing it, just to specifying that the table provider supports it. Going from filters: &[Expr] to something more useful was pretty intimidating for me as a newbie. I ended up learning about LiteralGuarantee::analyze from the pruning_predicate.rs example and that was the big unlock, but as someone not familiar with database internals terminology, it was nonobvious initially.

scratch. DataFusion provides building blocks that let you plug in at whatever
level makes sense:

| If your data is... | Start with | You implement |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider adding "custom DataSource for a ListingTable" as an option here, or at least acknowledge it.

I've wrote a custom table provider a year or two ago and struggled initially mostly because I tried to make them listing tables. It was the wrong decision in retrospect, but at the time I was convinced that it was solving problems for me because I was reading from multiple mmapped files composing a logical table. I blindly emulated the ParquetSource, and made my own version of so many layers (FileOpener, FileFormat, FileSource), tying myself into lots of assumptions about partitioning/etc. that I didn't fully understand up front.

Comment on lines +642 to +644
- [TableProvider API docs][`TableProvider`]
- [ExecutionPlan API docs][`ExecutionPlan`]
- [SendableRecordBatchStream API docs][`SendableRecordBatchStream`]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you meant to add links here.

Comment on lines +109 to +120
### Keep `scan()` Lightweight

This is a critical point: **`scan()` runs during planning, not execution.** It
should return quickly. Best practices are to avoid performing I/O, network
calls, or heavy computation here. The `scan` method's job is to *describe* how
the data will be produced, not to produce it. All the real work belongs in the
stream (Layer 3).

A common pitfall is to fetch data or open connections in `scan()`. This blocks
the planning thread and can cause timeouts or deadlocks, especially if the query
involves multiple tables or subqueries that all need to be planned before
execution begins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to agree with this but in practice doesn't ListingTable, some of our examples only avoid this because they pre-load all of the data (which is not realistic for a production system) (e.g.).

We might also want to document scan_with_args() instead of scan().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: What table provider features would be helpful in an example?

5 participants