Blog post on writing table providers by timsaucer · Pull Request #161 · apache/datafusion-site

timsaucer · 2026-03-20T22:06:08Z

This blog post is designed to help new users of DataFusion write their own table providers and understand some of the core concepts.

Preview site: https://datafusion.staged.apache.org/blog/2026/03/20/writing-table-providers/

stuhood

Thanks for doing this!

stuhood · 2026-03-21T17:42:19Z

content/blog/2026-03-20-writing-table-providers.md

+Think of these as a funnel: `TableProvider::scan()` is called once during
+planning to create an `ExecutionPlan`, then `ExecutionPlan::execute()` is called
+once per partition to create a stream, and those streams are where rows are
+actually produced during execution.


Would it be useful to know how these types relate to physical vs logical planning?

stuhood · 2026-03-21T17:46:24Z

content/blog/2026-03-20-writing-table-providers.md

+Since `execute()` is called once per partition, partitioning directly controls
+the parallelism of your table scan. Each partition runs on its own task, so
+more partitions means more concurrent work -- up to the number of available
+cores.


It would probably be good to briefly explain the relationship between tasks and threads. For example: is it ok to have a lot more tasks than cores, or should you cap the number of partitions (and thus tasks) you expose?

2010YOUY01

LGTM. I read through it and found the concepts well explained and easy to follow. One follow-up after publishing would be to link this blog from the doc comments of related APIs such as TableProvider.

2010YOUY01 · 2026-03-22T06:46:37Z

content/blog/2026-03-20-writing-table-providers.md

+Here is a minimal but complete example of a custom table provider that generates
+data lazily during streaming:
+
+```rust


Perhaps we can move it to the datafusion-examples, or maybe we already have one similar, and we can directly link it here.

2010YOUY01 · 2026-03-22T06:46:47Z

content/blog/2026-03-20-writing-table-providers.md

+the `us-east-1` partition. If that partition holds 100 million rows, you have
+just eliminated 90% of the I/O. DataFusion still applies the `event_type`
+filter via `FilterExec` if you reported it as `Unsupported`.
+


Suggested change

### Only Push Down Filters When the Data Source Can Do Better

DataFusion already pushes filters as close to the data source as possible, typically placing them directly above the scan. `FilterExec` is also highly optimized, with vectorized evaluation and type-specialized kernels for fast predicate evaluation.

Because of this, you should only implement filter pushdown when your data source can do strictly better. For example, avoid I/O by skipping data early using metadata. If your data source cannot eliminate I/O in this way, it is usually better to let DataFusion handle the filter, as its in-memory execution is already highly efficient (unless there are additional opportunities for deeper, application-specific optimizations).

Here is one clarification about filter pushdown that I think is important to mention. I have drafted it for you to consider.

pgwhalen

As someone who struggled in the past, I'm thrilled to see this get created now! I added some comments that highlight my biggest struggles.

pgwhalen · 2026-03-22T21:27:08Z

content/blog/2026-03-20-writing-table-providers.md

+|---|---|---|
+| Already in `RecordBatch`es in memory | [`MemTable`] | Nothing -- just construct it |
+| An async stream of batches | [`StreamTable`] | A stream factory |
+| A table with known sort order | [`SortedTableProvider`] wrapping another provider | The inner provider |


SortedTableProvider is only implemented in tests, right? Not that that's not useful for reference, but this section does make it seem like it's meant to be built on top of.

pgwhalen · 2026-03-22T21:46:53Z

content/blog/2026-03-20-writing-table-providers.md

+3. **Resource management breaks down.** DataFusion manages concurrency and
+   memory during execution. Work done during planning bypasses these controls.
+
+## Filter Pushdown: Doing Less Work


(Just a thought, since I understand this is not a trivial ask).

As someone who struggled with implementing filter pushdown in a custom table provider before, it would be helpful if there was a section with examples on actually implementing it, just to specifying that the table provider supports it. Going from filters: &[Expr] to something more useful was pretty intimidating for me as a newbie. I ended up learning about LiteralGuarantee::analyze from the pruning_predicate.rs example and that was the big unlock, but as someone not familiar with database internals terminology, it was nonobvious initially.

pgwhalen · 2026-03-22T21:52:38Z

content/blog/2026-03-20-writing-table-providers.md

+scratch. DataFusion provides building blocks that let you plug in at whatever
+level makes sense:
+
+| If your data is... | Start with | You implement |


You might consider adding "custom DataSource for a ListingTable" as an option here, or at least acknowledge it.

I've wrote a custom table provider a year or two ago and struggled initially mostly because I tried to make them listing tables. It was the wrong decision in retrospect, but at the time I was convinced that it was solving problems for me because I was reading from multiple mmapped files composing a logical table. I blindly emulated the ParquetSource, and made my own version of so many layers (FileOpener, FileFormat, FileSource), tying myself into lots of assumptions about partitioning/etc. that I didn't fully understand up front.

pgwhalen · 2026-03-22T21:59:59Z

content/blog/2026-03-20-writing-table-providers.md

+- [TableProvider API docs][`TableProvider`]
+- [ExecutionPlan API docs][`ExecutionPlan`]
+- [SendableRecordBatchStream API docs][`SendableRecordBatchStream`]


It looks like you meant to add links here.

adriangb · 2026-03-22T22:50:23Z

content/blog/2026-03-20-writing-table-providers.md

+### Keep `scan()` Lightweight
+
+This is a critical point: **`scan()` runs during planning, not execution.** It
+should return quickly. Best practices are to avoid performing I/O, network
+calls, or heavy computation here. The `scan` method's job is to *describe* how
+the data will be produced, not to produce it. All the real work belongs in the
+stream (Layer 3).
+
+A common pitfall is to fetch data or open connections in `scan()`. This blocks
+the planning thread and can cause timeouts or deadlocks, especially if the query
+involves multiple tables or subqueries that all need to be planned before
+execution begins.


I want to agree with this but in practice doesn't ListingTable, some of our examples only avoid this because they pre-load all of the data (which is not realistic for a production system) (e.g.).

We might also want to document scan_with_args() instead of scan().

Initial commit for blog post on writing table providers

1fe3d2e

timsaucer marked this pull request as ready for review March 20, 2026 22:09

timsaucer added 2 commits March 20, 2026 18:11

Minor text changes

dea6e3f

Add acknowledgement

2de8b7b

stuhood reviewed Mar 21, 2026

View reviewed changes

2010YOUY01 approved these changes Mar 22, 2026

View reviewed changes

pgwhalen reviewed Mar 22, 2026

View reviewed changes

adriangb reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog post on writing table providers#161

Blog post on writing table providers#161
timsaucer wants to merge 3 commits intomainfrom
site/writing-table-providers

timsaucer commented Mar 20, 2026 •

edited

Loading

Uh oh!

stuhood left a comment •

edited

Loading

Uh oh!

stuhood Mar 21, 2026

Uh oh!

stuhood Mar 21, 2026

Uh oh!

2010YOUY01 left a comment

Uh oh!

2010YOUY01 Mar 22, 2026

Uh oh!

2010YOUY01 Mar 22, 2026

Uh oh!

pgwhalen left a comment

Uh oh!

pgwhalen Mar 22, 2026

Uh oh!

pgwhalen Mar 22, 2026

Uh oh!

pgwhalen Mar 22, 2026

Uh oh!

pgwhalen Mar 22, 2026

Uh oh!

adriangb Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

+### Only Push Down Filters When the Data Source Can Do Better
+DataFusion already pushes filters as close to the data source as possible, typically placing them directly above the scan. `FilterExec` is also highly optimized, with vectorized evaluation and type-specialized kernels for fast predicate evaluation.
+Because of this, you should only implement filter pushdown when your data source can do strictly better. For example, avoid I/O by skipping data early using metadata. If your data source cannot eliminate I/O in this way, it is usually better to let DataFusion handle the filter, as its in-memory execution is already highly efficient (unless there are additional opportunities for deeper, application-specific optimizations).

Conversation

timsaucer commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuhood left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgwhalen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

timsaucer commented Mar 20, 2026 •

edited

Loading

stuhood left a comment •

edited

Loading