Skip to content

feat: Lazy per-column I/O for complex columns in Nimble (#677)#677

Open
prashantgolash wants to merge 1 commit into
facebookincubator:mainfrom
prashantgolash:export-D100277342
Open

feat: Lazy per-column I/O for complex columns in Nimble (#677)#677
prashantgolash wants to merge 1 commit into
facebookincubator:mainfrom
prashantgolash:export-D100277342

Conversation

@prashantgolash
Copy link
Copy Markdown

@prashantgolash prashantgolash commented Apr 27, 2026

Summary:
X-link: facebookincubator/velox#17350

Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.

This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.

How it works:

  • During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one.
  • The shared input is loaded during stripe init (eager columns only).
  • Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed.
  • Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded.

Gated behind the lazy_column_io session property (default off).

Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230

Reviewed By: HuamengJiang

Differential Revision: D100277342

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 27, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 27, 2026

@prashantgolash has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100277342.

@prashantgolash prashantgolash changed the title Deferred per-column I/O for lazy FlatMap columns in Nimble [nimble]feat: Deferred per-column I/O for lazy FlatMap columns in Nimble Apr 27, 2026
@prashantgolash prashantgolash changed the title [nimble]feat: Deferred per-column I/O for lazy FlatMap columns in Nimble feat: Deferred per-column I/O for lazy FlatMap columns in Nimble Apr 27, 2026
prashantgolash added a commit to prashantgolash/velox that referenced this pull request Apr 27, 2026
…ebookincubator#17350)

Summary:
X-link: facebookincubator/nimble#677


FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used.

This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off).

## How it works

**Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row.

**After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column.

## What qualifies for deferral

A column is deferred when all of these are true:
- defer_flatmap_io session property is enabled
- Column is a top-level child of the root struct (eligible for LazyVector)
- At least one sibling has a pushed-down filter
- The column itself has no filter and is projected
- The column is a complex type (MAP, ARRAY, or ROW)

## Batch size estimation

Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright.

## Why per-column clones (not a shared clone)

Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows.

Per-column clones load each column independently at the right time:
- pipeline_labels loads when the remaining filter accesses it
- sparse_features loads only when serialization needs it (after the remaining filter)
- If the remaining filter eliminates all rows, sparse_features is never loaded

Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction).

## Usage

SET SESSION hive.native_defer_flatmap_io = true;

Differential Revision: D100277342
prashantgolash added a commit to prashantgolash/velox that referenced this pull request Apr 27, 2026
…ebookincubator#17350)

Summary:
X-link: facebookincubator/nimble#677


FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used.

This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off).

## How it works

**Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row.

**After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column.

## What qualifies for deferral

A column is deferred when all of these are true:
- defer_flatmap_io session property is enabled
- Column is a top-level child of the root struct (eligible for LazyVector)
- At least one sibling has a pushed-down filter
- The column itself has no filter and is projected
- The column is a complex type (MAP, ARRAY, or ROW)

## Batch size estimation

Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright.

## Why per-column clones (not a shared clone)

Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows.

Per-column clones load each column independently at the right time:
- pipeline_labels loads when the remaining filter accesses it
- sparse_features loads only when serialization needs it (after the remaining filter)
- If the remaining filter eliminates all rows, sparse_features is never loaded

Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction).

## Usage

SET SESSION hive.native_defer_flatmap_io = true;

Differential Revision: D100277342
@meta-codesync meta-codesync Bot changed the title feat: Deferred per-column I/O for lazy FlatMap columns in Nimble Deferred per-column I/O for lazy FlatMap columns in Nimble Apr 27, 2026
@prashantgolash prashantgolash changed the title Deferred per-column I/O for lazy FlatMap columns in Nimble feat: Deferred per-column I/O for lazy FlatMap columns in Nimble Apr 27, 2026
prashantgolash added a commit to prashantgolash/velox that referenced this pull request Apr 28, 2026
Summary:
X-link: facebookincubator/nimble#677

FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used.

This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off).

## How it works

**Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row.

**After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column.

## What qualifies for deferral

A column is deferred when all of these are true:
- defer_flatmap_io session property is enabled
- Column is a top-level child of the root struct (eligible for LazyVector)
- At least one sibling has a pushed-down filter
- The column itself has no filter and is projected
- The column is a complex type (MAP, ARRAY, or ROW)

## Batch size estimation

Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright.

## Why per-column clones (not a shared clone)

Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows.

Per-column clones load each column independently at the right time:
- pipeline_labels loads when the remaining filter accesses it
- sparse_features loads only when serialization needs it (after the remaining filter)
- If the remaining filter eliminates all rows, sparse_features is never loaded

Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction).

## Usage

SET SESSION hive.native_defer_flatmap_io = true;

Differential Revision: D100277342
@meta-codesync meta-codesync Bot changed the title feat: Deferred per-column I/O for lazy FlatMap columns in Nimble feat: Deferred per-column I/O for lazy FlatMap columns in Nimble (#677) Apr 28, 2026
prashantgolash added a commit to prashantgolash/nimble that referenced this pull request Apr 28, 2026
…ebookincubator#677)

Summary:

FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used.

This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off).

## How it works

**Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row.

**After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column.

## What qualifies for deferral

A column is deferred when all of these are true:
- defer_flatmap_io session property is enabled
- Column is a top-level child of the root struct (eligible for LazyVector)
- At least one sibling has a pushed-down filter
- The column itself has no filter and is projected
- The column is a complex type (MAP, ARRAY, or ROW)

## Batch size estimation

Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright.

## Why per-column clones (not a shared clone)

Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows.

Per-column clones load each column independently at the right time:
- pipeline_labels loads when the remaining filter accesses it
- sparse_features loads only when serialization needs it (after the remaining filter)
- If the remaining filter eliminates all rows, sparse_features is never loaded

Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction).

## Usage

SET SESSION hive.native_defer_flatmap_io = true;

Differential Revision: D100277342
prashantgolash added a commit to prashantgolash/velox that referenced this pull request Apr 29, 2026
…ebookincubator#17350)

Summary:

X-link: facebookincubator/nimble#677

FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used.

This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off).

## How it works

**Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row.

**After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column.

## What qualifies for deferral

A column is deferred when all of these are true:
- defer_flatmap_io session property is enabled
- Column is a top-level child of the root struct (eligible for LazyVector)
- At least one sibling has a pushed-down filter
- The column itself has no filter and is projected
- The column is a complex type (MAP, ARRAY, or ROW)

## Batch size estimation

Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright.

## Why per-column clones (not a shared clone)

Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows.

Per-column clones load each column independently at the right time:
- pipeline_labels loads when the remaining filter accesses it
- sparse_features loads only when serialization needs it (after the remaining filter)
- If the remaining filter eliminates all rows, sparse_features is never loaded

Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction).

## Usage

SET SESSION hive.native_defer_flatmap_io = true;

Differential Revision: D100277342
prashantgolash added a commit to prashantgolash/nimble that referenced this pull request Apr 29, 2026
…ebookincubator#677)

Summary:
X-link: facebookincubator/velox#17350


FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used.

This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off).

## How it works

**Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row.

**After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column.

## What qualifies for deferral

A column is deferred when all of these are true:
- defer_flatmap_io session property is enabled
- Column is a top-level child of the root struct (eligible for LazyVector)
- At least one sibling has a pushed-down filter
- The column itself has no filter and is projected
- The column is a complex type (MAP, ARRAY, or ROW)

## Batch size estimation

Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright.

## Why per-column clones (not a shared clone)

Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows.

Per-column clones load each column independently at the right time:
- pipeline_labels loads when the remaining filter accesses it
- sparse_features loads only when serialization needs it (after the remaining filter)
- If the remaining filter eliminates all rows, sparse_features is never loaded

Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction).

## Usage

SET SESSION hive.native_defer_flatmap_io = true;

Differential Revision: D100277342
prashantgolash added a commit to prashantgolash/velox that referenced this pull request Apr 29, 2026
…ebookincubator#17350)

Summary:

X-link: facebookincubator/nimble#677

FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used.

This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off).

## How it works

**Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row.

**After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column.

## What qualifies for deferral

A column is deferred when all of these are true:
- defer_flatmap_io session property is enabled
- Column is a top-level child of the root struct (eligible for LazyVector)
- At least one sibling has a pushed-down filter
- The column itself has no filter and is projected
- The column is a complex type (MAP, ARRAY, or ROW)

## Batch size estimation

Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright.

## Why per-column clones (not a shared clone)

Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows.

Per-column clones load each column independently at the right time:
- pipeline_labels loads when the remaining filter accesses it
- sparse_features loads only when serialization needs it (after the remaining filter)
- If the remaining filter eliminates all rows, sparse_features is never loaded

Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction).

## Usage

SET SESSION hive.native_defer_flatmap_io = true;

Differential Revision: D100277342
prashantgolash added a commit to prashantgolash/nimble that referenced this pull request Apr 29, 2026
…ebookincubator#677)

Summary:
X-link: facebookincubator/velox#17350


FlatMap columns (e.g. sparse_features) store each map key as separate streams — often hundreds of streams totaling GBs per stripe. Today, all streams are loaded eagerly during stripe setup, even for columns wrapped in LazyVectors. When a high-selectivity filter on a sibling column (e.g. element_at(pipeline_labels, key) IS NOT NULL with 99.98% selectivity) eliminates most rows, the FlatMap data is loaded but never used.

This diff implements per-column deferred I/O, gated behind the defer_flatmap_io session property (default off).

## How it works

**Before (eager):** All streams are loaded in one batched I/O during stripe setup. FlatMap data sits in memory even if the filter eliminates every row.

**After (deferred):** Each qualifying FlatMap column gets its own cloned BufferedInput. Its streams are enqueued but not loaded during stripe setup. On first lazy access, DeferredInput::load() issues a single batched I/O for all of that column's streams. If the filter eliminates all rows in a stripe, the load is never triggered — zero I/O for that column.

## What qualifies for deferral

A column is deferred when all of these are true:
- defer_flatmap_io session property is enabled
- Column is a top-level child of the root struct (eligible for LazyVector)
- At least one sibling has a pushed-down filter
- The column itself has no filter and is projected
- The column is a complex type (MAP, ARRAY, or ROW)

## Batch size estimation

Deferred columns' decoders are not loaded, so estimateMaterializedSize() cannot query them. Without handling this, the estimate fails and falls back to 1MB per row (tiny batches, massive overhead). The fix: skip deferred children and use their totalStreamBytes (compressed stream sizes from tablet metadata) as an approximation. When file-level vectorized stats exist, this code path is never reached — stats-based estimation (Tier 1) wins outright.

## Why per-column clones (not a shared clone)

Each deferred column gets its own cloned BufferedInput rather than sharing one clone across all deferred columns. A shared clone would preserve cross-column coalescing but has a critical flaw: when the remaining filter accesses one deferred column (e.g. pipeline_labels for element_at), the shared load() triggers I/O for ALL deferred columns — including output-only columns (e.g. sparse_features) that may never be needed if the remaining filter eliminates all rows.

Per-column clones load each column independently at the right time:
- pipeline_labels loads when the remaining filter accesses it
- sparse_features loads only when serialization needs it (after the remaining filter)
- If the remaining filter eliminates all rows, sparse_features is never loaded

Production validation confirmed: shared clone showed no I/O reduction (46TB vs 46TB), while per-column clones reduced storageRead from 46TB to 6TB (7.5x reduction).

## Usage

SET SESSION hive.native_defer_flatmap_io = true;

Differential Revision: D100277342
@meta-codesync meta-codesync Bot changed the title feat: Deferred per-column I/O for lazy FlatMap columns in Nimble (#677) Lazy per-column I/O for complex columns in Nimble May 1, 2026
prashantgolash added a commit to prashantgolash/velox that referenced this pull request May 1, 2026
Summary:
X-link: facebookincubator/nimble#677

Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.

This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.

How it works:
- During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one.
- The shared input is loaded during stripe init (eager columns only).
- Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed.
- Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded.

Gated behind the `lazy_column_io` session property (default off).

Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2297779682

Differential Revision: D100277342
@prashantgolash prashantgolash changed the title Lazy per-column I/O for complex columns in Nimble feat: Lazy per-column I/O for complex columns in Nimble May 1, 2026
prashantgolash added a commit to prashantgolash/velox that referenced this pull request May 4, 2026
…bator#17350)

Summary:

X-link: facebookincubator/nimble#677

Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.

This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.

How it works:
- During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one.
- The shared input is loaded during stripe init (eager columns only).
- Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed.
- Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded.

Gated behind the `lazy_column_io` session property (default off).

Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230

Differential Revision: D100277342
@meta-codesync meta-codesync Bot changed the title feat: Lazy per-column I/O for complex columns in Nimble feat: Lazy per-column I/O for complex columns in Nimble (#677) May 4, 2026
prashantgolash added a commit to prashantgolash/nimble that referenced this pull request May 4, 2026
…bator#677)

Summary:
X-link: facebookincubator/velox#17350


Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.

This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.

How it works:
- During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one.
- The shared input is loaded during stripe init (eager columns only).
- Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed.
- Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded.

Gated behind the `lazy_column_io` session property (default off).

Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230

Differential Revision: D100277342
prashantgolash added a commit to prashantgolash/nimble that referenced this pull request May 8, 2026
…bator#677)

Summary:
X-link: facebookincubator/velox#17350


Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.

This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.

How it works:
- During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one.
- The shared input is loaded during stripe init (eager columns only).
- Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed.
- Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded.

Gated behind the `lazy_column_io` session property (default off).

Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230

Differential Revision: D100277342
prashantgolash added a commit to prashantgolash/velox that referenced this pull request May 8, 2026
…bator#17350)

Summary:
Pull Request resolved: facebookincubator#17350

X-link: facebookincubator/nimble#677

Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.

This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.

How it works:
- During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one.
- The shared input is loaded during stripe init (eager columns only).
- Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed.
- Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded.

Gated behind the `lazy_column_io` session property (default off).

Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230

Differential Revision: D100277342
…bator#677)

Summary:
X-link: facebookincubator/velox#17350


Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.

This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.

How it works:
- During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one.
- The shared input is loaded during stripe init (eager columns only).
- Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed.
- Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded.

Gated behind the `lazy_column_io` session property (default off).

Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230

Reviewed By: HuamengJiang

Differential Revision: D100277342
prashantgolash added a commit to prashantgolash/velox that referenced this pull request May 12, 2026
…bator#17350)

Summary:

X-link: facebookincubator/nimble#677

Today, the Nimble selective reader loads all column streams upfront during stripe init — including columns wrapped in LazyVectors. The lazy contract only defers decoding; the underlying I/O is still eager. When a high-selectivity remaining filter eliminates most rows, the eagerly-loaded data for output-only columns is never decoded — but the I/O cost was already paid.

This diff extends laziness from decoding to I/O. Complex lazy columns (MAP/ARRAY/ROW) without pushed-down filters get their streams enqueued into a per-column cloned BufferedInput, loaded only on first downstream access. If the filter eliminates all rows in a stripe, the deferred column's load() is never called — zero I/O for that column in that stripe.

How it works:
- During column reader construction, qualifying columns have their streams enqueued into a cloned BufferedInput instead of the shared one.
- The shared input is loaded during stripe init (eager columns only).
- Each deferred column's clone is loaded independently via ColumnLoader when the LazyVector is first accessed.
- Batch size estimation uses totalStreamBytes (compressed stream sizes from tablet metadata) for deferred columns since their decoders are not yet loaded.

Gated behind the `lazy_column_io` session property (default off).

Detailed analysis (naming changes, per-column vs shared clone tradeoff, code flow, shadow data): P2302893230

Reviewed By: HuamengJiang

Differential Revision: D100277342
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant