Skip to content

ESQL: tighten parsed-footer cache after review feedback#149045

Merged
costin merged 2 commits into
elastic:mainfrom
costin:esql/parsed-footer-cache-review-fixes
May 14, 2026
Merged

ESQL: tighten parsed-footer cache after review feedback#149045
costin merged 2 commits into
elastic:mainfrom
costin:esql/parsed-footer-cache-review-fixes

Conversation

@costin
Copy link
Copy Markdown
Member

@costin costin commented May 14, 2026

Follow-up to #149018 incorporating feedback. Three
behavioural/structural changes plus two doc improvements.

ParsedFooterCache

  • Lower DEFAULT_MAX_ENTRIES from 64 to 32. Worst case is now bounded
    closer to a few hundred MiB instead of multiple GiB when extremely wide
    files dominate the working set; typical workloads still sit at a few
    MiB total.
  • Replace 'intentionally conservative' wording with an explicit
    worst-case heap budget (column-chunk math, typical-vs-extreme split,
    weigher follow-up note) so the operational tradeoff is visible.
  • Extract rethrowStructural(ExecutionException) as a shared helper that
    reshapes Error/IOException/CircuitBreakingException/ElasticsearchException
    back to their original types. Format-specific RuntimeException wrapping
    stays at the call site since the two readers differ there.

ParquetFormatReader

  • Use ParsedFooterCache.rethrowStructural in loadFooter; the format-
    specific newInvalidParquetFileException now wraps only what rethrow
    could not rethrow structurally.
  • Document why loadFooter is non-static (captured 'this' threads the
    per-instance CircuitBreakerByteBufferAllocator into the loader) and
    why that is harmless under the current shared-parent-breaker layout.

OrcFormatReader

  • Use ParsedFooterCache.rethrowStructural in loadTail and rethrow
    RuntimeException directly; only non-structural causes get wrapped as
    IOException with a format-tagged message.
  • Document why the cold path deliberately parses the tail twice (the
    single-parse alternative would require re-implementing ORC's tail
    fetch protocol outside the library, fragile across versions).
  • Restrict setFileContext to the cache-miss branch in readRange,
    matching parquet style — no functional change but removes a redundant
    self-write on the hot per-producer path.

No SPI changes. No behavioural change to the cache hit path.

Developed using AI-assisted tooling

Follow-up to elastic#149018 incorporating pragmatic review feedback. Three
behavioural/structural changes plus two doc improvements.

ParsedFooterCache
- Lower DEFAULT_MAX_ENTRIES from 64 to 32. Worst case is now bounded
  closer to a few hundred MiB instead of multiple GiB when extremely wide
  files dominate the working set; typical workloads still sit at a few
  MiB total.
- Replace 'intentionally conservative' wording with an explicit
  worst-case heap budget (column-chunk math, typical-vs-extreme split,
  weigher follow-up note) so the operational tradeoff is visible.
- Extract rethrowStructural(ExecutionException) as a shared helper that
  reshapes Error/IOException/CircuitBreakingException/ElasticsearchException
  back to their original types. Format-specific RuntimeException wrapping
  stays at the call site since the two readers differ there.

ParquetFormatReader
- Use ParsedFooterCache.rethrowStructural in loadFooter; the format-
  specific newInvalidParquetFileException now wraps only what rethrow
  could not rethrow structurally.
- Document why loadFooter is non-static (captured 'this' threads the
  per-instance CircuitBreakerByteBufferAllocator into the loader) and
  why that is harmless under the current shared-parent-breaker layout.

OrcFormatReader
- Use ParsedFooterCache.rethrowStructural in loadTail and rethrow
  RuntimeException directly; only non-structural causes get wrapped as
  IOException with a format-tagged message.
- Document why the cold path deliberately parses the tail twice (the
  single-parse alternative would require re-implementing ORC's tail
  fetch protocol outside the library, fragile across versions).
- Restrict setFileContext to the cache-miss branch in readRange,
  matching parquet style — no functional change but removes a redundant
  self-write on the hot per-producer path.

No SPI changes. No behavioural change to the cache hit path.

Developed using AI-assisted tooling
@costin costin requested a review from bpintea May 14, 2026 06:37
@costin costin enabled auto-merge (squash) May 14, 2026 06:37
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 14, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @costin, I've created a changelog YAML for you.

@github-actions
Copy link
Copy Markdown
Contributor

🔍 Preview links for changed docs

⏳ Building and deploying preview... View progress

This comment will be updated with preview links when the build is complete.

@github-actions
Copy link
Copy Markdown
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@costin costin disabled auto-merge May 14, 2026 09:27
@costin costin merged commit e6615a0 into elastic:main May 14, 2026
36 checks passed
@costin costin deleted the esql/parsed-footer-cache-review-fixes branch May 14, 2026 09:27
lukewhiting pushed a commit to lukewhiting/elasticsearch that referenced this pull request May 14, 2026
Follow-up to elastic#149018 incorporating feedback. Three
behavioural/structural changes plus two doc improvements.

ParsedFooterCache
- Lower DEFAULT_MAX_ENTRIES from 64 to 32. Worst case is now bounded
  closer to a few hundred MiB instead of multiple GiB when extremely wide
  files dominate the working set; typical workloads still sit at a few
  MiB total.
- Replace 'intentionally conservative' wording with an explicit
  worst-case heap budget (column-chunk math, typical-vs-extreme split,
  weigher follow-up note) so the operational tradeoff is visible.
- Extract rethrowStructural(ExecutionException) as a shared helper that
  reshapes Error/IOException/CircuitBreakingException/ElasticsearchException
  back to their original types. Format-specific RuntimeException wrapping
  stays at the call site since the two readers differ there.

ParquetFormatReader
- Use ParsedFooterCache.rethrowStructural in loadFooter; the format-
  specific newInvalidParquetFileException now wraps only what rethrow
  could not rethrow structurally.
- Document why loadFooter is non-static (captured 'this' threads the
  per-instance CircuitBreakerByteBufferAllocator into the loader) and
  why that is harmless under the current shared-parent-breaker layout.

OrcFormatReader
- Use ParsedFooterCache.rethrowStructural in loadTail and rethrow
  RuntimeException directly; only non-structural causes get wrapped as
  IOException with a format-tagged message.
- Document why the cold path deliberately parses the tail twice (the
  single-parse alternative would require re-implementing ORC's tail
  fetch protocol outside the library, fragile across versions).
- Restrict setFileContext to the cache-miss branch in readRange,
  matching parquet style — no functional change but removes a redundant
  self-write on the hot per-producer path.

No SPI changes. No behavioural change to the cache hit path.

Developed using AI-assisted tooling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement ES|QL|DS ES|QL datasources Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants