Skip to content

feat(storage): add DocumentFactory.documentExists#5085

Merged
aglinxinyuan merged 11 commits into
mainfrom
xinyuan-document-factory-exists
May 16, 2026
Merged

feat(storage): add DocumentFactory.documentExists#5085
aglinxinyuan merged 11 commits into
mainfrom
xinyuan-document-factory-exists

Conversation

@aglinxinyuan
Copy link
Copy Markdown
Contributor

@aglinxinyuan aglinxinyuan commented May 16, 2026

What changes were proposed in this PR?

Adds a documentExists-style helper to DocumentFactory in both the Scala and Python code paths, so callers can check whether an iceberg-backed document already exists at a vfs:// URI without catching exceptions from openDocument / open_document.

  • Scala: new DocumentFactory.documentExists(uri: URI): Boolean. Resolves the VFSResourceType to its iceberg namespace, then probes the catalog via IcebergCatalogInstance.getInstance().tableExists(TableIdentifier.of(namespace, storageKey)). Throws UnsupportedOperationException for non-vfs URI schemes; IllegalArgumentException for unsupported resource types.
  • Python: new DocumentFactory.document_exists(uri: str) -> bool. Same shape: probes via catalog.table_exists(f"{namespace}.{storage_key}"); raises NotImplementedError / ValueError symmetrically.
  • Refactor: extracted a private resolveNamespace (Scala) and _resolve_namespace (Python) so createDocument, openDocument, and the new helper share one resource-type → namespace mapping in each language.
  • Why Catalog.tableExists rather than loadTableMetadata: loadTableMetadata catches every exception and returns None, so a transient catalog error would have surfaced as a false-negative "doesn't exist" answer. Catalog.tableExists only returns false on actual not-found, and lets unexpected errors propagate.
  • The change in open_document from a hard-coded "vfs" literal to VFSURIFactory.VFS_FILE_URI_SCHEME aligns the three methods on the same scheme constant.

Any related issues, documentation, discussions?

Closes: #5089

How was this PR tested?

  • sbt "WorkflowCore/Test/compile" — clean.
  • sbt "WorkflowCore/testOnly *IcebergDocumentSpec" — 14/14 pass, including two new cases asserting documentExists returns true after createDocument, false on a fresh URI, and throws UnsupportedOperationException for an unsupported scheme.
  • sbt "WorkflowCore/testOnly *IcebergUtilSpec" — 13/13 pass (refactor did not touch IcebergUtil).
  • pytest amber/src/test/python/core/storage/test_document_factory.py — 11/11 pass, including four new cases covering document_exists returning true/false based on catalog.table_exists, raising ValueError on an unsupported resource type, and raising NotImplementedError on an unsupported scheme.
  • ruff check clean on document_factory.py and test_document_factory.py.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.7 in compliance with ASF.

Enables "create only if absent" flows to test existence without
catching exceptions from openDocument. Splits off from #4206.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 16, 2026 00:43
aglinxinyuan added a commit that referenced this pull request May 16, 2026
The documentExists helper is being landed separately in #5085. This
branch keeps the callers in RegionExecutionCoordinator that depend on
it, so this PR now depends on #5085 merging first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a small storage helper to let callers check whether a VFS-backed document already exists at a URI, avoiding the current “probe by calling openDocument and catching exceptions” pattern. This fits into workflow-core’s storage abstractions around VFS/Iceberg-backed documents.

Changes:

  • Added DocumentFactory.documentExists(uri: URI): Boolean for the vfs scheme.
  • Implemented existence probing via Iceberg namespace resolution + IcebergUtil.loadTableMetadata.
  • Explicitly rejects unsupported URI schemes with UnsupportedOperationException.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 16, 2026

Codecov Report

❌ Patch coverage is 84.61538% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.14%. Comparing base (a820f67) to head (ff9d677).

Files with missing lines Patch % Lines
...he/texera/amber/core/storage/DocumentFactory.scala 64.70% 0 Missing and 6 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5085      +/-   ##
============================================
+ Coverage     43.12%   43.14%   +0.01%     
+ Complexity     2208     2207       -1     
============================================
  Files          1045     1045              
  Lines         40249    40260      +11     
  Branches       4252     4250       -2     
============================================
+ Hits          17358    17369      +11     
  Misses        21817    21817              
  Partials       1074     1074              
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.72% <ø> (ø) Carriedforward from d758abc
amber 43.77% <64.70%> (+0.01%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.18% <ø> (ø)
frontend 34.05% <ø> (ø) Carriedforward from d758abc
python 90.43% <100.00%> (+0.01%) ⬆️
workflow-compiling-service 56.81% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aglinxinyuan and others added 3 commits May 15, 2026 17:56
- Extract resolveNamespace helper so createDocument, openDocument, and
  documentExists share one resourceType → namespace mapping.
- Use catalog.tableExists via IcebergUtil instead of loadTableMetadata
  so transient catalog errors surface instead of becoming false negatives.
- Tweak the unsupported-scheme message to mention "checking document
  existence" rather than "checking the document".
- Add IcebergDocumentSpec cases for existing/fresh URIs and unsupported
  schemes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the single-use IcebergUtil.tableExists wrapper and calls
catalog.tableExists directly via TableIdentifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the Scala documentExists helper:
- Extract _resolve_namespace so create_document, open_document, and
  document_exists share one resource_type -> namespace mapping.
- document_exists calls catalog.table_exists directly so transient
  catalog errors surface instead of becoming false negatives.
- Add unit tests covering true/false catalog responses, unsupported
  resource type, and unsupported URI scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread amber/src/main/python/core/storage/document_factory.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xinyuan Lin <xinyual3@uci.edu>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aglinxinyuan aglinxinyuan marked this pull request as ready for review May 16, 2026 01:36
@aglinxinyuan aglinxinyuan requested review from Xiao-zhen-Liu and mengw15 and removed request for Xiao-zhen-Liu May 16, 2026 01:36
Copy link
Copy Markdown
Contributor

@mengw15 mengw15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some comments

Comment thread amber/src/main/python/core/storage/document_factory.py
Comment thread amber/src/main/python/core/storage/document_factory.py
@mengw15
Copy link
Copy Markdown
Contributor

mengw15 commented May 16, 2026

Also could we make Codecov 100%?

…ng asymmetry

Address review feedback on #5085:
- Document `@throws` cases on `DocumentFactory.documentExists` (Scala) and add a
  `Raises:` block on `DocumentFactory.document_exists` (Python), so callers gating
  a `createDocument` call know what to catch.
- Note in Python `_resolve_namespace` that only RESULT/STATE are mapped because
  CONSOLE_MESSAGES and RUNTIME_STATISTICS are written exclusively from the Scala
  runtime; the asymmetry vs. the Scala helper is intentional.
- Split the combined documentExists assert in `IcebergDocumentSpec` into two
  `it should` blocks ("true for a created URI" / "false for a never-created URI")
  so a regression names the failing scenario.
… of resolveNamespace

The original `IcebergDocumentSpec` only exercised the RESULT and STATE branches
of `DocumentFactory.resolveNamespace`. Codecov flagged CONSOLE_MESSAGES and
RUNTIME_STATISTICS as missing patch coverage. Add `documentExists` cases for
both URI types so all four mapped resource-type branches are now exercised.

The defensive `case _ =>` branch in `resolveNamespace` remains unreachable from
any well-formed VFS URI (`VFSURIFactory.decodeURI` already validates resource
types before reaching `resolveNamespace`); exercising it would require either
reflection-based test plumbing or restructuring the lookup. Keeping it as
defensive code for future enum additions.
…reflection

After covering CONSOLE_MESSAGES and RUNTIME_STATISTICS, the match in
`resolveNamespace` was still flagged as partial by jacoco because the
defensive `case _ =>` (unreachable from any well-formed VFS URI) was
never taken. Reflect into the private method and pass `null` to exercise
the wildcard branch, asserting it throws IllegalArgumentException as
documented. Keeps the defensive code intact while bringing branch
coverage on `resolveNamespace` to 100%.
@aglinxinyuan aglinxinyuan enabled auto-merge (squash) May 16, 2026 04:31
@aglinxinyuan aglinxinyuan merged commit e27b98a into main May 16, 2026
21 checks passed
@aglinxinyuan aglinxinyuan deleted the xinyuan-document-factory-exists branch May 16, 2026 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DocumentFactory existence check (documentExists / document_exists)

4 participants