Skip to content

test(workflow-operator): pin Sklearn OpDesc registry strings#4827

Merged
aglinxinyuan merged 15 commits into
apache:mainfrom
aglinxinyuan:test-sklearn-op-desc-registry-spec
May 4, 2026
Merged

test(workflow-operator): pin Sklearn OpDesc registry strings#4827
aglinxinyuan merged 15 commits into
apache:mainfrom
aglinxinyuan:test-sklearn-op-desc-registry-spec

Conversation

@aglinxinyuan
Copy link
Copy Markdown
Contributor

@aglinxinyuan aglinxinyuan commented May 3, 2026

What changes were proposed in this PR?

Add SklearnOpDescRegistrySpec covering every concrete SklearnClassifierOpDesc (25 subclasses) and SklearnTrainingOpDesc (26 subclasses) with the exact (importStatement, userFriendlyModelName) pair each one returns. A typo in either string would silently misroute either the generated Python pipeline or the user-facing UI label; pinning them in one table makes that a test failure.

Also covers:

  • SklearnTrainingOpDesc base default (RandomForest)
  • generatePythonCode smoke test for a classifier and a training subclass, verifying the import string is embedded
  • Completeness check: classpath scan (via PythonReflectionUtils.scanCandidates, the same scanner PythonCodeRawInvalidTextSpec uses) verifies that classifierEntries / trainingEntries cover every concrete subclass on the classpath. Adding a new concrete OpDesc without also adding it to the manual list now fails this spec with a "drift" message listing the missing class names.

Note: SklearnClassifierOpDesc's empty-string base defaults are not exercised here. The earlier attempt to do so via new SklearnClassifierOpDesc {} created an anonymous test-package class under org.apache.texera.amber.operator.sklearn, which the PythonCodeRawInvalidTextSpec classpath scan picked up as a descriptor candidate and failed to instantiate (anonymous classes have no accessible no-arg constructor). The base default is never observable in production anyway, since every concrete subclass overrides both methods. A spec-file comment documents this so a future contributor doesn't reflexively re-add it.

Also includes a small fix to SklearnDummyClassifierOpDesc / SklearnTrainingDummyClassifierOpDesc (corrects from sklearn.dummy import dummyimport DummyClassifier) and SklearnTrainingBaggingOpDesc (corrects the duplicated word in Training: Bagging TrainingTraining: Bagging), surfaced while writing the registry table.

Any related issues, documentation, discussions?

Closes #4826

How was this PR tested?

sbt "WorkflowOperator/Test/testOnly org.apache.texera.amber.operator.sklearn.SklearnOpDescRegistrySpec"
# → 107 tests, all pass

sbt "WorkflowOperator/Test/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"
# → raw-invalid SUMMARY ok=116/116, py-compile SUMMARY ok=116/116

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

Add SklearnOpDescRegistrySpec covering every concrete
SklearnClassifierOpDesc (24) and SklearnTrainingOpDesc (26) with the
exact (importStatement, userFriendlyModelName) pair each one returns,
plus the base-class defaults and a generatePythonCode smoke test for
both pipelines (UDFOperatorV2 / UDFTableOperator). A typo in either
string would silently misroute either the generated Python pipeline
or the user-facing UI label; pinning them in one table makes that a
test failure.

Closes apache#4826

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 47.77%. Comparing base (810263c) to head (95a382b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4827      +/-   ##
============================================
+ Coverage     42.03%   47.77%   +5.73%     
- Complexity     2154     2159       +5     
============================================
  Files           980      818     -162     
  Lines         36292    25972   -10320     
  Branches       3783     2346    -1437     
============================================
- Hits          15257    12407    -2850     
+ Misses        20110    12813    -7297     
+ Partials        925      752     -173     
Flag Coverage Δ
access-control-service 39.53% <ø> (ø)
amber 42.86% <100.00%> (+0.01%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 33.24% <ø> (ø)
workflow-compiling-service 47.72% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a ScalaTest spec to prevent regressions in the string “registry” contract for scikit-learn operator descriptors, since typos in (importStatement, userFriendlyModelName) can silently break generated Python pipelines or UI labels.

Changes:

  • Introduces SklearnOpDescRegistrySpec to pin (import statement, user-friendly model name) for classifier and training OpDesc subclasses.
  • Adds assertions for base-class defaults on SklearnClassifierOpDesc and SklearnTrainingOpDesc.
  • Adds generatePythonCode smoke tests verifying the expected import statement is embedded.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ing name

Per Copilot feedback on apache#4827:
- SklearnDummyClassifierOpDesc and SklearnTrainingDummyClassifierOpDesc
  imported the symbol "dummy" from sklearn.dummy, which does not exist.
  Correct to "DummyClassifier" to produce a runnable Python pipeline.
- SklearnTrainingBaggingOpDesc duplicated the word "Training" in the
  user-friendly model name ("Training: Bagging Training"). Correct to
  "Training: Bagging" to match the rest of the SklearnTraining* registry
  pattern.
- SklearnDummyClassifierOpDesc was missing from the spec's classifier
  registry; add it (now 25 entries) so a typo in this OpDesc is caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aglinxinyuan aglinxinyuan enabled auto-merge (squash) May 3, 2026 08:47
@aglinxinyuan aglinxinyuan requested review from Yicong-Huang and removed request for Yicong-Huang May 4, 2026 04:08
Yicong-Huang and others added 11 commits May 3, 2026 21:11
…eRawInvalidTextSpec

CI on apache#4827 was failing in `PythonCodeRawInvalidTextSpec` with:

  - org.apache.texera.amber.operator.sklearn.SklearnOpDescRegistrySpec$$anon$1
      cannot instantiate (needs an accessible no-arg constructor or
      must be a Scala object)

The spec's "base class default" test was instantiating the abstract
`SklearnClassifierOpDesc` via `new SklearnClassifierOpDesc {}` — an
anonymous test-package class under
`org.apache.texera.amber.operator.sklearn`. The
`PythonCodeRawInvalidTextSpec` classpath scan picks up everything in
`org.apache.texera.amber.operator` that extends
`PythonOperatorDescriptor` and tries to instantiate it via no-arg
constructor (or a Scala object MODULE$); anonymous classes have no
accessible no-arg constructor (they capture an enclosing instance), so
the scan fails on it.

The base-class empty-string default is never observable in production
(every concrete subclass below overrides both methods), so this fix
just deletes the test and leaves a comment explaining why no future
contributor should re-add an anonymous-subclass instantiation here.
PythonCodeRawInvalidTextSpec now reports ok=116/116 (was 116/117).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot feedback on apache#4827:

- Completeness: add classpath-scan completeness assertions for both
  classifierEntries and trainingEntries. Reuse PythonReflectionUtils
  .scanCandidates (the same scanner PythonCodeRawInvalidTextSpec uses)
  so a new concrete SklearnClassifierOpDesc / SklearnTrainingOpDesc
  subclass cannot silently bypass this registry — adding one without
  also adding it to the manual list will fail with a "drift" message
  listing the missing class names. SklearnTrainingOpDesc itself is
  concrete (used as a default fallback) so it is excluded from the
  diff for the training side.

- Grammar: rephrase the file-level scaladoc from "would silently
  misroute downstream UI labels and breakage of the generated Python
  pipeline" to "...labels and cause breakage in the generated Python
  pipeline".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aglinxinyuan aglinxinyuan requested a review from Copilot May 4, 2026 08:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aglinxinyuan aglinxinyuan merged commit ef0634d into apache:main May 4, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add unit test coverage for Sklearn OpDesc registry

4 participants