test(workflow-operator): pin Sklearn OpDesc registry strings#4827
Conversation
Add SklearnOpDescRegistrySpec covering every concrete SklearnClassifierOpDesc (24) and SklearnTrainingOpDesc (26) with the exact (importStatement, userFriendlyModelName) pair each one returns, plus the base-class defaults and a generatePythonCode smoke test for both pipelines (UDFOperatorV2 / UDFTableOperator). A typo in either string would silently misroute either the generated Python pipeline or the user-facing UI label; pinning them in one table makes that a test failure. Closes apache#4826 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4827 +/- ##
============================================
+ Coverage 42.03% 47.77% +5.73%
- Complexity 2154 2159 +5
============================================
Files 980 818 -162
Lines 36292 25972 -10320
Branches 3783 2346 -1437
============================================
- Hits 15257 12407 -2850
+ Misses 20110 12813 -7297
+ Partials 925 752 -173
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a ScalaTest spec to prevent regressions in the string “registry” contract for scikit-learn operator descriptors, since typos in (importStatement, userFriendlyModelName) can silently break generated Python pipelines or UI labels.
Changes:
- Introduces
SklearnOpDescRegistrySpecto pin(import statement, user-friendly model name)for classifier and training OpDesc subclasses. - Adds assertions for base-class defaults on
SklearnClassifierOpDescandSklearnTrainingOpDesc. - Adds
generatePythonCodesmoke tests verifying the expected import statement is embedded.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ing name Per Copilot feedback on apache#4827: - SklearnDummyClassifierOpDesc and SklearnTrainingDummyClassifierOpDesc imported the symbol "dummy" from sklearn.dummy, which does not exist. Correct to "DummyClassifier" to produce a runnable Python pipeline. - SklearnTrainingBaggingOpDesc duplicated the word "Training" in the user-friendly model name ("Training: Bagging Training"). Correct to "Training: Bagging" to match the rest of the SklearnTraining* registry pattern. - SklearnDummyClassifierOpDesc was missing from the spec's classifier registry; add it (now 25 entries) so a typo in this OpDesc is caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eRawInvalidTextSpec CI on apache#4827 was failing in `PythonCodeRawInvalidTextSpec` with: - org.apache.texera.amber.operator.sklearn.SklearnOpDescRegistrySpec$$anon$1 cannot instantiate (needs an accessible no-arg constructor or must be a Scala object) The spec's "base class default" test was instantiating the abstract `SklearnClassifierOpDesc` via `new SklearnClassifierOpDesc {}` — an anonymous test-package class under `org.apache.texera.amber.operator.sklearn`. The `PythonCodeRawInvalidTextSpec` classpath scan picks up everything in `org.apache.texera.amber.operator` that extends `PythonOperatorDescriptor` and tries to instantiate it via no-arg constructor (or a Scala object MODULE$); anonymous classes have no accessible no-arg constructor (they capture an enclosing instance), so the scan fails on it. The base-class empty-string default is never observable in production (every concrete subclass below overrides both methods), so this fix just deletes the test and leaves a comment explaining why no future contributor should re-add an anonymous-subclass instantiation here. PythonCodeRawInvalidTextSpec now reports ok=116/116 (was 116/117). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address Copilot feedback on apache#4827: - Completeness: add classpath-scan completeness assertions for both classifierEntries and trainingEntries. Reuse PythonReflectionUtils .scanCandidates (the same scanner PythonCodeRawInvalidTextSpec uses) so a new concrete SklearnClassifierOpDesc / SklearnTrainingOpDesc subclass cannot silently bypass this registry — adding one without also adding it to the manual list will fail with a "drift" message listing the missing class names. SklearnTrainingOpDesc itself is concrete (used as a default fallback) so it is excluded from the diff for the training side. - Grammar: rephrase the file-level scaladoc from "would silently misroute downstream UI labels and breakage of the generated Python pipeline" to "...labels and cause breakage in the generated Python pipeline". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What changes were proposed in this PR?
Add
SklearnOpDescRegistrySpeccovering every concreteSklearnClassifierOpDesc(25 subclasses) andSklearnTrainingOpDesc(26 subclasses) with the exact(importStatement, userFriendlyModelName)pair each one returns. A typo in either string would silently misroute either the generated Python pipeline or the user-facing UI label; pinning them in one table makes that a test failure.Also covers:
SklearnTrainingOpDescbase default (RandomForest)generatePythonCodesmoke test for a classifier and a training subclass, verifying the import string is embeddedPythonReflectionUtils.scanCandidates, the same scannerPythonCodeRawInvalidTextSpecuses) verifies thatclassifierEntries/trainingEntriescover every concrete subclass on the classpath. Adding a new concrete OpDesc without also adding it to the manual list now fails this spec with a "drift" message listing the missing class names.Note:
SklearnClassifierOpDesc's empty-string base defaults are not exercised here. The earlier attempt to do so vianew SklearnClassifierOpDesc {}created an anonymous test-package class underorg.apache.texera.amber.operator.sklearn, which thePythonCodeRawInvalidTextSpecclasspath scan picked up as a descriptor candidate and failed to instantiate (anonymous classes have no accessible no-arg constructor). The base default is never observable in production anyway, since every concrete subclass overrides both methods. A spec-file comment documents this so a future contributor doesn't reflexively re-add it.Also includes a small fix to
SklearnDummyClassifierOpDesc/SklearnTrainingDummyClassifierOpDesc(correctsfrom sklearn.dummy import dummy→import DummyClassifier) andSklearnTrainingBaggingOpDesc(corrects the duplicated word inTraining: Bagging Training→Training: Bagging), surfaced while writing the registry table.Any related issues, documentation, discussions?
Closes #4826
How was this PR tested?
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)