-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py #34671
[SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py #34671
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably separate ml
and mllib
part and add umbrella tickets for both for bookkeeping.
Additionally, we'll need return types for all functions.
cc @HyukjinKwon @ueshin @xinrong-databricks FYI |
Just to be clear, are you saying I should split this PR into And then have an umbrella ticket for adding type annotations to all of |
Test build #145468 has finished for PR 34671 at commit
|
Kubernetes integration test starting |
For the context ‒ we're in the middle of the process of inlining hints from stubs to inline hints. At the moment we have two umbrella tickets ‒ SPARK-36845 and SPARK-37094 for SQL and core respectively. We should follow this convention for ml and mllib as well. It should be OK to have two tickets ( |
Kubernetes integration test status failure |
8bb6757
to
db6a5b9
Compare
Test build #145514 has finished for PR 34671 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
The number of |
Test build #145516 has finished for PR 34671 at commit
|
That's not optimal, but expected (see #34680 (comment)). If you encounter case where there is no ongoing migration work and you can avoid ignores, it should be OK to extend the stub. Otherwise, we'll do another pass (ignores on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring minor issues LGTM.
This reverts commit 2f54cf4.
Test build #145518 has finished for PR 34671 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Test build #145519 has finished for PR 34671 at commit
|
Kubernetes integration test status failure |
Kubernetes integration test status failure |
Kubernetes integration test starting |
python/pyspark/ml/common.py
Outdated
@@ -15,11 +15,15 @@ | |||
# limitations under the License. | |||
# | |||
|
|||
from typing import Any, Callable | |||
from pyspark.ml._typing import C, JavaObjectOrPickleDump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import should happen in TYPE_CHECKING block
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from pyspark.ml._typing import C, JavaObjectOrPickleDump
Consequently, JavaObjectOrPickleDump
and C
have to be quoted when used ("JavaObjectOrPickleDump"
), i.e.
def _java2py(sc: SparkContext, r: "JavaObjectOrPickleDump", encoding: str = "bytes") -> Any: ...
That's because objects in stubs have no runtime equivalents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. The TYPE_CHECKING
block addresses the namespace pollution issue, I suppose.
But can you elaborate on why the type needs to be quoted? I understand that's for when the type is not known at that point in time (e.g. a self-referential type), but that isn't the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, whatever goes into TYPE_CHECKING
is not imported during normal execution. So as is, all the names we use would be undefined when these scripts are imported.
PEP 563 introduces a concept of postponed evaluation, but were not ready to go there yet (for starters, we still didn't formally drop 3.6 support ‒ I am working on cleaning the code, then we have some components that might require code changes).
Kubernetes integration test status failure |
Test build #145550 has finished for PR 34671 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge it tomorrow, unless there are any further comments.
Merged to master. Thanks all! |
What changes were proposed in this pull request?
This PR inlines the type annotations for
{ml, mllib}/common.py
.Why are the changes needed?
This allows us to run type checks against the code within both versions of
common.py
.This would help contributors catch some issues more easily, like this one: #34606 (comment)
Does this PR introduce any user-facing change?
Potentially. The
C
TypeVar is now public.How was this patch tested?
Existing tests.