Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py #34671

Conversation

nchammas
Copy link
Contributor

What changes were proposed in this pull request?

This PR inlines the type annotations for {ml, mllib}/common.py.

Why are the changes needed?

This allows us to run type checks against the code within both versions of common.py.

This would help contributors catch some issues more easily, like this one: #34606 (comment)

Does this PR introduce any user-facing change?

Potentially. The C TypeVar is now public.

How was this patch tested?

Existing tests.

python/pyspark/ml/common.py Outdated Show resolved Hide resolved
Copy link
Member

@zero323 zero323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably separate ml and mllib part and add umbrella tickets for both for bookkeeping.

Additionally, we'll need return types for all functions.

@zero323
Copy link
Member

zero323 commented Nov 19, 2021

cc @HyukjinKwon @ueshin @xinrong-databricks FYI

python/mypy.ini Outdated Show resolved Hide resolved
python/mypy.ini Outdated Show resolved Hide resolved
@nchammas
Copy link
Contributor Author

We should probably separate ml and mllib part and add umbrella tickets for both for bookkeeping.

Just to be clear, are you saying I should split this PR into ml/common.py vs. mllib/common.py?

And then have an umbrella ticket for adding type annotations to all of ml/, and another one for mllib/?

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Test build #145468 has finished for PR 34671 at commit 8bb6757.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49940/

@zero323
Copy link
Member

zero323 commented Nov 19, 2021

Just to be clear, are you saying I should split this PR into ml/common.py vs. mllib/common.py?

And then have an umbrella ticket for adding type annotations to all of ml/, and another one for mllib/?

For the context ‒ we're in the middle of the process of inlining hints from stubs to inline hints. At the moment we have two umbrella tickets ‒ SPARK-36845 and SPARK-37094 for SQL and core respectively. We should follow this convention for ml and mllib as well.

It should be OK to have two tickets (ml.common and mllib.common) and resolve both in this PR, since you've already started working on that.

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49940/

@zero323 zero323 changed the title [SPARK-37393][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py Nov 20, 2021
@nchammas nchammas force-pushed the SPARK-37393-inline-ml-common-type-annotations branch from 8bb6757 to db6a5b9 Compare November 22, 2021 15:47
@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Test build #145514 has finished for PR 34671 at commit db6a5b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49985/

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49985/

python/pyspark/ml/_typing.pyi Show resolved Hide resolved
python/pyspark/ml/common.py Show resolved Hide resolved
@nchammas
Copy link
Contributor Author

The number of ignore[attr-defined] hints required seems a little wrong. But I suppose addressing that would require changes to SparkContext, which is out of scope for this PR.

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Test build #145516 has finished for PR 34671 at commit ed56686.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member

zero323 commented Nov 22, 2021

The number of ignore[attr-defined] hints required seems a little wrong. But I suppose addressing that would require changes to SparkContext, which is out of scope for this PR.

That's not optimal, but expected (see #34680 (comment)). If you encounter case where there is no ongoing migration work and you can avoid ignores, it should be OK to extend the stub. Otherwise, we'll do another pass (ignores on _jvm are likely to stay, even if different error code, because we cannot lurk into JVM at this stage).

Copy link
Member

@zero323 zero323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring minor issues LGTM.

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Test build #145518 has finished for PR 34671 at commit 9b08ad0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49988/

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49990/

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Test build #145519 has finished for PR 34671 at commit 15b5fc4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49988/

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49990/

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49991/

@@ -15,11 +15,15 @@
# limitations under the License.
#

from typing import Any, Callable
from pyspark.ml._typing import C, JavaObjectOrPickleDump
Copy link
Member

@zero323 zero323 Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import should happen in TYPE_CHECKING block

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from pyspark.ml._typing import C, JavaObjectOrPickleDump

Consequently, JavaObjectOrPickleDump and C have to be quoted when used ("JavaObjectOrPickleDump"), i.e.

def _java2py(sc: SparkContext, r: "JavaObjectOrPickleDump", encoding: str = "bytes") -> Any: ...

That's because objects in stubs have no runtime equivalents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. The TYPE_CHECKING block addresses the namespace pollution issue, I suppose.

But can you elaborate on why the type needs to be quoted? I understand that's for when the type is not known at that point in time (e.g. a self-referential type), but that isn't the case here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, whatever goes into TYPE_CHECKING is not imported during normal execution. So as is, all the names we use would be undefined when these scripts are imported.

PEP 563 introduces a concept of postponed evaluation, but were not ready to go there yet (for starters, we still didn't formally drop 3.6 support ‒ I am working on cleaning the code, then we have some components that might require code changes).

@SparkQA
Copy link

SparkQA commented Nov 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49991/

@SparkQA
Copy link

SparkQA commented Nov 23, 2021

Test build #145550 has finished for PR 34671 at commit 61fe9fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50022/

@SparkQA
Copy link

SparkQA commented Nov 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50022/

Copy link
Member

@zero323 zero323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge it tomorrow, unless there are any further comments.

@zero323 zero323 closed this in 3565c3a Nov 24, 2021
@zero323
Copy link
Member

zero323 commented Nov 24, 2021

Merged to master.

Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants