[SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat by zero323 · Pull Request #34354 · apache/spark

zero323 · 2021-10-21T09:29:22Z

What changes were proposed in this pull request?

This PR adds overloads to the following pyspark.sql.functions:

array
struct
create_map
map_concat

to support calls with a single list or tuple argument, i.e.

array(["foo", "bar"])

Why are the changes needed?

These calls are supported by the current implementation, but don't type check.

Does this PR introduce any user-facing change?

Type checker only, as described above.

How was this patch tested?

Existing tests and manual tests (to be added in SPARK-36989)

zero323 · 2021-10-21T09:32:47Z

New annotations are already implemented, but I think we might have to redefine ColumnOrName to fully support these, so I'll keep this as a draft for now.

FYI @ueshin, @HyukjinKwon, @xinrong-databricks

SparkQA · 2021-10-21T10:15:15Z

Test build #144500 has finished for PR 34354 at commit cdc1965.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-21T10:34:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48972/

SparkQA · 2021-10-21T11:32:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48972/

python/pyspark/sql/functions.py

ueshin · 2021-10-22T18:19:45Z

I think we might have to redefine ColumnOrName to fully support these

What's your idea like?

zero323 · 2021-10-22T21:09:54Z

I think we might have to redefine ColumnOrName to fully support these

What's your idea like?

Long story short, I've been looking into different scenarios for using aliases and types. Adding inline hints, definitely introduced uses cases that we didn't have before, most notably casts (which further separate into cases where we have generics, bound from function signature and none of the above). And there are variances, which pop up here in there.

I suspect, that some of the cases where invariant generics hit us, might be addressed with bounded type vars:

from typing import overload, List, TypeVar, Union
from pyspark.sql import Column
from pyspark.sql.functions import col

ColumnOrName = Union[str, Column]
ColumnOrName_ = TypeVar("ColumnOrName_", bound=ColumnOrName)

def array(__cols: List[ColumnOrName_]): ...

column_names = ["a", "b", "c"]
array(column_names)

columns = [col(x) for x in column_names]
array(columns)

but these are not universal and there might be some caveats that I don't see at the moment.

I hope there will be an opportunity to discuss this stuff in a more interactive manner.

(Note: ColumnOrName is still needed for casts and other annotations in contexts where ColumnOrName_ would be unbound, like functions without ColumnOrName _ in arguments).

### What changes were proposed in this pull request? This PR changes changes `RDD[~T]` and `DStream[~T]` to `RDD[+T]` and `DStream[+T]` respectively. ### Why are the changes needed? To improve usability of the current annotations and simplify further development of type hints. Let's take simple `RDD` to `DataFrame` as an example. Currently, the following code will not type check ```python from pyspark import SparkContext from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext rdd = sc.parallelize([(1, 2)]) reveal_type(rdd) spark.createDataFrame(rdd) ``` with ``` main.py:8: note: Revealed type is "pyspark.rdd.RDD[Tuple[builtins.int, builtins.int]]" main.py:10: error: Argument 1 to "createDataFrame" of "SparkSession" has incompatible type "RDD[Tuple[int, int]]"; expected "Union[RDD[Tuple[Any, ...]], Iterable[Tuple[Any, ...]]]" Found 1 error in 1 file (checked 1 source file) ``` To type check, `rdd` would have to be annotated with specific type, matching the signature of the `createDataFrame` method: ```python rdd: RDD[Tuple[Any, ...]] = sc.parallelize([(1, 2)]) ``` Alternatively, one could inline definition: ```python spark.createDataFrame(sc.parallelize([(1, 2)])) ``` Similarly, with `pyspark.mllib`: ```python from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.mllib.clustering import KMeans from pyspark.mllib.linalg import SparseVector, Vectors spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext rdd = sc.parallelize([ Vectors.sparse(10, [1, 3, 5], [1, 1, 1]), Vectors.sparse(10, [2, 4, 6], [1, 1, 1]), ]) KMeans.train(rdd, 2) ``` we'd get ``` main.py:14: error: Argument 1 to "train" of "KMeans" has incompatible type "RDD[SparseVector]"; expected "RDD[Union[ndarray[Any, Any], Vector, List[float], Tuple[float, ...]]]" Found 1 error in 1 file (checked 1 source file) ``` but this time, we'd need much more complex annotation (inlining would work as well): ```python rdd: RDD[Union[ndarray[Any, Any], Vector, List[float], Tuple[float, ...]]] = sc.parallelize([ Vectors.sparse(10, [1, 3, 5], [1, 1, 1]), Vectors.sparse(10, [2, 4, 6], [1, 1, 1]), ]) ``` This happens because - RDD is invariant in terms of stored type. - mypy doesn't look forward to infer types of objects depending on the usage context (similarly to Scala console / spark-shell, but unlike standalone Scala compiler, which allows us to have [examples like this](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala)) It not only makes things verbose, but also fragile and dependent on details of implementation. In the first example, where we have top level `Union`, we can just use `RDD[...]` and ignore other members. In the second case, where `Union` is a type parameter we have to match all its components (it could be simpler if we didn't use `RDD[VectorLike]` but defined something like `RDD[ndarray] | RDD[Vector] | RDD[List[float]] | RDD[Tuple[float, ...]]]`, which should make it closer to the first case, though not semantically equivalent to the current signature). Theoretically, we could partially address this with different definitions of aliases, like using type bounds (see discussion under #34354), but it doesn't scale well and requires same steps to be taken by every library that depends on PySpark. See also related discussion about Scala counterpart ‒ SPARK-1296 ### Does this PR introduce _any_ user-facing change? Type hints only. Users will be able to use both subclasses of `RDD` / `DStream` in certain contexts, without explicit annotations or casts (both examples will pass type checker in their original form). ### How was this patch tested? Existing tests and not released data tests (SPARK-36989). Closes #34374 from zero323/SPARK-37104. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>

SparkQA · 2021-11-24T09:57:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50043/

SparkQA · 2021-11-24T11:00:37Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50043/

SparkQA · 2021-11-24T11:30:02Z

Test build #145577 has finished for PR 34354 at commit 2693310.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-24T11:47:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50050/

SparkQA · 2021-11-24T12:30:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50050/

python/pyspark/sql/functions.py

SparkQA · 2021-11-24T18:28:37Z

Test build #145589 has finished for PR 34354 at commit 7fe1908.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-24T18:44:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50061/

SparkQA · 2021-11-24T19:26:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50061/

SparkQA · 2021-12-14T00:40:31Z

Test build #146150 has finished for PR 34354 at commit 635b230.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-14T01:00:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50623/

SparkQA · 2021-12-14T01:45:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50623/

SparkQA · 2021-12-19T07:59:52Z

Test build #146374 has finished for PR 34354 at commit 7a646b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-19T08:18:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50848/

SparkQA · 2021-12-19T08:24:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50849/

SparkQA · 2021-12-19T09:21:58Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50849/

SparkQA · 2021-12-19T09:29:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50848/

HyukjinKwon · 2021-12-29T01:04:50Z

@itholic can you take a look please?

itholic · 2021-12-29T06:01:05Z

Could you rebase this branch to master ??

mypy annotation tests seems not to be passed with current status

zero323 · 2021-12-29T14:13:44Z

Could you rebase this branch to master ??

mypy annotation tests seems not to be passed with current status

@itholic Done. It seems to pass everything in CI. Did you have any particular issues in mind (I've seen some weird local issues, until I cleared mypy cache after the upgrade).

itholic · 2022-01-05T02:51:32Z

@zero323 hmm.. I didn't see anything particularly strange except for a few issues caused by different mypy version. It seems to be working fine now.

itholic · 2022-01-05T03:33:01Z

Oh, python/pyspark/pandas/indexes/base.py:48: error: Module "pandas._libs" has no attribute "lib" [attr-defined] is started to fail. Do you have same issue on your local ??

itholic · 2022-01-05T03:44:46Z

python/pyspark/sql/functions.py

+@overload
+def struct(__cols: Union[List["ColumnOrName_"], Tuple["ColumnOrName_", ...]]) -> Column:
+    ...


Maybe seems like List["ColumnOrName"] just works fine.

Could you briefly explain why we need to use Union the List and Tuple ??

And could I ask if what is the ... mean in ["ColumnOrName_", ...] ??

We still want to support (it was common user request in the past) calls like

struct(("foo", "bar"))

which shouldn't be accepted without Tuple (or some supertype).

If you try

diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 006d10c9fc..caf17a84b3 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -1677,13 +1677,11 @@ def struct(*cols: "ColumnOrName") -> Column: @overload -def struct(__cols: Union[List["ColumnOrName_"], Tuple["ColumnOrName_", ...]]) -> Column: +def struct(__cols: Union[List["ColumnOrName_"]]) -> Column: ... -def struct( - *cols: Union["ColumnOrName", Union[List["ColumnOrName_"], Tuple["ColumnOrName_", ...]]] -) -> Column: +def struct(*cols: Union["ColumnOrName", Union[List["ColumnOrName_"]]]) -> Column: """Creates a new struct column. .. versionadded:: 1.4.0 diff --git a/python/pyspark/sql/tests/typing/test_functions.yml b/python/pyspark/sql/tests/typing/test_functions.yml index efb3293472..f5f2f13c4f 100644 --- a/python/pyspark/sql/tests/typing/test_functions.yml +++ b/python/pyspark/sql/tests/typing/test_functions.yml @@ -66,6 +66,8 @@ create_map(col_objs) map_concat(col_objs) + struct(("foo", "bar")) + out: | main:29: error: No overload variant of "array" matches argument types "List[Column]", "List[Column]" [call-overload] main:29: note: Possible overload variant:

you should see error in data tests

___________________________ varargFunctionsOverloads ___________________________ /path/to/spark/python/pyspark/sql/tests/typing/test_functions.yml:19: E pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output: E Actual: E main:50: error: No overload variant of "struct" matches argument type "Tuple[str, str]" [call-overload] (diff) E main:50: note: Possible overload variants: (diff) E main:50: note: def struct(*cols: Union[Column, str]) -> Column (diff) E main:50: note: def [ColumnOrName_] struct(List[ColumnOrName_]) -> Column (diff) E Expected: E (empty) =========================== short test summary info ============================

And could I ask if what is the ... mean in ["ColumnOrName_", ...] ??

Tuples are typed like product types, so Tuple[ColumnOrName_] matches tuple with exactly one column or str element. In contrast Tuple["ColumnOrName_", ...] matches tuples of arbitrary size, as long as all elements are columns or strings (there is mypy doc section that discusses this syntax further).

Got it. Thanks for the comment!! 🙏

python/pyspark/sql/functions.py

zero323 · 2022-01-05T19:38:36Z

@zero323 hmm.. I didn't see anything particularly strange except for a few issues caused by different mypy version. It seems to be working fine now.

I don't think I've seen this one. In my local env I primarily see some errors caused by output mismatch in data tests, dynamically defined methods and stuff covered by #34946, but all of these are non-deterministic and rare, so I don't even have good place to start debugging and asking questions.

itholic

It seems to be the best for now, until mypy itself supports looser typing conventions.

HyukjinKwon

Haven't taken a close look but I am fine with this.

zero323 · 2022-01-08T23:54:22Z

Merged into master.

Thanks everyone!

zero323 · 2022-01-09T00:03:09Z

t seems to be the best for now, until mypy itself supports looser typing conventions.

As @ueshin pointed out, using Sequence might provide more general approach (and allow us to forget about variance), but there are two problems:

It is high level interface that potentially covers more than lists and tuples. So this would require change of the logic, and to provide consistent UX, we should probably to the same in other places where sequence-ish input is accepted.
There is concern of ambiguity (str is a Sequence[str]), which causes my irrational feat that it can break in some hard to contain ways (most likely, I am overthinking it).

… create_map, map_concat ### What changes were proposed in this pull request? This PR adds overloads to the following `pyspark.sql.functions`: - `array` - `struct` - `create_map` - `map_concat` to support calls with a single `list` or `tuple` argument, i.e. ```python array(["foo", "bar"]) ``` ### Why are the changes needed? These calls are supported by the current implementation, but don't type check. ### Does this PR introduce _any_ user-facing change? Type checker only, as described above. ### How was this patch tested? Existing tests and manual tests (to be added in SPARK-36989) Closes apache#34354 from zero323/SPARK-37085. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>

github-actions bot added CORE PYTHON SQL labels Oct 21, 2021

HyukjinKwon reviewed Oct 22, 2021

View reviewed changes

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

ueshin reviewed Oct 22, 2021

View reviewed changes

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

ueshin mentioned this pull request Oct 28, 2021

Add Python type annotations delta-io/delta#305

Closed

zero323 mentioned this pull request Nov 2, 2021

[SPARK-37104][PYTHON] Make RDD and DStream covariant #34374

Closed

zero323 force-pushed the SPARK-37085 branch from a840133 to 2693310 Compare November 24, 2021 10:29

xinrong-meng reviewed Nov 24, 2021

View reviewed changes

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

zero323 force-pushed the SPARK-37085 branch from 7fe1908 to 635b230 Compare December 13, 2021 23:38

zero323 changed the title ~~[WIP][SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat~~ [SPARK-37085][PYTHON][SQL] Add list/tuple overloads to array, struct, create_map, map_concat Dec 14, 2021

zero323 marked this pull request as ready for review December 14, 2021 01:53

zero323 requested a review from ueshin December 14, 2021 01:53

zero323 force-pushed the SPARK-37085 branch 2 times, most recently from f8b396f to 7a646b1 Compare December 19, 2021 07:37

zero323 requested a review from HyukjinKwon December 22, 2021 18:08

zero323 added 4 commits December 29, 2021 12:51

Add list/tuple overloads to array, struct, create_map, map_concat

9aeabba

Unindent lines

c9c87aa

Reformat

52bf2e1

Update to bounded types and sync tests

544986a

zero323 force-pushed the SPARK-37085 branch from 7a646b1 to 544986a Compare December 29, 2021 12:04

itholic reviewed Jan 5, 2022

View reviewed changes

python/pyspark/sql/functions.py Show resolved Hide resolved

Merge branch 'master' into SPARK-37085

d213cfb

itholic approved these changes Jan 6, 2022

View reviewed changes

HyukjinKwon approved these changes Jan 6, 2022

View reviewed changes

zero323 closed this in 9d9253c Jan 8, 2022

zero323 deleted the SPARK-37085 branch January 8, 2022 23:54

Conversation

zero323 commented Oct 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zero323 commented Oct 21, 2021

Uh oh!

SparkQA commented Oct 21, 2021

Uh oh!

SparkQA commented Oct 21, 2021

Uh oh!

SparkQA commented Oct 21, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ueshin commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 24, 2021

Uh oh!

SparkQA commented Nov 24, 2021

Uh oh!

SparkQA commented Nov 24, 2021

Uh oh!

SparkQA commented Nov 24, 2021

Uh oh!

SparkQA commented Nov 24, 2021

Uh oh!

Uh oh!

SparkQA commented Nov 24, 2021

Uh oh!

SparkQA commented Nov 24, 2021

Uh oh!

SparkQA commented Nov 24, 2021

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

SparkQA commented Dec 19, 2021

Uh oh!

SparkQA commented Dec 19, 2021

Uh oh!

SparkQA commented Dec 19, 2021

Uh oh!

SparkQA commented Dec 19, 2021

Uh oh!

SparkQA commented Dec 19, 2021

Uh oh!

HyukjinKwon commented Dec 29, 2021

Uh oh!

itholic commented Dec 29, 2021

Uh oh!

zero323 commented Dec 29, 2021

Uh oh!

itholic commented Jan 5, 2022

Uh oh!

itholic commented Jan 5, 2022

Uh oh!

itholic Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 Jan 5, 2022

Choose a reason for hiding this comment

Uh oh!

itholic Jan 6, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zero323 commented Jan 5, 2022

Uh oh!

ueshin commented Oct 22, 2021 •

edited

Loading

zero323 commented Oct 22, 2021 •

edited

Loading

itholic Jan 5, 2022 •

edited

Loading