Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-37104][PYTHON] Make RDD and DStream covariant
### What changes were proposed in this pull request? This PR changes changes `RDD[~T]` and `DStream[~T]` to `RDD[+T]` and `DStream[+T]` respectively. ### Why are the changes needed? To improve usability of the current annotations and simplify further development of type hints. Let's take simple `RDD` to `DataFrame` as an example. Currently, the following code will not type check ```python from pyspark import SparkContext from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext rdd = sc.parallelize([(1, 2)]) reveal_type(rdd) spark.createDataFrame(rdd) ``` with ``` main.py:8: note: Revealed type is "pyspark.rdd.RDD[Tuple[builtins.int, builtins.int]]" main.py:10: error: Argument 1 to "createDataFrame" of "SparkSession" has incompatible type "RDD[Tuple[int, int]]"; expected "Union[RDD[Tuple[Any, ...]], Iterable[Tuple[Any, ...]]]" Found 1 error in 1 file (checked 1 source file) ``` To type check, `rdd` would have to be annotated with specific type, matching the signature of the `createDataFrame` method: ```python rdd: RDD[Tuple[Any, ...]] = sc.parallelize([(1, 2)]) ``` Alternatively, one could inline definition: ```python spark.createDataFrame(sc.parallelize([(1, 2)])) ``` Similarly, with `pyspark.mllib`: ```python from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.mllib.clustering import KMeans from pyspark.mllib.linalg import SparseVector, Vectors spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext rdd = sc.parallelize([ Vectors.sparse(10, [1, 3, 5], [1, 1, 1]), Vectors.sparse(10, [2, 4, 6], [1, 1, 1]), ]) KMeans.train(rdd, 2) ``` we'd get ``` main.py:14: error: Argument 1 to "train" of "KMeans" has incompatible type "RDD[SparseVector]"; expected "RDD[Union[ndarray[Any, Any], Vector, List[float], Tuple[float, ...]]]" Found 1 error in 1 file (checked 1 source file) ``` but this time, we'd need much more complex annotation (inlining would work as well): ```python rdd: RDD[Union[ndarray[Any, Any], Vector, List[float], Tuple[float, ...]]] = sc.parallelize([ Vectors.sparse(10, [1, 3, 5], [1, 1, 1]), Vectors.sparse(10, [2, 4, 6], [1, 1, 1]), ]) ``` This happens because - RDD is invariant in terms of stored type. - mypy doesn't look forward to infer types of objects depending on the usage context (similarly to Scala console / spark-shell, but unlike standalone Scala compiler, which allows us to have [examples like this](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/KMeansExample.scala)) It not only makes things verbose, but also fragile and dependent on details of implementation. In the first example, where we have top level `Union`, we can just use `RDD[...]` and ignore other members. In the second case, where `Union` is a type parameter we have to match all its components (it could be simpler if we didn't use `RDD[VectorLike]` but defined something like `RDD[ndarray] | RDD[Vector] | RDD[List[float]] | RDD[Tuple[float, ...]]]`, which should make it closer to the first case, though not semantically equivalent to the current signature). Theoretically, we could partially address this with different definitions of aliases, like using type bounds (see discussion under #34354), but it doesn't scale well and requires same steps to be taken by every library that depends on PySpark. See also related discussion about Scala counterpart ‒ SPARK-1296 ### Does this PR introduce _any_ user-facing change? Type hints only. Users will be able to use both subclasses of `RDD` / `DStream` in certain contexts, without explicit annotations or casts (both examples will pass type checker in their original form). ### How was this patch tested? Existing tests and not released data tests (SPARK-36989). Closes #34374 from zero323/SPARK-37104. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>
- Loading branch information
Showing
3 changed files
with
88 additions
and
82 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters