[SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic #46129

HyukjinKwon · 2024-04-19T05:45:12Z

What changes were proposed in this pull request?

This PR proposes to have a parent pyspark.sql.DataFrame class which pyspark.sql.connect.dataframe.DataFrame and pyspark.sql.classic.dataframe.DataFrame inherit.

Note that for backward compatibility concern, pyspark.sql.DataFrame(...) will return still a Spark Classic DataFrame.

Before

pyspark.sql.DataFrame (Spark Claasic)
- docstrings
- Spark Classic logic
pyspark.sql.connect.dataframe.DataFrame (Spark Connect)
- Spark Connect logic
Users can only see the type hints from pyspark.sql.DataFrame.

After

pyspark.sql.DataFrame (Common)
- docstrings
- Support classmethod usages (dispatch to either Spark Connect or Spark Classic)
pyspark.sql.classic.dataframe.DataFrame (Spark Classic)
- Spark Classic logic
pyspark.sql.connect.dataframe.DataFrame (Spark Connect)
- Spark Connect logic
Users can only see the type hints from pyspark.sql.DataFrame.

Why are the changes needed?

This fixes two issues in the current structure at Spark Connect:

Support usage of regular methods as class methods, e.g.,

from pyspark.sql import DataFrame
df = spark.range(10)
DataFrame.union(df, df)

Before

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 4809, in union
    return DataFrame(self._jdf.union(other._jdf), self.sparkSession)
                     ^^^^^^^^^
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1724, in __getattr__
    raise PySparkAttributeError(
pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail.

After

DataFrame[id: bigint]

Supports isinstance call

from pyspark.sql import DataFrame
isinstance(spark.range(1), DataFrame)

Before

False

After

True

Does this PR introduce any user-facing change?

Yes, as described above.

How was this patch tested?

Manually tested, and CI should verify them.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-04-19T05:48:35Z

cc @ueshin @zhengruifeng @allisonwang-db @xinrong-meng @itholic @hvanhovell @grundprinzip 🙏

zhengruifeng · 2024-04-19T06:16:10Z

python/pyspark/sql/classic/dataframe.py

+    def explain(
+        self, extended: Optional[Union[bool, str]] = None, mode: Optional[str] = None
+    ) -> None:
+        if extended is not None and mode is not None:


we can move such complex preprocessing to the superclasses later

python/pyspark/sql/classic/dataframe.py

python/pyspark/sql/connect/dataframe.py

HyukjinKwon · 2024-04-19T19:06:01Z

Will fix up the tests soon.

python/pyspark/sql/dataframe.py

python/pyspark/sql/utils.py

ueshin · 2024-04-19T19:52:14Z

python/pyspark/sql/connect/session.py

@@ -325,7 +325,7 @@ def active(cls) -> "SparkSession":

    active.__doc__ = PySparkSession.active.__doc__

-    def table(self, tableName: str) -> DataFrame:
+    def table(self, tableName: str) -> ParentDataFrame:


I guess we can leave it as-is? And the following changes?

this was the way MyPy least complained IIRC. Let me take a look again ..

Seems like the arguments cannot be more specific type, and return types can't be wider types (https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides). So it complains about the argument.

Let me just keep them all as parent dataframe for simplicity because those types aren't user-facing anyway.

Here is one example of the error:

python/pyspark/sql/classic/dataframe.py:276: error: Argument 1 of "exceptAll" is incompatible with supertype "DataFrame"; supertype defines the argument type as "DataFrame" [override]

python/pyspark/sql/connect/dataframe.py

python/pyspark/sql/classic/dataframe.py

ueshin · 2024-04-19T21:23:17Z

python/pyspark/sql/classic/dataframe.py

+    @overload
+    def repartition(self, numPartitions: int, *cols: "ColumnOrName") -> "ParentDataFrame":
+        ...


I'm wondering if we need @overload definitions in the subclasses?

I initially added, and removed it back because MyPy complains too much. I will take another look.

Seems like by right we should redefine the overloads here (python/mypy#5146, python/mypy#10699). However, we're using pyspark.sql.DataFrame type hints even within our codebase .. so I think it's better to don't have them defined here for now.

HyukjinKwon · 2024-04-21T00:06:20Z

Should be ready for a look. All tests passed. I squashed/rebased the commits.

HyukjinKwon · 2024-04-21T00:14:42Z

python/pyspark/sql/tests/connect/test_connect_plan.py

@@ -333,6 +333,11 @@ def test_observe(self):
        from pyspark.sql.connect.observation import Observation

        class MockDF(DataFrame):


This might be a breaking change if somebody inherits pyspark.sql.DataFrame before, and it has it's own __init__. However, __init__ is not really an API, and users shouldn't really customize/use/invoke them directly.

HyukjinKwon · 2024-04-22T00:14:00Z

Merged to master.

I will followup if there are more comments to address.

dongjoon-hyun

Can we have a different name than classic?

dongjoon-hyun · 2024-04-22T07:57:35Z

classic sounds like a too limited wording because it has no clear meaning and not-extensible in a long-term perspective.

…c` references ### What changes were proposed in this pull request? This PR is a followup of #46129 that moves `pyspark.classic` references to the actual test methods so they are not references during `pyspark-connect` only test (that does not have `pyspark.classic` package). ### Why are the changes needed? To recover the CI: https://github.com/apache/spark/actions/runs/8789489804/job/24119356874 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46171 from HyukjinKwon/SPARK-47909-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… Classic ### What changes were proposed in this pull request? Same as #46129 but for `Column` class. ### Why are the changes needed? Same as #46129 ### Does this PR introduce _any_ user-facing change? Same as #46129 ### How was this patch tested? Manually tested, and CI should verify them. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46155 from HyukjinKwon/SPARK-47933. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ct and Spark Classic ### What changes were proposed in this pull request? This PR proposes to have a parent `pyspark.sql.DataFrame` class which `pyspark.sql.connect.dataframe.DataFrame` and `pyspark.sql.classic.dataframe.DataFrame` inherit. **Note** that for backward compatibility concern, `pyspark.sql.DataFrame(...)` will return still a Spark Classic DataFrame. Before 1. `pyspark.sql.DataFrame` (Spark Claasic) - docstrings - Spark Classic logic 2. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect) - Spark Connect logic 3. Users can only see the type hints from `pyspark.sql.DataFrame`. After 1. `pyspark.sql.DataFrame` (Common) - docstrings - Support classmethod usages (dispatch to either Spark Connect or Spark Classic) 2. `pyspark.sql.classic.dataframe.DataFrame` (Spark Classic) - Spark Classic logic 3. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect) - Spark Connect logic 4. Users can only see the type hints from `pyspark.sql.DataFrame`. ### Why are the changes needed? This fixes two issues in the current structure at Spark Connect: Support usage of regular methods as class methods, e.g., ```python from pyspark.sql import DataFrame df = spark.range(10) DataFrame.union(df, df) ``` Before ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 4809, in union return DataFrame(self._jdf.union(other._jdf), self.sparkSession) ^^^^^^^^^ File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1724, in __getattr__ raise PySparkAttributeError( pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail. ``` After ``` DataFrame[id: bigint] ``` Supports `isinstance` call ```python from pyspark.sql import DataFrame isinstance(spark.range(1), DataFrame) ``` Before ``` False ``` After ``` True ``` ### Does this PR introduce _any_ user-facing change? Yes, as described above. ### How was this patch tested? Manually tested, and CI should verify them. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46129 from HyukjinKwon/SPARK-47909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…c` references ### What changes were proposed in this pull request? This PR is a followup of apache#46129 that moves `pyspark.classic` references to the actual test methods so they are not references during `pyspark-connect` only test (that does not have `pyspark.classic` package). ### Why are the changes needed? To recover the CI: https://github.com/apache/spark/actions/runs/8789489804/job/24119356874 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46171 from HyukjinKwon/SPARK-47909-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… Classic ### What changes were proposed in this pull request? Same as apache#46129 but for `Column` class. ### Why are the changes needed? Same as apache#46129 ### Does this PR introduce _any_ user-facing change? Same as apache#46129 ### How was this patch tested? Manually tested, and CI should verify them. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46155 from HyukjinKwon/SPARK-47933. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…and Spark Classic ### What changes were proposed in this pull request? Parent Window class for Spark Connect and Spark Classic ### Why are the changes needed? Same as #46129 ### Does this PR introduce _any_ user-facing change? Same as #46129 ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes #46841 from zhengruifeng/py_parent_window. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…and Spark Classic ### What changes were proposed in this pull request? Parent Window class for Spark Connect and Spark Classic ### Why are the changes needed? Same as apache#46129 ### Does this PR introduce _any_ user-facing change? Same as apache#46129 ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#46841 from zhengruifeng/py_parent_window. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added SQL BUILD PYTHON CONNECT labels Apr 19, 2024

zhengruifeng reviewed Apr 19, 2024

View reviewed changes

HyukjinKwon force-pushed the SPARK-47909 branch from d70c9f1 to 9f1f1bd Compare April 19, 2024 07:31

zhengruifeng approved these changes Apr 19, 2024

View reviewed changes

HyukjinKwon commented Apr 19, 2024

View reviewed changes

python/pyspark/sql/connect/dataframe.py Show resolved Hide resolved

HyukjinKwon force-pushed the SPARK-47909 branch from 06a28df to 692a302 Compare April 19, 2024 08:32

ueshin reviewed Apr 19, 2024

View reviewed changes

HyukjinKwon force-pushed the SPARK-47909 branch from 6a64c37 to 253acad Compare April 20, 2024 02:25

github-actions bot added the ML label Apr 20, 2024

HyukjinKwon force-pushed the SPARK-47909 branch from 863bf1e to f0d4d57 Compare April 21, 2024 00:08

HyukjinKwon commented Apr 21, 2024

View reviewed changes

HyukjinKwon force-pushed the SPARK-47909 branch from f0d4d57 to d6f897a Compare April 21, 2024 01:14

Parent DataFrame class for Spark Connect and Spark Classic

a94f76b

HyukjinKwon force-pushed the SPARK-47909 branch from d6f897a to a94f76b Compare April 21, 2024 01:17

HyukjinKwon closed this in 393a84f Apr 22, 2024

HyukjinKwon mentioned this pull request Apr 22, 2024

[SPARK-47933][PYTHON] Parent Column class for Spark Connect and Spark Classic #46155

Closed

dongjoon-hyun reviewed Apr 22, 2024

View reviewed changes

HyukjinKwon mentioned this pull request Apr 23, 2024

[SPARK-47909][CONNECT][PYTHON][TESTS][FOLLOW-UP] Move pyspark.classic references #46171

Closed

ianmcook mentioned this pull request Apr 23, 2024

[SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark #45481

Closed

ianmcook added a commit to ianmcook/spark that referenced this pull request May 9, 2024

Update for consistency after apache#46129

27f8464

zhengruifeng mentioned this pull request Jun 3, 2024

[SPARK-48504][PYTHON][CONNECT] Parent Window class for Spark Connect and Spark Classic #46841

Closed

satniks mentioned this pull request Jun 20, 2024

SparkSQLCompare and spark submodule capitalone/datacompy#310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic #46129

[SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic #46129

HyukjinKwon commented Apr 19, 2024 •

edited

Loading

HyukjinKwon commented Apr 19, 2024

zhengruifeng Apr 19, 2024

HyukjinKwon commented Apr 19, 2024

ueshin Apr 19, 2024

HyukjinKwon Apr 20, 2024

HyukjinKwon Apr 21, 2024

HyukjinKwon Apr 21, 2024

ueshin Apr 19, 2024

HyukjinKwon Apr 20, 2024 •

edited

Loading

HyukjinKwon Apr 21, 2024

HyukjinKwon commented Apr 21, 2024 •

edited

Loading

HyukjinKwon Apr 21, 2024

HyukjinKwon commented Apr 22, 2024

dongjoon-hyun left a comment

dongjoon-hyun commented Apr 22, 2024

		@@ -333,6 +333,11 @@ def test_observe(self):
		from pyspark.sql.connect.observation import Observation

		class MockDF(DataFrame):

[SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic #46129

[SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic #46129

Conversation

HyukjinKwon commented Apr 19, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Apr 19, 2024

Choose a reason for hiding this comment

HyukjinKwon commented Apr 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 21, 2024 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Apr 22, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 22, 2024

HyukjinKwon commented Apr 19, 2024 •

edited

Loading

HyukjinKwon Apr 20, 2024 •

edited

Loading

HyukjinKwon commented Apr 21, 2024 •

edited

Loading