Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47909][PYTHON][CONNECT] Parent DataFrame class for Spark Connect and Spark Classic #46129

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Apr 19, 2024

What changes were proposed in this pull request?

This PR proposes to have a parent pyspark.sql.DataFrame class which pyspark.sql.connect.dataframe.DataFrame and pyspark.sql.classic.dataframe.DataFrame inherit.

Note that for backward compatibility concern, pyspark.sql.DataFrame(...) will return still a Spark Classic DataFrame.

Before

  1. pyspark.sql.DataFrame (Spark Claasic)

    • docstrings
    • Spark Classic logic
  2. pyspark.sql.connect.dataframe.DataFrame (Spark Connect)

    • Spark Connect logic
  3. Users can only see the type hints from pyspark.sql.DataFrame.

After

  1. pyspark.sql.DataFrame (Common)

    • docstrings
    • Support classmethod usages (dispatch to either Spark Connect or Spark Classic)
  2. pyspark.sql.classic.dataframe.DataFrame (Spark Classic)

    • Spark Classic logic
  3. pyspark.sql.connect.dataframe.DataFrame (Spark Connect)

    • Spark Connect logic
  4. Users can only see the type hints from pyspark.sql.DataFrame.

Why are the changes needed?

This fixes two issues in the current structure at Spark Connect:

Support usage of regular methods as class methods, e.g.,

from pyspark.sql import DataFrame
df = spark.range(10)
DataFrame.union(df, df)

Before

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 4809, in union
    return DataFrame(self._jdf.union(other._jdf), self.sparkSession)
                     ^^^^^^^^^
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1724, in __getattr__
    raise PySparkAttributeError(
pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail.

After

DataFrame[id: bigint]

Supports isinstance call

from pyspark.sql import DataFrame
isinstance(spark.range(1), DataFrame)

Before

False

After

True

Does this PR introduce any user-facing change?

Yes, as described above.

How was this patch tested?

Manually tested, and CI should verify them.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member Author

def explain(
self, extended: Optional[Union[bool, str]] = None, mode: Optional[str] = None
) -> None:
if extended is not None and mode is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move such complex preprocessing to the superclasses later

python/pyspark/sql/classic/dataframe.py Show resolved Hide resolved
@HyukjinKwon
Copy link
Member Author

Will fix up the tests soon.

python/pyspark/sql/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/utils.py Show resolved Hide resolved
@@ -325,7 +325,7 @@ def active(cls) -> "SparkSession":

active.__doc__ = PySparkSession.active.__doc__

def table(self, tableName: str) -> DataFrame:
def table(self, tableName: str) -> ParentDataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can leave it as-is? And the following changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the way MyPy least complained IIRC. Let me take a look again ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the arguments cannot be more specific type, and return types can't be wider types (https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides). So it complains about the argument.

Let me just keep them all as parent dataframe for simplicity because those types aren't user-facing anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one example of the error:

python/pyspark/sql/classic/dataframe.py:276: error: Argument 1 of "exceptAll" is incompatible with supertype "DataFrame"; supertype defines the argument type as "DataFrame"  [override]

python/pyspark/sql/connect/dataframe.py Show resolved Hide resolved
python/pyspark/sql/classic/dataframe.py Show resolved Hide resolved
python/pyspark/sql/classic/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/classic/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/classic/dataframe.py Outdated Show resolved Hide resolved
python/pyspark/sql/classic/dataframe.py Outdated Show resolved Hide resolved
Comment on lines 507 to 509
@overload
def repartition(self, numPartitions: int, *cols: "ColumnOrName") -> "ParentDataFrame":
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we need @overload definitions in the subclasses?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially added, and removed it back because MyPy complains too much. I will take another look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like by right we should redefine the overloads here (python/mypy#5146, python/mypy#10699). However, we're using pyspark.sql.DataFrame type hints even within our codebase .. so I think it's better to don't have them defined here for now.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Apr 21, 2024

Should be ready for a look. All tests passed. I squashed/rebased the commits.

@@ -333,6 +333,11 @@ def test_observe(self):
from pyspark.sql.connect.observation import Observation

class MockDF(DataFrame):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a breaking change if somebody inherits pyspark.sql.DataFrame before, and it has it's own __init__. However, __init__ is not really an API, and users shouldn't really customize/use/invoke them directly.

@HyukjinKwon
Copy link
Member Author

Merged to master.

I will followup if there are more comments to address.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a different name than classic?

@dongjoon-hyun
Copy link
Member

classic sounds like a too limited wording because it has no clear meaning and not-extensible in a long-term perspective.

HyukjinKwon added a commit that referenced this pull request Apr 23, 2024
…c` references

### What changes were proposed in this pull request?

This PR is a followup of #46129 that moves `pyspark.classic` references to the actual test methods so they are not references during `pyspark-connect` only test (that does not have `pyspark.classic` package).

### Why are the changes needed?

To recover the CI: https://github.com/apache/spark/actions/runs/8789489804/job/24119356874

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46171 from HyukjinKwon/SPARK-47909-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Apr 23, 2024
… Classic

### What changes were proposed in this pull request?

Same as #46129 but for `Column` class.

### Why are the changes needed?

Same as #46129

### Does this PR introduce _any_ user-facing change?

Same as #46129

### How was this patch tested?

Manually tested, and CI should verify them.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46155 from HyukjinKwon/SPARK-47933.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ianmcook added a commit to ianmcook/spark that referenced this pull request May 9, 2024
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…ct and Spark Classic

### What changes were proposed in this pull request?

This PR proposes to have a parent `pyspark.sql.DataFrame` class which `pyspark.sql.connect.dataframe.DataFrame` and `pyspark.sql.classic.dataframe.DataFrame` inherit.

**Note** that for backward compatibility concern, `pyspark.sql.DataFrame(...)` will return still a Spark Classic DataFrame.

Before

1. `pyspark.sql.DataFrame` (Spark Claasic)
    - docstrings
    - Spark Classic logic

2. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect)
    - Spark Connect logic

3. Users can only see the type hints from `pyspark.sql.DataFrame`.

After

1. `pyspark.sql.DataFrame` (Common)
    - docstrings
    - Support classmethod usages (dispatch to either Spark Connect or Spark Classic)

2. `pyspark.sql.classic.dataframe.DataFrame` (Spark Classic)
    - Spark Classic logic

3. `pyspark.sql.connect.dataframe.DataFrame` (Spark Connect)
    - Spark Connect logic

4. Users can only see the type hints from `pyspark.sql.DataFrame`.

### Why are the changes needed?

This fixes two issues in the current structure at Spark Connect:

Support usage of regular methods as class methods, e.g.,

```python
from pyspark.sql import DataFrame
df = spark.range(10)
DataFrame.union(df, df)
```

Before

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 4809, in union
    return DataFrame(self._jdf.union(other._jdf), self.sparkSession)
                     ^^^^^^^^^
  File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1724, in __getattr__
    raise PySparkAttributeError(
pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail.
```

After

```
DataFrame[id: bigint]
```

Supports `isinstance` call

```python
from pyspark.sql import DataFrame
isinstance(spark.range(1), DataFrame)
```

Before

```
False
```

After

```
True
```

### Does this PR introduce _any_ user-facing change?

Yes, as described above.

### How was this patch tested?

Manually tested, and CI should verify them.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46129 from HyukjinKwon/SPARK-47909.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…c` references

### What changes were proposed in this pull request?

This PR is a followup of apache#46129 that moves `pyspark.classic` references to the actual test methods so they are not references during `pyspark-connect` only test (that does not have `pyspark.classic` package).

### Why are the changes needed?

To recover the CI: https://github.com/apache/spark/actions/runs/8789489804/job/24119356874

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46171 from HyukjinKwon/SPARK-47909-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… Classic

### What changes were proposed in this pull request?

Same as apache#46129 but for `Column` class.

### Why are the changes needed?

Same as apache#46129

### Does this PR introduce _any_ user-facing change?

Same as apache#46129

### How was this patch tested?

Manually tested, and CI should verify them.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46155 from HyukjinKwon/SPARK-47933.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Jun 3, 2024
…and Spark Classic

### What changes were proposed in this pull request?
 Parent Window class for Spark Connect and Spark Classic

### Why are the changes needed?
Same as #46129

### Does this PR introduce _any_ user-facing change?
Same as #46129

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes #46841 from zhengruifeng/py_parent_window.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
riyaverm-db pushed a commit to riyaverm-db/spark that referenced this pull request Jun 7, 2024
…and Spark Classic

### What changes were proposed in this pull request?
 Parent Window class for Spark Connect and Spark Classic

### Why are the changes needed?
Same as apache#46129

### Does this PR introduce _any_ user-facing change?
Same as apache#46129

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#46841 from zhengruifeng/py_parent_window.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants