Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47683][PYTHON][BUILD] Decouple PySpark core API to pyspark.core package #45053

Closed
wants to merge 15 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 7, 2024

What changes were proposed in this pull request?

This PR proposes to release a separate pyspark-connect package, see also SPIP: Pure Python Package in PyPI (Spark Connect).

Today's PySpark package is roughly as follows:

pyspark
├── *.py               # *Core / No Spark Connect support*
├── mllib              # MLlib / No Spark Connect support
├── resource           # Resource profile API / No Spark Connect support
├── streaming          # DStream (deprecated) / No Spark Connect support
├── ml                 # ML 
│   └── connect            # Spark Connect for ML
├── pandas             # API on Spark with/without Spark Connect support
└── sql                # SQL
    └── connect            # Spark Connect for SQL
        └── streaming      # Spark Connect for Structured Streaming

There will be two packages available, pyspark and pyspark-connect.

pyspark

Same as today’s PySpark. But Core module is factored out to pyspark.core.*. User-facing interface stays the same at pyspark.*.

pyspark
├── core               # *Core / No Spark Connect support*
├── mllib              # MLlib / No Spark Connect support
├── resource           # Resource profile API / No Spark Connect support
├── streaming          # DStream (deprecated) / No Spark Connect support
├── ml                 # ML 
│   └── connect            # Spark Connect for ML
├── pandas             # API on Spark with/without Spark Connect support
└── sql                # SQL
    └── connect            # Spark Connect for SQL
        └── streaming      # Spark Connect for Structured Streaming

pyspark-connect

Package after excluding modules that do not support Spark Connect, also excluding jars, that are, ml without jars:

pyspark
├── ml
│   └── connect
├── pandas
└── sql
    └── connect
        └── streaming

Why are the changes needed?

To provide a pure Python library that does not depend on JVM.

See also SPIP: Pure Python Package in PyPI (Spark Connect).

Does this PR introduce any user-facing change?

Yes, users can install pure Python library via pip install pyspark-connect.

How was this patch tested?

Manually tested the basic set of tests.

./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
cd python
python packaging/connect/setup.py sdist
cd dist
conda create -y -n clean-py-3.11 python=3.11
conda activate clean-py-3.11
pip install pyspark-connect-4.0.0.dev0.tar.gz
python
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
>>> spark.range(10).show()
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

They will be separated added, and set as a scheduled job in CI.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-47683][PYTHON][BUILD] Decouple PySpark core API to pyspark.core package [SPARK-47683][PYTHON][BUILD] Decouple PySpark core API to pyspark.core package Apr 2, 2024
@HyukjinKwon
Copy link
Member Author

cc @zhengruifeng @grundprinzip @ueshin @hvanhovell @itholic @WeichenXu123 @mengxr @allisonwang-db @xinrong-meng @gatorsmile @cloud-fan This is ready for a look (before merging, should wait one more day for the SPIP to pass though)

@HyukjinKwon
Copy link
Member Author

I restored the references for our internal API. Explicitly private attributes starting _ are not restored.

@HyukjinKwon
Copy link
Member Author

Merged to master.

HyukjinKwon added a commit that referenced this pull request May 2, 2024
…spark-connect` package

### What changes were proposed in this pull request?

This PR is a followup of #45053 that includes `lib/py4j*zip` in the package. Currently it's being picked up by https://github.com/apache/spark/blob/master/python/MANIFEST.in#L26. For other files, we don't create `deps` directory in `setup.py` for `pyspark-connect` so they are not included. But `lib` is being included.

### Why are the changes needed?

To exclude unrelated files.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually packaged, and checked the contents via `vi`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46331 from HyukjinKwon/SPARK-47683-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…spark-connect` package

### What changes were proposed in this pull request?

This PR is a followup of apache#45053 that includes `lib/py4j*zip` in the package. Currently it's being picked up by https://github.com/apache/spark/blob/master/python/MANIFEST.in#L26. For other files, we don't create `deps` directory in `setup.py` for `pyspark-connect` so they are not included. But `lib` is being included.

### Why are the changes needed?

To exclude unrelated files.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released out yet.

### How was this patch tested?

Manually packaged, and checked the contents via `vi`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46331 from HyukjinKwon/SPARK-47683-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants