[DO-NOT-MERGE][POC] RDD in Python Spark Connect#55888
Draft
HyukjinKwon wants to merge 1 commit into
Draft
Conversation
4aa36ac to
9d21c2a
Compare
d6e9e1b to
ec6fca5
Compare
Member
dongjoon-hyun
left a comment
There was a problem hiding this comment.
It's a great break-through. Thank you, @HyukjinKwon . Looking forward to seeing the final status.
cc @peter-toth
Member
Author
|
Tests should pass now. I did rough benchmark - it would be 2~4x slower. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
RDDover pickled SQL / Arrow execution; parity tests.SparkContextfromSparkSession.sparkContext; classic facade routing.Why are the changes needed?
Expose
RDDandspark.sparkContexton PySpark Spark Connect while documenting what stays JVM-only or Connect-specific.Explicit gaps (Spark Connect)
SparkContext - absent vs classic JVM SparkContext
Missing methods:
accumulatorbinaryRecordsbroadcastdump_profilesgetLocalPropertyhadoopFilehadoopRDDnewAPIHadoopFilenewAPIHadoopRDDrunJobsequenceFilesetJobDescriptionsetLocalPropertysetLogLevelshow_profilesstatusTrackerMissing properties:
applicationIdlistArchiveslistFilesresourcesuiWebUrlSparkContext - rejected, ignored, or warned (implementation)
Constructor parameters:
SparkContext(ConnectSparkSession, ...extra ctor kwargs...)->PySparkTypeError(wrapped Connect session cannot be combined with other constructor arguments).batchSize->UserWarning, ignored.serializerwhen not effectively the passiveCPickleSerializerdefault ->UserWarning, ignored (CloudPickleSerializerfixed path for pickled RDD columns).profiler_cls,udf_profiler_cls,memory_profiler_cls->UserWarning, each ignored if passed.Other runtime behavior:
addFile(..., recursive=True)PySparkNotImplementedError. Non-recursiveaddFileusesSparkSession.addArtifacts;addPyFile/addArchivesimilarly where supported.setJobGroup(..., interruptOnCancel=True)UserWarning- JVM executor thread interruption is unavailable; cancellation uses Spark Connect tagging / interrupt APIs.setInterruptOnCancel(True)UserWarning- same.wholeTextFiles(..., use_unicode=False)UserWarning- rows are still decoded as Unicode strings.pickleFileon ConnectSparkContextreads output from ConnectRDD.saveAsPickleFileonly and does not interoperate with classic JVM pickle-object files (see method documentation).RDD -
PySparkNotImplementedErrortoLocalIterator(prefetchPartitions=True)only (prefetchPartitions=Falsesupported).cleanShuffleDependenciesname/setNamesaveAsHadoopDataset,saveAsHadoopFile,saveAsNewAPIHadoopDataset,saveAsNewAPIHadoopFile,saveAsSequenceFilecheckpoint/localCheckpoint(errors direct toDataFrame.checkpoint/DataFrame.localCheckpoint).Checkpoint stubs (limited semantics, no error):
isCheckpointed/isLocallyCheckpointed: always false.getCheckpointFile:None.Other parity notes:
countApprox/sumApprox/meanApprox: not JVMPartialRDDstreaming semantics; see docstrings.Does this PR introduce any user-facing change?
Yes - Connect
RDDandSparkContextsurface moves closer to classic PySpark naming and flows; gaps above remain missing or intentionally different versus JVM Spark.How was this patch tested?
Reuse of classic mixins and Connect parity subclasses (
ReusedConnectTestCase), plus artifact and job parity tests.Connect parity modules:
python/pyspark/tests/connect/test_parity_rdd.pypython/pyspark/tests/connect/test_parity_rddbarrier.pypython/pyspark/tests/connect/test_parity_rddsampler.pypython/pyspark/tests/connect/test_parity_serializers.pypython/pyspark/tests/connect/test_parity_shuffle_sort.pypython/pyspark/tests/connect/test_parity_statcounter.pypython/pyspark/tests/connect/test_parity_taskcontext.pypython/pyspark/tests/connect/test_parity_join.pypython/pyspark/tests/connect/test_parity_binary_files.pypython/pyspark/tests/connect/test_parity_spark_context_artifacts.pyArtifacts / Spark Connect SQL client:
python/pyspark/sql/tests/connect/client/test_artifact.py(includesArtifactViaSparkContextCheckMixin)python/pyspark/sql/tests/connect/test_parity_collection.pypython/pyspark/sql/tests/connect/test_parity_job_cancellation.pyWas this patch authored or co-authored using generative AI tooling?
No.