Skip to content

[DO-NOT-MERGE][POC] RDD in Python Spark Connect#55888

Draft
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:connect-rdd
Draft

[DO-NOT-MERGE][POC] RDD in Python Spark Connect#55888
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:connect-rdd

Conversation

@HyukjinKwon
Copy link
Copy Markdown
Member

@HyukjinKwon HyukjinKwon commented May 14, 2026

What changes were proposed in this pull request?

  • RDD: classic vs Connect packages plus facade routing; Connect-backed RDD over pickled SQL / Arrow execution; parity tests.
  • SparkContext: Connect session-scoped SparkContext from SparkSession.sparkContext; classic facade routing.

Why are the changes needed?

Expose RDD and spark.sparkContext on PySpark Spark Connect while documenting what stays JVM-only or Connect-specific.

Explicit gaps (Spark Connect)

SparkContext - absent vs classic JVM SparkContext

Missing methods:

  • accumulator
  • binaryRecords
  • broadcast
  • dump_profiles
  • getLocalProperty
  • hadoopFile
  • hadoopRDD
  • newAPIHadoopFile
  • newAPIHadoopRDD
  • runJob
  • sequenceFile
  • setJobDescription
  • setLocalProperty
  • setLogLevel
  • show_profiles
  • statusTracker

Missing properties:

  • applicationId
  • listArchives
  • listFiles
  • resources
  • uiWebUrl

SparkContext - rejected, ignored, or warned (implementation)

Constructor parameters:

  • SparkContext(ConnectSparkSession, ...extra ctor kwargs...) -> PySparkTypeError (wrapped Connect session cannot be combined with other constructor arguments).
  • batchSize -> UserWarning, ignored.
  • serializer when not effectively the passive CPickleSerializer default -> UserWarning, ignored (CloudPickleSerializer fixed path for pickled RDD columns).
  • profiler_cls, udf_profiler_cls, memory_profiler_cls -> UserWarning, each ignored if passed.

Other runtime behavior:

Item Behavior
addFile(..., recursive=True) PySparkNotImplementedError. Non-recursive addFile uses SparkSession.addArtifacts; addPyFile / addArchive similarly where supported.
setJobGroup(..., interruptOnCancel=True) UserWarning - JVM executor thread interruption is unavailable; cancellation uses Spark Connect tagging / interrupt APIs.
setInterruptOnCancel(True) UserWarning - same.
wholeTextFiles(..., use_unicode=False) UserWarning - rows are still decoded as Unicode strings.

pickleFile on Connect SparkContext reads output from Connect RDD.saveAsPickleFile only and does not interoperate with classic JVM pickle-object files (see method documentation).

RDD - PySparkNotImplementedError

  • toLocalIterator(prefetchPartitions=True) only (prefetchPartitions=False supported).
  • cleanShuffleDependencies
  • name / setName
  • saveAsHadoopDataset, saveAsHadoopFile, saveAsNewAPIHadoopDataset, saveAsNewAPIHadoopFile, saveAsSequenceFile
  • checkpoint / localCheckpoint (errors direct to DataFrame.checkpoint / DataFrame.localCheckpoint).

Checkpoint stubs (limited semantics, no error):

  • isCheckpointed / isLocallyCheckpointed: always false.
  • getCheckpointFile: None.

Other parity notes:

  • countApprox / sumApprox / meanApprox: not JVM PartialRDD streaming semantics; see docstrings.

Does this PR introduce any user-facing change?

Yes - Connect RDD and SparkContext surface moves closer to classic PySpark naming and flows; gaps above remain missing or intentionally different versus JVM Spark.

How was this patch tested?

Reuse of classic mixins and Connect parity subclasses (ReusedConnectTestCase), plus artifact and job parity tests.

Connect parity modules:

  • python/pyspark/tests/connect/test_parity_rdd.py
  • python/pyspark/tests/connect/test_parity_rddbarrier.py
  • python/pyspark/tests/connect/test_parity_rddsampler.py
  • python/pyspark/tests/connect/test_parity_serializers.py
  • python/pyspark/tests/connect/test_parity_shuffle_sort.py
  • python/pyspark/tests/connect/test_parity_statcounter.py
  • python/pyspark/tests/connect/test_parity_taskcontext.py
  • python/pyspark/tests/connect/test_parity_join.py
  • python/pyspark/tests/connect/test_parity_binary_files.py
  • python/pyspark/tests/connect/test_parity_spark_context_artifacts.py

Artifacts / Spark Connect SQL client:

  • python/pyspark/sql/tests/connect/client/test_artifact.py (includes ArtifactViaSparkContextCheckMixin)
  • python/pyspark/sql/tests/connect/test_parity_collection.py
  • python/pyspark/sql/tests/connect/test_parity_job_cancellation.py

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon HyukjinKwon marked this pull request as draft May 14, 2026 21:42
@HyukjinKwon HyukjinKwon force-pushed the connect-rdd branch 29 times, most recently from 4aa36ac to 9d21c2a Compare May 19, 2026 07:00
@HyukjinKwon HyukjinKwon force-pushed the connect-rdd branch 2 times, most recently from d6e9e1b to ec6fca5 Compare May 19, 2026 10:07
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great break-through. Thank you, @HyukjinKwon . Looking forward to seeing the final status.

cc @peter-toth

@HyukjinKwon
Copy link
Copy Markdown
Member Author

Tests should pass now. I did rough benchmark - it would be 2~4x slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants