[SPARK-22340][PYTHON] Save localProperties in thread.local #24705

lu-wang-dl · 2019-05-24T23:01:01Z

What changes were proposed in this pull request?

Add threading.local() in SparkContext
Save the value to thread local variables when calling setLocalProperty, setJobGroup, and setJobDescription
Add new API getLocalProperties
Send local properties to jvm side when calling collect
On jvm side setLocalProperty in collectAndServe

How was this patch tested?

Add one unit test in test_rdd.py

Please review https://spark.apache.org/contributing.html before opening a pull request.

jkbradley · 2019-05-24T23:36:03Z

add to whitelist

jkbradley · 2019-05-24T23:36:26Z

CC @ueshin

SparkQA · 2019-05-24T23:40:18Z

Test build #105775 has finished for PR 24705 at commit 937a4a6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

I added a few comments, but I'm not familiar with this part of the codebase. My main question is: There are other methods in self.ctx._jvm.PythonRDD which get called from rdd.py. Will those still behave the same way with the changes in context.py? Or will those no longer have the expected jobGroup, local properties, etc.? If the behavior has changed, what's the best way to keep it consistent? CC @ueshin ?

Also, Takuya, do you want to use SPARK-22340 for this or a new JIRA? Thank you!!

jkbradley · 2019-05-24T23:37:32Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

@@ -159,7 +159,12 @@ private[spark] object PythonRDD extends Logging {
   * @return 2-tuple (as a Java array) with the port number of a local socket which serves the
   *         data collected from this job, and the secret for authentication.
   */
-  def collectAndServe[T](rdd: RDD[T]): Array[Any] = {
+  def collectAndServe[T](


I'll let @ueshin judge, but my guess is that, even though this is a "private" API, we'll want to add a new collectAndServe with the 2 arguments, leaving the old one in case 3rd-party libraries use the private API.

jkbradley · 2019-05-24T23:43:43Z

python/pyspark/tests/test_rdd.py

+        run_job(group_A_name, 0)
+        self.assertFalse(is_job_cancelled[0], "job didn't succeeded.")
+
+        for i in range(num_threads):


I'd explain what this is testing in a comment.

HyukjinKwon

I think we should better just clarify the limitation rather than re-implementing get-set local property logic within Python side ..

HyukjinKwon · 2019-05-27T01:49:10Z

python/pyspark/rdd.py

+            java_map = MapConverter().convert(self.context.getLocalProperties(),
+                                              self.context._gateway._gateway_client)
+            sock_info = self.ctx._jvm.PythonRDD.collectAndServe(
+                self._jrdd.rdd() ,java_map)


Would that work if we use UDF + count and use TaskContext's local property access?

HyukjinKwon · 2019-05-27T01:51:12Z

python/pyspark/rdd.py

+            java_map = MapConverter().convert(self.context.getLocalProperties(),
+                                              self.context._gateway._gateway_client)
+            sock_info = self.ctx._jvm.PythonRDD.collectAndServe(
+                self._jrdd.rdd() ,java_map)


There are multiple actions like toPandas. If the changes are needed to all of them, it sounds like we're re-implementing the local property logic within Python side.

SparkQA · 2019-05-28T22:09:17Z

Test build #105878 has finished for PR 24705 at commit 892d4ef.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

lu-wang-dl · 2019-05-29T03:45:06Z

Close this PR now. We will design this more carefully.

lu-wang-dl added 2 commits May 24, 2019 15:50

add thread.local properties

568f7da

minnor fix

937a4a6

jkbradley reviewed May 24, 2019

View reviewed changes

HyukjinKwon reviewed May 27, 2019

View reviewed changes

fix python style

892d4ef

lu-wang-dl closed this May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22340][PYTHON] Save localProperties in thread.local #24705

[SPARK-22340][PYTHON] Save localProperties in thread.local #24705

lu-wang-dl commented May 24, 2019

jkbradley commented May 24, 2019

jkbradley commented May 24, 2019

SparkQA commented May 24, 2019

jkbradley left a comment •

edited

Loading

jkbradley May 24, 2019

jkbradley May 24, 2019

HyukjinKwon left a comment •

edited

Loading

HyukjinKwon May 27, 2019

HyukjinKwon May 27, 2019

SparkQA commented May 28, 2019

lu-wang-dl commented May 29, 2019

[SPARK-22340][PYTHON] Save localProperties in thread.local #24705

[SPARK-22340][PYTHON] Save localProperties in thread.local #24705

Conversation

lu-wang-dl commented May 24, 2019

What changes were proposed in this pull request?

How was this patch tested?

jkbradley commented May 24, 2019

jkbradley commented May 24, 2019

SparkQA commented May 24, 2019

jkbradley left a comment • edited Loading

Choose a reason for hiding this comment

jkbradley May 24, 2019

Choose a reason for hiding this comment

jkbradley May 24, 2019

Choose a reason for hiding this comment

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

HyukjinKwon May 27, 2019

Choose a reason for hiding this comment

HyukjinKwon May 27, 2019

Choose a reason for hiding this comment

SparkQA commented May 28, 2019

lu-wang-dl commented May 29, 2019

jkbradley left a comment •

edited

Loading

HyukjinKwon left a comment •

edited

Loading