Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22340][PYTHON] Save localProperties in thread.local #24705

Closed
wants to merge 3 commits into from

Conversation

lu-wang-dl
Copy link
Contributor

What changes were proposed in this pull request?

  • Add threading.local() in SparkContext
  • Save the value to thread local variables when calling setLocalProperty, setJobGroup, and setJobDescription
  • Add new API getLocalProperties
  • Send local properties to jvm side when calling collect
  • On jvm side setLocalProperty in collectAndServe

How was this patch tested?

Add one unit test in test_rdd.py

Please review https://spark.apache.org/contributing.html before opening a pull request.

@jkbradley
Copy link
Member

add to whitelist

@jkbradley
Copy link
Member

CC @ueshin

@SparkQA
Copy link

SparkQA commented May 24, 2019

Test build #105775 has finished for PR 24705 at commit 937a4a6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments, but I'm not familiar with this part of the codebase. My main question is: There are other methods in self.ctx._jvm.PythonRDD which get called from rdd.py. Will those still behave the same way with the changes in context.py? Or will those no longer have the expected jobGroup, local properties, etc.? If the behavior has changed, what's the best way to keep it consistent? CC @ueshin ?

Also, Takuya, do you want to use SPARK-22340 for this or a new JIRA? Thank you!!

@@ -159,7 +159,12 @@ private[spark] object PythonRDD extends Logging {
* @return 2-tuple (as a Java array) with the port number of a local socket which serves the
* data collected from this job, and the secret for authentication.
*/
def collectAndServe[T](rdd: RDD[T]): Array[Any] = {
def collectAndServe[T](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @ueshin judge, but my guess is that, even though this is a "private" API, we'll want to add a new collectAndServe with the 2 arguments, leaving the old one in case 3rd-party libraries use the private API.

run_job(group_A_name, 0)
self.assertFalse(is_job_cancelled[0], "job didn't succeeded.")

for i in range(num_threads):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd explain what this is testing in a comment.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should better just clarify the limitation rather than re-implementing get-set local property logic within Python side ..

java_map = MapConverter().convert(self.context.getLocalProperties(),
self.context._gateway._gateway_client)
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(
self._jrdd.rdd() ,java_map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that work if we use UDF + count and use TaskContext's local property access?

java_map = MapConverter().convert(self.context.getLocalProperties(),
self.context._gateway._gateway_client)
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(
self._jrdd.rdd() ,java_map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple actions like toPandas. If the changes are needed to all of them, it sounds like we're re-implementing the local property logic within Python side.

@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105878 has finished for PR 24705 at commit 892d4ef.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lu-wang-dl
Copy link
Contributor Author

Close this PR now. We will design this more carefully.

@lu-wang-dl lu-wang-dl closed this May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants