-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker (to support Python UDFs) #24946
Conversation
This comment has been minimized.
This comment has been minimized.
There's similar case that uses |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@HyukjinKwon How does |
It works after this PR as below: from pyspark.sql.functions import col, pandas_udf, PandasUDFType
@pandas_udf("int", PandasUDFType.SCALAR_ITER)
def the_udf(iterator):
for col1_batch in iterator:
yield col1_batch
spark \
.readStream \
.format("rate") \
.load() \
.withColumn("foo", the_udf(col("value"))) \
.writeStream \
.format("console") \
.trigger(continuous="5 second").start() Before:
After:
Because each epoch couldn't be referred in writer thread (to Python process). Each UDF will be executed each execution per each epoch. |
#24945 is merged. Let me rebase. |
e3d9908
to
aa34d9e
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle other thread local variables that are not InheritableThreadLocal, such as org.apache.spark.TaskContext.get
?
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochTracker.scala
Outdated
Show resolved
Hide resolved
In case of
and the information are (de)serialized into Python worker, I think it works too but I thought it's better to isolate this logic out of Python. Python runners are in core and this code is in SQL FWIW. We should move the codes around to mimic this approach. |
This comment has been minimized.
This comment has been minimized.
gentle ping .. :-) .. |
This comment has been minimized.
This comment has been minimized.
Test build #108000 has finished for PR 24946 at commit
|
LGTM |
Merged to master. Thanks all. |
@HyukjinKwon Was problem fixed in spark 3.0? What can i do to fix it in spark2.4.3? |
This fix will be included in Apache Spark 3.0. I think you should upgrade it later when this is released. |
In the dev mailing list, this issue is discussed for 2.4.4. I'll follow the decision from @HyukjinKwon and @zsxwing . |
Am fine with backporting but @zsxwing WDYT? |
I'm fine with backporting this small fix. |
Thank you, @zsxwing and @HyukjinKwon . |
Yup. BTW |
…h in EpochTracker (to support Python UDFs) This PR proposes to use `InheritableThreadLocal` instead of `ThreadLocal` for current epoch in `EpochTracker`. Python UDF needs threads to write out to and read it from Python processes and when there are new threads, previously set epoch is lost. After this PR, Python UDFs can be used at Structured Streaming with the continuous mode. The test cases were written on the top of apache#24945. Unit tests were added. Manual tests. Closes apache#24946 from HyukjinKwon/SPARK-27234. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to use
InheritableThreadLocal
instead ofThreadLocal
for current epoch inEpochTracker
. Python UDF needs threads to write out to and read it from Python processes and when there are new threads, previously set epoch is lost.After this PR, Python UDFs can be used at Structured Streaming with the continuous mode.
How was this patch tested?
The test cases were written on the top of #24945.
Unit tests were added.
Manual tests.