[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker (to support Python UDFs) #24946

HyukjinKwon · 2019-06-24T03:15:45Z

What changes were proposed in this pull request?

This PR proposes to use InheritableThreadLocal instead of ThreadLocal for current epoch in EpochTracker. Python UDF needs threads to write out to and read it from Python processes and when there are new threads, previously set epoch is lost.

After this PR, Python UDFs can be used at Structured Streaming with the continuous mode.

How was this patch tested?

The test cases were written on the top of #24945.
Unit tests were added.

Manual tests.

HyukjinKwon · 2019-06-24T03:17:02Z

cc @WeichenXu123, @mengxr, @tdas, @zsxwing, @HeartSaVioR

HyukjinKwon · 2019-06-24T03:19:44Z

There's similar case that uses InheritableThreadLocal, input_file_name(). cc @cloud-fan as well.

mengxr · 2019-06-24T19:47:04Z

@HyukjinKwon How does SCALAR_ITER Pandas UDF work with continuous processing after this PR? Do we only run initialization code once per task?

HyukjinKwon · 2019-06-25T00:11:50Z

It works after this PR as below:

from pyspark.sql.functions import col, pandas_udf, PandasUDFType

@pandas_udf("int", PandasUDFType.SCALAR_ITER)
def the_udf(iterator):
    for col1_batch in iterator:
        yield col1_batch

spark \
    .readStream \
    .format("rate") \
    .load() \
    .withColumn("foo", the_udf(col("value"))) \
    .writeStream \
    .format("console") \
    .trigger(continuous="5 second").start()

Before:

...
Caused by: java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:366)
	at scala.None$.get(Option.scala:364)
	at org.apache.spark.sql.execution.streaming.continuous.ContinuousQueuedDataReader.next(ContinuousQueuedDataReader.scala:116)
	at org.apache.spark.sql.execution.streaming.continuous.ContinuousDataSourceRDD$$anon$1.getNext(ContinuousDataSourceRDD.scala:93)
	at org.apache.spark.sql.execution.streaming.continuous.ContinuousDataSourceRDD$$anon$1.getNext(ContinuousDataSourceRDD.scala:91)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:701)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
...

After:

...
-------------------------------------------
Batch: 3
-------------------------------------------
+--------------------+-----+---+
|           timestamp|value|foo|
+--------------------+-----+---+
|2019-06-25 09:05:...|    9|  9|
|2019-06-25 09:05:...|    2|  2|
|2019-06-25 09:05:...|   10| 10|
|2019-06-25 09:05:...|   11| 11|
|2019-06-25 09:05:...|    5|  5|
|2019-06-25 09:05:...|    3|  3|
|2019-06-25 09:05:...|    1|  1|
|2019-06-25 09:05:...|    6|  6|
|2019-06-25 09:05:...|    4|  4|
|2019-06-25 09:05:...|    8|  8|
|2019-06-25 09:05:...|    7|  7|
|2019-06-25 09:05:...|    0|  0|
+--------------------+-----+---+
...

Because each epoch couldn't be referred in writer thread (to Python process). Each UDF will be executed each execution per each epoch.

HyukjinKwon · 2019-06-25T03:00:44Z

#24945 is merged. Let me rebase.

zsxwing

How do we handle other thread local variables that are not InheritableThreadLocal, such as org.apache.spark.TaskContext.get?

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochTracker.scala

HyukjinKwon · 2019-06-25T23:44:50Z

In case of org.apache.spark.TaskContext, looks it's manually set for Python's writer thread:

spark/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

Line 203 in 6d441dc

TaskContext.setTaskContext(context)

and the information are (de)serialized into Python worker, TaskContext instance is constructed and used in Python worker later.

I think it works too but I thought it's better to isolate this logic out of Python. Python runners are in core and this code is in SQL FWIW. We should move the codes around to mimic this approach.

HyukjinKwon · 2019-07-01T03:17:43Z

gentle ping .. :-) ..

…ort Python UDFs)

SparkQA · 2019-07-22T11:37:05Z

Test build #108000 has finished for PR 24946 at commit b103acc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2019-07-22T19:40:16Z

LGTM

HyukjinKwon · 2019-07-23T01:09:09Z

Thanks for review and approval, @zsxwing.

Just FYI, there's similar fix and discussion going on at #24958.

HyukjinKwon · 2019-07-24T00:59:15Z

Merged to master.

Thanks all.

aurorazl · 2019-08-13T04:12:27Z

@HyukjinKwon Was problem fixed in spark 3.0? What can i do to fix it in spark2.4.3?

HyukjinKwon · 2019-08-13T04:36:11Z

This fix will be included in Apache Spark 3.0. I think you should upgrade it later when this is released.

dongjoon-hyun · 2019-08-14T16:17:45Z

In the dev mailing list, this issue is discussed for 2.4.4. I'll follow the decision from @HyukjinKwon and @zsxwing .

HyukjinKwon · 2019-08-14T21:29:55Z

Am fine with backporting but @zsxwing WDYT?

zsxwing · 2019-08-14T22:05:00Z

I'm fine with backporting this small fix.

dongjoon-hyun · 2019-08-14T22:51:49Z

Thank you, @zsxwing and @HyukjinKwon .
Could you make a backporting PR against branch-2.4, @HyukjinKwon ?

HyukjinKwon · 2019-08-15T00:09:00Z

Yup. BTW IntegratedUDFTestUtils doesn't exist in master branch. I will have to manually test and backport.

…h in EpochTracker (to support Python UDFs) This PR proposes to use `InheritableThreadLocal` instead of `ThreadLocal` for current epoch in `EpochTracker`. Python UDF needs threads to write out to and read it from Python processes and when there are new threads, previously set epoch is lost. After this PR, Python UDFs can be used at Structured Streaming with the continuous mode. The test cases were written on the top of apache#24945. Unit tests were added. Manual tests. Closes apache#24946 from HyukjinKwon/SPARK-27234. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

This comment has been minimized.

Sign in to view

HyukjinKwon changed the title ~~[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker~~ [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker (to support Python UDFs) Jun 24, 2019

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the SPARK-27234 branch from 485539b to edf6cd2 Compare June 24, 2019 11:51

This comment has been minimized.

Sign in to view

dongjoon-hyun added the STRUCTURED STREAMING label Jun 24, 2019

HyukjinKwon mentioned this pull request Jun 24, 2019

[SPARK-27893][SQL][PYTHON][FOLLOW-UP] Allow Scalar Pandas and Python UDFs can be tested with Scala test base #24945

Closed

HyukjinKwon force-pushed the SPARK-27234 branch 2 times, most recently from e3d9908 to aa34d9e Compare June 25, 2019 03:03

This comment has been minimized.

Sign in to view

zsxwing reviewed Jun 25, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochTracker.scala Outdated Show resolved Hide resolved

HyukjinKwon force-pushed the SPARK-27234 branch from 62fa0b6 to 88812f8 Compare June 25, 2019 23:39

This comment has been minimized.

Sign in to view

HeartSaVioR mentioned this pull request Jul 13, 2019

[SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer #22138

Closed

suhsteve mentioned this pull request Jul 15, 2019

[FEATURE REQUEST]: Add Trigger() to DataStreamWriter. dotnet/spark#139

Closed

HyukjinKwon added 3 commits July 22, 2019 13:37

Use InheritableThreadLocal for current epoch in EpochTracker (to supp…

c9c8182

…ort Python UDFs)

Address comments

f4f2eff

Resolve conflicts

b103acc

HyukjinKwon force-pushed the SPARK-27234 branch from 88812f8 to b103acc Compare July 22, 2019 04:50

This comment has been minimized.

Sign in to view

HyukjinKwon closed this in b83b792 Jul 24, 2019

HyukjinKwon mentioned this pull request Jul 31, 2019

[SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) #24958

Closed

HyukjinKwon deleted the SPARK-27234 branch March 3, 2020 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker (to support Python UDFs) #24946

[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker (to support Python UDFs) #24946

HyukjinKwon commented Jun 24, 2019 •

edited

Loading

This comment has been minimized.

HyukjinKwon commented Jun 24, 2019 •

edited

Loading

HyukjinKwon commented Jun 24, 2019 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

mengxr commented Jun 24, 2019

HyukjinKwon commented Jun 25, 2019

HyukjinKwon commented Jun 25, 2019

This comment has been minimized.

This comment has been minimized.

zsxwing left a comment

HyukjinKwon commented Jun 25, 2019 •

edited

Loading

This comment has been minimized.

HyukjinKwon commented Jul 1, 2019

This comment has been minimized.

SparkQA commented Jul 22, 2019

zsxwing commented Jul 22, 2019

HyukjinKwon commented Jul 23, 2019

HyukjinKwon commented Jul 24, 2019

aurorazl commented Aug 13, 2019

HyukjinKwon commented Aug 13, 2019

dongjoon-hyun commented Aug 14, 2019

HyukjinKwon commented Aug 14, 2019

zsxwing commented Aug 14, 2019

dongjoon-hyun commented Aug 14, 2019

HyukjinKwon commented Aug 15, 2019

[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker (to support Python UDFs) #24946

[SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker (to support Python UDFs) #24946

Conversation

HyukjinKwon commented Jun 24, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

This comment has been minimized.

HyukjinKwon commented Jun 24, 2019 • edited Loading

HyukjinKwon commented Jun 24, 2019 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

mengxr commented Jun 24, 2019

HyukjinKwon commented Jun 25, 2019

HyukjinKwon commented Jun 25, 2019

This comment has been minimized.

This comment has been minimized.

zsxwing left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 25, 2019 • edited Loading

This comment has been minimized.

HyukjinKwon commented Jul 1, 2019

This comment has been minimized.

SparkQA commented Jul 22, 2019

zsxwing commented Jul 22, 2019

HyukjinKwon commented Jul 23, 2019

HyukjinKwon commented Jul 24, 2019

aurorazl commented Aug 13, 2019

HyukjinKwon commented Aug 13, 2019

dongjoon-hyun commented Aug 14, 2019

HyukjinKwon commented Aug 14, 2019

zsxwing commented Aug 14, 2019

dongjoon-hyun commented Aug 14, 2019

HyukjinKwon commented Aug 15, 2019

HyukjinKwon commented Jun 24, 2019 •

edited

Loading

HyukjinKwon commented Jun 24, 2019 •

edited

Loading

HyukjinKwon commented Jun 24, 2019 •

edited

Loading

HyukjinKwon commented Jun 25, 2019 •

edited

Loading