[SPARK-38679][CORE] Expose the number partitions in a stage to TaskContext #35995

vkorukanti · 2022-03-29T00:15:35Z

What changes were proposed in this pull request?

Add a new api to expose total partition count in the stage belonging to the task in TaskContext,

Why are the changes needed?

Add a new api to expose total partition count in the stage belonging to the task in TaskContext, so that the task knows what fraction of the computation is doing.

With this extra information, users can generate 32bit unique int ids as below rather than using monotonically_increasing_id which generates 64bit long ids.

rdd.mapPartitions { rowsIter =>
  val partitionId = TaskContext.get().partitionId()
  val numPartitions = TaskContext.get().numPartitions()
  var i = 0
  rowsIter.map { row =>
    val rowId = partitionId + i * numPartitions
    i += 1
    (rowId, row)
  }
}

Does this PR introduce any user-facing change?

Yes. We add a new API TaskContext.numPartitions.

How was this patch tested?

Added new unit tests to verify the number of partitions retrieved from TaskContext is expected.

jiangxb1987

LGTM

jiangxb1987 · 2022-03-29T05:36:17Z

It looks like barrier execution can also use this api to simplify the implementation of numTasks : )

vkorukanti · 2022-03-29T15:54:19Z

Thank you @jiangxb1987 for reviewing. Could you please take another look at the change? Had to exclude the new method from binary compatibility test.

zsxwing · 2022-03-29T18:17:46Z

@cloud-fan @Ngone51 Although this one adds a new API, it's a pretty straightforward change. It looks pretty safe to me to backport into 3.3. What do you think? Also cc @MaxGekk

MaxGekk · 2022-03-29T18:21:51Z

What do you think? Also cc @MaxGekk

Since this is either not a bug fix nor in the allow list https://lists.apache.org/thread/zrd7lcm5f5f3md7wffjy7x6w2pdmxxp7, we cannot just silently merge to branch-3.3. @zsxwing @vkorukanti Could you write an email to the thread in the dev list, and explain why we need this in 3.3 and cannot postpone to 3.4.

zsxwing · 2022-03-29T19:49:23Z

@MaxGekk Thanks for the feedback. We will not merge this to 3.3.

cloud-fan · 2022-03-30T03:01:42Z

core/src/main/scala/org/apache/spark/BarrierTaskContext.scala

@@ -215,6 +215,8 @@ class BarrierTaskContext private[spark] (

  override def partitionId(): Int = taskContext.partitionId()

+  override def numPartitions(): Int = taskContext.numPartitions()


We can remove lazy val numTasks in this file and use numPartitions() directly.

Updated in the latest commit.

Add a new api to expose total partition count in a task. so that the task knows what fraction of the computation is doing. With this extra information, users can generate 32bit unique int ids as below rather than using `monotonically_increasing_id` which generates 64bit long ids. ```scala val rdd = ... rdd.mapPartitions { rowsIter => val partitionId = TaskContext.get().partitionId() val numPartitions = TaskContext.get().numPartitions() var i = 0 rowsIter.map { row => val rowId = partitionId + i * numPartitions i += 1 (rowId, row) } } ``` Test: Added new unit tests to verify the number of partitions retrieved from TaskContext is expected.

cloud-fan · 2022-03-31T13:44:34Z

thanks, merging to master!

github-actions bot added CORE SQL STRUCTURED STREAMING labels Mar 29, 2022

HeartSaVioR changed the title ~~[SPARK-38679] Expose the number partitions in a stage to TaskContext~~ [SPARK-38679][CORE] Expose the number partitions in a stage to TaskContext Mar 29, 2022

vkorukanti force-pushed the task_numPartitions branch from c86b83e to ecf7112 Compare March 29, 2022 04:29

jiangxb1987 approved these changes Mar 29, 2022

View reviewed changes

vkorukanti force-pushed the task_numPartitions branch from 7354419 to b48a670 Compare March 29, 2022 15:53

github-actions bot added the BUILD label Mar 29, 2022

cloud-fan reviewed Mar 30, 2022

View reviewed changes

cloud-fan approved these changes Mar 30, 2022

View reviewed changes

Ngone51 approved these changes Mar 30, 2022

View reviewed changes

mridulm approved these changes Mar 30, 2022

View reviewed changes

vkorukanti added 5 commits March 30, 2022 16:24

fix

9cace3f

fix

848d22c

exclude the new method from binary compatibility check

37736c8

Update BarrierTaskContext to use the numPartitions API

c975242

vkorukanti force-pushed the task_numPartitions branch from b48a670 to c975242 Compare March 30, 2022 23:25

yaooqinn approved these changes Mar 31, 2022

View reviewed changes

cloud-fan closed this in f486029 Mar 31, 2022

vkorukanti deleted the task_numPartitions branch March 31, 2022 16:46

HyukjinKwon mentioned this pull request Mar 31, 2022

[SPARK-38679][SQL][TESTS][FOLLOW-UP]Add numPartitions parameter to TaskContextImpl at SubexpressionEliminationSuite #36031

Closed

LuciferYang mentioned this pull request Sep 1, 2022

[SPARK-40283][INFRA] Make MiMa check default exclude private object and bump previousSparkVersion to 3.3.0 #37741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38679][CORE] Expose the number partitions in a stage to TaskContext #35995

[SPARK-38679][CORE] Expose the number partitions in a stage to TaskContext #35995

vkorukanti commented Mar 29, 2022 •

edited by zsxwing

jiangxb1987 left a comment

jiangxb1987 commented Mar 29, 2022

vkorukanti commented Mar 29, 2022

zsxwing commented Mar 29, 2022

MaxGekk commented Mar 29, 2022

zsxwing commented Mar 29, 2022

cloud-fan Mar 30, 2022

vkorukanti Mar 30, 2022

cloud-fan commented Mar 31, 2022

		@@ -215,6 +215,8 @@ class BarrierTaskContext private[spark] (

		override def partitionId(): Int = taskContext.partitionId()

		override def numPartitions(): Int = taskContext.numPartitions()

[SPARK-38679][CORE] Expose the number partitions in a stage to TaskContext #35995

[SPARK-38679][CORE] Expose the number partitions in a stage to TaskContext #35995

Conversation

vkorukanti commented Mar 29, 2022 • edited by zsxwing

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jiangxb1987 left a comment

Choose a reason for hiding this comment

jiangxb1987 commented Mar 29, 2022

vkorukanti commented Mar 29, 2022

zsxwing commented Mar 29, 2022

MaxGekk commented Mar 29, 2022

zsxwing commented Mar 29, 2022

cloud-fan Mar 30, 2022

Choose a reason for hiding this comment

vkorukanti Mar 30, 2022

Choose a reason for hiding this comment

cloud-fan commented Mar 31, 2022

vkorukanti commented Mar 29, 2022 •

edited by zsxwing