Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE #33429

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 20, 2021

What changes were proposed in this pull request?

This PR proposes to rename:

  • Rename *Reader/*reader to *Read/*read for rules and execution plan (user-facing doc/config name remain untouched)
    • *ShuffleReaderExec ->*ShuffleReadExec
    • isLocalReader -> isLocalRead
    • ...
  • Rename CustomShuffle* prefix to AQEShuffle*
  • Rename OptimizeLocalShuffleReader rule to OptimizeShuffleWithLocalRead

Why are the changes needed?

There are multiple problems in the current naming:

  • CustomShuffle* -> AQEShuffle*
    it sounds like it is a pluggable API. However, this is actually only used by AQE.

  • OptimizeLocalShuffleReader -> OptimizeShuffleWithLocalRead
    it is the name of a rule but it can be misread as a reader, which is counterintuative

  • *ReaderExec -> *ReadExec
    Reader execution reads a bit odd. It should better be read execution (like ScanExec, ProjectExec and FilterExec). I can't find the reason to name it with something that performs an action. See also the generated plans:

    Before:

    ...
    * HashAggregate (12)
       +- CustomShuffleReader (11)
          +- ShuffleQueryStage (10)
             +- Exchange (9)
    ...
    

    After:

    ...
    * HashAggregate (12)
       +- AQEShuffleRead (11)
          +- ShuffleQueryStage (10)
             +- Exchange (9)
    ..
    

Does this PR introduce any user-facing change?

No, internal refactoring.

How was this patch tested?

Existing unittests should cover the changes.

@HyukjinKwon HyukjinKwon marked this pull request as draft July 20, 2021 04:10
@HyukjinKwon HyukjinKwon changed the title [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader [WIP][SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader Jul 20, 2021
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader [WIP][SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE Jul 20, 2021
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE Jul 20, 2021
@HyukjinKwon HyukjinKwon marked this pull request as ready for review July 20, 2021 04:38
@HyukjinKwon
Copy link
Member Author

cc @cloud-fan and @maryannxue can you take a look please?

@HyukjinKwon
Copy link
Member Author

cc @ulysses-you too FYI

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (from my side). I agree with the reason of renaming mentioned by @HyukjinKwon .

@HyukjinKwon
Copy link
Member Author

let me rebase. seems like it couldn't detect my GitHub actions job.

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45811/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141300 has finished for PR 33429 at commit 2840b47.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AQEShuffleReadRule extends Rule[SparkPlan]
  • case class CoalesceShufflePartitions(session: SparkSession) extends AQEShuffleReadRule

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45811/

@@ -139,21 +139,21 @@ class AdaptiveQueryExecSuite
}
}

private def checkNumLocalShuffleReaders(
plan: SparkPlan, numShufflesWithoutLocalReader: Int = 0): Unit = {
private def checkNumLocalShuffleReads(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"num readers" seems more natural?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or "num shuffles with local read", but it's longer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. The number of local shuffle reads that refers the number of (local) AQEShuffleReadExec, the number of AQE read plans.

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45814/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45814/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45818/

@HyukjinKwon
Copy link
Member Author

BTW, looks like GA changed something .. all tests fail apparently because of OOM ..

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141297 has finished for PR 33429 at commit f2b0bab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AQEShuffleReadRule extends Rule[SparkPlan]
  • case class CoalesceShufflePartitions(session: SparkSession) extends AQEShuffleReadRule

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45818/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141304 has finished for PR 33429 at commit 2840b47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AQEShuffleReadRule extends Rule[SparkPlan]
  • case class CoalesceShufflePartitions(session: SparkSession) extends AQEShuffleReadRule

@HyukjinKwon
Copy link
Member Author

(rebased to resolve conflicts)

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45850/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45850/

@SparkQA
Copy link

SparkQA commented Jul 20, 2021

Test build #141335 has finished for PR 33429 at commit d64ed95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AQEShuffleReadRule extends Rule[SparkPlan]
  • case class CoalesceShufflePartitions(session: SparkSession) extends AQEShuffleReadRule

@HyukjinKwon
Copy link
Member Author

I'll merge in few days if there are no more comments.

@maryannxue
Copy link
Contributor

I am fine with "Rename OptimizeLocalShuffleReader rule to OptimizeShuffleWithLocalRead"

But can still use "reader" for the physical node name, i.e., AQEShuffleReaderExec ?

@HyukjinKwon
Copy link
Member Author

hmm I feel like that's inconsistent with the current SparkPlan naming rules. For example, there's no -ers. One example might be V2TableWriteExec. As far as I can tell, there's almost no plan named with actors. ScanExec, ProtectExec, ExchangeExec, AggregateExec, etc. I believe using Read is more consistent with other namimg.

@SparkQA
Copy link

SparkQA commented Jul 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46073/

@SparkQA
Copy link

SparkQA commented Jul 23, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46073/

@SparkQA
Copy link

SparkQA commented Jul 23, 2021

Test build #141555 has finished for PR 33429 at commit e0b8fa6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AQEShuffleReadRule extends Rule[SparkPlan]
  • case class CoalesceShufflePartitions(session: SparkSession) extends AQEShuffleReadRule

@HyukjinKwon
Copy link
Member Author

Will merge tomorrow if there's no more comment @maryannxue

@SparkQA
Copy link

SparkQA commented Jul 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46152/

@SparkQA
Copy link

SparkQA commented Jul 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46152/

@SparkQA
Copy link

SparkQA commented Jul 26, 2021

Test build #141636 has finished for PR 33429 at commit 19d6dee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AQEShuffleReadRule extends Rule[SparkPlan]
  • case class CoalesceShufflePartitions(session: SparkSession) extends AQEShuffleReadRule

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2! (otherwise backporting AQE changes will be very hard)

@cloud-fan cloud-fan closed this in 6e3d404 Jul 26, 2021
cloud-fan pushed a commit that referenced this pull request Jul 26, 2021
…eReader in AQE

### What changes were proposed in this pull request?

This PR proposes to rename:

- Rename `*Reader`/`*reader` to `*Read`/`*read` for rules and execution plan (user-facing doc/config name remain untouched)
  - `*ShuffleReaderExec` ->`*ShuffleReadExec`
  - `isLocalReader` -> `isLocalRead`
  - ...
- Rename `CustomShuffle*` prefix to `AQEShuffle*`
- Rename `OptimizeLocalShuffleReader` rule to `OptimizeShuffleWithLocalRead`

### Why are the changes needed?

There are multiple problems in the current naming:

- `CustomShuffle*` -> `AQEShuffle*`
    it sounds like it is a pluggable API. However, this is actually only used by AQE.
- `OptimizeLocalShuffleReader` -> `OptimizeShuffleWithLocalRead`
    it is the name of a rule but it can be misread as a reader, which is counterintuative
- `*ReaderExec` -> `*ReadExec`
    Reader execution reads a bit odd. It should better be read execution (like `ScanExec`, `ProjectExec` and `FilterExec`). I can't find the reason to name it with something that performs an action. See also the generated plans:

    Before:

    ```
    ...
    * HashAggregate (12)
       +- CustomShuffleReader (11)
          +- ShuffleQueryStage (10)
             +- Exchange (9)
    ...
    ```

    After:

    ```
    ...
    * HashAggregate (12)
       +- AQEShuffleRead (11)
          +- ShuffleQueryStage (10)
             +- Exchange (9)
    ..
    ```

### Does this PR introduce _any_ user-facing change?

No, internal refactoring.

### How was this patch tested?

Existing unittests should cover the changes.

Closes #33429 from HyukjinKwon/SPARK-36217.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6e3d404)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@HyukjinKwon
Copy link
Member Author

Thanks Wenchen!

@HyukjinKwon HyukjinKwon deleted the SPARK-36217 branch January 4, 2022 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants