Skip to content

[HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.#1289

Merged
vinothchandar merged 1 commit intoapache:masterfrom
prashantwason:pw_dag_names
Jul 19, 2020
Merged

[HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.#1289
vinothchandar merged 1 commit intoapache:masterfrom
prashantwason:pw_dag_names

Conversation

@prashantwason
Copy link
Member

What is the purpose of the pull request

HUDI DAG stages do not have any names. The Spark History Server UI shows these stages with the HUDI JAVA filename and the line number.

This change provides descriptive names for the stages which makes it easier to visualize the HUDI DAG.

Brief change log

  1. All code locations where JavaSparkContext.XXX (DAG creation methods like parallelize) are used have been updated with descriptive names for the job with the following API
    jsc.setJobGroup(title, description);

  2. Title is the class name so its easy to identify. Description is the activity for that stage.

  3. Unit test code has been updates to enable the unit tests to write the Spark event logs so that even the unit tests can be visualized in a locally installed Spark History Server UI.

  4. The unit tests need to be run as follows:

mvn test -DSPARK_EVLOG_DIR=/path/for/spark/event/log

Once the unit tests complete, the Spark History Server shows the application logs for all completed unit tests. The unit tests themselves can be identified by the test-class-name as the application name.

Verify this pull request

-- This pull request is already covered by existing tests, such as (please describe tests).

  1. All existing unit tests pass.
  2. If the SPARK_EVLOG_DIR property is not set, spark event logging is not enabled (this is defauly behavior)
  3. Manually verified the change by running unit tests locally and observing the application logs on the locally installed Spark History Server.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@n3nash
Copy link
Contributor

n3nash commented Jan 29, 2020

@bvaradar @vinothchandar can you guys help review this ?

@n3nash
Copy link
Contributor

n3nash commented Jan 29, 2020

@prashantwason I think you tried another approach using aspects to make the code look cleaner right ? Could you please briefly describe that approach here (pros and cons) here so reviewers are also aware of it ?

@n3nash n3nash self-requested a review January 29, 2020 19:00
@prashantwason
Copy link
Member Author

A DAG stage name and description can be set using the JavaSparkContext.setJobDescription(...) method. The same name/description is used for all stages which use the same thread until the name/description is updated (another call to setJobDescription) or deleted (clearJobGroup).

In this PR, I am using the ClassName as the stage name and a textual description derived from the method logic. HUDI classes have very descriptive names so this works well.

There are two ways this may be done:

  1. Manually (this PR) by adding code set the name/description before any DAG stages are started.
  2. Using Java AOP to automatically find code locations matching some pattern and augment them with the call to setJobDescription.

To use AOP approach, we can create a separate AspectJ file containing the Pointcuts (code locations to augment) and Advices (code to insert). There is a separate AspectJ compiler which at runtime can change the class bytecode to add the Advices.

Pros of AOP approach:

  1. Does not require any change in current code
  2. Also covers future code automatically
  3. Easy to undo (just don't run the AspectJ compiler as part of build)
  4. Can be extended to more use-cases like automating Metrics.

Cons of AOP approach:

  1. Require AspectJ and its compiler to be integrated with the HUDI build chain
  2. The Advice cannot be dynamic. Hence we cannot provide descriptions to the DAG stages (we can still use the class name as the DAG stage name).

Since the code has a manageable number of places where DAG is created, I prefer the simpler manual approach. It also ends up documenting the code.

@vinothchandar
Copy link
Member

@prashantwason this is a great contribution for anyone debugging hudi writing... Can you post some screenshots for how upsert/bulk_insert dags now show up on the UI?

also @n3nash if you want to review this, feel free to grab this from me

@prashantwason
Copy link
Member Author

Spark History Server Screenshots

List of tests

Event timeline

TestHoodieClientCopyOnWrite

@vinothchandar
Copy link
Member

@prashantwason this is so awesome! Started reviewing this ..

@vinothchandar
Copy link
Member

vinothchandar commented Feb 6, 2020

The unit tests need to be run as follows:
mvn test -DSPARK_EVLOG_DIR=/path/for/spark/event/log

lets add this to the README, under a new section Running Tests

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made one pass and left some general comments to clarify the detail/description text..

Few high level questions

  1. It seems like you are only covering cases where an RDD is getting created? Is it possible to change the job group between stages? for e.g we have quite a few stages in HoodieBloomIndex and wondering if we can show them all https://cwiki.apache.org/confluence/display/HUDI/def~bloom-index
  2. Related to 1, if the answer is no, then how does an typical bulk_insert, upsert, insert dag look like? I am wondering if say the entire bloom index stages get named Obtain file ranges as range pruning is enabled, since that was the last call we made to set the name...

@prashantwason
Copy link
Member Author

  1. It seems like you are only covering cases where an RDD is getting created? Is it possible to change the job group between stages?

The setJobGroup() description applies to the Thread and is used until either it is updated or removed. So we need to label each stage and they should show up correctly.

We can label RDD creations as well as operations on them.

@vinothchandar
Copy link
Member

So we need to label each stage and they should show up correctly.

if we could do that, and if you can post the upsert() dag for example, that would be great..

@vinothchandar
Copy link
Member

@prashantwason Wondering if you have some screenshots for the upsert dag.. (I can try running tjhe PR locally if not)

@prashantwason
Copy link
Member Author

prashantwason commented Feb 24, 2020 via email

@vinothchandar
Copy link
Member

@prashantwason still driving this? Can I help get this moving along?

@prashantwason
Copy link
Member Author

@vinothchandar Yep, I would like this to move forward. Let me revive this as it seems there are merge conflicts now.

We dont have this deployed yet so the only DAG screenshots I can provide are from unit tests. On you end, can you provide me some specific unit tests which are exercising the specific DAGs you are interested in? I can generate the screenshots from those.

@vinothchandar
Copy link
Member

TestMergeOnReadTable or TestClientCopyOnWriteStorage etc that will do a full upsert dag for cow and mor are good starting points.. but really, we need to run an upsert with a real job to ensure these values also show up in real deployments

@vinothchandar
Copy link
Member

@prashantwason this will be a good candidate to fast track into the next release. are you still working on this?

@vinothchandar vinothchandar changed the title [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi. [WIP] [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi. May 5, 2020
@nsivabalan
Copy link
Contributor

@prashantwason : Once you update the PR, do let @lamber-ken know that its ready for review.
@lamber-ken : would you mind reviewing this PR.

@prashantwason
Copy link
Member Author

@lamber-ken I have updated the PR and added screenshots from the Spark History Server UI for some of the MOR table operations. Please have a look.

@prashantwason
Copy link
Member Author

List of executed tests - the name is made up of the test and the method
Screenshot (1)

Various DAGs
Screenshot (2)
Screenshot (3)
Screenshot (4)
Screenshot (5)

@codecov-commenter
Copy link

codecov-commenter commented May 30, 2020

Codecov Report

Merging #1289 into master will increase coverage by 0.00%.
The diff coverage is 20.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1289   +/-   ##
=========================================
  Coverage     18.19%   18.19%           
  Complexity      856      856           
=========================================
  Files           348      348           
  Lines         15344    15359   +15     
  Branches       1523     1523           
=========================================
+ Hits           2792     2795    +3     
- Misses        12195    12207   +12     
  Partials        357      357           
Impacted Files Coverage Δ Complexity Δ
.../org/apache/hudi/client/CompactionAdminClient.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...c/main/java/org/apache/hudi/table/HoodieTable.java 39.53% <0.00%> (-0.63%) 22.00 <0.00> (ø)
...e/hudi/table/action/clean/CleanActionExecutor.java 10.52% <0.00%> (-0.17%) 5.00 <0.00> (ø)
...ction/compact/HoodieMergeOnReadTableCompactor.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...on/rollback/MergeOnReadRollbackActionExecutor.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...che/hudi/table/action/rollback/RollbackHelper.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...able/action/savepoint/SavepointActionExecutor.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...n/java/org/apache/hudi/index/HoodieIndexUtils.java 47.61% <100.00%> (+2.61%) 3.00 <1.00> (ø)
.../org/apache/hudi/index/bloom/HoodieBloomIndex.java 59.00% <100.00%> (+0.41%) 16.00 <0.00> (ø)
...he/hudi/table/action/commit/UpsertPartitioner.java 55.71% <100.00%> (+0.31%) 15.00 <0.00> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a0d3f1...ed384e6. Read the comment docs.

@n3nash n3nash changed the title [WIP] [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi. [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi. Jun 12, 2020
@n3nash
Copy link
Contributor

n3nash commented Jun 12, 2020

@vinothchandar could you take a look at the screenshots and see if that provides what you were looking for ?

@vinothchandar
Copy link
Member

@n3nash this has been since reassigned.. feel free to grab this review, I have few others queued up before I can get to this.

@n3nash
Copy link
Contributor

n3nash commented Jul 8, 2020

@prashantwason can you rebase and push please ? I can then merge this

@prashantwason
Copy link
Member Author

@n3nash I have rebased the changes. Build is green.

@vinothchandar vinothchandar merged commit b71f25f into apache:master Jul 19, 2020
vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants