Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve jitdump functionality #9120

Closed
22 of 23 tasks
fjeremic opened this issue Apr 3, 2020 · 22 comments · Fixed by #12203
Closed
22 of 23 tasks

Improve jitdump functionality #9120

fjeremic opened this issue Apr 3, 2020 · 22 comments · Fixed by #12203
Assignees
Labels

Comments

@fjeremic
Copy link
Contributor

fjeremic commented Apr 3, 2020

Background

The jitdump is a dump agent [1] which collects JIT trace logs which can help investigation of OpenJ9 issues. This dump agent is enabled by default for general purpose faults and aborts [2].

A jitdump can typically help under two scenarios:

  1. A crash during a JIT compilation
  2. A crash in a JIT compiled method

For both of these scenarios we typically require a JIT trace log of the method in question for further investigation. Sometimes this is an iterative process, especially for case 2. as we may no know which area of the JIT compiler was responsible for generating the faulty logic in the JIT compiled method assembly. The iterative process may require us to learn more about the problem from every log, and suggest additional tracing options until we can pinpoint the problem.

For case 1. we often need to have additional tracing enabled of the area in the JIT that we crashed, in addition to having the JIT IL trees at hand.

Due to the dynamic nature of the JVM runtime environment, and the fact that the JIT compiler is guided by profiling information, a JIT compilation of a method in one JVM invocation may behave differently than a JIT compilation of the same method in a subsequent invocation of the JVM, even when the same environment and application is being run. This is a problem for servicing such issues if the first incident data collection did not capture enough information to be able to effectively service the issue and provide a resolution.

The typical result of the failure to obtain useful logging on first incident is that developers/service engineers must work with the stakeholder to reproduce the issue with additional tracing. This can take time and resources for both parties. A properly generated jitdump has a very high chance of reproducing the exact same compilation as the original, but with tracing enabled due to the fact that it runs in the same JVM process which produced the original faulty compilation. Therefore it is highly desirable to generate a useful jitdump on first incident to speed up the investigation effort of issues in the JIT.

[1] https://www.eclipse.org/openj9/docs/xdump/#dump-agents
[2] https://www.eclipse.org/openj9/docs/xdump/#default-dump-agents

Problems

There are several limitations when jitdump trace files are created:

  1. The jitdump file is empty
  2. The jitdump only contains a partial trace due to a recursive crash not related to original problem
  3. The jitdump file fails to trace the right area of the JIT for finer grained information
  4. The jitdump does not trace the full backtrace of interesting methods
  5. The jitdump trace does not reproduce the original trace file
  6. The options used for the jitdump generation were different than the options used for original compilation
  7. The jitdump compilation fails to complete due to JVM shutdown

Goal

The goal of this effort is to figure out a way to resolve the problems outlined in the previous section, and to always generate a useful jitdump so that developers/service engineers can make use of the trace information obtained during first incident data collection. The success metric of this effort will be quantified by the reduction in the amount of time it takes for developers/service engineers to obtain a JIT trace log which contains valuable information to make progress on fixing a defect. Another goal of this effort is to improve documentation and code quality of the jitdump process in the JIT compiler.

Issues / PRs

@fjeremic
Copy link
Contributor Author

fjeremic commented Apr 28, 2020

Adding new PR (eclipse/omr#5135) to the list which will address some newline issues seen when jitdumps are generated.

@fjeremic
Copy link
Contributor Author

I've reopened #9227 as we'll need to avoid printing snippets after a crash since we cannot reliably print them before binary encoding. This is further explained in eclipse/omr#5111 which will be addressed at some point in the future. For now, we still want to avoid recursive crashes so we get a proper jitdump out so I'll be addressing that issue in the next few days.

@fjeremic
Copy link
Contributor Author

Adding new issue (#9386) on a proposal to enable paranoid opt. check for jitdump recompilations.

@fjeremic
Copy link
Contributor Author

Adding new PR (#9387) to address inconsistency in generation of jitdump vs. javacore and other dump triggers. That is, the messages reported and how they are reported are now consistent with javacore, Snap dump, heapdump, etc. and there is no redundant prefixes in the messages.

In addition we use the same function naming convention as javacore and snap dumps to remain consistent with other parts of the JVM.

@fjeremic
Copy link
Contributor Author

fjeremic commented May 1, 2020

Adding new issue (#9428) to improve programmatically setting of tracing options for jitdump compiles.

@fjeremic
Copy link
Contributor Author

fjeremic commented May 6, 2020

Adding new issue (#9479) to support specifying sub-options using the -Xdump framework to jitdump so as to enable custom tracing to arbitrary failures.

@fjeremic
Copy link
Contributor Author

Adding new issue (#9522) to avoid compilation interruptions, such as the JVM wanting to shut down, when generating jitdumps. This is often seen in JUnit type tests where for example a crash in the JIT will happen, or an exception is thrown in a test which reaches main. In such scenarios JUnit will report this error and it may terminate the rest of the tests at that point. The JVM will then want to shut down but jitdumps are still being generated. This results in truncated jitdumps which are not useful for diagnosing the problem.

@fjeremic
Copy link
Contributor Author

fjeremic commented Jun 5, 2020

Just a quick update on where things stand. I currently have several PRs up which I'm waiting to get merged before forging on. I think the most important issue to work on following this bulk of PRs getting merged is #9136.

@fjeremic
Copy link
Contributor Author

Another update on #9136. I've gotten to the bottom of the major issue for one of the deadlocks. Still need to investigate the other much less common, and more artificial deadlock described in the latest comment in #9136. I'd like to fix them both to close off that item which is a major milestone in this work.

@fjeremic
Copy link
Contributor Author

Back to trying to finish this off in the next month or so. Trying to knock off the easier items first, so I'm resuming #9428.

@fjeremic
Copy link
Contributor Author

Another update from me. I do still have this on my radar but have been distracted by some machine migration that must be performed by end of September. I hope to get back to working on this in the next few weeks. I will post an update once I get back to doing something meaningful in this area.

@fjeremic
Copy link
Contributor Author

The changes delivered here are already starting to show their benefit, for example a 0/420 defect was able to produce a useful jitdump on first failure data capture over in #10630 which will aid in debugging the assert there.

@liqunl
Copy link
Contributor

liqunl commented Oct 6, 2020

I found one problem in a crash. The original crash is in AOT compilation, but the replay is for JIT compilation which finishes without error.

@liqunl
Copy link
Contributor

liqunl commented Oct 7, 2020

It would be good if the trace log and jitdump tell us if it is an AOT compile. Another problem is that, if the crash is in ilgen, no trees will be printed out before replay. I guess if replay happens in the right context, that is not a problem, but it will still be good if we can print some information.

@dsouzai
Copy link
Contributor

dsouzai commented Oct 8, 2020

I found one problem in a crash. The original crash is in AOT compilation, but the replay is for JIT compilation which finishes without error.

Opened #10852

@fjeremic
Copy link
Contributor Author

fjeremic commented Jan 25, 2021

Getting back to this work in the last few days as I'm trying to polish this off given we are so close to completing everything. I started back looking at #9522 and that problem is mostly fixed, but during my stress testing around that area I discovered several issues which I've documented in #11765, #11770, and #11772. I have a firm understanding of the various problems now and I have solutions for each of them which I will try to deliver in the next few days. We are much closer to having robust JitDump generation.

@fjeremic
Copy link
Contributor Author

I've dug myself out of the hole and have emerged with a ton of goodies. I've opened up #11825 which addresses what I believe to be all issues revolving around generation of JitDumps from crashed compilations. It will also help in the case of application thread crashes as well. This is the area I am going to stress test next and ensure every JIT compiled body on the stack of an application crash gets a JitDump recompilation. This will be the final step in this saga, afterwhich I expect every single JIT defect to have a useful JitDump accompanying it.

@fjeremic fjeremic self-assigned this Feb 16, 2021
@fjeremic
Copy link
Contributor Author

fjeremic commented Mar 4, 2021

All the issues on compilation thread crashes have been resolved that I could find. Going to take a look at application thread crashes and see if there is anything to fix on that front. If not, I'll do another refactoring pass to clean everything up, add documentation, and proper tracing then close off the Epic.

@fjeremic
Copy link
Contributor Author

We are almost done here. I've opened #12203 as a final refactoring PR. Once that PR is merged my contribution to this Epic is complete. Thanks to all who followed along!

@liqunl
Copy link
Contributor

liqunl commented Mar 12, 2021

I wonder if you can add a change to turn on TR_DebugInliner when the crash is in inliner? Or could you point me where to set this option? A lot of inlining traces are guarded by TR_DebugInliner and they're not printed in a jitdump.

@fjeremic
Copy link
Contributor Author

I wonder if you can add a change to turn on TR_DebugInliner when the crash is in inliner? Or could you point me where to set this option? A lot of inlining traces are guarded by TR_DebugInliner and they're not printed in a jitdump.

Implemented in #12208.

@JamesKingdon
Copy link
Contributor

Hoping that late is still better than never - Thank you for all this work!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants