Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.34.0 JVM crash #7144

Open
zBart opened this issue Jun 10, 2024 · 18 comments
Open

v1.34.0 JVM crash #7144

zBart opened this issue Jun 10, 2024 · 18 comments

Comments

@zBart
Copy link

zBart commented Jun 10, 2024

We recently upgraded dd-trace-java from 1.31.2 to 1.34.0 and have gotten a crash. We've only seen this happens once so far.

Not completely sure if this is a Datadog issue or a JVM issue.

Here is the log:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (codeCache.cpp:654), pid=14, tid=430
#  guarantee(is_result_safe || is_in_asgct()) failed: unsafe access to zombie method
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.11.9.1 (17.0.11+9) (build 17.0.11+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.11.9.1 (17.0.11+9-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x5b8d61]  CodeCache::find_blob(void*)+0xc1
#
# The JFR repository may contain useful JFR files. Location: /dumps/2024_05_30_12_10_17_14
#

[...]

---------------  T H R E A D  ---------------

Current thread (0x00007f36d80e94f0):  JavaThread "pool-18-thread-63" [_thread_in_Java, id=430, stack(0x00007f369c4b5000,0x00007f369c5b6000)]

Stack: [0x00007f369c4b5000,0x00007f369c5b6000],  sp=0x00007f369c5b3400,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5b8d61]  CodeCache::find_blob(void*)+0xc1
V  [libjvm.so+0xeee39b]  JavaThread::pd_get_top_frame_for_signal_handler(frame*, void*, bool)+0x1ab
V  [libjvm.so+0x6ef823]  AsyncGetCallTrace+0x1c3
C  [libjavaProfiler9259657795778516739.so+0x26926]  Profiler::getJavaTraceAsync(void*, ASGCT_CallFrame*, int, StackContext*, bool*)+0x176
C  [libjavaProfiler9259657795778516739.so+0x27811]  Profiler::recordSample(void*, unsigned long long, int, int, Event*)+0x1e1
C  [libjavaProfiler9259657795778516739.so+0x429d2]  WallClock::sharedSignalHandler(int, siginfo_t*, void*)+0x1a2
C  [libpthread.so.0+0x118e0]
C  [linux-vdso.so.1+0xc62]  clock_gettime+0x242
C  [libc.so.6+0xfc426]  __clock_gettime+0x26

Registers:
[...]
@richardstartin
Copy link
Member

Hi @zBart did you change anything else when you did the upgrade? Could you open a support ticket so I can get access to the hs_err file please?

@richardstartin
Copy link
Member

We have identified the cause of this bug and are working on releasing a fix

@foameraserblue
Copy link

foameraserblue commented Jun 14, 2024

I have similar symptoms. Is it the same reason??

I recently upgraded jdk from 17 to 23.

jdk used: openjdk:23-jdk-slim

dd-java-agent information: /datadog/dd-java-agent.jar 'https://dtdg.co/latest-java-tracer'

Error message:

[thread 52 also had an error]
#
# A fatal error has been detected by the Java Runtime Environment:
#
# JRE version: OpenJDK Runtime Environment (23.0+26) (build 23-ea+26-2269)
# Java VM: OpenJDK 64-Bit Server VM (23-ea+26-2269, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, serial gc, linux-amd64)
#
# Core dump will be written. Default location: /app/core.7
#
[65.819s][warning][jfr] Unable to create an emergency dump file at the location set by dumppath=/app
# The JFR repository may contain useful JFR files. Location: /tmp/2024_06_14_02_02_54_7
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid7.log
[65.908s][warning][os ] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
Aborted

@richardstartin
Copy link
Member

@foameraserblue there's nothing to suggest this was the same issue nor that it is related to the profiler. If you have the backtrace we would be able to determine if it's related or not. Since you're running an early access JDK23 build, it could always be a JVM problem.

@zBart
Copy link
Author

zBart commented Jun 18, 2024

We have identified the cause of this bug and are working on releasing a fix

Hey Richard. Great! Do you still need me to send the hs_err file?

@credpath-seek
Copy link

credpath-seek commented Jun 20, 2024

We had a similar issue recently that affected some of our services running Java 21 . The same services had been running for months on Java 21, and years on Java 17, suddenly started crashing repeatedly during startup. We have a staging version of these services which were also experiencing the crashing but less reliably so.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffffa336da70, pid=1, tid=52
#
# JRE version: OpenJDK Runtime Environment (21.0.3+10) (build 21.0.3+10-LTS)
# Java VM: OpenJDK 64-Bit Server VM (21.0.3+10-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# C  [libpthread.so.0+0xba70]	
#	
# Core dump will be written. Default location: /workdir/core.1
#
# JFR recording file will be written. Location: /workdir/hs_err_pid1.jfr
#
# An error report file with more information is saved as:
# /workdir/hs_err_pid1.log	
[17.998s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://bell-sw.com/support
#
[error occurred during error reporting (), id 0x5, SIGTRAP (0x5) at pc=0x0000ffffa31fc9c0]

We could not pinpoint a cause. We tried downgrading to Java 17 had the same issue. Had no sense of the cause, and due to our services running in AWS Fargate we were not able to access the error logs files

Luckily we found these past issues which suggested possibly dd-trace-java as the cause
#677
#2997
#4978
#5077
#5449
#6382
#6899
#6948

particularly this comment from @richardstartin :

For transparency's sake, since GA, this is now the second time we have had crashes reported related to the serialisation of JFR events in our native profiler after adding new event types, and preventing it from crashing the profiled process again is going to be our top priority in the short term. We have a change in the pipeline to change the behaviour when there is a buffer overflow, which would result in a truncated recording (which we have metrics for so can react to) rather than risk crashing the process or writing to arbitrary memory locations. In the longer term, we will completely rewrite the event serialisation to prioritise safety.

Our build pulls the latest dd-trace-java into our images at build-time, so it seemed like this could have been the cause since no other dependencies were updated. It would have been running 1.34.0 at the time of crashes.

We then noticed that services which did not have profiling enabled were not crashing. Also, our staging services receive only 1% of the traffic of our prod services, and the DD_TRACE_SAMPLE_RATE was 0.001 so that provided some explanation as to why the staging services were sometimes able to succeed (traffic didn't hit the profiler by chance).

Setting DD_PROFILING_ENABLED=false resolved the issue immediately.

Going forward our workaround is to not using the profiling feature.

We have identified the cause of this bug and are working on releasing a fix

This is great to hear, would appreciate any updates to give us confidence on re-enabling the feature

@richardstartin
Copy link
Member

Hi @credpath-seek, this is unexpected. We test the profiler in lots of environments, and it runs in many other environments without these issues, so there is probably an incompatibility with something in your environment which we'd like to get to the bottom of to avoid this happening again. If we can get the backtrace and the siginfo sections from the hs_err files we would be able to pinpoint the cause. Just based on the error message, the cause is different to the crash reported (which is within AsyncGetCallTrace). If you do have this information available, please either reply here or open a support ticket so we can get the underlying issue fixed (and tested for).

You could always re-enable profiling but set DD_PROFILING_DDPROF_ENABLED: false - which would make the profiler fall back to JFR. The profiles will be less detailed, but both crashes reported in this issue (one in AsyncGetCallTrace, as well as yours) would have been avoided.

@r1viollet
Copy link

@zBart hs_err files are welcome. They help us identify how we can further protect the usage of AsyncGetCallTrace. Reproducers are even better, though I can see how that is hard to come by.

Just to give more context on the profiling direction and what @richardstartin mentioned. We continue maintaining two flavours of profiling.

  • One based on this library (DD_PROFILING_DDPROF_ENABLED: true)
  • One based on JFR events (DD_PROFILING_DDPROF_ENABLED: false)

So by switching to the JFR events you will have a stable profiling experience. If you feel you are missing some features, we would be happy to get this feedback.
There are more long term initiatives ongoing to improve the stability of observing Java.

@zBart
Copy link
Author

zBart commented Jun 24, 2024

Hi @zBart did you change anything else when you did the upgrade? Could you open a support ticket so I can get access to the hs_err file please?

@zBart hs_err files are welcome. They help us identify how we can further protect the usage of AsyncGetCallTrace. Reproducers are even better, though I can see how that is hard to come by.

Done, ticket ID is: #1749443

Any idea for an ETA for a fix?

@jbachorik
Copy link
Contributor

Hi @zBart - the fix is in DataDog/java-profiler#107 which will be coming to dd-trace-java in the upcoming release (beginning of July 2024)

@oriy
Copy link

oriy commented Jun 30, 2024

Is this fixed by #7229 ?

@jbachorik
Copy link
Contributor

@oriy If you mean the one reported @zBart then yes, the fix is there and available in dd-trace-java 1.36.0

@oriy
Copy link

oriy commented Jul 1, 2024

@oriy If you mean the one reported @zBart then yes, the fix is there and available in dd-trace-java 1.36.0

meant asking if #7144 is fixed by #7229
we have tried deploying 1.36.0, profile 1.9 included, still crash with SIGSEGV

Running latest Temurin 11 on g1 linux-aarch64

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffff843c8c98, pid=1, tid=117
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.23+9 (11.0.23+9) (build 11.0.23+9)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.23+9 (11.0.23+9, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x608c98]  frame::safe_for_sender(JavaThread*)+0x278
#
# Core dump will be written. Default location: /opt/app/core.1
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
#

@jbachorik
Copy link
Contributor

@oriy This is a different failure mode, unfortunately :(
Would you be able to file a support ticket and include the whole hs_err file there?

@ivan-sukhomlyn
Copy link

Got the same errors on the latest AWS Graviton CPU (7g) with Datadog profiler enabled.

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffff96d828a8, pid=1, tid=298
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-17.0.10.0.7-1) (17.0.10+7) (build 17.0.10+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM (Red_Hat-17.0.10.0.7-1) (17.0.10+7-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, parallel gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x64f8a8]  Dictionary::find_class(unsigned int, Symbol*)+0x0
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffffa372e798, pid=1, tid=251
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-17.0.10.0.7-1) (17.0.10+7) (build 17.0.10+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM (Red_Hat-17.0.10.0.7-1) (17.0.10+7-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, parallel gc, linux-aarch64)
# Problematic frame:
# C  [libc.so.6+0x28798]  __GI___memcpy_simd+0xd8

And there are no errors with disabled DD profiler by setting DD_PROFILING_DDPROF_ENABLED=false.

@jbachorik
Copy link
Contributor

@ivan-sukhomlyn Hi, thanks for the report. This is a different failure mode - V [libjvm.so+0x64f8a8] Dictionary::find_class(unsigned int, Symbol*)+0x0 vs. V [libjvm.so+0x5b8d61] CodeCache::find_blob(void*)+0xc1

Would you be able to open a support ticket and submit the full hs_err.log file there so we can properly analyze the full crash?
Thanks!

@oriy
Copy link

oriy commented Jul 3, 2024

@oriy This is a different failure mode, unfortunately :( Would you be able to file a support ticket and include the whole hs_err file there?

Sure @jbachorik, submitted ticket 1760710
Thanks 🙏

@ivan-sukhomlyn
Copy link

ivan-sukhomlyn commented Jul 5, 2024

Hi @jbachorik
Thank you for the suggestion.

This is an answer from the support team regarding the requests mentioned in this issue.

Thanks to this, our team took a look at this and it seems to be unrelated to the tracer and the profiler.
In error-log-4 we see this is related to class transformation, and not profiling. Specifically the failure is in JVMTI ClassFileLoadHook internals and this is outside of anything the tracer or profiler can do.
Considering that this issue started happening with upgrading Graviton processor to v.3 it is very likely that this is a bug in JVMTI implementation manifesting on Graviton 3. The reason why the issue is resolved by setting DD_PROFILING_DDPROF_ENABLED=false is that the ClassFileLoadHook is not set up any more and the buggy JVMTI code is not executed.

The recommended step here is to open a support ticket with the JDK vendor and let us know how it goes?In the ticket you add that Datadog Engineering team looked at this and it might be related to https://bugs.openjdk.org/browse/JDK-8307315, but that’s really unsure. Adding this in case it might help find the root cause.
Let me know if you have any questions on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants