Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDR should detect truncated core files #8983

Closed
pshipton opened this issue Mar 25, 2020 · 21 comments
Closed

DDR should detect truncated core files #8983

pshipton opened this issue Mar 25, 2020 · 21 comments
Assignees
Labels

Comments

@pshipton
Copy link
Member

pshipton commented Mar 25, 2020

extended.system testing has been failing on AIX jdk11+ for a while due to OOM problems related to setting MALLOCOPTIONS. I tried to look at some of the core files from xlinux, but all the ones I've tried show "No JRE". I also recall @dmitripivkine got a core for a crash on AIX, and had a similar problem. I couldn't find the issue, but maybe Dmitri can track it down.

At the time Dmitri found the core I tried a simple test, but the core file produced was readable across platforms.

I'm wondering if there is something going wrong when trying to locate the JRE in the core. Perhaps not all the memory segments are being found properly. Seems to me there was a problem in this area fairly recently.

Looking at the allocateRASStruct() code, it starts looking to put the RAS struct at 0x30000000. Looking at the core, the lowest addresses are the following. This must be incorrect because there has to be memory below 4G to support compressed refs.

0x0000010000000000	0x0000010000017acb	0x0000000000017acc	(96,972)                   	
0x0000010010000000	0x000001008fffa43f	0x000000007fffa440	(2,147,460,160)            	
0x08001000a0000450	0x08001000a01de817	0x00000000001de3c8	(1,958,856)                	
0x0900000000000500	0x090000000044acfe	0x000000000044a7ff	(4,499,455)                	

Looking at a core created by a simple command (java -Xdump:system), I see low memory

0x0000000030000000      0x000000003fffffff      0x0000000010000000      (268,435,456)                   
0x0000000080000000      0x00000000ffffffff      0x0000000080000000      (2,147,483,648)                 
@pshipton
Copy link
Member Author

pshipton commented Mar 26, 2020

dbx seems to agree there isn't any memory at 0x30000000 in the problematic core. I didn't find any way to get a memory map from dbx, although map shows the modules text and data segments.

@keithc-ca
Copy link
Contributor

@pshipton Can you please check if #9026 fixed this?

@gacholio
Copy link
Contributor

gacholio commented Apr 1, 2020

For the record, I'm not expecting my change to have any effect on non-mixed builds.

@pshipton
Copy link
Member Author

pshipton commented Apr 2, 2020

@keithc-ca I got a core from last night's build and it's working. I don't know if this is because it's fixed, or this build just didn't experience the problem. Given that a couple of previous builds of this nature I checked didn't work, it looks promising. For the record can you please point out which part of the change may have fixed it.

@keithc-ca
Copy link
Contributor

Prior to #9026, allocateRASStruct() (called by J9RASInitialize()) would proceed as if -Xnocompressedrefs was in effect, but J9RelocateRASData() would see that -Xcompressedrefs was in effect. J9RelocateRASData() would allocate a new block and copy the (unitialized) _j9ras_ global into it. The VM points at this new block. The originally allocated block (with the eye-catcher) points at the VM, but the VM does not point back and so the eye-catcher is treated as a false positive, and DTFJ continues searching but there is no qualifying RAS structure to be found.

@pshipton
Copy link
Member Author

pshipton commented Apr 3, 2020

@keithc-ca @gacholio was there a prior change that broke it? We need to check if any release branches need to be fixed.

@gacholio
Copy link
Contributor

gacholio commented Apr 3, 2020

Yes, my previous change to the RAS init is what broke this.

@gacholio
Copy link
Contributor

gacholio commented Apr 3, 2020

#5783 is the culprit.

@pshipton
Copy link
Member Author

pshipton commented Apr 3, 2020

That was delivered in May 2019. We should put a fix for this problem into the 0.20.0 release so we can service the release.

@DanHeidinga fyi

@gacholio
Copy link
Contributor

gacholio commented Apr 3, 2020

So, the required backports are #9026 and #9111

@gacholio gacholio assigned pshipton and unassigned gacholio Apr 3, 2020
@gacholio
Copy link
Contributor

gacholio commented Apr 3, 2020

@pshipton has graciously volunteered to do the backporting.

@gacholio
Copy link
Contributor

gacholio commented Apr 3, 2020

Backporting the rasdump.c change from #9026 is probably the safest course - all of the other changes should have no effect on single-mode builds.

pshipton pushed a commit to pshipton/openj9 that referenced this issue Apr 3, 2020
Originally broken by eclipse-openj9#5783, fixed by eclipse-openj9#9026. Port the rasdump.c changes
from eclipse-openj9#9026 to the 0.20.0 release branch to resolve eclipse-openj9#8983, AIX core files
which cannot be read by DDR.

[ci skip]

Signed-off-by: Peter Shipton <Peter_Shipton@ca.ibm.com>
@pshipton
Copy link
Member Author

pshipton commented Apr 3, 2020

Created #9117 for the backport.

@pshipton
Copy link
Member Author

pshipton commented Apr 4, 2020

#9117 didn't seem to fix anything. A core from the JVM containing #9117 is still unreadable (No JRE).
info mmap shows there is no memory below 4GB, which doesn't match the javacore. The heap should be at 0x80000000, and the RAS struct at 0x30000000.

@keithc-ca
Copy link
Contributor

I looked at core.20200403.164329.8323202.0001.dmp from https://140-211-168-230-openstack.osuosl.org/artifactory/ci-eclipse-openj9/Test/Test_openjdk11_j9_extended.system_ppc64_aix_Personal/5/system_test_output.tar.gz.

The J9VMRAS eye-catcher is at file offset 0x8bb993ba, but the file header indicates the core file is truncated:

00000000  00 f7 00 00 0f ee dd b2  00 00 00 00 00 00 18 70  |...............p|
-------------^^

According to [1], the second byte is c_flag, the high-bit of which is CORE_TRUNC indicating truncation. AIXDumpReader checks that bit but doesn't share its findings with anyone.

[1] https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/filesreference/core.html

@pshipton
Copy link
Member Author

pshipton commented Apr 7, 2020

Obviously it would be useful to see a message that the core is truncated.

Opening the core in dbx I see a message about truncation. I thought I did this with a previous core unreadable with DDR (which is gone), but didn't see any truncation message.

I'll keep this open for adding a truncation message, but move it to the next milestone.

@keithc-ca
Copy link
Contributor

jextract does comment on truncated core files; jdmpview should do likewise.

$ jextract aix.core.dmp 
Loading dump file...
Warning: dump file is truncated. Extracted information may be incomplete.

@keithc-ca
Copy link
Contributor

With #9199, jdmpview also will generate a warning about truncated core files.

@DanHeidinga
Copy link
Member

This is still open until truncation messages are added for other platforms.

@pshipton pshipton changed the title DDR failure to read AIX core files DDR should detect truncated core files Feb 10, 2022
@pshipton
Copy link
Member Author

@keith am I correct DDR detects truncated core files on AIX and Linux (ELF)? If so I'd be inclined to close this until somebody complains about a problem on another platform.

@keithc-ca
Copy link
Contributor

Yes, DDR detects truncated core files on AIX and Linux.

@pshipton pshipton removed this from the Backlog milestone Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants