Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error on running "man" remotely #356

Closed
lusitania opened this issue Sep 4, 2015 · 65 comments
Closed

fatal error on running "man" remotely #356

lusitania opened this issue Sep 4, 2015 · 65 comments

Comments

@lusitania
Copy link

Hi

I'm getting a fatal when calling man on a remote Ubuntu Trusty:

~$ man bash
      0 [main] ssh 9412 C:\Program Files\Git\usr\bin\ssh.exe: *** fatal error - cmalloc would have returned NULL
angup

However, its fine to do the same on an Ubuntu Precise running in a Vagrant box.

Host: Windows 10 x64
Git4Win 2.5.0

@lusitania lusitania changed the title fatal error an remote start of "man" fatal error on running "man" remotely Sep 4, 2015
@dscho
Copy link
Member

dscho commented Sep 6, 2015

Render me absolutely confused. Ubuntu? What does that have to do with Git for Windows? Please clarify. In fact, you really want to volunteer information that allows others to understand the problem. If you expect others to ask for clarification, you will find that most people just don't.

@lusitania
Copy link
Author

It is the ssh.exe bundled with Git4Win that raises the error. I connected to a Linux/Ubuntu machine using ssh.exe from within a Git Bash shell (Mingw). I assume the issues does not originate from Git itself rather than MinGW under particular* circumstances. If that is the case you may need to update your dependencies... hopefully the MinGW team is aware of this and you'd just pull in the proper fix.

Best,
L

*) After I reduced my screen buffer from 10k down to 1k lines the error wasn't raised any more.

@dscho
Copy link
Member

dscho commented Sep 7, 2015

Unfortunately, this is wrong on at least three accounts:

  • there is still no MCVE, despite the fact that it should be possible for you to install Vagrant and test whether it already fails with that. I do expect a little assistance in resolving the bug, after all
  • ssh is not a MinGW program. It is an MSys2 program. This may be important, so it is really necessary to be very precise in communicating.
  • "hopefully the MinGW team is aware" just cries out loud for some proper investigation rather than hoping.

@lusitania
Copy link
Author

Unfortunately I fail to understand what you are trying to tell me. MCVE is an abbreviation unknown to me. And whether ssh is a part of MinGW or MSys2 is not transparent to me. All I am able to see is MinGW's prompt. So there is no way for me to be any more precise than done already. You are the one who has the answers. I can only guess. In the end it is your package that breaks, and again, I can only tell you.

Best

@dscho
Copy link
Member

dscho commented Sep 8, 2015

An MCVE is a Minimal, Complete and Verifiable Example. You cannot seriously expect a maintainer of a highly popular software to spend three or more days to chase a bug for you. You really want to make it as easy as possible for others to reproduce your problem. You are asking other to spend time for you, after all. It should be self-evident that you are more likely helped if you respect the time and expertise of those you ask for help.

Now, I gave you a very good hint how to make such an MCVE: by using Vagrant, you essentially install an Ubuntu Virtual Machine into which you can ssh. That is not exactly the intention of Vagrant, but it just happens to be the fastest way for you to set up a reproducible environment.

And just for fun, I did ssh into an Ubuntu Trusty machine I have access to (which you should not have assumed) and I did call man bash and (probably surprising to you, but not me) it worked without a hitch. No problem. Certainly no crash.

Of course, my account would be incomplete without the following information (which BTW should be provided with every bug report, as per our guidelines linked from our Contribute section on our home page):

Git Bash 64-bit on Windows Server 2008 R2, connecting to an Ubuntu Trusty 64-bit machine. Works.

@kumarharsh
Copy link

@dscho seems like I'm getting the same errors while using newer builds of Git for Windows which has the bundled ssh:

  1 [main] ssh 5852 C:\bin\Git\usr\bin\ssh.exe: *** fatal error - cmalloc would have returned NULL

1194 [main] ssh 5852 cygwin_exception::open_stackdumpfile: Dumping stack trace to ssh.exe.stackdump

The stackdump mentioned in the dump just has the same text.

This happens usually when I have ssh'd into a server and try to cat or less a large file (not always, but rather frequently).

Try downloading this file and less-ing it: https://1drv.ms/u/s!ArGqE3U0XhYmu5wtGK-qFiiPvUvfdQ

Setup:

ConEmu 180114 Alpha - 64-bit
> git --version
git version 2.16.1.windows.4
> ssh -V
OpenSSH_7.6p1, OpenSSL 1.0.2n  7 Dec 2017

@dscho
Copy link
Member

dscho commented Feb 22, 2018

@kumarharsh is ConEmu important here? Or can you reproduce with Git Bash or Git CMD? I ask because I do not want to spend the time setting up ConEmu (which I personally do not prefer) just to reproduce this issue (and I do ssh into a Linux VM all the time, without any of the indicated problems, although I did hit the error message recently when calling vim.exe but the problem went away on its own accord.).

@kumarharsh
Copy link

kumarharsh commented Feb 22, 2018

@dscho - yes, it happens with vim too from time to time. As for you, it also goes away for me after some time. I'll try to reproduce it with git bash.


Maddeningly enough, the less is now working when I've put the file in the onedrive folder.

@dscho
Copy link
Member

dscho commented Feb 22, 2018

Maddeningly enough, the less is now working when I've put the file in the onedrive folder.

Which file?

@kumarharsh
Copy link

The same one I shared with you above (onedrive link). I moved the file from my Downloads folder to the Onedrive folder for uploading it. Now, the less command is working on it. Weird thing is that:

  1. less on the same file was crashing when I had ssh'd into a linux machine and was running less on the same file. This was causing ssh.exe to also crash, producing the stackdump I posted above.
  2. My local machine's less.exe from the Git for Windows installation also was crashing on the same file, with a similar stackdump.
  3. I moved the file from it's then location to my Onedrive folder.
  4. After about 4-5 hours (since I posted this comment), when I tried to less the same file again, it was magically working.
  5. It's also started working magically if I ssh into the linux machine (described in 1). I'm doing all this from the same directory as in point 1, and no environment variable as changed.

@dscho
Copy link
Member

dscho commented Feb 22, 2018

The same one I shared with you above (onedrive link).

Ah. For the record, I tried to reproduce the issue with that file, and failed. I used a Hyper-V Ubuntu for my test, but it should not matter, should it.

Now, I investigated a little where this issue comes from. Turns out that it is not altogether clear how this issue comes about, it could be a lot of things.

First things first: this is where the error message is written: https://github.com/git-for-windows/msys2-runtime/blob/27e5516d7e7bb17c5c9c69462e1b5d5cebd4d2a0/winsup/cygwin/cygheap.cc#L407

Obviously, the call stack is very important. As the error message says cmalloc, this is most likely the caller: https://github.com/git-for-windows/msys2-runtime/blob/27e5516d7e7bb17c5c9c69462e1b5d5cebd4d2a0/winsup/cygwin/cygheap.cc#L425

And it triggers only when this function returns NULL (and there are a couple of candidate sites that do that): https://github.com/git-for-windows/msys2-runtime/blob/27e5516d7e7bb17c5c9c69462e1b5d5cebd4d2a0/winsup/cygwin/cygheap.cc#L326-L361

Now, the only caller of cmalloc() that would result in such an error message is this one (which is the only one passing in an fn parameter that is not NULL): https://github.com/git-for-windows/msys2-runtime/blob/27e5516d7e7bb17c5c9c69462e1b5d5cebd4d2a0/winsup/cygwin/cygheap.cc#L434-L438

Sadly, there are many, many, many callers of cmalloc_abort(): https://github.com/Alexpux/Cygwin/search?q=cmalloc_abort&type=Code&utf8=%E2%9C%93

It is not altogether clear that it is important which is the exact code path leading to this problem, either. It might be more important to know the size of the block that needs to be allocated.

Another thing that may play a role is this sad issue with the DLL base address. See https://github.com/git-for-windows/git/wiki/32-bit-issues for details. As suggested in a post to the Cygwin mailing list (https://cygwin.com/ml/cygwin/2013-08/msg00514.html), there may be another .dll involved that occupies the favored DLL base address of the MSYS2 runtime (obviously, Cygwin talks about cygwin1.dll, but we use MSYS2, a derivative of Cygwin, that renamed it to msys-2.0.dll to avoid clashes). The post suggests to use listdlls.exe to figure out where msys-2.0.dll is loaded. According to rebase -i /usr/bin/msys-2.0.dll, it is supposed to be loaded to base 0x000180040000 size 0x005d1000, but according to ListDLLs, it gets loaded to 0x0000000080040000 0x5d1000 instead (but I could imagine that ListDLLs somehow casts the 64-bit value to 32-bit first, as all the addresses start with eight zeroes). Maybe in a case where you can reproduce this, your msys-2.0.dll does not get loaded to its preferred base address?

@dakotahawkins
Copy link

@dscho This has been happening to me a lot with less and vim. It would probably happen with other things, it's just that's where I always see it.

With respect to dlls preferred addresses, I thought that basically wasn't a thing anymore since Windows Vista and address space layout randomization (ASLR), or is ASLR probably disabled for cygwin/msys?

This has been a relatively recent problem, but it takes a while to happen and so I can't reproduce it on-demand AFAIK.

It would be nice to find the right people to get together to figure this out, it's a pretty disruptive annoyance when it happens!

Do you think this is really some generic msys2 or cygwin bug?

@kumarharsh
Copy link

Are there any steps you can recommend to take or record logs or stuff in case of future occurences. I'll keep them in mind for the next time it happens.

Also, its a little more heartening to see the bug affecting more than me 😋

@dscho
Copy link
Member

dscho commented Feb 22, 2018

With respect to dlls preferred addresses, I thought that basically wasn't a thing anymore since Windows Vista and address space layout randomization (ASLR), or is ASLR probably disabled for cygwin/msys?

It is not disabled for Cygwin/MSYS2 specifically. Instead, it is not enabled (because Cygwin's fork() emulation won't work with ASLR).

This has been a relatively recent problem, but it takes a while to happen and so I can't reproduce it on-demand AFAIK.

Yeah, I only encountered it after updating the MSYS2 runtime to v2.10.0, but this bug has been reported a lot earlier. So maybe something made it more likely. Maybe there has been a change in the base address? no, that has not changed. Has always been 0x61000000 for 32-bit and 0x180040000 for 64-bit...

@dakotahawkins
Copy link

@dscho Do you have any suggestions for trying to debug this? Because it always fails immediately in a forked process I can't think of a way to catch it in my debugger (tried an extension that's supposed to auto-attach to child processes but I don't think it works for however these are created).

Do you think using the SDK to build everything in debug would get me any closer?

I find that once it happens in a bash.exe, it will apparently always happen from that bash.exe (I've been killing them and making new ones usually, today I tried to see what I could do from a "bad" one for a while).

@dscho
Copy link
Member

dscho commented Feb 25, 2018

@dakotahawkins does the ListDLLs tool show a different base address than usual in that case?

@dakotahawkins
Copy link

dakotahawkins commented Feb 26, 2018

@dscho Even not in that case they're all loaded at 0x0000000080040000, and I see the same preferred address you do.

I can't find anything loaded at 0x0000000180040000, so it doesn't seem like a conflict.

Could whatever loads that dll be truncating that address to its low half?

Edit: ListDlls doesn't report anything with a base address starting with 0x0000000[1-9], so maybe it's just misreporting for 64-bit DLLs.

Edit2: Process Explorer reports the expected base address. I'll check with that the next time this happens. I kind-of expect it to still be loaded at the correct address though I don't think I'll be able to catch it if the spawned process that will fail tries to load it again at a different address or something like that.

@dscho
Copy link
Member

dscho commented Feb 26, 2018

Yes, my take on the upper 32-bits of ListDLLs is the same: it probably truncates inappropriately.

So the next thing is that we might really be out of heap space. That could happen if the memory fragmentation gets too large, if too many individual too-large blocks were allocated. I have no idea so far how to test this hypothesis, though...

@dakotahawkins
Copy link

dakotahawkins commented Feb 26, 2018

There's some back-and-forth in this email soup where they discuss raising the size of "cygheap" to 2 megs. They're also talking about upping /HKLM/Software/Cygwin/heap_chunk_in_mb and working around the problem, so it sounds like maybe cygheap tracks individual allocated heap chunks? Maybe either bigger chunks or a bigger cygheap could work around a problem if you're running out of the space you're using to reference allocated memory?

Even if either of those helped because it's actually running out of heap space, I think the actual root cause would have to be some kind of recently exacerbated memory leak/mismanagement.

@dscho
Copy link
Member

dscho commented Feb 27, 2018

Even if either of those helped because it's actually running out of heap space, I think the actual root cause would have to be some kind of recently exacerbated memory leak/mismanagement.

I would agree. But I have nothing to back up that hunch.

BTW the email thread you mentioned is from 2011... ;-)

@kumarharsh
Copy link

Can you set this issue as "open" again?

@dscho dscho reopened this Feb 27, 2018
@dscho
Copy link
Member

dscho commented Feb 27, 2018

@kumarharsh can you research how to debug these cygheap issues?

@dakotahawkins
Copy link

@dscho Is the SDK installer in a decent state / recommended? Looks like it hasn't changed in a while so I figured I'd ask :)

dakotahawkins added a commit to dakotahawkins/msys2-runtime that referenced this issue Jul 14, 2018
Modified .dll and .pdb are placed in windebug subfolder.

Fixes git-for-windows/git#356

Signed-off-by: Dakota Hawkins <dakotahawkins@gmail.com>
dakotahawkins added a commit to dakotahawkins/MSYS2-packages that referenced this issue Jul 14, 2018
Modified .dll and .pdb are placed in windebug subfolder.

Fixes git-for-windows/git#356

Signed-off-by: Dakota Hawkins <dakotahawkins@gmail.com>
@dakotahawkins
Copy link

dakotahawkins commented Jul 16, 2018

@dscho It's telling that dumpbin doesn't like msys-2.0.dll after cv2pdb messes with it:

C:\Program Files\Git\usr\bin>dumpbin /summary msys-2.0.dll
Microsoft (R) COFF/PE Dumper Version 14.14.26433.0
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file msys-2.0.dll

File Type: DLL
msys-2.0.dll : fatal error LNK1106: invalid file or disk full: cannot seek to 0x1A1800C

It doesn't have a problem when built with the old .dbg generation. It I had hoped to compare dumpbin outputs to see what cv2pdb might be doing wrong, but obviously I didn't get that far. There have been some recent fixes/improvements to cv2pdb I'm trying out, but no luck yet.

@dakotahawkins
Copy link

dakotahawkins commented Jul 17, 2018

objdump complains that the dll is truncated, and what I think might happen is that some of the cygheap section (or similar) gets cut off.

The 64-bit .dll should have an image base address of 0x180040000 (as discussed in previously). cv2pdb, however, calls exe.relocateDebugLineInfo(0x400000) where 0x400000 is an image base address (and the Windows default base address for .DLLs).

Further, relocateDebugLineInfo takes an unsigned int, so 0x180040000 won't fit in it. Even worse, there are lots of unsigned int variables used to calculate offsets that involve doing math with the image base address (in these cases the base address wasn't hard coded, so I assume it's correct) and I suspect a lot of these over/underflow and cause most of the problem.

I'm not sure how much work would be involved to fix all those types in cv2pdb, but another option might be to build with the windows-default base address and then set it after the fact. I'm not sure if there's a mingw equivalent to editbin which might let you do that.

@dakotahawkins
Copy link

Update: I've tried using objcopy to change various addresses, but it doesn't have any apparent effect on the resulting DLL generated by cv2pdb.

@dakotahawkins
Copy link

dakotahawkins commented Jul 17, 2018

Epilogue:

FYI @elieux @dscho

Even with my "fix" to restore msys-2.0.dll, this can still happen.

With scrollback lines set really high and running the python snippet from above to 15000 lines instead of 8000:

In mintty (so, the "default" way to run), nothing crashes. Of note:

  • msys2: best experience. exactly what you'd expect.
  • cygwin and git bash: no crash, but less doesn't come up until a key is pressed, and when you quit it the scrollback is truncated (in my case from what was 14999 lines to part of the way through line 14600-ish. Here's a screenshot.

Under ConEmu, all 3 crash. They'll do 8000 lines over and over if the scrollback max is 9999, but if you set something high for scrollback lines (ConEmu supports 32766, so I used that) everything has the same problem. Tasks are set up like so:

  • msys2: set MSYSTEM=MSYS & set MSYS=winsymlinks:nativestrict & set MSYS2_PATH_TYPE=inherit & set MSYSCON= & "%_MSYS%\usr\bin\bash.exe" --login -i
  • git bash: "%_GIT%\bin\sh.exe" --login -i
  • cygwin: set CHERE_INVOKING=1 & %ConEmuDrive%\cygwin64\bin\sh.exe --login -i

So, at least with my changes I can go back to 9999 and not have my terminal permanently die when I try to edit a commit message or rebase -i. That's a marked improvement, and since all 3 have the same problem with a higher ConEmu scrollback I don't think there could be many other important differences between the GfW/MSYS2 runtime builds. The lack of problems under mintty and the different behavior running python ... && less ... are weird, and possibly some other small difference, but they're not weird enough to make me want to investigate them right now. :)

Edit: @Maximus5 I assume the ConEmu behavior difference here can be explained by some way I've configured it... tagging you in case you happen to be interested and have any insight or have seen similar behavior before.

dakotahawkins added a commit to dakotahawkins/msys2-runtime that referenced this issue Jul 17, 2018
Modified .dll and .pdb are placed in windebug subfolder.

Fixes git-for-windows/git#356

Signed-off-by: Dakota Hawkins <dakotahawkins@gmail.com>
dakotahawkins added a commit to dakotahawkins/MSYS2-packages that referenced this issue Jul 17, 2018
Modified .dll and .pdb are placed in windebug subfolder.

Fixes git-for-windows/git#356

Signed-off-by: Dakota Hawkins <dakotahawkins@gmail.com>
@dscho
Copy link
Member

dscho commented Jul 18, 2018

@dakotahawkins thanks so much for your tenacity (and @elieux, too). I would never have dreamed that the .pdb generation might be responsible for that.

@dakotahawkins
Copy link

Frankly I'm amazed it works to the extent that it does. I'd be kind-of interested to find a tool that dumps whatever sections are in the .dll to confirm exactly what's missing/wrong with it.

dakotahawkins added a commit to dakotahawkins/msys2-runtime that referenced this issue Jul 22, 2018
Fixes git-for-windows/git#356

This reverts commit 27e5516, reversing
changes made to a1a4754.

Signed-off-by: Dakota Hawkins <dakotahawkins@gmail.com>
dakotahawkins added a commit to dakotahawkins/MSYS2-packages that referenced this issue Jul 22, 2018
Fixes git-for-windows/git#356

This reverts commit c8e710e.

Signed-off-by: Dakota Hawkins <dakotahawkins@gmail.com>
@dscho dscho added this to the v2.18.0(2) milestone Aug 21, 2018
@lukeu
Copy link

lukeu commented Aug 26, 2018

Out of curiosity, is this fix included in the v2.19.0-rc0.windows.2 release made 2 days ago? I read:

So here is a test: is this robust enough for end users? Please test
thoroughly, in particular any rebase/stash scenarios you use reguarly.

Those are my favourite 2 commands! :-) I'd be happy to beta test this if it also saves me from having to restart my console 3-4 times a day due to the cmalloc error.

(FYI: I run cygwin.bat still (so cygwin's bash in the Windows terminal) to avoid various problems with git-for-windows, so I'm very excited to hear this might be solved! I'd always assumed it was some cygwin vs Mingw incompatibility & unlikely to ever be solved.)

@dscho
Copy link
Member

dscho commented Aug 27, 2018

Out of curiosity, is this fix included in the v2.19.0-rc0.windows.2 release made 2 days ago?

Yep.

I'd be happy to beta test this

Looking forward to it!

I run cygwin.bat still

Please be advised that Cygwin and Git for Windows are considered incompatible. That is, if you run Cygwin, you should use Cygwin's Git. It might work for you if you run Git for Windows from inside Cygwin's Bash, but sooner or later you will run into trouble, and all I can say to help you, then, is that you had been warned. :-)

@dscho
Copy link
Member

dscho commented Aug 27, 2018

@dakotahawkins thank you so much for your detailed analysis (and @elieux for joining in!). AFAICT cmalloc() is a really old malloc() implementation that suffers from quite a few limitations, and nobody fixed it because it did not cause problems so far. Of course, changing the base address as cv2pdb does leads to problems, so we did the best thing we could in a reasonable amount of time by simply not using cv2pdb on msys-2.0.dll.

@dakotahawkins
Copy link

@dscho Happy to help! I'm super-excited for the next release :)

@leonyu
Copy link

leonyu commented Aug 29, 2018

I downloaded the RC and tried it. It is a magnitude improvement, it no longer crashes at around buffer height of 2500 lines, which was very restrictive.

However, it still crashes at higher buffer height in ConEmu. For me that is around 13106 (happen to be max value / 2.5). This might be a ConEmu limitation?

I am still excited to have 10000 buffer height, than the previous!!!

@dakotahawkins
Copy link

@leonyu That's what I found.

#356 (comment)

Under ConEmu, all 3 crash. They'll do 8000 lines over and over if the scrollback max is 9999, but if you set something high for scrollback lines (ConEmu supports 32766, so I used that) everything has the same problem.

I think I also found it was somewhere around 13k where it started happening.

It might not be a ConEmu issue, it might just be ConEmu allows you to set more than msys can handle. @Maximus5 care to weigh in on this? Do you think it might be a ConEmu issue?

@Maximus5
Copy link

Conhost supports up to 32K lines (short is used in console WinAPI). Properly written console applications should not crash on such buffers.

@dakotahawkins
Copy link

That's kind-of what I suspected, thanks!

@lukeu
Copy link

lukeu commented Aug 30, 2018

I'd be happy to beta test this

Looking forward to it!

Well I've been heavily rebasing for 2 days with the RC and so far so good! 9000-lines of history. Not conclusive yet, but I'd be guessing something like "2 sigma" significance c.f. the crashes I saw before.

[Off-topic]

Please be advised that Cygwin and Git for Windows are considered incompatible.

Consider me well-advised ;-) I've been happily running this way 5-6 years or something.

The only thing I found that falls over really horribly is the cygwin's default terminal. I enjoy being able to update a few key tools (git, python) independently from cygwin - like this RC. (I've developed an unfortunate "if it ain't broke don't fix it" attitude with the rest of Cygwin so it might fall many years behind, but the rest of my tool-chain doesn't need bleeding edge updates anyway.)

So using the slow Windows terminal to side-step the issues is not a big deal for me, I realise this might not be workable for others... But then again it, maybe it could be soon though? (interesting development)

@kumarharsh
Copy link

Wow, thanks for the link @lukeu. It seems Windows might get PTY support in the next build (October?). This is great news.

@dakotahawkins
Copy link

https://github.com/git-for-windows/git/releases/tag/v2.19.0.windows.1

image

mjcheetham pushed a commit to mjcheetham/git that referenced this issue May 24, 2021
…te-and-cache-server

scalar: reorder `delete`, `help` and `version` to come before `cache-server`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests