Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate git daemon problems #304

Closed
dscho opened this issue Aug 22, 2015 · 37 comments
Closed

Investigate git daemon problems #304

dscho opened this issue Aug 22, 2015 · 37 comments

Comments

@dscho
Copy link
Member

dscho commented Aug 22, 2015

It had been reported in msysgit#70 (with a quite large patch that I did not manage to review due to its sheer size) that git daemon might close the socket before the client read everything, and as a consequence, the transaction fails.

I still need to be able to reproduce; if I manage that, I will try to come up with a substantially smaller patch myself.

@linquize
Copy link

Not easy to report a MCVE, because after a few times of retry, it succeeds.
But in some cases, it still fails with 100 consecutive times of retry.

@wq9
Copy link

wq9 commented Aug 23, 2015

I always get this problem.

Client (git version 2.5.0.windows.1):
image

Server (git version 2.4.6.windows.1):

C:\vpk>git daemon --verbose --base-path=C:\
[2652] Ready to rumble
[3296] Connection from <client>:59471
[3296] Extended attributes (32 bytes) exist <host=vm1-kykbkmi4.cloudapp.net>
[3296] Request upload-pack for '/vpk/repo'
[2652] [3296] Disconnected
[1776] Connection from <client>:59486
[1776] Extended attributes (32 bytes) exist <host=vm1-kykbkmi4.cloudapp.net>
[1776] Request upload-pack for '/vpk/repo'
[2652] [1776] Disconnected

@dscho
Copy link
Member Author

dscho commented Aug 24, 2015

@wq9 the trick, now, is to put in some wait cycles at the appropriate point to trigger this always (i.e. also here)... ;-)

@Radrik5
Copy link

Radrik5 commented Mar 2, 2016

I also get this problem when I try fetching from remote machine. I cannot reproduce it using localhost (it works fine) but for remote machine git fetch git:// always fails. With Git 1.9.5 it worked well in most cases (only sometimes it failed).

Remote machine: Windows 7 64bit, Git 2.7.0 64bit
Local machine: Windows 7 64bit, Git 2.7.2 64bit

Command on remote machine:

git daemon --base-path=. --export-all --reuseaddr --informative-errors --verbose

Command on local machine:

$ GIT_TRACE=1 git fetch git://host/ test:host
13:27:40.171077 git.c:348               trace: built-in: git 'fetch' 'git://host/' 'test:host'
13:27:40.366077 run-command.c:343       trace: run_command: 'rev-list' '--objects' '--stdin' '--not' '--all' '--quiet'
remote: Counting objects: 65, done.
remote: Compressing objects: 100% (65/65), done.
13:27:40.759077 run-command.c:343       trace: run_command: 'unpack-objects' '--pack_header=2,65'
13:27:40.782077 git.c:348               trace: built-in: git 'unpack-objects' '--pack_header=2,65'
fatal: read error: Invalid argument
fatal: early EOFs:  23% (15/65)
fatal: unpack-objects failed

I think you can try any repo on remote machine and git fetch should fail. The point here is not to use localhost network.

@linquize
Copy link

linquize commented Mar 2, 2016

Still a problem in 2.7

@guban
Copy link

guban commented Jan 21, 2017

I prepared the following MCVE:

My setup is

Windows 10 64-bit.

Git-2.11.0.3-64-bit.exe

$ git --version
git version 2.11.0.windows.3

The problem is reproducible when cloning from a local machine. To
reproduce the problem, first run the following script as a batch file so
as to generate a git repository and start git daemon:

set work_dir=%cd%
set repo_dir=%work_dir%\my-repo

:: Uncomment this for cleaning-up after previous run.
::rmdir /s /q %repo_dir%

mkdir %repo_dir%
cd %repo_dir%
git init
echo 1 > .git\git-daemon-export-ok

:: This populates a.txt with the help text for the "for" command.
help for > a.txt

:: Can't reproduce without generating quite a bit of history.
:: Tweak the last number in the triplet to vary the history size.
:: Beware of exponential growth. Value 12 results in a 64M repository.
for /l %%x in (1, 1, 12) do (
  type a.txt >> a.txt
  git add a.txt
  git commit -m "update %%x"
)
set GIT_TRACE=1
git daemon --verbose --reuseaddr --base-path=%work_dir:\=/% %work_dir:\=/%

Then, try cloning this repository:

set GIT_TRACE=1
git clone git://127.0.0.1/my-repo cloned

Most of the time (but not always), the above git clone command fails:

09:33:24.142211 git.c:371               trace: built-in: git 'clone' 'git://127.0.0.1/my-repo' 'cloned'
09:33:24.201829 run-command.c:350       trace: run_command: 'index-pack' '--stdin' '--fix-thin' '--keep=fetch-pack 8892 on oleg' '--check-self-contained-and-connected'
Cloning into 'cloned'...
09:33:24.210832 git.c:371               trace: built-in: git 'index-pack' '--stdin' '--fix-thin' '--keep=fetch-pack 8892 on oleg' '--check-self-contained-and-connected'
remote: 09:33:24.210832 git.c:371               trace: built-in: git 'pack-objects' '--revs' '--thin' '--stdout' '--delta-base-offset'        
fatal: early EOF
fatal: read error: Invalid argument
fatal: index-pack failed

The corresponding git daemon output is

[7296] Connection from 127.0.0.1:26663
[7296] unable to set SO_KEEPALIVE on socket: No such file or directory
[7296] Extended attributes (16 bytes) exist <host=127.0.0.1>
[7296] Request upload-pack for '/my-repo'
09:33:24.175817 run-command.c:350       trace: run_command: 'upload-pack' '--strict' '--timeout=0' '.'
[7296] 09:33:24.185821 git.c:607               trace: exec: 'git-upload-pack' '--strict' '--timeout=0' '.'
[7296] 09:33:24.185821 run-command.c:350       trace: run_command: 'git-upload-pack' '--strict' '--timeout=0' '.'
[7296] 09:33:24.199827 run-command.c:350       trace: run_command: 'pack-objects' '--revs' '--thin' '--stdout' '--delta-
base-offset'

Sometimes, git clone fails with another error:

fatal: early EOFs: 69% (53/76)
fatal: read error: Invalid argument
fatal: index-pack failed

I also observe the following surprising behavior. When I select any text
in the console window where git daemon is running, subsequent cloning
operations always succeed. Then, when I clear the selection with Escape
key, git clone begins to fail intermittently again. Apparently, the
text selection suppresses git daemon's console output, and that somehow
helps to avoid the condition that leads to git clone failure.

The problem is also reproducible in 32-bit version 2.11.0.windows.3.

@cezariuszmarek
Copy link

cezariuszmarek commented Mar 27, 2017

It's 100% reproducible on my 6 MB repository with 183 history entries. Can't you remove the mcve-required label having @guban script?

@guban
Copy link

guban commented Apr 1, 2017

If there is a reason for keeping this issue as "mcve-required", it would be nice to let me know what needs to be improved in my MCVE that was provided back in January.

@dscho
Copy link
Member Author

dscho commented Apr 4, 2017

I did not yet have time to come back to this issue. So I am reluctant to remove that tag, especially given that no contributor who would be able to fix the underlying problem gave the MCVE a whirl. Sorry, busy times!

@edropps
Copy link

edropps commented Feb 16, 2018

We have a team of ~ 10 people who run into this exact issue daily on windows Server 2016 on multiple Git versions through Git-2.16.1.3-64-bit I support removing the mcve-required tag.

@dscho
Copy link
Member Author

dscho commented Feb 19, 2018

@edropps are you saying that you are willing to contribute time and effort to see this problem fixed? If so, do the steps outlined in #304 (comment) reproduce the problem on your machine? Do you have a Git for Windows SDK set up already to investigate further?

@guban
Copy link

guban commented Feb 24, 2018

A dirty fix to the problem is to add sleep(1); right after
upload_pack(); at the bottom of upload-pack.c:

    case protocol_v0:
      upload_pack();
+     sleep(1);
      break;

Without this fix, I reproduce the problem about 75% of the time
following the steps from my MCVE. After applying the fix, I could not
reproduce the problem at all.

@guban
Copy link

guban commented Feb 25, 2018

I think toshiyuki-ogawa determined the root cause correctly:
upload-pack closes its TCP socket immediately after sending the
requested data. On Windows, that immediately destroys the OS-allocated
buffer even when the latter still contains the last chunk of the data to
be sent. We have a race condition between sending that last chunk and
destroying the OS socket object. This explains why the failures are
intermittent.

To eliminate the race condition, we need to wait in upload-pack until
the client receives all data and closes connection on its side. One way
to achieve waiting is by replacing sleep(1) in my dirty hack with

read(1, &buffer, 1);

where buffer is a dummy char buffer;. Yes, reading from stdout
seems odd, but stdout in our case is a socket, so an attempt to read
from it blocks until the client closes connection on its side. I
verified that by adding traces and an artificial delay in index-pack.
One major drawback of this solution is that read() blocks
indefinitely, so a malicious client could cause multiple upload-pack
processes waiting for sockets that never close on the client side.

Then, to introduce a timeout, I replaced read() with poll():

struct pollfd pfd;
pfd.fd = 1;
pfd.events = POLLIN;
poll(&pfd, 1, 10 * 1000);

This works just like read() (i.e., it fixes the bug), and also it
unblocks after the given timeout. I verified that it does unblock if the
poll() timeout is smaller than the artificial delay in index-pack. A
technical detail: since poll() uses winsock, we have to ensure it is
initialized. In my experiment, I just called gethostname(0, 0);,
relying on the fact that the latter performs winsock initialization.

I also tried an alternative solution that involves using the SO_LINGER
socket option. I set this option on a socket returned by accept() call
in git-daemon. The setsockopt() call succeeds, but the option has no
effect: I can still reproduce the bug with this option set. Of course,
it might not work because I did something wrong.

To summarize, I think that polling with timeout in upload-pack so as
to wait for the client's disconnect is a good solution: it is minimal,
and it eliminates the root cause.

@dscho
Copy link
Member Author

dscho commented Feb 27, 2018

@guban excellent work! Thank you very much.

To summarize, I think that polling with timeout in upload-pack so as
to wait for the client's disconnect is a good solution: it is minimal,
and it eliminates the root cause.

I agree. Would you mind opening a Pull Request? Most likely, this comment can be turned into a stellar commit message with minimal effort.

A technical detail: since poll() uses winsock, we have to ensure it is initialized. In my experiment, I just called gethostname(0, 0);, relying on the fact that the latter performs winsock initialization.

That is a good workaround. I think we can do even better, though: there is already an ensure_socket_initialization() function in compat/mingw.c. How about renaming this function to ensure_winsock_initialized(), making it non-static, declare it in compat/win32/winsock-helper.h and use it both in poll() in compat/poll/poll.c as well as in compat/mingw.c?

That way, all users of poll() benefit from the fix.

In short: it would be awesome if you could prepare a Pull Request with one patch that teaches poll() to ensure that winsock was initialized, and one patch that adds the polling to git-daemon.

@dscho dscho added bug and removed mcve-required labels Feb 27, 2018
guban added a commit to guban/git that referenced this issue Feb 28, 2018
…nnects

This fixes the issue identified in

	git-for-windows#304

On Windows, closing a socket destroys its OS-allocated buffer even when
the latter still contains the last chunk of the data to be sent. Hence,
closing the socket immediately after sending the pack results in a race
condition between sending that last chunk and destroying the OS socket
object.

When the OS socket object is destroyed first, git-daemon fails to
actually send the last chunk of the pack to the client, and the latter
fails with the `early EOF` / `index-pack failed`. That is, the above
race condition results in intermittent failures to clone a repository
hosted on a Windows machine by git-daemon.

To eliminate the race condition, `upload-pack` now sends all data and
then waits until the client closes connection from its side.
guban added a commit to guban/git that referenced this issue Feb 28, 2018
…nnects

This fixes the issue identified in

	git-for-windows#304

On Windows, closing a socket destroys its OS-allocated buffer even when
the latter still contains the last chunk of the data to be sent. Hence,
closing the socket immediately after sending the pack results in a race
condition between sending that last chunk and destroying the OS socket
object.

When the OS socket object is destroyed first, git-daemon fails to
actually send the last chunk of the pack to the client, and the latter
fails with the `early EOF` / `index-pack failed`. That is, the above
race condition results in intermittent failures to clone a repository
hosted on a Windows machine by git-daemon.

To eliminate the race condition, `upload-pack` now sends all data and
then waits until the client closes connection from its side.

Signed-off-by: Oleg Gubanov <oleg.gubanov@gmail.com>
@guban
Copy link

guban commented Feb 28, 2018

@dscho you are welcome!

@kgybels
Copy link

kgybels commented Feb 28, 2018

@guban I propose a different solution. Can you try kgybels/git@b7fb4db?

This way the logic stays isolated to git daemon, and upload-pack doesn't need to worry about its stdout being a network connection.

@guban
Copy link

guban commented Mar 1, 2018

@kgybels your solution works. I couldn't reproduce the bug after about 300 attempts. I also verified that I can still reproduce the bug after reverting your changes.

I agree that your solution is better. Unlike mine, it does not need to explain why polling for read from stdout would make sense :) Accepting your proposal is the way to go.

@dscho
Copy link
Member Author

dscho commented Mar 5, 2018

Excellent. Thanks, both!

@dscho
Copy link
Member Author

dscho commented Apr 4, 2018

@kgybels what is the state of your patch? I am eager to get this into Git for Windows...

@kgybels
Copy link

kgybels commented Apr 8, 2018

@dscho I have some time this Thursday to continue were I left off.

@kgybels
Copy link

kgybels commented Apr 12, 2018

I submitted my changes to the git mailing list:
[PATCH 0/2] Fix early EOF with GfW daemon

I tested on Windows 10 and Ubuntu 17.10.

kgybels added a commit to kgybels/git that referenced this issue Apr 19, 2018
On Windows, a connection is shutdown when the last open handle to it is
closed. When that last open handle is stdout of our child process, an
abortive shutdown is triggered when said process exits. Ensure a
graceful shutdown of the client connection by keeping an open handle
until we detect our child process has finished. This allows all the data
to be sent to the client, instead of being discarded.

Fixes git-for-windows#304

Signed-off-by: Kim Gybels <kgybels@infogroep.be>
@dscho
Copy link
Member Author

dscho commented May 31, 2018

Thank you. This stalled, though. You hinted at something you wanted to teach xpoll() (and implicitly that you agreed with @gitster that xpoll() makes sense)?

@InBetweenNames
Copy link

Any chance we could get the workaround applied to only Git for Windows and not upstream git in the interim? It would be really useful to have.

@dartero
Copy link

dartero commented Sep 19, 2018

Same situation here.
My organization has an internal screenless no-session-initiated Windows Server. I'm trying to define a Windows service (using daemon) to make it work without having to do the console workaround with no luck.

Will it be possible to provide "alpha-pre-release" git-daemon binary version mentioned in [PATCH 2/2] daemon: graceful shutdown of client connection with xpoll() implementation? Last message is from April but at v2.19.0 bug still exists.

Anyway, a hint to compile project would be helpful. Doc at INSTALL and Makefile refeer to *nix. I would want to try @guban code untill @kgybels is up to standards.
I'm sysop, not dev, so a bit lost at compile time :'(

@PhilipOakley
Copy link

Go to the Git-for-Windows home page, scroll down to Contribute, install the SDK (which will do the first full compile..) and the whole thing is right there for you to try.

Search all the issues for SDK, and you'll see a number of repeats of the (fairly easy, step by step) instructions[1] of how to build and test a new version. (so simple even an engineer like me can do it ;-) See also the wiki/FAQs.

HTH

[1] https://github.com/git-for-windows/git/wiki/Building-Git

@dartero
Copy link

dartero commented Sep 19, 2018

OK, thanks for the fast response and sorry if question seemed "dumb". Just didn't look at web, only the repositories. 🤷‍♂️

InBetweenNames pushed a commit to InBetweenNames/git that referenced this issue Sep 19, 2018
On Windows, a connection is shutdown when the last open handle to it is
closed. When that last open handle is stdout of our child process, an
abortive shutdown is triggered when said process exits. Ensure a
graceful shutdown of the client connection by keeping an open handle
until we detect our child process has finished. This allows all the data
to be sent to the client, instead of being discarded.

Fixes git-for-windows#304

Signed-off-by: Kim Gybels <kgybels@infogroep.be>
@InBetweenNames
Copy link

Done and done, thanks! I rebased @kgybels patches onto master and I was able to produce an installer using the process described. I have uploaded my installer at this link: https://github.com/InBetweenNames/git/releases/tag/v2.19.0.windows.1.4.g3828fefffa.shane

@dartero
Copy link

dartero commented Sep 19, 2018

Wow! fast replies!
@PhilipOakley, thanks for pointing SDK. I've tried it but, inside my organization, could not install it (weird https 443 port problem to github through git.exe).
@InBetweenNames, thanks for the compilation. Checked it in Virustotal. Results:
Report for InBetweenNames installer - One engine detected, probably because InnoSetup installer. as official download also fails at the same engine Official package report
Extracted files report does not fail at all.
@kgybels, thanks for the patch. I read it still has the termination issue, but it's fine to me as daemon only has to end when machine reboots.
@dscho, thanks for the hard work with this project. I'll stay tuned for final resolution of the bug.

InBetweenNames pushed a commit to InBetweenNames/git that referenced this issue Sep 23, 2018
On Windows, a connection is shutdown when the last open handle to it is
closed. When that last open handle is stdout of our child process, an
abortive shutdown is triggered when said process exits. Ensure a
graceful shutdown of the client connection by keeping an open handle
until we detect our child process has finished. This allows all the data
to be sent to the client, instead of being discarded.

Fixes git-for-windows#304

Signed-off-by: Kim Gybels <kgybels@infogroep.be>
@kgybels
Copy link

kgybels commented Sep 24, 2018

@dartero thanks for the patch. I read it still has the termination issue, but it's fine to me as daemon only has to end when machine reboots.

There is no issue that I know of with the patches for Windows builds. They, however, introduce a problem for linux builds, where connections would remain open, as explained on the git mailing list:

There is another issue with the existing code that this new "xpoll" will need
to take into account. If a SIGCHLD arrives between the call to
check_dead_children and poll, the poll will not be interupted by it, resulting
in the child not being reaped until another child terminates or a client
connects. Currently, the effect is just a zombie process for a longer time,
however, the proposed patch (daemon: graceful shutdown of client connection)
relies on the cleanup to close the client connection.

When I have time, I will reroll including a change to ppoll.

Maybe it is time for that reroll, anyone up for some pair programming? :)

@dartero
Copy link

dartero commented Sep 28, 2018

@kgybels There is no issue that I know of with the patches for Windows builds. They, however, introduce a problem for linux builds, where connections would remain open, as explained on the git mailing list:

@InBetweenNames Done and done, thanks! I rebased @kgybels patches onto master and I was able to produce an installer using the process described.

Ok, I misread it in mailing list.
Currently I'm using it at a preproduction deployment, running the daemon as a service, with no issues.
I've been testing thoroughly to find proper config, since @InBetweenNames release lacks some items (i.e.: git-lfs.exe) and seems to be builded with different compiling options.
I got the best results by overwritting:
git.exe
git-daemon.exe
in
C:\Program Files\Git\mingw64\libexec\git-core
from @InBetweenNames release, over the official release files.

Thanks again for your work.

@dscho
Copy link
Member Author

dscho commented Dec 26, 2018

@kgybels would you kindly give me a summary of the current state?

@kgybels
Copy link

kgybels commented Dec 27, 2018

@dscho would you kindly give me a summary of the current state?

No change since the patches on the git mailing list, to make the change correct for linux, the code needs to be changed to use ppoll, however, for windows this is problematic, since we don't have ppoll in compat.

@dscho
Copy link
Member Author

dscho commented Feb 26, 2019

for windows this is problematic, since we don't have ppoll in compat.

So how about introducing a HAVE_PPOLL constant and using that for conditional code blocks? Then we could have your original fix in case of !HAVE_PPOLL?

@kgybels
Copy link

kgybels commented Feb 26, 2019

@dscho So how about introducing a HAVE_PPOLL constant and using that for conditional code blocks? Then we could have your original fix in case of !HAVE_PPOLL?

Just to be clear, I'm not working on this, nor do I plan on doing so.

@dscho
Copy link
Member Author

dscho commented Oct 13, 2019

Okay, so I have to admit openly, to myself as well as to everybody else, that I won't find the time to work on this and drive this forward.

I'll leave it open as an "up for grabs", as @kgybels' patch provides a really good head start for anybody who is actually interested. (And by "actually interested" I don't mean "likes to talk about it" but "likes to do something about it".)

@dscho dscho removed this from the Post-release cleanup milestone Oct 18, 2019
@dscho
Copy link
Member Author

dscho commented Oct 18, 2019

I removed this from the milestone because, realistically, I won't have time to complete this, ever.

Having said that, I will be glad to help everybody who is interested in driving this to the finish line.

jeffhostetler pushed a commit to jeffhostetler/git that referenced this issue May 14, 2021
…b check

This check verifies that the ODB matches what we supplied, but there are
some subtleties around Windows path names that can cause inexact matches
to be logically the same. Since this check is really intended only for
debugging and development purposes, let's remove it for now as a quick
workaround.

This was originally part of git-for-windows#304, as I hit it more often in that effort. However,
this was hit during the Scalar release process, so I'll need to generate new
installers.
@dscho
Copy link
Member Author

dscho commented Mar 17, 2022

Let's just close this as stale.

@dscho dscho closed this as completed Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests