Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISTCC_SSH regression in v3.4 #434

Closed
KSmanis opened this issue Jul 25, 2021 · 9 comments · Fixed by #438
Closed

DISTCC_SSH regression in v3.4 #434

KSmanis opened this issue Jul 25, 2021 · 9 comments · Fixed by #438

Comments

@KSmanis
Copy link

KSmanis commented Jul 25, 2021

Version 3.4 seems to be affected by a regression regarding DISTCC_SSH, according to which the environment variable is only honored during the first connection attempt of distcc and ignored during any subsequent runs (assuming the distcc server is temporarily unavailable).

I accidentally stumbled upon this issue when my CI started failing once Gentoo stabilized the relevant version.

Relevant CI log (truncated):

distcc-ssh-client_1  | distcc[10] (dcc_trace_version) distcc 3.4 i686-pc-linux-gnu; built Jul  4 2021 03:18:57
distcc-ssh-client_1  | distcc[10] (dcc_recursion_safeguard) safeguard level=0
distcc-ssh-client_1  | distcc[10] (main) compiler name is "distcc"
distcc-ssh-client_1  | distcc[10] (dcc_scan_args) scanning arguments: gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[10] (dcc_scan_args) found input file "test.c"
distcc-ssh-client_1  | distcc[10] (dcc_scan_args) found object/output file "test-distcc.o"
distcc-ssh-client_1  | distcc[10] compile from test.c to test-distcc.o
distcc-ssh-client_1  | distcc[10] (dcc_get_hostlist) read hosts from environment
distcc-ssh-client_1  | distcc[10] (dcc_parse_hosts) found ssh token "distcc-ssh@distcc-ssh-server"
distcc-ssh-client_1  | distcc[10] (dcc_lock_host) got cpu lock on distcc-ssh@distcc-ssh-server slot 0 as fd3
distcc-ssh-client_1  | distcc[10] (dcc_lock_host) got cpu lock on localhost slot 0 as fd4
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 3, file "(NULL)", host "localhost"
distcc-ssh-client_1  | distcc[10] (dcc_strip_dasho) result: gcc -c test.c
distcc-ssh-client_1  | distcc[10] (dcc_spawn_child) forking to execute: gcc -E test.c
distcc-ssh-client_1  | distcc[10] (dcc_spawn_child) child started as pid11
distcc-ssh-client_1  | distcc[10] (dcc_strip_local_args) result: gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[10] exec on distcc-ssh@distcc-ssh-server: gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 2, file "test.c", host "distcc-ssh-server"
distcc-ssh-client_1  | distcc[11] (dcc_new_pgrp) entered process group
distcc-ssh-client_1  | distcc[11] (dcc_increment_safeguard) setting safeguard: _DISTCC_SAFEGUARD=1
distcc-ssh-client_1  | distcc[10] (dcc_ssh_connect) connecting to distcc-ssh-server using ssh
distcc-ssh-client_1  | distcc[10] (dcc_run_piped_cmd) execute: ssh -o StrictHostKeyChecking=no -v -l distcc-ssh distcc-ssh-server distccd --inetd --enable-tcp-insecure
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 4, file "(NULL)", host "(NULL)"
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send DIST00000001
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGC00000005
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV00000003
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string 'gcc'
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV00000002
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string '-c'
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV00000006
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string 'test.c'
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV00000002
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string '-o'
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV0000000d
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string 'test-distcc.o'
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 3, file "(NULL)", host "(NULL)"
distcc-ssh-client_1  | OpenSSH_8.6p1, OpenSSL 1.1.1k  25 Mar 2021
distcc-ssh-client_1  | debug1: Reading configuration data /etc/ssh/ssh_config
distcc-ssh-client_1  | debug1: Authenticator provider $SSH_SK_PROVIDER did not resolve; disabling
distcc-ssh-client_1  | distcc[10] (dcc_collect_child) cpp child 11 terminated with status 0
distcc-ssh-client_1  | distcc[10] (dcc_collect_child) cpp times: user 0.000000s, system 0.000000s, 0 minflt, 0 majflt
distcc-ssh-client_1  | distcc[10] cpp test.c on localhost completed ok
distcc-ssh-client_1  | distcc[10] (dcc_unlock) release lock fd4
distcc-ssh-client_1  | distcc[10] (dcc_x_file) send 16555 byte file /tmp/distcc_fec22928.i with token DOTI and compression 69
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send DOTI000040ab
distcc-ssh-client_1  | distcc[10] (dcc_compile_remote) client finished sending request to server
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 5, file "(NULL)", host "distcc-ssh-server"
distcc-ssh-client_1  | distcc[10] (dcc_select_for_read) select for read on fd7 for 300s
distcc-ssh-client_1  | debug1: Connecting to distcc-ssh-server [172.18.0.2] port 22.
distcc-ssh-client_1  | debug1: connect to address 172.18.0.2 port 22: Connection refused
distcc-ssh-client_1  | ssh: connect to host distcc-ssh-server port 22: Connection refused
distcc-ssh-client_1  | distcc[10] (dcc_readx) ERROR: unexpected eof on fd7
distcc-ssh-client_1  | distcc[10] (dcc_r_token_int) ERROR: read failed while waiting for token "DONE"
distcc-ssh-client_1  | distcc[10] (dcc_r_result_header) ERROR: server provided no answer. Is the server configured to allow access from your IP address? Is the server performing authentication and your client isn't? Does the server have the compiler installed? Is the server configured to access the compiler?
distcc-ssh-client_1  | distcc[10] 16555 bytes from test.c compiled on distcc-ssh-server in 0.0266s, rate 607kB/s
distcc-ssh-client_1  | distcc[10] (dcc_collect_child) ssh child 13 terminated with status 0xff00
distcc-ssh-client_1  | distcc[10] (dcc_collect_child) ssh times: user 0.000000s, system 0.000000s, 0 minflt, 0 majflt
distcc-ssh-client_1  | distcc[10] (dcc_unlock) release lock fd3
distcc-ssh-client_1  | distcc[10] (dcc_get_hostlist) read hosts from environment
distcc-ssh-client_1  | distcc[10] (dcc_parse_hosts) found ssh token "distcc-ssh@distcc-ssh-server"
distcc-ssh-client_1  | distcc[10] (dcc_lock_host) got cpu lock on distcc-ssh@distcc-ssh-server slot 0 as fd3
distcc-ssh-client_1  | distcc[10] (dcc_lock_host) got cpu lock on localhost slot 0 as fd4
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 3, file "(NULL)", host "localhost"
distcc-ssh-client_1  | distcc[10] (dcc_strip_dasho) result: gcc -c test.c
distcc-ssh-client_1  | distcc[10] (dcc_spawn_child) forking to execute: gcc -E test.c
distcc-ssh-client_1  | distcc[10] (dcc_spawn_child) child started as pid14
distcc-ssh-client_1  | distcc[14] (dcc_new_pgrp) entered process group
distcc-ssh-client_1  | distcc[14] (dcc_increment_safeguard) setting safeguard: _DISTCC_SAFEGUARD=1
distcc-ssh-client_1  | distcc[10] (dcc_strip_local_args) result: gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[10] exec on distcc-ssh@distcc-ssh-server: gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 2, file "test.c", host "distcc-ssh-server"
distcc-ssh-client_1  | distcc[10] (dcc_ssh_connect) connecting to distcc-ssh-server using ssh
distcc-ssh-client_1  | distcc[10] (dcc_run_piped_cmd) execute: ssh -l distcc-ssh distcc-ssh-server distccd --inetd --enable-tcp-insecure
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 4, file "(NULL)", host "(NULL)"
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send DIST00000001
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGC00000005
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV00000003
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string 'gcc'
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV00000002
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string '-c'
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV00000006
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string 'test.c'
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV00000002
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string '-o'
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send ARGV0000000d
distcc-ssh-client_1  | distcc[10] (dcc_x_token_string) send string 'test-distcc.o'
distcc-ssh-client_1  | distcc[10] (dcc_note_state) note state 3, file "(NULL)", host "(NULL)"
distcc-ssh-client_1  | ssh: connect to host distcc-ssh-server port 22: Connection refused
distcc-ssh-client_1  | distcc[10] (dcc_collect_child) cpp child 14 terminated with status 0
distcc-ssh-client_1  | distcc[10] (dcc_collect_child) cpp times: user 0.000000s, system 0.000000s, 0 minflt, 0 majflt
distcc-ssh-client_1  | distcc[10] cpp test.c on localhost completed ok
distcc-ssh-client_1  | distcc[10] (dcc_unlock) release lock fd4
distcc-ssh-client_1  | distcc[10] (dcc_x_file) send 16555 byte file /tmp/distcc_6b452928.i with token DOTI and compression 69
distcc-ssh-client_1  | distcc[10] (dcc_x_token_int) send DOTI000040ab
distcc-ssh-client_1  | distcc[10] (dcc_writex) ERROR: failed to write: Broken pipe
distcc-ssh-client_1  | distcc[10] (dcc_collect_child) ssh child 15 terminated with status 0xff00
distcc-ssh-client_1  | distcc[10] (dcc_collect_child) ssh times: user 0.000000s, system 0.000000s, 0 minflt, 0 majflt
distcc-ssh-client_1  | distcc[10] (dcc_unlock) release lock fd3
@shawnl
Copy link
Collaborator

shawnl commented Jul 25, 2021

distcc-ssh-client_1 | ssh: connect to host distcc-ssh-server port 22: Connection refused

There is your problem

according to which the environment variable is only honored during the first connection attempt of distcc and ignored during any subsequent runs

the only state distcc has is the lock files to avoid spamming a host, and marking failed hosts in the same way, which has a timeout.

@KSmanis
Copy link
Author

KSmanis commented Jul 25, 2021

I see, but given that the CI builds started failing after v3.4, there is a change in behavior somewhere. Looking at successful build logs prior to v3.4, distcc used to fail with exit code 107 after failing to connect:

distcc-ssh-client_1  | + gcc -c test.c -o test-gcc.o
distcc-ssh-client_1  | + retry 6 distcc gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | + max_retries=6
distcc-ssh-client_1  | + shift
distcc-ssh-client_1  | + retries=0
distcc-ssh-client_1  | + delay=1
distcc-ssh-client_1  | + distcc gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[21] (dcc_trace_version) distcc 3.3.3 powerpc64le-unknown-linux-gnu; built Jul  3 2021 03:45:29
distcc-ssh-client_1  | distcc[21] (dcc_recursion_safeguard) safeguard level=0
distcc-ssh-client_1  | distcc[21] (main) compiler name is "distcc"
distcc-ssh-client_1  | distcc[21] (dcc_scan_args) scanning arguments: gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[21] (dcc_scan_args) found input file "test.c"
distcc-ssh-client_1  | distcc[21] (dcc_scan_args) found object/output file "test-distcc.o"
distcc-ssh-client_1  | distcc[21] compile from test.c to test-distcc.o
distcc-ssh-client_1  | distcc[21] (dcc_gcc_rewrite_fqn) Re-writing call to 'gcc' to 'powerpc64le-unknown-linux-gnu-gcc' to support cross-compilation.
distcc-ssh-client_1  | distcc[21] (dcc_get_hostlist) read hosts from environment
distcc-ssh-client_1  | distcc[21] (dcc_parse_hosts) found ssh token "distcc-ssh@distcc-ssh-server"
distcc-ssh-client_1  | distcc[21] (dcc_lock_host) got cpu lock on distcc-ssh@distcc-ssh-server slot 0 as fd3
distcc-ssh-client_1  | distcc[21] (dcc_lock_host) got cpu lock on localhost slot 0 as fd4
distcc-ssh-client_1  | distcc[21] (dcc_note_state) note state 3, file "(NULL)", host "localhost"
distcc-ssh-client_1  | distcc[21] (dcc_strip_dasho) result: powerpc64le-unknown-linux-gnu-gcc -c test.c
distcc-ssh-client_1  | distcc[21] (dcc_spawn_child) forking to execute: powerpc64le-unknown-linux-gnu-gcc -E test.c
distcc-ssh-client_1  | distcc[21] (dcc_spawn_child) child started as pid24
distcc-ssh-client_1  | distcc[24] (dcc_new_pgrp) entered process group
distcc-ssh-client_1  | distcc[24] (dcc_increment_safeguard) setting safeguard: _DISTCC_SAFEGUARD=1
distcc-ssh-client_1  | distcc[21] (dcc_strip_local_args) result: powerpc64le-unknown-linux-gnu-gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[21] exec on distcc-ssh@distcc-ssh-server: powerpc64le-unknown-linux-gnu-gcc -c test.c -o test-distcc.o
distcc-ssh-client_1  | distcc[21] (dcc_note_state) note state 2, file "test.c", host "distcc-ssh-server"
distcc-ssh-client_1  | distcc[21] (dcc_ssh_connect) connecting to distcc-ssh-server using ssh
distcc-ssh-client_1  | distcc[21] (dcc_run_piped_cmd) execute: ssh -o StrictHostKeyChecking=no -v -l distcc-ssh distcc-ssh-server distccd --inetd --enable-tcp-insecure
distcc-ssh-client_1  | distcc[21] (dcc_note_state) note state 4, file "(NULL)", host "(NULL)"
distcc-ssh-client_1  | distcc[21] (dcc_x_token_int) send DIST00000001
distcc-ssh-client_1  | distcc[21] (dcc_x_token_int) send ARGC00000005
distcc-ssh-client_1  | distcc[21] (dcc_x_token_int) send ARGV00000021
distcc-ssh-client_1  | distcc[21] (dcc_x_token_string) send string 'powerpc64le-unknown-linux-gnu-gcc'
distcc-ssh-client_1  | distcc[21] (dcc_x_token_int) send ARGV00000002
distcc-ssh-client_1  | distcc[21] (dcc_x_token_string) send string '-c'
distcc-ssh-client_1  | distcc[21] (dcc_x_token_int) send ARGV00000006
distcc-ssh-client_1  | distcc[21] (dcc_x_token_string) send string 'test.c'
distcc-ssh-client_1  | distcc[21] (dcc_x_token_int) send ARGV00000002
distcc-ssh-client_1  | distcc[21] (dcc_x_token_string) send string '-o'
distcc-ssh-client_1  | distcc[21] (dcc_x_token_int) send ARGV0000000d
distcc-ssh-client_1  | distcc[21] (dcc_x_token_string) send string 'test-distcc.o'
distcc-ssh-client_1  | distcc[21] (dcc_note_state) note state 3, file "(NULL)", host "(NULL)"
distcc-ssh-client_1  | OpenSSH_8.6p1, OpenSSL 1.1.1k  25 Mar 2021
distcc-ssh-client_1  | debug1: Reading configuration data /etc/ssh/ssh_config
distcc-ssh-client_1  | debug1: Authenticator provider $SSH_SK_PROVIDER did not resolve; disabling
distcc-ssh-client_1  | debug1: Connecting to distcc-ssh-server [172.18.0.2] port 22.
distcc-ssh-client_1  | debug1: connect to address 172.18.0.2 port 22: Connection refused
distcc-ssh-client_1  | ssh: connect to host distcc-ssh-server port 22: Connection refused
distcc-ssh-client_1  | distcc[21] (dcc_collect_child) cpp child 24 terminated with status 0
distcc-ssh-client_1  | distcc[21] (dcc_collect_child) cpp times: user 0.000000s, system 0.000000s, 0 minflt, 0 majflt
distcc-ssh-client_1  | distcc[21] cpp test.c on localhost completed ok
distcc-ssh-client_1  | distcc[21] (dcc_unlock) release lock fd4
distcc-ssh-client_1  | distcc[21] (dcc_x_file) send 16361 byte file /tmp/distcc_4544debd.i with token DOTI and compression 69
distcc-ssh-client_1  | distcc[21] (dcc_x_token_int) send DOTI00003fe9
distcc-ssh-client_1  | distcc[21] (dcc_writex) ERROR: failed to write: Broken pipe
distcc-ssh-client_1  | distcc[21] (dcc_collect_child) ssh child 27 terminated with status 0xff00
distcc-ssh-client_1  | distcc[21] (dcc_collect_child) ssh times: user 0.000000s, system 0.000000s, 0 minflt, 0 majflt
distcc-ssh-client_1  | distcc[21] (dcc_unlock) release lock fd3
distcc-ssh-client_1  | distcc[21] (dcc_build_somewhere) ERROR: failed to distribute and fallbacks are disabled
distcc-ssh-client_1  | distcc[21] elapsed compilation time 0.469821s
distcc-ssh-client_1  | distcc[21] (dcc_exit) exit: code 107; self: 0.047215 user 0.007869 sys; children: 0.395984 user 0.020005 sys
distcc-ssh-client_1  | distcc[21] (dcc_cleanup_tempfiles_inner) deleted 2 temporary files
distcc-ssh-client_1  | + exit=107
distcc-ssh-client_1  | + retries=1
distcc-ssh-client_1  | + '[' 1 -lt 6 ']'
distcc-ssh-client_1  | + echo '[1/6] `distcc gcc -c test.c -o test-distcc.o` failed with exit code 107; sleeping for 1s...'
distcc-ssh-client_1  | [1/6] `distcc gcc -c test.c -o test-distcc.o` failed with exit code 107; sleeping for 1s...
distcc-ssh-client_1  | + sleep 1

Post-v3.4 distcc seems to keep on trying forever to distribute the job. Am I missing something obvious here?

@shawnl
Copy link
Collaborator

shawnl commented Jul 25, 2021

distcc-ssh-client_1 | OpenSSH_8.6p1, OpenSSL 1.1.1k 25 Mar 2021
distcc-ssh-client_1 | debug1: Reading configuration data /etc/ssh/ssh_config
distcc-ssh-client_1 | debug1: Authenticator provider $SSH_SK_PROVIDER did not resolve; disabling
distcc-ssh-client_1 | debug1: Connecting to distcc-ssh-server [172.18.0.2] port 22.
distcc-ssh-client_1 | debug1: connect to address 172.18.0.2 port 22: Connection refused
distcc-ssh-client_1 | ssh: connect to host distcc-ssh-server port 22: Connection refused

I think your ssh is not configured correctly.

@KSmanis
Copy link
Author

KSmanis commented Jul 26, 2021

The SSH issue is expected, since the distcc server container is started at the same time as the client container. Both of these containers are part of a smoke test where the client container is designed to try reconnecting a few times before failing.

The only reason I brought this issue up is because I noticed a change in the behavior of distcc from v3.3.3 to v3.4: in v3.3.3 the client container would occasionally fail to connect and distcc would exit with 107 (but that was expected and it was designed to retry after a few seconds). Since v3.4, however, distcc does not exit, but keeps on trying forever.

This is more of a heads-up rather than an issue, so feel free to close anytime if you find it irrelevant. I reported it because I didn't notice any relevant change in the changelog.

@shawnl
Copy link
Collaborator

shawnl commented Jul 26, 2021

Since v3.4, however, distcc does not exit, but keeps on trying forever.

If that is the case that is indeed a bug.

SSH mode is quite inefficient, and I do not use it, and thus my motivation to work on this is little, however. If you want security I recommend you use WireGuard. I would love to be able to integrate WireGuard directly into distcc, but that is tricky because it would require a full TCP stack in distcc.

@asheplyakov
Copy link
Contributor

The SSH issue is expected, since the distcc server container is started at the same time as the client container.

Knock-knock!
Race condition!
Who's there?

@shawnl
Copy link
Collaborator

shawnl commented Jul 26, 2021 via email

@KSmanis
Copy link
Author

KSmanis commented Jul 28, 2021

Using git bisect I traced this change in behavior to commit a7dd5cf using the following command: DISTCC_BACKOFF_PERIOD=0 DISTCC_FALLBACK=0 DISTCC_VERBOSE=1 DISTCC_HOSTS='@localhost' ./distcc gcc -c test.c -o test.o

Prior to this commit, distcc would exit after failing to distribute to @localhost (where no distccd is running). Starting with that commit it keeps retrying forever.

Again, I don't know if this change in behavior was intended or not, I'm just bringing it up for consideration.

@asheplyakov
Copy link
Contributor

Using git bisect I traced this change in behavior to commit a7dd5cf using the following command: DISTCC_BACKOFF_PERIOD=0 DISTCC_FALLBACK=0 DISTCC_VERBOSE=1 DISTCC_HOSTS='@localhost' ./distcc gcc -c test.c -o test.o

With DISTCC_BACKOFF_PERIOD=0 distcc won't mark failed connection attempts.
As a result when no hosts can be reached distcc will retry forever. Before a7dd5cf distcc would fallback to the local compilation after the first failed connection attempt (which might be a bit annoying).

asheplyakov added a commit to asheplyakov/distcc that referenced this issue Aug 30, 2021
Since the commit a7dd5cf ("Try a new host after failing to
connect instead of immediately falling back to localhost") distcc
tries to choose a different host on connection/protocol errors.
However when backoff is disabled (DISTCC_BACKOFF_PERIOD=0)
distcc keeps retrying forever.

To avoid the problem limit the number of retries (currently to 3)
when backoff is disabled.

Closes: distcc#434
vt-alt pushed a commit to altlinux/specs that referenced this issue Sep 15, 2022
- Repaired IP based access control (Closes: #42251)
- Added systemd unit file (Closes: #40669)
- Improved --allow-private, see distcc/distcc#451
- Removed clients.allow, commands.allow.sh: these have never ever worked
- Avoid infinite loop when DISTCC_BACKOFF is disabled, see
  distcc/distcc#434
- Refuse to distribute files with the `.incbin` assembler directive, see
  distcc/distcc#461
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants