Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: fix the FreeBSD build #1308

Closed
wants to merge 2 commits into from
Closed

Conversation

dscho
Copy link
Member

@dscho dscho commented Jul 29, 2022

Since 3a251ba (trace2: only include "fsync" events if we git_fsync(), 2022-07-18), the FreeBSD builds are failing in t5351.6. See https://cirrus-ci.com/task/4577761405698048 for an example. The run at https://cirrus-ci.com/task/6004115347079168 shows that this patch fixes the bug.

While verifying the fix on Windows, I noticed a recent, rather terrible performance regression: t5351 all of a sudden takes almost half an hour to run on Windows. I found a fix, and it now passes in less than half a minute again.

cc: Derrick Stolee derrickstolee@github.com
cc: Carlo Marcelo Arenas Belón carenas@gmail.com
cc: "brian m. carlson" sandals@crustytoothpaste.net

On FreeBSD, this mode is not supported. But since 3a251ba (trace2:
only include "fsync" events if we git_fsync(), 2022-07-18) t5351 will
fail if this mode is unsupported.

Let's address this in the minimal fashion, by detecting that that mode
is unsupported and expecting a different count of hardware flushes in
that case.

This fixes the CI/PR builds on FreeBSD again.

Note: A better way would be to test only what is relevant in t5351.6
"unpack big object in stream (core.fsyncmethod=batch)" again instead of
blindly comparing the output against some exact text. But that would
pretty much revert the idea of above-mentioned commit, and that commit
was _just_ accepted into Git's main branch so one must assume that it
was intentional.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho self-assigned this Jul 29, 2022
The `test_cmp` function is meant to provide nicer output than `cmp` when
expected and actual output of Git commands disagree. The implicit
assumption is that the output is line-based and human readable.

However, aaf8122 (unpack-objects: use stream_loose_object() to
unpack large objects, 2022-06-11) introduced a call that compares the
contents of pack files, which are distinctly not line-based nor human
readable.

This causes problems because on Windows, we hand off to the Bash
function `mingw_test_cmp` that compares the lines while ignoring line
ending differences. And this Bash function spends an insane amount of
cycles trying to read in that binary pack file, so that it is almost
indistinguishable from an infinite loop.

For example, t5351 took 1486 seconds in the CI run at
https://github.com/git/git/runs/7398490747?check_suite_focus=true#step:5:171,
to complete. And yes, that is almost half an hour.

Since Git's tests already use `cmp` consistently when comparing pack
files, let's change this instance to use `cmp` instead of `test_cmp`,
too, and fix that performance problem.

Now t5351 takes all of 22 seconds.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho
Copy link
Member Author

dscho commented Jul 29, 2022

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

Submitted as pull.1308.git.1659097724.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1308/dscho/fix-t5351-on-freebsd-v1

To fetch this version to local tag pr-1308/dscho/fix-t5351-on-freebsd-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1308/dscho/fix-t5351-on-freebsd-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/29/2022 8:28 AM, Johannes Schindelin via GitGitGadget wrote:
> Since 3a251bac0d1a (trace2: only include "fsync" events if we git_fsync(),
> 2022-07-18), the FreeBSD builds are failing in t5351.6. See
> https://cirrus-ci.com/task/4577761405698048 for an example. The run at
> https://cirrus-ci.com/task/6004115347079168 shows that this patch fixes the
> bug.

Thanks for noticing and fixing this bug. The FreeBSD build is slow
and flaky enough that I sometimes ignore its output before submitting
a series. Good that it will be green again.
 
> While verifying the fix on Windows, I noticed a recent, rather terrible
> performance regression: t5351 all of a sudden takes almost half an hour
> [https://github.com/git/git/runs/7398490747?check_suite_focus=true#step:5:171]
> to run on Windows. I found a fix, and it now passes in less than half a
> minute
> [https://github.com/gitgitgadget/git/runs/7578071365?check_suite_focus=true#step:5:125]
> again.

Thanks for speeding this up!

-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

User Derrick Stolee <derrickstolee@github.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> performance regression: t5351 all of a sudden takes almost half an hour
> [https://github.com/git/git/runs/7398490747?check_suite_focus=true#step:5:171]
> to run on Windows. I found a fix, and it now passes in less than half a
> minute

;-)  That's impressive.

@@ -340,6 +340,8 @@ void fsync_loose_object_bulk_checkin(int fd, const char *filename)
*/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> On FreeBSD, this mode is not supported. But since 3a251bac0d1a (trace2:
> only include "fsync" events if we git_fsync(), 2022-07-18) t5351 will
> fail if this mode is unsupported.
>
> Let's address this in the minimal fashion, by detecting that that mode
> is unsupported and expecting a different count of hardware flushes in
> that case.
>
> This fixes the CI/PR builds on FreeBSD again.
>
> Note: A better way would be to test only what is relevant in t5351.6
> "unpack big object in stream (core.fsyncmethod=batch)" again instead of
> blindly comparing the output against some exact text. But that would
> pretty much revert the idea of above-mentioned commit, and that commit
> was _just_ accepted into Git's main branch so one must assume that it
> was intentional.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  bulk-checkin.c                  |  2 ++
>  t/t5351-unpack-large-objects.sh | 10 ++++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)

I am inclined to take this as-is for now.

But it inherits from 3a251bac0d1a the general "we should be able to
count the value" expectation, which may not be the best approach to
run this test; asking Acks from those originally involved in this
area and possibly ideas to test this in a more robust way.

Thanks.


> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 98ec8938424..855b68ec23b 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -340,6 +340,8 @@ void fsync_loose_object_bulk_checkin(int fd, const char *filename)
>  	 */
>  	if (!bulk_fsync_objdir ||
>  	    git_fsync(fd, FSYNC_WRITEOUT_ONLY) < 0) {
> +		if (errno == ENOSYS)
> +			warning(_("core.fsyncMethod = batch is unsupported on this platform"));
>  		fsync_or_die(fd, filename);
>  	}
>  }
> diff --git a/t/t5351-unpack-large-objects.sh b/t/t5351-unpack-large-objects.sh
> index f785cb06173..dd7ebc40072 100755
> --- a/t/t5351-unpack-large-objects.sh
> +++ b/t/t5351-unpack-large-objects.sh
> @@ -70,9 +70,15 @@ test_expect_success 'unpack big object in stream (core.fsyncmethod=batch)' '
>  	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
>  	GIT_TEST_FSYNC=true \
>  		git -C dest.git $BATCH_CONFIGURATION unpack-objects <pack-$PACK.pack &&
> -	check_fsync_events trace2.txt <<-\EOF &&
> +	if grep "core.fsyncMethod = batch is unsupported" trace2.txt
> +	then
> +		flush_count=7
> +	else
> +		flush_count=1
> +	fi &&
> +	check_fsync_events trace2.txt <<-EOF &&
>  	"key":"fsync/writeout-only","value":"6"
> -	"key":"fsync/hardware-flush","value":"1"
> +	"key":"fsync/hardware-flush","value":"$flush_count"
>  	EOF
>  
>  	test_dir_is_empty dest.git/objects/pack &&

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, "brian m. carlson" wrote (reply to this):

On 2022-07-29 at 16:07:46, Junio C Hamano wrote:
> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
> 
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > On FreeBSD, this mode is not supported. But since 3a251bac0d1a (trace2:
> > only include "fsync" events if we git_fsync(), 2022-07-18) t5351 will
> > fail if this mode is unsupported.
> >
> > Let's address this in the minimal fashion, by detecting that that mode
> > is unsupported and expecting a different count of hardware flushes in
> > that case.
> >
> > This fixes the CI/PR builds on FreeBSD again.
> >
> > Note: A better way would be to test only what is relevant in t5351.6
> > "unpack big object in stream (core.fsyncmethod=batch)" again instead of
> > blindly comparing the output against some exact text. But that would
> > pretty much revert the idea of above-mentioned commit, and that commit
> > was _just_ accepted into Git's main branch so one must assume that it
> > was intentional.
> >
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> >  bulk-checkin.c                  |  2 ++
> >  t/t5351-unpack-large-objects.sh | 10 ++++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> I am inclined to take this as-is for now.

I think this patch is a strict improvement over the status quo, despite
what I'm going to mention below.

> But it inherits from 3a251bac0d1a the general "we should be able to
> count the value" expectation, which may not be the best approach to
> run this test; asking Acks from those originally involved in this
> area and possibly ideas to test this in a more robust way.

While it may not matter in this test, I noticed that the metrics we use
need not be accurate in multi-threaded programs.  We use integers of
static intmax_t which isn't necessarily atomic across threads.  Even if
we assumed word-sized increments were atomic[0], this type might be 64
bits on a 32-bit system.

I am aware that we don't use threads in many places, but in general we
should be safely able to assume that we can perform an fsync on any
thread without thread safety problems because that's the assumption a
reasonable Unix programmer would make.  Even if it's not a problem now,
it could well be in the future, and we should either fix this (say, with
C11 atomic integers[1]) or put a note on the metrics functions that they
may be wrong and stop using them in tests.

I would, in general, prefer that if we add wrappers that wrap common
Unix functions we ensure that they provide the same guarantees as the
common Unix functions we're wrapping.  I realize I may sound fussy, but
I've been fixing giant thread-safety problems recently and it's not fun.

[0] This is not, in general, a safe assumption to make, since most RISC
architectures are load-store.
[1] This would necessitate moving to C11, which is fine with me (and
already required on Windows), but others may have objections to.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

On the Git mailing list, Carlo Marcelo Arenas Belón wrote (reply to this):

On Fri, Jul 29, 2022 at 08:58:46AM -0400, Derrick Stolee wrote:
> On 7/29/2022 8:28 AM, Johannes Schindelin via GitGitGadget wrote:
> > Since 3a251bac0d1a (trace2: only include "fsync" events if we git_fsync(),
> > 2022-07-18), the FreeBSD builds are failing in t5351.6. See
> > https://cirrus-ci.com/task/4577761405698048 for an example. The run at
> > https://cirrus-ci.com/task/6004115347079168 shows that this patch fixes the
> > bug.
> 
> Thanks for noticing and fixing this bug. The FreeBSD build is slow

It usually takes a little more than 7 minutes for a full run, which is IMHO
less (at least wall time) than the whole CI does; could you elaborate on
why being "slow" would warrant ignoring its failures?

> and flaky enough that I sometimes ignore its output before submitting
> a series. Good that it will be green again.

I'd noticed that because it runs outside GitHub actions it sometimes has
synchronization(ex [1]) issues, but that might be some bug on the integration
with Cirrus which is easily avoided by looking instead directly to their
status page:

  https://cirrus-ci.com/github/git/git

Carlo

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

User Carlo Marcelo Arenas Belón <carenas@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

User "brian m. carlson" <sandals@crustytoothpaste.net> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

This branch is now known as js/t5351-freebsd-fix.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

This patch series was integrated into seen via git@9135aa1.

@gitgitgadget gitgitgadget bot added the seen label Jul 29, 2022
@gitgitgadget
Copy link

gitgitgadget bot commented Jul 29, 2022

There was a status update in the "New Topics" section about the branch js/t5351-freebsd-fix on the Git mailing list:

Some tests assumed that core.fsyncMethod=batch is supported
everywhere, which broke FreeBSD.

Will merge to 'next'.
source: <pull.1308.git.1659097724.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 1, 2022

This patch series was integrated into seen via git@bad9312.

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 1, 2022

This patch series was integrated into next via git@b47609e.

@gitgitgadget gitgitgadget bot added the next label Aug 1, 2022
@gitgitgadget
Copy link

gitgitgadget bot commented Aug 1, 2022

There was a status update in the "Cooking" section about the branch js/t5351-freebsd-fix on the Git mailing list:

Some tests assumed that core.fsyncMethod=batch is supported
everywhere, which broke FreeBSD.

Will merge to 'master'.
source: <pull.1308.git.1659097724.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 3, 2022

This patch series was integrated into seen via git@b40deae.

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 5, 2022

This patch series was integrated into seen via git@9d250de.

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 6, 2022

There was a status update in the "Cooking" section about the branch js/t5351-freebsd-fix on the Git mailing list:

Some tests assumed that core.fsyncMethod=batch is supported
everywhere, which broke FreeBSD.

Will merge to 'master'.
source: <pull.1308.git.1659097724.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 7, 2022

This patch series was integrated into seen via git@e89e751.

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 8, 2022

This patch series was integrated into seen via git@1b53bea.

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 8, 2022

This patch series was integrated into master via git@1b53bea.

@gitgitgadget
Copy link

gitgitgadget bot commented Aug 8, 2022

This patch series was integrated into next via git@1b53bea.

@gitgitgadget gitgitgadget bot added the master label Aug 8, 2022
@gitgitgadget gitgitgadget bot closed this Aug 8, 2022
@gitgitgadget
Copy link

gitgitgadget bot commented Aug 8, 2022

Closed via 1b53bea.

@dscho dscho deleted the fix-t5351-on-freebsd branch August 9, 2022 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant