Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] for testing : Fix 4Gb limit for large files on Git for Windows #2179

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
23 changes: 23 additions & 0 deletions t/t-large-files-on-windows.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I read this commit message, I wonder how we can munge it to look more like the other commit messages in Git. I.e. if you compare the commit message e.g. of 581d2fd to this:

 Add test for large files on windows

Original test by Thomas.
Add the extra fsck to get diagnostics after the add.
Verify the pack at earliest opportunity
Slight confusion as to why index-pack vs verify-pack...
It's -v (verbose) not --verify
Specify an output file to index-pack, otherwise it clashes with
the existing index file.
Check that the sha1 matches the existing value;-)

then it is hard not to spot a stark difference. Maybe we could write something like this instead:

Add a test for files >2GB

There is nothing in the C specification that claims that `long` should
have the same size as `void *`. For that reason, Git's over-use of
`unsigned long` when it tries to store addresses or offsets is
problematic.

This issue is most prominent on Windows, where the size of `long` is
indeed 32 bit, even on 64-bit Windows.

Let's add a test that demonstrates problems with our current code base
when `sizeof(long)` is different from `sizeof(void *)`.

Thoughts?

BTW I have no idea what you mean by this:

Check that the sha1 matches the existing value;-)

I do not see this at all in this here patch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sha1 check was one of the reported values in the test but I only manually checked it against what I'd seen via WSL.

the test (and hence message) should also cover the zlib compile flags (not sure if we can determine the crc32 compile flags)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test (and hence message) should also cover the zlib compile flags (not sure if we can determine the crc32 compile flags)

Would a change of those flags constitute a regression? If not (which I think is the case), then it should not be tested.

The sha1 check was one of the reported values in the test but I only manually checked it against what I'd seen via WSL.

Ah, okay. I am not sure that is strictly an interesting information for the commit message, but more for the cover letter.


test_description='test large file handling on windows'
. ./test-lib.sh

test_expect_success SIZE_T_IS_64BIT 'require 64bit size_t' '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message claims that this is for Windows, but the prereq does not actually test that. You will want to use the prereq MINGW for that.

Besides, I think that SIZE_T_IS_64BIT is only part of the remaining prereqs: we probably want to also ensure that sizeof(long) == 4, because otherwise this test case will trivially succeed, even if nothing in Git's source code was fixed.

To that end, I would like to imitate the way we test for time_t-is64bit via the test-tool.

And when we do that, we do not even need the MINGW prereq, nor do we need to mention Windows specifically: it is no longer about Windows, it is about systems where the size of long differs from that of void *.

Finally, I think it would make sense to fold the commit that introduces the SIZE_T_IS_64BIT prereq into this commit, and probably even move it into this here file, as that would make it easier on reviewers (both present as well as future ones).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the initial test was incomplete, and just sufficient to ensure I could see all the steps that include a long conversion. When I looked at the use of zlib there were 28 places! so a lot of different parts of the test could be impacted, hence checks everywhere.

I also wanted to explicitly test that the zlib compile flags were the problematic ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's indeed what you should do while developing the test case. Once your done, it needs to be polished to serve the particular use case to allow easy debugging of future regressions. And that is where I think that the git log --stat and the compile flags and stuff like that won't help, so it needs to go before we merge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile flags should be part of the pre-requisite to detect that Size_t is 64 bit but zlib is only 32bit.. which is currently part of the >4Gb failing case. Your right it's not in there yet but should be (not sure how to do it yet though).


dd if=/dev/zero of=file bs=1M count=4100 &&
git config core.compression 0 &&
git config core.looseCompression 0 &&
git add file &&
git verify-pack -s .git/objects/pack/*.pack &&
git fsck --verbose --strict --full &&
git commit -m msg file &&
git log --stat &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather leave this out. It might have made sense for debugging when this patch was developed, but the test case should be optimized for catching, diagnosing and debugging future regressions, nothing more, nothing less.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on the testing criteria.
Maybe we have a full fledged 'check everything every step' commit immediately followed by a commit that trims it back to 'just detect an end to end fail', with message that states that the debug version is the previous commit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regularly investigate test failures in Git's test suite, as I am one of the very few who even looks at our CI builds.

You should try it, too. It's very educative.

I can tell you precisely how I feel about test cases like this one, but you would have to hide younger children first.

In short: no, let's not do this. Let's not make life hard on the developers who will inevitably have to investigate regressions. Let's make it easy instead. The shorter, conciser, and more meaningful we can make the test case, the better.

Remember: regression tests are not so much about finding regressions, as they are about helping regressions to be fixed. That is the criteria you should aim to optimize for. Always.

git gc &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to git gc both here and 3 lines further down? Would this really help debugging regressions in the future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost very step is error prone and some do not repeat outside of the test environment (differing big file threshold effects I think) i.e. pack vs loose object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we really need a more robust test. Otherwise a developer might be tempted to think that they fixed a flawed regression test, all the while a real regression crept in.

So please indulge me: what exactly is this regression test case supposed to test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an expansion of the 'does it perform correctly' test that Thomas B initially developed.

Each of the fsck, verify-pack, log and gc commands can barf. Plus if run manually it actually does a slightly different test (add to loose vs pack), which I didn't get to the bottom of!

In this case he had deliberately set the compression to zero so that the compressed file would exceed the limit, we also 'need' (if we are to be complete) to check for packed and loose encoding, etc etc (including breakages at various process stages, add vs commit,

I'd agree that the testing still has a long way to go to get a smooth set of tests that satisfy all. It's all low level code issues that can pop up in many places.

It may be that we have a two phase plan. First make the code patches visible (if the code isn't working...), then make the testing acceptable (to as yet unknown/discussed/decided standards).

git fsck --verbose --strict --full &&
git index-pack -v -o test.idx .git/objects/pack/*.pack &&
git gc &&
git fsck
'

test_done