Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting in TreeModifier.set_done() seems to be inconsistent with git's own sorting -> breaks repository #369

Closed
ddanier opened this issue Dec 8, 2015 · 8 comments

Comments

@ddanier
Copy link
Contributor

ddanier commented Dec 8, 2015

git enforces tree objects to be sorted by name. GitPython enables users to sort their tree's by calling TreeModifier.set_done(), see:
https://github.com/gitpython-developers/GitPython/blob/master/git/objects/tree.py#L45

This sorting somehow seems to be inconsistent with git's own sorting mechanism, which means adding a tree might break the git repository. This means "git fsck" will throw errors and some remotes (gitlab for example) will no accept the commit. Anyways this bug leaves the git object store in an broken state which cannot be fixed easily, especially if there is much history after the broken commit/tree. I had to delete multiple branches from projects I manage because of their broken state so far.

About the problem itself:

GitPython uses core Python functions to sort the file list. This looks correct, but has one major difference to what git itself does. If you have two files beginning with the same chars (for example "file" and "file.second") their ordering will differ:

What git does:

  1. "file.second"
  2. "file"

What GitPython does:

  1. "file"
  2. "file.second"

We found this problem using fabdeploit for deployments using git, which allows us to add/removed files to/from the git repository. This way tree's will be modified which means sorting is essential.

Please fix this. ;-)

@Byron
Copy link
Member

Byron commented Dec 13, 2015

Thanks for the detailed and informative issue ! This is indeed a critical one, whose fix is on the way.

To the interested audience, this is how git compares tree entries:

int name_compare(const char *name1, size_t len1, const char *name2, size_t len2)
{
    size_t min_len = (len1 < len2) ? len1 : len2;
    int cmp = memcmp(name1, name2, min_len);
    if (cmp)
        return cmp;
    if (len1 < len2)
        return -1;
    if (len1 > len2)
        return 1;
    return 0;
}

Now git-python does it similarly. It's worth noting that the most difficult was to make it work in Py3 as well as in Py2, which contributes to most of the added code.

@Byron Byron closed this as completed in d82a6c5 Dec 13, 2015
@Byron
Copy link
Member

Byron commented Dec 13, 2015

I am hoping to hear from you if it does indeed fix the issue on your end :).
Thanks again.

@ddanier
Copy link
Contributor Author

ddanier commented Dec 13, 2015

Whoa, hadn't thought this would be this complicated. Big thanks for the quick response!

I will test if the bug is fixed for our repositories, but I'm pretty certain we will have no more issues.

@ddanier
Copy link
Contributor Author

ddanier commented Dec 13, 2015

Sadly I still get errors. I took one tree which produces errors and used "git show" to see what is stored in object store. The attached files correspond to the four different ways I used to create the tree:

  • git.txt - Original git version, this is correct
  • gitpython.txt - Generated by gitpython directly downloaded form git (including d82a6c5), this should be correct
  • gitpython_reversed.txt - same as above, but I switched len_b - len_a to len_a - len_b (just tried this, could have worked)
  • gitpython_old.txt - Original version of gitpython I used when first discovering the bug

git seems to do some additional magic? Or am I missing something?

@Byron
Copy link
Member

Byron commented Dec 13, 2015

Thanks a lot ! With the new, improved dataset, I am sure there will soon be an actual fix.
This makes me curious what's so special about how git does it, after all I was looking at git's source code and thought I now did it by the book.
Maybe it's the sorting algorithm - git uses quick sort, and I am not quite sure what python's List.sort is actually doing.

@Byron Byron reopened this Dec 13, 2015
@Byron Byron self-assigned this Dec 13, 2015
Byron added a commit that referenced this issue Dec 14, 2015
The problem is that a per-tree modification API cannot work properly, as the sorting is based
on full paths of all entries within the repository. This feat can only be achieved by the index,
which to my knowledge already does it correctly.

The only fix is to remove the misleading API entirely, which will happen in the next commit.

Related to #369
@Byron Byron closed this as completed in b295c13 Dec 14, 2015
@Byron
Copy link
Member

Byron commented Dec 14, 2015

Please have a look at the two commits related to the issue to see if this can indeed solve the problem.

Long story short: The API used here shouldn't have existed in the first place, as it's not possible to sort individual trees correctly. Instead, please resort to modifying index entries and use IndexFile.write_tree() to write out the tree objects for use in a commit.

Of course, if this still is not working or creates invalid repositories, please let me know right away so I can have another look.

Thanks for your effort so far, I found the input very valuable.

@ghost
Copy link

ghost commented Jan 11, 2024

Don't know if you have known, but during my research, you can keep use default sort instead merge sort here , the point git sort diff with GitPython is git sort will add "/" after folder name, so "file.second/" < "file/", just use sort and add "/" after file name if a object type is tree. fyi~

@Byron
Copy link
Member

Byron commented Jan 11, 2024

You are probably right and I invite you to create a test that shows how GitPython creates an incorrect tree, maybe even in comparison to Git itself. From there it should be easy to fix and to assure this issue won't be re-introduced again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants