Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching Submodule in Tree raises AttributeError: "Cannot retrieve the name of a submodule if it was not set initially" #1092

Open
ryan-williams opened this issue Nov 26, 2020 · 5 comments

Comments

@ryan-williams
Copy link

Check out toy example repo w/ 2 submodules (named generate-random-ints and sort):

git clone --recurse-submodules https://gitlab.com/gsmo/examples/submodule-demo.git
cd submodule-demo
git checkout e6f6d8f
git submodule update
python

In Python REPL:

import git
repo = git.Repo()
tree = repo.tree()
tree['sort']
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/Users/ryan/.local/lib/python3.8/site-packages/git/objects/submodule/base.py", line 169, in __repr__
#     % (type(self).__name__, self._name, self.path, self.url, self.branch_path)
#   File "/Users/ryan/.pyenv/versions/3.8.5/lib/python3.8/site-packages/gitdb/util.py", line 253, in __getattr__
#     self._set_cache_(attr)
#   File "/Users/ryan/.local/lib/python3.8/site-packages/git/objects/submodule/base.py", line 132, in _set_cache_
#     raise AttributeError("Cannot retrieve the name of a submodule if it was not set initially")
# AttributeError: Cannot retrieve the name of a submodule if it was not set initially
# >>> tree['generate-random-ints']
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/Users/ryan/.local/lib/python3.8/site-packages/git/objects/submodule/base.py", line 169, in __repr__
#     % (type(self).__name__, self._name, self.path, self.url, self.branch_path)
#   File "/Users/ryan/.pyenv/versions/3.8.5/lib/python3.8/site-packages/gitdb/util.py", line 253, in __getattr__
#     self._set_cache_(attr)
#   File "/Users/ryan/.local/lib/python3.8/site-packages/git/objects/submodule/base.py", line 132, in _set_cache_
#     raise AttributeError("Cannot retrieve the name of a submodule if it was not set initially")
# AttributeError: Cannot retrieve the name of a submodule if it was not set initially

It would be nice (and consistent with the Git CLI) to expose Submodules in Tree objects; analogous Git CLI operation:

git ls-tree HEAD
# 100644 blob 53d0e7837b21f6c964832001a428e2eac023ae3e	.gitmodules
# 160000 commit 1dcac12d613a0e405db510a28123e23fda6bd1c6	generate-random-ints
# 040000 tree a0440560b281de673b23324a9693782f848791b1	nbs
# 100644 blob 85543297722cd8f862ca50aaaf6ca54c5ea9055b	run.ipynb
# 160000 commit de538ba56a5929cab0465ea29478bd6af2ba974b	sort

It seems like this is maybe supposed to work, and the AttributeError above is just a bug?

In any case, given this issue, I can't find a way to access the submodule commit at a given parent commit. The natural interface would seem to be e.g.:

tree = repo.tree('6e2e388')
tree['sort']  # should return a reference to Submodule commit 36a9654716fe3a317aae4a4a96f95bdff4805625, instead raises AttributeError as above

Git CLI makes it easy to introspect trees' submodule states at any prior commit:

git ls-tree 6e2e388
# 100644 blob 53d0e7837b21f6c964832001a428e2eac023ae3e	.gitmodules
# 160000 commit 3ed135bf1b90b90b007f4cbd47c0f9496d7a4924	generate-random-ints
# 100644 blob 85543297722cd8f862ca50aaaf6ca54c5ea9055b	run.ipynb
# 160000 commit 36a9654716fe3a317aae4a4a96f95bdff4805625	sort
@ryan-williams
Copy link
Author

Debugging this, the error comes from referencing Submodule._name in Submodule.__str__ or Submodule.__repr__; I think accessing the Submodule from the Tree works as expected, but displaying it is crashing. Continuing to dig…

@Byron
Copy link
Member

Byron commented Nov 27, 2020

Thanks for looking into this!

It looks like a potential cause is that IterableList will happily return methods defined on itself, making anything named 'sort' impossible to retrieve. Maybe related to #1091.

For now as a workaround, I recommend using git directly at the expense of having to parse its output manually.

I am hopeful that your inquiries will result in a PR which can fix this apparently long-standing issue.

@ryan-williams
Copy link
Author

I think I'm close to a PR on this one! It's not really related to #1091, fwiw; this issue is reproducible with more innocuously-names submodules, I just made it confusing by using a module named sort in the example above. I'll try to update that to make it clearer.

The question of whether a Tree can truly contain a Submodule seems subtle: a Tree stores a Submodule's commit SHA, but [what repo that SHA belongs to] and [what the submodule's name is] live in .gitmodules in the repo root, which a Tree has no access to. In principle, two different commits can have different names for the same Submodule (pointed at the same {commit,path,url}), but share the Tree that contains the Submodule at that commit, meaning there isn't even one "right" answer to "given a Tree containing a Submodule, what is the Submodule's name?"). You need a specific Commit context to get at a Submodule's name (and URL; path you do know based on a Tree context though).

I've improved things a bit by letting Submodules lazily populate _name by parsing .gitmodules, matching on the submodule's path, and parsing name from the config section name (which looks like submodule "mymodule"). This requires Submodule.path, which we know from the parent Tree context. However, parent_commit defaults to repo HEAD when config parsing, and when accessing a Submodule under a Tree, I don't think we have any way to get at the logical Commit context (which we presumably got to the Tree from, but which is not stored on the Tree; storing a Commit reference on a Tree is complicated; a great thing about Trees is they can be shared by multiple Commits…).

An additional mitigation I'm playing with (and interested in your thoughts on) involves defining Commit.__getitem__ (currently undefined) to behave just like Tree.__getitem__, but with the added benefit of actually resolving Submodules fully+correctly in this kind of situation. I think this is a generally intuitive API that I remember trying to use when I was newer to GitPython (before learning that I had to additionally traverse from Commit to Tree before grabbing specific Blobs by path).

In that world, Tree.getitem could continue to be exhibit somewhat undefined behavior when interacting with Submodules (or we could pin down some reasonable defaults), and Commit.getitem would be the recommended API/work-around. I guess a semi-defined Submodule reference from a Tree can always be bound to its parent commit (and then properly populated automatically) using Submodule.set_parent_commit (which I also have several fixes to).

Sorry for the word salad, I ran in circles on this for a while yesterday while also learning the codebase, and writing this was also very helpful 😀

@ryan-williams
Copy link
Author

fwiw here's a recent snapshot, though I've reworked things since: https://github.com/gitpython-developers/GitPython/compare/master...ryan-williams:sm?expand=1 hoping to get it cleaned up and PR'd this wknd

@Byron
Copy link
Member

Byron commented Nov 30, 2020

Apologies for the late reply! But here I go…

The question of whether a Tree can truly contain a Submodule seems subtle: a Tree stores a Submodule's commit SHA, but [what repo that SHA belongs to] and [what the submodule's name is] live in .gitmodules in the repo root, which a Tree has no access to. In principle, two different commits can have different names for the same Submodule (pointed at the same {commit,path,url}), but share the Tree that contains the Submodule at that commit, meaning there isn't even one "right" answer to "given a Tree containing a Submodule, what is the Submodule's name?"). You need a specific Commit context to get at a Submodule's name (and URL; path you do know based on a Tree context though).

I find this to be such a fantastic description that I hope this would end up in the GitPython documentation somewhere. It's so fitting that I would hope it is conserved somewhere, I suppose.

I've improved things a bit by letting Submodules lazily populate _name by parsing .gitmodules, matching on the submodule's path, and parsing name from the config section name (which looks like submodule "mymodule"). This requires Submodule.path, which we know from the parent Tree context. However, parent_commit defaults to repo HEAD when config parsing, and when accessing a Submodule under a Tree, I don't think we have any way to get at the logical Commit context (which we presumably got to the Tree from, but which is not stored on the Tree; storing a Commit reference on a Tree is complicated; a great thing about Trees is they can be shared by multiple Commits…).

I absolutely agree, and believe that having an API like this and all it's implications is bound to fail for that exact reason. It can't actually be done.

An additional mitigation I'm playing with (and interested in your thoughts on) involves defining Commit.getitem (currently undefined) to behave just like Tree.getitem, but with the added benefit of actually resolving Submodules fully+correctly in this kind of situation. I think this is a generally intuitive API that I remember trying to use when I was newer to GitPython (before learning that I had to additionally traverse from Commit to Tree before grabbing specific Blobs by path).

In that world, Tree.getitem could continue to be exhibit somewhat undefined behavior when interacting with Submodules (or we could pin down some reasonable defaults), and Commit.getitem would be the recommended API/work-around. I guess a semi-defined Submodule reference from a Tree can always be bound to its parent commit (and then properly populated automatically) using Submodule.set_parent_commit (which I also have several fixes to).

To me it sounds very reasonable to add an API that is actually working, providing submodule information starting on commit level. I would of course be happy for any fixes to the tree based API that you happen to have laying around, and would hope there could be a deprecation warning going with it that informs about the new API (knowing that the old API will probably never be removed).

Sorry for the word salad, I ran in circles on this for a while yesterday while also learning the codebase, and writing this was also very helpful 😀

If this is word salad to you, I don't know what I should call what I am writing most of the time :D - thanks for sharing, I found these thoughts a pleasure to read and easy to follow, with the added benefit of knowing once again how git submodules are (supposed to be) working :).

And yes, I am absolutely looking forward to a PR - please feel free to submit WIPs as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants