Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git-p4: speed up search for branch parent #1013

Closed
wants to merge 2 commits into from

Conversation

jkuebart
Copy link
Contributor

@jkuebart jkuebart commented Apr 28, 2021

In this iteration, I have added more context and measurements to the commit message.

I have also made small improvements to the code suggested by reviewers.

I enhanced t9801-git-p4-branch.sh to test for the functionality, namely that branches are branched off at the correct point in their parents' history.

Signed-off-by: Joachim Kuebart joachim.kuebart@gmail.com

cc: Joachim Kuebart joachim.kuebart@gmail.com

cc: Luke Diamand luke@diamand.org
cc: Joachim Kuebart joachim.kuebart@gmail.com

@gitgitgadget-git
Copy link

Welcome to GitGitGadget

Hi @jkuebart, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the FreeNode IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Freenode. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@dscho
Copy link
Member

dscho commented Apr 28, 2021

/allow

@gitgitgadget-git
Copy link

User jkuebart is now allowed to use GitGitGadget.

WARNING: jkuebart has no public email address set on GitHub

@jkuebart
Copy link
Contributor Author

/submit

@gitgitgadget-git
Copy link

Submitted as pull.1013.git.git.1619640416533.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-git-1013/jkuebart/p4-faster-parent-v1

To fetch this version to local tag pr-git-1013/jkuebart/p4-faster-parent-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-git-1013/jkuebart/p4-faster-parent-v1

@gitgitgadget-git
Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Joachim Kuebart via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Joachim Kuebart <joachim.kuebart@gmail.com>

Thanks.  As git-p4 is not in my area of expertise, I'll make a style
critique first, while pinging Luke as an area expert (you can learn
who they are with "git shortlog --no-merges --since=18.months.ago
git-p4.py").

> Previously, the code iterated through the parent branch commits and
> compared each one to the target tree using diff-tree.

It is customary in this project to describe the problem in the
present tense.  In other words, whoever is writing the log message
still lives in the world without this patch applied to the system.

    The code iterates through the parent commits and compares each of
    them to the target tree using diff-tree.

But before that sentence, please prepare the reader with a bit
larger picture.  A reader may not know what purpose the comparison
serves.  Do we know that the tree of one of the parents of the
commit must match the tree of the target, and trying to see which
parent is the one with the same tree?  What is helped by learning
which parent has the same tree?

Perhaps

    The method searchParent() is used to find a commit in the
    history of the given 'parent' commit whose tree exactly matches
    the tree of the given 'target commit.  The code iterates through
    the commits in the history and compares each of them to the
    target tree by invoking diff-tree.

And then our log message would make observation, pointing out what
is wrong with it (after all comparing with diff-tree is not giving
us a wrong result---the point of this change is that spawning diff-tree
for each commit is wasteful when we only want to see exact matches).

    Because we only are interested in finding a tree that is exactly
    the same, and not interested in how other trees are different,
    having to spawn diff-tree for each and every commit is wasteful. 

> This patch outputs the revision's tree hash along with the commit hash,
> thereby saving the diff-tree invocation. This results in a considerable
> speed-up, at least on Windows.

And then our log message would order the codebase to "become like
so", in order to resolve the issue(s) pointed out in the
observation.  Perhaps

    Use the "--format" option of "rev-list" to find out the tree
    object name of each commit in the history, and find the tree
    whose name is exactly the same as the tree of the target commit
    to optimize this.

When making a claim on performance, it is helpful to our readers to
give some numbers, even in a limited test, e.g.

    In a sample history where ~100 commits needed to be traversed to
    find the fork point on my Windows box, the current code took
    10.4 seconds to complete, while the new code yields the same
    result in 1.8 seconds, which is a significant speed-up.

or something along these lines.

> Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
>  git-p4.py | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac401..dbe94e6fb83b 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -3600,19 +3600,19 @@ def importNewBranch(self, branch, maxChange):
>          return True
>  
>      def searchParent(self, parent, branch, target):
> +        for tree in read_pipe_lines(["git", "rev-parse",
> +                                     "{}^{{tree}}".format(target)]):
> +            targetTree = tree.strip()

It looks very strange to run a commit that you expect a single line
of output, and read the result in a loop.  Doesn't git-p4.py supply
a more suitable helper to read a single line output from a command?

> +        for blob in read_pipe_lines(["git", "rev-list", "--format=%H %T",
>                                       "--no-merges", parent]):

This is not a new problem you introduced, but when we hear word
"blob" in the context of this project, it reminds us of the "blob"
object, while the 'blob' variable used in this loop has nothing to
do with it.  Perhaps rename it to say 'line' or something?

> +            if blob[:7] == "commit ":
> +                continue

Perhaps blob.startswith("commit ") to avoid hardcoded string length?

> +            blob = blob.strip().split(" ")
> +            if blob[1] == targetTree:
>                  if self.verbose:
> +                    print("Found parent of %s in commit %s" % (branch, blob[0]))
> +                return blob[0]
> +        return None

@gitgitgadget-git
Copy link

On the Git mailing list, Joachim Kuebart wrote (reply to this):

On Thu, 29 Apr 2021 at 04:22, Junio C Hamano <gitster@pobox.com> wrote:
>
> "Joachim Kuebart via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Joachim Kuebart <joachim.kuebart@gmail.com>
>
> Thanks.  As git-p4 is not in my area of expertise, I'll make a style
> critique first, while pinging Luke as an area expert (you can learn
> who they are with "git shortlog --no-merges --since=18.months.ago
> git-p4.py").

Hi Junio, thanks for your timely and thorough review and for putting
up with my greenhorn mistakes ;-)

> > Previously, the code iterated through the parent branch commits and
> > compared each one to the target tree using diff-tree.
>
> It is customary in this project to describe the problem in the
> present tense.  In other words, whoever is writing the log message
> still lives in the world without this patch applied to the system.

I will rephrase the commit message and give better details as you
mentioned. Thanks a lot for your suggestions!

> When making a claim on performance, it is helpful to our readers to
> give some numbers, even in a limited test, e.g.
>
>     In a sample history where ~100 commits needed to be traversed to
>     find the fork point on my Windows box, the current code took
>     10.4 seconds to complete, while the new code yields the same
>     result in 1.8 seconds, which is a significant speed-up.
>
> or something along these lines.

I will add some measurements.

> > Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
> >  git-p4.py | 22 +++++++++++-----------
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/git-p4.py b/git-p4.py
> > index 09c9e93ac401..dbe94e6fb83b 100755
> > --- a/git-p4.py
> > +++ b/git-p4.py
> > @@ -3600,19 +3600,19 @@ def importNewBranch(self, branch, maxChange):
> >          return True
> >
> >      def searchParent(self, parent, branch, target):
> > +        for tree in read_pipe_lines(["git", "rev-parse",
> > +                                     "{}^{{tree}}".format(target)]):
> > +            targetTree = tree.strip()
>
> It looks very strange to run a commit that you expect a single line
> of output, and read the result in a loop.  Doesn't git-p4.py supply
> a more suitable helper to read a single line output from a command?

You're absolutely right that this isn't very readable. I had a quick
look around for a function that reads a single-line response, but I'll
look again and come up with a clearer solution.

> > +        for blob in read_pipe_lines(["git", "rev-list", "--format=%H %T",
> >                                       "--no-merges", parent]):
>
> This is not a new problem you introduced, but when we hear word
> "blob" in the context of this project, it reminds us of the "blob"
> object, while the 'blob' variable used in this loop has nothing to
> do with it.  Perhaps rename it to say 'line' or something?

Will do, thanks!

> > +            if blob[:7] == "commit ":
> > +                continue
>
> Perhaps blob.startswith("commit ") to avoid hardcoded string length?

Yes, that's the name of the function that I can never think of when I need it.

Thanks again for your comments,

Joachim

@gitgitgadget-git
Copy link

User Joachim Kuebart <joachim.kuebart@gmail.com> has been added to the cc: list.

@gitgitgadget-git
Copy link

On the Git mailing list, Luke Diamand wrote (reply to this):



On 29/04/2021 07:48, Joachim Kuebart wrote:
> On Thu, 29 Apr 2021 at 04:22, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> "Joachim Kuebart via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Joachim Kuebart <joachim.kuebart@gmail.com>
>>
>> Thanks.  As git-p4 is not in my area of expertise, I'll make a style
>> critique first, while pinging Luke as an area expert (you can learn
>> who they are with "git shortlog --no-merges --since=18.months.ago
>> git-p4.py").
> 
> Hi Junio, thanks for your timely and thorough review and for putting
> up with my greenhorn mistakes ;-)
> 
>>> Previously, the code iterated through the parent branch commits and
>>> compared each one to the target tree using diff-tree.
>>
>> It is customary in this project to describe the problem in the
>> present tense.  In other words, whoever is writing the log message
>> still lives in the world without this patch applied to the system.
> 
> I will rephrase the commit message and give better details as you
> mentioned. Thanks a lot for your suggestions!
> 
>> When making a claim on performance, it is helpful to our readers to
>> give some numbers, even in a limited test, e.g.
>>
>>      In a sample history where ~100 commits needed to be traversed to
>>      find the fork point on my Windows box, the current code took
>>      10.4 seconds to complete, while the new code yields the same
>>      result in 1.8 seconds, which is a significant speed-up.
>>
>> or something along these lines.
> 
> I will add some measurements.
> 
>>> Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
>>>   git-p4.py | 22 +++++++++++-----------
>>>   1 file changed, 11 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/git-p4.py b/git-p4.py
>>> index 09c9e93ac401..dbe94e6fb83b 100755
>>> --- a/git-p4.py
>>> +++ b/git-p4.py
>>> @@ -3600,19 +3600,19 @@ def importNewBranch(self, branch, maxChange):
>>>           return True
>>>
>>>       def searchParent(self, parent, branch, target):
>>> +        for tree in read_pipe_lines(["git", "rev-parse",
>>> +                                     "{}^{{tree}}".format(target)]):
>>> +            targetTree = tree.strip()
>>
>> It looks very strange to run a commit that you expect a single line
>> of output, and read the result in a loop.  Doesn't git-p4.py supply
>> a more suitable helper to read a single line output from a command?
> 
> You're absolutely right that this isn't very readable. I had a quick
> look around for a function that reads a single-line response, but I'll
> look again and come up with a clearer solution.

I don't think there is one - git-p4 has lots of functions for calling 
`p4', but for calling git, it just uses Python's Popen() API.

A good question is whether we can start taking advantage of the newer 
features in Python3 which will obviously break backward compatibility.

> 
>>> +        for blob in read_pipe_lines(["git", "rev-list", "--format=%H %T",
>>>                                        "--no-merges", parent]):
>>
>> This is not a new problem you introduced, but when we hear word
>> "blob" in the context of this project, it reminds us of the "blob"
>> object, while the 'blob' variable used in this loop has nothing to
>> do with it.  Perhaps rename it to say 'line' or something? >
> Will do, thanks!

It confused me as well.

> 
>>> +            if blob[:7] == "commit ":
>>> +                continue
>>
>> Perhaps blob.startswith("commit ") to avoid hardcoded string length?
> 
> Yes, that's the name of the function that I can never think of when I need it.
> 
> Thanks again for your comments,
> 
> Joachim
> 

There are existing tests for importing branches which should cover this. 
I don't know if they need to be extended or not, you might want to check.

Looks good otherwise.


@gitgitgadget-git
Copy link

User Luke Diamand <luke@diamand.org> has been added to the cc: list.

@gitgitgadget-git
Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

Luke Diamand <luke@diamand.org> writes:

>>>>       def searchParent(self, parent, branch, target):
>>>> +        for tree in read_pipe_lines(["git", "rev-parse",
>>>> +                                     "{}^{{tree}}".format(target)]):
>>>> +            targetTree = tree.strip()
>>>
>>> It looks very strange to run a commit that you expect a single line
>>> of output, and read the result in a loop.  Doesn't git-p4.py supply
>>> a more suitable helper to read a single line output from a command?
>> You're absolutely right that this isn't very readable. I had a quick
>> look around for a function that reads a single-line response, but I'll
>> look again and come up with a clearer solution.
>
> I don't think there is one - git-p4 has lots of functions for calling
> `p4', but for calling git, it just uses Python's Popen() API.

OK.  It just felt "strange", not "wrong", so I am OK with the
construct at least for now.

> There are existing tests for importing branches which should cover
> this. I don't know if they need to be extended or not, you might want
> to check.
>
> Looks good otherwise.

Thanks for a prompt review.


@gitgitgadget-git
Copy link

On the Git mailing list, Joachim Kuebart wrote (reply to this):

On Thu, 29 Apr 2021 at 10:22, Luke Diamand <luke@diamand.org> wrote:
>
> There are existing tests for importing branches which should cover this.
> I don't know if they need to be extended or not, you might want to check.

I enhanced t9801-git-p4-branch to check for this functionality, i.e.
that the branches are branched off at the correct commits from their
parents. As far as I could see, there was no test for this before.

Thank you as well for your quick response!

Cheers,

Joachim

@gitgitgadget-git
Copy link

User Joachim Kuebart <joachim.kuebart@gmail.com> has been added to the cc: list.

@gitgitgadget-git
Copy link

On the Git mailing list, Joachim Kuebart wrote (reply to this):

On Thu, 29 Apr 2021 at 10:31, Junio C Hamano <gitster@pobox.com> wrote:
>
> Luke Diamand <luke@diamand.org> writes:
> >
> > Looks good otherwise.
>
> Thanks for a prompt review.

I addressed your comments and updated my PR at
https://github.com/git/git/pull/1013, but CI seems stuck and hasn't
kicked off yet. I'd like to see it pass especially since I modified a
test. Is there anything else I need to do?

Joachim

@keerati808080

This comment has been minimized.

@jkuebart
Copy link
Contributor Author

jkuebart commented May 4, 2021

Hi @dscho, is there anything I can do to kick off CI again? I modified one of the tests and would like to ensure they pass before submitting to the list.

@dscho
Copy link
Member

dscho commented May 4, 2021

thanks for the ping, I kicked it off.

When importing a branch from p4, git-p4 searches the history of the parent
branch for the branch point. The test for the complex branch structure
ensures all files have the expected contents, but doesn't examine the
branch structure.

Check for the correct branch structure by making sure that the initial
commit on each branch is empty. This ensures that the initial commit's
parent is indeed the correct branch-off point.

Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
For every new branch that git-p4 imports, it needs to find the commit
where it branched off its parent branch. While p4 doesn't record this
information explicitly, the first changelist on a branch is usually an
identical copy of the parent branch.

The method searchParent() tries to find a commit in the history of the
given "parent" branch whose tree exactly matches the initial changelist
of the new branch, "target". The code iterates through the parent
commits and compares each of them to this initial changelist using
diff-tree.

Since we already know the tree object name we are looking for, spawning
diff-tree for each commit is wasteful.

Use the "--format" option of "rev-list" to find out the tree object name
of each commit in the history, and find the tree whose name is exactly
the same as the tree of the target commit to optimize this.

This results in a considerable speed-up, at least on Windows. On one
Windows machine with a fairly large repository of about 16000 commits in
the parent branch, the current code takes over 7 minutes, while the new
code only takes just over 10 seconds for the same changelist:

Before:

    $ time git p4 sync
    Importing from/into multiple branches
    Depot paths: //depot
    Importing revision 31274 (100.0%)
    Updated branches: b1

    real    7m41.458s
    user    0m0.000s
    sys     0m0.077s

After:

    $ time git p4 sync
    Importing from/into multiple branches
    Depot paths: //depot
    Importing revision 31274 (100.0%)
    Updated branches: b1

    real    0m10.235s
    user    0m0.000s
    sys     0m0.062s

Signed-off-by: Joachim Kuebart <joachim.kuebart@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Luke Diamand <luke@diamand.org>
@jkuebart
Copy link
Contributor Author

jkuebart commented May 4, 2021

thanks for the ping, I kicked it off.

unfortunately, due to a syntax error I'm going to require yet another run…

@dscho
Copy link
Member

dscho commented May 5, 2021

thanks for the ping, I kicked it off.

unfortunately, due to a syntax error I'm going to require yet another run…

Sorry for the delay...

@jkuebart
Copy link
Contributor Author

jkuebart commented May 5, 2021

/submit

@gitgitgadget-git
Copy link

Submitted as pull.1013.v2.git.git.1620215786.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-git-1013/jkuebart/p4-faster-parent-v2

To fetch this version to local tag pr-git-1013/jkuebart/p4-faster-parent-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-git-1013/jkuebart/p4-faster-parent-v2

@gitgitgadget-git
Copy link

This branch is now known as jk/p4-locate-branch-point-optim.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 879323c.

@gitgitgadget-git gitgitgadget-git bot added the seen label May 6, 2021
@gitgitgadget-git
Copy link

This patch series was integrated into seen via 31efcaa.

@gitgitgadget-git
Copy link

There was a status update in the "New Topics" section about the branch jk/p4-locate-branch-point-optim on the Git mailing list:

"git p4" learned to find branch points more efficiently.

Will merge to 'next'.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 902a9ec.

@gitgitgadget-git
Copy link

This patch series was integrated into next via 0fa60c3.

@gitgitgadget-git gitgitgadget-git bot added the next label May 7, 2021
@gitgitgadget-git
Copy link

This patch series was integrated into seen via 1664182.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via afae0c1.

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch jk/p4-locate-branch-point-optim on the Git mailing list:

"git p4" learned to find branch points more efficiently.

Will merge to 'master'.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 9ab332c.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 588379c.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via e289f68.

@gitgitgadget-git
Copy link

This patch series was integrated into next via e289f68.

@gitgitgadget-git
Copy link

This patch series was integrated into master via e289f68.

@gitgitgadget-git
Copy link

Closed via e289f68.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants