Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reftable support git-core #539

Closed
wants to merge 15 commits into from
Closed

Reftable support git-core #539

wants to merge 15 commits into from

Conversation

hanwen
Copy link

@hanwen hanwen commented Jan 23, 2020

This adds the reftable library, and hooks it up as a ref backend.

This is still based on next, as there are conflicts basing on master.

Includes testing support, to test: make -C t/ GIT_TEST_REFTABLE=1

Summary
455 tests fail

v24

  • rebase
  • FETCH_HEAD special casing

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 23, 2020

Welcome to GitGitGadget

Hi @hanwen, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "commit:", and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the FreeNode IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Freenode. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@hanwen
Copy link
Author

hanwen commented Jan 23, 2020

hi there @dscho !

this probably will fail CI spectacularly, but it does enough to get some first feedback. Could you do the /allow , /submit dance?

@dscho
Copy link
Member

dscho commented Jan 23, 2020

/allow

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 23, 2020

User hanwen is now allowed to use GitGitGadget.

@dscho
Copy link
Member

dscho commented Jan 23, 2020

Could you do the /allow , /submit dance?

You'll have to do the /submit yourself, GitGitGadget only allows the PR owner to submit (as only the owner can make subsequent changes as required).

this probably will fail CI spectacularly

It does. I will leave a couple suggestions to help you with that.

@hanwen
Copy link
Author

hanwen commented Jan 23, 2020

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 23, 2020

dscho
dscho previously requested changes Jan 23, 2020
Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of suggestions how you can fix the build. Fixing this will go a long way of attracting reviewers; I, for one, am unlikely to even look at patches that break the build or the test suite.

reftable/file.c Outdated Show resolved Hide resolved
reftable/file.c Outdated
struct file_block_source *b = (struct file_block_source *)v;
assert(off + size <= b->size);
dest->data = malloc(size);
if (pread(b->fd, dest->data, size, off) != size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't compile on Windows, but there is xpread() which you probably want to use, or pread_in_full().

Based on your usage of the pread() function, I strongly suspect that you may want to use the pread_in_full() function instead.

Or do you want to keep the code in reftable/ as independent from the rest of Git's source code as possible? If that is the case, you will have to reimplement the pread_in_full()/xpread() functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is

#define pread git_pread

in your compat header. Does that mean I can just keep using pread?

in general, is there a way to compile this for windows without actually rigging up a Windows installation?

reftable/record.c Outdated Show resolved Hide resolved
reftable/record.c Outdated Show resolved Hide resolved
reftable/stack.c Outdated Show resolved Hide resolved
reftable/stack.c Outdated Show resolved Hide resolved
reftable/block.c Outdated Show resolved Hide resolved
@dscho
Copy link
Member

dscho commented Jan 23, 2020

/submit

A little premature, I would think. It does not even compile yet...

@hanwen
Copy link
Author

hanwen commented Jan 23, 2020

The idea is to keep this separate. Thanks for the pointers on portability; I'll have a look.

at this point, I am more interested in resolving the questions in places marked with XXX in the last commit of the chain, though.

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 23, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This adds the reftable library, and hooks it up as a ref backend.

Just a quick impression before getting into details of individual
steps.

 * With this series, the reftable backend seems to take over as the
   default and only backend.  We would need to design and decide how
   repositories would specify which backend it uses (I personally do
   not think we need to allow more than one backend to be active at
   the same time) before we take this series out of RFC status.

 * What's reftable/VERSION file?  Does it really have what you
   intended to add?

 * Mixed indentation and many whitespace breakages make the code
   distracting to review.

 * Comparison with 0 is written as "if (!strcmp(a, b))" in this
   codebase, and never "if (0 != strcmp(a, b))".

 * We unfortunately do not use var defn "for (int i = 0; ..." yet.

 * We do not use // comments.

Thanks.


> Han-Wen Nienhuys (5):
>   setup.c: enable repo detection for reftable
>   create .git/refs in files-backend.c
>   Document how ref iterators and symrefs interact
>   Add reftable library
>   Reftable support for git-core
>
>  Makefile                  |   23 +-
>  builtin/init-db.c         |    2 -
>  refs.c                    |   18 +-
>  refs.h                    |    2 +
>  refs/files-backend.c      |    4 +
>  refs/refs-internal.h      |    4 +
>  refs/reftable-backend.c   |  780 ++++++++++++++++++++++++++++
>  reftable/README.md        |   12 +
>  reftable/VERSION          |  293 +++++++++++
>  reftable/basics.c         |  180 +++++++
>  reftable/basics.h         |   38 ++
>  reftable/block.c          |  382 ++++++++++++++
>  reftable/block.h          |   71 +++
>  reftable/block_test.c     |  145 ++++++
>  reftable/blocksource.h    |   20 +
>  reftable/bytes.c          |    0
>  reftable/constants.h      |   27 +
>  reftable/dump.c           |  102 ++++
>  reftable/file.c           |   98 ++++
>  reftable/iter.c           |  211 ++++++++
>  reftable/iter.h           |   56 ++
>  reftable/merged.c         |  262 ++++++++++
>  reftable/merged.h         |   34 ++
>  reftable/merged_test.c    |  247 +++++++++
>  reftable/pq.c             |  116 +++++
>  reftable/pq.h             |   34 ++
>  reftable/reader.c         |  672 ++++++++++++++++++++++++
>  reftable/reader.h         |   52 ++
>  reftable/record.c         | 1030 +++++++++++++++++++++++++++++++++++++
>  reftable/record.h         |   79 +++
>  reftable/record_test.c    |  313 +++++++++++
>  reftable/reftable.h       |  396 ++++++++++++++
>  reftable/reftable_test.c  |  434 ++++++++++++++++
>  reftable/slice.c          |  179 +++++++
>  reftable/slice.h          |   39 ++
>  reftable/slice_test.c     |   38 ++
>  reftable/stack.c          |  931 +++++++++++++++++++++++++++++++++
>  reftable/stack.h          |   40 ++
>  reftable/stack_test.c     |  265 ++++++++++
>  reftable/test_framework.c |   60 +++
>  reftable/test_framework.h |   63 +++
>  reftable/tree.c           |   60 +++
>  reftable/tree.h           |   24 +
>  reftable/tree_test.c      |   54 ++
>  reftable/writer.c         |  586 +++++++++++++++++++++
>  reftable/writer.h         |   46 ++
>  setup.c                   |   20 +-
>  47 files changed, 8530 insertions(+), 12 deletions(-)
>  create mode 100644 refs/reftable-backend.c
>  create mode 100644 reftable/README.md
>  create mode 100644 reftable/VERSION
>  create mode 100644 reftable/basics.c
>  create mode 100644 reftable/basics.h
>  create mode 100644 reftable/block.c
>  create mode 100644 reftable/block.h
>  create mode 100644 reftable/block_test.c
>  create mode 100644 reftable/blocksource.h
>  create mode 100644 reftable/bytes.c
>  create mode 100644 reftable/constants.h
>  create mode 100644 reftable/dump.c
>  create mode 100644 reftable/file.c
>  create mode 100644 reftable/iter.c
>  create mode 100644 reftable/iter.h
>  create mode 100644 reftable/merged.c
>  create mode 100644 reftable/merged.h
>  create mode 100644 reftable/merged_test.c
>  create mode 100644 reftable/pq.c
>  create mode 100644 reftable/pq.h
>  create mode 100644 reftable/reader.c
>  create mode 100644 reftable/reader.h
>  create mode 100644 reftable/record.c
>  create mode 100644 reftable/record.h
>  create mode 100644 reftable/record_test.c
>  create mode 100644 reftable/reftable.h
>  create mode 100644 reftable/reftable_test.c
>  create mode 100644 reftable/slice.c
>  create mode 100644 reftable/slice.h
>  create mode 100644 reftable/slice_test.c
>  create mode 100644 reftable/stack.c
>  create mode 100644 reftable/stack.h
>  create mode 100644 reftable/stack_test.c
>  create mode 100644 reftable/test_framework.c
>  create mode 100644 reftable/test_framework.h
>  create mode 100644 reftable/tree.c
>  create mode 100644 reftable/tree.h
>  create mode 100644 reftable/tree_test.c
>  create mode 100644 reftable/writer.c
>  create mode 100644 reftable/writer.h
>
>
> base-commit: 232378479ee6c66206d47a9be175e3a39682aea6
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-539%2Fhanwen%2Freftable-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-539/hanwen/reftable-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/539

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 23, 2020

On the Git mailing list, Stephan Beyer wrote (reply to this):

Hi Han-Wen,

On 1/23/20 8:41 PM, Han-Wen Nienhuys via GitGitGadget wrote:
> This adds the reftable library, and hooks it up as a ref backend.
>
> Han-Wen Nienhuys (5):
>   setup.c: enable repo detection for reftable
>   create .git/refs in files-backend.c
>   Document how ref iterators and symrefs interact
>   Add reftable library
>   Reftable support for git-core

I am most of the time just a curious reader on this list but as someone
who has no idea what the reftable library does (except that it can serve
as a "ref backend"), I would expect much more elaborate commit messages.
In particular, patch 4/5 (i.e. the commit message of the "add reftable
library" commit) should describe the purpose of the reftable library at
least briefly; and patch 5/5 should not contain shell commands and
output without context but a short description why the patch is doing
what it is doing. For example, if the use of the reftable library makes
certain git commands a lot faster, it should state that.

Thanks.

Stephan

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2020

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

On Thu, Jan 23, 2020 at 10:44 PM Junio C Hamano <gitster@pobox.com> wrote:
> > This adds the reftable library, and hooks it up as a ref backend.
>
> Just a quick impression before getting into details of individual
> steps.
>
>  * With this series, the reftable backend seems to take over as the
>    default and only backend.  We would need to design and decide how
>    repositories would specify which backend it uses (I personally do
>    not think we need to allow more than one backend to be active at
>    the same time) before we take this series out of RFC status.

Yes, obviously. Would you have concrete ideas on how this should work?

>  * What's reftable/VERSION file?  Does it really have what you
>    intended to add?

Fixed, and explained in reftable/VERSION.

>  * Mixed indentation and many whitespace breakages make the code
>    distracting to review.

Do you mean in the changes to refs/reftable-backend.c? Or the imported code?

The imported code is uniformly formatted with clang-format.

Is there a clang-format setting for uniformly formatting to Linux/Git
style? I really love automated formatting, so we don't have to spend
time debating irrelevant details.


>
>  * Comparison with 0 is written as "if (!strcmp(a, b))" in this
>    codebase, and never "if (0 != strcmp(a, b))".
>
>  * We unfortunately do not use var defn "for (int i = 0; ..." yet.
>
>  * We do not use // comments.
>

fixed.

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2020

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

I added some more background in one of the commit messages. You can
find extensive discussion here:

  https://github.com/google/reftable#background-reading


On Thu, Jan 23, 2020 at 11:45 PM Stephan Beyer <s-beyer@gmx.net> wrote:
>
> Hi Han-Wen,
>
> On 1/23/20 8:41 PM, Han-Wen Nienhuys via GitGitGadget wrote:
> > This adds the reftable library, and hooks it up as a ref backend.
> >
> > Han-Wen Nienhuys (5):
> >   setup.c: enable repo detection for reftable
> >   create .git/refs in files-backend.c
> >   Document how ref iterators and symrefs interact
> >   Add reftable library
> >   Reftable support for git-core
>
> I am most of the time just a curious reader on this list but as someone
> who has no idea what the reftable library does (except that it can serve
> as a "ref backend"), I would expect much more elaborate commit messages.
> In particular, patch 4/5 (i.e. the commit message of the "add reftable
> library" commit) should describe the purpose of the reftable library at
> least briefly; and patch 5/5 should not contain shell commands and
> output without context but a short description why the patch is doing
> what it is doing. For example, if the use of the reftable library makes
> certain git commands a lot faster, it should state that.
>
> Thanks.
>
> Stephan

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2020

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

On Mon, Jan 27, 2020 at 2:52 PM Han-Wen Nienhuys <hanwen@xs4all.nl> wrote:
> >  * Mixed indentation and many whitespace breakages make the code
> >    distracting to review.
>
> Do you mean in the changes to refs/reftable-backend.c? Or the imported code?
>
> The imported code is uniformly formatted with clang-format.
>
> Is there a clang-format setting for uniformly formatting to Linux/Git
> style? I really love automated formatting, so we don't have to spend
> time debating irrelevant details.

Ah, there is a .clang-format entry in the git repository already. Thanks!

@hanwen
Copy link
Author

hanwen commented Jan 27, 2020

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2020

@hanwen hanwen requested a review from dscho January 27, 2020 19:16
@@ -226,8 +226,6 @@ static int create_default_files(const char *template_path,
* We need to create a "refs" dir in any case so that older
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Han-Wen Nienhuys <hanwen@google.com>
> Subject: Re: [PATCH v2 2/5] create .git/refs in files-backend.c
>
> This prepares for supporting the reftable format, which creates a file
> in that place.

The idea is sound, I think.  We want to let each backend to be
responsible for creating and maintaining what is at .git/refs on the
filesystem.

> Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>


> Change-Id: I2fc47c89f5ec605734007ceff90321c02474aa92

Do we need to keep this, which is pretty much private name for the
patch that is not valid for most of the people on the list?

> ---
>  builtin/init-db.c    | 2 --
>  refs/files-backend.c | 4 ++++
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/init-db.c b/builtin/init-db.c
> index 944ec77fe1..45bdea0589 100644
> --- a/builtin/init-db.c
> +++ b/builtin/init-db.c
> @@ -226,8 +226,6 @@ static int create_default_files(const char *template_path,
>  	 * We need to create a "refs" dir in any case so that older
>  	 * versions of git can tell that this is a repository.
>  	 */
> -	safe_create_dir(git_path("refs"), 1);
> -	adjust_shared_perm(git_path("refs"));
>  
>  	if (refs_init_db(&err))
>  		die("failed to set up refs db: %s", err.buf);
> diff --git a/refs/files-backend.c b/refs/files-backend.c
> index 0ea66a28b6..f49b6f2ab6 100644
> --- a/refs/files-backend.c
> +++ b/refs/files-backend.c
> @@ -3158,6 +3158,10 @@ static int files_init_db(struct ref_store *ref_store, struct strbuf *err)
>  		files_downcast(ref_store, REF_STORE_WRITE, "init_db");
>  	struct strbuf sb = STRBUF_INIT;
>  
> +	files_ref_path(refs, &sb, "refs");
> +	safe_create_dir(sb.buf, 1);
> + 	// XXX adjust_shared_perm ?

I am not sure what's there to wonder about with the question mark.

If this step is meant to be a preparation before we actually allow
a different backend to be used, shouldn't the updated and prepared
code behave identically in externally visible ways?

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

On Mon, Jan 27, 2020 at 11:28 PM Junio C Hamano <gitster@pobox.com> wrote:

> > Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
>
>
> > Change-Id: I2fc47c89f5ec605734007ceff90321c02474aa92
>
> Do we need to keep this, which is pretty much private name for the
> patch that is not valid for most of the people on the list?

No, I can make do with like gitgitgadget.

> > -     safe_create_dir(git_path("refs"), 1);
> > -     adjust_shared_perm(git_path("refs"));
> >
> >       if (refs_init_db(&err))
> >               die("failed to set up refs db: %s", err.buf);
> > diff --git a/refs/files-backend.c b/refs/files-backend.c
> > index 0ea66a28b6..f49b6f2ab6 100644
> > --- a/refs/files-backend.c
> > +++ b/refs/files-backend.c
> > @@ -3158,6 +3158,10 @@ static int files_init_db(struct ref_store *ref_store, struct strbuf *err)
> >               files_downcast(ref_store, REF_STORE_WRITE, "init_db");
> >       struct strbuf sb = STRBUF_INIT;
> >
> > +     files_ref_path(refs, &sb, "refs");
> > +     safe_create_dir(sb.buf, 1);
> > +     // XXX adjust_shared_perm ?
>
> I am not sure what's there to wonder about with the question mark.

I forgot why I put the XXX, but note that safe_create_dirs runs
adjust_shared_perms implicitly.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Han-Wen Nienhuys <hanwen@google.com> writes:

>> > +     files_ref_path(refs, &sb, "refs");
>> > +     safe_create_dir(sb.buf, 1);
>> > +     // XXX adjust_shared_perm ?
>
> I forgot why I put the XXX, but note that safe_create_dirs runs
> adjust_shared_perms implicitly.

That piece of information would make an excellent part of the log
message, which is to describe why "questionable" changes are made.

"git init" is designed to be runnable in an existing repository, but
safe_create_dir() would not run adjust_shared_perm() when mkdir()
says EEXIST, and I think that is why the original makes a call to
adjust_shared_perm() after safe_create_dir() returns.

Thanks.


@@ -269,6 +269,9 @@ int refs_rename_ref_available(struct ref_store *refs,
* to the next entry, ref_iterator_advance() aborts the iteration,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Han-Wen Nienhuys <hanwen@google.com>
> Subject: Re: [PATCH v2 3/5] Document how ref iterators and symrefs interact

"Subject: refs: document how ...", perhaps?

Also isn't it more like iterators and symrefs do not interact in any
unexpected way, is it?  iterators while enumerating refs when they
see a symref, they do not dereference and give the underlying ref
the symref is pointing at---the underlying ref will be listed when
it comes its turn to be shown as an ordinary ref.  I am not sure
what is there to single out and document...

> Change-Id: Ie3ee63c52254c000ef712986246ca28f312b4301
> Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
> ---
>  refs/refs-internal.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/refs/refs-internal.h b/refs/refs-internal.h
> index ff2436c0fb..fc18b12340 100644
> --- a/refs/refs-internal.h
> +++ b/refs/refs-internal.h
> @@ -269,6 +269,9 @@ int refs_rename_ref_available(struct ref_store *refs,
>   * to the next entry, ref_iterator_advance() aborts the iteration,
>   * frees the ref_iterator, and returns ITER_ERROR.
>   *
> + * Ref iterators cannot return symref targets, so symbolic refs must be
> + * dereferenced during the iteration.

What this says is not techincally incorrect.  Iterators do not give
each_ref_fn what underlying ref a symref is pointing at.  But it
also is misleading.  If your callback wants to know what object each
ref is pointing at do not need to do anything extra when it sees a
symref, as name of the object pointed at by the underlying ref is
given to it.  Only callbacks that wants to know the other ref, not
the value of the other ref, needs to dereference when called with a
symref.

But I wonder if we need to even say this.  Isn't it obvious from the
each_ref_fn API that nothing other than the refname, object id, and
what kind of ref it is, will be given to the user of the API, so it
would be natural for a caller that wants to do extra things it needs
to do for symrefs must act when it learns a ref is a symref?  After
all, that is why the flags word is one of the parameters given to an
each_ref_fn.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

On Mon, Jan 27, 2020 at 11:53 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Han-Wen Nienhuys via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Han-Wen Nienhuys <hanwen@google.com>
> > Subject: Re: [PATCH v2 3/5] Document how ref iterators and symrefs interact
>
> "Subject: refs: document how ...", perhaps?
>
> Also isn't it more like iterators and symrefs do not interact in any
> unexpected way, is it?  iterators while enumerating refs when they
> see a symref, they do not dereference and give the underlying ref
> the symref is pointing at---the underlying ref will be listed when
> it comes its turn to be shown as an ordinary ref.  I am not sure
> what is there to single out and document...

I mean, you can't return a zero OID and set REF_ISSYMLINK. Instead,
the ref backend has to follow the symlink recursively until it finds a
ref pointing to a OID.

> > Change-Id: Ie3ee63c52254c000ef712986246ca28f312b4301
> > Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
> > ---
> >  refs/refs-internal.h | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/refs/refs-internal.h b/refs/refs-internal.h
> > index ff2436c0fb..fc18b12340 100644
> > --- a/refs/refs-internal.h
> > +++ b/refs/refs-internal.h
> > @@ -269,6 +269,9 @@ int refs_rename_ref_available(struct ref_store *refs,
> >   * to the next entry, ref_iterator_advance() aborts the iteration,
> >   * frees the ref_iterator, and returns ITER_ERROR.
> >   *
> > + * Ref iterators cannot return symref targets, so symbolic refs must be
> > + * dereferenced during the iteration.
>
> What this says is not techincally incorrect.  Iterators do not give
> each_ref_fn what underlying ref a symref is pointing at.  But it
> also is misleading.  If your callback wants to know what object each
> ref is pointing at do not need to do anything extra when it sees a
> symref, as name of the object pointed at by the underlying ref is
> given to it.  Only callbacks that wants to know the other ref, not
> the value of the other ref, needs to dereference when called with a
> symref.
>
> But I wonder if we need to even say this.  Isn't it obvious from the
> each_ref_fn API that nothing other than the refname, object id, and
> what kind of ref it is, will be given to the user of the API, so it
> would be natural for a caller that wants to do extra things it needs
> to do for symrefs must act when it learns a ref is a symref?  After
> all, that is why the flags word is one of the parameters given to an
> each_ref_fn.

Maybe it is obvious to you, but it was not obvious to me, coming from
JGit working on the RefDatabase. One could imagine that the caller
needs to do something special to handle a symref, and returning a zero
object ID is OK, and this is a natural assumption if you implement a
ref backend, as dereferencing the symref to find out the final OID may
do more work than the caller needs.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Han-Wen Nienhuys <hanwen@google.com> writes:

>> But I wonder if we need to even say this.  Isn't it obvious from the
>> each_ref_fn API that nothing other than the refname, object id, and
>> what kind of ref it is, will be given to the user of the API, so it
>> would be natural for a caller that wants to do extra things it needs
>> to do for symrefs must act when it learns a ref is a symref?  After
>> all, that is why the flags word is one of the parameters given to an
>> each_ref_fn.
>
> Maybe it is obvious to you, but it was not obvious to me, coming from
> JGit working on the RefDatabase.

If you found each_ref_fn API docs lacking, that probably is where
the information should go (i.e. regardless of type of refs, the
object name is filled correctly).  We need to make it obvious to the
developers who needs to use and work on the code with the API
defined and described in refs.h

Thanks.


@@ -813,6 +813,7 @@ TEST_SHELL_PATH = $(SHELL_PATH)
LIB_FILE = libgit.a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Jan 27, 2020 at 02:22:24PM +0000, Han-Wen Nienhuys via GitGitGadget wrote:

> [...]

I'll try to add my 2 cents to all of the XXX spots you flagged.

> @@ -1839,10 +1839,22 @@ static struct ref_store *lookup_ref_store_map(struct hashmap *map,
>  static struct ref_store *ref_store_init(const char *gitdir,
>  					unsigned int flags)
>  {
> -	const char *be_name = "files";
> -	struct ref_storage_be *be = find_ref_storage_backend(be_name);
> +	struct strbuf refs_path = STRBUF_INIT;
> +
> +        /* XXX this should probably come from a git config setting and not
> +           default to reftable. */
> +	const char *be_name = "reftable";

I think the scheme here needs to be something like:

  - "struct repository" gets a new "ref_format" field

  - setup.c:check_repo_format() learns about an extensions.refFormat
    config key, which we use to set repo->ref_format

  - init/clone should take a command-line option for the ref format of
    the new repository. Anybody choosing reftables would want to set
    core.repositoryformatversion to "1" and set the extensions.refFormat
    key.

  - likewise, there should be a config setting to choose the default ref
    format. It would obviously stay at "files" for now, but we'd
    eventually perhaps flip the default to "reftable".

Some thoughts on compatibility:

The idea of the config changes is that older versions of Git will either
complain about repositoryformatversion (if very old), or will complain
that they don't know about extensions.refFormat. But the changes you
made in patch 1 indicate that existing versions of Git won't consider it
to be a Git repository at all!

I think this is slightly non-ideal, in that we'd continue walking up the
tree looking for a Git repo. And either:

  - we'd find one, which would be confusing and possibly wrong

  - we wouldn't, in which case the error would be something like "no git
    repository", and not "your git repository is too new"

So it would be really nice if we continued to have a HEAD file (just
with some sentinel value in it) and a refs/ directory, so that existing
versions of Git realize "there's a repository here, but it's too new for
me".

There's a slight downside in that tools which _aren't_ careful about
repositoryformatversion might try to operate on the repository, writing
into refs/ or whatever. But such tools are broken, and I'm not sure we
should be catering to them (besides which, the packed-refs ship sailed
long ago, so they're already potentially dangerous).

> +/* XXX which ordering are these? Newest or oldest first? */
>  int for_each_reflog_ent(const char *refname, each_reflog_ent_fn fn, void *cb_data);
>  int for_each_reflog_ent_reverse(const char *refname, each_reflog_ent_fn fn, void *cb_data);

The top one is chronological order (i.e., reading the file from start to
finish), and the latter is reverse-chronological (seeking backwards in
the file).

> +static struct ref_iterator *
> +reftable_ref_iterator_begin(struct ref_store *ref_store, const char *prefix,
> +			    unsigned int flags)
> +{
> +	struct reftable_ref_store *refs =
> +		(struct reftable_ref_store *)ref_store;
> +	struct reftable_iterator *ri = xcalloc(1, sizeof(*ri));
> +	struct merged_table *mt = NULL;
> +	int err = 0;
> +	if (refs->err) {
> +		/* how to propagate errors? */
> +		return NULL;
> +	}
> +
> +	mt = stack_merged_table(refs->stack);
> +
> +	/* XXX something with flags? */

I think the flags here are DO_FOR_EACH_*. You might need to consider
INCLUDE_BROKEN here (or it might be dealt with at a layer above).

There's also PER_WORKTREE_ONLY; I think you'd need to check ref_type()
there, like the files backend does.

> +	err = merged_table_seek_ref(mt, &ri->iter, prefix);
> +	/* XXX what to do with err? */

Hmm, I don't think you can return an error from iterator_begin(). It
would probably be OK to record the error in your local struct, and then
later return ITER_ERROR from iterator_advance().

I notice that the more-recent dir_iterator_begin() returns NULL in case
of an error, but looking at the callers of the ref iterators, they don't
seem to be ready to handle NULL.

> +static int write_transaction_table(struct writer *writer, void *arg)
> +{
> +	struct ref_transaction *transaction = (struct ref_transaction *)arg;
> +	struct reftable_ref_store *refs =
> +		(struct reftable_ref_store *)transaction->ref_store;
> +	uint64_t ts = stack_next_update_index(refs->stack);
> +	int err = 0;
> +	/* XXX - are we allowed to mutate the input data? */
> +	qsort(transaction->updates, transaction->nr,
> +	      sizeof(struct ref_update *), ref_update_cmp);

I don't offhand know of anything that would break, but it's probably a
bad idea to do so. If you need a sorted view, can you make an auxiliary
array-of-pointers? Also, the QSORT() macro is a little shorter and has
some extra checks.

> +	for (int i = 0; i < transaction->nr; i++) {
> +		struct ref_update *u = transaction->updates[i];
> +		if (u->flags & REF_HAVE_NEW) {
> +			struct object_id out_oid = {};
> +			int out_flags = 0;
> +			/* XXX who owns the memory here? */
> +			const char *resolved = refs_resolve_ref_unsafe(
> +				transaction->ref_store, u->refname, 0, &out_oid,
> +				&out_flags);

In the "unsafe" version, the memory belongs to a static buffer inside
the function. You shouldn't need to free it.

> +static int
> +reftable_reflog_ref_iterator_advance(struct ref_iterator *ref_iterator)
> [...]
> +		/* XXX const? */
> +		memcpy(&ri->oid.hash, ri->log.new_hash, GIT_SHA1_RAWSZ);

You'd want probably want to use hashcpy() here (or better yet, use
"struct object_id" in ri->log, too, and then use oidcpy()).

But that raises a question: how ready are reftables to handle non-sha1
object ids? I see a lot of GIT_SHA1_RAWSZ, and I think the on-disk
format actually has binary sha1s, right? In theory if those all become
the_hash_algo->rawsz, then it might "Just Work" to read and write
slightly larger entries.

> +reftable_for_each_reflog_ent_newest_first(struct ref_store *ref_store,
> +					  const char *refname,
> +					  each_reflog_ent_fn fn, void *cb_data)
> [...]
> +			/* XXX committer = email? name? */
> +			if (fn(&old_oid, &new_oid, log.name, log.time,
> +			       log.tz_offset, log.message, cb_data)) {
> +				err = -1;
> +				break;
> +			}

The committer entry we pass back to each_reflog_ent_fn should be the
full "Human Name <email@host>".

> +static int reftable_reflog_expire(struct ref_store *ref_store,
> +				  const char *refname,
> +				  const struct object_id *oid,
> +				  unsigned int flags,
> +				  reflog_expiry_prepare_fn prepare_fn,
> +				  reflog_expiry_should_prune_fn should_prune_fn,
> +				  reflog_expiry_cleanup_fn cleanup_fn,
> +				  void *policy_cb_data)
> +{
> +	/*
> +	  XXX
> +
> +	  This doesn't fit with the reftable API. If we are expiring for space
> +	  reasons, the expiry should be combined with a compaction, and there
> +	  should be a policy that can be called for all refnames, not for a
> +	  single ref name.
> +
> +	  If this is for cleaning up individual entries, we'll have to write
> +	  extra data to create tombstones.
> +	 */
> +	return 0;
> +}

I agree that we'd generally want to expire and compact all entries at
once. Unfortunately I think this per-ref interface is exposed by
git-reflog. I.e., you can do "git reflog expire refs/heads/foo".

Could we add an extra API call for "expire everything", and then:

  - have refs/files-backend.c implement that by just iterating over all
    of the refs and pruning them one by one

  - have builtin/reflog.c trigger the "expire everything" API when it
    sees "--all"

  - teach refs/reftables-backend.c a crappy version of the per-ref
    expiration, where it just inserts tombstones but doesn't do any
    compaction. It doesn't have to be fast or produce a good output;
    it's just for compatibility.

    Alternatively, I'd be OK with die("sorry, reftables doesn't support
    per-ref reflog expiration") as a first step.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Martin Fick wrote (reply to this):

On Tuesday, January 28, 2020 2:31:00 AM MST Jeff King wrote:
> Some thoughts on compatibility:
> 
> The idea of the config changes is that older versions of Git will either
> complain about repositoryformatversion (if very old), or will complain
> that they don't know about extensions.refFormat. But the changes you
> made in patch 1 indicate that existing versions of Git won't consider it
> to be a Git repository at all!
> 
> I think this is slightly non-ideal, in that we'd continue walking up the
> tree looking for a Git repo. And either:
> 
>   - we'd find one, which would be confusing and possibly wrong
> 
>   - we wouldn't, in which case the error would be something like "no git
>     repository", and not "your git repository is too new"
> 
> So it would be really nice if we continued to have a HEAD file (just
> with some sentinel value in it) and a refs/ directory, so that existing
> versions of Git realize "there's a repository here, but it's too new for
> me".
> 
> There's a slight downside in that tools which _aren't_ careful about
> repositoryformatversion might try to operate on the repository, writing
> into refs/ or whatever. But such tools are broken, and I'm not sure we
> should be catering to them (besides which, the packed-refs ship sailed
> long ago, so they're already potentially dangerous).

Could you elaborate on this a bit because it seems on the surface that these 
tools aren't very dangerous, and therefore likely many of them exist?

What are the dangers today of tools that understand/operate on loose and 
packed refs without reading repositoryformatversion?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

On Tue, Jan 28, 2020 at 8:31 AM Jeff King <peff@peff.net> wrote:
>
> On Mon, Jan 27, 2020 at 02:22:24PM +0000, Han-Wen Nienhuys via GitGitGadget wrote:
>
> > [...]
>
> I'll try to add my 2 cents to all of the XXX spots you flagged.
>

thanks! that was really helpful.

>
> Some thoughts on compatibility:
>
> The idea of the config changes is that older versions of Git will either
> complain about repositoryformatversion (if very old), or will complain
> that they don't know about extensions.refFormat. But the changes you
> made in patch 1 indicate that existing versions of Git won't consider it
> to be a Git repository at all!
>
> I think this is slightly non-ideal, in that we'd continue walking up the
> tree looking for a Git repo. And either:
>
>   - we'd find one, which would be confusing and possibly wrong
>
>   - we wouldn't, in which case the error would be something like "no git
>     repository", and not "your git repository is too new"
>
> So it would be really nice if we continued to have a HEAD file (just
> with some sentinel value in it) and a refs/ directory, so that existing
> versions of Git realize "there's a repository here, but it's too new for
> me".

You have good points.

JGit currently implements what we have here, as this is what's spelled
out in the spec that Shawn posted  back in the day. It's probably
acceptable to this, though, as the reftable support has only landed in
JGit very recently and will probably appear very experimental to
folks.

How would the layout be then? We'll have

  HEAD - dummy file
  reftable/ - the tables
  refs/ - dummy dir

where shall we store the reftable list? maybe in a file called

  reftable-list

If we have both HEAD/refs + (refable/reftable-list), what should we
put there to ensure that no git version actually manages to use the
repository? (what happens if someone deletes the version setting from
the .git/config file)

> > +/* XXX which ordering are these? Newest or oldest first? */
> >  int for_each_reflog_ent(const char *refname, each_reflog_ent_fn fn, void *cb_data);
> >  int for_each_reflog_ent_reverse(const char *refname, each_reflog_ent_fn fn, void *cb_data);
>
> The top one is chronological order (i.e., reading the file from start to
> finish), and the latter is reverse-chronological (seeking backwards in
> the file).

thanks, I put in a commit to clarify this.

> > +     err = merged_table_seek_ref(mt, &ri->iter, prefix);
> > +     /* XXX what to do with err? */
>
> Hmm, I don't think you can return an error from iterator_begin(). It
> would probably be OK to record the error in your local struct, and then
> later return ITER_ERROR from iterator_advance().

good point. I did that.

> > +     /* XXX - are we allowed to mutate the input data? */
> > +     qsort(transaction->updates, transaction->nr,
> > +           sizeof(struct ref_update *), ref_update_cmp);
>
> I don't offhand know of anything that would break, but it's probably a
> bad idea to do so. If you need a sorted view, can you make an auxiliary
> array-of-pointers? Also, the QSORT() macro is a little shorter and has
> some extra checks.

Done.

> In the "unsafe" version, the memory belongs to a static buffer inside
> the function. You shouldn't need to free it.


OK.

> > +static int
> > +reftable_reflog_ref_iterator_advance(struct ref_iterator *ref_iterator)
> > [...]
> > +             /* XXX const? */
> > +             memcpy(&ri->oid.hash, ri->log.new_hash, GIT_SHA1_RAWSZ);
>
> You'd want probably want to use hashcpy() here (or better yet, use
> "struct object_id" in ri->log, too, and then use oidcpy()).

It's more practical in the library to use pointers rather than fixed
size arrays, so it'll be hashcpy.

> But that raises a question: how ready are reftables to handle non-sha1
> object ids? I see a lot of GIT_SHA1_RAWSZ, and I think the on-disk
> format actually has binary sha1s, right? In theory if those all become
> the_hash_algo->rawsz, then it might "Just Work" to read and write
> slightly larger entries.

The format fixes the reftable at 20 bytes, and there is not enough
framing information to just write more data. We'll have to encode the
hash size in the version number somehow, eg. we could use the  higher
order bit of the version byte to encode it, for example.

But it needs a new version of the spec. I think it's premature to do
this while v1 of reftable isn't in git-core yet.


> The committer entry we pass back to each_reflog_ent_fn should be the
> full "Human Name <email@host>".

thx, done.

> I agree that we'd generally want to expire and compact all entries at
> once. Unfortunately I think this per-ref interface is exposed by
> git-reflog. I.e., you can do "git reflog expire refs/heads/foo".
>
> Could we add an extra API call for "expire everything", and then:

I took note of your remarks, and added a BUG() for now.

I'll focus on getting the CI on gitgitgadget to build this code first.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 28, 2020 at 08:36:53AM -0700, Martin Fick wrote:

> > There's a slight downside in that tools which _aren't_ careful about
> > repositoryformatversion might try to operate on the repository, writing
> > into refs/ or whatever. But such tools are broken, and I'm not sure we
> > should be catering to them (besides which, the packed-refs ship sailed
> > long ago, so they're already potentially dangerous).
> 
> Could you elaborate on this a bit because it seems on the surface that these 
> tools aren't very dangerous, and therefore likely many of them exist?
> 
> What are the dangers today of tools that understand/operate on loose and 
> packed refs without reading repositoryformatversion?

I was mostly thinking of hacky scripts that tried to touch .git/refs
directly. And there are a few levels of dangerous there:

  - if you're doing "echo $sha1 >.git/refs/heads/master", then you're
    not locking properly. But it would probably work most of the time.

  - if you're properly taking a lock on ".git/refs/heads/master.lock"
    and renaming into place but not looking at packed-refs, then you
    might overwrite somebody else's update which is in the packed file

  - if you're trying to read refs and not reading packed-refs, obviously
    you might miss some values

If you're actually doing the correct locking and packed-refs read (which
"real" implementations like libgit2 do) then no, I don't think that's
dangerous. And I think libgit2 properly complains when it sees a
repositoryformatversion above 0. I don't know offhand about JGit, or any
of the lesser-used ones like dulwich or go-git.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 28, 2020 at 04:56:26PM +0100, Han-Wen Nienhuys wrote:

> JGit currently implements what we have here, as this is what's spelled
> out in the spec that Shawn posted  back in the day. It's probably
> acceptable to this, though, as the reftable support has only landed in
> JGit very recently and will probably appear very experimental to
> folks.
> 
> How would the layout be then? We'll have
> 
>   HEAD - dummy file
>   reftable/ - the tables
>   refs/ - dummy dir
> 
> where shall we store the reftable list? maybe in a file called
> 
>   reftable-list
> 
> If we have both HEAD/refs + (refable/reftable-list), what should we
> put there to ensure that no git version actually manages to use the
> repository? (what happens if someone deletes the version setting from
> the .git/config file)

Yeah, it would be nice to have something that an older version of Git
would totally choke on, but I'm not sure we have a lot of leeway. What
we put in HEAD has to be syntactically legitimate enough to appease
validate_headref(), so our options are either "ref:
refs/something/bogus" or an object hash that we don't have (e.g.,
0{40}). The former would be preferable because it would (in theory)
prevent us from writing to HEAD, as well.

I wondered what would happen if you put in a syntactically invalid ref,
like "ref: refs/.not/.valid" (leading dots are not allowed in path
components of refnames). It does cause _some_ parts of Git to choke, but
sadly "git update-ref HEAD $sha1" actually writes to .git/refs/.not/.valid.

Even "refs/../../dangerous" doesn't give it pause. Yikes. It seems we're
pretty willing to accept symref destinations without further checking.

Making "refs" a file instead of a directory does work nicely, as any
attempts to read or write would get ENOTDIR. And we can fool
is_git_directory() as long as it's marked executable. That's OK on POSIX
systems, but I'm not sure how it would work on Windows (or maybe it
would work just fine, since we presumably just say "yep, everything is
executable").

So perhaps that's enough, and what we put in HEAD won't matter (since
nobody will be able to write into refs/ anyway).

> > But that raises a question: how ready are reftables to handle non-sha1
> > object ids? I see a lot of GIT_SHA1_RAWSZ, and I think the on-disk
> > format actually has binary sha1s, right? In theory if those all become
> > the_hash_algo->rawsz, then it might "Just Work" to read and write
> > slightly larger entries.
> 
> The format fixes the reftable at 20 bytes, and there is not enough
> framing information to just write more data. We'll have to encode the
> hash size in the version number somehow, eg. we could use the  higher
> order bit of the version byte to encode it, for example.
> 
> But it needs a new version of the spec. I think it's premature to do
> this while v1 of reftable isn't in git-core yet.

I don't know that we technically need the reftables file to say how long
the hashes are. The git config will tell us which hash we're using, and
everything else is supposed to follow. So I think it would work OK as
long as you're able to be told by the rest of Git that hashes are N
bytes, and just use that to compute the fixed-size records.

That said, it might make for easier debugging if the reftables file
declares the size it assumes.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Feb 04, 2020 at 08:06:57PM +0100, Han-Wen Nienhuys wrote:

> On Wed, Jan 29, 2020 at 11:47 AM Jeff King <peff@peff.net> wrote:
> 
> > That said, it might make for easier debugging if the reftables file
> > declares the size it assumes.
> 
> If there is a mismatch in size, the reftable will look completely
> corrupted data.  I think this will be a bad experience.

Yes, but I think it would take a bunch of other failures to get there.
And it's true of all of the other parts of Git, too; the master switch
is a config setting that says "I'm a sha-256 repository".

That said, I'm not at all opposed to a header in the reftables file
saying "I'm a sha-256 reftable". We shouldn't ever see a mismatch there,
but that's what I meant by "easier debugging".

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Feb 04, 2020 at 07:54:12PM +0100, Han-Wen Nienhuys wrote:

> > PS I don't know if it's set in stone yet, but if we're changing things
> >    around, is it an option to put reftable-list inside the reftable
> >    directory, just to keep the top-level uncluttered?
> 
> I put up https://git.eclipse.org/r/c/157167/ to update the spec inside
> JGit. Please review.

That looks quite reasonable. The one change I'd make is to put
"refs/heads/.invalid" into HEAD (instead of "refs/.invalid"). In my
experiments, existing Git is happy to write into an invalid refname via
"git update-ref HEAD". But by putting it past the non-directory "heads",
such a write would always fail.

It's also supposed to be a requirement that HEAD only puts to
refs/heads/, though our validate_headref() does not enforce that. I
don't know if any other implementations might, though.

I did compare earlier your "make refs/heads a file" versus "make refs/ a
file and just give it an executable bit". And I think the "refs/heads"
solution does behave slightly better in a few cases. But if I understand
correctly, just giving "refs/" an executable bit would mean we retain
compatibility with the existing JGit implementation.

It does end up playing games with permissions, which I warned against,
but I think it might be tolerable:

  1. Unlike read vs write, there's not generally a reason users would
     need to separate read and execute.

  2. Portability-wise, POSIX permissions give us what we need, and
     non-POSIX systems tend to just say everything is executable.

So I could also live with that direction, if switching the JGit behavior
at this point is too much of a pain.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

On Tue, Jan 28, 2020 at 8:31 AM Jeff King <peff@peff.net> wrote:
>
> >  {
> > -     const char *be_name = "files";
> > -     struct ref_storage_be *be = find_ref_storage_backend(be_name);
> > +     struct strbuf refs_path = STRBUF_INIT;
> > +
> > +        /* XXX this should probably come from a git config setting and not
> > +           default to reftable. */
> > +     const char *be_name = "reftable";
>
> I think the scheme here needs to be something like:
>
>   - "struct repository" gets a new "ref_format" field
>
>   - setup.c:check_repo_format() learns about an extensions.refFormat
>     config key, which we use to set repo->ref_format
>
>   - init/clone should take a command-line option for the ref format of
>     the new repository. Anybody choosing reftables would want to set
>     core.repositoryformatversion to "1" and set the extensions.refFormat
>     key.

I did this, but are you sure this works? Where would the
repo->ref_storage_format get set? I tried adding it to repo_init(),
but this doesn't get called in a normal startup sequence.

Breakpoint 3, ref_store_init (gitdir=0x555555884b70 ".git",
be_name=0x5555557942ca "reftable", flags=15) at refs.c:1841
warning: Source file is more recent than executable.
1841 {
(gdb) up
#1  0x00005555556de2c8 in get_main_ref_store (r=0x555555871d40
<the_repo>) at refs.c:1862
(gdb) p r->ref_storage_format
$1 = 0x0
}


I'm sending the revised patch series now.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Han-Wen Nienhuys wrote (reply to this):

On Tue, Feb 4, 2020 at 9:06 PM Jeff King <peff@peff.net> wrote:
>
> On Tue, Feb 04, 2020 at 07:54:12PM +0100, Han-Wen Nienhuys wrote:
>
> > > PS I don't know if it's set in stone yet, but if we're changing things
> > >    around, is it an option to put reftable-list inside the reftable
> > >    directory, just to keep the top-level uncluttered?
> >
> > I put up https://git.eclipse.org/r/c/157167/ to update the spec inside
> > JGit. Please review.
>
> That looks quite reasonable. The one change I'd make is to put
> "refs/heads/.invalid" into HEAD (instead of "refs/.invalid"). In my
> experiments, existing Git is happy to write into an invalid refname via
> "git update-ref HEAD". But by putting it past the non-directory "heads",
> such a write would always fail.

thanks, incorporated.

> It's also supposed to be a requirement that HEAD only puts to
> refs/heads/, though our validate_headref() does not enforce that. I
> don't know if any other implementations might, though.
>
> I did compare earlier your "make refs/heads a file" versus "make refs/ a
> file and just give it an executable bit". And I think the "refs/heads"
> solution does behave slightly better in a few cases. But if I understand
> correctly, just giving "refs/" an executable bit would mean we retain
> compatibility with the existing JGit implementation.

I hadn't looked at the JGit discovery in detail, but it looks like the
currently proposed approach works with JGit. (I haven't adapted JGit
for the new path setup yet.)

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Feb 04, 2020 at 09:22:36PM +0100, Han-Wen Nienhuys wrote:

> >   - init/clone should take a command-line option for the ref format of
> >     the new repository. Anybody choosing reftables would want to set
> >     core.repositoryformatversion to "1" and set the extensions.refFormat
> >     key.
> 
> I did this, but are you sure this works? Where would the
> repo->ref_storage_format get set? I tried adding it to repo_init(),
> but this doesn't get called in a normal startup sequence.
> 
> Breakpoint 3, ref_store_init (gitdir=0x555555884b70 ".git",
> be_name=0x5555557942ca "reftable", flags=15) at refs.c:1841
> warning: Source file is more recent than executable.
> 1841 {
> (gdb) up
> #1  0x00005555556de2c8 in get_main_ref_store (r=0x555555871d40
> <the_repo>) at refs.c:1862
> (gdb) p r->ref_storage_format
> $1 = 0x0
> }

Hmm. Looks like repo_init() is only used for submodules. Which is pretty
horrid, because that setup sequence is basically duplicated between
there and setup_git_directory(), which is the traditional mechanism.
That's something that ought to be cleaned up, but it's definitely
outside the scope of your series.

The patch below was enough to make something like this work:

  git init --reftable repo
  cd repo
  git commit --allow-empty -m foo
  git for-each-ref

I had to make some tweaks to the init code, too:

 - setting the_repository->ref_storage_format based on the --reftable
   flag

 - we can still recognize a reinit by the presence of a HEAD file
   (whether it's a real one or a dummy). And we should call
   create_symref() in either case, either to create HEAD or to create
   the HEAD symlink inside reftables

I also fixed a bug in your first patch, where the creation of refs/
moves into files_init_db(). The strbuf needs reset there, or we just
append to it, and end up with a doubled path (you could see it by
running the same commands above without --reftable, but of course not
with your patch series as-is because it will _always_ choose reftable).

diff --git a/builtin/init-db.c b/builtin/init-db.c
index dcea74610c..ea2a333c4a 100644
--- a/builtin/init-db.c
+++ b/builtin/init-db.c
@@ -213,6 +213,8 @@ static int create_default_files(const char *template_path,
 	is_bare_repository_cfg = init_is_bare_repository;
 	if (init_shared_repository != -1)
 		set_shared_repository(init_shared_repository);
+	if (flags & INIT_DB_REFTABLE)
+		the_repository->ref_storage_format = xstrdup("reftable");
 
 	/*
 	 * We would have created the above under user's umask -- under
@@ -222,6 +224,15 @@ static int create_default_files(const char *template_path,
 		adjust_shared_perm(get_git_dir());
 	}
 
+	/*
+	 * Check to see if .git/HEAD exists; this must happen before
+	 * initializing the ref db, because we want to see if there is an
+	 * existing HEAD.
+	 */
+	path = git_path_buf(&buf, "HEAD");
+	reinit = (!access(path, R_OK) ||
+		  readlink(path, junk, sizeof(junk) - 1) != -1);
+
 	/*
 	 * We need to create a "refs" dir in any case so that older
 	 * versions of git can tell that this is a repository.
@@ -234,18 +245,14 @@ static int create_default_files(const char *template_path,
 	 * Create the default symlink from ".git/HEAD" to the "master"
 	 * branch, if it does not exist yet.
 	 */
-	if (flags & INIT_DB_REFTABLE) {
-		reinit = 0; /* XXX - how do we recognize a reinit,
-			     * and what should we do? */
-	} else {
-		path = git_path_buf(&buf, "HEAD");
-		reinit = (!access(path, R_OK) ||
-			  readlink(path, junk, sizeof(junk) - 1) != -1);
-	}
-
 	if (!reinit) {
 		if (create_symref("HEAD", "refs/heads/master", NULL) < 0)
 			exit(1);
+	} else {
+		/*
+		 * XXX should check whether our ref backend matches the
+		 * original one; if not, either die() or convert
+		 */
 	}
 
 	/* This forces creation of new config file */
diff --git a/refs.c b/refs.c
index 493d3fb673..d6ddbbcc67 100644
--- a/refs.c
+++ b/refs.c
@@ -1859,10 +1859,9 @@ struct ref_store *get_main_ref_store(struct repository *r)
 		BUG("attempting to get main_ref_store outside of repository");
 
 	r->refs = ref_store_init(r->gitdir,
-				 /* XXX r->ref_storage_format == NULL. Where
-				  * should the config file be parsed out? */
-				 r->ref_storage_format ? r->ref_storage_format :
-							 "reftable",
+				 r->ref_storage_format ?
+					r->ref_storage_format :
+					"files",
 				 REF_STORE_ALL_CAPS);
 	return r->refs;
 }
diff --git a/refs/files-backend.c b/refs/files-backend.c
index 0c53b246e8..6f9efde9ca 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -3166,6 +3166,7 @@ static int files_init_db(struct ref_store *ref_store, struct strbuf *err)
 	/*
 	 * Create .git/refs/{heads,tags}
 	 */
+	strbuf_reset(&sb);
 	files_ref_path(refs, &sb, "refs/heads");
 	safe_create_dir(sb.buf, 1);
 
diff --git a/repository.c b/repository.c
index aff47302c9..ff0988dac8 100644
--- a/repository.c
+++ b/repository.c
@@ -174,9 +174,7 @@ int repo_init(struct repository *repo,
 	if (worktree)
 		repo_set_worktree(repo, worktree);
 
-	repo->ref_storage_format = format.ref_storage != NULL ?
-					   xstrdup(format.ref_storage) :
-					   "files"; /* XXX */
+	repo->ref_storage_format = xstrdup_or_null(format.ref_storage);
 
 	clear_repository_format(&format);
 	return 0;
diff --git a/setup.c b/setup.c
index a339186d83..58c5cd3bf0 100644
--- a/setup.c
+++ b/setup.c
@@ -450,7 +450,7 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
 			data->partial_clone = xstrdup(value);
 		} else if (!strcmp(ext, "worktreeconfig")) {
 			data->worktree_config = git_config_bool(var, value);
-		} else if (!strcmp(ext, "refStorage")) {
+		} else if (!strcmp(ext, "refstorage")) {
 			data->ref_storage = xstrdup(value);
 		} else
 			string_list_append(&data->unknown_extensions, ext);
@@ -1184,8 +1184,11 @@ const char *setup_git_directory_gently(int *nongit_ok)
 				gitdir = DEFAULT_GIT_DIR_ENVIRONMENT;
 			setup_git_env(gitdir);
 		}
-		if (startup_info->have_repository)
+		if (startup_info->have_repository) {
 			repo_set_hash_algo(the_repository, repo_fmt.hash_algo);
+			the_repository->ref_storage_format =
+				xstrdup_or_null(repo_fmt.ref_storage);
+		}
 	}
 
 	strbuf_release(&dir);

@dscho
Copy link
Member

dscho commented Jan 28, 2020 via email

@hanwen hanwen force-pushed the reftable branch 5 times, most recently from 7ab3c4d to fd06ade Compare January 29, 2020 13:48
The reftable/ directory is structured as a library, so it cannot
crash on misuse. Instead, it returns an error codes.

In addition, the error code can be used to signal conditions from lower levels
of the library to be handled by higher levels of the library. For example, a
transaction might legitimately write an empty reftable file, but in that case,
we'd want to shortcut the transaction overhead.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
This commit provides basic utility classes for the reftable library.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The reftable format is usually used with files for storage. However, we abstract
away this using the blocksource data structure. This has two advantages:

* log blocks are zlib compressed, and handling them is simplified if we can
  discard byte segments from within the block layer.

* for unittests, it is useful to read and write in-memory. The blocksource
  allows us to abstract the data away from on-disk files.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
hanwen and others added 10 commits March 12, 2021 20:15
The reftable format is structured as a sequence of blocks, and each block
contains a sequence of prefix-compressed key-value records. There are 4 types of
records, and they have similarities in how they must be handled. This is
achieved by introducing a polymorphic 'record' type that encapsulates ref, log,
index and object records.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
The reftable format is structured as a sequence of block. Within a block,
records are prefix compressed, with an index of offsets for fully expand keys to
enable binary search within blocks.

This commit provides the logic to read and write these blocks.

Includes a code snippet copied from zlib

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
The reftable format includes support for an (OID => ref) map. This map can speed
up visibility and reachability checks. In particular, various operations along
the fetch/push path within Gerrit have ben sped up by using this structure.

The map is constructed with help of a binary tree. Object IDs are hashes, so
they are uniformly distributed. Hence, the tree does not attempt forced
rebalancing.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
This supports reading a single reftable file.

The commit introduces an abstract iterator type, which captures the usecases
both of reading individual refs, and iterating over a segment of the ref
namespace.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
With support for reading and writing files in place, we can construct files (in
memory) and attempt to read them back.

Because some sections of the format are optional (eg. indices, log entries), we
have to exercise this code using multiple sizes of input data

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
For background, see the previous commit introducing the library.

This introduces the file refs/reftable-backend.c containing a reftable-powered
ref storage backend.

It can be activated by passing --ref-storage=reftable to "git init", or setting
GIT_TEST_REFTABLE in the environment.

Example use: see t/t0031-reftable.sh

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Junio Hamano <gitster@pobox.com>
Co-authored-by: Jeff King <peff@peff.net>
In our git-prompt script we strive to use Bash builtins wherever
possible, because fork()-ing subshells for command substitutions and
fork()+exec()-ing Git commands are expensive on some platforms.  We
even read and parse '.git/HEAD' using Bash builtins to get the name of
the current branch [1].  However, the upcoming reftable refs backend
won't use '.git/HEAD' at all, but will write an invalid refname as
placeholder for backwards compatibility instead, which will break our
git-prompt script.

Update the git-prompt script to recognize the placeholder '.git/HEAD'
written by the reftable backend (its content is specified in the
reftable specs), and then fall back to use 'git symbolic-ref' to get
the name of the current branch.

[1] 3a43c4b (bash prompt: use bash builtins to find out current
    branch, 2011-03-31)

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
This command dumps individual tables or a stack of of tables.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 15, 2021

There was a status update about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2021

There was a status update about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 20, 2021

There was a status update about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 23, 2021

There was a status update about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 31, 2021

There was a status update about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 6, 2021

There was a status update about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

There was a status update about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

What's the status of this topic?

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 13, 2021

There was a status update in the "Cooking" section about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

Waiting for reviews.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 16, 2021

There was a status update in the "Cooking" section about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

Waiting for reviews.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 19, 2021

There was a status update in the "Cooking" section about the branch hn/reftable on the Git mailing list:

The "reftable" backend for the refs API.

Reroll exists; needs picking up.
cf. <pull.847.v7.git.git.1618832276.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 20, 2021

This patch series was integrated into seen via git@597c232.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 21, 2021

This patch series was integrated into seen via git@eb39b33.

@hanwen
Copy link
Author

hanwen commented Apr 26, 2021

this is now being worked on further in git#847

@hanwen hanwen closed this Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants