Add struct strmap and associated utility functions #835

newren · 2020-08-21T16:45:03Z

Here I introduce new strmap, strintmap, and strset types.

Changes since v5:

Fixed a typo in forward declaration of struct mem_pool, spotted by Phillip. (Usage via pointers meant gcc & clang wouldn't complain.)

[1] https://lore.kernel.org/git/20180906191203.GA26184@sigill.intra.peff.net/

CC: Jeff King peff@peff.net
cc: Elijah Newren newren@gmail.com
cc: Phillip Wood phillip.wood123@gmail.com
cc: Chris Torek chris.torek@gmail.com

newren · 2020-08-21T18:51:31Z

/submit

gitgitgadget-git · 2020-08-21T18:52:54Z

Submitted as pull.835.git.git.1598035949.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-git-835/newren/strmap-v1

To fetch this version to local tag pr-git-835/newren/strmap-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-git-835/newren/strmap-v1

hashmap.h

Makefile

strmap.c

gitgitgadget-git · 2020-08-21T20:19:45Z

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Aug 21, 2020 at 06:52:24PM +0000, Elijah Newren via GitGitGadget wrote:

> Here I introduce a new strmap type, which my new merge backed, merge-ort,
> uses heavily. (I also made significant use of it in my changes to
> diffcore-rename). This strmap type was based on Peff's proposal from a
> couple years ago[1], but has additions that I made as I used it. I also
> start the series off with a quick documentation improvement to hashmap.c to
> differentiate between hashmap_free() and hashmap_free_entries(), since I
> personally had difficulty understanding them and it affects how
> strmap_clear()/strmap_free() are written.

I like the direction overall (unsurprisingly), but left a bunch of
comments. I do think if we're going to do this that it may be worth
cleaning up hashmap a bit first, especially around its clear/free
semantics, and its ability to lazy-allocate the table.

I'm happy to work on that, but don't want to step on your toes.

I also wonder if you looked at the khash stuff at all. Especially for
storing integers, it makes things much more natural. You'd do something
like:

  /* you might even be able to just write !strcmp in the macro below */
  static inline int streq(const char *a, const char *b)
  {
          return !strcmp(a, b);
  }

  KHASH_INIT(strint_map, char *, int, 1, strhash, streq);

and then you'd probably want a "put" wrapper that makes a copy of the
string. khash has its own charming awkwardness, but I'm just curious if you
looked at it and found it more awkward than hashmap.c, or if you just
didn't look at it.

-Peff

gitgitgadget-git · 2020-08-21T21:38:35Z

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Aug 21, 2020 at 1:16 PM Jeff King <peff@peff.net> wrote:
>
> On Fri, Aug 21, 2020 at 06:52:24PM +0000, Elijah Newren via GitGitGadget wrote:
>
> > Here I introduce a new strmap type, which my new merge backed, merge-ort,
> > uses heavily. (I also made significant use of it in my changes to
> > diffcore-rename). This strmap type was based on Peff's proposal from a
> > couple years ago[1], but has additions that I made as I used it. I also
> > start the series off with a quick documentation improvement to hashmap.c to
> > differentiate between hashmap_free() and hashmap_free_entries(), since I
> > personally had difficulty understanding them and it affects how
> > strmap_clear()/strmap_free() are written.
>
> I like the direction overall (unsurprisingly), but left a bunch of
> comments. I do think if we're going to do this that it may be worth
> cleaning up hashmap a bit first, especially around its clear/free
> semantics, and its ability to lazy-allocate the table.
>
> I'm happy to work on that, but don't want to step on your toes.

I have patches which introduce hashmap_clear() and
hashmap_clear_entries() to hashmap.[ch], which allowed me to simplify
strmap_clear(); instead of needing to call both
hashmap_free[_entries]() && strmap_init(), I could just call
hashmap_clear[_entries]().  Doing that surprised me with a significant
performance impact (in a good direction), at which point I started
adding mem-pool integration into hashmap for storing the entries that
hashmap.c allocates and got further good speedups.

I thought those were better explained when I got to the performance
stuff, so I had held off on those patches.  I could pull them out and
submit them first.

However, there's an important difference here between what I've done
and what you've suggested for hashmap: my method did not deallocate
hashmap->table in hashmap_clear() and then use lazy initialization.
In fact, I think not deallocating the table was part of the charm --
the table had already naturally grown to the right size, and because
the repository has approximately the same number of paths in various
commits, this provided me a way of getting a table preallocated to a
reasonable size for all merges after the first (and there are multiple
merges either when recursiveness is needed due to multiple merge
bases, OR when rebasing or cherry-picking a sequence of commits).
This prevented, as hashmap.h puts it, "expensive resizing".

So, once again, my performance ideas might be clashing with some of
your desires for the API.  Any clever ideas for resolving that?

Also, since you want to see hashmap cleanup first, should I submit
just the hashmap_clear[_entries()] stuff, or should I also submit the
API additions to allow mem-pool integration in hashmap (it's pretty
small and self-contained, but it'll be a while before I submit the
patches that use it...)?

> I also wonder if you looked at the khash stuff at all. Especially for
> storing integers, it makes things much more natural. You'd do something
> like:
>
>   /* you might even be able to just write !strcmp in the macro below */
>   static inline int streq(const char *a, const char *b)
>   {
>           return !strcmp(a, b);
>   }
>
>   KHASH_INIT(strint_map, char *, int, 1, strhash, streq);
>
> and then you'd probably want a "put" wrapper that makes a copy of the
> string. khash has its own charming awkwardness, but I'm just curious if you
> looked at it and found it more awkward than hashmap.c, or if you just
> didn't look at it.

I did look at it, but only briefly.  I had a further investigation on
my TODO list for months, along with several other improvement ideas.
But it seemed like my TODO list was really long, and my new merge
backend hasn't benefited anyone yet.  At some point, I decided to punt
on it and other ideas and start cleaning up my code and submitting.  I
believe merge-ort is more accurate than merge-recursive (it fixes
several test_expect_failures) and is a lot faster as well for the
cases I'm looking at.  So, for now, I've pulled it off my radar.

But I'd be really happy if someone else wanted to jump in and try
switching out hashmap for khash in the strmap API and see if it helps
merge-ort performance.  :-)

gitgitgadget-git · 2020-08-21T22:29:11Z

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Aug 21, 2020 at 2:33 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Fri, Aug 21, 2020 at 1:16 PM Jeff King <peff@peff.net> wrote:
> >
> > On Fri, Aug 21, 2020 at 06:52:24PM +0000, Elijah Newren via GitGitGadget wrote:
> >
> > > Here I introduce a new strmap type, which my new merge backed, merge-ort,
> > > uses heavily. (I also made significant use of it in my changes to
> > > diffcore-rename). This strmap type was based on Peff's proposal from a
> > > couple years ago[1], but has additions that I made as I used it. I also
> > > start the series off with a quick documentation improvement to hashmap.c to
> > > differentiate between hashmap_free() and hashmap_free_entries(), since I
> > > personally had difficulty understanding them and it affects how
> > > strmap_clear()/strmap_free() are written.
> >
> > I like the direction overall (unsurprisingly), but left a bunch of
> > comments. I do think if we're going to do this that it may be worth
> > cleaning up hashmap a bit first, especially around its clear/free
> > semantics, and its ability to lazy-allocate the table.
> >
> > I'm happy to work on that, but don't want to step on your toes.
>
> I have patches which introduce hashmap_clear() and
> hashmap_clear_entries() to hashmap.[ch], which allowed me to simplify
> strmap_clear(); instead of needing to call both
> hashmap_free[_entries]() && strmap_init(), I could just call
> hashmap_clear[_entries]().  Doing that surprised me with a significant
> performance impact (in a good direction), at which point I started
> adding mem-pool integration into hashmap for storing the entries that
> hashmap.c allocates and got further good speedups.
>
> I thought those were better explained when I got to the performance
> stuff, so I had held off on those patches.  I could pull them out and
> submit them first.
>
> However, there's an important difference here between what I've done
> and what you've suggested for hashmap: my method did not deallocate
> hashmap->table in hashmap_clear() and then use lazy initialization.
> In fact, I think not deallocating the table was part of the charm --
> the table had already naturally grown to the right size, and because
> the repository has approximately the same number of paths in various
> commits, this provided me a way of getting a table preallocated to a
> reasonable size for all merges after the first (and there are multiple
> merges either when recursiveness is needed due to multiple merge
> bases, OR when rebasing or cherry-picking a sequence of commits).
> This prevented, as hashmap.h puts it, "expensive resizing".
>
> So, once again, my performance ideas might be clashing with some of
> your desires for the API.  Any clever ideas for resolving that?
>
> Also, since you want to see hashmap cleanup first, should I submit
> just the hashmap_clear[_entries()] stuff, or should I also submit the
> API additions to allow mem-pool integration in hashmap (it's pretty
> small and self-contained, but it'll be a while before I submit the
> patches that use it...)?

Nevermind, I mis-remembered.  The mempool integration was added
specifically to strmap, not to hashmap, because strmap_put() does the
allocation of the str_entry.  So I'll just pull out the
hashmap_clear[_entries]() stuff and send it up.

>
> > I also wonder if you looked at the khash stuff at all. Especially for
> > storing integers, it makes things much more natural. You'd do something
> > like:
> >
> >   /* you might even be able to just write !strcmp in the macro below */
> >   static inline int streq(const char *a, const char *b)
> >   {
> >           return !strcmp(a, b);
> >   }
> >
> >   KHASH_INIT(strint_map, char *, int, 1, strhash, streq);
> >
> > and then you'd probably want a "put" wrapper that makes a copy of the
> > string. khash has its own charming awkwardness, but I'm just curious if you
> > looked at it and found it more awkward than hashmap.c, or if you just
> > didn't look at it.
>
> I did look at it, but only briefly.  I had a further investigation on
> my TODO list for months, along with several other improvement ideas.
> But it seemed like my TODO list was really long, and my new merge
> backend hasn't benefited anyone yet.  At some point, I decided to punt
> on it and other ideas and start cleaning up my code and submitting.  I
> believe merge-ort is more accurate than merge-recursive (it fixes
> several test_expect_failures) and is a lot faster as well for the
> cases I'm looking at.  So, for now, I've pulled it off my radar.
>
> But I'd be really happy if someone else wanted to jump in and try
> switching out hashmap for khash in the strmap API and see if it helps
> merge-ort performance.  :-)

gitgitgadget-git · 2020-08-28T07:04:46Z

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Aug 21, 2020 at 02:33:54PM -0700, Elijah Newren wrote:

> However, there's an important difference here between what I've done
> and what you've suggested for hashmap: my method did not deallocate
> hashmap->table in hashmap_clear() and then use lazy initialization.
> In fact, I think not deallocating the table was part of the charm --
> the table had already naturally grown to the right size, and because
> the repository has approximately the same number of paths in various
> commits, this provided me a way of getting a table preallocated to a
> reasonable size for all merges after the first (and there are multiple
> merges either when recursiveness is needed due to multiple merge
> bases, OR when rebasing or cherry-picking a sequence of commits).
> This prevented, as hashmap.h puts it, "expensive resizing".
> 
> So, once again, my performance ideas might be clashing with some of
> your desires for the API.  Any clever ideas for resolving that?

If the magic is in pre-sizing the hash, then it seems like the callers
ought to be feeding the size hint. That does make a little more work for
them, but I think there's real value in having consistent semantics for
"clear" across our data structures.

However, one cheat would be to free the memory but retain the size hint
after a clear. And then if we lazy-init, grow immediately to the hint
size. That's more expensive than a true reuse, because we do reallocate
the memory. But it avoids the repeated re-allocation during growth.

It may also be a sign that we should be growing the hash more
aggressively in the first place. Of course all of this is predicated
having some benchmarks. It would be useful to know which part actually
provided the speedup.

-Peff

gitgitgadget-git · 2020-08-28T07:04:48Z

User Jeff King <peff@peff.net> has been added to the cc: list.

gitgitgadget-git · 2020-08-28T15:32:22Z

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Aug 28, 2020 at 12:03 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, Aug 21, 2020 at 02:33:54PM -0700, Elijah Newren wrote:
>
> > However, there's an important difference here between what I've done
> > and what you've suggested for hashmap: my method did not deallocate
> > hashmap->table in hashmap_clear() and then use lazy initialization.
> > In fact, I think not deallocating the table was part of the charm --
> > the table had already naturally grown to the right size, and because
> > the repository has approximately the same number of paths in various
> > commits, this provided me a way of getting a table preallocated to a
> > reasonable size for all merges after the first (and there are multiple
> > merges either when recursiveness is needed due to multiple merge
> > bases, OR when rebasing or cherry-picking a sequence of commits).
> > This prevented, as hashmap.h puts it, "expensive resizing".
> >
> > So, once again, my performance ideas might be clashing with some of
> > your desires for the API.  Any clever ideas for resolving that?
>
> If the magic is in pre-sizing the hash, then it seems like the callers
> ought to be feeding the size hint. That does make a little more work for
> them, but I think there's real value in having consistent semantics for
> "clear" across our data structures.

I thought about adding a size hint from the callers, but the thing is
I don't know how to get a good one short of running a merge and
querying how big things were sized in that merge.  (In some common
cases I can get an upper bound, but I can't get it in all cases and
that upper bound might be a couple orders of magnitude too big.)
Thus, it's really a case where I just punt on pre-sizing for the first
merge, and use the size from the previous merge for subsequent ones.
If you have a non-recursive merge or are cherry-picking only a single
commit, then no sizing hint is used.

> However, one cheat would be to free the memory but retain the size hint
> after a clear. And then if we lazy-init, grow immediately to the hint
> size. That's more expensive than a true reuse, because we do reallocate
> the memory. But it avoids the repeated re-allocation during growth.
>
> It may also be a sign that we should be growing the hash more
> aggressively in the first place. Of course all of this is predicated
> having some benchmarks. It would be useful to know which part actually
> provided the speedup.

Your thoughts here are great; I also had another one this past week --
I could introduce a hashmap_partial_clear() (in addition to
hashmap_clear()) for the special usecase I have of leaving the table
allocated and pre-sized.  It'd prevent people from accidentally using
it and forgetting to free stuff, while still allowing me to take
advantage.  But, as you say, more benchmarks would be useful to find
which parts provided the speedup before taking any of these steps.

gitgitgadget-git · 2020-08-28T15:32:24Z

User Elijah Newren <newren@gmail.com> has been added to the cc: list.

gitgitgadget-git · 2020-09-01T09:30:54Z

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Aug 28, 2020 at 08:29:44AM -0700, Elijah Newren wrote:

> > It may also be a sign that we should be growing the hash more
> > aggressively in the first place. Of course all of this is predicated
> > having some benchmarks. It would be useful to know which part actually
> > provided the speedup.
> 
> Your thoughts here are great; I also had another one this past week --
> I could introduce a hashmap_partial_clear() (in addition to
> hashmap_clear()) for the special usecase I have of leaving the table
> allocated and pre-sized.  It'd prevent people from accidentally using
> it and forgetting to free stuff, while still allowing me to take
> advantage.  But, as you say, more benchmarks would be useful to find
> which parts provided the speedup before taking any of these steps.

Yeah, having a separate function to explicitly do "remove all elements
but keep the table allocated" would be fine with me. My big desire is
that clear() should do the safe, non-leaking thing by default.

-Peff

manda2639 · 2020-09-01T10:49:28Z

Excellent job! Thank you for sharing and your outstanding dedication. On Tue, Sep 1, 2020, 4:33 AM gitgitgadget-git[bot] <notifications@github.com> wrote:

…

On the Git mailing list ***@***.***>, Jeff King wrote (reply to this <https://github.com/gitgitgadget/gitgitgadget/wiki/ReplyToThis>): On Fri, Aug 28, 2020 at 08:29:44AM -0700, Elijah Newren wrote: > > It may also be a sign that we should be growing the hash more > > aggressively in the first place. Of course all of this is predicated > > having some benchmarks. It would be useful to know which part actually > > provided the speedup. > > Your thoughts here are great; I also had another one this past week -- > I could introduce a hashmap_partial_clear() (in addition to > hashmap_clear()) for the special usecase I have of leaving the table > allocated and pre-sized. It'd prevent people from accidentally using > it and forgetting to free stuff, while still allowing me to take > advantage. But, as you say, more benchmarks would be useful to find > which parts provided the speedup before taking any of these steps. Yeah, having a separate function to explicitly do "remove all elements but keep the table allocated" would be fine with me. My big desire is that clear() should do the safe, non-leaking thing by default. -Peff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#835 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQIEWFO4ZO5GX4J2HM4KCSTSDS5YBANCNFSM4QHO4NVA> .

The existence of hashmap_free() and hashmap_free_entries() confused me, and the docs weren't clear enough. We are dealing with a map table, entries in that table, and possibly also things each of those entries point to. I had to consult other source code examples and the implementation. Add a brief note to clarify the differences. This will become even more important once we introduce a new hashmap_partial_clear() function which will add the question of whether the table itself has been freed. Signed-off-by: Elijah Newren <newren@gmail.com>

newren · 2020-10-13T00:39:42Z

/submit

gitgitgadget-git · 2020-10-13T00:41:23Z

Submitted as pull.835.v2.git.git.1602549650.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-git-835/newren/strmap-v2

To fetch this version to local tag pr-git-835/newren/strmap-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-git-835/newren/strmap-v2

gitgitgadget-git · 2020-10-13T21:54:45Z

This branch is now known as en/strmap.

gitgitgadget-git · 2020-10-13T21:54:46Z

This patch series was integrated into seen via 67250e0.

gitgitgadget-git · 2020-10-15T20:28:58Z

This patch series was integrated into seen via d335da2.

gitgitgadget-git · 2020-10-16T23:02:19Z

This patch series was integrated into seen via 2294c72.

gitgitgadget-git · 2020-10-16T23:21:31Z

This patch series was integrated into seen via 5b2f544.

gitgitgadget-git · 2020-10-18T21:31:35Z

This patch series was integrated into seen via 3afbd9f.

gitgitgadget-git · 2020-10-19T17:52:50Z

This patch series was integrated into seen via ca48df1.

gitgitgadget-git · 2020-10-20T22:53:52Z

This patch series was integrated into seen via 0207fe2.

gitgitgadget-git · 2020-11-06T22:07:36Z

This patch series was integrated into seen via 11b1c72.

gitgitgadget-git · 2020-11-08T19:40:42Z

This patch series was integrated into seen via 28cce76.

gitgitgadget-git · 2020-11-09T23:53:27Z

This patch series was integrated into seen via b637f1e.

gitgitgadget-git · 2020-11-10T21:46:08Z

This patch series was integrated into seen via fac590c.

gitgitgadget-git · 2020-11-11T01:03:42Z

This patch series was integrated into seen via 870b4e5.

strmap.c

gitgitgadget-git · 2020-11-11T17:36:59Z

User Phillip Wood <phillip.wood123@gmail.com> has been added to the cc: list.

For heavy users of strmaps, allowing the keys and entries to be allocated from a memory pool can provide significant overhead savings. Add an option to strmap_init_with_options() to specify a memory pool. Signed-off-by: Elijah Newren <newren@gmail.com>

By default, we do not use a mempool and strdup_strings is true; in this case, we can avoid both an extra allocation and an extra free by just over-allocating for the strmap_entry leaving enough space at the end to copy the key. FLEXPTR_ALLOC_STR exists for exactly this purpose, so make use of it. Also, adjust the case when we are using a memory pool and strdup_strings is true to just do one allocation from the memory pool instead of two so that the strmap_clear() and strmap_remove() code can just avoid freeing the key in all cases. Signed-off-by: Elijah Newren <newren@gmail.com>

Now that hashamp has lazy initialization and a HASHMAP_INIT macro, hashmaps allocated on the stack can be initialized without a call to hashmap_init() and in some cases makes the code a bit shorter. Convert some callsites over to take advantage of this. Signed-off-by: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>

newren · 2020-11-11T20:01:12Z

/submit

gitgitgadget-git · 2020-11-11T20:03:06Z

Submitted as pull.835.v6.git.git.1605124942.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-git-835/newren/strmap-v6

To fetch this version to local tag pr-git-835/newren/strmap-v6:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-git-835/newren/strmap-v6

gitgitgadget-git · 2020-11-11T20:10:13Z

On the Git mailing list, Jeff King wrote (reply to this):

On Wed, Nov 11, 2020 at 08:02:06PM +0000, Elijah Newren via GitGitGadget wrote:

> Here I introduce new strmap, strintmap, and strset types.
> 
> Changes since v5:
> 
>  * Fixed a typo in forward declaration of struct mem_pool, spotted by
>    Phillip. (Usage via pointers meant gcc & clang wouldn't complain.)

Yep, this version looks good to me. I think this is probably ready for
'next'.

-Peff

gitgitgadget-git · 2020-11-11T20:37:20Z

User Chris Torek <chris.torek@gmail.com> has been added to the cc: list.

gitgitgadget-git · 2020-11-11T22:20:24Z

This patch series was integrated into seen via 0a6426c.

gitgitgadget-git · 2020-11-12T21:19:23Z

This patch series was integrated into seen via 0a12bba.

gitgitgadget-git · 2020-11-12T21:19:24Z

This patch series was integrated into next via 41519a5.

gitgitgadget-git · 2020-11-17T23:46:43Z

This patch series was integrated into seen via 349b680.

gitgitgadget-git · 2020-11-18T22:28:50Z

This patch series was integrated into seen via 8b4a8be.

gitgitgadget-git · 2020-11-20T01:42:24Z

This patch series was integrated into seen via d018753.

gitgitgadget-git · 2020-11-23T23:00:55Z

This patch series was integrated into seen via bf0a430.

gitgitgadget-git · 2020-11-23T23:00:57Z

This patch series was integrated into next via bf0a430.

gitgitgadget-git · 2020-11-23T23:00:58Z

This patch series was integrated into master via bf0a430.

gitgitgadget-git · 2020-11-23T23:01:01Z

Closed via bf0a430.