Hash Table Instrumentation #64

JohnMcHugh · 2013-08-06T17:31:24Z

When monitoring the performance of a program that uses dynamic hash tables, it is useful to track the space allocated to the table by catching table reallocations. Since reallocations occur when a new item is entered into a table which cannot find room for it, the tracking could be a side effect of the entry and size setting actions.

For the cork hash tables, these are calls on

cork_hash_table_get_or_create()
cork_hash_table_put()

for adding entries and the size setting operations

cork_string_hash_table_init()
cork_pointer_hash_table_init()
cork_hash_table_ensure_size()

By adding an optional "in/out" parameter, say

size_t * actual_size

to these operations, the actual size of the hash array (as opposed to the number of entries currently present in the table) could be returned whenever the table is allocated or reallocated. If the call time value is set to 0, a simple test for a non-zero return value identifies a reallocation point. In addition, by tracking the number of entries in the table when reallocation occurs, the typical table capacity (or load factor) for a given size can be tracked, as well.

Although I have not inspected the current hash code, reallocation typically involves doubling the size of the store for the table and rehashing all of the current entries since their index in the table will go from N to N+1 bits as the size of the table is increased from 2^N to 2^(N+1) entries. Instrumenting the reallocation process is a first step in evaluating the performance of algorithms that use the hash table facilities.

The text was updated successfully, but these errors were encountered:

dcreager · 2013-08-06T18:41:09Z

I like this idea, but would tweak the API a bit.

Instead of an in/out parameter, I'd suggest a separate accessor function:

size_t
cork_hash_table_bucket_count(struct cork_hash_table *table);

That on its own might give you what you need. After each mutator function, you'd call this to get the current bucket count, and compare it to the value from the previous call (instead of comparing to 0).

If you want to know precisely when a reallocation occurs, I'd go for a hook function:

typedef void
(*cork_hash_table_resized_f)(void *user_data, size_t old_size, size_t new_size);

void
cork_hash_table_add_resized_hook(struct cork_hash_table *table,
                                 cork_hash_table_resized_f resized, void *user_data);

JohnMcHugh · 2013-08-06T19:39:53Z

Either of those would work, but the accessor function adds a second function call to each potentially modifying call.
The hook is more attractive on its face, but, if the size doubles as seems to be the case in reading the code, the signature might be more useful if it were

typedef void
(*cork_hash_table_resized_f)(void *user_data, size_t new_bucket_count, size_t new_entry_count);

That way, you can track both the density (loading) and bucket count with a single callback. If you care (or if the previous size actually matters), you should be able to maintain that each time you get a callback.

Yet another alternative would be to add variants to the mutator functions that have the desired functionality, but this causes a proliferation of interface functions.

dcreager · 2013-08-06T21:42:19Z

Either of those would work, but the accessor function adds a second function call to each potentially modifying call.

True, but that means you only pay the price to grab the bucket count if you need to for profiling purposes. (The function call won't be any more expensive than the if statement that we'd have to add to all of the mutators in the case of the in/out parameter.)

the signature might be more useful if it were

Good call. I'll throw old_bucket_count in there, too.

JohnMcHugh · 2013-08-06T22:17:13Z

(The function call won't be any more expensive than the if statement that we'd have to add to all of the mutators in the case of the in/out parameter.)

I don't think this is correct. The function call will have to be made every time a mutator is called by a user tracking reallocations.

The original intent was that the in/out parameter be optional and could be left strictly alone unless reallocation was performed. If the optional parameter is the last one (I realize that C does not suffer these gladly, but they are used with the variable number of arguments mechanism), users not tracking reallocations are unaffected. Users tracking reallocation add the parameter. The only place it need be detected and used is in the reallocation code where the additional tests are likely to be insignificant relative to reallocating and rehashing the table.

This is effectively the same effort required to determine whether the hook needs to be exercised. I suspect that the hook is preferable as the variable number of arguments methods are pretty kludgy from what I can tell.

JohnMcHugh · 2013-09-11T14:33:13Z

I think that some of this discussion is a red herring. In general, I don't want to examine every entry creation call. I want a plausible explanation for time consuming behaviors. By adding a reallocation count to the hash table data structure and a hash table statistics function that can be called to extract performance information for the table, we should be able to get this information at the end of a run. Having looked at the code, I'd like to be able to get the entry and bin counts as well as the reallocation count. It might be useful to track the collision statistics, as well since the reallocation trigger is an average entries per bin rather than a hard limit. Since we traverse the entry chain in the rehashing process, we could at least get a high water mark with a triple <reallocation number, bin index, entry count> that would show how badly (or well) the collision chain management was performing. This should be fairly low cost since it only occurs when reallocation is needed and should not add significantly to its operation count. If this should turn out to be prohibitive, we can get it by making a traversal in the statistics function and expect this to be called only once in the lifetime of the table.

An alternative might be to define several levels of debug with debug level 2 causing log messages when reallocation occurs. These messages might include collision chain information for the pre (and possibly post) reallocation entries at a still higher debug level.

dcreager · 2013-09-12T20:11:22Z

Some extra performance counters would definitely be nice. One thing that I've wanted to do is make the cork_hash_table type opaque. That would let us add fields such as these without having to worry about ABI compatibility. That would be a massive change for a lot of existing code; you'd have to change

struct cork_hash_table  table;
cork_hash_table_init(&table, 0, hasher, compator);

to

struct cork_hash_table  *table;
table = cork_hash_table_new(0, hasher, comparator);

But I think it's worth it. Thoughts?

JohnMcHugh · 2013-09-13T14:46:50Z

See https://gist.github.com/JohnMcHugh/823161 for a statistics run.

JohnMcHugh · 2013-09-16T16:51:00Z

I have been looking through the code of the hash tables some more. Whether these observations matter or not depends greatly on the number of entries that actually appear in the tables and on the ratio of entry creation to entry lookup.

To a limited extent, some of these issues can be addressed by the profiler, but the costs of a given hash table may be difficult to determine.

These observations are based on timings for runs that created a 2^20 entry hash table and showed that setting the initial bin count to avoid bin reallocation reduced the run time by about half. I suspect that further reductions would follow by tuning the memory pool for the entries, as well. Note that most of the reallocation time is spent in the last reallocation since there are more items to move then.

Observation 1:

Hash tables are relatively heavy weight for small aggregations. There is a point at which a linear search is better. Above that, a binary search may be the next choice, and above that point, the hash table may be indicated. Setting the default bin count at 8 (49 entries to the first reallocation) is probably never appropriate. An initial bin count of 2^14 would allow nearly 100K entries before reallocation. If we expect a million entries, 2^18 bins is a good starting point. With a linear collision search and unordered collision chains, a hash plus on the order of 5 - 6 comparisons (worst case seems to be 20 or so) is required to find an entry. Linear search on an ordered list would probably be better up to 8-12 entries, binary search up to a few hundred.

Observation 2:

Using an unordered collision chain requires a search to the end to find whether an entry is present. Ordering the chain does not add to the cost since the comparisons on entry will occur anyhow. On a lookup, we will find an existing entry in an unsorted chain, on average, after looking at half the entries, and discover a nonexistent entry only after looking at all of them. If values in the collision chain are uniformly distributed in the key space, an ordered list will typically find both existent and non-existent entries in about the same number of comparisons, about half the chain length.

Observation 3:

A 4K memory pool block only gives us about 128 table entries if we assume 8 byte pointers at 4 pointers per entry: key, value, forward and back. It might be better to allocate the memory pool in blocks that are more appropriate to the number of entries supported by the bin allocation. This might result in an initial memory pool size of a megabyte or so to support 16K initial bins with no additional malloc calls. If we are going to double the bin space at each table reallocation, we might want to treat the mempool blocks in a similar fashion. This would ensure that malloc failures, if any, would only occur at table reallocation time. In some use cases, recovery is possible by writing the existing table information to disk and clearing the existing table.

Note that it is not necessary to link the pool block space onto a free list. The free list need only contain pool entries that have been allocated, then freed. The free space in the pool block is contiguous and can be allocated from a single free pointer that is compared to a limit value to determine whether a new block is needed. If the free list is non empty, freed space is used preferentially, then unallocated block space, then a new block is allocated as necessary. This avoids the performance hiccough that occurs when a linked list containing large numbers of items has to be created in the middle of an operation.

Observation 4:

It is not necessary to save the old bin space and allocate new space. The existing space can be extended with realloc and the new bin pointer chains initialized with a memzero operation. If the existing space is not extended, the realloc operation will copy the existing hash chain entries using an efficient memcpy operation.

The rehashing process essentially uncovers a new high order bin index bit for the existing hash collision chain entries. If this bit is 0, the entry remains in its current hash chain. If the bit is a '1', the entry is moved to the chain in a newly allocated bin. On the average, we expect half the entries to stay put and half to move. If the existing chains are ordered, the new chains can be created as ordered simply by traversing the existing chain in order and appending the moved entries to the ends of their new chains.

dcreager mentioned this issue Dec 9, 2013

Fixed iteration order in hash tables #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash Table Instrumentation #64

Hash Table Instrumentation #64

JohnMcHugh commented Aug 6, 2013

dcreager commented Aug 6, 2013

JohnMcHugh commented Aug 6, 2013

dcreager commented Aug 6, 2013

JohnMcHugh commented Aug 6, 2013

JohnMcHugh commented Sep 11, 2013

dcreager commented Sep 12, 2013

JohnMcHugh commented Sep 13, 2013

JohnMcHugh commented Sep 16, 2013

Hash Table Instrumentation #64

Hash Table Instrumentation #64

Comments

JohnMcHugh commented Aug 6, 2013

dcreager commented Aug 6, 2013

JohnMcHugh commented Aug 6, 2013

dcreager commented Aug 6, 2013

JohnMcHugh commented Aug 6, 2013

JohnMcHugh commented Sep 11, 2013

dcreager commented Sep 12, 2013

JohnMcHugh commented Sep 13, 2013

JohnMcHugh commented Sep 16, 2013