-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash Table Instrumentation #64
Comments
I like this idea, but would tweak the API a bit. Instead of an in/out parameter, I'd suggest a separate accessor function: size_t
cork_hash_table_bucket_count(struct cork_hash_table *table); That on its own might give you what you need. After each mutator function, you'd call this to get the current bucket count, and compare it to the value from the previous call (instead of comparing to 0). If you want to know precisely when a reallocation occurs, I'd go for a hook function: typedef void
(*cork_hash_table_resized_f)(void *user_data, size_t old_size, size_t new_size);
void
cork_hash_table_add_resized_hook(struct cork_hash_table *table,
cork_hash_table_resized_f resized, void *user_data); |
Either of those would work, but the accessor function adds a second function call to each potentially modifying call. typedef void
(*cork_hash_table_resized_f)(void *user_data, size_t new_bucket_count, size_t new_entry_count); That way, you can track both the density (loading) and bucket count with a single callback. If you care (or if the previous size actually matters), you should be able to maintain that each time you get a callback. Yet another alternative would be to add variants to the mutator functions that have the desired functionality, but this causes a proliferation of interface functions. |
True, but that means you only pay the price to grab the bucket count if you need to for profiling purposes. (The function call won't be any more expensive than the
Good call. I'll throw |
I don't think this is correct. The function call will have to be made every time a mutator is called by a user tracking reallocations. The original intent was that the in/out parameter be optional and could be left strictly alone unless reallocation was performed. If the optional parameter is the last one (I realize that C does not suffer these gladly, but they are used with the variable number of arguments mechanism), users not tracking reallocations are unaffected. Users tracking reallocation add the parameter. The only place it need be detected and used is in the reallocation code where the additional tests are likely to be insignificant relative to reallocating and rehashing the table. This is effectively the same effort required to determine whether the hook needs to be exercised. I suspect that the hook is preferable as the variable number of arguments methods are pretty kludgy from what I can tell. |
I think that some of this discussion is a red herring. In general, I don't want to examine every entry creation call. I want a plausible explanation for time consuming behaviors. By adding a reallocation count to the hash table data structure and a hash table statistics function that can be called to extract performance information for the table, we should be able to get this information at the end of a run. Having looked at the code, I'd like to be able to get the entry and bin counts as well as the reallocation count. It might be useful to track the collision statistics, as well since the reallocation trigger is an average entries per bin rather than a hard limit. Since we traverse the entry chain in the rehashing process, we could at least get a high water mark with a triple <reallocation number, bin index, entry count> that would show how badly (or well) the collision chain management was performing. This should be fairly low cost since it only occurs when reallocation is needed and should not add significantly to its operation count. If this should turn out to be prohibitive, we can get it by making a traversal in the statistics function and expect this to be called only once in the lifetime of the table. An alternative might be to define several levels of debug with debug level 2 causing log messages when reallocation occurs. These messages might include collision chain information for the pre (and possibly post) reallocation entries at a still higher debug level. |
Some extra performance counters would definitely be nice. One thing that I've wanted to do is make the struct cork_hash_table table;
cork_hash_table_init(&table, 0, hasher, compator); to struct cork_hash_table *table;
table = cork_hash_table_new(0, hasher, comparator); But I think it's worth it. Thoughts? |
See https://gist.github.com/JohnMcHugh/823161 for a statistics run. |
I have been looking through the code of the hash tables some more. Whether these observations matter or not depends greatly on the number of entries that actually appear in the tables and on the ratio of entry creation to entry lookup. To a limited extent, some of these issues can be addressed by the profiler, but the costs of a given hash table may be difficult to determine. These observations are based on timings for runs that created a 2^20 entry hash table and showed that setting the initial bin count to avoid bin reallocation reduced the run time by about half. I suspect that further reductions would follow by tuning the memory pool for the entries, as well. Note that most of the reallocation time is spent in the last reallocation since there are more items to move then. Observation 1: Hash tables are relatively heavy weight for small aggregations. There is a point at which a linear search is better. Above that, a binary search may be the next choice, and above that point, the hash table may be indicated. Setting the default bin count at 8 (49 entries to the first reallocation) is probably never appropriate. An initial bin count of 2^14 would allow nearly 100K entries before reallocation. If we expect a million entries, 2^18 bins is a good starting point. With a linear collision search and unordered collision chains, a hash plus on the order of 5 - 6 comparisons (worst case seems to be 20 or so) is required to find an entry. Linear search on an ordered list would probably be better up to 8-12 entries, binary search up to a few hundred. Observation 2: Using an unordered collision chain requires a search to the end to find whether an entry is present. Ordering the chain does not add to the cost since the comparisons on entry will occur anyhow. On a lookup, we will find an existing entry in an unsorted chain, on average, after looking at half the entries, and discover a nonexistent entry only after looking at all of them. If values in the collision chain are uniformly distributed in the key space, an ordered list will typically find both existent and non-existent entries in about the same number of comparisons, about half the chain length. Observation 3: A 4K memory pool block only gives us about 128 table entries if we assume 8 byte pointers at 4 pointers per entry: key, value, forward and back. It might be better to allocate the memory pool in blocks that are more appropriate to the number of entries supported by the bin allocation. This might result in an initial memory pool size of a megabyte or so to support 16K initial bins with no additional malloc calls. If we are going to double the bin space at each table reallocation, we might want to treat the Note that it is not necessary to link the pool block space onto a free list. The free list need only contain pool entries that have been allocated, then freed. The free space in the pool block is contiguous and can be allocated from a single free pointer that is compared to a limit value to determine whether a new block is needed. If the free list is non empty, freed space is used preferentially, then unallocated block space, then a new block is allocated as necessary. This avoids the performance hiccough that occurs when a linked list containing large numbers of items has to be created in the middle of an operation. Observation 4: It is not necessary to save the old bin space and allocate new space. The existing space can be extended with The rehashing process essentially uncovers a new high order bin index bit for the existing hash collision chain entries. If this bit is |
When monitoring the performance of a program that uses dynamic hash tables, it is useful to track the space allocated to the table by catching table reallocations. Since reallocations occur when a new item is entered into a table which cannot find room for it, the tracking could be a side effect of the entry and size setting actions.
For the cork hash tables, these are calls on
for adding entries and the size setting operations
By adding an optional "in/out" parameter, say
to these operations, the actual size of the hash array (as opposed to the number of entries currently present in the table) could be returned whenever the table is allocated or reallocated. If the call time value is set to 0, a simple test for a non-zero return value identifies a reallocation point. In addition, by tracking the number of entries in the table when reallocation occurs, the typical table capacity (or load factor) for a given size can be tracked, as well.
Although I have not inspected the current hash code, reallocation typically involves doubling the size of the store for the table and rehashing all of the current entries since their index in the table will go from N to N+1 bits as the size of the table is increased from 2^N to 2^(N+1) entries. Instrumenting the reallocation process is a first step in evaluating the performance of algorithms that use the hash table facilities.
The text was updated successfully, but these errors were encountered: