Move to hashing instead of generating URL keys #17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of producing a
string
for each URL to use as a key, it is sensible to produce a hash.I pulled in the SpookyV2 hash library, which is reasonably compact and public domain, and produces 128-bit hashes. We definitely need a 128-bit or greater hash, as a 64-bit hash will probably collide after about 2^32 inputs, which would be about 256 GB of URLs with an average length of 64 bytes. This may have been fine, but to be safe, a 128-bit hash function lets us process ~1'000'000 PB of URLs, which is definitely fine.
I also changed the
is_asset
andis_number
functions to acceptstring_view
s, since that will work with bothstring
s andstring_view
s.This PR depends on my last one. That is my mistake, and I don't know how to fix it.. Let me know if you need me to do something about that.