Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to hashing instead of generating URL keys #17

Closed
wants to merge 5 commits into from
Closed

Move to hashing instead of generating URL keys #17

wants to merge 5 commits into from

Conversation

larskraemer
Copy link
Contributor

Instead of producing a string for each URL to use as a key, it is sensible to produce a hash.
I pulled in the SpookyV2 hash library, which is reasonably compact and public domain, and produces 128-bit hashes. We definitely need a 128-bit or greater hash, as a 64-bit hash will probably collide after about 2^32 inputs, which would be about 256 GB of URLs with an average length of 64 bytes. This may have been fine, but to be safe, a 128-bit hash function lets us process ~1'000'000 PB of URLs, which is definitely fine.
I also changed the is_asset and is_number functions to accept string_views, since that will work with both strings and string_views.

This PR depends on my last one. That is my mistake, and I don't know how to fix it.. Let me know if you need me to do something about that.

@larskraemer
Copy link
Contributor Author

I updated this with a better input method which requires fewer allocations. @ameenmaali

@ameenmaali
Copy link
Owner

Hey @larskraemer, thanks so much for the work here :D - I have been super busy/offline the past couple weeks as you've seen. I'll take a look at this when I get back to it in a few days. As a side note/first glance, is there any solution that doesn't require 3rd party code? That was one of my goals building this (hence why a lot of stuff could probably be easily done in an existing library). I've checked out the std::hash lib a bit, but not sure if it's got some issues. No worries if you think this is better, this looks pretty solid to me at the moment.

@larskraemer
Copy link
Contributor Author

@ameenmaali
I noticed :D
As for not using third party libraries; As mentioned above, we probably want a 128 bit hash. I don't think the standard library provides one, so we're stuck with either writing our own, or using somebody else's solution.
As for writing your own hash function, that's hard, and the result will probably be pretty bad (not because you or I are dumb, but because, as mentioned, it is hard to write a good hash function).
I get that this is a learning project, but in this case, I think the best lesson is probably "Don't write your own hash function" :D

@larskraemer larskraemer closed this by deleting the head repository Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants