Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster Digest#hexfinal #9292

Merged
merged 13 commits into from
Jun 22, 2022
Merged

Conversation

didactic-drunk
Copy link
Contributor

md5 final.hexstring 16   4.39M (227.78ns) (± 1.29%)  80.0B/op   1.13× slower
       md5 hexfinal 16   4.77M (209.85ns) (± 1.32%)  48.0B/op   1.04× slower
  md5 hexfinal(buf) 16   4.96M (201.77ns) (± 1.31%)   0.0B/op        fastest

@waj
Copy link
Member

waj commented May 13, 2020

Why are these methods called hexfinal instead of using the (deprecated) hexdigest?

@asterite
Copy link
Member

Also, what's the benchmark source code?

@didactic-drunk
Copy link
Contributor Author

The name hexfinal matches it's use and other methods with the same semantics (Digest#final, Cipher#final).

#hexdigest is safe to call multiple times but copies the context incurring significant (4x) overhead. One of the reasons for it's deprecation.

@didactic-drunk
Copy link
Contributor Author

Also, what's the benchmark source code?

Here. It's a bit of a mess but the unused portions may be of interest when comparing #digest and #final.

src/openssl/digest/digest.cr Outdated Show resolved Hide resolved
src/digest/base.cr Outdated Show resolved Hide resolved
didactic-drunk and others added 2 commits May 14, 2020 08:42
Co-authored-by: Sijawusz Pur Rahnama <sija@sija.pl>
Co-authored-by: Sijawusz Pur Rahnama <sija@sija.pl>
src/digest/base.cr Outdated Show resolved Hide resolved
Co-authored-by: Sijawusz Pur Rahnama <sija@sija.pl>
src/digest/base.cr Outdated Show resolved Hide resolved
@didactic-drunk
Copy link
Contributor Author

It would be nice if these could make it in to the coming release as those getting a deprecated warning will see use hexfinal instead of use final.hexstring pointing them to slightly faster methods. If it doesn't make in to the next release those who upgrade to final.hexstring won't be warned of the new hexfinal methods unless they read the documentation.

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a note about being callable only once and the .dup workaround to all *final methods' docs?

@didactic-drunk
Copy link
Contributor Author

Done.

@didactic-drunk
Copy link
Contributor Author

Soooo..... This didn't get merged and what I predicted would happen is (based on gitter chat).

i noticed they deprecated hexdigest in favor of Bytes#hexstring. While I can kind of understand that change, it does make porting ruby code to crystal just slightly more work, than if crystal tried to maintain compatibility. Also not sure why they wouldn't leave hexstring in, but mark it @[AlwaysInline] or such.

I suppose hexfinal could be renamed back to hexdigest. If they use it twice they get an Exception.

This should probably be resolved sooner rather than later.

@didactic-drunk
Copy link
Contributor Author

Bump.

@didactic-drunk
Copy link
Contributor Author

Stalled?

@didactic-drunk
Copy link
Contributor Author

Should I rename hexfinal back to hexdigest and let programmers handle rare double calls? A 2nd call may be embedded in Log statements that aren't in a ci or normal call path.

@beta-ziliani beta-ziliani self-assigned this May 5, 2022
@beta-ziliani
Copy link
Member

We discussed it with the team, and decided to push for it. We discussed the name, being hexdigest! the preferred one in the team. But we are also ok with hexdigest, final_hexdigest, hexdigest_final. Happy to hear anyone's opinion on this matter (in particular, yours @didactic-drunk !).

@didactic-drunk
Copy link
Contributor Author

didactic-drunk commented May 5, 2022

  • hexdigest for compatibility with raise on double use would probably make more people happy

The dup approach ruby uses doesn't seem worth preserving especially not under the same name as ported ruby code because the code will end up less performant. 95-99% of use cases don't need it.

My opinion is either:

  • Choose a new name and let porters read the docs (hexdigest not available)
  • Use hexdigest and raise on double use

hexdigest_final and final_hexdigest seem too long.

@didactic-drunk
Copy link
Contributor Author

Maybe I should read my own comments (it's been so long)

hexfinal was chosen to match final and provide familiarity to ruby porting efforts

src/digest/digest.cr Outdated Show resolved Hide resolved
Comment on lines 157 to 158
sary = uninitialized StaticArray(UInt8, 64)
tmp = sary.to_slice[0, digest_size]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering is there some effective limitation on digest_size that would ensure it's less than 64 bytes? Implementations could technically use whatever size they like. Practical implications might make digests bigger than 64 bytes unlikely, but I presume it would be possible?

If we can't excuse the case for digest_size > 64 with good certainty, there should be an alternative here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

We'll have to allocate on the heap when digest_size > sary.size.

Copy link
Member

@straight-shoota straight-shoota May 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe a crazier alternative? In order to avoid an additional heap allocation we could re-purpose the the string buffer which already needs to be allocated for building the string (in Slice#hexstring).

string_size = digest_size * 2
string = String.new(string_size) do |buffer|
  # put the binary data in the back half of the string buffer so values don't get overwritten when the data is transformed to hex characters
  tmp = Slice.new(buffer + digest_size, digest_size)
  final tmp
  tmp.hexstring(buffer)
  {string_size, string_size}
end

Copy link
Contributor Author

@didactic-drunk didactic-drunk May 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's possible but unlikely. I'd say leave it to those implementations to override #hexfinal or fix the issue in a future PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense 👍
We should add a note in the docs for digest_size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output size limitations removed except for hexfinal(io)

src/digest/digest.cr Outdated Show resolved Hide resolved
def final(dst : Bytes) : Bytes
check_finished
@finished = true
final_impl dst
dst
end

# Returns a hexadecimal-encoded digest in a new `String`.
#
# This method can only be called once and raises `FinalizedError` on subsequent calls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "can only be called once" is not limited to this method, but includes all finalizing methods. You can only call one of them, not each method once.

We could document that with references to #finalize in all methods' docs, or references to all methods from all methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See latest "documentation" commit

didactic-drunk and others added 3 commits May 6, 2022 10:28
Co-authored-by: Johannes Müller <straightshoota@gmail.com>
Co-authored-by: Johannes Müller <straightshoota@gmail.com>
Comment on lines 193 to 195
def hexfinal(io : IO) : Nil
sary = uninitialized StaticArray(UInt8, 128)
tmp = sary.to_slice[0, digest_size * 2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should address the size restriction in this method. It's not possible to use the same optimization as in the String overload.
I think it's probably best to raise a meaningful error if digest_size * 2 > 128 and mention this in the documentations (that implementing types can override this method for a larger buffer size).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@didactic-drunk Would you mind to address this? Just adding a proper exception + documentation should be good. I think then it should be ready to merge.
Let me know if you want me to take over.

@straight-shoota
Copy link
Member

@didactic-drunk

hexfinal was chosen to match final and provide familiarity to ruby porting efforts

What kind of familiarity are you addressing? I can't find a hexfinal method in Ruby. Is it something else you mean here?

In fact, Ruby has Digest#hexdigest! which is the name proposed in #9292 (comment)
Its semantics also include an implicit call to #reset. After finalizing, you can't make any use of the Digest instance until its reset. So I believe this makes a lot of sense and is pracitcally useful.
It also simplifies the "can only get called once" logic. It implicitly resets, afterwards you can start building a new digest.

@didactic-drunk
Copy link
Contributor Author

didactic-drunk commented May 12, 2022

@didactic-drunk

hexfinal was chosen to match final and provide familiarity to ruby porting efforts

What kind of familiarity are you addressing? I can't find a hexfinal method in Ruby. Is it something else you mean here?

hex(action) matches Ruby.
action = final matches Crystal.

In fact, Ruby has Digest#hexdigest! which is the name proposed in #9292 (comment) Its semantics also include an implicit call to #reset. After finalizing, you can't make any use of the Digest instance until its reset. So I believe this makes a lot of sense and is pracitcally useful. It also simplifies the "can only get called once" logic. It implicitly resets, afterwards you can start building a new digest.

Auto reset may cause subtle bugs with misuse. Consider:

# Incorrect usage may be buried in a Log statement and hard to see
Log.info { "... #{digest.hexdigest!} ..." }
# Incorrect result / data corruption
digest.hexdigest! 

If you swap the order of Log and the last hexdigest!, the data is correct but logs are incorrect and not caught until someone looks.

Compared with:

Log.info { "... #{digest.hexfinal} ..." }
# Raises - error is immediately visible
digest.hexfinal

Or this, which appears correct but the order is wrong. This may occur randomly if Log.info is evaluated async

digest.hexfinal
Log.info { "... #{digest.dup.hexfinal} ..." }

With hexfinal dup copies the final state and raises. With auto reset the error is silent.

Here's another bug that's easy to create but hard to replicate with auto reset:

@@digests = ThreadLocalValue<Digest>... # Or Pool or other class that reuses digests

def fiber_handler
  digest = @@digests.get...
  digest.update header
  digest.update data.decrypt
  digest.hexdigest!
end

So where's the bug?
data.decrypt may raise leaving digest partially updated. The next call to fiber_handler running on the same Thread will compute an incorrect hash value. Extremely hard to debug and something I've already encountered by putting reset after *final instead of before. Also, your specs may pass (because there aren't network or transient errors) then fail infrequently in production. And because it's not the current request but some future request (not necessarily the next) that fails, figuring it out is very time consuming.

Ok, so what's the harm in auto reset for code that can't raise. Reuse requires reset before using update. Auto reset means double reset and unnecessary overhead. Also, can't raise today doesn't mean can't raise tomorrow.

hexdigest!with auto reset:

  • May lead to subtle bugs
  • The bugs listed aren't theoretical. I've created/encountered the last 2.
  • Doesn't indicate to users final is any part of the method (but it does match ruby).
  • Incurs #reset overhead for single use digests. Example: Hashing in HTTP handlers or other Fiber spawned handlers
  • Incurs #reset overhead for reused objects alternatively creates subtle bugs.
  • Logging inserted later may silently corrupt or mislog depending on ordering or async/fiber use.

Fundamentally auto reset assumes the object will be reused. Except it can't be reused reliably with auto reset.

Personally I think the Ruby's OpenSSL::Digest class has a horrible design with additional overhead for rare uses. I would rather see the rare uses have an explicit reset in the correct place.

If you want to solve stale updates by putting reset in correct place, I guess this would work.

# Reset at start of block

digest.open do |d|
# Or
digest.reset do |d|
  digest.update ...
  digest.final
end
# Or
digest.hexdigest do |d|
  d.update ...
end

This won't work for classes that hold a reference to @digest and update incrementally as data comes in. I have a few classes like this, but the methods would cover many cases where all data is available and digested in a single block.

There may be other options but please not auto reset.

Copy link
Member

@beta-ziliani beta-ziliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the long, very convincing explanation, and for the patience <3

@straight-shoota straight-shoota added this to the 1.5.0 milestone Jun 21, 2022
@straight-shoota straight-shoota changed the title Add faster Digest#hexfinal, Digest#hexfinal(dst) Add faster Digest#hexfinal Jun 22, 2022
@straight-shoota straight-shoota merged commit ff55fb1 into crystal-lang:master Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants