Skip to content

Conversation

@pjfanning
Copy link
Member

current code uses 3 different ways to lowercase a char

  • parboiled2 CharUtils method
  • an internal method on HttpHeaderParser
  • uses Java built-in Character.toLowerCase

I did a jmh benchmark in my pekko-bench project
https://github.com/pjfanning/pekko-bench/blob/main/src/main/scala/example/LowerCaseBench.scala

And this showed that function in HttpHeaderParser was better than or matched the perf of the others on all Java versions.
Parboiled2 was always slower. The Java built-in was as good for Java 21 and 25 but slowest approach in java 17 and below.

This change switches all code to use the pekko-http internal method.

@pjfanning
Copy link
Member Author

The link validator issue seems a temp issue - the supposed broken link works for me.

@pjfanning pjfanning requested review from He-Pin and raboof October 1, 2025 15:50
Copy link
Member

@raboof raboof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, but it'd be good to double-check that this doesn't allow sneaking in harmful characters that were sanitized out before. How did parboiled behave for non-7-bit-ascii characters?


(char1 | char2) < 0x80 &&
Character.toLowerCase(char1) == Character.toLowerCase(char2) &&
CharUtils.toLowerCase(char1) == CharUtils.toLowerCase(char2) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 clearly safe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in hindsight, this might be the riskiest change. CharUtils.toLowerCase only lower cases the 26 standard letters of the latin alphabet (the ones in 7 bit ASCII table) while java.lang.Character.toLowerCase seems to support other letters that appear in Unicode table. Let me re-review this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually with (char1 | char2) < 0x80 in the check, I think we can assume that it doesn't matter that Character.toLowerCase can handle non-ASCII chars.

///////////// helpers /////////////

private def appendLowered(): Rule0 = rule { run(sb.append(CharUtils.toLowerCase(lastChar))) }
private def appendLowered(): Rule0 = rule { run(sb.append(CU.toLowerCase(lastChar))) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assume lastChar is 7-bit ascii here? or are we OK with letting other characters through?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parboiled CharUtils only deals with the 26 standard letters of the latin alphabet (the ones in 7 bit ASCII table) - same as the pekko-http internal toLowerCase.

@pjfanning pjfanning marked this pull request as draft October 1, 2025 18:44
@pjfanning pjfanning marked this pull request as ready for review October 1, 2025 20:49
@pjfanning
Copy link
Member Author

@raboof I've looked again and I think these changes are safe. wdyt?

@pjfanning pjfanning added this to the 2.0.0 milestone Oct 2, 2025
@raboof
Copy link
Member

raboof commented Oct 6, 2025

I'm not sure why d55eed0 makes sense? It seems 'correct' but looks like it'd be slower if anything?

@pjfanning
Copy link
Member Author

I'm not sure why d55eed0 makes sense? It seems 'correct' but looks like it'd be slower if anything?

The main part d55eed0 was to change to calling byteChar function. This should get inlined by Hotspot compiler. It is used in lots of places already and this usage seems strange not to use it. I can remove this change though and come back to it. Maybe the correct thing to do is actually make the inlining of byteChar explicit (by declaring inline def byteChar and @inline def byteChar which means slitting the code into Scala 2 and 3 versions of the package.scala).

Copy link
Member

@raboof raboof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

Update CharUtils.scala

Update CharUtils.scala

use byteChar

Remove unused import from HttpHeaderParser.scala

revert change to use byteChar fn

Update HttpHeaderParser.scala
@pjfanning pjfanning merged commit a6b03ea into apache:main Oct 6, 2025
5 checks passed
@pjfanning pjfanning deleted the lowercase branch October 6, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants