Skip to content

Added performant impl for string upcase/downcase :ascii mode. #7680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 15, 2018

Conversation

tckb
Copy link
Contributor

@tckb tckb commented May 12, 2018

the existing implementation with binary comprehensions turned out to be
far slower than the other modes.

The current implementation is >= 2.5X faster than the earlier implementation


this PR is a consequence of the discussions in #7673

PS: this is my first PR in elixir

the existing implementation with binary comprehensions turned out to be
_far_ slower than the other modes.

The current implementation is >= 2.5X faster than the earlier
implementation
@josevalim
Copy link
Member

@tckb Although I said to go with dn4_1, @michalmuskala brought good points about dn4_2 and memory use, so can you please migrate to dn4_2 instead? Thank you!

@tckb
Copy link
Contributor Author

tckb commented May 13, 2018

@josevalim on further benchmarking with larger & different inputs, I have not seen any consistent performance difference between dn4_1(downcase_ascii_patternMatch) and dn4_2(downcase_ascii_iodata). For memory consumption, dn4_2 has slightly better memory usage than the other. In one case (all characters of same case), I have noticed dn4_1 with better memory usage.

https://gist.github.com/tckb/ee13867e8a50e069afb04b25fc9efd85
https://gist.github.com/tckb/4cfd89a3c2e10f307f681679b528b099

Regardless, these new implementations have outperformed the standard library's impl by several folds.
https://gist.github.com/tckb/6e0a55e9482e5de785059e3a2de2b4e2

cc: @michalmuskala

@josevalim
Copy link
Member

It is your call @michalmuskala.

@michalmuskala
Copy link
Member

Let's go with the body-recursive version. That's the implementation we use for Enum.map (through :lists.map), so let's be consistent.

@josevalim
Copy link
Member

To clarify, we are going with dn4_2 then. Can you please @tckb? thanks for the extra benchmarks!

@tckb
Copy link
Contributor Author

tckb commented May 14, 2018

thanks @michalmuskala @josevalim I've changed the impl to dn4_2 I'm assuming the test cases related to this method are already in-place. I haven't created one

@michalmuskala
Copy link
Member

The upcase_ascii function is implemented in-between the upcase clauses. This causes a compiler warning. Let's move upcase_ascii after all the upcase clauses. Similar with downcase_ascii.

@tckb
Copy link
Contributor Author

tckb commented May 14, 2018

Ah! good catch @michalmuskala , was wondering about that. Is there any specific side-effect of this?

@michalmuskala
Copy link
Member

Not really, just tidiness.

Copy link
Contributor

@eksperimental eksperimental left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename c to char or to another descriptive name? Thank you

end

def upcase(string, mode) when mode in @conditional_mappings do
String.Casing.upcase(string, [], mode)
end

defp upcase_ascii(<<c, rest::bits>>) when c >= ?a and c <= ?z, do: [c - 32 | upcase_ascii(rest)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do

defp upcase_ascii(<<c, rest::bits>>) when c in ?a..?z, do: ...

The in seems clearer to me and is compiled to the same thing so we should be fine performance-wise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the same minute we both commented the same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do not compile the same (there is an additional is_integer check which hmay be removed by the compiler or not) but we can also run into bootstrap issues, unfortunataely.

Copy link
Contributor Author

@tckb tckb May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, the Kernel.".."/2 macros has the integer checks for the range. personally, I like the <= >= syntax. It's more pragmatic to me ;)

https://github.com/elixir-lang/elixir/blob/master/lib/elixir/lib/kernel.ex#L3023

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the extra check is not supposed to matter in future erlang versions anyway but the biggest issue here is bootstrapping (in may not be available when we compile the String module).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the is_integer check on extracting data from a binary will only be eliminated on OTP 21 (I contributed that optimisation recently).

for <<x <- string>>,
do: if(x >= ?a and x <= ?z, do: <<x - 32>>, else: <<x>>),
into: ""
IO.iodata_to_binary(upcase_ascii(string))
end

def upcase(string, mode) when mode in @conditional_mappings do
String.Casing.upcase(string, [], mode)
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not with a computer but would when char in ?a..?z, do: work? If so that will read better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eksperimental normally, yes -- but as per @josevalim using in at this level will cause bootstrapping issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked the additional is_integer checks. So i would not recommend using in

@tckb
Copy link
Contributor Author

tckb commented May 14, 2018

okay, here are the todo's

  • change c to more "descriptive" name
  • use in instead of explicit >, < not considering as it might cause issue while bootstrapping.

I'll hold off the push for a day just in-case there are further changes ;)

@josevalim @michalmuskala @whatyouhide @eksperimental :D

@tckb
Copy link
Contributor Author

tckb commented May 15, 2018

Changes committed

@josevalim josevalim merged commit 95ffb4c into elixir-lang:master May 15, 2018
@josevalim
Copy link
Member

❤️ 💚 💙 💛 💜

josevalim pushed a commit that referenced this pull request May 31, 2018
The existing implementation with binary comprehensions turned
out to be slower than the other modes.

The current implementation is >= 2.5X faster than the earlier
implementation.

Signed-off-by: José Valim <jose.valim@plataformatec.com.br>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants