New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in String.replace_leading/3 and String.replace_trailing/3 #5022

Merged
merged 1 commit into from Jul 19, 2016

Conversation

Projects
None yet
7 participants
@whatyouhide
Copy link
Member

whatyouhide commented Jul 17, 2016

Closes #5019.

These function would end up in infinite recursion if the "match" argument to replace was an empty string. With this commit, we explicitly check for that empty string and possibly raise an ArgumentError exception.

@@ -618,6 +618,10 @@ defmodule String do
@spec replace_leading(t, t, t) :: t
def replace_leading(string, match, replacement)
when is_binary(string) and is_binary(match) and is_binary(replacement) do
if match == "" do

This comment has been minimized.

@michalmuskala

michalmuskala Jul 17, 2016

Member

Wouldn't it be better to do this in a guard?

This comment has been minimized.

@whatyouhide

This comment has been minimized.

@michalmuskala

michalmuskala Jul 17, 2016

Member

It seems like using a guard (or pattern matching in a function) is the usual way of dealing with invalid input.

def replace_leading(_string, "", _replacement) do
  raise ArgumentError, "cannot use an empty string as the match to replace"
end

def replace_leading(string, match, replacement)
    when is_binary(string) and is_binary(match) and is_binary(replacement) do
  # implementation
end

looks to me the most idiomatic. I would also suspect guards to introduce less overhead (although that's marginally important in that case).

This comment has been minimized.

@whatyouhide

whatyouhide Jul 17, 2016

Member

I wouldn't worry about overhead. The problem with your example is that then we say that the problem with replace_leading(:not_a_binary, "", %{hello: "mom, look not a binary!"}) will be that the match is "" instead of the other two arguments not being binaries. So the solution would be to add is_binary/1 clauses to the "" clause as well, but at that point I prefer the if version. ¯_(ツ)_/¯ I can change if other people prefer that, but personally I think the difference is not very strong anyways.

This comment has been minimized.

@eksperimental

eksperimental Jul 17, 2016

Member

yes. using if saves us from two extra is_binary guard checks (one for string, one for replacement),
if you want to use guards, which seems reasonable, is that you need to create a private function, check only once in the public function, and not check anything in the private one.

def replace_leading(string, match, replacement) when is_binary(string) and is_binary(match) and is_binary(replacement),
  do: do_replace_leading(string, match, replacement)

defp do_replace_leading(_string, "", _replacement) do
  raise ArgumentError, "cannot use an empty string as the match to replace"
end

defp do_replace_leading(string, match, replacement) do
  # implementation
end

This comment has been minimized.

@whatyouhide

whatyouhide Jul 17, 2016

Member

Yes, but here I think the if is the cleanest and easiest thing to read :)

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

I'm not so sure about this behaviour.
I think if replacement is "", we should return the original string.

I will do a bit of research.

another weird behaviour is

iex>  Regex.replace(~r//, "123", "x")
"x1x2x3x"
@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

@eksperimental if we return the original string, it will be as if we considered "" to not have matched, but that makes no sense since "" always matches. I discussed this with @lexmag as well before submitting the PR and we both agreed on this, hence why I went with the raising.

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

Ok, it makes sense, but there's another case.
String.replace_leading("", "", "found")
if empty string is found, and it was same as original, shouldn't we return "found", instead of raising?

I have tried previous regex in Perl, and it works as in Elixir

$ echo "123a456" | perl -pe "s//X/g" 
X1X2X3XaX4X5X6X
@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

@eksperimental the thing is that String.replace_leading/3 (and _trailing) accept the "match" as a controlled input, meaning the user is supposed to control it and check it's something meaningful. I think it's reasonable to never allow "" because semantically it makes no sense. And in the replace_leading("", "", "foo") example, I still think "" matches infinite times even if the string is an empty string itself: infinite empty strings concatenated are still an empty string and match everything infinite times, even the empty string itself. Did I explain what I mean? :)

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

yes. I can see what you mean. it makes sense.
Could we please add String.replace_leading("", "", "found") as a test?

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

@eksperimental good call, updated with a new test.

@@ -657,6 +661,10 @@ defmodule String do
@spec replace_trailing(t, t, t) :: t

This comment has been minimized.

@eksperimental

eksperimental Jul 17, 2016

Member

return should be t | :no_return
since we are raising

@@ -618,6 +618,10 @@ defmodule String do
@spec replace_leading(t, t, t) :: t

This comment has been minimized.

@eksperimental

eksperimental Jul 17, 2016

Member

return should be t | :no_return
since we are raising

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

@whatyouhide we need to update the docs explaining this special case.

@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Jul 17, 2016

Let's take a step back and understand the usage of "" throughout String:

ex(2)> String.split "foo", ""
["f", "o", "o", ""]
iex(3)> String.contains? "foo", ""
true
iex(4)> String.ends_with? "foo", ""
true

If a string contains "", shouldn't replacing "" mean something? How are other languages tackling this?

@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Jul 17, 2016

I am pinging @ThomasArts because he may be able to encode this discussion into properties that may resonate with everyone. :)

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

@josevalim then it should behave same way as my Regex example,
and it will also be in agreement with that @whatyouhide says: #5022 (comment)
there are infinite empty strings between each letter, so slot in between gets a replacement inserted

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

In Ruby, it's:

'foo'.split('')
#=> ['f', 'o', 'o']

In Node, it's:

'foo'.split('')
//=> ['f', 'o', 'o']

In Python (thankfully someone makes sense here 😄):

>>> 'foo'.split('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: empty separator

I think the correct behaviour might be to raise for empty separator in String.split/2 as well. If you want codepoints/graphemes, we have String.codepoints/1 and String.graphemes/1 so achieving that through String.split/2 seems like a weird way to do it.

As for contains?/2 and ends_with?/2, they seem to behave consistently in all languages (e.g. every string contains the empty string and every string ends/starts with the empty string). These properties may make sense as the empty string has no size so every string contains infinite empty strings. And this is the reason why we raise in replace_leading but not replace_prefix (we can replace one prefix, but not all of them because there's infinite). Makes sense? :)

@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Jul 17, 2016

@whatyouhide I was mostly concerned about how replace_leading and replace_prefix works on other languages with empty strings. :)

To put things in perspective I would expect this to happen: if String.ends_with?(string, value) then String.replace_trailing(string, value, another) |> String.ends_with?(another) should be true. The code in this PR violates this property. However, I do understand the matter of infinity which happens with replace_trailing which it does not happen with replace_suffix. Should we at least agree that replace_prefix and replace_suffix should accept empty strings and add tests for those too?

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

Yeah I agree for replace_prefix and replace_suffix, those make sense to me as they replace the first occurrence of the match so it makes sense with "" as the match as well.

The property you showed would be nice to have but I think this is an intrinsic problem with having String.replace_leading/3 and String.replace_trailing/3 in the stdlib. Other languages (at least the ones I'm looking at) appear to have no similar functionality and you have to build that yourself, treating the empty string in the way that better fits you. Not sure what to do then 😕

@lexmag

This comment has been minimized.

Copy link
Member

lexmag commented Jul 17, 2016

I think we should raise on empty string in String.split/2 as well; today we have a "hack" here to handle "" match, since :binary.split("asd", "", [:global]) raises ArgumentError.

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

@lexmag that's good news. I can take care of that, and I can add tests for a "" match to replace_prefix and replace_suffix.

Now we only need to decide what to do with replace_leading and replace_trailing.

@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Jul 17, 2016

@whatyouhide we are not changing String.split. If we want to, we should open up a discussion apart from this one but I would very likely veto it since it is well documented and I see no reason why we should stick with the behaviour described in :binary.match/2. Specially given we changed String.contains? to match :binary.match in the past and then @ThomasArts proved with properties that not following :binary.match was the best way to go.

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

I fail to see the problem with the infinity of empty strings.
let's say [IES] (Infinite empty strings)
"foo" is [IES]f[IES]o[IES]o[IES]

String.prefix("foo", "", "X), replaces the first empty space in [IES].
should be interpreted as:
``String.prefix([IES]f[IES]o[IES]o[IES], [IES], [IES]X[IES]) should return:[IES]X[IES][IES]f[IES]o[IES]o[IES]` which reads as `"Xfoo"`

String.leading("foo", "", "X"), replaces all empty spaces in [IES].
should be interpreted as:
``String.prefix([IES]f[IES]o[IES]o[IES], [IES], [IES]X[IES]) should return:[IES]X[IES]f[IES]o[IES]o[IES]` which reads as `"Xfoo"`

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

We can of course look into this from the properties perspective, that would be helpful I think. But which property String.split/2 holds, that sounds harder :) Also, String.split("", "") == [""], this stuff is a slippery slope: does that mean that "" is actually an empty string followed by an empty string? Or an empty string preceded by an empty string?

The empty string is dangerous and I think we should treat it as such.

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

@eksperimental you're making IES (infinite empty string) finite like that :)

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

@whatyouhide how come?

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

@eksperimental you're just treating it like a codepoint in your examples.

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

@whatyouhide in my examples I am treating it as infinite codepoints, and every code point will be surrounded by infinite empty codepoints.

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

It's a slippery slope dealing with infinites @eksperimental. In your reasonings, ∞−∞ = 0, which is not a truth in math (it's indeterminate). This math.stackexchange.com answer explains this nicely :)

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

but where in my examples I end up with 0? It always ends up with inifite empty spaces

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

@eksperimental you replace infinite empty strings with infinite empty strings considering them equal because they're both infinite empty strings, but as I said above, ∞−∞ = 0 (which means ∞ = ∞) is not true and indeterminate in math. E.g. you do String.prefix([IES]f[IES]o[IES]o[IES], [IES], [IES]X[IES]) and replace [IES] at the beginning, under the wrong assumption that IES == IES, that's where you apply ∞−∞ = 0.

@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Jul 17, 2016

String.split("", "") == [""] an empty string cannot be further split. Split also always returns a non-empty list. I would say this result is actually quite reasonable.

Here is another property we could consider:

if String.split(string, value) |> Enum.at(-1) == "" then
  String.split(String.replace_trailing(string, value, another), another) |> Enum.at(-1) == ""

Given that String.split on "" does not return infinite empty strings at the end, I think it is reasonable for replace_trailing with an empty string to behave the same as replace_suffix with an empty string.

@lexmag

This comment has been minimized.

Copy link
Member

lexmag commented Jul 17, 2016

Considering how String.split/2 behaves today, I think not raising on empty string in replace_trailing and replace_leading is consistent.

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

Yes we either raise in all of them or in none of them (raising in all three would be consistent as well, just saying 😄). So what shall we go with?

@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Jul 17, 2016

I think it should behave like replace_suffix and replace_prefix (basically the value is appended or prepended).

@eksperimental

This comment has been minimized.

Copy link
Member

eksperimental commented Jul 17, 2016

@whatyouhide sorry, i don't agree. My reasoning for _prefix I that I replace one of the empty strings in the series of infinite strings, with non-empty strings which is surrounded by infinite empty strings.

String.prefix([IES_1][IES_2][IES_3][IES_...]f[IES_...]o[IES_...]o[IES_...], [IES_...], [IES_...]X[IES_...])
returns: [IES_...]X[IES_...][IES_2][IES_3][IES_...]f[IES_...]o[IES_...]o[IES_...]
which ends up being squeezed up as: [IES_...]X[IES_...]f[IES_...]o[IES_...]o[IES_...]

for a more practical way of explaining _leading (which is pretty much the same as for _prefix).. think of grabbing an infinite number of empty strings, split it right in the middle and insert a non-empty string which is surrounded by infinite number of empty strings.
it makes no difference whether is _leading or _prefix. you will always will end up with the non-empty string surrounded by infinite empty strings.

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 17, 2016

Ok @josevalim, let's go with that then.

@eksperimental you're doing the same you did before; you can't replace IES_1 with IES_... because you have to replace matching strings and IES_1 (supposedly a single empty string) doesn't match IES_.... Also, I could argue that IES_1 is not one empty string but infinite empty strings; no way to disproof that :) Math is hard, so let's not try to justify this with it and let's just go with prepending/appending and document this well, and be sad that chaos rules this world. ¯_(ツ)_/¯

@ThomasArts

This comment has been minimized.

Copy link

ThomasArts commented Jul 17, 2016

The property I typically would like to see hold is:

property "replace all leading occurrence of strings" do
    forall {ls, n, s, rs}  <- {string(), nat(), string(), string()} do
      implies not String.starts_with?(s, ls) do
         string = String.duplicate(ls, n) <> s 
         replaced = String.duplicate(rs, n) <> s
         ensure String.replace_leading(string, ls, rs) == replaced
        end
    end
end

In Elixir version 1.3.2, this property passes 100,000 tests without problem.

In other words, if we take a random strings lsand s such that ls is not a prefix of s, then I expect that replace leading replaces n copies of ls with ncopies of rs. No matter what value n and rs have.

My string generator does indeed generate empty strings.

As a consequence of the implies operator, this means that we never call the replace_leading function with the empty string, since it makes no sense to predict the outcome.

@ThomasArts

This comment has been minimized.

Copy link

ThomasArts commented Jul 17, 2016

Another property... which reveals the problem that you discuss is:

  property "replace all multiple occurrences of string" do
    forall {ls, n, rs}  <- {string(), nat(), string()} do
        string = String.duplicate(ls, n) 
        replaced = String.duplicate(rs, n)
        ensure String.replace_leading(string, ls, rs) == replaced
      end
  end

Given a string that is just n copies of a random string ls, we expect this to result in n copies of rs. However, this property fails with a timeout. The replacement takes too longif ls is the empty string.

This is a good reason to raise an exception when ls is the empty string. After all, how would you replace 4 copies of "" by 4 copies of "foo"?

@ThomasArts

This comment has been minimized.

Copy link

ThomasArts commented Jul 18, 2016

Thanks for getting me into the discussion. I was a bit late with my reaction, but I added properties to the discussion and let me know if you want more of such.

I came to the same conclusion as you reached: raise an exception when the leading string is empty.
Cheers
Thomas

On 17 Jul 2016, at 15:40, Eksperimental notifications@github.com wrote:

@josevalim https://github.com/josevalim then it should behave same way as my Regex example,
and it will also be in agreement with that @whatyouhide https://github.com/whatyouhide says: #5022 (comment) #5022 (comment)
there are infinite empty strings between each letter, so slot in between gets a replacement inserted


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #5022 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAm8P_xN8_jtqYxUfPu2bEZMf1IjSq9qks5qWjDYgaJpZM4JONOS.

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 18, 2016

Thanks a lot @ThomasArts! 💟 This was superhelpful. And your latter property shows pretty well that things start to break (conceptually I mean) when we allow to replace all the leading occurrences of "" in a string. We may go with the solution @josevalim proposed as well (make it behave like replace_prefix/3) for the sake of just "making it work", but I am not convinced.

Thoughts?

@ThomasArts

This comment has been minimized.

Copy link

ThomasArts commented Jul 18, 2016

For the replace_prefix, I expect the following property to hold:

 property "replace one prefix occurrences of string" do
    forall {ls, rs, s}  <- {string(), string(), string()} do
        string = ls <> s
        replaced = rs <> s
        ensure String.replace_prefix(string, ls, rs) == replaced
    end
  end

Which it does, even for ls being the empty string. The reason is that the empty string appended to an arbitrary string always at most represents one string.
However, in replace_leading, the same string can represent zero or more leading empty strings. The information on the amount of occurrences is not present when string length is zero.

In my opinion, allowing the empty string as prefix makes sense, It turns replace_prefix in the <> operator.
Not very useful, but consistent.

In general, the simpler the specification, the easier for people to understand what the function does. If one adds complexity to the specification, such as:

 property "replace one prefix occurrences of string" do
    forall {ls, rs, s}  <- {non_empty(string()), string(), string()} do
        string = ls <> s
        replaced = rs <> s
        ensure String.replace_prefix(string, ls, rs) == replaced
    end
  end

Then the user has to program extra code to check and avoid the first argument to be empty or to reason that this in the specific application will never happen.

/Thomas

On 18 Jul 2016, at 10:38, Andrea Leopardi notifications@github.com wrote:

Thanks a lot @ThomasArts https://github.com/ThomasArts! 💟 This was superhelpful. And your latter property shows pretty well that things start to break (conceptually I mean) when we allow to replace all the leading occurrences of "" in a string. We may go with the solution @josevalim https://github.com/josevalim proposed as well (make it behave like replace_prefix/3) for the sake of just "making it work", but I am not convinced.

Thoughts?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #5022 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AAm8P80KOUhyvtVLtG2YiL_4tuu26zElks5qWztvgaJpZM4JONOS.

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 18, 2016

@ThomasArts yes, for replace_prefix we agreed that "" is allowed and it turns it in <> basically, as you said. So no need for empty string checks and such. But for replace_leading, as you showed, it's a bit more complicated and we may need the users to do their empty string checks in order to be able to have a consistent function.

@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Jul 18, 2016

That was really helpful @ThomasArts, thank you!

So to tidy up the discussion:

  • replace_prefix and replace_suffix will work with empty strings
  • replace_leading and replace_trailing will raise for empty strings

Let's make sure to add documentation that says it is impossible to replace multiple (leading|trailing) empty strings, hence the error.

@ericentin

This comment has been minimized.

Copy link
Contributor

ericentin commented Jul 18, 2016

I like the current proposed solution, as I think that even though it is not "mathematically' consistent, it is pragmatic and will prevent people from doing something that doesn't really make sense to do. 👍

@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 18, 2016

Small update: String.replace_prefix/3 and String.replace_suffix/3 now (93bc364) officially support "" as the match, in which case they prepend/append the given replacement, respectively.

Fix a bug in String.replace_leading/3 and String.replace_trailing/3
These function would end up in infinite recursion if the "match"
argument to replace was an empty string. With this commit, we explicitly
check for that empty string and possibly raise an ArgumentError exception.
@whatyouhide

This comment has been minimized.

Copy link
Member

whatyouhide commented Jul 18, 2016

Update the PR so that replace_leading/3 and replace_trailing/3 raise if the match is ""; I also update the typespec with no_return for both and documented the new behaviour.

@josevalim josevalim merged commit 070e395 into elixir-lang:master Jul 19, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Jul 19, 2016

❤️ 💚 💙 💛 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment