Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String canonicalization #985

Open
eernstg opened this issue May 26, 2020 · 7 comments
Open

String canonicalization #985

eernstg opened this issue May 26, 2020 · 7 comments
Labels
enhanced-const Requests or proposals about enhanced constant expressions question Further information is requested specification technical-debt Dealing with a part of the language which needs clarification or adjustments

Comments

@eernstg
Copy link
Member

eernstg commented May 26, 2020

The degree to which canonicalization occurs in Dart, in particular for strings, has always been somewhat unclear.

We have explicit rules requiring that the following forms of constant expressions must be evaluated to the same (canonical) object when evaluating any of the occurrences of such an expression:

  • Symbol literals (such as #foo or #+) are canonicalized.
  • List, set, and map literals with the same elements/keys/values are canonicalized.
  • The value of a <constObjectExpression> (such as const C(2)) is canonicalized.

The definition of identical() ensures that instances of bool, int, and double are canonicalized, in the sense that they are considered identical when they represent the same value (and it is not observable whether this happens because the representation is a tagged bit string, or because identical treats distinct boxed representations in a special way).

Strings have traditionally been treated differently:

void main() {
  var s1 = "ab";
  var s2 = "a" "b"; // or `"a" + "b"`.
  print(identical(s1, s2)); // 'false' with dart, 'true' with dart2js.
}

The specification of identical actually requires that identical(c1, c2) must evaluate to true when

c1 and c2 are constant strings and c1 == c2

which implies that the behavior of dart is wrong, and dart2js is correct: "ab" is a constant expression of type String, "a" "b" is a constant expression of type String, and the two strings are equal according to operator ==. (We haven't specified the behavior of operator == on strings, but we hardly want to make those two strings unequal).

I believe that all non-string objects are canonicalized or not in a way which is well-defined: It is specified for the basic forms mentioned above that every constant expression of these forms has a canonicalized value, and constant expressions do not allow for creating composite objects from existing objects, canonicalized or not, other than constant collection literals (e.g., we can't have const C(x)). In particular, we do not have to specify any additional rules saying that the value of a constant variable is canonicalized.

It is unfortunate that different tools behave differently, but it is also likely to be a serious breaking change to change the behavior of any of these tools. So do we wish to proceed and change the specification to make identical() implementation dependent on strings, or do we wish to choose a specific behavior and get that implemented?

@munificent, @lrhn, @leafpetersen, WDYT?

@eernstg eernstg added question Further information is requested specification labels May 26, 2020
@lrhn
Copy link
Member

lrhn commented May 26, 2020

The phrasing of "string canonicalization" has always been vague. In fact, the specification does not require canonicalization, it only requires that identical(c1, c2) behaves in a specific way if c1 and c2 are constant expressions (so the identical invocation is a constant invocation). There is nothing technically prohibiting:

  const a = "s";
  const b = "s";
  print(identical(a, b));  // must be true because `identical(a, b)` is a constant expression
  var c = a;
  print(identical(c, b));  // false.  Not bound by the specification.

We wouldn't want that, but the specification allows it because it doesn't talk about canonicalization of strings at all.

If we read the specification as implicitly defining canonicalization of strings through their behavior wrt identical, then it only states that compile-time constant string expressions must canonicalized their values. (At least if that value reaches a constant identical invocation).

I believe the current VM behavior is that they canonicalize all constant string literals, and they canonicalize the value of constant string expressions in constant contexts. That is const s = "a" "b"; will be canonicalized and var s = "a" "b"; will not. A single literal like "ab" is always canonicalized.
The canonicalization is part of constant evaluation, and the VM does not handle "constant expressions" like "a" + "b" that are not required to be constant in the constant evaluation.

This does ensure that const ["a" "b"] and const ["ab"] become the same object. Any string which is part of a composite constant object will be canonicalized.
It just doesn't guarantee that the element of that list is identical to var x = "a" "b";.

The JS compilation is limited by the underlying platform. A JS program cannot distinguish two strings with the same content in any way, so they effectively canonicalize all string values. We do not want the VM to do the same, so we will definitely need to allow more canonicalization than what the specification requires.

We should probably specify the exact behavior that we require.
If we go for the current VM behavior, then we will have to change the definition of compile-time constant expressions to only include those that need to be constant (it's based on context, not syntax), rather than promising that 1 + 2 is a compile-time constant expression even when it's not needed to be.
I think that's actually an improvement.

A compile-time constant expression is:

  • The initializer expression of a constant variable declaration.
  • A list, map, or set literal preceded by const.
  • A constructor invocation preceded by const or @.
  • The expression of a case clause.
  • Any sub-expression of a compile-time constant expression.

It is a compile-time error if a compile-time constant expression is not one of ....
(listing all the currently approved forms).

Then we can say that:

If two compile-time constant expressions evaluate to instances of String with the same content, they evaluate to the same instance. That is, constant strings are canonicalized.
Strings created by non-constant expressions may be canonicalized, but they are not required to be so. (Compilation to JavaScript effectively canonicalize all strings that are equal.)

@Cat-sushi
Copy link

@eernstg

We haven't specified the behavior of operator == on strings

The API reference seems to show the behavior of operator == on String

https://api.dart.dev/stable/2.8.3/dart-core/String/operator_equals.html

@lrhn
Copy link
Member

lrhn commented May 27, 2020

We have indeed specified both String.== and identical in the library documentation.
The behavior of identical interacts with the language specification which specifies certain things must be identical. If we assume both specifications are correct, then that effectively specifies canonicalization rules (library documentation for identical says it's true if it's the same object, or if it's a num with the same representation, language spec says that certainidentical invocations must be true).

It's still better to specify canonicalization explicitly instead of implicitly. (There have been a lot of different ways to read that part of the spec over time).

@leafpetersen
Copy link
Member

So do we wish to proceed and change the specification to make identical() implementation dependent on strings, or do we wish to choose a specific behavior and get that implemented?

@munificent, @lrhn, @leafpetersen, WDYT?

Neither? Why is this a priority right now?

@munificent
Copy link
Member

It's definitely not a priority. I think it probably came up because we just migrated the language/identity tests which have some tests around strings and identical() that are more precise than the language requires and that I think fail on some platforms.

@eernstg eernstg added the technical-debt Dealing with a part of the language which needs clarification or adjustments label Aug 6, 2021
@eernstg eernstg added the enhanced-const Requests or proposals about enhanced constant expressions label Feb 22, 2024
@eernstg
Copy link
Member Author

eernstg commented Feb 22, 2024

Feb 2024: Note that the original example prints 'true' in many (perhaps all?) configurations including dart, dart compile js, and dart compile exe. This could imply that the discrepancy is disappearing over time, and more constant expressions are actually evaluated using constant evaluation:

void main() {
  var s1 = "ab";
  var s2 = "a" "b"; // or `"a" + "b"`.
  print(identical(s1, s2)); // 'true'.
}

@lrhn
Copy link
Member

lrhn commented Feb 26, 2024

That used to work, but the VM still does not canonicalize var s3 = "a" + "b";, where it does canonicalize const s3 = "a" + "b";.

It's not that "a" + "b" is not a constant expression, it's allowed in void foo([String s = "a" + "b"]){}, it's just a constant expression which is not canonicalized. (Possibly because it's not recognized and evaluated as constant unless necessary.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhanced-const Requests or proposals about enhanced constant expressions question Further information is requested specification technical-debt Dealing with a part of the language which needs clarification or adjustments
Projects
None yet
Development

No branches or pull requests

5 participants