feature: String interpolation #7343

TD5 · 2023-06-01T14:53:02Z

Adds four kinds of string interpolation split over two axes (utf-8 binary or unicode codepoint list, and user-facing or developer-facing formatting).

The result are four general classes of syntax with interpolated values:

% binary format
<<"A utf-8 binary string: 4"/utf8>> =
  bf"A utf-8 binary string: ~2 + 2~"

% list format
"A unicode codepoint list string: 4" =
  lf"A unicode codepoint list string: ~2 + 2~"

% binary debug
<<"A utf-8 binary string: {4, foo, [x, y, z]}"/utf8>> =
  bd"A utf-8 binary string: ~{2 + 2, foo, [x, y, z]}~"

% list debug
"A unicode codepoint list string: {4, foo, [x, y, z]}" =
  ld"A unicode codepoint list string: ~{2 + 2, foo, [x, y, z]}~"

Arbitrary expressions can be nested inside string interpolation substitutions, including variables, function calls, macros and even further string interpolation expressions.

Design

Why list- and binary-strings?

In the string module from the stdlib, a string is represented by unicode:chardata(), that is, a list of codepoints, binaries with UTF-8-encoded codepoints (UTF-8 binaries), or a mix of the two.

With this in mind, the list- and binary-oriented string interpolation syntaxes accept either type of interpolated value, but the user of the interpolation determines whether they want to generate a unicode:char_list() or unicode:unicode_binary() based on which kind of interpolation they use (bf"..." and bd"..." to create binaries, or lf"..." and ld"..." to create lists).

List-strings are most useful for backwards compatibility and convenience. Binary-strings are most useful for memory-compactness and IO.

Why user- and developer-oriented strings?

There are two similar, but distinct cases where developers typically want to format strings: when logging/debugging, and when displaying data to users.

When logging or debugging, the most important features are typically that any kind of term can be printed, and it should round-trip losslessly and be read by developers unambiguously. Examples of these properties are, for example, retaining runtime type information, e.g. keeping strings quoted when formatting them and printing floats with full range and resolution.

When displaying to users, the most important features are typically that they are always going to be human-readable and cleanly formatted. Examples of these properties are, for example, formatting strings verbatim, without quotation marks, and not retaining any Erlang-isms (e.g. we don't want to be printing Erlang tuples, because they won't make much sense to the average application consumer), so we'd rather get a badarg error to push the developer to make an explicit formatting decision.

Why no formatting options?

Let's consider the two use-cases introduced earlier:

Logging/debugging: Typically you want to fire-and-forget, giving whatever value you care about to the formatter, and just let it print that value unambiguously, meaning there's no need to tweak formatting options: bd"~Timestamp~: ~Query~ returned ~Result~"
Displaying to users: Typically you want to tightly control formatting, and you probably want to do so in a modular and reusable way. In that case, factoring out your formatting decision to a function, and interpolating the result of that function is probably the best way to go: bf"You account balance is now ~my_app:format_balance(Currency, Balance)~".

Notably, nothing in the design and implementation here precludes the future introduction of formatting options such as bf"float: ~.2f(MyFloat)~" as one might do with io_lib:format etc. But existing stdlib functions can offer similar functionality, e.g. bf"float: ~float_to_binary(MyFloat, [{decimals, 2}, compact])~", and can be factored out into their own reusable functions.

Implementation

To parse interpolated strings, the scanner tracks some additional state regarding whether we are currently in an interpolated string, at which point it enables the recognition of ~ as the delimiter for interpolated expressions, and generates new tokens which represent the various components of an interpolated string.

Early during compilation and shell evaluation, interpolated strings are desugared into calls to functions from the io_lib module, and therefore don't impact later stages of compilation or evalution.

The new string interpolation syntax was not previously valid syntax, so should be entirely backwards compatible with existing source code.

github-actions · 2023-06-01T14:54:02Z

CT Test Results

      3 files   375 suites 46m 10s ⏱️
2 561 tests 2 504 ✔️ 47 💤 10 ❌
7 021 runs 6 961 ✔️ 50 💤 10 ❌

For more details on these failures, see this check.

Results for commit 9922f03.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

TD5 · 2023-06-01T15:33:31Z

I see there are a bunch of test failures outside the tests I ran locally. I'll investigate.

Adds four kinds of string interpolation split over two axes (utf-8 binary or unicode codepoint list, and user-facing or developer-facing formatting). The result are four general classes of syntax with interpolated values: ``` % binary format <<"A utf-8 binary string: 4"/utf8>> = bf"A utf-8 binary string: ~2 + 2~" ``` ``` % list format "A unicode codepoint list string: 4" = lf"A unicode codepoint list string: ~2 + 2~" ``` ``` % binary debug <<"A utf-8 binary string: {4, foo, [x, y, z]}"/utf8>> = bd"A utf-8 binary string: ~{2 + 2, foo, [x, y, z]}~" ``` ``` % list debug "A unicode codepoint list string: {4, foo, [x, y, z]}" = ld"A unicode codepoint list string: ~{2 + 2, foo, [x, y, z]}~" ``` Arbitrary expressions can be nested inside string interpolation substitutions, including variables, function calls, macros and even further string interpolation expressions. Design ====== Why list- and binary-strings? ----------------------------- In the `string` module from the stdlib, a string is represented by `unicode:chardata()`, that is, a list of codepoints, binaries with UTF-8-encoded codepoints (UTF-8 binaries), or a mix of the two. With this in mind, the list- and binary-oriented string interpolation syntaxes accept either type of interpolated value, but the user of the interpolation determines whether they want to generate a `unicode:char_list()` or `unicode:unicode_binary()` based on which kind of interpolation they use (`bf"..."` and `bd"..."` to create binaries, or `lf"..."` and `ld"..."` to create lists). List-strings are most useful for backwards compatibility and convenience. Binary-strings are most useful for memory-compactness and IO. Why user- and developer-oriented strings? ----------------------------------------- There are two similar, but distinct cases where developers typically want to format strings: when logging/debugging, and when displaying data to users. When logging or debugging, the most important features are typically that any kind of term can be printed, and it should round-trip losslessly and be read by developers unambiguously. Examples of these properties are, for example, retaining runtime type information, e.g. keeping strings quoted when formatting them and printing floats with full range and resolution. When displaying to users, the most important features are typically that they are always going to be human-readable and cleanly formatted. Examples of these properties are, for example, formatting strings verbatim, without quotation marks, and not retaining any Erlang-isms (e.g. we don't want to be printing Erlang tuples, because they won't make much sense to the average application consumer), so we'd rather get a `badarg` error to push the developer to make an explicit formatting decision. Why no formatting options? -------------------------- Let's consider the two use-cases introduced earlier: - Logging/debugging: Typically you want to fire-and-forget, giving whatever value you care about to the formatter, and just let it print that value unambiguously, meaning there's no need to tweak formatting options: `bd"~Timestamp~: ~Query~ returned ~Result~"` - Displaying to users: Typically you want to tightly control formatting, and you probably want to do so in a modular and reusable way. In that case, factoring out your formatting decision to a function, and interpolating the result of that function is probably the best way to go: `bf"You account balance is now ~my_app:format_balance(Currency, Balance)~"`. Notably, nothing in the design and implementation here precludes the future introduction of formatting options such as `bf"float: ~.2f(MyFloat)~"` as one might do with `io_lib:format` etc. But existing stdlib functions can offer similar functionality, e.g. `bf"float: ~float_to_binary(MyFloat, [{decimals, 2}, compact])~"`, and can be factored out into their own reusable functions. Implementation ============== To parse interpolated strings, the scanner tracks some additional state regarding whether we are currently in an interpolated string, at which point it enables the recognition of `~` as the delimiter for interpolated expressions, and generates new tokens which represent the various components of an interpolated string. Early during compilation and shell evaluation, interpolated strings are desugared into calls to functions from the `io_lib` module, and therefore don't impact later stages of compilation or evalution. The new string interpolation syntax was not previously valid syntax, so should be entirely backwards compatible with existing source code.

kikofernandez · 2023-09-13T13:05:25Z

Thanks for this contribution.

At the moment, this PR involves many small decisions that need to be consistent and well-thought, e.g., symbol for sigils, complexity of the lexer and parser given that it is defined in a recursive manner, etc.

I do not think we can figure all these details before OTP-27, so we mark it as "stalled" and we will take small steps until all the design decisions are 100% clear.

We may reach to you in our internal channel to get a better understanding of things, but it seems difficult that this PR makes it in OTP-27

rvirding · 2023-09-25T14:50:40Z

Which checks failed and why?

mikpe · 2023-09-29T18:39:06Z

I see that a lot of changes spill to parts outside of the scanner and parser proper. Why? The feature should be reducible to the base language, and the parser should be able to hide that translation from the rest of the system.
It seems the new feature uses a new term formatter not available via io_lib:format. That doesn't seem right to me.

TD5 · 2023-10-02T13:28:13Z

The feature should be reducible to the base language, and the parser should be able to hide that translation from the rest of the system.

Putting it all in the parser is a bit of a trade-off. As I see it, the options are:

Confine it to the parser, and therefore have the parser return abstract forms which do not closely match the true source syntax (this can make subsequent operations which need to be source-syntax-aware hard, such as linting).
Let the parser handle source syntax, and let later stages of the compiler perform re-writing / de-sugaring. Preserving the high-level structure beyond the parser can also allow further optimisations to be applicable, since, for example, we are guaranteed to know the number of interpolated sub-expressions to be formatted statically, which isn't true of formatting in general (e.g. with io_lib:format, since its even the arity of its arguments can be determined dynamically).

Regarding the other spillage, that's for supporting erl_lint, the compiler's partial evaluation logic, etc. All things which I think should be updated for a change like this.

It seems the new feature uses a new term formatter not available via io_lib:format. That doesn't seem right to me.

This is essentially because the original io_lib:format formatter lacked features to meet the design I laid out in the EEP. The implementation here adds new functions to the io_lib module which implement these. Notably, these new functions are aware of types with a "natural user-facing format", and the implementation makes use of an explicit accumulator for binaries which leverages recent optimisations for binaries in the runtime. This means we get the existing benefits of binaries (compactness & locality), plus efficient construction (which is optimised by the runtime into mutations rather than naive copying). What's more, the information is there statically to allow even more future optimisations, for example by pre-allocating the binary to minimise re-sizing of the underlying mutable binary.

This was referenced Jun 1, 2023

Create eep-0062.md: String interpolation syntax erlang/eep#45

Merged

Create eep-0063.md: Lightweight UTF-8 binary string literals and patterns erlang/eep#46

Closed

TD5 force-pushed the string-interpolation branch from 9922f03 to 7236940 Compare June 2, 2023 13:07

rickard-green added team:VM Assigned to OTP team VM team:PS Assigned to OTP team PS labels Jun 5, 2023

jhogberg assigned bjorng, kikofernandez, jhogberg and frazze-jobb Jun 5, 2023

IngelaAndin added stalled waiting for input by the Erlang/OTP team team:PO Assigned to OTP team PO labels Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: String interpolation #7343

feature: String interpolation #7343

TD5 commented Jun 1, 2023

github-actions bot commented Jun 1, 2023 •

edited

TD5 commented Jun 1, 2023 •

edited

kikofernandez commented Sep 13, 2023 •

edited

rvirding commented Sep 25, 2023

mikpe commented Sep 29, 2023

TD5 commented Oct 2, 2023 •

edited

feature: String interpolation #7343

Are you sure you want to change the base?

feature: String interpolation #7343

Conversation

TD5 commented Jun 1, 2023

Design

Why list- and binary-strings?

Why user- and developer-oriented strings?

Why no formatting options?

Implementation

github-actions bot commented Jun 1, 2023 • edited

CT Test Results

Artifacts

TD5 commented Jun 1, 2023 • edited

kikofernandez commented Sep 13, 2023 • edited

rvirding commented Sep 25, 2023

mikpe commented Sep 29, 2023

TD5 commented Oct 2, 2023 • edited

github-actions bot commented Jun 1, 2023 •

edited

TD5 commented Jun 1, 2023 •

edited

kikofernandez commented Sep 13, 2023 •

edited

TD5 commented Oct 2, 2023 •

edited