Implement format/interpolation strings #822

kwshi · 2021-05-07T18:19:31Z

Resolves #11.

This PR builds off of @casey's existing work on parsing interpolation strings:

Fixes parsing of interpolation fragments
Handles un-indenting & escape sequence processing
Implements evaluation of interpolation strings.
Adds some parser tests

While it appears to work correctly, the code I've written to do so is pretty patchy, and a few more things are needed for reliability & maintainability:

Add some evaluation tests?
Possibly refactor the un-indenting code.

Because of the way interpolated strings are structured (Vec<Fragment>), it's not really possible to re-use the existing logic in unindent.rs, which acts on individual &strs. (It's not sufficient to run unindent on each one of the Text fragments individually, because the common indentation needs to be calculated across the entire string, i.e. all fragments, not simply all lines for a given fragment.)

As such, a lot of the logic of unindent had to be reimplemented, but with a lot more consideration for edge-cases, etc. The resulting code is pretty messy and hard to read, and I have an idea about how to re-architect the FormatString type to be able to re-use the existing unindent code cleanly, but it requires some heavier refactoring, so I've held off on it for now to see if other contributors have any better thoughts/comments on this challenge.

@casey if it's not too much trouble, perhaps you can take a look at this & offer some suggestions on how to tidy up the code? Thanks!

kwshi · 2021-05-07T18:28:26Z

Some of the code diffs were accidentally caused by my editor auto-running rustfmt on save & I accidentally committed them--I didn't bother picking them out afterward but if you want me to I can revert those!

casey · 2021-05-08T02:01:36Z

Nice! Today's a little busy, but I should definitely be able to check this out tomorrow.

casey · 2021-05-09T02:57:14Z

src/compilation_error.rs

+      UnterminatedBacktick => {
+        writeln!(f, "Unterminated backtick")?;
+      },
+      UnterminatedFormatString => {


There should probably be an UnterminatedFormatBacktick variant as well.

casey · 2021-05-09T03:03:46Z

src/lexer.rs

+    // TODO:
+    // - what token should format backticks be?
+    // - how are format strings dedented?
+    // - brag about recursive format strings
+    // - test empty format string interpolation error
+    // - handle backticks
+    // - test format string evaluation
+    // - format strings with text immediatle before and after inteprolation
+    //   start/end
+    // - test format strings inside of recipe bodies
+    // - test multi-line format strings inside of recipe bodies
+    // - test open delimiters with multiple lines inside of recipe bodies
+    // - test multi-line strings inside of recipe bodies
+    // - combine backticks and string literals
+    // - must avoid parsing format string or backtick in shell setting
+


Suggested change

// TODO:

// - what token should format backticks be?

// - how are format strings dedented?

// - brag about recursive format strings

// - test empty format string interpolation error

// - handle backticks

// - test format string evaluation

// - format strings with text immediatle before and after inteprolation

// start/end

// - test format strings inside of recipe bodies

// - test multi-line format strings inside of recipe bodies

// - test open delimiters with multiple lines inside of recipe bodies

// - test multi-line strings inside of recipe bodies

// - combine backticks and string literals

// - must avoid parsing format string or backtick in shell setting

I'll comment with the ones in this list that are still to do.

src/assignment_resolver.rs

casey · 2021-05-09T05:32:58Z

This looks great so far.

There were some errors from clippy on CI. I have clippy turned up to "diabolical", so sometimes it complains about silly things, in which case I just add an allow to main.rs. The things it's complaining about currently seem reasonable, but if any of them seem unreasonable or the code to fix them is worse, we can turn off those lints globally or on a case-by-case-basis.
I use nightly rustfmt for Just, which supports more formatting options. You can do rustup install nightly to install it and clippy +nightly fmt --all to run it. I pushed a commit that fixes the clippy warnings.
The bin/lint script that runs on CI will fail if things like 'todo' or dbg!(…) are found in source code. I use this to avoid committing notes and debug code.
I feel like the contents of the unindent module can be made generic over both &str and &[StringFragment]. I added an Unindent trait that is implemented for str. I claim (but cannot prove!) that this could be implemented for [StringFragment]. It's getting late, so I didn't give it a shot. It will be trick, because in Unindent::slice, start and end are character indices. I think that for the purpose of these indices, an interpolation would have length 1, and a text fragment would have length text.len().
This needs a bunch of integration tests, check out the tests directory. I added tests/format_strings.rs with an test that checks that unknown variables inside format strings produce a correctly formatted error. I like to have integration tests for all aspects of all features, as well as for all errors. For errors, one thing to check for is that the underlined token, if there is one, makes sense. Some test ideas off the top of my head:
- Circular variable dependency inside format strings: foo := f"{{foo}}"
- Empty format string foo := f"{{}}"
- Untermined format string: foo := f"
- Untermined format string interpolation: foo := f"{{
- Backtick evaluation tests
- Unindentation tests
- Unindentation test that makes sure that escape-sequence produced whitespace is not unindented
- Escape processing edge case: foo := f"\{{""}}n" # this probably shouldn't be a newline
- Format strings inside of recipe bodies
- Weird, multi-line, indented format strings inside recipe bodies
- Recursive format strings
- Check that set shell := [f"{{foo}}"] does not work. (The value of set shell must be evaluated without invoking the shell, so no format strings.
Feel free to add some of these, and I can also add some.

casey · 2021-05-09T20:01:10Z

Just pushed a couple more commits removing the old unindent functions. I promise I'm done pushing to this branch for now.

kwshi · 2021-05-10T21:12:00Z

Thanks for the tips! I'll take a look & work on this in the next couple days.

kwshi · 2021-05-16T01:51:18Z

Running into a few uncertainties trying to implement Unindent for &[StringFragment]:

slice performance concerns. Since each chunk/fragment will have a variable length, indexing must be computed via a fold over the fragment lengths (e.g., for a chunk [Text("hello"), Interpolation(..), Text("hi")], trying to find the index i=7 requires checking fragment_start <= i && i < fragment_start + fragment.len() for each fragment with fragment_start accumulating over each visited fragment length).

This operation takes O(n) (cf. O(1) for plain &strs) time. Since unindent calls slice once for each line in the string, unindent's runtime cost would be O(n^2) on format strings. Granted, the constants are small--n is estimated relative to the number of fragments and lines, typically under 100--so the practical impact of this difference may be negligible. But still, I think it could be improved.
- One possible solution to this drawback is to pre-compute index positions (e.g., do a single, initial fold over the fragments to figure out which fragment each index maps to). The two ways I can think of doing so:
  1. Store a vector directly mapping each index in the range [0, len) (len is the total byte-length of the string, + 1 for each interpolation) to the corresponding position (fragment index, position within fragment for text fragments). Takes O(n) pre-computing, O(n) extra storage space; makes subsequent indexing/slicing operations O(1), and therefore unindent O(n), same as &str.
  2. Store a "boundaries" vector holding the starting index of each fragment ([0, 5, 6] in the [Text("hello"), Interpolation(..), Text("hi")] example). For each given index, binary-search the boundaries vector to determine which fragment to index into, and then index (subtracting offset) within the correct fragment. Takes O(n) pre-computing, O(n) extra storage (but much less--one usize for each fragment instead of for each byte); makes subsequent index/slice ops O(log n), and therefore unindent O(n log n).
  However, both of these options require introducing a wrapper struct to store the pre-computed information, so the trait would have to be implemented not on [StringFragment] directly but on the wrapper struct, making some of the function signatures a little awkward (e.g. slice and from_str returning &Self instead of Self).
- An alternative solution I'd like to propose is to introduce something like
```
enum InterpolationChar {
  Text(char), // or u8
  Interpolation(Expression),
}
```
  and implement Unindent on [InterpolationChar] instead--this way, there is a more obvious/direct analogy between &str as a slice of chars (or bytes) and format-strings as a slice of InterpolationChars: each element corresponds to one character, indexing on the string matches indexing on the slice itself, etc.
  
  I believe this implicitly incurs an extra O(n) storage (for marking the enum tag on each character), but I think the implementation becomes much more elegant/simple/straightforward this way--no more fancy piecewise finagling to compute index locations. What do you think about this?
join ownership. The join method signature (and some trait lifetime restrictions) requires it to return an owned type (Self::Output). In particular, given (possibly discontiguous) strings, it has to clone the contents of those strings to a separate location in order to return an owned type (in the str implementation, this is done in the .collect() call). This means that, in order to implement the analogous logic for StringFragment, StringFragment::Interpolation { expression: Expression } also needs to be clone-able. In particular, we will need to add #[derive(Clone)] to StringFragment and Expression. Are we... cool with that? (I'm relatively new to Rust, so I'm not sure what common design patterns are, I just have a vague urge to avoid allowing large things to be cloned here and there.)

casey · 2021-05-16T05:01:47Z

Good points!

I think that the issue with slicing being O(n) isn't something we should worry about. As you mentioned, N is likely to be small enough that even if our algorithm is terrible, it will still be fast enough that nobody will notice. Also, in general, I think it's a good idea to worry about being correct before worrying about being fast. After we have a correct and slow implementation with a bunch of tests, we can replace the correct and slow implementation with a correct and fast implementation.

You propose some good optimizations, but optimizations should always be led by profiling that shows that an operation is slow, needs to be optimized, and that the optimizations actually improve things.

That being said, if there's a cleaner or simpler implementation, then that is worth considering up front.

I think that InterpolationChar would be fairly large. An enum is the same size as its largest variant, plus the enum tag, and Expression is pretty large. It could be made smaller by doing enum InterpolationChar { Text(char), Interpolation(Box<Expression>) }. I would do char instead of u8, since the slicing logic is more complicated for u8, since you would have to respect charter boundaries and panic if an &[InterpolationChar] was sliced mid-character.

I'm kind of ambivalent as to whether an Unindent implementation over &[InterpolationChar] would be simpler than over &[Fragment]. If the complexity is reasonable and confined to one or two functions, I'd be inclined to go with &[Fragment], since it avoids needing a new type. But since you have your hands on the code, I think you're in a better position to decide, I'm happy with both.

Regarding join ownership, adding #[derive(Clone)] and cloning things is fine. I think the same logic around optimization applies here too. The clones are likely to be so fast that nobody will ever notice they're there.

kwshi · 2021-06-22T03:06:06Z

Sorry I've left this to stagnate for a bit-- a bunch of other obligations have popped up in my life, leaving me with little leisure to work on this. I'm not abandoning this PR, but it might be another month or so until I'll be able to come back to it--if anyone else comes across this and wants to pick up the work, feel free! Otherwise, see you in a month.

casey · 2021-06-22T03:27:35Z

No worries at all, life happens! And the original issue for this feature was opened in 2016, so what's another month or few.

casey · 2023-01-16T05:34:29Z

This has gone pretty stale. Feel free to reopen!

casey and others added 11 commits April 25, 2021 18:00

Git rid of interpolation stack

02650b4

Mostly implement format strings in lexer

c1a909d

Test unterminated string and backtick errors

5635692

More work

745f477

Stuff

e055dbd

Note

08c185f

Fix compilation warnings and add TODOs

a465a2a

implement hacky mvp of format strings

9369ad4

Merge remote-tracking branch 'origin' into format-strings

620dc83

merge master (again)

910a633

remove commented draft code

236d2dd

casey added 5 commits May 8, 2021 17:59

Run clippy

52eac68

Fix more clippy lints

ef5e2b2

Fix src/summary.rs

3faca6d

Fix clippy

dce6d1a

Merge branch 'master' into format-strings

eafa7f0

casey reviewed May 9, 2021

View reviewed changes

src/assignment_resolver.rs Outdated Show resolved Hide resolved

casey added 5 commits May 8, 2021 21:13

Remove TODO

8bb8918

Add Unindent implementation for str

e4427a1

Use Self::Output instead of Self::ToOwned

c2266d1

Add unknown_variable_in_format_string_test

2af5024

Remove unnecessary --evaluate

d5e79c0

casey added 2 commits May 9, 2021 12:56

Use new generic unindent implementation

d365b13

Use other new Unindent functions

27170be

casey added this to In progress in Main via automation May 29, 2021

casey mentioned this pull request Jun 4, 2021

export keyword outside recipes not setting environment variable #854

Closed

casey mentioned this pull request Jul 13, 2021

Is there a way to access Just variables in a backtick command evaluation #905

Closed

casey mentioned this pull request Aug 27, 2021

if-else is not doing string substitution #954

Closed

casey mentioned this pull request Sep 17, 2021

argument using a just variable not showing #972

Closed

casey mentioned this pull request Oct 12, 2021

Inline just variable in backtick invocation #995

Closed

casey mentioned this pull request Oct 30, 2021

Variable expansion in strings #1012

Closed

funkyfuture mentioned this pull request Feb 2, 2022

Function Ideas #876

Closed

casey mentioned this pull request Jul 28, 2022

How to use variables in commands with backticks? #1292

Closed

casey mentioned this pull request Oct 9, 2022

feat: evaluate nested variables #1365

Closed

casey closed this Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement format/interpolation strings #822

Implement format/interpolation strings #822

kwshi commented May 7, 2021 •

edited

kwshi commented May 7, 2021

casey commented May 8, 2021

casey May 9, 2021

casey May 9, 2021

casey commented May 9, 2021

casey commented May 9, 2021 •

edited

kwshi commented May 10, 2021

kwshi commented May 16, 2021 •

edited

casey commented May 16, 2021 •

edited

kwshi commented Jun 22, 2021

casey commented Jun 22, 2021

casey commented Jan 16, 2023

Implement format/interpolation strings #822

Implement format/interpolation strings #822

Conversation

kwshi commented May 7, 2021 • edited

kwshi commented May 7, 2021

casey commented May 8, 2021

casey May 9, 2021

Choose a reason for hiding this comment

casey May 9, 2021

Choose a reason for hiding this comment

casey commented May 9, 2021

casey commented May 9, 2021 • edited

kwshi commented May 10, 2021

kwshi commented May 16, 2021 • edited

casey commented May 16, 2021 • edited

kwshi commented Jun 22, 2021

casey commented Jun 22, 2021

casey commented Jan 16, 2023

kwshi commented May 7, 2021 •

edited

casey commented May 9, 2021 •

edited

kwshi commented May 16, 2021 •

edited

casey commented May 16, 2021 •

edited