Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix formatting of strings containing control characters #1543

Merged
merged 1 commit into from
Nov 17, 2019

Conversation

sjakobi
Copy link
Collaborator

@sjakobi sjakobi commented Nov 15, 2019

If a string contains characters that must not appear in multi-line
strings, like control characters, it is always formatted with
double-quotes.

This fixes the formatting of the Prelude.Text.show example discovered
in dhall-lang/dhall-lang#822.

@@ -1,6 +1,6 @@
{ x =
''

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that leading whitespace would be ignored by the parser when I originally added this testcase.

where
anyText predicate = any (\(text, _) -> Text.any predicate text) a || Text.any predicate b

-- If we format the text literal as a multi-line string, would any line
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-line Text literals do not preclude leading semantically-significant whitespace. For example:

''
 a
b
''

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I was confused by the leadingTabs test case then.

Does the explanation on that test case simply mean that tabs are treated just like spaces for dedenting purposes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why does this typecheck?

assert :
''

 ''
===
"\n"
$ dhall --file spaces.dhall 
assert : "\n" ≡ "\n"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjakobi: Yes, dedenting will strip the longest shared prefix of tabs and/or spaces for each line

The reason that example type-checks is because leading tabs/spaces before the opening '' do not count for dedenting purposes, so the dedenter only sees two line prefixes, both of which begin with two spaces, so those two-space prefixes are stripped, leaving behind only a newline.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't get it. Maybe this clearer:

$ dhall format
''
 ''
<Ctrl-D>
""

There's no whitespace before the opening '', so the leading space before the closing '' shouldn't actually be subtracted, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjakobi: The whitespace before the opening '' is ignored for the purposes of computing the longest shared prefix. So the only whitespace prefix is the one right before the closing '' and that prefix is stripped because there are no other whitespaces prefixes to compare to, therefore it must be the longest whitespace prefix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Gabriel439. I had never read that document and somehow forgot that it existed at all. Maybe we should add a link to it from the grammar?

So if I understood correctly we must only avoid multi-line formatting if the last line has leading whitespace. Then this would indeed be a bug:

$ echo '"\n "' | dhall format
''

 ''

@sjakobi sjakobi removed the merge me label Nov 15, 2019
@Gabriella439
Copy link
Collaborator

@sjakobi: I think we need to step back a second and review the original failure you ran into on the branch

The issue you ran into was that a multi-line string literal with an unescaped null byte failed to parse. So before we even get into the question of how to format string literals, we need to first answer the question: "According to the standard, are unescaped null bytes legal within any string literal?"

The answer is: "no". I think we both agree on that, but I'm just going through the steps pedantically here.

The next question is: "Was the original input expression syntactically legal?"

The answer is: "yes". The null byte was escaped in the original string literal.

The next question is: "At what stage did things first go wrong?". Specifically, there are four possible relevant places we could have introduced error:

  1. When ./scripts/lint-prelude parsed the original unformatted code
  2. When ./scripts/lint-prelude pretty-printed the formatted code
  3. When CI parsed the formatted code that you committed to your branch

In this case the answer is (2), when ./scripts/lint-prelude pretty-printed the formatted code because we can see that the formatted code you checked into your branch is not syntactically valid. We would technically need to look at the intermediate Haskell syntax tree to rule out (1) but I'm reasonably certain that is not where the error was introduced.

So the next question is: "What went wrong when pretty-printing the code?"

The two relevant places were things could have gone wrong:

  1. When selecting whether to render as a multi-line string literal or an ordinary double-quoted string literal
  2. When escaping special characters

The answer should never be (1). This is because the decision to format a string literal as a multi-line string literal should never be semantically significant and should only be a cosmetic formatting choice. Every Text should be representable using either normal or multi-line Text literals and inability to do so is a standard bug in my opinion.

So that leaves (2): the null byte was not properly escaped when formatting the multi-line string literal. That does not necessarily imply that we should have used a multi-line string literal but the decision about whether to use a multi-line string literal is orthogonal to fixing this bug.

That leads to the next question: "Was the absence of a null byte escape due to an issue with the standard or an issue with the Haskell implementation?"

The answer is that it's not a bug in the standard, because the standard does not specify how to render string literals. It only specifies how to desugar multi-line string literals to ordinary string literals when parsing.

So the bug that needs to be fixed here is that the Haskell implementation is not escaping special characters like a null byte when rendering string literals.

@sjakobi
Copy link
Collaborator Author

sjakobi commented Nov 15, 2019

Right! The first commit in this PR is indeed all about the bug discovered in dhall-lang/dhall-lang#822 and AFAIK fixes it.

I should have clarified that the second commit is in response to a test failure I saw while running the formatting idempotence test to check the correctness of the first commit. Sorry for not making that clearer.

Nevertheless we do have a bug where we incorrectly choose a multi-line layout for strings that have leading whitespace before the closing ''. For clarity's sake, I'll move the fix for that to a follow-up PR.

@sjakobi sjakobi changed the title Fixes for text literal formatting Fix formatting of strings containing control characters Nov 15, 2019
@Gabriella439
Copy link
Collaborator

I still think the bug isn't fully fixed, though. This is masking the bug by selecting the multi-line string literal representation, but if somebody later makes a change to select a multi-line string literal then it will unmask the same bug. I think properly fixing this requires escaping the null byte for multi-line string literals.

The other reason I suggest this is because I think we should be using the multi-line string literal representation if there is a newline and a null byte in the string

@sjakobi
Copy link
Collaborator Author

sjakobi commented Nov 15, 2019

I think properly fixing this requires escaping the null byte for multi-line string literals.

Oh, I didn't know this was possible at all! What does it look like?

@Gabriella439
Copy link
Collaborator

@sjakobi: It's awkward, but it's possible to escape an arbitrary byte using ${"\u…"}

@@ -0,0 +1,3 @@
''
${"\u0000"} $ \
☺''
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit funny that this triggers #1545 now! ;)

Should I add the fix for this to this PR or should I keep it separate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1550.

-- >>> splitOnPredicate (== 'x') "xx"
-- ([("","xx")],"")
--
-- prop> \(Fun _ p) s -> let {t = Text.pack s; (as, b) = splitOnPredicate p t} in foldMap (uncurry (<>)) as <> b == t
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately doctest doesn't support multiline properties yet: sol/doctest#131

@sjakobi
Copy link
Collaborator Author

sjakobi commented Nov 16, 2019

BTW @amarrella, your comment above looks like it ended up on the wrong issue!

@amarrella
Copy link

Whoops, deleting it since it's not relevant, thanks @sjakobi :)

Since control characters like '\b' cannot be escaped directly in
multi-line strings, we move them into string interpolations. E.g.

    "\n\b"

is formatted as

    ''

    ${"\b"}''

This partially addresses the malformatting of the Prelude.Text.show
example discovered in dhall-lang/dhall-lang#822.

This sample string is now malformatted due to #1545 though.
@sjakobi sjakobi merged commit 145b7b8 into master Nov 17, 2019
@sjakobi sjakobi deleted the sjakobi/fix-prettyChunks branch November 17, 2019 14:54
sjakobi added a commit to dhall-lang/dhall-lang that referenced this pull request Nov 17, 2019
This changes the formatting of Dhall source files affected
by dhall-lang/dhall-haskell#183
or dhall-lang/dhall-haskell#1400.

The change in Prelude/Text/show is due to
dhall-lang/dhall-haskell#1508, but also relies the bug fixes
dhall-lang/dhall-haskell/pull/1543 and
dhall-lang/dhall-haskell/pull/1550.
sjakobi added a commit to dhall-lang/dhall-lang that referenced this pull request Nov 23, 2019
This changes the formatting of Dhall source files affected
by dhall-lang/dhall-haskell#183
or dhall-lang/dhall-haskell#1400.

The change in Prelude/Text/show is due to
dhall-lang/dhall-haskell#1508, but also relies the bug fixes
dhall-lang/dhall-haskell/pull/1543 and
dhall-lang/dhall-haskell/pull/1550.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants