Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode leads to incorrect indentation #2945

Open
4 tasks
pbiggar opened this issue Aug 8, 2023 · 6 comments
Open
4 tasks

Unicode leads to incorrect indentation #2945

pbiggar opened this issue Aug 8, 2023 · 6 comments

Comments

@pbiggar
Copy link
Contributor

pbiggar commented Aug 8, 2023

Issue created from fantomas-online

Code

String.toList "Zalͮ̒ͫgoZalͮ̒ͫgo" = [ c "Z"; c "a"; c "lͮ̒ͫ"; c "g"; c "o" ]

Result

String.toList "Zalͮ̒ͫgoZalͮ̒ͫgo" = [ c "Z"
                                                                             c "a"
                                                                             c "lͮ̒ͫ"
                                                                             c "g"
                                                                             c "o" ]

Problem description

This is formatted weiiiiiird! Though the code is valid still.

FYI, we have lots of F# unicode edgecases in the darklang codebase, esp in string.dark

Extra information

  • The formatted result breaks my code.
  • The formatted result gives compiler warnings.
  • I or my company would be willing to help fix this.
  • I would like a release if this problem is solved.

Options

Fantomas main branch at 1/1/1990

    { config with
                IndentSize = 2 }
@nojaf
Copy link
Contributor

nojaf commented Aug 16, 2023

Wow, this one is quite the sight 🙈.
I don't have any immediate idea what could be going wrong here.
We normally just copy the string straight from the source text. So I didn't expect unicode strings to matter in this case.
String.toList "Zzzzzzzzzzzz" = [ c "Z̤͔ͧ̑̓"; c "ä͖̭̈̇"; c "lͮ̒ͫ"; c "ǧ̗͚̚"; c "o̙̔ͮ̇͐̇" ] didn't reproduce it for me, so I'm not sure what to make of this one.

@pbiggar
Copy link
Contributor Author

pbiggar commented Aug 16, 2023

What I imagine is happening here is that we're using the Length of a string to determine indentation, as opposed to the number of Extended Grapheme Clusters (which correspond to visual on-screen characters).

I checked to see how we do this in Darklang and found this:

System.Globalization.StringInfo(s).LengthInTextElements

@nojaf
Copy link
Contributor

nojaf commented Aug 16, 2023

That is an interesting pointer, thanks! I'll try and follow-up on this.

@nojaf
Copy link
Contributor

nojaf commented Aug 17, 2023

Hmm, the range of "ä͖̭̈̇" in c "ä͖̭̈̇"; a seems to be wrong. (online tool)

SynExpr.Const(constant = SynConst.String(text = "ä͖̭̈̇", synStringKind = SynStringKind.Regular, range = R("(1,2--1,10)")), range = R("(1,2--1,10)"))

That is definitely part of the problem and would need a fix on the compiler side.

Most likely lhs parseState in doesn't do unicode well.
https://github.com/dotnet/fsharp/blob/681069fbebcdff312645e61e4970a6dd403ff0ee/src/Compiler/pars.fsy#L3289-L3291

EDIT: Maybe not. Not sure what to make of that.

@pbiggar
Copy link
Contributor Author

pbiggar commented Aug 18, 2023

I guess the range is using the number of bytes, and I'm guessing this is an 8 byte unicode character. Fantomas clearly needs to use unicode length, but I don't know about the compiler. Is the compiler's range field supposed to be length in bytes or length on screen? If it's trying to do error reporting, it might be length on screen (in which case it's incorrect here).

Could fantomas use the text to get the EGC length instead of using range? I'm guessing that would fix it (though it might miss other cases like eg Match Patterns).

@nojaf
Copy link
Contributor

nojaf commented Aug 18, 2023

Is the compiler's range field supposed to be length in bytes or length on screen?

I believe that will be the length on the screen.

Could fantomas use the text to get the EGC length instead of using range?

Possibly, so Fantomas doesn't use the string value that is stored in the AST because it is an optimized representation of the value. There are multiple scenarios where this is beneficial.

When we grab the string from the source text we probably need to do something more clever when there is unicode involved in https://github.com/fsprojects/fantomas/blob/e671f3d7c68a258d80f6440ea82aaada2c48a34d/src/Fantomas.Core/ISourceTextExtensions.fs

We can detect the difference between EGC and range:
image

in

let mkConstString (creationAide: CreationAide) (stringKind: SynStringKind) (value: string) (range: range) =
let escaped = Regex.Replace(value, "\"{1}", "\\\"")
let fallback () =
match stringKind with
| SynStringKind.Regular -> sprintf "\"%s\"" escaped
| SynStringKind.Verbatim -> sprintf "@\"%s\"" escaped
| SynStringKind.TripleQuote -> sprintf "\"\"\"%s\"\"\"" escaped

I'm just not really sure, how to extract the right thing from the ISourceText.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants