Unicode leads to incorrect indentation #2945

pbiggar · 2023-08-08T02:49:30Z

Code

String.toList "Zalͮ̒ͫgoZalͮ̒ͫgo" = [ c "Z"; c "a"; c "lͮ̒ͫ"; c "g"; c "o" ]

Result

String.toList "Zalͮ̒ͫgoZalͮ̒ͫgo" = [ c "Z"
                                                                             c "a"
                                                                             c "lͮ̒ͫ"
                                                                             c "g"
                                                                             c "o" ]

Problem description

This is formatted weiiiiiird! Though the code is valid still.

FYI, we have lots of F# unicode edgecases in the darklang codebase, esp in string.dark

Extra information

The formatted result breaks my code.
The formatted result gives compiler warnings.
I or my company would be willing to help fix this.
I would like a release if this problem is solved.

Options

Fantomas main branch at 1/1/1990

    { config with
                IndentSize = 2 }

nojaf · 2023-08-16T10:14:32Z

Wow, this one is quite the sight 🙈.
I don't have any immediate idea what could be going wrong here.
We normally just copy the string straight from the source text. So I didn't expect unicode strings to matter in this case.
String.toList "Zzzzzzzzzzzz" = [ c "Z̤͔ͧ̑̓"; c "ä͖̭̈̇"; c "lͮ̒ͫ"; c "ǧ̗͚̚"; c "o̙̔ͮ̇͐̇" ] didn't reproduce it for me, so I'm not sure what to make of this one.

pbiggar · 2023-08-16T14:09:25Z

What I imagine is happening here is that we're using the Length of a string to determine indentation, as opposed to the number of Extended Grapheme Clusters (which correspond to visual on-screen characters).

I checked to see how we do this in Darklang and found this:

System.Globalization.StringInfo(s).LengthInTextElements

nojaf · 2023-08-16T14:54:53Z

That is an interesting pointer, thanks! I'll try and follow-up on this.

nojaf · 2023-08-17T07:04:51Z

Hmm, the range of "ä͖̭̈̇" in c "ä͖̭̈̇"; a seems to be wrong. (online tool)

SynExpr.Const(constant = SynConst.String(text = "ä͖̭̈̇", synStringKind = SynStringKind.Regular, range = R("(1,2--1,10)")), range = R("(1,2--1,10)"))

That is definitely part of the problem and would need a fix on the compiler side.

Most likely lhs parseState in doesn't do unicode well.
https://github.com/dotnet/fsharp/blob/681069fbebcdff312645e61e4970a6dd403ff0ee/src/Compiler/pars.fsy#L3289-L3291

EDIT: Maybe not. Not sure what to make of that.

pbiggar · 2023-08-18T05:07:30Z

I guess the range is using the number of bytes, and I'm guessing this is an 8 byte unicode character. Fantomas clearly needs to use unicode length, but I don't know about the compiler. Is the compiler's range field supposed to be length in bytes or length on screen? If it's trying to do error reporting, it might be length on screen (in which case it's incorrect here).

Could fantomas use the text to get the EGC length instead of using range? I'm guessing that would fix it (though it might miss other cases like eg Match Patterns).

nojaf · 2023-08-18T06:55:09Z

Is the compiler's range field supposed to be length in bytes or length on screen?

I believe that will be the length on the screen.

Could fantomas use the text to get the EGC length instead of using range?

Possibly, so Fantomas doesn't use the string value that is stored in the AST because it is an optimized representation of the value. There are multiple scenarios where this is beneficial.

When we grab the string from the source text we probably need to do something more clever when there is unicode involved in https://github.com/fsprojects/fantomas/blob/e671f3d7c68a258d80f6440ea82aaada2c48a34d/src/Fantomas.Core/ISourceTextExtensions.fs

We can detect the difference between EGC and range:

in

fantomas/src/Fantomas.Core/ASTTransformer.fs

Lines 87 to 94 in 0ce91b7

    
           let mkConstString (creationAide: CreationAide) (stringKind: SynStringKind) (value: string) (range: range) = 
        
               let escaped = Regex.Replace(value, "\"{1}", "\\\"") 
        
               let fallback () = 
        
                   match stringKind with 
        
                   | SynStringKind.Regular -> sprintf "\"%s\"" escaped 
        
                   | SynStringKind.Verbatim -> sprintf "@\"%s\"" escaped 
        
                   | SynStringKind.TripleQuote -> sprintf "\"\"\"%s\"\"\"" escaped

I'm just not really sure, how to extract the right thing from the ISourceText.

nojaf added bug (soundness) bug (stylistic) and removed bug (soundness) labels Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode leads to incorrect indentation #2945

Unicode leads to incorrect indentation #2945

pbiggar commented Aug 8, 2023

nojaf commented Aug 16, 2023

pbiggar commented Aug 16, 2023

nojaf commented Aug 16, 2023

nojaf commented Aug 17, 2023 •

edited

pbiggar commented Aug 18, 2023

nojaf commented Aug 18, 2023

Unicode leads to incorrect indentation #2945

Unicode leads to incorrect indentation #2945

Comments

pbiggar commented Aug 8, 2023

Code

Result

Problem description

Extra information

Options

nojaf commented Aug 16, 2023

pbiggar commented Aug 16, 2023

nojaf commented Aug 16, 2023

nojaf commented Aug 17, 2023 • edited

pbiggar commented Aug 18, 2023

nojaf commented Aug 18, 2023

nojaf commented Aug 17, 2023 •

edited