Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatter: Unicode width is too high #6499

Closed
Tracked by #6069
konstin opened this issue Aug 11, 2023 · 5 comments
Closed
Tracked by #6069

Formatter: Unicode width is too high #6499

konstin opened this issue Aug 11, 2023 · 5 comments
Assignees
Labels
formatter Related to the formatter needs-decision Awaiting a decision from a maintainer

Comments

@konstin
Copy link
Member

konstin commented Aug 11, 2023

The following snippet fits visually in pycharm and is kept in this layout by black

function_call(
    "aaaaaaaaaaaaaaaaaaaa", "我隻氣墊船裝滿晒鱔.txt", "मेरी मँडराने वाली नाव सर्पमीनों से भरी ह"
)

ruff formats it as

function_call(
    "aaaaaaaaaaaaaaaaaaaa",
    "我隻氣墊船裝滿晒鱔.txt",
    "मेरी मँडराने वाली नाव सर्पमीनों से भरी ह",
)

We likely compute the unicode width of the string too high so that we assume the line is too long

@konstin konstin added bug Something isn't working formatter Related to the formatter labels Aug 11, 2023
@MichaReiser
Copy link
Member

How did you measure the width in Pycharm? I understand that Pycharms cursor position (row:column) is character based and not width based.

@konstin
Copy link
Member Author

konstin commented Aug 11, 2023

Really just visually:

Untitled

The chinese characters line up with two latin characters, the hindi doesn't.

I don't know by which rules which tool measures, but we should try to match what black does

@MichaReiser
Copy link
Member

MichaReiser commented Aug 11, 2023

Black migrates to use unicode-width in preview style, but only for strings (not identifiers).

Edit: But we should look into this.

@konstin
Copy link
Member Author

konstin commented Aug 11, 2023

black does also break in preview mode! I guess it makes more sense by the unicode text width algorithm but it's also odd because it's different from the actual text rendering

@MichaReiser MichaReiser added this to the Formatter: Alpha milestone Aug 17, 2023
@konstin konstin added needs-decision Awaiting a decision from a maintainer and removed bug Something isn't working labels Aug 21, 2023
@konstin konstin assigned MichaReiser and unassigned konstin Aug 22, 2023
@MichaReiser
Copy link
Member

MichaReiser commented Aug 22, 2023

@konstin and I talked about this and we must admit, we're confused. I will close this issue because I believe we're doing the right thing. Please open a new issue and link this issue if you're more familiar with CJK characters and/or have reasons to believe that the implementation is incorrect. We'll hopefully be able to figure out the correct behavior together. Reasons for closing:

  • According to @konstin. Our implementation matches black preview style, which is good
  • The Unicode spec recommends to tread the ambiguous characters as half-width when displaying but the context cannot be inferred (e.g. by the used font).

    Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default.

  • The formatter uses width which treads ambiguous characters as half-width.

I also went ahead and pasted the example into my IDE and it seems it exceeds the column width for my configuration (the xx... is exactly 88 characters wide)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
formatter Related to the formatter needs-decision Awaiting a decision from a maintainer
Projects
None yet
Development

No branches or pull requests

2 participants