challenge(formatter): (also parser) Respect, parse, and format BOM characters from source files #750

faultyserver · 2023-11-17T09:16:47Z

See the end of this PR description for a note on reviewing this PR. The GitHub file viewer will not like it if you try to open it as a whole...

Summary

About a dozen of the tests in the prettier suite use files that include a Unicode BOM character. Prettier's support for BOM characters is simply reading a file, stripping the BOM if it sees one, then doing all the work, and finally re-adding the BOM at the start of the result if it found one earlier. That works, but isn't easily adaptable into Biome as a more comprehensive tool.

BOM characters are a way of indicating which encoding a file is using, but it's also not a perfect mechanism. So much so that Unicode has generally stopped recommending the use of BOM characters unless really necessary, which is almost never. In UTF-8 files, the BOM generally has no meaning anyway, since byte ordering does not matter. Windows (and specifically Windows text editors like Visual Studio) however, for a variety of reasons, will often encode UTF-8 with a BOM, because historically, Windows used UTF-16LE, so it tries to disambiguate.

The presence of a BOM character generally has no meaning and no effect on tools, other than occasionally breaking them when the tool doesn't know that a BOM character might show up at the start of a file. But for exactly that reason (to avoid breaking on accident), it is worth respecting the presence of a BOM character, both when parsing and when formatting.

When parsing, the BOM can be read and stored alongside the parse tree. Then, when formatting, whatever BOM is included with the tree gets formatted back out at the start of the result. This ensures that BOMs are preserved when given, and not added or removed unintentionally.

A fun tangent in my explorations of this revealed that JSON explicitly forbids BOM characters (when transmitting JSON over the network). This would imply that, as a technicality, allowing a BOM character while parsing JSON would be invalid. However, again for the sake of consistency with other tools and to avoid breaking unintentionally (if, say, a Windows users creates a JSON file that inadvertently gets a BOM character when saved to a file), I think it makes sense to respect the BOM and just let it pass by like the other languages.

Approach

It took me a while to figure out how it would even be feasible to handle BOMs across all languages and ways that text could be sent to Biome to be parsed, linted, and formatted. Originally, I wanted to do it as abstractly as possible; in one place, high up when first reading text from a file. But, this quickly seemed like a rough idea, especially since the formatting methods for creating format results don't actually operate on files, but purely on a parse result. I thought about adding it as state in the Lexer as well, just seeing if the stream starts with a BOM, and then other components can use that state, but that also didn't seem feasible since the Lexer is really an internal detail of parsing, and formatters don't directly know about the Lexer right now.

So the BOM data needs to be passed along with the tree no matter what, which led to adding it as a token in the root nodes of each language tree. This has the unfortunate side effect that every parse will have a slot for the BOM token, which will almost always be empty, but that's also true with the interpreter token right now, so I felt it was an acceptable means to the end here.

If there is an easier, less intrusive way of going about this, I'm all for it lol.

One other benefit of parsing a BOM token and including it in the AST is that Biome can also use that in lint rules to warn against the usage of BOMs. For example, it could raise an error in JSON files that have a BOM token, since that is almost always going to be invalid for various tools. Also, if any BOM other than plain UTF-8 is encountered, it could warn that the encoding doesn't match what Biome expects.

Test Plan

To review this PR, I've split it up into a few specific commits. The first commit should contain all of the logic changes. Everything with adding the unicode_bom crate to each parser, implementing the UNICODE_BOM token kind, lexing it, storing it in the parse nodes, and writing them out in the formatter from the root nodes of each language. This is really the only interesting thing in this PR, and it's much smaller.

The second and third commits are just updating the .rast test snapshots for all of the spec tests to include the bom_token slot on each node.

The fourth commit is the removal of the prettier test snapshots for the tests that are now passing completely. See the last commit for all of the removed tests. They are:

That these tests now pass (and all of the others as well), should indicate that BOMs are being respected in files where they are present and not affecting parsing or formatting unintentionally.

ematipico

The solution looks good to me! Thank you very much for the thorough explanation! I am sure people learned a lot from this PR (included myself)

crates/biome_css_parser/Cargo.toml

crates/biome_css_parser/src/lexer/mod.rs

Conaclos

This is a valuable contribution!

parse and format BOM in json format bom token in json re-add unicode_bom in json parse and format BOM in js parse BOM in CSS track exact bom length in lexer state

fix json bom spec test

update CSS parser tests with BOM update js parser tests with bom token add bom tests for js_formatter

faultyserver · 2023-11-17T21:49:40Z

Sorry for the many pushes and build notifications, but I think this has addressed the feedback, fixed the tests, and cleaned up some unused changes i'd made originally before going in the final direction here. I think it's set to go now.

github-actions bot added A-Core Area: core A-Parser Area: parser A-Formatter Area: formatter A-Tooling Area: internal tools L-JavaScript Language: JavaScript and super languages L-CSS Language: CSS L-JSON Language: JSON and super languages labels Nov 17, 2023

faultyserver mentioned this pull request Nov 17, 2023

☂️ Prettier code challenge hub #720

Closed

ematipico approved these changes Nov 17, 2023

View reviewed changes

crates/biome_css_parser/Cargo.toml Outdated Show resolved Hide resolved

crates/biome_css_parser/src/lexer/mod.rs Show resolved Hide resolved

Conaclos self-requested a review November 17, 2023 16:45

Conaclos approved these changes Nov 17, 2023

View reviewed changes

github-actions bot added the A-LSP Area: language server protocol label Nov 17, 2023

faultyserver added 6 commits November 17, 2023 18:25

parse and format BOM tokens in all languages

06f9637

parse and format BOM in json format bom token in json re-add unicode_bom in json parse and format BOM in js parse BOM in CSS track exact bom length in lexer state

update json tests with BOM

9b8d282

fix json bom spec test

update test snapshots for BOM tokens in parse results

50725df

update CSS parser tests with BOM update js parser tests with bom token add bom tests for js_formatter

update prettier snapshots for BOM

52c1bdb

fix document_lifecycle test and other un-updated .rasts

170a075

use single workspace version for unicode-bom

0f76090

faultyserver force-pushed the challenge/bom branch from 040faa9 to 0f76090 Compare November 17, 2023 18:25

faultyserver added 4 commits November 17, 2023 18:39

update css snaps with bom token

8129cc3

fix cargo.toml formatting

5efde11

remove original attempt at BOM as trivia

96b1af0

add explanation for slicing bytes for BOM

7e1f97a

github-actions bot removed the A-Core Area: core label Nov 17, 2023

ematipico merged commit 3aa88e8 into biomejs:main Nov 17, 2023
13 checks passed

faultyserver deleted the challenge/bom branch November 17, 2023 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

challenge(formatter): (also parser) Respect, parse, and format BOM characters from source files #750

challenge(formatter): (also parser) Respect, parse, and format BOM characters from source files #750

faultyserver commented Nov 17, 2023

ematipico left a comment

Conaclos left a comment

faultyserver commented Nov 17, 2023

challenge(formatter): (also parser) Respect, parse, and format BOM characters from source files #750

challenge(formatter): (also parser) Respect, parse, and format BOM characters from source files #750

Conversation

faultyserver commented Nov 17, 2023

Summary

Approach

Test Plan

ematipico left a comment

Choose a reason for hiding this comment

Conaclos left a comment

Choose a reason for hiding this comment

faultyserver commented Nov 17, 2023