Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

challenge(formatter): (also parser) Respect, parse, and format BOM characters from source files #750

Merged
merged 10 commits into from Nov 17, 2023

Conversation

faultyserver
Copy link
Contributor

See the end of this PR description for a note on reviewing this PR. The GitHub file viewer will not like it if you try to open it as a whole...

Summary

About a dozen of the tests in the prettier suite use files that include a Unicode BOM character. Prettier's support for BOM characters is simply reading a file, stripping the BOM if it sees one, then doing all the work, and finally re-adding the BOM at the start of the result if it found one earlier. That works, but isn't easily adaptable into Biome as a more comprehensive tool.

BOM characters are a way of indicating which encoding a file is using, but it's also not a perfect mechanism. So much so that Unicode has generally stopped recommending the use of BOM characters unless really necessary, which is almost never. In UTF-8 files, the BOM generally has no meaning anyway, since byte ordering does not matter. Windows (and specifically Windows text editors like Visual Studio) however, for a variety of reasons, will often encode UTF-8 with a BOM, because historically, Windows used UTF-16LE, so it tries to disambiguate.

The presence of a BOM character generally has no meaning and no effect on tools, other than occasionally breaking them when the tool doesn't know that a BOM character might show up at the start of a file. But for exactly that reason (to avoid breaking on accident), it is worth respecting the presence of a BOM character, both when parsing and when formatting.

When parsing, the BOM can be read and stored alongside the parse tree. Then, when formatting, whatever BOM is included with the tree gets formatted back out at the start of the result. This ensures that BOMs are preserved when given, and not added or removed unintentionally.

A fun tangent in my explorations of this revealed that JSON explicitly forbids BOM characters (when transmitting JSON over the network). This would imply that, as a technicality, allowing a BOM character while parsing JSON would be invalid. However, again for the sake of consistency with other tools and to avoid breaking unintentionally (if, say, a Windows users creates a JSON file that inadvertently gets a BOM character when saved to a file), I think it makes sense to respect the BOM and just let it pass by like the other languages.

Approach

It took me a while to figure out how it would even be feasible to handle BOMs across all languages and ways that text could be sent to Biome to be parsed, linted, and formatted. Originally, I wanted to do it as abstractly as possible; in one place, high up when first reading text from a file. But, this quickly seemed like a rough idea, especially since the formatting methods for creating format results don't actually operate on files, but purely on a parse result. I thought about adding it as state in the Lexer as well, just seeing if the stream starts with a BOM, and then other components can use that state, but that also didn't seem feasible since the Lexer is really an internal detail of parsing, and formatters don't directly know about the Lexer right now.

So the BOM data needs to be passed along with the tree no matter what, which led to adding it as a token in the root nodes of each language tree. This has the unfortunate side effect that every parse will have a slot for the BOM token, which will almost always be empty, but that's also true with the interpreter token right now, so I felt it was an acceptable means to the end here.

If there is an easier, less intrusive way of going about this, I'm all for it lol.

One other benefit of parsing a BOM token and including it in the AST is that Biome can also use that in lint rules to warn against the usage of BOMs. For example, it could raise an error in JSON files that have a BOM token, since that is almost always going to be invalid for various tools. Also, if any BOM other than plain UTF-8 is encountered, it could warn that the encoding doesn't match what Biome expects.

Test Plan

To review this PR, I've split it up into a few specific commits. The first commit should contain all of the logic changes. Everything with adding the unicode_bom crate to each parser, implementing the UNICODE_BOM token kind, lexing it, storing it in the parse nodes, and writing them out in the formatter from the root nodes of each language. This is really the only interesting thing in this PR, and it's much smaller.

The second and third commits are just updating the .rast test snapshots for all of the spec tests to include the bom_token slot on each node.

The fourth commit is the removal of the prettier test snapshots for the tests that are now passing completely. See the last commit for all of the removed tests. They are:

That these tests now pass (and all of the others as well), should indicate that BOMs are being respected in files where they are present and not affecting parsing or formatting unintentionally.

@github-actions github-actions bot added A-Core Area: core A-Parser Area: parser A-Formatter Area: formatter A-Tooling Area: internal tools L-JavaScript Language: JavaScript and super languages L-CSS Language: CSS L-JSON Language: JSON and super languages labels Nov 17, 2023
Copy link
Member

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution looks good to me! Thank you very much for the thorough explanation! I am sure people learned a lot from this PR (included myself)

crates/biome_css_parser/Cargo.toml Outdated Show resolved Hide resolved
crates/biome_css_parser/src/lexer/mod.rs Show resolved Hide resolved
@Conaclos Conaclos self-requested a review November 17, 2023 16:45
Copy link
Member

@Conaclos Conaclos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valuable contribution!

@github-actions github-actions bot added the A-LSP Area: language server protocol label Nov 17, 2023
parse and format BOM in json

format bom token in json

re-add unicode_bom in json

parse and format BOM in js

parse BOM in CSS

track exact bom length in lexer state
fix json bom spec test
update CSS parser tests with BOM

update js parser tests with bom token

add bom tests for js_formatter
@github-actions github-actions bot removed the A-Core Area: core label Nov 17, 2023
@faultyserver
Copy link
Contributor Author

Sorry for the many pushes and build notifications, but I think this has addressed the feedback, fixed the tests, and cleaned up some unused changes i'd made originally before going in the final direction here. I think it's set to go now.

@ematipico ematipico merged commit 3aa88e8 into biomejs:main Nov 17, 2023
13 checks passed
@faultyserver faultyserver deleted the challenge/bom branch November 17, 2023 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Formatter Area: formatter A-LSP Area: language server protocol A-Parser Area: parser A-Tooling Area: internal tools L-CSS Language: CSS L-JavaScript Language: JavaScript and super languages L-JSON Language: JSON and super languages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants