[Parse] Improve Lexer's UTF-8 BOM handling #13483
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Current
Lexer
has a function to skip UTF-8 BOM.But its behavior has some bugs.
isAtStartOfLine
, but is not with BOM.All these bugs are come from implementation of
Lexer
thatBufferStart
is regarded as the beginning of source code even if BOM was skipped.In text file with BOM, the end of BOM should be regarded as the beginnig of text content.
This PR fix these issues.
A design is shown below.
In text file, the beginning of text content is ordinary the beginning of file.
But if there is a BOM, the end of BOM is.
So to represent the beginning of text content,
ContentStart
field is added toLexer
.note
Two bugs about BOM + shebang are at skipping logic in libParse and libSyntax(
Lexer::lexTrivia
) both.This PR only fix and add testcase for libParse.
I think that libSyntax should not skip BOM and should keep it as LeadingTrivia.
But to implement this idea needs more changes,
So I plan to submit it as another PR in future.
Reported issues is not found in bugs.swift.org.