Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parse] Improve Lexer's UTF-8 BOM handling #13483

Merged
merged 2 commits into from
Dec 18, 2017

Conversation

omochi
Copy link
Collaborator

@omochi omochi commented Dec 16, 2017

Current Lexer has a function to skip UTF-8 BOM.
But its behavior has some bugs.

  • It skips shebang at the beginning of file, but does not with BOM.
  • A token should be prefix operator at the beginning of file, but is become infix operator with BOM.
  • It skips conflict marker, but does not with BOM.
  • A token at the beginning of file should be isAtStartOfLine, but is not with BOM.

All these bugs are come from implementation of Lexer that BufferStart is regarded as the beginning of source code even if BOM was skipped.
In text file with BOM, the end of BOM should be regarded as the beginnig of text content.

This PR fix these issues.

A design is shown below.

In text file, the beginning of text content is ordinary the beginning of file.
But if there is a BOM, the end of BOM is.

So to represent the beginning of text content, ContentStart field is added to Lexer.

note

Two bugs about BOM + shebang are at skipping logic in libParse and libSyntax(Lexer::lexTrivia) both.
This PR only fix and add testcase for libParse.

I think that libSyntax should not skip BOM and should keep it as LeadingTrivia.
But to implement this idea needs more changes,
So I plan to submit it as another PR in future.


Reported issues is not found in bugs.swift.org.

@CodaFi
Copy link
Member

CodaFi commented Dec 18, 2017

@harlanhaskins I definitely agree that libSyntax should retain the BOM as leading trivia for the first syntax node.

@swift-ci please smoke test

@harlanhaskins
Copy link
Collaborator

It might be fine to just use TriviaKind::GarbageText for this. It’s not meant to be parsed by swiftc and should remain transparent, but is necessary for a full source-accurate reprint.

Copy link
Member

@rintaro rintaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thank you @omochi !

@rintaro
Copy link
Member

rintaro commented Dec 18, 2017

@rintaro rintaro merged commit ed58c15 into apple:master Dec 18, 2017
@omochi
Copy link
Collaborator Author

omochi commented Dec 18, 2017

Thanks to review and merge.

@omochi omochi deleted the lexer-refactor-bom-handling branch December 18, 2017 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants