[Parse] Improve Lexer's UTF-8 BOM handling #13483

omochi · 2017-12-16T05:55:48Z

Current Lexer has a function to skip UTF-8 BOM.
But its behavior has some bugs.

It skips shebang at the beginning of file, but does not with BOM.
A token should be prefix operator at the beginning of file, but is become infix operator with BOM.
It skips conflict marker, but does not with BOM.
A token at the beginning of file should be isAtStartOfLine, but is not with BOM.

All these bugs are come from implementation of Lexer that BufferStart is regarded as the beginning of source code even if BOM was skipped.
In text file with BOM, the end of BOM should be regarded as the beginnig of text content.

This PR fix these issues.

A design is shown below.

In text file, the beginning of text content is ordinary the beginning of file.
But if there is a BOM, the end of BOM is.

So to represent the beginning of text content, ContentStart field is added to Lexer.

note

Two bugs about BOM + shebang are at skipping logic in libParse and libSyntax(Lexer::lexTrivia) both.
This PR only fix and add testcase for libParse.

I think that libSyntax should not skip BOM and should keep it as LeadingTrivia.
But to implement this idea needs more changes,
So I plan to submit it as another PR in future.

Reported issues is not found in bugs.swift.org.

CodaFi · 2017-12-18T06:18:34Z

@harlanhaskins I definitely agree that libSyntax should retain the BOM as leading trivia for the first syntax node.

@swift-ci please smoke test

harlanhaskins · 2017-12-18T06:20:54Z

It might be fine to just use TriviaKind::GarbageText for this. It’s not meant to be parsed by swiftc and should remain transparent, but is necessary for a full source-accurate reprint.

rintaro

Good catch. Thank you @omochi !

rintaro · 2017-12-18T08:20:47Z

We also need to update here:
https://github.com/omochi/swift/blob/035e47851fe6b70877ece7de275953e30cabedee/lib/Parse/Lexer.cpp#L2396
I will do that in #13500

omochi · 2017-12-18T09:11:28Z

Thanks to review and merge.

omochi added 2 commits December 17, 2017 02:21

[Parse] add BOM handling testcases

60dc411

[Parse] add ContentStart to Lexer for BOM handling

035e478

omochi force-pushed the lexer-refactor-bom-handling branch from 0446c46 to 035e478 Compare December 16, 2017 17:22

omochi mentioned this pull request Dec 17, 2017

[Parse] Refactor Lexer's skipToEndOfline #13495

Merged

rintaro approved these changes Dec 18, 2017

View reviewed changes

rintaro merged commit ed58c15 into apple:master Dec 18, 2017

omochi deleted the lexer-refactor-bom-handling branch December 18, 2017 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parse] Improve Lexer's UTF-8 BOM handling #13483

[Parse] Improve Lexer's UTF-8 BOM handling #13483

omochi commented Dec 16, 2017

CodaFi commented Dec 18, 2017

harlanhaskins commented Dec 18, 2017

rintaro left a comment

rintaro commented Dec 18, 2017

omochi commented Dec 18, 2017

[Parse] Improve Lexer's UTF-8 BOM handling #13483

[Parse] Improve Lexer's UTF-8 BOM handling #13483

Conversation

omochi commented Dec 16, 2017

note

CodaFi commented Dec 18, 2017

harlanhaskins commented Dec 18, 2017

rintaro left a comment

Choose a reason for hiding this comment

rintaro commented Dec 18, 2017

omochi commented Dec 18, 2017