Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust lexer has to be able to handle singular quotes '/" #6091

Conversation

matthiasblaesing
Copy link
Contributor

In a valid Rust programm quotes occur in pairs to form character, string or byte string literals. However while inputting the programm the lexer must still be able to handle them. The lexer has to produce a token in that case so that the NetBeans infrastructure relying on the token stream can work.

It is not a problem, that the resulting token stream can't be parsed.

Concrete cases:

Removing parts of single quote structure. The pipe symbols are not part of the code. The area from the first pipe to the second pipe is selected and removed.

fn main() {
    println!('|x');
}
|

Enter single quote ('). The pipe symbol is not part of the code, at that position a single quote is entered.

fn main() {
    println!(|

This partially reverts b62889b and introduces a different fix, that handles the "lonely" characters as explicit tokens.

In a valid Rust programm quotes occur in pairs to form character,
string or byte string literals. However while inputting the programm
the lexer must still be able to handle them. The lexer has to produce
a token in that case so that the NetBeans infrastructure relying on the
token stream can work.

It is not a problem, that the resulting token stream can't be parsed.

Concrete cases:

Removing parts of single quote structure. The pipe symbols are not
part of the code. The area from the first pipe to the second pipe is
selected and removed.

fn main() {
    println!('|x');
}
|

Enter single quote ('). The pipe symbol is not part of the code, at
that position a single quote is entered.

fn main() {
    println!(|

This partially reverts b62889b and
introduces a different fix, that handles the "lonely" characters as
explicit tokens.
@matthiasblaesing matthiasblaesing added the Rust [ci] enable Rust tests label Jun 18, 2023
@matthiasblaesing matthiasblaesing added this to the NB19 milestone Jun 18, 2023
@BradWalker
Copy link
Member

BradWalker commented Jun 18, 2023

Hey @matthiasblaesing ,

I could use some clarification here. I thought that single quote, ', deals with Rust lifetime rules..

https://doc.rust-lang.org/nomicon/lifetimes.html

Does this agree with your interpretation?

@matthiasblaesing
Copy link
Contributor Author

matthiasblaesing commented Jun 18, 2023

Totally not a Rust expert here, but to me the specification implies, that 'x is a lifetime or loop label, while ' is not. The lexer has to split the cases:

  • ' - a lonely single quote
  • 'dummy - a singlequote followed by characters, that can be used as identifiers
  • 'd' - a single quote followed by a character, followed by a single quote

This is decideable with a look ahead of 2. This should be trivial for the code generation of antlr.

While reading your reply I realise, that the currently used grammar does not map lifetimes at all. That needs to be handled later. I had another look and lifetime parsing lexing is there.

@vieiro
Copy link
Contributor

vieiro commented Jun 20, 2023

I'll need a few days to look at this one, I'm afraid...

@vieiro
Copy link
Contributor

vieiro commented Jun 24, 2023

As a quick test, I cloned and opened the regex project (https://github.com/rust-lang/regex) and browsed some files.

@matthiasblaesing , when opening this file: https://github.com/rust-lang/regex/blob/master/regex-syntax/src/unicode_tables/age.rs the lexer reported a syntax error, and the whole editor system broke (the SwingEDT behaves badly after that).

This happens in master as well. The Lexer reports an error, and this is catastrophic for the ANTLR4 Lexer Support module (the editor freezes).

We may want to investigate what's going on in ANTLR4 Lexer Support (how to handle failures in the antlr4 lexers). Can't tell if we want to merge this one before or not.

@matthiasblaesing
Copy link
Contributor Author

Thank you for the test file. This is indeed a worst-case. I think, that we are seeing a problem in the NetBeans lexer infrastructure when characters outside the basic Unicode Plane are encountered.

One problematic character in that file is the code point 66349. That code point is:

https://www.compart.com/de/unicode/U+1032D

In java we normally deal with chars or character arrays. The 16bit size of char limits it to the BMP. For that reason Java encodes codepoints outside of the BMP as surrogate pairs (https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF).

It seems, that at least one problem is, that the antlr lexer does not expect surrogate pairs and so we would need to recombine first and feed the resulting int into the lexer. A quick test looks promising. However the rendering is still off. I'll look further into this.

@vieiro
Copy link
Contributor

vieiro commented Jun 24, 2023

I'll take a look at how ANTLR4 Support handles exceptions as time permits. We may want to return some sort of "error token" or similar in ANTLR4Support on these cases.

Copy link
Contributor

@vieiro vieiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can safely merge this one. It's clever creating "error" tokens to handle mismatched single and double quotes. This is the way to go, I tink. Good to see them categorized in the Error category in RustTokenID.

Regarding Unicode support, it seems ANTLR 4 has specific ways to handle these codes (see https://github.com/antlr/antlr4/blob/dev/doc/unicode.md). We may want to tackle this in another PR, if we want to support this.

@matthiasblaesing
Copy link
Contributor Author

Thank you all for review. I agree, that the lexer problem with characters outside the BMP should be tackled outside this PR.

@matthiasblaesing matthiasblaesing merged commit 0259759 into apache:master Jun 25, 2023
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Rust [ci] enable Rust tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants