Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language independent replace/reduce #2948

Open
kfsone opened this issue Oct 27, 2020 · 0 comments
Open

Language independent replace/reduce #2948

kfsone opened this issue Oct 27, 2020 · 0 comments

Comments

@kfsone
Copy link

kfsone commented Oct 27, 2020

Consider a scenario where we wish to have strings with their quotation marks removed, so that "hello" is surfaced in the target language as a STRING token with value hello, rather than "hello" or 'hello':

fragment SingleQuote: '\'';
fragment DoubleQuote: '"';

STRING
   : SingleQuote ~SingleQuote*? SingleQuote
   | DoubleQuote ~DoubleQuote*? DoubleQuote
   ;

You'd currently need language specific actions or careful use of hidden/skip to pull this off, and it's not especially intuitive when doing so. I'd like to suggest a couple of multi-faceted possible alternatives that are language independent:

<tokenname>=(term)

STRING  : SingleQuote <STRING>=(~SingleQuote*?) SingleQuote;  // Requires <name-of-token>

Antlr will currently warn about a collision if you do STRING=... so I used angled brackets to make it more distinctive, and I thought requiring the parens added to that clarity, but I can live without them.

Replacement

A gross simplification of https://github.com/antlr/antlr4/blob/master/doc/faq/lexical.md becomes possible by allowing a second equal sign for literal substitution:

STR :   '"'
        (   '\\'
            (   'r'     {buf.append('\r');}
            |   'n'     {buf.append('\n');}
            |   't'     {buf.append('\t');}
            |   '\\'    {buf.append('\\');}
            |   '\"'   {buf.append('"');}
            )
        |   ~('\\'|'"') {buf.append((char)_input.LA(-1));}
        )*
        '"'
        {setText(buf.toString()); buf.setLength(0); System.out.println(getText());}
    ;

becomes:

DoubleQuote : '"' -> channel(HIDDEN);
EscapeR : <EscapeR>='\\r'='\r';
EscapeN: <EscapeN>='\\n'='\n';
Escaped: '\\' <Escaped>=.;
STR : DoubleQuote <STR>=( EscapeR | EscapeN | Escaped | ~DoubleQuote )* DoubleQuote

Alternative syntax: (lit "AS" sub)

Fairly succinct and clean, with a special case for discarding tokens by reducing them to the empty string ('').

EscapeR : ('\\r' AS '\r');
EscapeN: ('\\n' AS '\n');
Escaped: ('\\' AS '') . ;  // Escaped itself evaluates 
STR : ('"' AS '') ( ( EscapeR | EscapeN | Escaped | ~DoubleQuote )* AS STR ) ('"' AS '')

Passthru/Inline

The first syntax could also allow for passthru, a case where the user wants a named production but doesn't want it to appear in the parse tree:

STRING : <STRING>=(SQ_STRING) | <STRING>=DQ_STRING;   // just trying with/without parens
SQ_STRING : SingleQuote <SQ_STRING>=(~SingleQuote*?) SingleQuote;
DQ_STRING : DoubleQuote <DQ_STRING>=(~DoubleQuote*?) DoubleQuote;

SQ_STRING and DQ_STRING would be entirely inlined to STRING, making them transparent (and in languages like Python reducing the function call overhead).

The second syntax could also do this:

STRING : (SQ_STRING | DQ_STRING) AS STRING  // 'as' term must match token name.
  • Nota Bene

Syntax candidates provided without prejudice, one demonstrating an extrapolation of current antlr syntax and the other probably triggered by a neuron that knows sql.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant