Language independent replace/reduce #2948

kfsone · 2020-10-27T21:40:11Z

Consider a scenario where we wish to have strings with their quotation marks removed, so that "hello" is surfaced in the target language as a STRING token with value hello, rather than "hello" or 'hello':

fragment SingleQuote: '\'';
fragment DoubleQuote: '"';

STRING
   : SingleQuote ~SingleQuote*? SingleQuote
   | DoubleQuote ~DoubleQuote*? DoubleQuote
   ;

You'd currently need language specific actions or careful use of hidden/skip to pull this off, and it's not especially intuitive when doing so. I'd like to suggest a couple of multi-faceted possible alternatives that are language independent:

<tokenname>=(term)

STRING  : SingleQuote <STRING>=(~SingleQuote*?) SingleQuote;  // Requires <name-of-token>

Antlr will currently warn about a collision if you do STRING=... so I used angled brackets to make it more distinctive, and I thought requiring the parens added to that clarity, but I can live without them.

Replacement

A gross simplification of https://github.com/antlr/antlr4/blob/master/doc/faq/lexical.md becomes possible by allowing a second equal sign for literal substitution:

STR :   '"'
        (   '\\'
            (   'r'     {buf.append('\r');}
            |   'n'     {buf.append('\n');}
            |   't'     {buf.append('\t');}
            |   '\\'    {buf.append('\\');}
            |   '\"'   {buf.append('"');}
            )
        |   ~('\\'|'"') {buf.append((char)_input.LA(-1));}
        )*
        '"'
        {setText(buf.toString()); buf.setLength(0); System.out.println(getText());}
    ;

becomes:

DoubleQuote : '"' -> channel(HIDDEN);
EscapeR : <EscapeR>='\\r'='\r';
EscapeN: <EscapeN>='\\n'='\n';
Escaped: '\\' <Escaped>=.;
STR : DoubleQuote <STR>=( EscapeR | EscapeN | Escaped | ~DoubleQuote )* DoubleQuote

Alternative syntax: (lit "AS" sub)

Fairly succinct and clean, with a special case for discarding tokens by reducing them to the empty string ('').

EscapeR : ('\\r' AS '\r');
EscapeN: ('\\n' AS '\n');
Escaped: ('\\' AS '') . ;  // Escaped itself evaluates 
STR : ('"' AS '') ( ( EscapeR | EscapeN | Escaped | ~DoubleQuote )* AS STR ) ('"' AS '')

Passthru/Inline

The first syntax could also allow for passthru, a case where the user wants a named production but doesn't want it to appear in the parse tree:

STRING : <STRING>=(SQ_STRING) | <STRING>=DQ_STRING;   // just trying with/without parens
SQ_STRING : SingleQuote <SQ_STRING>=(~SingleQuote*?) SingleQuote;
DQ_STRING : DoubleQuote <DQ_STRING>=(~DoubleQuote*?) DoubleQuote;

SQ_STRING and DQ_STRING would be entirely inlined to STRING, making them transparent (and in languages like Python reducing the function call overhead).

The second syntax could also do this:

STRING : (SQ_STRING | DQ_STRING) AS STRING  // 'as' term must match token name.

Nota Bene

Syntax candidates provided without prejudice, one demonstrating an extrapolation of current antlr syntax and the other probably triggered by a neuron that knows sql.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language independent replace/reduce #2948

Language independent replace/reduce #2948

kfsone commented Oct 27, 2020

Language independent replace/reduce #2948

Language independent replace/reduce #2948

Comments

kfsone commented Oct 27, 2020

<tokenname>=(term)

Replacement

Alternative syntax: (lit "AS" sub)

Passthru/Inline