Skip to content

Latest commit

 

History

History
80 lines (63 loc) · 4.28 KB

case-insensitive-lexing.md

File metadata and controls

80 lines (63 loc) · 4.28 KB

Case-Insensitive Lexing

In some languages, keywords are case insensitive meaning that BeGiN means the same thing as begin or BEGIN. ANTLR has two mechanisms to support building grammars for such languages:

  1. Build lexical rules that match either upper or lower case.
  • Advantage: no changes required to ANTLR, makes it clear in the grammar that the language in this case insensitive.
  • Disadvantage: might have a small efficiency cost and grammar is a more verbose and more of a hassle to write.
  1. Build lexical rules that match keywords in all uppercase and then parse with a custom character stream that converts all characters to uppercase before sending them to the lexer (via the LA() method). Care must be taken not to convert all characters in the stream to uppercase because characters within strings and comments should be unaffected. All we really want is to trick the lexer into thinking the input is all uppercase.
  • Advantage: Could have a speed advantage depending on implementation, no change required to the grammar.
  • Disadvantage: Requires that the case-insensitive stream and grammar are used in correctly in conjunction with each other, makes all characters appear as uppercase/lowercase to the lexer but some grammars are case sensitive outside of keywords, errors new case insensitive streams and language output targets (java, C#, C++, ...).

For the 4.7.1 release, we discussed both approaches in detail and even possibly altering the ANTLR metalanguage to directly support case-insensitive lexing. We discussed including the case insensitive streams into the runtime but not all would be immediately supported. I decided to simply make documentation that clearly states how to handle this and include the appropriate snippets that people can cut-and-paste into their grammars.

Case-insensitive grammars

As a prime example of a grammar that specifically describes case insensitive keywords, see the SQLite grammar. To match a case insensitive keyword, there are rules such as

K_UPDATE : U P D A T E;

that will match UpdaTE and upDATE etc... as the update keyword. This rule makes use of some generically useful fragment rules that you can cut-and-paste into your grammars:

fragment A : [aA]; // match either an 'a' or 'A'
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];

No special streams are required to use this mechanism for case insensitivity.

Custom character streams approach

The other approach is to use lexical rules that match either all uppercase or all lowercase, such as:

K_UPDATE : 'UPDATE';

Then, when creating the character stream to parse from, we need a custom class that overrides methods used by the lexer. Below you will find custom character streams for a number of the targets that you can copy into your projects, but here is how to use the streams in Java as an example:

CharStream s = CharStreams.fromPath(Paths.get('test.sql'));
CaseChangingCharStream upper = new CaseChangingCharStream(s, true);
Lexer lexer = new SomeSQLLexer(upper);

Here are implementations of CaseChangingCharStream in various target languages: