Add a new CharStream that converts the symbols to upper or lower case. #2046

bramp · 2017-10-06T14:56:06Z

This is useful for many of the case insensitive grammars found at https://github.com/antlr/grammars-v4/ which assume the input would be all upper or lower case. Related discussion can be found at antlr/antlr4test-maven-plugin#1

It would be used like so:

input, _ := antlr.NewFileStream("filename")

in = antlr.NewCaseChangingStream(is, true) // true forces upper case symbols, false forces lower case.

lexer := parser.NewAbnfLexer(in)

While writing this, I found other people have written their own similar implementations (go, java). It makes sense to place this in the core, so everyone can use it.

I would love for the grammar to have a option that says the lexer should upper/lower case all input, and then this code could be moved into the generated Lexer, and no user would need to explicitly use a CaseChangingStream (similar to what's discussed in #1002).

KvanTTT

It would be great if you also add such streams for other runtimes (Java, C#, JavaScript, Python2, Python3, C++, Swift). At least for a few of them.

KvanTTT · 2017-10-06T20:15:08Z

runtime/Go/antlr/case_changing_stream.go

+		return in
+	}
+	if is.upper {
+		return int(unicode.ToUpper(rune(in)))


Maybe it's better to convert case at the beginning one time and use upper/lower values from a prepared array for the performance reason? LA function can be called multiple times.

I could cache the result, but I didn't want this to work by reading the full input up front, because this data could be streamed, or extremely large.

However, I don't think caching will save much, as this is quite a cheap operation for the CPU. But I'm happy to benchmark it.

Ok, got it. Maybe it's right.

bramp · 2017-10-06T20:25:08Z

Sure, I'd be happy to contribute, Java, C++ and Python[23] (I don't know / have access to C# or Swift). Let's resolve this one first, then I'll work on the other languages.

bramp · 2017-10-07T20:35:32Z

I just added a similar version for Java #2048

parrt · 2017-10-21T20:21:23Z

hi gang. How is this different than simply running tolower() or whatever on the input string?

bramp · 2017-10-21T20:25:29Z

In my example, I want to use a NewFileStream, which gives me no opportunity to tolower the input. Basically this just wraps an existing stream and does that for me.

parrt · 2017-10-21T20:26:58Z

It seems to me that what one really wants is insensitivity during the lexing process but to leave the actual input as is, right?

bramp · 2017-10-21T20:37:05Z

Sure, if the lexer could treat the input in a case insensitive way, that would be ideal. I don't like sticking an extra stream in front, I'd rather the author of the lexer hide this detail from me. But currently there are multiple grammars that require lower or upper case input, and this helps me use them.

Alternatively as a step in that direction, this stream could be kept inside the lexer, and put there by antlr during the code generation.

parrt · 2017-10-21T20:40:15Z

It seems like we should simply tweak LA() so input.LA(1) returns toupper or change getExistingTargetState() and computeTargetState() and any other places that test characters in the LexerATNSimulator. of course that is a lot scarier to change

bramp · 2017-10-21T20:53:17Z

And there would be a change to the grammar file to indicate that input should always be uppercased? Because presumably some grammars would want to be case sensitive.

parrt · 2017-10-21T21:03:17Z

That's a good question. Is it a function of the grammar purely or could the same grammar be used in both case-sensitive and insensitive situations?

bramp · 2017-10-21T21:05:58Z

The samples I've seen on https://github.com/antlr/grammars-v4 have all been written assuming the input would all be one case, and not something you'd want to selectively turn on and off. They do this because it makes the grammar easier to write. So in those cases it would be a function of the grammar. However, I can't speak for all exotic grammars.

parrt · 2017-10-21T21:18:04Z

Hmm...yeah,I can imagine it being a function of the grammar. That said, given my limited time I think we have a simple solution which is to manually alter the string going into the upper or lower case. The programmer can do this easily with their own wrapper class or a simple function call if they have the input as a string. In that case, I'm going to close. Sorry but thanks for the effort.

bramp · 2017-10-21T21:43:24Z

Sure, the programmer can write their own wrapper, but providing one in the core library makes that so much easier. When I first started using antlr, I took a pre-written grammar (the mysql one in fact), and it took me a while to realise I had to uppercase everything, then it took me another while to understand the Stream interface, and figure out how to adapt the LA function to do what I need... Lets avoid that burden on a first time user, and make this easy for them with this pre-written and tested wrapper class.

parrt · 2017-10-21T21:50:56Z

Well,I have a lot of targets that we need to be updated and documentation and tests and so on. Probably not worth it when I personally just call toupper after reading in a string and before passing to ANTLR.

bramp · 2017-10-21T22:14:25Z

Ok sounds reasonable. I am willing to provide implementations for some of the other targets, but not all of them. However, I'll maintain this code outside of the core library, until there is an appetite for it. Thanks.

KvanTTT · 2017-10-30T14:39:55Z

@parrt, this stream save chars case. It only influences on lexer, but not on the parser or parse tree bypassing. It's roughly just to call "toupper" for an input string.

On my opinion, it's very demanded feature. See the list of issues in grammars and main repository:
antlr/grammars-v4#854
antlr/grammars-v4#878
antlr/grammars-v4#919

parrt · 2017-10-30T17:27:30Z

hiya. I can see how it's useful occasionally but it seems like we would need to do this for all targets not just Go before adding it. Maybe I can come up with a Java version to see how it feels.

bramp · 2017-10-30T17:49:28Z

I sent a PR for a Java implementation here: #2048

KvanTTT · 2017-10-30T19:35:55Z

I've implemented such streams for C#, C# optimized, Go, Java, JavaScript, Python2, Python3 runtimes. See AntlrCaseInsensitiveInputStream in respective runtime directories: https://github.com/KvanTTT/DAGE/tree/master/AntlrGrammarEditor/Runtimes. С++ and Swift are lacking for now.

Also, grammar tests can be found here: https://github.com/antlr/antlr4test-maven-plugin/tree/master/src/test/resources

KvanTTT · 2017-10-31T19:45:27Z

Yet another case insensitive issue: antlr/grammars-v4#921

parrt · 2017-11-01T17:22:02Z

Ok, so this wrapper implementation that @bramp has provided seems flexible actually. It works with existing streams but we just wrap that stream in an object that upper or lower case is things as needed without changing the actual stream. If we could bring other targets in line with #2048 and bring it all together as a PR, I think that would be acceptable.

parrt · 2017-12-04T17:36:37Z

btw, @sharwell used a stream approach which makes me think it's also not a hack. His approach does toUpperCase() once on copy of stream which is faster and also supports beta becoming two char 'SS'. Disadvantage: must replicate across streams and also that grammar doesn't indicate case-insensitivity.

davesisson · 2017-12-04T17:39:46Z

How would the stream approach work with embedded string content (I'm presuming you want to preserve case in text)?

…

On Mon, Dec 4, 2017 at 9:36 AM Terence Parr ***@***.***> wrote: btw, @sharwell <https://github.com/sharwell> used a stream approach which makes me think it's also not a hack. His approach does toUpperCase() once on copy of stream which is faster and also supports beta becoming two char 'SS'. Disadvantage: must replicate across streams and also that grammar doesn't indicate case-insensitivity. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2046 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALUC_SpBDQFMb7pKtiHI3Qk0qU1cSpFTks5s9C2sgaJpZM4Pwmzt> .

parrt · 2017-12-04T17:40:34Z

yep, streams leave text alone (in functionality) but make the lexer think everything is upper or lower.

KvanTTT · 2017-12-04T17:47:14Z

Given a unicode string, shouldn't toUpperCase() (without locale) work correctly?

The Character.toUpperCase('а')); is not depend on the users locale.

Disadvantage: must replicate across streams and also that grammar doesn't indicate case-insensitivity.

Other runtimes also have a culture-invariant character case changing methods. I checked it at least for Java, C#, JavaScript, Go (I can attach confirming links later). We also can introduce case insensitive option in grammar and use such streams if it's set up.

Maybe it makes sense to add not only case sensitivity option, but also cultureor locale option (invariant by default, Russian ru, Turkish tr, Deutch de etc.) . But I'm afraid it may not work across all runtimes.

btw, @sharwell used a stream approach which makes me think it's also not a hack.

Yes, this is a working solution :)

parrt · 2017-12-04T17:49:10Z

So Character.toUpperCase(x)) will work for any unicode x? Why do they support a locale I wonder?

davesisson · 2017-12-04T17:52:11Z

For an example where uppercasing will fail, try the Turkish dotted i: http://www.i18nguy.com/unicode/turkish-i18n.html

…

On Mon, Dec 4, 2017 at 9:49 AM Terence Parr ***@***.***> wrote: So Character.toUpperCase(x)) will work for any unicode x? Why do they support a locale I wonder? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2046 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALUC_SV2I4wJUtikC-ZUe-EeSaOCB1T4ks5s9DCegaJpZM4Pwmzt> .

parrt · 2017-12-04T17:55:27Z

Yeah, but the unicode dotted i should upper case properly as it's not the english i right? toUpperCase should handle it

parrt · 2017-12-04T17:58:44Z

Actually, doesn't work. Turkish reuses english 'i' I think. i get same upper case char here when it should be two diff ones.

class T {
     public static void main(String []args) {
        System.out.printf("%d\n", (int)Character.toUpperCase('\u0069'));
        System.out.printf("%d\n", (int)Character.toUpperCase('\u0131'));
     }
}
beast:/tmp $ javac T.java; java T
73
73

KvanTTT · 2017-12-04T18:06:53Z

So Character.toUpperCase(x)) will work for any unicode x? Why do they support a locale I wonder?

The Locale influence only on string toUpperCase method, but not on the character toUpperCase method (see my code above). I can prepare a table for all Unicode characters.

For an example where uppercasing will fail, try the Turkish dotted i:

It's failed because of culture dependece, see https://ideone.com/IvtDOd

Invariant: 
UPPER_CHAR(i) == I
UPPER_STR(i) == I

Turkish: 
LOWER_CHAR(I) == i
LOWER_STR(I) == ı
UPPER_CHAR(i) == I
UPPER_STR(i) == İ

Deutsch: 
UPPER_CHAR(ß) == ß
UPPER_STR(ß) == SS

Russian: 
UPPER_CHAR(а) == А
UPPER_STR(а) == А

parrt · 2017-12-04T18:09:15Z

thanks! So it would seem option { caseInsensitive=TR; } or whatever would be needed, not just case insensitive. that makes sense as grammar should indicate the exact characters/sets it is matching. A stream approach hides all of that and makes it easy to mismatch stream / grammar I'd say.

would we want the lower char or lower string version? ß or SS?

KvanTTT · 2017-12-04T18:13:32Z

Maybe something like this?

options
{
   caseInsensitive = true; // or false by default
   culture = TR; // by default en or invariant but other options are also available
}

In theory, culture can be also used in other cases.

parrt · 2017-12-04T18:19:16Z

I wonder if the simpler solution is just have antlr's IDE alter a rule inplace or have a trivial script that does it.

parrt · 2017-12-04T22:40:21Z

Referring to Mike's:

fragment A: 'A' | 'a';
fragment B: 'B' | 'b';
...
KEYWORD1: K E Y W O R D '1';

All that is really needed is the fragment rules because:

KEYWORD1: K E Y W O R D '1';

is no more burden than

KEYWORD1: 'keyword1' ;

I would say. I can easily add the fragment generator to the IDE and/or we can simply add that list to the doc for cut/paste. And we can refer people to these PR for streams and include them in the doc if we want. I think that satisfies everything. I just can't see beta to two chars right + Turkish etc...

KvanTTT · 2017-12-04T23:05:41Z

KEYWORD1: K E Y W O R D '1';

A new unreachable token warning from #2066 won't work with such fragment rules.

Ok, you have the freedom to do what you think right. But I won't transform tokens in existing T-SQL, PL/SQL, and other case insensitive grammars. Maybe somebody else...

parrt · 2017-12-04T23:14:20Z

Ok, I'll sleep on it. I think maybe doc is the right way to go on this rather than enforcing a specific vision in the tool.

mike-lischke · 2017-12-05T07:59:40Z

A word about locale: don't mix that with charset. A charset describes a mapping of numbers to charactes. Unicode itself is a charset (however, usually ones speaks of "encoding" instead). A locale is a collection of rules that describe various aspects about a culture/language and has nothing to do how characters are mapped. A locale determines time + number formats, sorting rules (collation) and others (here's the C++11 locale description with more details). This is what makes this so complicated with Unicode. And my question (and suggestion) about focusing on keywords tried to reflect that.

By focusing on keywords we don't need to worry about locales at all. We don't touch text that doesn't need to be case folded. We usually don't have trouble with special cases like the turkish I or the german ß (which is called: ess-zet or sharp s btw.), since they rarely are part of keywords. In general we can simplify things a lot by just focusing on keywords. We don't need comments or strings to be case insensitive, as we never have to match them in a lexer. Converting them in the input stream is just a big waste of CPU cycles and memory slots and therefore also Sam's solution is not good either.

@KvanTTT: by using a new caseInsensitive option you wouldn't need to have letter fragments. The lexer internally would handle that. As I wrote before it would probably boil down to implicitly extending the interval sets used to match a character. So, from an implementation standpoint it should be relatively easy to do (and can be done fully in the Java code, no target involved then).

@parrt One thing remains to be clarified then: what is a keyword? That's a weak point in my approach. It's tricky to say this is a keyword and that is not, in a lexer grammar. You would need to limit where the new option can be applied to. In any case, I would leave out culture/locale processing, as ANTLR is not an NLP tool, but made for programming languages, where some common rules exist (like: keywords use the ASCII charset) that simplify processing. So, one way could be to say: case insensitivity can only be applied to lexer rules that only match ASCII input. That should still cover most usage scenarios, but keeps implementation work small.

parrt · 2017-12-05T21:45:47Z

Thanks everybody! I think I'm going to try out the documentation cut/paste angle ala @mike-lischke's approach and send in a pull request that incorporates streams in the doc provided by the other folks as options. then we can comment on that before I pull it in.

parrt · 2017-12-05T23:33:26Z

For your review gentlemen, #2146 I have tried to incorporate with cherry picking the resources created by @bramp so that he can get credit for the contribution. It is a purely documentation based solution to this problem but requires no source code changes to ANTLR etc...

balahsen · 2017-12-06T14:15:05Z

Hi,
I agree with mike-lischke:
It's about lexer characters for reserved (or key) word for some grammar that is case insensitive like Oracle (PL/SQL).
The lexer should accept upper or lower key word without syntax error for case insensitive grammar.
it's very important: do not change case of client code ( the parsed code).

parrt · 2017-12-06T20:28:23Z

Closing As I'm going the documentation route #2146

…rmed: S E L F -> 'SELF';

KvanTTT · 2021-09-12T14:48:33Z

@mike-lischke

Converting them in the input stream is just a big waste of CPU cycles and memory slots and therefore also Sam's solution is not good either.

I've made a benchmark (C# Runtime) for comparison of grammar and runtime case comparison approaches. It shows that grammar based approach a bit faster:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.17763.2114 (1809/October2018Update/Redstone5)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.101
  [Host]     : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT
  DefaultJob : .NET Core 3.1.11 (CoreCLR 4.700.20.56602, CoreFX 4.700.20.56604), X64 RyuJIT


|                     Method |     Mean |     Error |    StdDev | Ratio |
|--------------------------- |---------:|----------:|----------:|------:|
|   CaseInsensitiveLexerTest | 6.769 ms | 0.0282 ms | 0.0236 ms |  1.00 |
| CaseInsensitiveGrammarTest | 6.258 ms | 0.1018 ms | 0.0903 ms |  0.92 |

But, if grammar approach is used, the generated parser is a bit bigger.

So, one way could be to say: case insensitivity can only be applied to lexer rules that only match ASCII input.

It's not completely true. I know well-known language in Russia (1С) that have case insensitive cyrillic lexemes (А а Б б В в ...). But I've found a simple solution for not dealing with complex locale case comparison: just do not change case if length of resulting char (actually string) more than 1 or use culture inveriant case changing if possible. It keeps equal length of input and normalized string. For instance, the following string:

A a ß Д д

is being transformed to

A A ß Д Д

You can take a look at my implementation.

Python, JavaScript, PHP require one-length character check, but C#, Java, Go, Dart do correct transfomation using system functions.

KvanTTT reviewed Oct 6, 2017

View reviewed changes

bramp mentioned this pull request Oct 7, 2017

[Java] Add a new CharStream that converts the symbols to upper or lower case. #2048

Closed

parrt closed this Oct 21, 2017

parrt added the status:not-fixing label Oct 21, 2017

hvogeler mentioned this pull request Oct 30, 2017

TSQL data_type tokens in upper case only antlr/grammars-v4#920

Closed

parrt reopened this Nov 1, 2017

parrt added comp:runtime lexers and removed status:not-fixing labels Nov 1, 2017

vonloxley mentioned this pull request Nov 11, 2017

TSQL count(*) not recognized antlr/grammars-v4#934

Closed

parrt mentioned this pull request Dec 5, 2017

Case insensitivity doc #2146

Merged

parrt closed this Dec 6, 2017

parrt removed this from the 4.7.1 milestone Dec 6, 2017

n8pease mentioned this pull request May 14, 2018

Tickets/dm 13960 lsst/qserv#395

Closed

KvanTTT referenced this pull request in antlr/grammars-v4 Jul 2, 2018

Removed fragment rules from PL/SQL and T-SQL grammars. Tokens transfo…

44aa3b6

…rmed: S E L F -> 'SELF';

PHaroZ mentioned this pull request Aug 3, 2018

mysql lexer, enable lower case letters in variables antlr/grammars-v4#1197

Merged

zy735537 mentioned this pull request Feb 13, 2019

TSQL: Parse error when IDs with lowercase antlr/grammars-v4#1367

Closed

KvanTTT mentioned this pull request Sep 12, 2021

[Dart] Add case changing char stream example #3246

Merged

KvanTTT mentioned this pull request Dec 12, 2021

Implement caseInsensitive option #3399

Merged

Add a new CharStream that converts the symbols to upper or lower case. #2046

Add a new CharStream that converts the symbols to upper or lower case. #2046

Conversation

bramp commented Oct 6, 2017

KvanTTT left a comment

Choose a reason for hiding this comment

KvanTTT Oct 6, 2017

Choose a reason for hiding this comment

bramp Oct 6, 2017

Choose a reason for hiding this comment

KvanTTT Oct 6, 2017

Choose a reason for hiding this comment

bramp commented Oct 6, 2017

bramp commented Oct 7, 2017

parrt commented Oct 21, 2017

bramp commented Oct 21, 2017

parrt commented Oct 21, 2017

bramp commented Oct 21, 2017

parrt commented Oct 21, 2017

bramp commented Oct 21, 2017

parrt commented Oct 21, 2017

bramp commented Oct 21, 2017

parrt commented Oct 21, 2017

bramp commented Oct 21, 2017

parrt commented Oct 21, 2017

bramp commented Oct 21, 2017

KvanTTT commented Oct 30, 2017

parrt commented Oct 30, 2017

bramp commented Oct 30, 2017

KvanTTT commented Oct 30, 2017 • edited

KvanTTT commented Oct 31, 2017

parrt commented Nov 1, 2017

parrt commented Dec 4, 2017

davesisson commented Dec 4, 2017 via email

parrt commented Dec 4, 2017

KvanTTT commented Dec 4, 2017 • edited

parrt commented Dec 4, 2017

davesisson commented Dec 4, 2017 via email

parrt commented Dec 4, 2017

parrt commented Dec 4, 2017

KvanTTT commented Dec 4, 2017 • edited

parrt commented Dec 4, 2017 • edited

KvanTTT commented Dec 4, 2017 • edited

parrt commented Dec 4, 2017

parrt commented Dec 4, 2017

KvanTTT commented Dec 4, 2017 • edited

parrt commented Dec 4, 2017

mike-lischke commented Dec 5, 2017 • edited

parrt commented Dec 5, 2017 • edited

parrt commented Dec 5, 2017 • edited

balahsen commented Dec 6, 2017 • edited

parrt commented Dec 6, 2017

KvanTTT commented Sep 12, 2021 • edited

KvanTTT commented Oct 30, 2017 •

edited

KvanTTT commented Dec 4, 2017 •

edited

KvanTTT commented Dec 4, 2017 •

edited

parrt commented Dec 4, 2017 •

edited

KvanTTT commented Dec 4, 2017 •

edited

KvanTTT commented Dec 4, 2017 •

edited

mike-lischke commented Dec 5, 2017 •

edited

parrt commented Dec 5, 2017 •

edited

parrt commented Dec 5, 2017 •

edited

balahsen commented Dec 6, 2017 •

edited

KvanTTT commented Sep 12, 2021 •

edited