does not work syntax highlighting when used with \b and the Russian alphabet #3291

ava57r · 2016-10-23T17:00:32Z

Good afternoon. Create syntax highlighting for 1C:Enterprise, which supports the Russian key words.
When we use \b(Если|If)\b in github https://github-lightshow.herokuapp.com keywords are not highlighted. Do like this (?<=[^\w-а-яё\.]|^)(Если|If)(?=[^\w-а-яё\.]|$) works. Files in UTF-8.

The text was updated successfully, but these errors were encountered:

Alhadis · 2016-10-23T17:04:55Z

Well, the effect of the \b sequence is partly dependent on the engine's Unicode settings... the Lightshow app is using a PCRE-driven implementation, as opposed to the Oniguruma-based one used by others.

Can you post an example of the code you're trying to highlight?

nixel2007 · 2016-10-23T17:08:44Z

Hello. Most of the code in 1C Enterprise is written in russian - https://github.com/silverbulleters/vanessa-agiler/blob/master/src/cf/CommonModules/%D0%9E%D0%B1%D1%89%D0%B5%D0%B3%D0%BE%D0%9D%D0%B0%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D1%8F%D0%9A%D0%BB%D0%B8%D0%B5%D0%BD%D1%82%D0%A1%D0%B5%D1%80%D0%B2%D0%B5%D1%80/Ext/Module.bsl

Grammar was added in #2773

When we started that PR in december, we saw, that lightshow doen't work properly with \b and cyrillic, so we changed the regexps from \b to positive look behind/forward.

\b with cyrillic works in Atom, VSC and Sublime text.

Is it a bug with lightshow or Github itself can't handle it?

Alhadis · 2016-10-23T17:15:15Z

Ah! That proves it then. Yes, I'd say it's because of the two different engines being used. Oniguruma is Unicode-aware by default, which means it matches "word boundaries" on a more world-wise definition (instead of [A-Za-z_0-9]). PCRE, on the other hand, probably doesn't do this, although I'm not familiar with its inner workings.

I do know, however, that \b sequences are a pain in the neck... =) I like to think \b really stands for \BAD.

Assertions are definitely the best way to go, in all cases.

Alhadis · 2016-10-23T17:20:19Z

I have to point out, too, that Lightshow can sometimes give very misleading results. See #3130.

The app is closed source, sadly, so I have no way of knowing what's really going on in there...

nixel2007 · 2016-10-23T17:39:34Z

The question is what engine github uses. :)

Alhadis · 2016-10-23T17:40:44Z

Oh! Sorry, I wasn't clear.

GitHub uses PCRE, same as Lightshow. =)

nixel2007 · 2016-10-23T17:42:42Z

Okay... Is it easy to change one engine to another? Can we just ask the Github guys to make this change?

Alhadis · 2016-10-23T17:57:13Z

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

More info on that here.

nixel2007 · 2016-10-23T18:08:25Z

@Alhadis got 404 on your link
UPDATE: need to change htm to html at the end

arfon · 2016-10-23T18:19:30Z

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

I do not sorry. @aroben might though 😁

ava57r · 2016-10-24T05:51:13Z

Alhadis · 2016-10-24T05:59:07Z

Wouldn't it be easier to narrow it down to what characters are valid, instead of which ones aren't...?

For example:

(?xi)

# Following:
(?<= ^       # Start of line
   | \s      # Whitespace
   | \(      # Open bracket
   | [-+~/*] # Possible operators
   | ...     # Etc
   )

(Неопределено|Undefined|Истина|True|Ложь|False|NULL)

# Followed by:
(?= $        # End of line
  | \s       # Whitespace
  | ...      # Etc...

I'm not familiar with the language's syntax, so I can't be more specific than that, but I think you get the idea. =)

nixel2007 · 2016-10-24T07:40:20Z

@Alhadis (?<=[^\wа-яё\.]|^) literally is a \b-before-word realisation. \b is ^\w, so we've just complete it with а-яё and adiotional . to avoid highlighting of properties with the same name as keyword.

Alhadis · 2016-10-24T07:45:40Z

Thing is, different engines have different notions of what constitutes a "word character". When writing portable expressions, I avoid the assumption that consuming engines will be Unicode-aware. Mostly because of scenarios like this.

In any case, there's no reason for the site's highlighting engine to not be Unicode-aware, given the multicultural nature of GitHub.

aroben · 2016-10-24T13:51:40Z

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

@vmg may be able to answer this

Alhadis · 2016-10-24T13:52:33Z

If @vmg pings @arfon for an answer, I'm outta here.

vmg · 2016-10-24T14:08:52Z

If @vmg pings @arfon for an answer, I'm outta here.

Hahaha. I can answer this: we don't currently use the PCRE_UCP flag because it's a really significant performance degradation. I acknowledge this is not an ideal answer -- I'm looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can't enable it by default because it really slows down syntax highlighting. 😢

I don't have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

Alhadis · 2016-10-24T14:14:36Z

Ah right, that makes sense then. I wasn't aware the flag imposed performance penalties (at least as far as the PCRE library is concerned).

I'd say your proposed solution is a good compromise. =) We could also only enable it for languages which use Unicode-sensitive grammars for highlighting, too (like 1C Enterprise's one does). Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pcre: true or something.

nixel2007 · 2016-10-24T14:21:16Z

ersonally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

this is a good variant. @vmg @arfon is it possible to make that thing?

ava57r · 2016-10-24T14:23:21Z

Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

It would be great!

arfon · 2016-10-24T14:44:19Z

this is a good variant. @vmg @arfon is it possible to make that thing?

That may be the right approach but I'd rather wait until we have a solid plan for supporting something like a language flag before going ahead and adding this to the languages.yml file.

Alhadis · 2016-10-24T14:48:49Z

Don't screw it up. The Russians are watching.

nixel2007 · 2016-10-25T10:06:21Z

@arfon what are our next steps?
What kind of plan do you need?

In my world the plan is:

add a new entry in languages.yml
add to the languages.yml reader at linguist-side a read step. maybe we need to create some collection or array of flags (for future uses)
add flag-reader at the github-highlighter side.

anything else?

arfon · 2016-10-25T10:18:55Z

@arfon what are our next steps?

Right now we wait. @vmg above pointed out that we need to do some work on the GitHub side (i.e. not here) to add support for this.

add to the languages.yml reader at linguist-side a read step. maybe we need to create some collection or array of flags (for future uses)

add flag-reader at the github-highlighter side.

Possibly. Please don't start opening pull requests implementing these changes yet as we won't be able to do anything with them 😄

Alhadis · 2016-10-25T11:05:14Z

Please don't start opening pull requests implementing these changes yet as we won't be able to do anything with them

How about these in the mean time? :D

Just so we're guarded against the ol' syntax-highlighting related questions...

ava57r · 2016-12-03T14:10:00Z

@vmg Can you provide a link to the issue on the Github side?

ava57r · 2017-05-26T06:30:31Z

@Alhadis Hello.
Any news? There is a chance to solve this problem.

Alhadis · 2017-05-26T13:35:07Z

What do you mean?

ava57r · 2017-05-26T17:01:08Z

@arfon write #3291 (comment)
There is a solution? More than half a year has passed.

ghost · 2018-04-06T19:25:43Z

(apologies in advance, i don't intent to hijack this conversation)

In C# too, we are struggling to have C# 7.2 syntax support due to the fact that upstream has moved to Oniguruma grammer. @damieng has chalked out some ideas at atom/language-csharp#112 (comment) by which we can try to make progress.

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project?

lildude · 2018-04-07T08:22:43Z

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

worldbeater · 2018-04-09T17:00:31Z

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

Just to clarify, is there any chance that someone on GitHub will come and replace the good old parser with a newer one, providing support for Oniguruma and other complicated regex syntaxes?

C# language highlighting on GitHub is absolutely disgusting and it would be nice to figure out how this can be fixed in the nearest future. Other engines like BitBucket or GitLab provide a much better highlighting, sad.

Thanks in advance!

Alhadis · 2018-04-09T18:37:46Z

providing support for Oniguruma and other complicated regex syntaxes?

@worldbeater, PCRE supports every feature that Oniguruma does, it just uses different syntax. The perceived difference in regex support you see on GitHub is a consequence of grammar authors using Oniguruma-specific syntax instead of PCRE. This happens because Oniguruma is used by every editor which supports TextMate grammars – GitHub is a lone exception due to its use of PCRE.

If the engines were reversed, you'd be asking us to provide support for PCRE's "complicated syntax" too, because the Oniguruma engine isn't good enough.

worldbeater · 2018-04-13T11:32:09Z

PCRE supports every feature that Oniguruma does, it just uses different syntax

Well, now I understand, thanks! Could you please provide more info on which PCRE version GitHub is using now? Seems work on porting Oniguruma-based syntax to PCRE-based is stuck due to some incompatibilities, I don't know: atom/language-csharp#112 (comment)

That is all because Microsoft uses Oniguruma expressions in grammar for Visual Studio Code and Atom editors right now, but it can't be simply copied and pasted to play well with GitHub.

Thanks in advance, @Alhadis!

Alhadis · 2018-04-13T12:06:10Z

@worldbeater I doubt the PCRE version has anything to do with this. It's running in ASCII-only mode, that's all I know (I'm not staff). But of check the manpage for pcresyntax(3), you'll find documentation for variances between regular expression engines (starting at the section "Backreferences").

These are the likely discrepancies that that're affecting the C# grammar (indeed, most TextMate grammars which use Oniguruma extensions):

Syntax	Description	Engines
`(?R)`	recurse whole pattern	All
`(?n)`	call subpattern by absolute number	All
`(?+n)`	call subpattern by relative number	All
`(?-n)`	call subpattern by relative number	All
`(?&name)`	call subpattern by name	Perl, PCRE
`(?P>name)`	call subpattern by name	Python
`\g<name>`	call subpattern by name	Oniguruma
`\g'name'`	call subpattern by name	Oniguruma
`\g<n>`	call subpattern by absolute number	Oniguruma
`\g'n'`	call subpattern by absolute number	Oniguruma
`\g<+n>`	call subpattern by relative number	PCRE
`\g'+n'`	call subpattern by relative number	PCRE
`\g<-n>`	call subpattern by relative number	PCRE
`\g'-n'`	call subpattern by relative number	PCRE

vmg · 2018-04-17T12:33:28Z

Thank you for that accurate list, @Alhadis. The most common occurrence of incompatible syntax is the \g<name> that some VSCode grammars use. Obviously it's trivially fixable by simply referring to the subpattern by absolute number instead of name -- but the replacement can be tricky.

stale · 2018-11-06T07:14:25Z

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

nixel2007 · 2018-11-06T07:42:50Z

i think this issue should be opened. the problem still exists.

stale · 2018-12-06T08:10:37Z

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

nixel2007 · 2018-12-06T08:19:19Z

Any news form GitHub.com?
If this issue has no place in gh backlog it could be closed (sadly)

stale · 2019-02-04T11:46:32Z

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

nixel2007 · 2019-02-04T12:36:47Z

One more ping, guys. (sorry)

ava57r · 2019-02-05T19:48:13Z

Need help this issue

lildude · 2019-02-06T10:27:41Z

Nothing has changed on the GitHub side of things... #3291 (comment) still applies:

... we don't currently use the PCRE_UCP flag because it's a really significant performance degradation. I acknowledge this is not an ideal answer -- I'm looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can't enable it by default because it really slows down syntax highlighting. 😢

I don't have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

This isn't likely to change in the immediate future. If you really want this issue resolved sooner rather than later, the best option is to change the grammar as discussed in various points earlier in this issue.

stale · 2019-03-08T10:55:01Z

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale · 2019-03-22T11:45:10Z

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

ava57r mentioned this issue Oct 23, 2016

Сделать более короткие шаблоны регулярных выражений 1c-syntax/1c-syntax#214

Open

lildude mentioned this issue Nov 29, 2017

New Grammars Compiler #3915

Merged

worldbeater mentioned this issue Apr 9, 2018

Add support for C# 7.2 keyword contexts atom/language-csharp#112

Closed

stale bot added the Stale label Nov 6, 2018

stale bot removed the Stale label Nov 6, 2018

stale bot added the Stale label Dec 6, 2018

stale bot removed the Stale label Dec 6, 2018

stale bot added the Stale label Feb 4, 2019

stale bot removed the Stale label Feb 4, 2019

stale bot added the Stale label Mar 8, 2019

stale bot closed this as completed Mar 22, 2019

Alhadis mentioned this issue May 2, 2019

SourcePawn stock repository detected as partially C++ #3156

Closed

Alhadis mentioned this issue Oct 14, 2019

Muse: do not allow letters after closing "*" Alhadis/language-emacs-lisp#6

Closed

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

does not work syntax highlighting when used with \b and the Russian alphabet #3291

does not work syntax highlighting when used with \b and the Russian alphabet #3291

Comments

ava57r commented Oct 23, 2016

Alhadis commented Oct 23, 2016

nixel2007 commented Oct 23, 2016

Alhadis commented Oct 23, 2016

Alhadis commented Oct 23, 2016

nixel2007 commented Oct 23, 2016

Alhadis commented Oct 23, 2016

nixel2007 commented Oct 23, 2016

Alhadis commented Oct 23, 2016

nixel2007 commented Oct 23, 2016 • edited Loading

arfon commented Oct 23, 2016

ava57r commented Oct 24, 2016

Alhadis commented Oct 24, 2016

nixel2007 commented Oct 24, 2016

Alhadis commented Oct 24, 2016 • edited Loading

aroben commented Oct 24, 2016

Alhadis commented Oct 24, 2016

vmg commented Oct 24, 2016

Alhadis commented Oct 24, 2016 • edited Loading

nixel2007 commented Oct 24, 2016

ava57r commented Oct 24, 2016

arfon commented Oct 24, 2016

Alhadis commented Oct 24, 2016

nixel2007 commented Oct 25, 2016

arfon commented Oct 25, 2016

Alhadis commented Oct 25, 2016

ava57r commented Dec 3, 2016

ava57r commented May 26, 2017

Alhadis commented May 26, 2017

ava57r commented May 26, 2017

ghost commented Apr 6, 2018

lildude commented Apr 7, 2018

worldbeater commented Apr 9, 2018

Alhadis commented Apr 9, 2018

worldbeater commented Apr 13, 2018 • edited Loading

Alhadis commented Apr 13, 2018 • edited Loading

vmg commented Apr 17, 2018

stale bot commented Nov 6, 2018

nixel2007 commented Nov 6, 2018

stale bot commented Dec 6, 2018

nixel2007 commented Dec 6, 2018

stale bot commented Feb 4, 2019

nixel2007 commented Feb 4, 2019

ava57r commented Feb 5, 2019

lildude commented Feb 6, 2019

stale bot commented Mar 8, 2019

stale bot commented Mar 22, 2019

nixel2007 commented Oct 23, 2016 •

edited

Loading

Alhadis commented Oct 24, 2016 •

edited

Loading

Alhadis commented Oct 24, 2016 •

edited

Loading

worldbeater commented Apr 13, 2018 •

edited

Loading

Alhadis commented Apr 13, 2018 •

edited

Loading