Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does not work syntax highlighting when used with \b and the Russian alphabet #3291

Closed
ava57r opened this issue Oct 23, 2016 · 48 comments
Closed
Labels

Comments

@ava57r
Copy link

ava57r commented Oct 23, 2016

Good afternoon. Create syntax highlighting for 1C:Enterprise, which supports the Russian key words.
When we use \b(Если|If)\b in github https://github-lightshow.herokuapp.com keywords are not highlighted. Do like this (?<=[^\w-а-яё\.]|^)(Если|If)(?=[^\w-а-яё\.]|$) works. Files in UTF-8.

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 23, 2016

Well, the effect of the \b sequence is partly dependent on the engine's Unicode settings... the Lightshow app is using a PCRE-driven implementation, as opposed to the Oniguruma-based one used by others.

Can you post an example of the code you're trying to highlight?

@nixel2007
Copy link
Contributor

Hello. Most of the code in 1C Enterprise is written in russian - https://github.com/silverbulleters/vanessa-agiler/blob/master/src/cf/CommonModules/%D0%9E%D0%B1%D1%89%D0%B5%D0%B3%D0%BE%D0%9D%D0%B0%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D1%8F%D0%9A%D0%BB%D0%B8%D0%B5%D0%BD%D1%82%D0%A1%D0%B5%D1%80%D0%B2%D0%B5%D1%80/Ext/Module.bsl

Grammar was added in #2773

When we started that PR in december, we saw, that lightshow doen't work properly with \b and cyrillic, so we changed the regexps from \b to positive look behind/forward.

\b with cyrillic works in Atom, VSC and Sublime text.

Is it a bug with lightshow or Github itself can't handle it?

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 23, 2016

Ah! That proves it then. Yes, I'd say it's because of the two different engines being used. Oniguruma is Unicode-aware by default, which means it matches "word boundaries" on a more world-wise definition (instead of [A-Za-z_0-9]). PCRE, on the other hand, probably doesn't do this, although I'm not familiar with its inner workings.

I do know, however, that \b sequences are a pain in the neck... =) I like to think \b really stands for \BAD.

Assertions are definitely the best way to go, in all cases.

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 23, 2016

I have to point out, too, that Lightshow can sometimes give very misleading results. See #3130.

The app is closed source, sadly, so I have no way of knowing what's really going on in there...

@nixel2007
Copy link
Contributor

The question is what engine github uses. :)

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 23, 2016

Oh! Sorry, I wasn't clear.

GitHub uses PCRE, same as Lightshow. =)

@nixel2007
Copy link
Contributor

Okay... Is it easy to change one engine to another? Can we just ask the Github guys to make this change?

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 23, 2016

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

More info on that here.

@nixel2007
Copy link
Contributor

nixel2007 commented Oct 23, 2016

@Alhadis got 404 on your link
UPDATE: need to change htm to html at the end

@arfon
Copy link
Contributor

arfon commented Oct 23, 2016

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

I do not sorry. @aroben might though 😁

@ava57r
Copy link
Author

ava57r commented Oct 24, 2016

regular expression with \b
(?i:\b(Неопределено|Undefined|Истина|True|Ложь|False|NULL)\b)
regular expression with positive look behind/forward
(?i:(?<=[^\wа-яё\.]|^)(Неопределено|Undefined|Истина|True|Ложь|False|NULL)(?=[^\wа-яё\.]|$))

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 24, 2016

Wouldn't it be easier to narrow it down to what characters are valid, instead of which ones aren't...?

For example:

(?xi)

# Following:
(?<= ^       # Start of line
   | \s      # Whitespace
   | \(      # Open bracket
   | [-+~/*] # Possible operators
   | ...     # Etc
   )

(Неопределено|Undefined|Истина|True|Ложь|False|NULL)

# Followed by:
(?= $        # End of line
  | \s       # Whitespace
  | ...      # Etc...

I'm not familiar with the language's syntax, so I can't be more specific than that, but I think you get the idea. =)

@nixel2007
Copy link
Contributor

@Alhadis (?<=[^\wа-яё\.]|^) literally is a \b-before-word realisation. \b is ^\w, so we've just complete it with а-яё and adiotional . to avoid highlighting of properties with the same name as keyword.

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 24, 2016

Thing is, different engines have different notions of what constitutes a "word character". When writing portable expressions, I avoid the assumption that consuming engines will be Unicode-aware. Mostly because of scenarios like this.

In any case, there's no reason for the site's highlighting engine to not be Unicode-aware, given the multicultural nature of GitHub.

@aroben
Copy link
Contributor

aroben commented Oct 24, 2016

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

@vmg may be able to answer this

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 24, 2016

If @vmg pings @arfon for an answer, I'm outta here.

@vmg
Copy link
Contributor

vmg commented Oct 24, 2016

If @vmg pings @arfon for an answer, I'm outta here.

Hahaha. I can answer this: we don't currently use the PCRE_UCP flag because it's a really significant performance degradation. I acknowledge this is not an ideal answer -- I'm looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can't enable it by default because it really slows down syntax highlighting. 😢

I don't have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 24, 2016

Ah right, that makes sense then. I wasn't aware the flag imposed performance penalties (at least as far as the PCRE library is concerned).

I'd say your proposed solution is a good compromise. =) We could also only enable it for languages which use Unicode-sensitive grammars for highlighting, too (like 1C Enterprise's one does). Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pcre: true or something.

@nixel2007
Copy link
Contributor

ersonally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

this is a good variant. @vmg @arfon is it possible to make that thing?

@ava57r
Copy link
Author

ava57r commented Oct 24, 2016

Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

It would be great!

@arfon
Copy link
Contributor

arfon commented Oct 24, 2016

this is a good variant. @vmg @arfon is it possible to make that thing?

That may be the right approach but I'd rather wait until we have a solid plan for supporting something like a language flag before going ahead and adding this to the languages.yml file.

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 24, 2016

Don't screw it up. The Russians are watching.

@nixel2007
Copy link
Contributor

@arfon what are our next steps?
What kind of plan do you need?

In my world the plan is:

  • add a new entry in languages.yml
  • add to the languages.yml reader at linguist-side a read step. maybe we need to create some collection or array of flags (for future uses)
  • add flag-reader at the github-highlighter side.

anything else?

@arfon
Copy link
Contributor

arfon commented Oct 25, 2016

@arfon what are our next steps?

Right now we wait. @vmg above pointed out that we need to do some work on the GitHub side (i.e. not here) to add support for this.

  • add to the languages.yml reader at linguist-side a read step. maybe we need to create some collection or array of flags (for future uses)
  • add flag-reader at the github-highlighter side.

Possibly. Please don't start opening pull requests implementing these changes yet as we won't be able to do anything with them 😄

@Alhadis
Copy link
Collaborator

Alhadis commented Oct 25, 2016

Please don't start opening pull requests implementing these changes yet as we won't be able to do anything with them

How about these in the mean time? :D

Just so we're guarded against the ol' syntax-highlighting related questions...

@ava57r
Copy link
Author

ava57r commented Dec 3, 2016

@vmg Can you provide a link to the issue on the Github side?

@ava57r
Copy link
Author

ava57r commented May 26, 2017

@Alhadis Hello.
Any news? There is a chance to solve this problem.

@Alhadis
Copy link
Collaborator

Alhadis commented May 26, 2017

What do you mean?

@ava57r
Copy link
Author

ava57r commented May 26, 2017

@arfon write #3291 (comment)
There is a solution? More than half a year has passed.

@ghost
Copy link

ghost commented Apr 6, 2018

(apologies in advance, i don't intent to hijack this conversation)

In C# too, we are struggling to have C# 7.2 syntax support due to the fact that upstream has moved to Oniguruma grammer. @damieng has chalked out some ideas at atom/language-csharp#112 (comment) by which we can try to make progress.

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project?

@lildude
Copy link
Member

lildude commented Apr 7, 2018

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

@worldbeater
Copy link
Contributor

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

Just to clarify, is there any chance that someone on GitHub will come and replace the good old parser with a newer one, providing support for Oniguruma and other complicated regex syntaxes?

C# language highlighting on GitHub is absolutely disgusting and it would be nice to figure out how this can be fixed in the nearest future. Other engines like BitBucket or GitLab provide a much better highlighting, sad.

Thanks in advance!

@Alhadis
Copy link
Collaborator

Alhadis commented Apr 9, 2018

providing support for Oniguruma and other complicated regex syntaxes?

@worldbeater, PCRE supports every feature that Oniguruma does, it just uses different syntax. The perceived difference in regex support you see on GitHub is a consequence of grammar authors using Oniguruma-specific syntax instead of PCRE. This happens because Oniguruma is used by every editor which supports TextMate grammars – GitHub is a lone exception due to its use of PCRE.

If the engines were reversed, you'd be asking us to provide support for PCRE's "complicated syntax" too, because the Oniguruma engine isn't good enough.

@worldbeater
Copy link
Contributor

worldbeater commented Apr 13, 2018

PCRE supports every feature that Oniguruma does, it just uses different syntax

Well, now I understand, thanks! Could you please provide more info on which PCRE version GitHub is using now? Seems work on porting Oniguruma-based syntax to PCRE-based is stuck due to some incompatibilities, I don't know: atom/language-csharp#112 (comment)

That is all because Microsoft uses Oniguruma expressions in grammar for Visual Studio Code and Atom editors right now, but it can't be simply copied and pasted to play well with GitHub.

Thanks in advance, @Alhadis!

@Alhadis
Copy link
Collaborator

Alhadis commented Apr 13, 2018

@worldbeater I doubt the PCRE version has anything to do with this. It's running in ASCII-only mode, that's all I know (I'm not staff). But of check the manpage for pcresyntax(3), you'll find documentation for variances between regular expression engines (starting at the section "Backreferences").

These are the likely discrepancies that that're affecting the C# grammar (indeed, most TextMate grammars which use Oniguruma extensions):

Syntax Description Engines
(?R) recurse whole pattern All
(?n) call subpattern by absolute number All
(?+n) call subpattern by relative number All
(?-n) call subpattern by relative number All
(?&name) call subpattern by name Perl, PCRE
(?P>name) call subpattern by name Python
\g<name> call subpattern by name Oniguruma
\g'name' call subpattern by name Oniguruma
\g<n> call subpattern by absolute number Oniguruma
\g'n' call subpattern by absolute number Oniguruma
\g<+n> call subpattern by relative number PCRE
\g'+n' call subpattern by relative number PCRE
\g<-n> call subpattern by relative number PCRE
\g'-n' call subpattern by relative number PCRE

@vmg
Copy link
Contributor

vmg commented Apr 17, 2018

Thank you for that accurate list, @Alhadis. The most common occurrence of incompatible syntax is the \g<name> that some VSCode grammars use. Obviously it's trivially fixable by simply referring to the subpattern by absolute number instead of name -- but the replacement can be tricky.

@stale
Copy link

stale bot commented Nov 6, 2018

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Nov 6, 2018
@nixel2007
Copy link
Contributor

i think this issue should be opened. the problem still exists.

@stale stale bot removed the Stale label Nov 6, 2018
@stale
Copy link

stale bot commented Dec 6, 2018

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Dec 6, 2018
@nixel2007
Copy link
Contributor

Any news form GitHub.com?
If this issue has no place in gh backlog it could be closed (sadly)

@stale stale bot removed the Stale label Dec 6, 2018
@stale
Copy link

stale bot commented Feb 4, 2019

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Feb 4, 2019
@nixel2007
Copy link
Contributor

One more ping, guys. (sorry)

@stale stale bot removed the Stale label Feb 4, 2019
@ava57r
Copy link
Author

ava57r commented Feb 5, 2019

Need help this issue

@lildude
Copy link
Member

lildude commented Feb 6, 2019

Nothing has changed on the GitHub side of things... #3291 (comment) still applies:

... we don't currently use the PCRE_UCP flag because it's a really significant performance degradation. I acknowledge this is not an ideal answer -- I'm looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can't enable it by default because it really slows down syntax highlighting. 😢

I don't have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

This isn't likely to change in the immediate future. If you really want this issue resolved sooner rather than later, the best option is to change the grammar as discussed in various points earlier in this issue.

@stale
Copy link

stale bot commented Mar 8, 2019

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Mar 8, 2019
@stale
Copy link

stale bot commented Mar 22, 2019

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants