Skip to content
This repository was archived by the owner on Aug 9, 2024. It is now read-only.

Patch numeric underscores #112

Merged

Conversation

takenobu-hs
Copy link
Contributor

Dear authors,

Sorry if it is not the correct procedure.
Could you please fix this for syntax highlighting of numeric literals?

This patch contains three fixes:

  • Add constant.numeric.hexfloat.haskell for HexFloatLiterals extension 1
  • Add constant.numeric.binary.haskell for BinaryLiterals extension 2
  • Modify lit_num for NumericUnderscores extension 3 4

I visually checked this patch with the these files 5 6.

Thanks for your wonderful work

* Add constant.numeric.hexfloat.haskell for HexFloatLiterals extension [1]
* Add constant.numeric.binary.haskell for BinaryLiterals extension [2]
* Modify lit_num for NumericUnderscores extension [3][4]

[1]: https://ghc.haskell.org/trac/ghc/ticket/9224
[2]: https://ghc.haskell.org/trac/ghc/ticket/13126
[3]: https://ghc.haskell.org/trac/ghc/ticket/14473
[4]: https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0009-numeric-underscores.rst
@lierdakil
Copy link
Contributor

Hello. Thank you for your contribution. I can't not notice though that except BinaryLiterals other proposals aren't implemented in any of the currently-released GHC versions (please correct me if I'm wrong). I'm a bit reluctant to include experimental GHC features here.

There are also some questions about regular expressions used, which I will post shortly.

name: 'constant.numeric.hexadecimal.haskell'
match: '0[xX][0-9a-fA-F]+'
match: '0[xX]_*[0-9a-fA-F](_*[0-9a-fA-F])*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure why you're using this rather convoluted expression here (and other places). For all intents and purposes, 0[xX]_*[0-9a-fA-F](_*[0-9a-fA-F])* is the same as 0[xX](?:_*[0-9a-fA-F])+. And if we're not aiming for technical correctness (which we aren't), latter expression is roughly equivalent to 0[xX][_0-9a-fA-F]+, which would also accept cases like 0x_ (which I don't think is valid Haskell), but is measurably faster.

Also please don't introduce new capturing groups if you can avoid it, it can potentially break some stuff, and test coverage isn't 100%.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for telling me your policy.

I'm not entirely sure why you're using this rather convoluted expression here (and other places)

I simply implemented it according to the accepted formal specification.

... but is measurably faster.

You mean that this project has adopted a policy that gives priority to speed over accuracy.
I understand.
It is one of the valuable decisions for many platforms.

Also please don't introduce new capturing groups if you can avoid it, it can potentially break some stuff, and test coverage isn't 100%.

OK. Is it also better to merge the binary with octal?

In consideration of speed, I propose the simplified two plans below.
Are you acceptable for this?

Plan A: BinaryLiterals + NumericUnderscores extension (without HexFloatLiterals)

   lit_num: [
     name: 'constant.numeric.hexadecimal.haskell'
-    match: '0[xX][0-9a-fA-F]+'
+    match: '0[xX][_0-9a-fA-F]+'
   ,
     name: 'constant.numeric.octal.haskell'
-    match: '0[oO][0-7]+'
+    match: '0[oObB][_0-7]+'
   ,
     name: 'constant.numeric.float.haskell'
-    match: '[0-9]+(\\.[0-9]+[eE][+-]?|\\.|[eE][+-]?)[0-9]+'
+    match: '[_0-9]+(\\.[_0-9]+[eE][+-]?|\\.|[eE][+-]?)[_0-9]+'
   ,
     name: 'constant.numeric.decimal.haskell'
-    match: '[0-9]+'
+    match: '[0-9][_0-9]*'
   ]

Plan B: Plan A + HexFloatLiterals

This is difference for Plan A.

     name: 'constant.numeric.float.haskell'
-    match: '[_0-9]+(\\.[_0-9]+[eE][+-]?|\\.|[eE][+-]?)[_0-9]+'
+    match: '(?:0[xX])?[_0-9a-fA-F]+(?:\\.[_0-9a-fA-F]+[eEpP][+-]?|\\.|[eEpP][+-]?)[_0-9a-fA-F]+'

Thank you for your time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also better to merge the binary with octal?

No, those are different beasts, and parser has a rather obvious choice based on 2-character lookahead.

Same goes for hexadecimal floats, those are obviously (from point of view of parser) syntactically different from decimal floats. There's no need to create horrifyingly convoluted regular expressions (also apparently incorrect, (?:0[xX])?[_0-9a-fA-F]+(?:\\.[_0-9a-fA-F]+[eEpP][+-]?|\\.|[eEpP][+-]?)[_0-9a-fA-F]+ matches 0xfade1, which I don't think it should). Just abstract away a common pattern for floats. F.ex.:

floatPattern = (digit, exp) -> "#{digit}+(\\.#{digit}+#{exp}[+-]?|\\.|#{exp}[+-]?)#{digit}+"

and then use it for both regular floats and hexfloats:

float: "[0-9]#{floatPattern('[0-9_]', '[eE]')}" # see below why [0-9] at the start
hexfloat: "0[xX]#{floatPattern('[0-9a-fA-F_]','[pP]')}"

Everything else seems reasonable. Except maybe in 'Plan A' I would use the same approach in float as in decimal (only for the first digit mind)

[0-9][_0-9]*(\\.[_0-9]+[eE][+-]?|\\.|[eE][+-]?)[_0-9]+

so that it could bail early on holes (like f _x _y z = z)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, those are different beasts, and parser has a rather obvious choice based on 2-character lookahead.

Thank you. I understand.
I was misunderstanding the matter of the capturing groups.

Just abstract away a common pattern for floats. F.ex.:

Beautiful!
I realized what you pointed out.

Everything else seems reasonable. Except maybe in 'Plan A' I would use the same approach in float as in decimal

OK, the policy of pattern matching is below.

  • The leading of the pattern should be accurate for identification with other tokens.
  • Subsequent patterns prioritize speed.

I will push the revised version after this.

@@ -496,17 +496,23 @@ module.exports=
name: 'punctuation.separator.comma.haskell'
match: /,/
lit_num: [
name: 'constant.numeric.hexfloat.haskell'
match: '0[xX]_*[0-9a-fA-F](_*[0-9a-fA-F])*((\\.[0-9a-fA-F](_*[0-9a-fA-F])*(_*[pP][+-]?[0-9](_*[0-9])*)*)|(_*[pP][+-]?[0-9](_*[0-9])*))'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather hard to read frankly, and I'm not a fan of duplication. Abstracting general "floating point" pattern into a function accepting a character class could be a good idea (as far as I can tell, character class is the only difference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will comment together in the next place.

@takenobu-hs
Copy link
Contributor Author

@lierdakil thank you for your kind explanation.

Hello. Thank you for your contribution. I can't not notice though that except BinaryLiterals other proposals aren't implemented in any of the currently-released GHC versions (please correct me if I'm wrong). I'm a bit reluctant to include experimental GHC features here.

Sorry that there is little information.
HexFloatLiterals and NumericUnderscores were officially accepted the ghc-proposal.

HexFloatLiterals was implemented in master and ghc8.4.1 branch.
It will be shipped with ghc 8.4.1.

The implementation of NumericUnderscores is under review.

I believe that this feature will be widely used in the future.

@takenobu-hs
Copy link
Contributor Author

Thank you again and again.
I pushed the revised version. here
Could you please review again?

I modifed floatPattern for hexfloat pattern from your suggestion. here

Because the exponent part of hexfloat should be also decimal numbers.
If it is complicated, please tell me. I will further simplify it.

Thank you for taking so much time.

@@ -22,4 +22,8 @@ balanced = (name, left, right, inner, ignore = '') ->
else
"(?<#{name}>(?:[^#{left}#{right}#{ignore}]|#{left}\\g<#{name}>#{right})*)"

module.exports = {list, listMaybe, concat, balanced}
floatPattern = (digit, exp) ->
"#{digit}*(?:(?:\\.#{digit}+)?#{exp}[+-]?[0-9_]+|\\.#{digit}+)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, now this expression here doesn't match floats without explicit exponents, e.g. 3.1415.

In general, Haskell floating points have syntactical from similar to one of:

  • 3.1415
  • 314e-2
  • 31.415e-1

I would imagine it's similar for hexfloats?

Anyway, we'd like to catch all of those, that's why original float regular expression is kinda convoluted -- you can have either fractional part, exponent part, or both of them, but you can't have a float without at least one. In quasi-EBNF terms, this would be somewhat simple to express:

float → digit+ ( fraction | exponent | fraction exponent )
fraction → "." digit+
exponent → [eE] [+-]? digit+

Equivalent regular expression was trivial to factorize, because all alternatives end on digit+, but if digit in exponent can differ from digit elsewhere, this won't work, so you'll need to write out alternatives.

However, I should point out, fraction and fraction exponent have the same prefix, and hence would require backtracking. We can avoid that by rewriting expression a little:

float → digit+ ( fraction exponent? | exponent )

this way, regexp engine can bail from any branch early

Copy link
Contributor

@lierdakil lierdakil Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, disregard the first line, I made a mistake when testing. Still, not starting branches on optional parts will on average be about twice as effective, i.e.

float → digit+ ( fraction exponent? | exponent )

is more effective than

float → digit+ ( fraction? exponent | fraction )

which you apparently have here (notice common prefix, remember that regexp parses left-to-right, and realize that fraction part will be effectively matched twice if there is no exponent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which you apparently have here (notice common prefix, remember that regexp parses left-to-right, and realize that fraction part will be effectively matched twice if there is no exponent)

Thank you for teaching me about effective regexp.
You gave me a wonderful insight about behavior of parsing regex.

I will revise the pattern.

@takenobu-hs
Copy link
Contributor Author

I revised the pattern according to your suggestion. here
Please review again when you have time.
Thank you very much.

Copy link
Contributor

@lierdakil lierdakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for bearing with my nitpicks. It looks like everything's in order, so I'll merge this.

@lierdakil lierdakil merged commit f90eb7d into atom-haskell:master Dec 6, 2017
@lierdakil
Copy link
Contributor

Released as v1.15.0.

@takenobu-hs
Copy link
Contributor Author

Okay, thanks for bearing with my nitpicks. It looks like everything's in order, so I'll merge this.

I'm grateful for your kind response.
Your insight make the user experience for Haskell community better.
Thank you very much for taking the time :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants