Add parsing of hex literals in strings #612

markuspf · 2016-02-13T13:20:01Z

With this patch it is possible to give character values by using
the common '\xHH' escape where H is a hexadecimal digit.

If this is accepted, I wonder whether there is also an interest for python's \u? It might not be as easy as this one as \u parses to more than one byte.

I think this might break existing code though, because

gap> "\xff";
"xff"

markuspf · 2016-02-13T14:25:55Z

Apparently gcc 4.x doesn't like inline function definitions that are in headers and not guarded. Well, I'll move the function to system.c then

rbehrends · 2016-02-13T15:01:54Z

Not sure what you mean by "not guarded", but may the issue simply be that the function needs to be "static inline" instead of just "inline"?

markuspf · 2016-02-13T15:28:37Z

m( of course. Thanks for pointing that out.

stevelinton · 2016-02-13T21:27:13Z

Happy with this. It would be nice to see a comprehensive approach to dealing with Unicode strings (for instance one which allowed them to be viewed as a list of unicode characters) and then it might make sense to add some syntactic support for them, but I think it would be good to have the grand design first.

markuspf · 2016-02-13T22:32:01Z

This PR also is missing documentation changes and tests. I will add those soon.

ChrisJefferson · 2016-02-15T07:57:38Z

While it doesn't directly fit in this patch, is there any reason not to make '\a' for all characters 'a' which do not have a special meaning an error, rather than just making the '' get ignored.

markuspf · 2016-02-15T12:21:05Z

I wondered that too. It should be easy to add either as a separate PR if people think it's sensible.

frankluebeck · 2016-02-15T15:03:03Z

I don't remember why hex characters in string literals were not introduced long ago. At least now I cannot see a reason against it. I would consider the slight incompatibility of "\x" acceptable.

That the \ is ignored if it is not followed by a special character was already in very early versions of GAP. Nevertheless, I guess that it will not disturb many people if this was changed. This had the advantage that afterwards one could introduce further special characters without breaking backward compatibility again.

GAP strings are really sequences of bytes. Introducing \uXXXX for unicode characters does not really make sense (because a GAP string has no encoding context).
Note that there is functionality for unicode strings:

?Unicode Strings

The Unicode function was originally called U but I changed it after people complained about a new one letter function.

stevelinton · 2016-02-15T16:15:27Z

@frankluebeck great. We should add some cross-references from the strings section of the reference manual to the GAPDoc manual. Is there are a case for splitting GAPDoc into several packages? XML, Unicode, help viewers, help compiler?

Do we need kernel support for (for instance) reading large UTF-8 files into a Unicode string?

fingolfin · 2016-02-15T17:09:44Z

@stevelinton These are all interesting points, but perhaps for the mailing list, not on this PR, which will (hopefully) soon be merged, and then the discussion is "lost".

fingolfin · 2016-02-15T17:18:11Z

doc/ref/string.xml

-such digits. The meaning is given in the following list
+such digits.
+If it is the character <C>x</C>, then there must be two hexadecimal
+digits. The meaning is given in the following list


This sounds weird, in particular the bit saying "then there must be two hexadecimal such digits", which seems to be missing something. Finally, the sentence "They consist of two characters." was already before this patch quite misleading. How about this:

There are a number of <E>special character sequences</E> that can be used between the singlequotes of a character literal or between the doublequotes of a string literal to specify characters. They consists of a backslash <C>\</C>, followed by a second character indicating the type of special character sequence, and, depending on this type, possibly some more characters. The following special character sequencs are currently defined. Any other sequence starting with a backslash results in an error.

fingolfin · 2016-02-15T17:22:25Z

I think this is a good idea, thanks. Just have some minor quibbles with the documentation.

markuspf · 2016-02-15T20:35:03Z

Well this is fun:

On Sun, Feb 14, 2016 at 11:57:38PM -0800, Christopher Jefferson wrote:

While it doesn't directly fit in this patch, is there any reason not to make '
a' for all characters 'a' which do not have a special meaning an error, rather
than just making the '' get ignored.

I tried patching this in and it turns out that in the library and packages there
are a few places that have "!", "*", "-", "#" and others. While I suspect
that these are acutally bugs, the fallout from such a patch might be a tad
larger than I thought.

I'll make a separate PR for this...

markuspf · 2016-02-17T11:38:03Z

I should probably pull the refactoring from #619 into this PR before we merge it.

One small issue I see is that the error message is currently less specific. That could be addressed though if people are worried about it.

markuspf · 2016-02-18T10:47:59Z

I updated this PR in the following way

Change the escape sequence for hexadecimal values in strings and characters to \0xHH
Include the refactoring done in Stricter string escape #619
Display a syntax warning if a \\ is followed by anything that is not a valid escape sequence.
Fix all issues that were showing because of the added warning in the library

markuspf · 2016-02-25T14:53:41Z

I have fixed qaos and sent an email to Thomas Breuer about ctbllib.

frankluebeck · 2016-02-26T18:21:29Z

Looks mostly fine to me. Using \0x?? for the hex characters is similar to the syntax in other languages. And it has the advantage that there is no backward compatibility problem. Of course, the refactoring is also sensible.

But as discussed in the thread on "Stricter string escape" I'm not happy with the warning if a backslash is used with non-special characters (and the corresponding change of the documentation). This change seems purely cosmetic and it breaks without real need a behaviour that is documented since 30 years (and makes changes in several places necessary).

markuspf · 2016-02-29T11:16:02Z

Is there any preference regarding whether we just revert to the old behaviour with respect to \\ or display a warning if \\ is followed by a letter that is not in a valid sequence?

I'd quite like to move this PR forward.

stevelinton · 2016-02-29T11:27:19Z

On 29 Feb 2016, at 11:16, Markus Pfeiffer notifications@github.com wrote:

Is there any preference regarding whether we just revert to the old behaviour with respect to \ or display a warning if \ is followed by a letter that is not in a valid sequence?

I'd quite like to move this PR forward.

—
Reply to this email directly or view it on GitHub.

Mild preference for the warning.

Steve

markuspf · 2016-03-04T16:04:45Z

Now updated to only warn if a backslash is followed by a letter.

Tests fail because of minor errors in qaos (for which there is a new version available that does not have this error) and ctbllib (which I reported to Thomas Breuer).

fingolfin · 2016-11-05T16:12:11Z

What is the status of this? It seems like it was basically done (though travis failed), and nobody really had complaints left?

@markuspf Perhaps you can rebase it, so that codecov gets run; I'd be happy to review this once more afterwards, too.

olexandr-konovalov · 2016-11-05T16:28:17Z

IIRC, the new version of qaos contained a fix for the problem addresses by this PR, but ctbllib will fail the bugfix.tst test:

testing: /home/travis/build/gap-system/gap/tst/teststandard/bugfix.tst
rev: ########> Diff in /home/travis/build/gap-system/gap/tst/teststandard/bugfix.ts\
t:2884
# Input is:
if LoadPackage("ctbllib", false) <> fail then
     if Irr( CharacterTable( "WeylD", 4 ) )[1] <>
          [ 3, -1, 3, -1, 1, -1, 3, -1, -1, 0, 0, -1, 1 ] then
       Print( "problem with Irr( CharacterTable( \"WeylD\", 4 ) )[1]\n" );
     fi;
   fi;
# Expected output:
# But found:
Syntax warning: Alphabet letter after \ in /home/travis/build/gap-system/gap/p\
\
kg/ctbllib/data/ctgeneri.tbl:915
"in this paper (the characters $\Theta_l$ and $\Lambda_u$ on the classes\n",
                                ^
Syntax warning: Alphabet letter after \ in /home/travis/build/gap-system/gap/p\
\
kg/ctbllib/data/ctgeneri.tbl:915
"in this paper (the characters $\Theta_l$ and $\Lambda_u$ on the classes\n",
                                               ^
########

fingolfin · 2016-11-06T12:11:29Z

So, how about completly disabling the warning for now (leave it in, but commented out)? We can re-evaluate enabling it in the future, when/if ctbllib gets changed. In the meantime, I don't think we should hold up a useful feature (support for hex literals) because of this...

markuspf · 2016-11-06T13:20:22Z

@finglfin there you go. I disabled the syntax warning for now, though I have to admit I am uncomfortable with bugwards-compatibility hacks.

codecov-io · 2016-11-06T13:46:30Z

Current coverage is 48.70% (diff: 74.07%)

Merging #612 into master will increase coverage by <.01%

@@             master       #612   diff @@
==========================================
  Files           424        424          
  Lines        222109     222119    +10   
  Methods        3426       3429     +3   
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits         108166     108193    +27   
+ Misses       113943     113926    -17   
  Partials          0          0

Powered by Codecov. Last update 54ae310...0d7a403

frankluebeck · 2016-11-07T14:55:39Z

I have not changed my opinion compared to my comment on Feb 26. The example from ctblib mentioned above even demonstrates that GAPs (documented!) behaviour can be useful.

fingolfin

Overall looks good to me. So now we need to decide whether we want to break backwards compatibility and documented behavior for the sake of hypothetical future enhancements or not.

Right now, there seems no urgent reason to do so; so perhaps postpone it, but still ask package authors to adjust their packages, and revisit this in the future?

fingolfin · 2016-11-07T15:35:49Z

doc/ref/string.xml

    This is translated to the character corresponding to the number
    <C>X * 64 + Y * 8 + Z modulo 256</C>.
    This can be used to specify and store arbitrary binary data as a string
    in &GAP;.
 </Item>
 <Mark>
+<Index><C>\0xYZ</C></Index>
+<Index>hexadecimal character codes</Index>
+<C>\xYZ</C></Mark>


This still says \xYZ, should be \0xYZ.

fingolfin · 2016-11-07T15:35:51Z

doc/ref/string.xml

 <Index>escaping non-special characters</Index>
 other</Mark>
 <Item>
-    For any other character the backslash is simply ignored.
+  For any other character the backslash is ignored. If the character
+  is a letter, that is one of <C>a..zA..Z</C>, then a warning is displayed.


Same as above: If we can't agree on changing the documented and actual behaviour, then this last sentence is controversial. In that case, we should drop it, but can keep the rest of this change.

fingolfin · 2016-11-07T15:36:08Z

doc/ref/string.xml

+They consist of a backslash <C>\</C> followed by a second character
+indicating the type of special character sequence, and possibly more characters.
+The following special character sequences are currently defined. Any other sequence
+starting with a backslash results in an error.


If we can't agree on changing the documented and actual behaviour, then this last sentence is controversial. In that case, we should drop it, but can keep the rest of this change.

fingolfin · 2016-11-07T15:37:03Z

lib/helpbase.gi

@@ -416,7 +416,7 @@ InstallGlobalFunction(SIMPLE_STRING, function(str)
 "efghijklmnopqrstuvwxyz[\000]^_\000abcdefghijklmnopqrstuvwxyz{ }~",
 "\177\200\201\202",
 "\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225",
-"\226\227\230\231\232\233\234\235\236\237\238",
+"\226\227\230\231\232\233\234\235\236\237\240",


What is the purpose of this change?

\238 is not a valid octal escape, whereas \240 is (for the same number, if one were to allow overflows in the octal digits).

ahhhh of course! facepalm

fingolfin · 2016-11-07T15:40:44Z

src/scanner.c

+    GET_CHAR();
+    c += GetOctalDigits();
+  } else {
+      /* This warning is currently disabled for backwards compatibility */


Perhaps elaborate a bit, and/or reference this PR. E.g.:
"This warning is currently disabled for backwards compatibility: It turns out there are some packages which still rely on this."

fingolfin · 2016-11-07T15:44:19Z

src/system.h

+**  '0..9', 'A..F', or 'a..f' and 0 otherwise.
+*/
+#define IsHexDigit(ch)     (isxdigit((unsigned int)ch))
+


OK. Technically, there might be systems out there which don't have isxdigit, but we'll just deal with them if we ever encounter them. So this is fine.

isxdigit is in ISO C90, I thought we're doing C99 even? Am I mixing something up?

fingolfin · 2016-11-07T15:47:18Z

src/system.h

+    } else {
+        return (ch - '0');
+    }
+};


This is only used in one place... so unless we forsee use for it, why not just move it to scanner.c ? If something else ever needs it, we can still move it to a header (and then contemplate whether it needs extra safe guards or not etc.)

fingolfin · 2016-11-07T15:49:20Z

tst/testinstall/strings.tst

+gap> x:='\0xFF';
+'\377'
+gap> x:='\0xab';
+'\253'



These test cases only cover positives, but not malformed inputs, which we should also test. E.g. \0yAB, \0X12, \090, \0a0, \0x0, \009, \00a, \00x, \0x1g, ..

markuspf · 2016-11-07T16:36:57Z

I adapted the PR according to @fingolfin's comments. I should probably do some history rearrangement before we merge this (If everything is ok).

fingolfin · 2016-11-09T13:39:46Z

Looks good to me now, thanks. If you want to cleanup the history, go ahead. Other than that, i think it can be merged now.

This commit allows hexadecimal escapes of the form `0xHH` in string and character literals where `H` is a hexadecimal digit. The common codepaths for parsing the escape sequences in GetStr and GetChar are factored out into a common function GetEscapedChar. If an invalid escape sequence is read, a SyntaxWarning is displayed, but the old and documented behaviour to just ignore the backslash is preserved.

Code to display a warning is left commented out.

markuspf force-pushed the string-hex-literal branch from 64d1e9d to 8e67616 Compare February 13, 2016 15:28

markuspf force-pushed the string-hex-literal branch from 8e67616 to 7c7b050 Compare February 13, 2016 16:39

markuspf force-pushed the string-hex-literal branch from 096cad7 to a6a420f Compare February 14, 2016 15:01

fingolfin added kind: enhancement Label for issues suggesting enhancements; and for pull requests implementing enhancements kind: new feature labels Feb 15, 2016

fingolfin reviewed Feb 15, 2016
View reviewed changes

markuspf force-pushed the string-hex-literal branch from e7624c1 to 8c6286d Compare February 15, 2016 20:13

markuspf mentioned this pull request Feb 16, 2016

Stricter string escape #619

Closed

markuspf force-pushed the string-hex-literal branch from 8c6286d to 75fb611 Compare February 18, 2016 10:45

markuspf added this to the GAP 4.9.0 milestone Mar 7, 2016

markuspf force-pushed the string-hex-literal branch 2 times, most recently from 64138bc to 72ec6bb Compare May 13, 2016 10:27

markuspf force-pushed the string-hex-literal branch from 72ec6bb to 5a0301b Compare November 6, 2016 13:18

fingolfin reviewed Nov 7, 2016

View reviewed changes

markuspf added 12 commits November 9, 2016 21:35

Document hexadecimal character entry

f67ac31

Test hexadecimal string entries.

8ff2dda

Replace invalid octal escape \238 by \240

b21d44a

Remove spurious \

6a504a2

Remove \! from helpt2t.gi

4f5c877

Replace "\in" by "\\in"

5a39df4

Replace \* by *

ea04251

Correct escape for \.

0ef496c

Backslashes are ignored if they are in a non-special escape sequence.

54d373d

Code to display a warning is left commented out.

Move CharHexDigit to scanner.c, adapt comment

eaf8f79

Add some negative testcases for hexadecimal and octal parsing

0d7a403

markuspf force-pushed the string-hex-literal branch from 22adfad to 0d7a403 Compare November 9, 2016 21:35

markuspf merged commit 5f18c3d into gap-system:master Nov 9, 2016

olexandr-konovalov mentioned this pull request Nov 12, 2016

strings.tst not robust enough #947

Closed

markuspf deleted the string-hex-literal branch February 5, 2017 12:31

fingolfin mentioned this pull request Sep 7, 2017

Release notes for GAP 4.9 #1699

Closed

olexandr-konovalov added the release notes: added PRs introducing changes that have since been mentioned in the release notes label Jan 20, 2018

Add parsing of hex literals in strings #612

Add parsing of hex literals in strings #612

Conversation

markuspf commented Feb 13, 2016

markuspf commented Feb 13, 2016

rbehrends commented Feb 13, 2016

markuspf commented Feb 13, 2016

stevelinton commented Feb 13, 2016

markuspf commented Feb 13, 2016 via email

ChrisJefferson commented Feb 15, 2016

markuspf commented Feb 15, 2016

frankluebeck commented Feb 15, 2016

stevelinton commented Feb 15, 2016

fingolfin commented Feb 15, 2016

Choose a reason for hiding this comment

fingolfin commented Feb 15, 2016

markuspf commented Feb 15, 2016

markuspf commented Feb 17, 2016

markuspf commented Feb 18, 2016

markuspf commented Feb 25, 2016

frankluebeck commented Feb 26, 2016

markuspf commented Feb 29, 2016

stevelinton commented Feb 29, 2016

markuspf commented Mar 4, 2016

fingolfin commented Nov 5, 2016

olexandr-konovalov commented Nov 5, 2016

fingolfin commented Nov 6, 2016

markuspf commented Nov 6, 2016

codecov-io commented Nov 6, 2016 • edited Loading

Current coverage is 48.70% (diff: 74.07%)

frankluebeck commented Nov 7, 2016

fingolfin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markuspf commented Nov 7, 2016

fingolfin commented Nov 9, 2016

codecov-io commented Nov 6, 2016 •

edited

Loading