Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parsing of hex literals in strings #612

Merged
merged 12 commits into from
Nov 9, 2016

Conversation

markuspf
Copy link
Member

With this patch it is possible to give character values by using
the common '\xHH' escape where H is a hexadecimal digit.

If this is accepted, I wonder whether there is also an interest for python's \u? It might not be as easy as this one as \u parses to more than one byte.

I think this might break existing code though, because

gap> "\xff";
"xff"

@markuspf
Copy link
Member Author

Apparently gcc 4.x doesn't like inline function definitions that are in headers and not guarded. Well, I'll move the function to system.c then

@rbehrends
Copy link
Contributor

Not sure what you mean by "not guarded", but may the issue simply be that the function needs to be "static inline" instead of just "inline"?

@markuspf
Copy link
Member Author

m( of course. Thanks for pointing that out.

@stevelinton
Copy link
Contributor

Happy with this. It would be nice to see a comprehensive approach to dealing with Unicode strings (for instance one which allowed them to be viewed as a list of unicode characters) and then it might make sense to add some syntactic support for them, but I think it would be good to have the grand design first.

@markuspf
Copy link
Member Author

markuspf commented Feb 13, 2016 via email

@ChrisJefferson
Copy link
Contributor

While it doesn't directly fit in this patch, is there any reason not to make '\a' for all characters 'a' which do not have a special meaning an error, rather than just making the '' get ignored.

@markuspf
Copy link
Member Author

I wondered that too. It should be easy to add either as a separate PR if people think it's sensible.

@frankluebeck
Copy link
Member

I don't remember why hex characters in string literals were not introduced long ago. At least now I cannot see a reason against it. I would consider the slight incompatibility of "\x" acceptable.

That the \ is ignored if it is not followed by a special character was already in very early versions of GAP. Nevertheless, I guess that it will not disturb many people if this was changed. This had the advantage that afterwards one could introduce further special characters without breaking backward compatibility again.

GAP strings are really sequences of bytes. Introducing \uXXXX for unicode characters does not really make sense (because a GAP string has no encoding context).
Note that there is functionality for unicode strings:

?Unicode Strings

The Unicode function was originally called U but I changed it after people complained about a new one letter function.

@stevelinton
Copy link
Contributor

@frankluebeck great. We should add some cross-references from the strings section of the reference manual to the GAPDoc manual. Is there are a case for splitting GAPDoc into several packages? XML, Unicode, help viewers, help compiler?

Do we need kernel support for (for instance) reading large UTF-8 files into a Unicode string?

@fingolfin fingolfin added kind: enhancement Label for issues suggesting enhancements; and for pull requests implementing enhancements kind: new feature labels Feb 15, 2016
@fingolfin
Copy link
Member

@stevelinton These are all interesting points, but perhaps for the mailing list, not on this PR, which will (hopefully) soon be merged, and then the discussion is "lost".

such digits. The meaning is given in the following list
such digits.
If it is the character <C>x</C>, then there must be two hexadecimal
digits. The meaning is given in the following list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds weird, in particular the bit saying "then there must be two hexadecimal such digits", which seems to be missing something. Finally, the sentence "They consist of two characters." was already before this patch quite misleading. How about this:

There are a number of <E>special character sequences</E> that can be used
 between the singlequotes of a character literal or between the
 doublequotes of a string literal to specify characters.
They consists of a backslash <C>\</C>, followed by a second character indicating the type of
special character sequence, and, depending on this type, possibly some more characters.

The following special character sequencs are currently defined. Any other sequence starting with
a backslash results in an error.

@fingolfin
Copy link
Member

I think this is a good idea, thanks. Just have some minor quibbles with the documentation.

@markuspf
Copy link
Member Author

Well this is fun:

On Sun, Feb 14, 2016 at 11:57:38PM -0800, Christopher Jefferson wrote:

While it doesn't directly fit in this patch, is there any reason not to make '
a' for all characters 'a' which do not have a special meaning an error, rather
than just making the '' get ignored.

I tried patching this in and it turns out that in the library and packages there
are a few places that have "!", "*", "-", "#" and others. While I suspect
that these are acutally bugs, the fallout from such a patch might be a tad
larger than I thought.

I'll make a separate PR for this...

@markuspf markuspf mentioned this pull request Feb 16, 2016
@markuspf
Copy link
Member Author

I should probably pull the refactoring from #619 into this PR before we merge it.

One small issue I see is that the error message is currently less specific. That could be addressed though if people are worried about it.

@markuspf
Copy link
Member Author

I updated this PR in the following way

  • Change the escape sequence for hexadecimal values in strings and characters to \0xHH
  • Include the refactoring done in Stricter string escape #619
  • Display a syntax warning if a \\ is followed by anything that is not a valid escape sequence.
  • Fix all issues that were showing because of the added warning in the library

@markuspf
Copy link
Member Author

I have fixed qaos and sent an email to Thomas Breuer about ctbllib.

@frankluebeck
Copy link
Member

Looks mostly fine to me. Using \0x?? for the hex characters is similar to the syntax in other languages. And it has the advantage that there is no backward compatibility problem. Of course, the refactoring is also sensible.

But as discussed in the thread on "Stricter string escape" I'm not happy with the warning if a backslash is used with non-special characters (and the corresponding change of the documentation). This change seems purely cosmetic and it breaks without real need a behaviour that is documented since 30 years (and makes changes in several places necessary).

@markuspf
Copy link
Member Author

Is there any preference regarding whether we just revert to the old behaviour with respect to \\ or display a warning if \\ is followed by a letter that is not in a valid sequence?

I'd quite like to move this PR forward.

@stevelinton
Copy link
Contributor

On 29 Feb 2016, at 11:16, Markus Pfeiffer notifications@github.com wrote:

Is there any preference regarding whether we just revert to the old behaviour with respect to \ or display a warning if \ is followed by a letter that is not in a valid sequence?

I'd quite like to move this PR forward.


Reply to this email directly or view it on GitHub.

Mild preference for the warning.

Steve

@markuspf
Copy link
Member Author

markuspf commented Mar 4, 2016

Now updated to only warn if a backslash is followed by a letter.

Tests fail because of minor errors in qaos (for which there is a new version available that does not have this error) and ctbllib (which I reported to Thomas Breuer).

@markuspf markuspf added this to the GAP 4.9.0 milestone Mar 7, 2016
@markuspf markuspf force-pushed the string-hex-literal branch 2 times, most recently from 64138bc to 72ec6bb Compare May 13, 2016 10:27
@fingolfin
Copy link
Member

What is the status of this? It seems like it was basically done (though travis failed), and nobody really had complaints left?

@markuspf Perhaps you can rebase it, so that codecov gets run; I'd be happy to review this once more afterwards, too.

@olexandr-konovalov
Copy link
Member

IIRC, the new version of qaos contained a fix for the problem addresses by this PR, but ctbllib will fail the bugfix.tst test:

testing: /home/travis/build/gap-system/gap/tst/teststandard/bugfix.tst
rev: ########> Diff in /home/travis/build/gap-system/gap/tst/teststandard/bugfix.ts\
t:2884
# Input is:
if LoadPackage("ctbllib", false) <> fail then
     if Irr( CharacterTable( "WeylD", 4 ) )[1] <>
          [ 3, -1, 3, -1, 1, -1, 3, -1, -1, 0, 0, -1, 1 ] then
       Print( "problem with Irr( CharacterTable( \"WeylD\", 4 ) )[1]\n" );
     fi;
   fi;
# Expected output:
# But found:
Syntax warning: Alphabet letter after \ in /home/travis/build/gap-system/gap/p\
\
kg/ctbllib/data/ctgeneri.tbl:915
"in this paper (the characters $\Theta_l$ and $\Lambda_u$ on the classes\n",
                                ^
Syntax warning: Alphabet letter after \ in /home/travis/build/gap-system/gap/p\
\
kg/ctbllib/data/ctgeneri.tbl:915
"in this paper (the characters $\Theta_l$ and $\Lambda_u$ on the classes\n",
                                               ^
########

@fingolfin
Copy link
Member

So, how about completly disabling the warning for now (leave it in, but commented out)? We can re-evaluate enabling it in the future, when/if ctbllib gets changed. In the meantime, I don't think we should hold up a useful feature (support for hex literals) because of this...

@markuspf
Copy link
Member Author

markuspf commented Nov 6, 2016

@finglfin there you go. I disabled the syntax warning for now, though I have to admit I am uncomfortable with bugwards-compatibility hacks.

@codecov-io
Copy link

codecov-io commented Nov 6, 2016

Current coverage is 48.70% (diff: 74.07%)

Merging #612 into master will increase coverage by <.01%

@@             master       #612   diff @@
==========================================
  Files           424        424          
  Lines        222109     222119    +10   
  Methods        3426       3429     +3   
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits         108166     108193    +27   
+ Misses       113943     113926    -17   
  Partials          0          0          

Powered by Codecov. Last update 54ae310...0d7a403

@frankluebeck
Copy link
Member

I have not changed my opinion compared to my comment on Feb 26. The example from ctblib mentioned above even demonstrates that GAPs (documented!) behaviour can be useful.

Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. So now we need to decide whether we want to break backwards compatibility and documented behavior for the sake of hypothetical future enhancements or not.

Right now, there seems no urgent reason to do so; so perhaps postpone it, but still ask package authors to adjust their packages, and revisit this in the future?

This is translated to the character corresponding to the number
<C>X * 64 + Y * 8 + Z modulo 256</C>.
This can be used to specify and store arbitrary binary data as a string
in &GAP;.
</Item>
<Mark>
<Index><C>\0xYZ</C></Index>
<Index>hexadecimal character codes</Index>
<C>\xYZ</C></Mark>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still says \xYZ, should be \0xYZ.

<Index>escaping non-special characters</Index>
other</Mark>
<Item>
For any other character the backslash is simply ignored.
For any other character the backslash is ignored. If the character
is a letter, that is one of <C>a..zA..Z</C>, then a warning is displayed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: If we can't agree on changing the documented and actual behaviour, then this last sentence is controversial. In that case, we should drop it, but can keep the rest of this change.

They consist of a backslash <C>\</C> followed by a second character
indicating the type of special character sequence, and possibly more characters.
The following special character sequences are currently defined. Any other sequence
starting with a backslash results in an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't agree on changing the documented and actual behaviour, then this last sentence is controversial. In that case, we should drop it, but can keep the rest of this change.

@@ -416,7 +416,7 @@ InstallGlobalFunction(SIMPLE_STRING, function(str)
"efghijklmnopqrstuvwxyz[\000]^_\000abcdefghijklmnopqrstuvwxyz{ }~",
"\177\200\201\202",
"\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225",
"\226\227\230\231\232\233\234\235\236\237\238",
"\226\227\230\231\232\233\234\235\236\237\240",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\238 is not a valid octal escape, whereas \240 is (for the same number, if one were to allow overflows in the octal digits).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh of course! facepalm

GET_CHAR();
c += GetOctalDigits();
} else {
/* This warning is currently disabled for backwards compatibility */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps elaborate a bit, and/or reference this PR. E.g.:
"This warning is currently disabled for backwards compatibility: It turns out there are some packages which still rely on this."

** '0..9', 'A..F', or 'a..f' and 0 otherwise.
*/
#define IsHexDigit(ch) (isxdigit((unsigned int)ch))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Technically, there might be systems out there which don't have isxdigit, but we'll just deal with them if we ever encounter them. So this is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isxdigit is in ISO C90, I thought we're doing C99 even? Am I mixing something up?

} else {
return (ch - '0');
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in one place... so unless we forsee use for it, why not just move it to scanner.c ? If something else ever needs it, we can still move it to a header (and then contemplate whether it needs extra safe guards or not etc.)

gap> x:='\0xFF';
'\377'
gap> x:='\0xab';
'\253'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases only cover positives, but not malformed inputs, which we should also test. E.g. \0yAB, \0X12, \090, \0a0, \0x0, \009, \00a, \00x, \0x1g, ..

@markuspf
Copy link
Member Author

markuspf commented Nov 7, 2016

I adapted the PR according to @fingolfin's comments. I should probably do some history rearrangement before we merge this (If everything is ok).

@fingolfin
Copy link
Member

Looks good to me now, thanks. If you want to cleanup the history, go ahead. Other than that, i think it can be merged now.

This commit allows hexadecimal escapes of the form `0xHH` in string
and character literals where `H` is a hexadecimal digit.

The common codepaths for parsing the escape sequences in GetStr and GetChar
are factored out into a common function GetEscapedChar.

If an invalid escape sequence is read, a SyntaxWarning is displayed, but
the old and documented behaviour to just ignore the backslash is preserved.
Code to display a warning is left commented out.
@markuspf markuspf merged commit 5f18c3d into gap-system:master Nov 9, 2016
@markuspf markuspf deleted the string-hex-literal branch February 5, 2017 12:31
@olexandr-konovalov olexandr-konovalov added the release notes: added PRs introducing changes that have since been mentioned in the release notes label Jan 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: enhancement Label for issues suggesting enhancements; and for pull requests implementing enhancements kind: new feature release notes: added PRs introducing changes that have since been mentioned in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants