ByteArray# literals #135

phadej · 2018-05-14T20:19:38Z

This is a proposal to introduce ByteArray# and (# Word#, Addr# #) literals, and slightly change Addr# literals. In short you'll be able to write

"Literals"#b           -- ByteArray#
"\xef\xbb\xbf"#abytes  -- Addr#
"Юникод"#ucp1251       -- (# Int#, Addr# #)

Rendered

cartazio · 2018-05-14T20:55:29Z

I like this a lot

bgamari · 2018-05-14T21:33:53Z

proposals/0000-bytearray-literals.rst

+One proposed alternative is to use double hash syntax ``"foo"##`` to represent
+UTF8 ``ByteArray#``.  That variant is very limited in power compared to proposed.
+
+In fact, we might introduce similar syntax for the number literals, e.g. ``120#i8``, if we get `Int8#` primitive type.


I agree that the syntax that you propose above is much preferred to the "keep-adding-more-hashes" approach. I would really like to see the this syntax adopted for numeric llterals at some point as well.

that'd be nice, agreed

bgamari · 2018-05-14T21:37:05Z

proposals/0000-bytearray-literals.rst

+
+Therefore, code with current primitive strings won't break.
+
+*Unresolved:* Should there be a (``-Wall``) warning in this case, asking user to be explicit?


I'm not sure I see the advantage. Implicit is fine given this syntax has a long history here and avoids the nasty compatibility issues that warning generally brings.

nickkuk · 2018-05-15T16:31:45Z

proposals/0000-bytearray-literals.rst

+.. code-block:: haskell
+
+  "primitive"#           -- Addr# in utf8
+  "string                -- String or IsString a => a


Thanks for spotting. I corrected others as well.

nickkuk · 2018-05-15T16:33:02Z

proposals/0000-bytearray-literals.rst

+
+.. code-block:: haskell
+
+  "hello#


nickkuk · 2018-05-15T16:33:33Z

proposals/0000-bytearray-literals.rst

+
+These literals are ``[Word8]`` literals, *primitive string literal must contain only characters <= '\xFF'*.
+
+Ordinary strings, like ``"hello``, ``"Юникод"``, ``"\NUL"`` are then desugared as


No quotes after "hello".

nickkuk · 2018-05-15T16:35:26Z

proposals/0000-bytearray-literals.rst

+
+.. highlight:: haskell
+
+This proposal is `discussed at this pull request <https://github.com/ghc-proposals/ghc-proposals/pull/134>`_.


134 should be 135.

cartazio · 2018-05-15T17:04:13Z

Question: How will escaping be handled? Or is that already addressed via the proposal implicitly?

…

On Tue, May 15, 2018 at 12:37 PM Nikolai Kuklin ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In proposals/0000-bytearray-literals.rst <#135 (comment)> : > +This is a proposal to introduce ``ByteArray#`` and ``(# Word#, Addr# #)`` +literals, and slightly change ``Addr#`` literals. In short you'll be able +to write + +.. code-block:: haskell + + "Literals"#b -- ByteArray# + "\xef\xbb\xbf"#abytes -- Addr# + "Юникод"#ucp1251 -- (# Int#, Addr# #) + +additionally to current + +.. code-block:: haskell + + "primitive"# -- Addr# in utf8 + "string -- String or IsString a => a No quotes? ------------------------------ In proposals/0000-bytearray-literals.rst <#135 (comment)> : > + +:: + + perl -e 'print "a\x00b\x00c\x00\x00\xd8"'|iconv -f utf16le -t utf8|hexdump -C + +You may append two bytes to the input, try to make correct surrogate pair! + + +Primitive string without modifier +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The current primitive string + +.. code-block:: haskell + + "hello# No quotes? ------------------------------ In proposals/0000-bytearray-literals.rst <#135 (comment)> : > +``ByteArray#`` Yes Yes No No +====================== =========== ======= ======== =========== + + +Recap: String desugaring currently +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Currently, it's possible to create primitive ``Addr#`` string literals: + +.. code-block:: haskell + + "hello"# -- :: Addr# + +These literals are ``[Word8]`` literals, *primitive string literal must contain only characters <= '\xFF'*. + +Ordinary strings, like ``"hello``, ``"Юникод"``, ``"\NUL"`` are then desugared as No quotes after "hello". ------------------------------ In proposals/0000-bytearray-literals.rst <#135 (comment)> : > @@ -0,0 +1,345 @@ +.. proposal-number:: Leave blank. This will be filled in when the proposal is + accepted. + +.. trac-ticket:: Leave blank. This will eventually be filled with the Trac + ticket number which will track the progress of the + implementation of the feature. + +.. implemented:: Leave blank. This will be filled in with the first GHC version which + implements the described feature. + +.. highlight:: haskell + +This proposal is `discussed at this pull request <#134>`_. 134 should be 135. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#135 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwhrKX6k9bO76fQWJL7gKyuKc9Nauks5tywRggaJpZM4T-clu> .

phadej · 2018-05-15T18:56:41Z

@cartazio what you mean by escaping?

simonpj · 2018-05-15T20:12:33Z

The proposal helpfully lists a bunch of motivations with Trac tickets. I like that it resolves a bunch of related issues. But could you add a section going through these tickets one by one and explaining how the tickets are thereby resolved?

Trac #5218 speaks of

{-# LANGUAGE OverloadedStrings #-}
module Foo where
import Data.ByteString.Char8
foo = "abc" :: ByteString

How will that desugar. Will good things happen?

simonmar · 2018-05-16T13:43:02Z

The motivation has the example

"Юникод"#ucp1251       -- (# Int#, Addr# #)

but cp1251 isn't one of the allowed encodings according to the proposal.

What does "algebraic" mean in the table in the motivation section?

Under "string syntax desugaring", the proposal says we'll use ByteArray#, but I'm a bit worried about the potential increase in code size here. The proposal refers to https://ghc.haskell.org/trac/ghc/ticket/5218#comment:95 for evidence that the overhead is small, but I think we should re-check that: it only considers the addition of a single word per string literal, but we would need 2 (the header + size).

Why not use (# Addr#, Int# #) for string literals, as we've discussed before? The proposal says "Desugaring to ByteArray# literals will allow GHC to eliminate common (sub)-expressions.", is this referring to the problem in https://ghc.haskell.org/trac/ghc/ticket/11312? We need to be really careful here - ByteArray# could have the same problem, if the compiler can inline and duplicate a ByteArray# literal. Anyway, I think we need to consider this point very carefully, and the proposal should elaborate on it.

Would these ByteArray#s return True for isByteArrayPinned#? (presumably yes)

The proposal doesn't really motivate the addition of the (# Addr#, Int# #) version. What would it be used for, if not string literals?

nomeata · 2018-06-23T17:41:18Z

There is some good feedpack. @phadej, are you going to incoporate it into the proposal?

phadej · 2018-06-23T23:14:26Z

I will. A bit busy with other stuff right now.

…

Sent from my iPhone

On 23 Jun 2018, at 20.41, Joachim Breitner ***@***.***> wrote: There is some good feedpack. @phadej, are you going to incoporate it into the proposal? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

fumieval · 2018-12-19T09:43:38Z

Any updates? A lot of people shot themselves in their feet due to IsString ByteString with multi-byte strings. This proposal seems like a good solution.

fumieval · 2018-12-19T09:44:38Z

proposals/0000-bytearray-literals.rst

+
+  data ShortByteString = SBS ByteArray#
+
+  instance FromString ShortByteString where


I think you are talking about the IsString class.

phadej · 2018-12-19T14:36:03Z

@fumieval I might look into updating the proposal during holidays

hsyl20 · 2019-02-26T11:12:42Z

It seems to me that the proposed literals are just TH expression quasi-quoters with some very thin syntactic sugar:

 "abc123"#enc <=> [enc|abc123|]

So i think it would be better to reuse and to extend quasi-quoters instead:

Add more built-in quasi-quoters

Currently GHC (via TH) supports "e", "t", "d" and "p" quasi-quoters. Add some common ones, for instance:

QuasiQuoter	Input	Expr Type
cstring	any string	`Addr# {- Null terminated UTF-8 encoded string-}`
bytes	ASCII string	`(# Addr#, Word# #)`
hexbin	[0-9A-F]*	`(# Addr#, Word# #)`
utf8	any string	`(# Addr#, Word# {- Size in bytes -}, Word# {- Number of code-points #-} #)`
utf16le	any string	`(# Addr#, Word# {- Size in bytes -}, Word# {- Number of code-points #-} #)`
...	...	...

As usual we can export names from ghc-prim for built-in quasi-quoters in order to require an "import" to make those quasi-quoters in scope instead of polluting the global namespace.

Make it extensible: allow GHC plugins to define and register "builtin" quasi-quoters.

Built-in or plugged-in quasi-quoters don't need template-haskell (which is not always available).

(Optional) Add syntactic sugar: "xyz"#enc ==> [enc|xyz|]"
Make it more efficient

Redefine QuasiQuoter to be:

data QuasiQuoter = QuasiQuoter
   { -- For backward compatibility
     quoteExp :: String -> Q Exp
   , quotePat :: String -> Q Pat
   , quoteType :: String -> Q Type
   , quoteDec :: String -> Q [Dec]
   -- Operations on raw UTF-8 buffers
   , quoteExpRaw :: Addr# -> Word# -> Q Exp
   , quoteExpRaw a sz = quoteExp (unpackNBytes# a (word2Int# sz))
   , quotePatRaw :: Addr# -> Word# -> Q Pat
   , quotePatRaw a sz = quotePat (unpackNBytes# a (word2Int# sz))
   , quoteTypeRaw :: Addr# -> Word# -> Q Type
   , quoteTypeRaw a sz = quoteType (unpackNBytes# a (word2Int# sz))
   , quoteDecRaw :: Addr# -> Word# -> Q [Dec]
   , quoteDecRaw a sz = quoteDec (unpackNBytes# a (word2Int# sz))
   }

and use the "raw" operations.

phadej · 2019-02-26T11:50:46Z

@hsyl20 AFAIK QuasiQuotes cannot be used with stage1 compiler, i.e. you cannot use them in bytestring

Also this proposal is not only about syntax.

hsyl20 · 2019-02-26T12:19:02Z

@phadej yes I know. See point (2): "Built-in or plugged-in quasi-quoters don't need template-haskell (which is not always available)."

The idea is to extract quasi-quoters from TH to put them into GHC in order to generalize them and to reuse their syntax:

built-in GHC quasi-quoters: cstring, utf8, bytes, whatever (as in your proposal)
when TH is available: e,p,t,d
extensible via GHC plugins: because inevitably someone will require an encoding that isn't built-in into GHC
when TH is available: extensible via TH as currently

phadej · 2019-02-26T12:24:52Z

@hsyl20

Make it extensible: allow GHC plugins to define and register "builtin" quasi-quoters.

Is worth own proposal. I won't even try to sneak it in through this one.

phadej · 2019-04-30T12:24:08Z

I won't be able to finish this proposal in a foreseeable future

vdukhovni · 2021-08-27T19:01:52Z

I see this has become dormant... Any chance it will be revived?

Initial bytearray literals

a09f4f3

phadej force-pushed the bytearray-literals branch from fcd81e7 to a09f4f3 Compare May 14, 2018 20:20

bgamari reviewed May 14, 2018

View reviewed changes

nickkuk reviewed May 15, 2018

View reviewed changes

Correct typos

297f234

bravit added the Proposal label Nov 11, 2018

bravit removed the Proposal label Dec 3, 2018

fumieval suggested changes Dec 19, 2018

View reviewed changes

phadej closed this Apr 30, 2019

andrewthad mentioned this pull request Nov 12, 2019

ByteArray Literals #292

Open

hsyl20 mentioned this pull request Oct 28, 2021

Sized literals #451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteArray# literals #135

ByteArray# literals #135

phadej commented May 14, 2018 •

edited

cartazio commented May 14, 2018

bgamari May 14, 2018

cartazio May 14, 2018

bgamari May 14, 2018

nickkuk May 15, 2018

phadej May 15, 2018 •

edited

nickkuk May 15, 2018

nickkuk May 15, 2018

nickkuk May 15, 2018

cartazio commented May 15, 2018 via email

phadej commented May 15, 2018

simonpj commented May 15, 2018

simonmar commented May 16, 2018

nomeata commented Jun 23, 2018

phadej commented Jun 23, 2018 via email

fumieval commented Dec 19, 2018

fumieval Dec 19, 2018

phadej commented Dec 19, 2018

hsyl20 commented Feb 26, 2019

phadej commented Feb 26, 2019

hsyl20 commented Feb 26, 2019

phadej commented Feb 26, 2019

phadej commented Apr 30, 2019

vdukhovni commented Aug 27, 2021


		Therefore, code with current primitive strings won't break.

		Unresolved: Should there be a (``-Wall``) warning in this case, asking user to be explicit?


		These literals are ``[Word8]`` literals, primitive string literal must contain only characters <= '\xFF'.

		Ordinary strings, like ``"hello``, ``"Юникод"``, ``"\NUL"`` are then desugared as


		.. highlight:: haskell

		This proposal is `discussed at this pull request <https://github.com/ghc-proposals/ghc-proposals/pull/134>`_.


		data ShortByteString = SBS ByteArray#

		instance FromString ShortByteString where

ByteArray# literals #135

ByteArray# literals #135

Conversation

phadej commented May 14, 2018 • edited

cartazio commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phadej May 15, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cartazio commented May 15, 2018 via email

phadej commented May 15, 2018

simonpj commented May 15, 2018

simonmar commented May 16, 2018

nomeata commented Jun 23, 2018

phadej commented Jun 23, 2018 via email

fumieval commented Dec 19, 2018

Choose a reason for hiding this comment

phadej commented Dec 19, 2018

hsyl20 commented Feb 26, 2019

phadej commented Feb 26, 2019

hsyl20 commented Feb 26, 2019

phadej commented Feb 26, 2019

phadej commented Apr 30, 2019

vdukhovni commented Aug 27, 2021

phadej commented May 14, 2018 •

edited

phadej May 15, 2018 •

edited