Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteArray# literals #135

Closed
wants to merge 2 commits into from
Closed

Conversation

phadej
Copy link
Contributor

@phadej phadej commented May 14, 2018

This is a proposal to introduce ByteArray# and (# Word#, Addr# #) literals, and slightly change Addr# literals. In short you'll be able to write

"Literals"#b           -- ByteArray#
"\xef\xbb\xbf"#abytes  -- Addr#
"Юникод"#ucp1251       -- (# Int#, Addr# #)

Rendered

@cartazio
Copy link

I like this a lot

One proposed alternative is to use double hash syntax ``"foo"##`` to represent
UTF8 ``ByteArray#``. That variant is very limited in power compared to proposed.

In fact, we might introduce similar syntax for the number literals, e.g. ``120#i8``, if we get `Int8#` primitive type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the syntax that you propose above is much preferred to the "keep-adding-more-hashes" approach. I would really like to see the this syntax adopted for numeric llterals at some point as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'd be nice, agreed


Therefore, code with current primitive strings won't break.

*Unresolved:* Should there be a (``-Wall``) warning in this case, asking user to be explicit?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see the advantage. Implicit is fine given this syntax has a long history here and avoids the nasty compatibility issues that warning generally brings.

.. code-block:: haskell

"primitive"# -- Addr# in utf8
"string -- String or IsString a => a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No quotes?

Copy link
Contributor Author

@phadej phadej May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting. I corrected others as well.


.. code-block:: haskell

"hello#
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No quotes?


These literals are ``[Word8]`` literals, *primitive string literal must contain only characters <= '\xFF'*.

Ordinary strings, like ``"hello``, ``"Юникод"``, ``"\NUL"`` are then desugared as
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No quotes after "hello".


.. highlight:: haskell

This proposal is `discussed at this pull request <https://github.com/ghc-proposals/ghc-proposals/pull/134>`_.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

134 should be 135.

@cartazio
Copy link

cartazio commented May 15, 2018 via email

@phadej
Copy link
Contributor Author

phadej commented May 15, 2018

@cartazio what you mean by escaping?

@simonpj
Copy link
Contributor

simonpj commented May 15, 2018

The proposal helpfully lists a bunch of motivations with Trac tickets. I like that it resolves a bunch of related issues. But could you add a section going through these tickets one by one and explaining how the tickets are thereby resolved?

Trac #5218 speaks of

{-# LANGUAGE OverloadedStrings #-}
module Foo where
import Data.ByteString.Char8
foo = "abc" :: ByteString

How will that desugar. Will good things happen?

@simonmar
Copy link

The motivation has the example

"Юникод"#ucp1251       -- (# Int#, Addr# #)

but cp1251 isn't one of the allowed encodings according to the proposal.

What does "algebraic" mean in the table in the motivation section?

Under "string syntax desugaring", the proposal says we'll use ByteArray#, but I'm a bit worried about the potential increase in code size here. The proposal refers to https://ghc.haskell.org/trac/ghc/ticket/5218#comment:95 for evidence that the overhead is small, but I think we should re-check that: it only considers the addition of a single word per string literal, but we would need 2 (the header + size).

Why not use (# Addr#, Int# #) for string literals, as we've discussed before? The proposal says "Desugaring to ByteArray# literals will allow GHC to eliminate common (sub)-expressions.", is this referring to the problem in https://ghc.haskell.org/trac/ghc/ticket/11312? We need to be really careful here - ByteArray# could have the same problem, if the compiler can inline and duplicate a ByteArray# literal. Anyway, I think we need to consider this point very carefully, and the proposal should elaborate on it.

Would these ByteArray#s return True for isByteArrayPinned#? (presumably yes)

The proposal doesn't really motivate the addition of the (# Addr#, Int# #) version. What would it be used for, if not string literals?

@nomeata
Copy link
Contributor

nomeata commented Jun 23, 2018

There is some good feedpack. @phadej, are you going to incoporate it into the proposal?

@phadej
Copy link
Contributor Author

phadej commented Jun 23, 2018 via email

@bravit bravit removed the Proposal label Dec 3, 2018
@fumieval
Copy link
Contributor

Any updates? A lot of people shot themselves in their feet due to IsString ByteString with multi-byte strings. This proposal seems like a good solution.


data ShortByteString = SBS ByteArray#

instance FromString ShortByteString where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are talking about the IsString class.

@phadej
Copy link
Contributor Author

phadej commented Dec 19, 2018

@fumieval I might look into updating the proposal during holidays

@hsyl20
Copy link
Contributor

hsyl20 commented Feb 26, 2019

It seems to me that the proposed literals are just TH expression quasi-quoters with some very thin syntactic sugar:

 "abc123"#enc <=> [enc|abc123|]

So i think it would be better to reuse and to extend quasi-quoters instead:

  1. Add more built-in quasi-quoters

Currently GHC (via TH) supports "e", "t", "d" and "p" quasi-quoters. Add some common ones, for instance:

QuasiQuoter Input Expr Type
cstring any string Addr# {- Null terminated UTF-8 encoded string-}
bytes ASCII string (# Addr#, Word# #)
hexbin [0-9A-F]* (# Addr#, Word# #)
utf8 any string (# Addr#, Word# {- Size in bytes -}, Word# {- Number of code-points #-} #)
utf16le any string (# Addr#, Word# {- Size in bytes -}, Word# {- Number of code-points #-} #)
... ... ...

As usual we can export names from ghc-prim for built-in quasi-quoters in order to require an "import" to make those quasi-quoters in scope instead of polluting the global namespace.

  1. Make it extensible: allow GHC plugins to define and register "builtin" quasi-quoters.

Built-in or plugged-in quasi-quoters don't need template-haskell (which is not always available).

  1. (Optional) Add syntactic sugar: "xyz"#enc ==> [enc|xyz|]"

  2. Make it more efficient

Redefine QuasiQuoter to be:

data QuasiQuoter = QuasiQuoter
   { -- For backward compatibility
     quoteExp :: String -> Q Exp
   , quotePat :: String -> Q Pat
   , quoteType :: String -> Q Type
   , quoteDec :: String -> Q [Dec]
   -- Operations on raw UTF-8 buffers
   , quoteExpRaw :: Addr# -> Word# -> Q Exp
   , quoteExpRaw a sz = quoteExp (unpackNBytes# a (word2Int# sz))
   , quotePatRaw :: Addr# -> Word# -> Q Pat
   , quotePatRaw a sz = quotePat (unpackNBytes# a (word2Int# sz))
   , quoteTypeRaw :: Addr# -> Word# -> Q Type
   , quoteTypeRaw a sz = quoteType (unpackNBytes# a (word2Int# sz))
   , quoteDecRaw :: Addr# -> Word# -> Q [Dec]
   , quoteDecRaw a sz = quoteDec (unpackNBytes# a (word2Int# sz))
   }

and use the "raw" operations.

@phadej
Copy link
Contributor Author

phadej commented Feb 26, 2019

@hsyl20 AFAIK QuasiQuotes cannot be used with stage1 compiler, i.e. you cannot use them in bytestring

Also this proposal is not only about syntax.

@hsyl20
Copy link
Contributor

hsyl20 commented Feb 26, 2019

@phadej yes I know. See point (2): "Built-in or plugged-in quasi-quoters don't need template-haskell (which is not always available)."

The idea is to extract quasi-quoters from TH to put them into GHC in order to generalize them and to reuse their syntax:

  • built-in GHC quasi-quoters: cstring, utf8, bytes, whatever (as in your proposal)
  • when TH is available: e,p,t,d
  • extensible via GHC plugins: because inevitably someone will require an encoding that isn't built-in into GHC
  • when TH is available: extensible via TH as currently

@phadej
Copy link
Contributor Author

phadej commented Feb 26, 2019

@hsyl20

Make it extensible: allow GHC plugins to define and register "builtin" quasi-quoters.

Is worth own proposal. I won't even try to sneak it in through this one.

@phadej
Copy link
Contributor Author

phadej commented Apr 30, 2019

I won't be able to finish this proposal in a foreseeable future

@phadej phadej closed this Apr 30, 2019
@andrewthad andrewthad mentioned this pull request Nov 12, 2019
@vdukhovni
Copy link

I see this has become dormant... Any chance it will be revived?

@hsyl20 hsyl20 mentioned this pull request Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet