ByteArray Literals #292

andrewthad · 2019-11-12T15:10:57Z

This is a variant of the now-closed ByteArray Literals proposal written by @phadej some time ago. It copies a lot of the text from that proposal, and I'd like to make sure @phadej is credited for that. It deviates considerably from the original proposal:

The suggested syntax is different: "foo"utf8## instead of "foo"#butf8 for a ByteArray#.
Fewer encodings are required: Modified UTF-8, UTF-8, and native-endian UTF-16.
Casing on ByteArray# literals is required.
The desugaring of string literals is left alone as work for a future proposal.
No encoding-conversion functions are included.
It will not silently break any existing programs.

All of this can be summarized by saying that this proposal is less ambitious than the original, and most of the benefit this variant provides is for users who are using comfortable writing out ByteArray# and Addr# literals. It does lay the groundwork for improving the desugaring of string literals, but the specifics of that are left for a future proposal.

proposals/0000-bytearray-literals.rst

hsyl20 · 2019-11-12T17:17:59Z

Thanks for the proposal. I would also be very interested to have ByteArray# literals at the Core level for unrelated purpose (to speed-up big Integers).

Syntax In my opinion we shouldn't have to modify the parser every time someone wants another encoding or target type supported. We already have a generic way to support any format in expressions and patterns: quasi-quoters.
So I would argue that we should reuse this syntax. Your examples become something like:

  [mutf8Addr#|Literals|]      -- Addr# (Modified UTF-8)
  [utf8Addr#|\xef\xbb\xbf|]   -- Addr# (UTF-8)
  [utf8ByteArray#|Юникод|]    -- ByteArray# (UTF-8)
  [utf16ByteArray#|Юникод|]   -- ByteArray# (UTF-16, native endian)

Template-Haskell. We can already implement [m]utf8Addr# quasi-quoters with TH's Bytes literals which don't exist in Haskell syntax. Once we support ByteArray# literals in Core, we can easily make them available from template-haskell and support the other quasi-quoters too.
Without TH We may want to use these literals without having to use TH. In this case, it could make sense to provide some wired-in quasi-quoters that would be supported even when TH isn't. The encodings you suggest are good candidates. Arrays of Words with native endianness/size would be useful too.

My implementation plan would be:

Add support for ByteArray# literals in Core and TH
Implement quasi-quoters in a normal package
(optional) Add support for wired-in quasi-quoters

andrewthad

Responded to some concerns.

andrewthad · 2019-11-12T18:28:57Z

@hsyl20 That is a very reasonable alternative. I will include your suggested syntax as an alternative and leave that question to the committee. I've never used quasiquotation (outside of working with TH-heavy libraries that require it), so it is unclear to me whether or not there are any drawbacks to that approach. If this is the decided route, wired-in quasiquoters would be essential since TH is a pain for users who cross-compile. Also, wired-in quasiquoters would be essential for any future change in the desugaring of string literals (that is, changing how IsString works).

AndreasPK · 2020-01-17T12:25:36Z

Is there a link to the current rendered proposal?

andrewthad · 2020-01-17T13:45:02Z

I've added a top-level link. I still need to make some changes to this proposal.

cartazio · 2020-01-31T16:18:44Z

i like the notations you suggest, afaict

[octets#|fe01bce8|] -- ByteArray# (four bytes)
[utf8#|Araña|]      -- ByteArray# (UTF-8)
[utf16#|Araña|]     -- ByteArray# (UTF-16, native endian)
[utf16le#|Araña|]   -- ByteArray# (UTF-16, little endian)
[utf16be#|Araña|]   -- ByteArray# (UTF-16, big endian)

my initial reading of the proposal and discussion didn't highlight providing a binary/hex encodedd data syntax :)

andrewthad · 2020-03-26T19:54:15Z

@hsyl20 Is there any precedent for the wired-in quasiquoters you describe? Are there any wired-in quasiquoters that exist today in GHC that can be used without turning on template haskell?

hsyl20 · 2020-03-27T12:02:23Z

@andrewthad I don't think so. But it should be straightforward to implement. In GHC.Rename.Splice.runRnSplice we can match on HsQuasiQuote with known (wired-in) quoter names and instead of running their expression with TH we would just have to transform their string parameter into a literal.

Quasi-quoters already have their own extension, so we would just have to allow it without also enabling TH.

andrewthad · 2020-03-28T21:02:32Z

@hsyl20 Since the wired-in quasiquoters would have type QuasiQuoter, I think that QuasiQuoter would need to be moved into base to avoid users of bytearray literals from having to depend on the template-haskell library. That doesn't seem too bad though.

To everyone, I've laid some groundwork for this feature in MR 2971. There is still a lot of work, particularly cmm-to-cmm and cmm-to-asm, that needs to happen to make this actually work. I'm not going to have time to implement this personally, but this might be a Summer of Haskell project if an interested student would like to do this.

I've cleaned this up a little more. I'll submit this to the committee in a week if there is no further discussion.

andrewthad · 2020-05-24T11:49:03Z

I would like to submit this proposal to the committee.

sjakobi

Thanks for the proposal and sorry for the late review! :)

I'd love to see GHC produce fast code for case-expressions on (byte)string literals. (I don't have a clear understanding of the status quo though.) Since this proposal seems to move things in that direction, I'm in favour! :)

proposals/0000-bytearray-literals.rst

Correct a sentence Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

Correct spelling of scrutinize Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

Make effects and interactions agree in number Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

nomeata · 2020-05-24T21:52:26Z

/remind @bravit to check progress in two weeks

reminders-prs · 2020-05-24T21:52:30Z

@nomeata set a reminder for Jun 7th 2020

bravit · 2020-05-31T14:24:23Z

Hi Andrew, thanks for the proposal! Unfortunately, I'm not very excited about it. I believe that such substantial changes as extending GHC Core, dealing with code generation, moving QuasiQuotes-guarded syntax to plain GHC/Haskell, etc. require much stronger motivation. Moreover, it's not immediately clear to me how many performance benefits we could get after implementing this proposal and whether they outweigh the significance of the changes you propose.

As for now, I have a couple of suggestions. First, could you please elaborate on how this proposal solves the following problem:

There is no O(1) way to get the length of a primitive string literal.

Second, GHC proposals usually have the "Alternatives" section. It would be nice to see there some traces of not so syntactically simple ways to get around things. I think that having such a section could help Committee members to decide on this proposal.

bravit · 2020-06-05T12:01:59Z

Hi @andrewthad, do you have any comments or objections?

…als into bytearray-literals

andrewthad · 2020-06-05T14:38:24Z

I've elaborated more in the motivations section. The hyperlinks that I've added have the real information though. Reading those should provide some sense of what there is to gain. Basically, it's unlocking Core-to-Core optimizations are reducing bloat in generated binaries.

I've removed the comment about "There is no O(1) way to get the length of a primitive string literal". I had copied that from an older version of this proposal. I resolved that issue in GHC MR 2165, so it is no longer relevant.

moving QuasiQuotes-guarded syntax to plain GHC/Haskell

I did not intend to suggest this. This syntax would still require the QuasiQuotes extension.

extending GHC Core

I don't think this change is as significant as you interpret it to be. This isn't extending GHC Core in the same sense that join points or linear types is. It's adding a data constructor to Literal, not to Expr. In GHC MR 2971, I started on this, and it's a pretty simple change. The difficulty lies in:

Code gen for pattern matches (cmm-to-cmm)
Code gen for literals themselves (cmm-to-asm? I think)

The impact on Core is minimal.

I can think about alternatives more, but the only alternatives I'm aware of are different user-facing syntaxes for this. Concerning Core, there's only one way to go about this. If you have another idea for how to resolve the motivations issues, write it up and I can add it as an alternative.

simonpj · 2020-06-05T15:05:38Z

The link "This proposal is discussed at this pull request." in the rendered proposal is broken

simonpj · 2020-06-05T15:19:46Z

I'm wondering: what makes this a GHC Proposal. After all, it doesn't change the language, only the libraries, and perhaps the compiler implementation.

In particular, we have accepted proposal #125, Type annotated quoters. I think this is the canonical link, or maybe this. (I wish there was an easy way to tell.)

All is good, except that the type TExp ByteArray# instantiates a polymoprhic type TExp with an unlifted type ByteArray. So maybe we can get these literals to return ByteArray not ByteArray#.

Now the only issue is: will GHC compile certain idioms efficiently. Maybe, maybe not. Let's see. And if not, let's see if we can modify GHC until it does. Only if that proves impossible, and we need some language design change, should a GHC proposal be necessary.

TL;DR: maybe you can skip the GHC Proposal process, and just make GHC optimise better!

reminders-prs · 2020-06-07T09:12:41Z

👋 @bravit, check progress

bravit · 2020-06-07T17:52:38Z

Hi @andrewthad,

I think that @simonpj is right. This looks like implementing several quasi quoters, guarded by QuasiQuotes extension, available by default through Prelude, and supported by GHC implementation. So it turns out there are no GHC/Haskell changes proposed.

I think we can label this proposal as dormant and see if there is something that we should consider here later.

simonpj · 2020-06-08T08:07:43Z

Actually, I wasn't quite right. I pointed to a proposal about Typed Template Haskell, but we want these bytearray literals to appear in patterns, so it must be an untyped TH thing. So there's no issue with instantiating a polymorphic type with TExp.

Rather the issue is this: how do you write the quoters octets etc? Those quoters need to produce a TH Syntax tree. But the data type Language.Haskell.TH.Syntax.Lit has no constructor for ByteArray literals, so we are stuck.

Or are we? In fact the Lit has

data Lit = ...
         | StringPrimL [Word8]  -- ^ A primitive C-style string, type 'Addr#'
         | BytesPrimL Bytes     -- ^ Some raw bytes, type 'Addr#':
         ...

-- | Raw bytes embedded into the binary.
--
-- Avoid using Bytes constructor directly as it is likely to change in the
-- future. Use helpers such as `mkBytes` in Language.Haskell.TH.Lib instead.
data Bytes = Bytes
   { bytesPtr    :: ForeignPtr Word8 -- ^ Pointer to the data
   , bytesOffset :: Word             -- ^ Offset from the pointer
   , bytesSize   :: Word             -- ^ Number of bytes
   -- Maybe someday:
   -- , bytesAlignement  :: Word -- ^ Alignement constraint
   -- , bytesReadOnly    :: Bool -- ^ Shall we embed into a read-only
   --                            --   section or not
   -- , bytesInitialized :: Bool -- ^ False: only use `bytesSize` to allocate
   --                            --   an uninitialized region
   }

Now, that Bytes thing was introduced only last year

commit 224a6b864c6aa0d851fcbf79469e5702b1116dbc
Author: Sylvain Henry <sylvain@haskus.fr>
Date:   Fri Jan 18 12:30:31 2019 +0100

    TH: support raw bytes literals (#14741)
    
    GHC represents String literals as ByteString internally for efficiency
    reasons. However, until now it wasn't possible to efficiently create
    large string literals with TH (e.g. to embed a file in a binary, cf #14741):
    TH code had to unpack the bytes into a [Word8] that GHC then had to re-pack
    into a ByteString.
    
    This patch adds the possibility to efficiently create a "string" literal
    from raw bytes. We get the following compile times for different sizes
    of TH created literals:
    
    || Size || Before || After  || Gain ||
    || 30K  || 2.307s || 2.299  || 0%   ||
    || 3M   || 3.073s || 2.400s || 21%  ||
    || 30M  || 8.517s || 3.390s || 60%  ||
    
    Ticket #14741 can be fixed if the original code uses this new TH feature.

I'm not sure if there was a GHC proposal about it -- the commit message is silent.

I'm a bit lost in ByteString vs ByteArray land, but this looks like what you need, right?

I'm not sure what it turns into when converting to Core, nor about the efficiency of the Core but you could check that.

andrewthad · 2020-06-08T12:58:58Z

@simonpj What I've done is extend GHC.Types.Literal.Lit with a LitByteArray data constructor and extend Language.Haskell.Syntax.TH.Lit with a ByteArrayPrimL data constructor. The second change doesn't impact Core, but technically the first one does, although the impact is minor. I'm not sure if that merits a proposal.

simonpj · 2020-06-08T13:41:04Z

extend Language.Haskell.Syntax.TH.Lit with a ByteArrayPrimL data constructor.

This is what I am doubtful about. Now we have THREE ways in TH Lit for describing a sequence of bytes. Do we need three?

goldfirere · 2020-06-24T10:23:13Z

In considering this proposal, the committee is struggling to figure out whether this really needs to be a proposal, or whether the goals could be accomplished just by adding quasi-quoters to some library (perhaps base, in which case this goes to the libraries mailing list). The proposal does mention a change to Core, though, and that suggests that it really does require a proposal. Yet this change is not in the Proposed Change Specification, my first port of call on any proposal.

So: could I ask the author to clarify, in the proposal text itself (not just in a comment here), what makes this a proper proposal? If a change to Core is warranted, could that be spelled out, too? I recognize that the proposal format doesn't really have a spot for a proposed change to Core (as these should be very rare), but something would have to go in the Proposed Change Specification.

Thanks!

andrewthad · 2020-06-24T11:02:45Z

I'll ask here before clarifying in the proposal, since I myself am not really sure what counts as changing Core. Does adding a data constructor to GHC.Types.Literal.Literal count as changing Core? (Expr has a Literal in the Lit data constructor, and Literal will get a new data constructor under this proposal)

simonpj · 2020-06-24T12:33:36Z

Does adding a data constructor to GHC.Types.Literal.Literal count as changing Core?

Yes, I think it does. Every GHC API user might need to update their code.

And same for changing TH Syntax, if you plan to do that. The latter esp is a user facing change

As I remark above, we already have two ways of describing byte-array literals in TH syntax, and one (LitString) in Core. It would be helpful to explain why we need another.

bravit · 2020-06-24T15:47:30Z

We have the following description of our scope:

A GHC Proposal is a document describing a proposed change to the compiler, the GHC/Haskell language, or the libraries in the GHC.* module namespace. These include,

A syntactic change to GHC/Haskell (e.g. the various ShortImports proposals, do expressions without $)
A major change to the user-visible behaviour of the compiler (e.g. the recent change in super-class solving, and -Wall behavior)
The addition of major features to the compiler (e.g. -XTypeInType, GHCi commands, type-indexed Typeable representations)

Changes to the GHC API or the plugin API are not automatically within the scope of the committee, and can be contributed following the usual GHC workflow. Should the GHC maintainers deem a change significant or controversial enough to warrant that, they may, at their discretion, involve the committee and ask the contributor to write a formal proposal.

So, affecting GHC API is not an immediate reason to consider this change. As for changes to GHC Core, I'm not sure that the proposed change accounts for a major feature unless we consider any change to Core as major.

Anyway, I think that I should put the needs revision label on this proposal.

Revive bytearray-literals proposal

415a520

phadej reviewed Nov 12, 2019

View reviewed changes

proposals/0000-bytearray-literals.rst Outdated Show resolved Hide resolved

phadej reviewed Nov 12, 2019

View reviewed changes

proposals/0000-bytearray-literals.rst Outdated Show resolved Hide resolved

andrewthad commented Nov 12, 2019

View reviewed changes

hvr mentioned this pull request Dec 31, 2019

Surprising behavior of ByteString literals via IsString haskell/bytestring#140

Open

Rewrite proposal

17c5df5

update proposal

05daf80

andrewthad mentioned this pull request May 24, 2020

Compute length at compile time for literal strings haskell/bytestring#191

Merged

sjakobi reviewed May 24, 2020

View reviewed changes

proposals/0000-bytearray-literals.rst Outdated Show resolved Hide resolved

proposals/0000-bytearray-literals.rst Outdated Show resolved Hide resolved

proposals/0000-bytearray-literals.rst Outdated Show resolved Hide resolved

andrewthad and others added 3 commits May 24, 2020 17:41

Update proposals/0000-bytearray-literals.rst

b5238bc

Correct a sentence Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

Update proposals/0000-bytearray-literals.rst

75fa93f

Correct spelling of scrutinize Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

Update proposals/0000-bytearray-literals.rst

6dba9ba

Make effects and interactions agree in number Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

nomeata assigned bravit May 24, 2020

nomeata changed the title ~~ByteArray Literals~~ ByteArray Literals (under review) May 24, 2020

nomeata added the Pending shepherd recommendation The shepherd needs to evaluate the proposal and make a recommendataion label May 24, 2020

reminders-prs bot added the reminder label May 24, 2020

andrewthad mentioned this pull request Jun 5, 2020

Suboptimal generated code with OverloadedStrings haskell/text#287

Closed

andrewthad added 2 commits June 5, 2020 10:20

Improve motivations

9ebed21

Merge branch 'bytearray-literals' of github.com:andrewthad/ghc-propos…

ab6a271

…als into bytearray-literals

reminders-prs bot removed the reminder label Jun 7, 2020

bravit added Needs revision The proposal needs changes in response to shepherd or committee review feedback and removed Pending shepherd recommendation The shepherd needs to evaluate the proposal and make a recommendataion labels Jun 24, 2020

hsyl20 mentioned this pull request Oct 28, 2021

Sized literals #451

Merged

adamgundry changed the title ~~ByteArray Literals (under review)~~ ByteArray Literals Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteArray Literals #292

ByteArray Literals #292

andrewthad commented Nov 12, 2019 •

edited by reminders-prs bot

hsyl20 commented Nov 12, 2019

andrewthad left a comment

andrewthad commented Nov 12, 2019

AndreasPK commented Jan 17, 2020

andrewthad commented Jan 17, 2020

cartazio commented Jan 31, 2020

andrewthad commented Mar 26, 2020

hsyl20 commented Mar 27, 2020

andrewthad commented Mar 28, 2020

andrewthad commented May 24, 2020

sjakobi left a comment

nomeata commented May 24, 2020

reminders-prs bot commented May 24, 2020

bravit commented May 31, 2020

bravit commented Jun 5, 2020

andrewthad commented Jun 5, 2020

simonpj commented Jun 5, 2020

simonpj commented Jun 5, 2020

reminders-prs bot commented Jun 7, 2020

bravit commented Jun 7, 2020

simonpj commented Jun 8, 2020

andrewthad commented Jun 8, 2020

simonpj commented Jun 8, 2020

goldfirere commented Jun 24, 2020

andrewthad commented Jun 24, 2020

simonpj commented Jun 24, 2020

bravit commented Jun 24, 2020

ByteArray Literals #292

Are you sure you want to change the base?

ByteArray Literals #292

Conversation

andrewthad commented Nov 12, 2019 • edited by reminders-prs bot

hsyl20 commented Nov 12, 2019

andrewthad left a comment

Choose a reason for hiding this comment

andrewthad commented Nov 12, 2019

AndreasPK commented Jan 17, 2020

andrewthad commented Jan 17, 2020

cartazio commented Jan 31, 2020

andrewthad commented Mar 26, 2020

hsyl20 commented Mar 27, 2020

andrewthad commented Mar 28, 2020

andrewthad commented May 24, 2020

sjakobi left a comment

Choose a reason for hiding this comment

nomeata commented May 24, 2020

reminders-prs bot commented May 24, 2020

bravit commented May 31, 2020

bravit commented Jun 5, 2020

andrewthad commented Jun 5, 2020

simonpj commented Jun 5, 2020

simonpj commented Jun 5, 2020

reminders-prs bot commented Jun 7, 2020

bravit commented Jun 7, 2020

simonpj commented Jun 8, 2020

andrewthad commented Jun 8, 2020

simonpj commented Jun 8, 2020

goldfirere commented Jun 24, 2020

andrewthad commented Jun 24, 2020

simonpj commented Jun 24, 2020

bravit commented Jun 24, 2020

andrewthad commented Nov 12, 2019 •

edited by reminders-prs bot