Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteArray Literals #292

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

andrewthad
Copy link
Contributor

@andrewthad andrewthad commented Nov 12, 2019

Rendered Link

This is a variant of the now-closed ByteArray Literals proposal written by @phadej some time ago. It copies a lot of the text from that proposal, and I'd like to make sure @phadej is credited for that. It deviates considerably from the original proposal:

  • The suggested syntax is different: "foo"utf8## instead of "foo"#butf8 for a ByteArray#.
  • Fewer encodings are required: Modified UTF-8, UTF-8, and native-endian UTF-16.
  • Casing on ByteArray# literals is required.
  • The desugaring of string literals is left alone as work for a future proposal.
  • No encoding-conversion functions are included.
  • It will not silently break any existing programs.

All of this can be summarized by saying that this proposal is less ambitious than the original, and most of the benefit this variant provides is for users who are using comfortable writing out ByteArray# and Addr# literals. It does lay the groundwork for improving the desugaring of string literals, but the specifics of that are left for a future proposal.

@hsyl20
Copy link
Contributor

hsyl20 commented Nov 12, 2019

Thanks for the proposal. I would also be very interested to have ByteArray# literals at the Core level for unrelated purpose (to speed-up big Integers).

  1. Syntax In my opinion we shouldn't have to modify the parser every time someone wants another encoding or target type supported. We already have a generic way to support any format in expressions and patterns: quasi-quoters.
    So I would argue that we should reuse this syntax. Your examples become something like:
  [mutf8Addr#|Literals|]      -- Addr# (Modified UTF-8)
  [utf8Addr#|\xef\xbb\xbf|]   -- Addr# (UTF-8)
  [utf8ByteArray#|Юникод|]    -- ByteArray# (UTF-8)
  [utf16ByteArray#|Юникод|]   -- ByteArray# (UTF-16, native endian)
  1. Template-Haskell. We can already implement [m]utf8Addr# quasi-quoters with TH's Bytes literals which don't exist in Haskell syntax. Once we support ByteArray# literals in Core, we can easily make them available from template-haskell and support the other quasi-quoters too.

  2. Without TH We may want to use these literals without having to use TH. In this case, it could make sense to provide some wired-in quasi-quoters that would be supported even when TH isn't. The encodings you suggest are good candidates. Arrays of Words with native endianness/size would be useful too.

My implementation plan would be:

  1. Add support for ByteArray# literals in Core and TH
  2. Implement quasi-quoters in a normal package
  3. (optional) Add support for wired-in quasi-quoters

Copy link
Contributor Author

@andrewthad andrewthad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to some concerns.

@andrewthad
Copy link
Contributor Author

@hsyl20 That is a very reasonable alternative. I will include your suggested syntax as an alternative and leave that question to the committee. I've never used quasiquotation (outside of working with TH-heavy libraries that require it), so it is unclear to me whether or not there are any drawbacks to that approach. If this is the decided route, wired-in quasiquoters would be essential since TH is a pain for users who cross-compile. Also, wired-in quasiquoters would be essential for any future change in the desugaring of string literals (that is, changing how IsString works).

@AndreasPK
Copy link

Is there a link to the current rendered proposal?

@andrewthad
Copy link
Contributor Author

I've added a top-level link. I still need to make some changes to this proposal.

@cartazio
Copy link

i like the notations you suggest, afaict

[octets#|fe01bce8|] -- ByteArray# (four bytes)
[utf8#|Araña|]      -- ByteArray# (UTF-8)
[utf16#|Araña|]     -- ByteArray# (UTF-16, native endian)
[utf16le#|Araña|]   -- ByteArray# (UTF-16, little endian)
[utf16be#|Araña|]   -- ByteArray# (UTF-16, big endian)

my initial reading of the proposal and discussion didn't highlight providing a binary/hex encodedd data syntax :)

@andrewthad
Copy link
Contributor Author

@hsyl20 Is there any precedent for the wired-in quasiquoters you describe? Are there any wired-in quasiquoters that exist today in GHC that can be used without turning on template haskell?

@hsyl20
Copy link
Contributor

hsyl20 commented Mar 27, 2020

@andrewthad I don't think so. But it should be straightforward to implement. In GHC.Rename.Splice.runRnSplice we can match on HsQuasiQuote with known (wired-in) quoter names and instead of running their expression with TH we would just have to transform their string parameter into a literal.

Quasi-quoters already have their own extension, so we would just have to allow it without also enabling TH.

@andrewthad
Copy link
Contributor Author

@hsyl20 Since the wired-in quasiquoters would have type QuasiQuoter, I think that QuasiQuoter would need to be moved into base to avoid users of bytearray literals from having to depend on the template-haskell library. That doesn't seem too bad though.

To everyone, I've laid some groundwork for this feature in MR 2971. There is still a lot of work, particularly cmm-to-cmm and cmm-to-asm, that needs to happen to make this actually work. I'm not going to have time to implement this personally, but this might be a Summer of Haskell project if an interested student would like to do this.

I've cleaned this up a little more. I'll submit this to the committee in a week if there is no further discussion.

@andrewthad
Copy link
Contributor Author

I would like to submit this proposal to the committee.

Copy link

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal and sorry for the late review! :)

I'd love to see GHC produce fast code for case-expressions on (byte)string literals. (I don't have a clear understanding of the status quo though.) Since this proposal seems to move things in that direction, I'm in favour! :)

proposals/0000-bytearray-literals.rst Outdated Show resolved Hide resolved
proposals/0000-bytearray-literals.rst Outdated Show resolved Hide resolved
proposals/0000-bytearray-literals.rst Outdated Show resolved Hide resolved
andrewthad and others added 3 commits May 24, 2020 17:41
Correct a sentence

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>
Correct spelling of scrutinize

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>
Make effects and interactions agree in number

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>
@nomeata nomeata changed the title ByteArray Literals ByteArray Literals (under review) May 24, 2020
@nomeata nomeata added the Pending shepherd recommendation The shepherd needs to evaluate the proposal and make a recommendataion label May 24, 2020
@nomeata
Copy link
Contributor

nomeata commented May 24, 2020

/remind @bravit to check progress in two weeks

@reminders-prs
Copy link

reminders-prs bot commented May 24, 2020

@nomeata set a reminder for Jun 7th 2020

@bravit
Copy link
Contributor

bravit commented May 31, 2020

Hi Andrew, thanks for the proposal! Unfortunately, I'm not very excited about it. I believe that such substantial changes as extending GHC Core, dealing with code generation, moving QuasiQuotes-guarded syntax to plain GHC/Haskell, etc. require much stronger motivation. Moreover, it's not immediately clear to me how many performance benefits we could get after implementing this proposal and whether they outweigh the significance of the changes you propose.

As for now, I have a couple of suggestions. First, could you please elaborate on how this proposal solves the following problem:

There is no O(1) way to get the length of a primitive string literal.

Second, GHC proposals usually have the "Alternatives" section. It would be nice to see there some traces of not so syntactically simple ways to get around things. I think that having such a section could help Committee members to decide on this proposal.

@bravit
Copy link
Contributor

bravit commented Jun 5, 2020

Hi @andrewthad, do you have any comments or objections?

@andrewthad
Copy link
Contributor Author

I've elaborated more in the motivations section. The hyperlinks that I've added have the real information though. Reading those should provide some sense of what there is to gain. Basically, it's unlocking Core-to-Core optimizations are reducing bloat in generated binaries.

I've removed the comment about "There is no O(1) way to get the length of a primitive string literal". I had copied that from an older version of this proposal. I resolved that issue in GHC MR 2165, so it is no longer relevant.

moving QuasiQuotes-guarded syntax to plain GHC/Haskell

I did not intend to suggest this. This syntax would still require the QuasiQuotes extension.

extending GHC Core

I don't think this change is as significant as you interpret it to be. This isn't extending GHC Core in the same sense that join points or linear types is. It's adding a data constructor to Literal, not to Expr. In GHC MR 2971, I started on this, and it's a pretty simple change. The difficulty lies in:

  • Code gen for pattern matches (cmm-to-cmm)
  • Code gen for literals themselves (cmm-to-asm? I think)

The impact on Core is minimal.

I can think about alternatives more, but the only alternatives I'm aware of are different user-facing syntaxes for this. Concerning Core, there's only one way to go about this. If you have another idea for how to resolve the motivations issues, write it up and I can add it as an alternative.

@simonpj
Copy link
Contributor

simonpj commented Jun 5, 2020

The link "This proposal is discussed at this pull request." in the rendered proposal is broken

@simonpj
Copy link
Contributor

simonpj commented Jun 5, 2020

I'm wondering: what makes this a GHC Proposal. After all, it doesn't change the language, only the libraries, and perhaps the compiler implementation.

In particular, we have accepted proposal #125, Type annotated quoters. I think this is the canonical link, or maybe this. (I wish there was an easy way to tell.)

All is good, except that the type TExp ByteArray# instantiates a polymoprhic type TExp with an unlifted type ByteArray. So maybe we can get these literals to return ByteArray not ByteArray#.

Now the only issue is: will GHC compile certain idioms efficiently. Maybe, maybe not. Let's see. And if not, let's see if we can modify GHC until it does. Only if that proves impossible, and we need some language design change, should a GHC proposal be necessary.

TL;DR: maybe you can skip the GHC Proposal process, and just make GHC optimise better!

@reminders-prs reminders-prs bot removed the reminder label Jun 7, 2020
@reminders-prs
Copy link

reminders-prs bot commented Jun 7, 2020

👋 @bravit, check progress

@bravit
Copy link
Contributor

bravit commented Jun 7, 2020

Hi @andrewthad

I think that @simonpj is right. This looks like implementing several quasi quoters, guarded by QuasiQuotes extension, available by default through Prelude, and supported by GHC implementation. So it turns out there are no GHC/Haskell changes proposed.

I think we can label this proposal as dormant and see if there is something that we should consider here later.

@simonpj
Copy link
Contributor

simonpj commented Jun 8, 2020

Actually, I wasn't quite right. I pointed to a proposal about Typed Template Haskell, but we want these bytearray literals to appear in patterns, so it must be an untyped TH thing. So there's no issue with instantiating a polymorphic type with TExp.

Rather the issue is this: how do you write the quoters octets etc? Those quoters need to produce a TH Syntax tree. But the data type Language.Haskell.TH.Syntax.Lit has no constructor for ByteArray literals, so we are stuck.

Or are we? In fact the Lit has

data Lit = ...
         | StringPrimL [Word8]  -- ^ A primitive C-style string, type 'Addr#'
         | BytesPrimL Bytes     -- ^ Some raw bytes, type 'Addr#':
         ...

-- | Raw bytes embedded into the binary.
--
-- Avoid using Bytes constructor directly as it is likely to change in the
-- future. Use helpers such as `mkBytes` in Language.Haskell.TH.Lib instead.
data Bytes = Bytes
   { bytesPtr    :: ForeignPtr Word8 -- ^ Pointer to the data
   , bytesOffset :: Word             -- ^ Offset from the pointer
   , bytesSize   :: Word             -- ^ Number of bytes
   -- Maybe someday:
   -- , bytesAlignement  :: Word -- ^ Alignement constraint
   -- , bytesReadOnly    :: Bool -- ^ Shall we embed into a read-only
   --                            --   section or not
   -- , bytesInitialized :: Bool -- ^ False: only use `bytesSize` to allocate
   --                            --   an uninitialized region
   }

Now, that Bytes thing was introduced only last year

commit 224a6b864c6aa0d851fcbf79469e5702b1116dbc
Author: Sylvain Henry <sylvain@haskus.fr>
Date:   Fri Jan 18 12:30:31 2019 +0100

    TH: support raw bytes literals (#14741)
    
    GHC represents String literals as ByteString internally for efficiency
    reasons. However, until now it wasn't possible to efficiently create
    large string literals with TH (e.g. to embed a file in a binary, cf #14741):
    TH code had to unpack the bytes into a [Word8] that GHC then had to re-pack
    into a ByteString.
    
    This patch adds the possibility to efficiently create a "string" literal
    from raw bytes. We get the following compile times for different sizes
    of TH created literals:
    
    || Size || Before || After  || Gain ||
    || 30K  || 2.307s || 2.299  || 0%   ||
    || 3M   || 3.073s || 2.400s || 21%  ||
    || 30M  || 8.517s || 3.390s || 60%  ||
    
    Ticket #14741 can be fixed if the original code uses this new TH feature.

I'm not sure if there was a GHC proposal about it -- the commit message is silent.

I'm a bit lost in ByteString vs ByteArray land, but this looks like what you need, right?

I'm not sure what it turns into when converting to Core, nor about the efficiency of the Core but you could check that.

@andrewthad
Copy link
Contributor Author

@simonpj What I've done is extend GHC.Types.Literal.Lit with a LitByteArray data constructor and extend Language.Haskell.Syntax.TH.Lit with a ByteArrayPrimL data constructor. The second change doesn't impact Core, but technically the first one does, although the impact is minor. I'm not sure if that merits a proposal.

@simonpj
Copy link
Contributor

simonpj commented Jun 8, 2020

extend Language.Haskell.Syntax.TH.Lit with a ByteArrayPrimL data constructor.

This is what I am doubtful about. Now we have THREE ways in TH Lit for describing a sequence of bytes. Do we need three?

@goldfirere
Copy link
Contributor

In considering this proposal, the committee is struggling to figure out whether this really needs to be a proposal, or whether the goals could be accomplished just by adding quasi-quoters to some library (perhaps base, in which case this goes to the libraries mailing list). The proposal does mention a change to Core, though, and that suggests that it really does require a proposal. Yet this change is not in the Proposed Change Specification, my first port of call on any proposal.

So: could I ask the author to clarify, in the proposal text itself (not just in a comment here), what makes this a proper proposal? If a change to Core is warranted, could that be spelled out, too? I recognize that the proposal format doesn't really have a spot for a proposed change to Core (as these should be very rare), but something would have to go in the Proposed Change Specification.

Thanks!

@andrewthad
Copy link
Contributor Author

I'll ask here before clarifying in the proposal, since I myself am not really sure what counts as changing Core. Does adding a data constructor to GHC.Types.Literal.Literal count as changing Core? (Expr has a Literal in the Lit data constructor, and Literal will get a new data constructor under this proposal)

@simonpj
Copy link
Contributor

simonpj commented Jun 24, 2020

Does adding a data constructor to GHC.Types.Literal.Literal count as changing Core?

Yes, I think it does. Every GHC API user might need to update their code.

And same for changing TH Syntax, if you plan to do that. The latter esp is a user facing change

As I remark above, we already have two ways of describing byte-array literals in TH syntax, and one (LitString) in Core. It would be helpful to explain why we need another.

@bravit
Copy link
Contributor

bravit commented Jun 24, 2020

We have the following description of our scope:

A GHC Proposal is a document describing a proposed change to the compiler, the GHC/Haskell language, or the libraries in the GHC.* module namespace. These include,

A syntactic change to GHC/Haskell (e.g. the various ShortImports proposals, do expressions without $)
A major change to the user-visible behaviour of the compiler (e.g. the recent change in super-class solving, and -Wall behavior)
The addition of major features to the compiler (e.g. -XTypeInType, GHCi commands, type-indexed Typeable representations)

Changes to the GHC API or the plugin API are not automatically within the scope of the committee, and can be contributed following the usual GHC workflow. Should the GHC maintainers deem a change significant or controversial enough to warrant that, they may, at their discretion, involve the committee and ask the contributor to write a formal proposal.

So, affecting GHC API is not an immediate reason to consider this change. As for changes to GHC Core, I'm not sure that the proposed change accounts for a major feature unless we consider any change to Core as major.

Anyway, I think that I should put the needs revision label on this proposal.

@bravit bravit added Needs revision The proposal needs changes in response to shepherd or committee review feedback and removed Pending shepherd recommendation The shepherd needs to evaluate the proposal and make a recommendataion labels Jun 24, 2020
@hsyl20 hsyl20 mentioned this pull request Oct 28, 2021
@adamgundry adamgundry changed the title ByteArray Literals (under review) ByteArray Literals Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs revision The proposal needs changes in response to shepherd or committee review feedback
Development

Successfully merging this pull request may close these issues.

None yet

10 participants