spec: more bytes operations #164

alandonovan · 2021-02-12T00:00:12Z

spec: more bytes operations

This change defines the semantics of: 
- str(bytes) - UTF-8 decoding with U+FFFD replacement
- bytes(str) - UTF-8 encoding with U+FFFD replacement
- bytes.elems() -- iterable of int values of byte elements
- hash(bytes) -- 32-bit FNV-1a hash
- bytes in bytes -- substring test
- int in bytes -- element membership test

Updates #112

merge upstream

google-cla · 2021-02-12T00:00:16Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

google-cla · 2021-02-12T00:05:27Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

alandonovan · 2021-02-12T00:41:33Z

Some comments on what is (and what is not) in this PR:

There is still no spec for ord nor chr functions.
chr is easy enough to define: it takes an int, and returns a 1-codepoint string. If you want bytes of the result, call bytes(chr(x)), for now.
codepoint_chr is not needed. Use chr.
Python's ord function returns the numeric value of a 1-element string. (Python2 string elements are bytes; Python3 strings are of code points, and Python3 bytes strings of bytes.) Given that Starlark's text string elements are not code points, I think it risks confusion for us to define ord. One can use bytes.elems()[0] to obtain the same result, at some constant overhead, and this works on strings longer than 1 as well. If string.elem_ords() existed in the spec (and not just the Go impl), you could use that too. I think we should specify a method like elem_ords for strings, but ideally it would be called string.elems, for symmetry with bytes.elems. Unfortunately, that method already exists and returns 1-element substrings, which is really not useful (and would be an error for non-ASCII in Rust.) I propose we use elem_ords for now, but plan longer term to make elems equivalent to elem_ords.
codepoint_ord(s) can be achieved using s.codepoint_ords()[0], at least for now. We should specify string.codepoint_ords, which only exists in the Go impl. (Tricky to implement in Java due to Bazel latin1 hack.)
bytes(str) and str(bytes): this PR defines these conversions, with U+FFFD behavior, not fail-fast. Why do we not want to support these operations, as comments in spec: new 'bytes' data type #112 say?
If we ever want to support other encodings (and I think that's unlikely), we needn't pollute the str() and bytes() functions with extra parameters. We could introduce explicit encode/decode methods in that case.

google-cla · 2021-02-12T00:52:27Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

spec.md

brandjon · 2021-02-12T19:24:10Z

spec.md

@@ -2110,19 +2114,20 @@ these operators.
 #### Membership tests

 ```text
-      any in     sequence		(list, tuple, dict, string)
+      any in     sequence		(list, tuple, dict, string, bytes)


(Here and immediately below) Also range objects, and possibly other view-like objects (even an elems() iterable).

Separately, I wonder if the spec should say anything about which operations an application-defined type may elect to participate in.

Ideally the spec would get into the business of exhaustively listing all possible types, instead referring to abstractions like "indexable sequence" and given examples of one or to instances.

I've added range here and in a couple places, but it's a slippery slope.

which operations an application-defined type may elect to participate in.

I think that's entirely up to the implementation, and essentially all operators are fair game.

brandjon · 2021-02-12T19:59:12Z

spec.md

+If x is a `bytes`, the result is x.
+
+If x is a string, the result is a `bytes` whose elements are
+the UTF-8 encoding of the string. Each element of the string that is


There's multiple ways to apply U+FFFD replacement to invalid UTF-8 data. Looking at Section 3.9 of the unicode standard (see heading "Constraints on Conversion Process"), it appears the only hard requirement is that no valid sequence of code units encoding a single code point be consumed by the replacement process. But there's flexibility on whether each code unit that is not part of a valid sequence gets its own replacement character, or whether consecutive such units get batched into just one replacement character.

I imagine the algorithm used might vary across string encoding libraries in different Starlark implementation's host languages. Which means we may want to allow for implementation-defined behavior here.

Of course, that's for decoding, not encoding. I suppose bytes(str) is thought of as applying a composition of decoding the string's elements from UTF-K and re-encoding them to UTF-8, even when K = 8.

Yes, the Unicode spec doesn't prescribe it exactly, and this question has a history; see https://hsivonen.fi/broken-utf-8/Unicode. But it's safer for us to prescribe the behavior here, as that very document points out:

Web standards these days tend to avoid implementation-defined behavior due to the painful experience of Web sites developing dependencies on the quirks of particular browsers in areas that old specs considered error situations not worth spelling out precise processing requirements for. Therefore, there has been a long push towards well-defined behavior even in error situations on the Web without debating each particular error case...

I chose this behavior because it's easy to implement, and it's the one required by the Go spec, and if I recall correctly, Java's String.getBytes(utf8) uses '?' as its replacement (!!), and thus isn't compatible with any reading of Unicode, so we'll have to handle the conversion ourselves anyway.

If this is consistent with practices in Go, it's good enough for me.

brandjon · 2021-02-12T20:11:12Z

spec.md

 In other words `x == y` implies `hash(x) == hash(y)`.
+Any other type of argument in an error, even if it is suitable as the key of a dict.
+
 In the interests of reproducibility of Starlark program behavior over time and


In the interests of reproducibility

Which is exactly what Python aims to avoid for the sake of thwarting collision DoS attacks... Can't be helped, I suppose. Do we have the option to extend hash() with a seed argument in the future? Or perhaps at that point you'd just run the interpreter in a mode where it's allowed to substitute an arbitrary hash function.

Yes, but Starlark prizes determinism higher than DoS protection. This was a material problem when I implemented Herb: execution of BUILD files would vary based on the hash function, causing (for example) different elements to be chosen from a set, causing different results, and in some cases, errors only in one implementation.

Do we have the option to extend hash() with a seed argument in the future? Or perhaps at that point you'd just run the interpreter in a mode where it's allowed to substitute an arbitrary hash function.

No, we don't have an option.

Well, put it this way: If Starlark is ever used by an application that wants DoS protection and values it over determinism, they'll do what they want in the implementation (to the extent that they control their interpreter) and ignore the spec. So it's just a choice of whether we bless it or not, and I don't feel strongly either way.

spec.md

This change defines the semantics of: - str(bytes) -- UTF-k decoding with U+FFFD replacement - bytes(str) -- UTF-k encoding with U+FFFD replacement - bytes.elems() -- iterable of int values of byte elements - hash(bytes) -- 32-bit FNV-1a hash - bytes in bytes -- substring test - int in bytes -- element membership test Updates bazelbuild#112 Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7

google-cla · 2021-02-12T21:22:52Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

Merge pull request #1 from bazelbuild/master

57bfcac

merge upstream

alandonovan requested a review from laurentlb as a code owner February 12, 2021 00:00

adonovan force-pushed the bytes2 branch from 9a4e744 to d484a24 Compare February 12, 2021 00:05

alandonovan requested a review from brandjon February 12, 2021 00:23

adonovan force-pushed the bytes2 branch from d484a24 to 5915dd2 Compare February 12, 2021 00:52

brandjon approved these changes Feb 12, 2021

View reviewed changes

adonovan force-pushed the bytes2 branch from 5915dd2 to 65e182f Compare February 12, 2021 21:22

alandonovan merged commit 0b9b3d0 into bazelbuild:master Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: more bytes operations #164

spec: more bytes operations #164

alandonovan commented Feb 12, 2021 •

edited

Loading

google-cla bot commented Feb 12, 2021

google-cla bot commented Feb 12, 2021

alandonovan commented Feb 12, 2021

google-cla bot commented Feb 12, 2021

brandjon Feb 12, 2021

alandonovan Feb 12, 2021 •

edited

Loading

brandjon Feb 12, 2021

alandonovan Feb 12, 2021

brandjon Feb 12, 2021

brandjon Feb 12, 2021

alandonovan Feb 12, 2021

brandjon Feb 12, 2021

google-cla bot commented Feb 12, 2021

spec: more bytes operations #164

spec: more bytes operations #164

Conversation

alandonovan commented Feb 12, 2021 • edited Loading

google-cla bot commented Feb 12, 2021

google-cla bot commented Feb 12, 2021

alandonovan commented Feb 12, 2021

google-cla bot commented Feb 12, 2021

brandjon Feb 12, 2021

Choose a reason for hiding this comment

alandonovan Feb 12, 2021 • edited Loading

Choose a reason for hiding this comment

brandjon Feb 12, 2021

Choose a reason for hiding this comment

alandonovan Feb 12, 2021

Choose a reason for hiding this comment

brandjon Feb 12, 2021

Choose a reason for hiding this comment

brandjon Feb 12, 2021

Choose a reason for hiding this comment

alandonovan Feb 12, 2021

Choose a reason for hiding this comment

brandjon Feb 12, 2021

Choose a reason for hiding this comment

google-cla bot commented Feb 12, 2021

alandonovan commented Feb 12, 2021 •

edited

Loading

alandonovan Feb 12, 2021 •

edited

Loading