Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: more bytes operations #164

Merged
merged 2 commits into from
Feb 12, 2021
Merged

spec: more bytes operations #164

merged 2 commits into from
Feb 12, 2021

Conversation

alandonovan
Copy link
Contributor

@alandonovan alandonovan commented Feb 12, 2021

spec: more bytes operations

This change defines the semantics of: 
- str(bytes) - UTF-8 decoding with U+FFFD replacement
- bytes(str) - UTF-8 encoding with U+FFFD replacement
- bytes.elems() -- iterable of int values of byte elements
- hash(bytes) -- 32-bit FNV-1a hash
- bytes in bytes -- substring test
- int in bytes -- element membership test

Updates #112

@google-cla
Copy link

google-cla bot commented Feb 12, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Feb 12, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@alandonovan
Copy link
Contributor Author

Some comments on what is (and what is not) in this PR:

  • There is still no spec for ord nor chr functions.

  • chr is easy enough to define: it takes an int, and returns a 1-codepoint string. If you want bytes of the result, call bytes(chr(x)), for now.

  • codepoint_chr is not needed. Use chr.

  • Python's ord function returns the numeric value of a 1-element string. (Python2 string elements are bytes; Python3 strings are of code points, and Python3 bytes strings of bytes.) Given that Starlark's text string elements are not code points, I think it risks confusion for us to define ord. One can use bytes.elems()[0] to obtain the same result, at some constant overhead, and this works on strings longer than 1 as well. If string.elem_ords() existed in the spec (and not just the Go impl), you could use that too. I think we should specify a method like elem_ords for strings, but ideally it would be called string.elems, for symmetry with bytes.elems. Unfortunately, that method already exists and returns 1-element substrings, which is really not useful (and would be an error for non-ASCII in Rust.) I propose we use elem_ords for now, but plan longer term to make elems equivalent to elem_ords.

  • codepoint_ord(s) can be achieved using s.codepoint_ords()[0], at least for now. We should specify string.codepoint_ords, which only exists in the Go impl. (Tricky to implement in Java due to Bazel latin1 hack.)

  • bytes(str) and str(bytes): this PR defines these conversions, with U+FFFD behavior, not fail-fast. Why do we not want to support these operations, as comments in spec: new 'bytes' data type #112 say?

  • If we ever want to support other encodings (and I think that's unlikely), we needn't pollute the str() and bytes() functions with extra parameters. We could introduce explicit encode/decode methods in that case.

@google-cla
Copy link

google-cla bot commented Feb 12, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated
@@ -2110,19 +2114,20 @@ these operators.
#### Membership tests

```text
any in sequence (list, tuple, dict, string)
any in sequence (list, tuple, dict, string, bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Here and immediately below) Also range objects, and possibly other view-like objects (even an elems() iterable).

Separately, I wonder if the spec should say anything about which operations an application-defined type may elect to participate in.

Copy link
Contributor Author

@alandonovan alandonovan Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the spec would get into the business of exhaustively listing all possible types, instead referring to abstractions like "indexable sequence" and given examples of one or to instances.

I've added range here and in a couple places, but it's a slippery slope.

which operations an application-defined type may elect to participate in.

I think that's entirely up to the implementation, and essentially all operators are fair game.

If x is a `bytes`, the result is x.

If x is a string, the result is a `bytes` whose elements are
the UTF-8 encoding of the string. Each element of the string that is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's multiple ways to apply U+FFFD replacement to invalid UTF-8 data. Looking at Section 3.9 of the unicode standard (see heading "Constraints on Conversion Process"), it appears the only hard requirement is that no valid sequence of code units encoding a single code point be consumed by the replacement process. But there's flexibility on whether each code unit that is not part of a valid sequence gets its own replacement character, or whether consecutive such units get batched into just one replacement character.

I imagine the algorithm used might vary across string encoding libraries in different Starlark implementation's host languages. Which means we may want to allow for implementation-defined behavior here.

Of course, that's for decoding, not encoding. I suppose bytes(str) is thought of as applying a composition of decoding the string's elements from UTF-K and re-encoding them to UTF-8, even when K = 8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Unicode spec doesn't prescribe it exactly, and this question has a history; see https://hsivonen.fi/broken-utf-8/Unicode. But it's safer for us to prescribe the behavior here, as that very document points out:

Web standards these days tend to avoid implementation-defined behavior due to the painful experience of Web sites developing dependencies on the quirks of particular browsers in areas that old specs considered error situations not worth spelling out precise processing requirements for. Therefore, there has been a long push towards well-defined behavior even in error situations on the Web without debating each particular error case...

I chose this behavior because it's easy to implement, and it's the one required by the Go spec, and if I recall correctly, Java's String.getBytes(utf8) uses '?' as its replacement (!!), and thus isn't compatible with any reading of Unicode, so we'll have to handle the conversion ourselves anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is consistent with practices in Go, it's good enough for me.

In other words `x == y` implies `hash(x) == hash(y)`.
Any other type of argument in an error, even if it is suitable as the key of a dict.

In the interests of reproducibility of Starlark program behavior over time and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interests of reproducibility

Which is exactly what Python aims to avoid for the sake of thwarting collision DoS attacks... Can't be helped, I suppose. Do we have the option to extend hash() with a seed argument in the future? Or perhaps at that point you'd just run the interpreter in a mode where it's allowed to substitute an arbitrary hash function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but Starlark prizes determinism higher than DoS protection. This was a material problem when I implemented Herb: execution of BUILD files would vary based on the hash function, causing (for example) different elements to be chosen from a set, causing different results, and in some cases, errors only in one implementation.

Do we have the option to extend hash() with a seed argument in the future? Or perhaps at that point you'd just run the interpreter in a mode where it's allowed to substitute an arbitrary hash function.

No, we don't have an option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, put it this way: If Starlark is ever used by an application that wants DoS protection and values it over determinism, they'll do what they want in the implementation (to the extent that they control their interpreter) and ignore the spec. So it's just a choice of whether we bless it or not, and I don't feel strongly either way.

spec.md Show resolved Hide resolved
spec.md Show resolved Hide resolved
This change defines the semantics of:
- str(bytes) -- UTF-k decoding with U+FFFD replacement
- bytes(str) -- UTF-k encoding with U+FFFD replacement
- bytes.elems() -- iterable of int values of byte elements
- hash(bytes) -- 32-bit FNV-1a hash
- bytes in bytes -- substring test
- int in bytes -- element membership test

Updates bazelbuild#112

Change-Id: Ide3459c4115fff718197001c381da4da7a45a9d7
@google-cla
Copy link

google-cla bot commented Feb 12, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@alandonovan alandonovan merged commit 0b9b3d0 into bazelbuild:master Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants