spec: bytes data type #161

alandonovan · 2021-02-05T20:23:10Z

This change adds initial specification of the bytes data type
following length discussion in #112.
It also explains the implementation-dependent encoding of
text strings, and the \u \U \X escapes.

More will follow, but let's get the easy parts out of the way first.

Updates #112

spec.md

brandjon

(Still reviewing)

spec.md

brandjon · 2021-02-10T16:44:14Z

spec.md

@@ -419,6 +475,41 @@ b"		# "a\\\nb"
 It is an error for a backslash to appear within a string literal other


Github won't let me comment on the above lines so quoting here:

Regardless of the platform's convention for text line endings---for example, a linefeed (\n) on UNIX, or a carriage return followed by a linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a multiline string literal always denotes a line feed (\n).

Do we specify what constitutes a line ending in a Starlark source file? Specifically, what the algorithm is for converting a raw U+000D or U+000A or a pair of them, into a single U+000A?

Starlark also supports raw string literals, which look like an ordinary single- or double-quotation preceded by r. Within a raw string literal, there is no special processing of backslash escapes, other than an escaped quotation mark (which denotes a literal quotation mark), or an escaped newline (which denotes a backslash followed by a newline). This form of quotation is typically used when writing strings that contain many quotation marks or backslashes (such as regular expressions or shell commands) to reduce the burden of escaping:

r'\'' denotes a literal backslash and quote, not a quote by itself. Also, raw string literals don't help you when there are many literal quotes, since you still have to escape them (if they match the string's opening and closing quote type), but they do help you when there are many literal backslashes.

Do we specify what constitutes a line ending in a Starlark source file? Specifically, what the algorithm is for converting a raw U+000D or U+000A or a pair of them, into a single U+000A?

What more needs to be said? The scanner needs to recognize line endings (however they are defined by the platform), in three places:

escaped, in which case they are ignored;

unescaped, outside a string literal, where they make a NEWLINE token;

unescaped, in a multiline string literal, where they make a \n (as the quoted paragraph explains).

The scanner is part of the spec, no? I believe Python defines newlines in a platform-independent way, and we should probably do the same. A raw \r\n on unix should still produce a single \n, not a \r\n, inside a multiline string literal.

A TODO is fine for unblocking this PR.

My point is that the only place that needs to take a stance on the concrete representation of a line ending is case 3, which already spells it out thus:

Regardless of the platform's convention for text line endings---for
example, a linefeed (\n) on UNIX, or a carriage return followed by a
linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a
multiline string literal always denotes a line feed (\n).

spec.md

brandjon · 2021-02-10T21:30:43Z

Driveby: "type(x) returns a string describing the type of its operand." -> backticks around type(x).

brandjon · 2021-02-10T21:31:37Z

Also:

Operand = identifier
        | int | float | string
        | ListExpr | ListComp
        | DictExpr | DictComp
        | '(' [Expression [',']] ')'

Needs a | bytes there.

alandonovan

Thanks Jon; PTAL.

spec.md

This change adds initial specification of the bytes data type following length discussion in bazelbuild#112. It also explains the implementation-dependent encoding of text strings, and the \u and \U escapes. More will follow, but let's get the easy parts out of the way first. Updates bazelbuild#112 Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a

spec.md

brandjon · 2021-02-11T18:25:50Z

spec.md

@@ -419,6 +475,41 @@ b"		# "a\\\nb"
 It is an error for a backslash to appear within a string literal other


The scanner is part of the spec, no? I believe Python defines newlines in a platform-independent way, and we should probably do the same. A raw \r\n on unix should still produce a single \n, not a \r\n, inside a multiline string literal.

alandonovan · 2021-02-11T18:47:27Z

Driveby: "type(x) returns a string describing the type of its operand." -> backticks around type(x).

Done.

Needs a | bytes there.

Done.

The scanner is part of the spec, no? I believe Python defines newlines in a platform-independent way, and we should probably do the same. A raw \r\n on unix should still produce a single \n, not a \r\n, inside a multiline string literal.

Yes. We say that here:
"""
Regardless of the platform's convention for text line endings---for
example, a linefeed (\n) on UNIX, or a carriage return followed by a
linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a
multiline string literal always denotes a line feed (\n).
"""

Change-Id: I868d12b5c9c02a86903541d1cdf7907fbed5f56e

alandonovan requested a review from laurentlb as a code owner February 5, 2021 20:23

alandonovan requested a review from brandjon February 5, 2021 20:23

adonovan force-pushed the bytes branch 2 times, most recently from 794ca8b to 4a69c45 Compare February 5, 2021 22:33

alandonovan mentioned this pull request Feb 5, 2021

starlark: add 'bytes' data type, for binary strings google/starlark-go#330

Merged

adonovan force-pushed the bytes branch from 4a69c45 to f7de256 Compare February 5, 2021 22:51

illicitonion reviewed Feb 5, 2021

View reviewed changes

spec.md Outdated Show resolved Hide resolved

adonovan force-pushed the bytes branch from f7de256 to e683647 Compare February 8, 2021 18:37

brandjon reviewed Feb 9, 2021

View reviewed changes

spec.md Show resolved Hide resolved

brandjon reviewed Feb 10, 2021

View reviewed changes

alandonovan commented Feb 10, 2021

View reviewed changes

brandjon reviewed Feb 11, 2021

View reviewed changes

brandjon approved these changes Feb 11, 2021

View reviewed changes

review

5cdf36f

Change-Id: I868d12b5c9c02a86903541d1cdf7907fbed5f56e

adonovan force-pushed the bytes branch from c52f5d9 to 5cdf36f Compare February 11, 2021 18:48

alandonovan merged commit 7f53743 into bazelbuild:master Feb 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: bytes data type #161

spec: bytes data type #161

alandonovan commented Feb 5, 2021

brandjon left a comment

brandjon Feb 10, 2021

alandonovan Feb 10, 2021 •

edited

brandjon Feb 11, 2021

brandjon Feb 11, 2021

alandonovan Feb 11, 2021

brandjon commented Feb 10, 2021

brandjon commented Feb 10, 2021

alandonovan left a comment

brandjon Feb 11, 2021

alandonovan commented Feb 11, 2021

		@@ -419,6 +475,41 @@ b" # "a\\\nb"
		It is an error for a backslash to appear within a string literal other

spec: bytes data type #161

spec: bytes data type #161

Conversation

alandonovan commented Feb 5, 2021

brandjon left a comment

Choose a reason for hiding this comment

brandjon Feb 10, 2021

Choose a reason for hiding this comment

alandonovan Feb 10, 2021 • edited

Choose a reason for hiding this comment

brandjon Feb 11, 2021

Choose a reason for hiding this comment

brandjon Feb 11, 2021

Choose a reason for hiding this comment

alandonovan Feb 11, 2021

Choose a reason for hiding this comment

brandjon commented Feb 10, 2021

brandjon commented Feb 10, 2021

alandonovan left a comment

Choose a reason for hiding this comment

brandjon Feb 11, 2021

Choose a reason for hiding this comment

alandonovan commented Feb 11, 2021

alandonovan Feb 10, 2021 •

edited