Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: bytes data type #161

Merged
merged 2 commits into from Feb 11, 2021
Merged

spec: bytes data type #161

merged 2 commits into from Feb 11, 2021

Conversation

alandonovan
Copy link
Contributor

This change adds initial specification of the bytes data type
following length discussion in #112.
It also explains the implementation-dependent encoding of
text strings, and the \u \U \X escapes.

More will follow, but let's get the easy parts out of the way first.

Updates #112

spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
Copy link
Member

@brandjon brandjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Still reviewing)

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@@ -419,6 +475,41 @@ b" # "a\\\nb"
It is an error for a backslash to appear within a string literal other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github won't let me comment on the above lines so quoting here:

Regardless of the platform's convention for text line endings---for example, a linefeed (\n) on UNIX, or a carriage return followed by a linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a multiline string literal always denotes a line feed (\n).

Do we specify what constitutes a line ending in a Starlark source file? Specifically, what the algorithm is for converting a raw U+000D or U+000A or a pair of them, into a single U+000A?

Starlark also supports raw string literals, which look like an ordinary single- or double-quotation preceded by r. Within a raw string literal, there is no special processing of backslash escapes, other than an escaped quotation mark (which denotes a literal quotation mark), or an escaped newline (which denotes a backslash followed by a newline). This form of quotation is typically used when writing strings that contain many quotation marks or backslashes (such as regular expressions or shell commands) to reduce the burden of escaping:

r'\'' denotes a literal backslash and quote, not a quote by itself. Also, raw string literals don't help you when there are many literal quotes, since you still have to escape them (if they match the string's opening and closing quote type), but they do help you when there are many literal backslashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we specify what constitutes a line ending in a Starlark source file? Specifically, what the algorithm is for converting a raw U+000D or U+000A or a pair of them, into a single U+000A?

What more needs to be said? The scanner needs to recognize line endings (however they are defined by the platform), in three places:

  1. escaped, in which case they are ignored;
  2. unescaped, outside a string literal, where they make a NEWLINE token;
  3. unescaped, in a multiline string literal, where they make a \n (as the quoted paragraph explains).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scanner is part of the spec, no? I believe Python defines newlines in a platform-independent way, and we should probably do the same. A raw \r\n on unix should still produce a single \n, not a \r\n, inside a multiline string literal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TODO is fine for unblocking this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that the only place that needs to take a stance on the concrete representation of a line ending is case 3, which already spells it out thus:

Regardless of the platform's convention for text line endings---for
example, a linefeed (\n) on UNIX, or a carriage return followed by a
linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a
multiline string literal always denotes a line feed (\n).

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Show resolved Hide resolved
@brandjon
Copy link
Member

Driveby: "type(x) returns a string describing the type of its operand." -> backticks around type(x).

@brandjon
Copy link
Member

Also:

Operand = identifier
        | int | float | string
        | ListExpr | ListComp
        | DictExpr | DictComp
        | '(' [Expression [',']] ')'

Needs a | bytes there.

Copy link
Contributor Author

@alandonovan alandonovan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jon; PTAL.

spec.md Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Show resolved Hide resolved
spec.md Show resolved Hide resolved
This change adds initial specification of the bytes data type
following length discussion in bazelbuild#112.
It also explains the implementation-dependent encoding of
text strings, and the \u and \U escapes.

More will follow, but let's get the easy parts out of the way first.

Updates bazelbuild#112

Change-Id: I8cfbb4910c2f85a1076f9b8bdf1081c89dd5948a
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
@@ -419,6 +475,41 @@ b" # "a\\\nb"
It is an error for a backslash to appear within a string literal other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scanner is part of the spec, no? I believe Python defines newlines in a platform-independent way, and we should probably do the same. A raw \r\n on unix should still produce a single \n, not a \r\n, inside a multiline string literal.

@alandonovan
Copy link
Contributor Author

Driveby: "type(x) returns a string describing the type of its operand." -> backticks around type(x).

Done.

Needs a | bytes there.

Done.

The scanner is part of the spec, no? I believe Python defines newlines in a platform-independent way, and we should probably do the same. A raw \r\n on unix should still produce a single \n, not a \r\n, inside a multiline string literal.

Yes. We say that here:
"""
Regardless of the platform's convention for text line endings---for
example, a linefeed (\n) on UNIX, or a carriage return followed by a
linefeed (\r\n) on Microsoft Windows---an unescaped line ending in a
multiline string literal always denotes a line feed (\n).
"""

Change-Id: I868d12b5c9c02a86903541d1cdf7907fbed5f56e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants