Skip to content

Commit

Permalink
Unicode and UTF-8 clarifications from toml-lang#924
Browse files Browse the repository at this point in the history
  • Loading branch information
eksortso committed Oct 27, 2022
1 parent 487b38c commit 5f0236a
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 7 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## unreleased

- Clarify Unicode and UTF-8 references.
- Allow newline after key/values in inline tables.
- Allow trailing comma in inline tables.
- Clarify where and how dotted keys define tables.
Expand Down
20 changes: 13 additions & 7 deletions toml.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,15 @@ should be easy to parse into data structures in a wide variety of languages.

## Spec

A TOML file must be a valid UTF-8 encoded Unicode document. Specifically this
means that, should a file as a whole not form a
[well-formed code-unit sequence](https://unicode.org/glossary/#well_formed_code_unit_sequence),
the file must be rejected (preferably) or ill-formed byte sequences must be
replaced with U+FFFD as per the Unicode spec.

- TOML is case-sensitive.
- A TOML file must be a valid UTF-8 encoded Unicode document.
- Whitespace means tab (0x09) or space (0x20).
- Newline means LF (0x0A) or CRLF (0x0D 0x0A).
- Whitespace means tab (U+0009) or space (U+0020).
- Newline means LF (U+000A) or CRLF (U+000D U+000A).

## Comment

Expand Down Expand Up @@ -265,7 +270,7 @@ The above TOML maps to the following JSON.
## String

There are four ways to express strings: basic, multi-line basic, literal, and
multi-line literal. All strings must contain only valid UTF-8 characters.
multi-line literal. All strings must contain only Unicode characters.

**Basic strings** are surrounded by quotation marks (`"`). Any Unicode character
may be used except those that must be escaped: quotation mark, backslash, and
Expand Down Expand Up @@ -293,7 +298,7 @@ For convenience, some popular characters have a compact escape sequence.
```

Any Unicode character may be escaped with the `\xHH`, `\uHHHH`, or `\UHHHHHHHH`
forms. The escape codes must be valid Unicode
forms. The escape codes must be Unicode
[scalar values](https://unicode.org/glossary/#unicode_scalar_value).

All other escape sequences not listed above are reserved; if they are used, TOML
Expand Down Expand Up @@ -417,8 +422,9 @@ str = ''''That,' she said, 'is still pointless.''''
```

Control characters other than tab are not permitted in a literal string. Thus,
for binary data, it is recommended that you use Base64 or another suitable ASCII
or UTF-8 encoding. The handling of that encoding will be application-specific.
for binary data, it is recommended that you use Base64 or another suitable
binary-to-text encoding. The handling of that encoding will be
application-specific.

## Integer

Expand Down

0 comments on commit 5f0236a

Please sign in to comment.