| Field | Value |
|---|---|
| DIP: | 1026 |
| Review Count: | 0 |
| Author: | Dennis Korpel dkorpel@gmail.com |
| Implementation: | |
| Status: | Community Review Round 1 |
D is intended to have a context-free grammar allowing for easy lexing and parsing for text editors or syntax highlighters. However, delimited strings are context sensitive. Since they are rarely used, it is proposed that delimited strings be deprecated and later removed from the language.
- Definitions
- Rationale
- Prior Work
- Description
- Breaking Changes and Deprecations
- Reference
- Copyright & License
- Reviews
This DIP discusses string literals starting with a q, of which there are multiple variants in D.
They will be referred to by the following names:
// "identifier-delimited strings"
q"EOS
text
EOS";
// "bracket-delimited strings"
q"(text)";
q"<text>";
q"[text]";
q"{text}";
// "character-delimited strings"
q"/text/";
// "delimited strings" refers to any of the above
// "token strings"
q{if () else {}};In the reference compiler's error messages, identifier-delimited strings are referred to as 'heredocs'. In this DIP, Here documents refers to the usage of string literals to include large bodies of text directly into source code rather than indicating a specific type of string literal.
Regarding language design, Walter Bright has stated:
Things that are true gods:
Context free grammars. What this really means is the code should be parseable
without having to look things up in a symbol table. C++ is famously not a
context free grammar. A context free grammar, besides making things a lot
simpler, means that IDEs can do syntax highlighting without integrating in
most of a compiler front end, i.e. third party tools become much more likely
to exist.
D is intended to be able to be parsed by a context-free grammar, but some users have expressed doubts about that because certain features (like static / associative arrays) cannot be distinguished by a context-free language. Comments mention how the parser only decides the basic structure of code, and certain type-related features are distinguished during semantic analysis. However, as pointed out by a Stack Overflow user, there is one construct that, without a doubt, proves that D's grammar is not actually context free:
DelimitedString:
q" Delimiter WysiwygCharacters MatchingDelimiter "
Considering that a delimiter can be an identifier, this is almost a textbook example of a context-sensitive construct.
It is not so bad that it requires keeping a symbol table, but Delimiter and MatchingDelimiter stand out as informal terms in an otherwise formal context-free grammar.
Consequently, in the D language grammar example from Pegged where every rule has to be formally defined, delimited strings are assumed not to exist.
Luckily, delimited strings are not very commonly used in D code. To quantify the usage, a scan of string literal usage was performed using libdparse. The search is limited to:
- a snapshot from July 2019 of all packages registered on dub
- repositories that could be cloned with git (no private/deleted/mercurial repositories)
- files with a
.dextension - string literals directly in source code, not inside a string mixin or build artifact
With these limitations the result does not give the full picture, but it still gives a reasonable impression. The following observations were made:
- There are 714 identifier-delimited or character-delimited strings, 509 of which are in the package 'DStep'
- There are 152 bracket-delimited strings (
q"( like this )") - Identifier-delimited strings are used at least once in 28/1586 packages
- Bracket-delimited strings are used at least once in 32/1586 packages
- Identifier-delimited strings make up 0.06% of all string literals (0.02% without DStep) A spreadsheet with the full breakdown can be found here.
Possible reasons for the rare usage of delimted strings include:
- Here documents are not needed as often as short strings
- String literals (even regular "" ones) can span multiple lines in D
q{}strings cover D code and DSLs with similar lexical syntax, such as JSON or GLSL.- WYSIWYG strings like `text` cover many other cases: Pegged grammars, SQL, regex, English text.
- If a backtick (`) is needed, it can be concatenated using the
~operator, or ar""string can be used. - The
import("file.txt")statement completely eliminates the need for any escaping, and even allows binary data. It also encourages separation between code and data. - Even if the programmer is looking for delimited strings, chances are they won't easily find them.
Removing identifier-delimited strings would mean context-free parser generators could be used for D, with no asterisks. It would also reduce language complexity and make the specification more formal. The downside is that it would break some code in cases where there is no direct replacement. However, as shown above, in practice there are not many of these string literals and they can often be easily replaced with other string literals.
Hexstring literals have been deprecated since version 2.079 (March 1 2018), under the rationale that "Hexstrings are used so seldom that they don't warrant a language feature". As of July 2019, there are still 1035 hex string literals spread across 15 registered dub packages.
The deprecation of context-sensitive string literals can be considered similar to the deprecation of hexstring literals, with a few differences:
- Deprecation warnings for hexstring literals have a trivial fix (use
std.conv: hexstringinstead), but there is no trivial fix for all identifier-delimited string literal deprecation warnings. However, in practice, identifier-delimited strings are often easily replaceable with bracket-delimited strings or WYSIWYG strings. - Hexstring literals are context-free; identifier-delimited string strings add much more language complexity since they are not context-free. Therefore, removing the latter from the language simplifies the language more than did the removal of hexstring literals.
Rosetta Code has a comparison of Here documents in different languages. D's q-strings are mentioned there, but the syntax highlighting in the code snippet is messy, nicely illustrating the problem with them. Notable languages having identifier-delimited string literals are C++, Perl, and PHP. Other languages such as Java, JavaScript, Python, C#, and Rust feature only context-free string literals.
Identifier-delimited strings, like hex strings before them, become deprecated following the usual deprecation cycle.
Character-delimited strings also become deprecated.
Even though using a single character as a delimiter is still context free, it requires adding a parser state for every possible character, of which there are many.
Token strings (q{}) and strings with brackets as delimiters (q"{}" q"()" q"<>" q"[]") are unaffected and can be suggested as a replacement.
The current rule for delimited strings is removed and rules are added for specifically allowed delimiters.
DelimitedString:
- q" Delimiter WysiwygCharacters MatchingDelimiter "
+ q"( WysiwygCharacters )"
+ q"{ WysiwygCharacters }"
+ q"[ WysiwygCharacters ]"
+ q"< WysiwygCharacters >"Any code that uses delimited strings must be changed to use a different string literal or an import statement. Below is a list of DUB packages that use at least one of these literals and thus need to be updated:
dstep
pgator
orange
tsv-utils
dentist
jwtd-es
jwtd
ae
dtools
button
msgpack-rpc
dmd
arsd-official
ddmd-experimental
dini
gpgme-d
phobos
dalicious
dlangide
remarkify
dcaptcha
ddata
doveralls
dtest
easing
flod
libdparse
nwn-lib-d
- "Here document" on Wikipedia
- Walter Bright on context-free grammars
- Context-free grammar: yes
- "is the advertising as 'context-free grammar' wrong?"
- Stack Overflow post describing the issue
- D language grammar example from Pegged
- String literal usage breakdown per dub project
- Here documents in different languages
Copyright (c) 2019 by the D Language Foundation
Licensed under Creative Commons Zero 1.0
The DIP Manager will supplement this section with a summary of each review stage of the DIP process beyond the Draft Review.