Initial implementation of raw string literals #1304

SlaterLatiao · 2022-06-01T18:49:54Z

Combined simple string and block string in lexer.lpp, and implemented raw string literals.
Modified ParseBlockStringLiteral() and UnescapeStringLiteral() in string_helper.cpp to allow raw string literals.

…sage for simple string except for *#"""#*

block string literal cannot be one line

…sage for simple string except for *#"""#*

block string literal cannot be one line

… raw_string

josh11b

I've not finished reading the code, but I thought I'd send you what I have. I would encourage you to think about ways of writing code that uses fewer boolean flags. They make the code harder to read and reason about.

explorer/syntax/lexer.lpp

…tring starting with #+\'\'\'

explorer/syntax/lexer.lpp

Co-authored-by: josh11b <josh11b@users.noreply.github.com>

jonmeow

This is looking good! I'm mostly making code style comments. Please don't be bothered by that -- code review like this is often how we end up teaching style. If you have any questions, I should be available over IM.

explorer/syntax/lexer.lpp

common/string_helpers.h

explorer/syntax/lexer.lpp

explorer/testdata/string/fail_hex_lower.carbon

jonmeow

This is looking good! Just a few more comments.

common/string_helpers.cpp

explorer/syntax/lex_helper.h

explorer/syntax/lex_scan_helper.cpp

explorer/syntax/lex_scan_helper.h

jonmeow · 2022-06-08T17:58:54Z

explorer/syntax/lex_scan_helper.h

+  // EOF.
+  auto Advance() -> bool;
+  // Returns last scanned char.
+  auto last_char() -> int { return str_.back(); };


Is there a reason this is int rather than char?

The original ReadChar has return type of int. Switched to char instead.

explorer/syntax/lexer.lpp

explorer/syntax/lex_scan_helper.h

josh11b · 2022-06-08T20:19:13Z

common/string_helpers.cpp

@@ -27,87 +27,86 @@ static auto FromHex(char c) -> std::optional<char> {
  return std::nullopt;
 }

-auto UnescapeStringLiteral(llvm::StringRef source, bool is_block_string)
-    -> std::optional<std::string> {
+auto UnescapeStringLiteral(llvm::StringRef source, const size_t hashtag_num,


Use int instead of size_t. Avoid unsigned types unless they are needed: https://google.github.io/styleguide/cppguide.html#Integer_Types

Replaced size_t with int.

explorer/syntax/lex_scan_helper.cpp

josh11b · 2022-06-08T20:22:47Z

explorer/syntax/lexer.lpp

+  const size_t hashtag_num = std::count(s.begin(), s.end(), '#');
+  const size_t leading_quotes = std::count(s.begin(), s.end(), '"');


I think the current code works because we know the string has to satisfy the regexp #*(\"\"\"|\") due to line 266.

No comment on whether the multi_line flag is easier to read. It is a case where I think a boolean flag is reasonable, particularly if it can be declared const so you know that it never changes.

josh11b · 2022-06-08T20:23:28Z

explorer/syntax/lexer.lpp

+    if (Carbon::ReadHashTags(str_lex_helper, hashtag_num)) {
+      return Carbon::ProcessSingleLineString(str_lex_helper.str(), context,
+                                             hashtag_num);
+    } else if (str_lex_helper.is_eof()) {


ReadHashTags(...) calls Advance().

explorer/syntax/lex_scan_helper.h

josh11b · 2022-06-08T20:41:16Z

explorer/syntax/lex_scan_helper.cpp

+auto ReadHashTags(Carbon::StringLexHelper& scan_helper,
+                  const size_t hashtag_num) -> bool {
+  for (size_t i = 0; i < hashtag_num; ++i) {
+    if (!scan_helper.Advance() || scan_helper.last_char() != '#') {


This is backwards from what I expect:

Suggested change

if (!scan_helper.Advance() || scan_helper.last_char() != '#') {

if (scan_helper.last_char() != '#' || !scan_helper.Advance()) {

That way only # characters would be consumed.

Advance() needs to be called before calling ReadHashTags() if hashtags are checked first. It is still possible to consume a non # char after switching the order.

josh11b · 2022-06-08T20:42:02Z

explorer/syntax/lex_scan_helper.h

+// Reads and returns a single character. Reports an error on EOF.
+auto ReadChar(yyscan_t yyscanner, Carbon::ParseAndLexContext& context) -> int;
+
+// Tries to Read [hashtag_num] hashtags. Returns true on success.


I'd generally write `hashtag_num` instead of [hashtag_num].

Should document how much scan_helper is Advance()d. The change I recommend would make it clear that it advances past up to hashtag_num hashtags, and no other characters.

Edited as suggested and added comment on how many characters are read.

josh11b · 2022-06-08T20:52:21Z

explorer/syntax/lexer.lpp

+  const size_t hashtag_num = std::count(s.begin(), s.end(), '#');
+  const size_t leading_quotes = std::count(s.begin(), s.end(), '"');
+  if (leading_quotes == 3 && hashtag_num > 0) {
+    // Check if it's a single-line string, like #"""#.


I think this logic isn't quite smart enough to handle #""""# which should be treated as the string with two double-quote characters: "". I think the block string specification limits what can be after #""", otherwise it is a single-line string. @zygoloid probably can say what the exact rule is.

https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md :

A block string literal starts with """, followed by an optional file type indicator, followed by a newline, ...

A file type indicator is any sequence of non-whitespace characters other than " or #.

Please add tests and implement this rule.

@SlaterLatiao I think this is what I'd discussed with you. I would still encourage you to gauge how much of a change it is -- even if it's a significant refactor, it may be easier to review as a delta from current work. In that case, I'd still suggest adding a TODO for this issue, finishing cleanups on this PR, getting it merged, and then adding support in a new PR. Closing out the review and getting it in can be valuable.

Added the TODO.

explorer/syntax/lex_scan_helper.cpp

josh11b · 2022-06-08T21:04:15Z

explorer/syntax/lex_scan_helper.cpp

+auto ProcessSingleLineString(llvm::StringRef str,
+                             Carbon::ParseAndLexContext& context,
+                             const size_t hashtag_num)
+    -> Carbon::Parser::symbol_type {


Save a copy of str for error messages before you consume the front and back. Also for ProcessMultiLineString.

Instead of copying str (the parameter type of str is changed to llvm::StringRef to avoid such copies), the string used for error message will be reconstructed by prepending and appending the quotes. The hashtags are not added, to be consistent with ProcessMultiLineString. The error messages in ProcessMultiLineString are handled in ParseBlockStringLiteral, where the hashtags are already removed when calling.

Copying a llvm::StringRef should be cheap, and not involve copying the string.

Updated with copy of llvm::StringRef.

josh11b · 2022-06-12T14:19:00Z

explorer/syntax/lex_scan_helper.h

+// Reads and returns a single character. Reports an error on EOF.
+auto ReadChar(yyscan_t yyscanner, Carbon::ParseAndLexContext& context) -> int;


How many places is this function called from?

It is only called by Advance. I removed ReadChar and merged its logic into Advance.

explorer/syntax/lex_scan_helper.h

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

… raw_string

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

…witched back to indicate muti-line string with a flag.

explorer/syntax/lex_scan_helper.cpp

josh11b · 2022-06-22T19:26:06Z

explorer/syntax/lexer.lpp

+  const size_t hashtag_num = std::count(s.begin(), s.end(), '#');
+  const size_t leading_quotes = std::count(s.begin(), s.end(), '"');
+  if (leading_quotes == 3 && hashtag_num > 0) {
+    // Check if it's a single-line string, like #"""#.


https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md :

A block string literal starts with """, followed by an optional file type indicator, followed by a newline, ...

A file type indicator is any sequence of non-whitespace characters other than " or #.

Please add tests and implement this rule.

Co-authored-by: josh11b <josh11b@users.noreply.github.com>

explorer/syntax/lexer.lpp

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* test cases for raw string literals * raw string literal implementation * match as block string if starting with triple ", and better error message for simple string except for *#"""#* * fix broken test case block string literal cannot be one line * test cases for raw string literals * raw string literal implementation * match as block string if starting with triple ", and better error message for simple string except for *#"""#* * fix broken test case block string literal cannot be one line * removed unused initial value * rename flag to indicate multi-line string and remove comment * use * to get value from std::optional * clean-ups * removed skip_scan flag and directly return in case of a single line string starting with #+\'\'\' * Updated error message: simple string -> single-line string. Co-authored-by: josh11b <josh11b@users.noreply.github.com> * Updated test cases according to changes in error message * Removed counting_hashtag flag. * Implemented ScanHelper class to handle scanning * Fixed explanation of ReadHashTags. * Addressed PR comment. * Clarify that scan_helper holds the source text. * Addressed PR comments. * Updated error messages in test cases. * Added const keyword to return type of GetCurrentStr(). * addressed PR comments. 1. Moved ScanHelper class to lex_scan_helper.h and lex_scan_helper.cpp. 2. Moved ReadHashTags and Process* functions to lex_scan_helper.cpp. Moved YY_USER_ACTION, SIMPLE_TOKEN and ARG_TOKEN to lex_helper.h. Added a wrapper function YyinputWrapper to call static function yyinput in lexer.lpp. 3. Renamed ScanHelper with StringLexHelper. 4. Modified BUILD accordingly. 5. Renamed data members and functions. * Addressed PR comments. 1. Adjusted order to keep ret usage close. 2. Used resize to construct the string to avoid creation of temp string. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Removed the multi_line flag and skip_read field to improve readability. * Copied default parameter value to definition of UnescapeStringLiteral. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Copied default parameter value to definition of ParseBlockStringLiteral. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Prefix CARBON_ to SIMPLE_TOKEN and ARG_TOKEN macros. * Rollback redefinition of arguments. * Updated comment on the flex macro. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Updated wording. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Moved the EOF error out of the loop. * Removed duplicated declaration. * Changed type of `hashtag_num` and `leading_quotes` to int. * Minor fix: string copy. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Added comment on YyinputWrapper. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Garmmar in comment. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Added check of eof before readling next char. * Minor updates based on PR comments. * Minor changes to address PR comments. * Used a clearer way to calculate `hashtag_num` and `leading_quotes`. Switched back to indicate muti-line string with a flag. * Directly copy StringRef for compilation error message. * Make str_with_quote const as we don't change it. Co-authored-by: josh11b <josh11b@users.noreply.github.com> * Added TODO for unsupported cases. * Fixed a typo. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> Co-authored-by: josh11b <josh11b@users.noreply.github.com> Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

SlaterLatiao added 9 commits May 31, 2022 23:28

test cases for raw string literals

a4ff50d

raw string literal implementation

a31dda2

match as block string if starting with triple ", and better error mes…

d3be382

…sage for simple string except for *#"""#*

fix broken test case

ece0b7f

block string literal cannot be one line

test cases for raw string literals

4d56ea6

raw string literal implementation

7a32eec

match as block string if starting with triple ", and better error mes…

5745c55

…sage for simple string except for *#"""#*

fix broken test case

d5f69af

block string literal cannot be one line

Merge branch 'raw_string' of github.com:SlaterLatiao/carbon-lang into…

75bcbcf

… raw_string

SlaterLatiao requested a review from a team as a code owner June 1, 2022 18:49

josh11b reviewed Jun 2, 2022

View reviewed changes

SlaterLatiao added 5 commits June 3, 2022 00:14

removed unused initial value

ac735be

rename flag to indicate multi-line string and remove comment

ebdf8f7

use * to get value from std::optional

743aed6

clean-ups

d509aa5

removed skip_scan flag and directly return in case of a single line s…

d51377d

…tring starting with #+\'\'\'

josh11b reviewed Jun 3, 2022

View reviewed changes

explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved

explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved

SlaterLatiao and others added 7 commits June 3, 2022 20:46

Updated error message: simple string -> single-line string.

855fe32

Co-authored-by: josh11b <josh11b@users.noreply.github.com>

Updated test cases according to changes in error message

17bc3cf

Removed counting_hashtag flag.

9ac7418

Implemented ScanHelper class to handle scanning

43ab9a6

Fixed explanation of ReadHashTags.

750b034

Addressed PR comment.

266359a

Clarify that scan_helper holds the source text.

8437c5c

jonmeow reviewed Jun 6, 2022

View reviewed changes

SlaterLatiao added the explorer Action items related to Carbon explorer code label Jun 7, 2022

SlaterLatiao added 3 commits June 7, 2022 00:53

Addressed PR comments.

51b2af9

Updated error messages in test cases.

b5791ad

Added const keyword to return type of GetCurrentStr().

1914495

josh11b reviewed Jun 7, 2022

View reviewed changes

explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved

explorer/testdata/string/fail_hex_lower.carbon Outdated Show resolved Hide resolved

jonmeow reviewed Jun 8, 2022

View reviewed changes

josh11b reviewed Jun 8, 2022

View reviewed changes

josh11b reviewed Jun 12, 2022

View reviewed changes

SlaterLatiao and others added 18 commits June 13, 2022 13:42

Copied default parameter value to definition of UnescapeStringLiteral.

9cf8448

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

Copied default parameter value to definition of ParseBlockStringLiteral.

54c46c1

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

Prefix CARBON_ to SIMPLE_TOKEN and ARG_TOKEN macros.

78da5b9

Merge branch 'raw_string' of github.com:SlaterLatiao/carbon-lang into…

3c2e90d

… raw_string

Rollback redefinition of arguments.

70709cd

Updated comment on the flex macro.

6f45efc

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

Updated wording.

00401d8

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

Moved the EOF error out of the loop.

65facf5

Removed duplicated declaration.

0c91724

Changed type of hashtag_num and leading_quotes to int.

f8e8054

Minor fix: string copy.

7cc8cbb

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

Added comment on YyinputWrapper.

a45fd15

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

Garmmar in comment.

24d3149

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

Added check of eof before readling next char.

ec4477b

Minor updates based on PR comments.

1346f92

Minor changes to address PR comments.

bb63820

Used a clearer way to calculate hashtag_num and leading_quotes. S…

3a8d488

…witched back to indicate muti-line string with a flag.

Directly copy StringRef for compilation error message.

aa6e246

josh11b reviewed Jun 22, 2022

View reviewed changes

SlaterLatiao and others added 2 commits June 22, 2022 14:34

Make str_with_quote const as we don't change it.

6a77fea

Co-authored-by: josh11b <josh11b@users.noreply.github.com>

Added TODO for unsupported cases.

eae97d5

jonmeow approved these changes Jun 22, 2022

View reviewed changes

explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved

SlaterLatiao and others added 2 commits June 22, 2022 15:29

Merged upstream trunk into raw_string.

2aba1f6

Fixed a typo.

4e238e8

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

SlaterLatiao merged commit e1fa153 into carbon-language:trunk Jun 22, 2022

SlaterLatiao deleted the raw_string branch June 22, 2022 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of raw string literals #1304

Initial implementation of raw string literals #1304

SlaterLatiao commented Jun 1, 2022

josh11b left a comment

jonmeow left a comment

jonmeow left a comment

jonmeow Jun 8, 2022

SlaterLatiao Jun 21, 2022

josh11b Jun 8, 2022

SlaterLatiao Jun 17, 2022

josh11b Jun 8, 2022

josh11b Jun 8, 2022

josh11b Jun 8, 2022

SlaterLatiao Jun 21, 2022

josh11b Jun 8, 2022

SlaterLatiao Jun 21, 2022

josh11b Jun 8, 2022 •

edited

josh11b Jun 22, 2022

jonmeow Jun 22, 2022

SlaterLatiao Jun 22, 2022

josh11b Jun 8, 2022

SlaterLatiao Jun 21, 2022

josh11b Jun 21, 2022

SlaterLatiao Jun 22, 2022

josh11b Jun 12, 2022

SlaterLatiao Jun 21, 2022

josh11b Jun 22, 2022

		const size_t hashtag_num = std::count(s.begin(), s.end(), '#');
		const size_t leading_quotes = std::count(s.begin(), s.end(), '"');

	if (!scan_helper.Advance() \|\| scan_helper.last_char() != '#') {
	if (scan_helper.last_char() != '#' \|\| !scan_helper.Advance()) {

		// Reads and returns a single character. Reports an error on EOF.
		auto ReadChar(yyscan_t yyscanner, Carbon::ParseAndLexContext& context) -> int;

Initial implementation of raw string literals #1304

Initial implementation of raw string literals #1304

Conversation

SlaterLatiao commented Jun 1, 2022

josh11b left a comment

Choose a reason for hiding this comment

jonmeow left a comment

Choose a reason for hiding this comment

jonmeow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josh11b Jun 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josh11b Jun 8, 2022 •

edited