Initial implementation of raw string literals (#1304)

* test cases for raw string literals * raw string literal implementation * match as block string if starting with triple ", and better error message for simple string except for *#"""#* * fix broken test case block string literal cannot be one line * test cases for raw string literals * raw string literal implementation * match as block string if starting with triple ", and better error message for simple string except for *#"""#* * fix broken test case block string literal cannot be one line * removed unused initial value * rename flag to indicate multi-line string and remove comment * use * to get value from std::optional * clean-ups * removed skip_scan flag and directly return in case of a single line string starting with #+\'\'\' * Updated error message: simple string -> single-line string. Co-authored-by: josh11b <josh11b@users.noreply.github.com> * Updated test cases according to changes in error message * Removed counting_hashtag flag. * Implemented ScanHelper class to handle scanning * Fixed explanation of ReadHashTags. * Addressed PR comment. * Clarify that scan_helper holds the source text. * Addressed PR comments. * Updated error messages in test cases. * Added const keyword to return type of GetCurrentStr(). * addressed PR comments. 1. Moved ScanHelper class to lex_scan_helper.h and lex_scan_helper.cpp. 2. Moved ReadHashTags and Process* functions to lex_scan_helper.cpp. Moved YY_USER_ACTION, SIMPLE_TOKEN and ARG_TOKEN to lex_helper.h. Added a wrapper function YyinputWrapper to call static function yyinput in lexer.lpp. 3. Renamed ScanHelper with StringLexHelper. 4. Modified BUILD accordingly. 5. Renamed data members and functions. * Addressed PR comments. 1. Adjusted order to keep ret usage close. 2. Used resize to construct the string to avoid creation of temp string. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Removed the multi_line flag and skip_read field to improve readability. * Copied default parameter value to definition of UnescapeStringLiteral. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Copied default parameter value to definition of ParseBlockStringLiteral. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Prefix CARBON_ to SIMPLE_TOKEN and ARG_TOKEN macros. * Rollback redefinition of arguments. * Updated comment on the flex macro. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Updated wording. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Moved the EOF error out of the loop. * Removed duplicated declaration. * Changed type of `hashtag_num` and `leading_quotes` to int. * Minor fix: string copy. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Added comment on YyinputWrapper. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Garmmar in comment. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Added check of eof before readling next char. * Minor updates based on PR comments. * Minor changes to address PR comments. * Used a clearer way to calculate `hashtag_num` and `leading_quotes`. Switched back to indicate muti-line string with a flag. * Directly copy StringRef for compilation error message. * Make str_with_quote const as we don't change it. Co-authored-by: josh11b <josh11b@users.noreply.github.com> * Added TODO for unsupported cases. * Fixed a typo. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> Co-authored-by: josh11b <josh11b@users.noreply.github.com> Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
carbon-language · Jun 22, 2022 · 8d0f336 · 8d0f336
1 parent a1be2a8
commit 8d0f336
Show file tree

Hide file tree

Showing 22 changed files with 683 additions and 236 deletions.
diff --git a/common/string_helpers.cpp b/common/string_helpers.cpp
@@ -27,87 +27,86 @@ static auto FromHex(char c) -> std::optional<char> {
   return std::nullopt;
 }
 
-auto UnescapeStringLiteral(llvm::StringRef source, bool is_block_string)
-    -> std::optional<std::string> {
+auto UnescapeStringLiteral(llvm::StringRef source, const int hashtag_num,
+                           bool is_block_string) -> std::optional<std::string> {
   std::string ret;
   ret.reserve(source.size());
+  std::string escape = "\\";
+  escape.resize(hashtag_num + 1, '#');
   size_t i = 0;
   while (i < source.size()) {
     char c = source[i];
-    switch (c) {
-      case '\\':
-        ++i;
-        if (i == source.size()) {
-          return std::nullopt;
-        }
-        switch (source[i]) {
-          case 'n':
-            ret.push_back('\n');
-            break;
-          case 'r':
-            ret.push_back('\r');
-            break;
-          case 't':
-            ret.push_back('\t');
-            break;
-          case '0':
-            if (i + 1 < source.size() && llvm::isDigit(source[i + 1])) {
-              // \0[0-9] is reserved.
-              return std::nullopt;
-            }
-            ret.push_back('\0');
-            break;
-          case '"':
-            ret.push_back('"');
-            break;
-          case '\'':
-            ret.push_back('\'');
-            break;
-          case '\\':
-            ret.push_back('\\');
-            break;
-          case 'x': {
-            i += 2;
-            if (i >= source.size()) {
-              return std::nullopt;
-            }
-            std::optional<char> c1 = FromHex(source[i - 1]);
-            std::optional<char> c2 = FromHex(source[i]);
-            if (c1 == std::nullopt || c2 == std::nullopt) {
-              return std::nullopt;
-            }
-            ret.push_back(16 * *c1 + *c2);
-            break;
+    if (i + hashtag_num < source.size() &&
+        source.slice(i, i + hashtag_num + 1).equals(escape)) {
+      i += hashtag_num + 1;
+      if (i == source.size()) {
+        return std::nullopt;
+      }
+      switch (source[i]) {
+        case 'n':
+          ret.push_back('\n');
+          break;
+        case 'r':
+          ret.push_back('\r');
+          break;
+        case 't':
+          ret.push_back('\t');
+          break;
+        case '0':
+          if (i + 1 < source.size() && llvm::isDigit(source[i + 1])) {
+            // \0[0-9] is reserved.
+            return std::nullopt;
+          }
+          ret.push_back('\0');
+          break;
+        case '"':
+          ret.push_back('"');
+          break;
+        case '\'':
+          ret.push_back('\'');
+          break;
+        case '\\':
+          ret.push_back('\\');
+          break;
+        case 'x': {
+          i += 2;
+          if (i >= source.size()) {
+            return std::nullopt;
           }
-          case 'u':
-            CARBON_FATAL() << "\\u is not yet supported in string literals";
-          case '\n':
-            if (!is_block_string) {
-              return std::nullopt;
-            }
-            break;
-          default:
-            // Unsupported.
+          std::optional<char> c1 = FromHex(source[i - 1]);
+          std::optional<char> c2 = FromHex(source[i]);
+          if (c1 == std::nullopt || c2 == std::nullopt) {
             return std::nullopt;
+          }
+          ret.push_back(16 * *c1 + *c2);
+          break;
         }
-        break;
-
-      case '\t':
-        // Disallow non-` ` horizontal whitespace:
-        // https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/whitespace.md
-        // TODO: This doesn't handle unicode whitespace.
-        return std::nullopt;
-
-      default:
-        ret.push_back(c);
-        break;
+        case 'u':
+          CARBON_FATAL() << "\\u is not yet supported in string literals";
+        case '\n':
+          if (!is_block_string) {
+            return std::nullopt;
+          }
+          break;
+        default:
+          // Unsupported.
+          return std::nullopt;
+      }
+    } else if (c == '\t') {
+      // Disallow non-` ` horizontal whitespace:
+      // https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/whitespace.md
+      // TODO: This doesn't handle unicode whitespace.
+      return std::nullopt;
+    } else {
+      ret.push_back(c);
     }
     ++i;
   }
   return ret;
 }
 
-auto ParseBlockStringLiteral(llvm::StringRef source) -> ErrorOr<std::string> {
+auto ParseBlockStringLiteral(llvm::StringRef source, const int hashtag_num)
+    -> ErrorOr<std::string> {
   llvm::SmallVector<llvm::StringRef> lines;
   source.split(lines, '\n', /*MaxSplit=*/-1, /*KeepEmpty=*/true);
   if (lines.size() < 2) {
@@ -150,8 +149,9 @@ auto ParseBlockStringLiteral(llvm::StringRef source) -> ErrorOr<std::string> {
     }
     // Unescaping with \n appended to handle things like \\<newline>.
     llvm::SmallVector<char> buffer;
-    std::optional<std::string> unescaped = UnescapeStringLiteral(
-        (line + "\n").toStringRef(buffer), /*is_block_string=*/true);
+    std::optional<std::string> unescaped =
+        UnescapeStringLiteral((line + "\n").toStringRef(buffer), hashtag_num,
+                              /*is_block_string=*/true);
     if (!unescaped.has_value()) {
       return Error("Invalid escaping in " + line);
     }

diff --git a/common/string_helpers.h b/common/string_helpers.h
@@ -19,11 +19,13 @@ namespace Carbon {
 // Unescapes Carbon escape sequences in the source string. Returns std::nullopt
 // on bad input. `is_block_string` enables escaping unique to block string
 // literals, such as \<newline>.
-auto UnescapeStringLiteral(llvm::StringRef source, bool is_block_string = false)
+auto UnescapeStringLiteral(llvm::StringRef source, int hashtag_num = 0,
+                           bool is_block_string = false)
     -> std::optional<std::string>;
 
 // Parses a block string literal in `source`.
-auto ParseBlockStringLiteral(llvm::StringRef source) -> ErrorOr<std::string>;
+auto ParseBlockStringLiteral(llvm::StringRef source, int hashtag_num = 0)
+    -> ErrorOr<std::string>;
 
 // Returns true if the pointer is in the string ref (including equality with
 // `ref.end()`). This should be used instead of `<=` comparisons for

diff --git a/common/string_helpers_test.cpp b/common/string_helpers_test.cpp
@@ -28,6 +28,8 @@ TEST(UnescapeStringLiteral, Valid) {
   EXPECT_THAT(UnescapeStringLiteral("test\\\\n"), Optional(Eq("test\\n")));
   EXPECT_THAT(UnescapeStringLiteral("\\xAA"), Optional(Eq("\xAA")));
   EXPECT_THAT(UnescapeStringLiteral("\\x12"), Optional(Eq("\x12")));
+  EXPECT_THAT(UnescapeStringLiteral("test", 1), Optional(Eq("test")));
+  EXPECT_THAT(UnescapeStringLiteral("test\\#n", 1), Optional(Eq("test\n")));
 }
 
 TEST(UnescapeStringLiteral, Invalid) {
@@ -43,6 +45,7 @@ TEST(UnescapeStringLiteral, Invalid) {
   EXPECT_THAT(UnescapeStringLiteral("\\xaa"), Eq(std::nullopt));
   // Reserved.
   EXPECT_THAT(UnescapeStringLiteral("\\00"), Eq(std::nullopt));
+  EXPECT_THAT(UnescapeStringLiteral("\\#00", 1), Eq(std::nullopt));
 }
 
 TEST(UnescapeStringLiteral, Nul) {
@@ -90,6 +93,11 @@ TEST(ParseBlockStringLiteral, FailInvalidEscaping) {
      """)";
   EXPECT_THAT(ParseBlockStringLiteral(Input).error().message(),
               Eq("Invalid escaping in \\q"));
+  constexpr char InputRaw[] = R"("""
+     \#q
+     """)";
+  EXPECT_THAT(ParseBlockStringLiteral(InputRaw, 1).error().message(),
+              Eq("Invalid escaping in \\#q"));
 }
 
 TEST(ParseBlockStringLiteral, OkEmptyString) {

diff --git a/explorer/syntax/BUILD b/explorer/syntax/BUILD
@@ -53,6 +53,9 @@ cc_library(
 cc_library(
     name = "syntax",
     srcs = [
+        "lex_helper.h",
+        "lex_scan_helper.cpp",
+        "lex_scan_helper.h",
         "lexer.cpp",
         "lexer.h",
         "parse.cpp",

diff --git a/explorer/syntax/lex_helper.h b/explorer/syntax/lex_helper.h
@@ -0,0 +1,25 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#ifndef CARBON_EXPLORER_SYNTAX_LEX_HELPER_H_
+#define CARBON_EXPLORER_SYNTAX_LEX_HELPER_H_
+
+// Flex expands this macro immediately before each action.
+//
+// Advances the current token position by yyleng columns without changing
+// the line number, and takes us out of the after-whitespace / after-operand
+// state.
+#define YY_USER_ACTION                                             \
+  context.current_token_position.columns(yyleng);                  \
+  if (YY_START == AFTER_WHITESPACE || YY_START == AFTER_OPERAND) { \
+    BEGIN(INITIAL);                                                \
+  }
+
+#define CARBON_SIMPLE_TOKEN(name) \
+  Carbon::Parser::make_##name(context.current_token_position);
+
+#define CARBON_ARG_TOKEN(name, arg) \
+  Carbon::Parser::make_##name(arg, context.current_token_position);
+
+#endif  // CARBON_EXPLORER_SYNTAX_LEX_HELPER_H_
diff --git a/explorer/syntax/lex_scan_helper.cpp b/explorer/syntax/lex_scan_helper.cpp
@@ -0,0 +1,68 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include "explorer/syntax/lex_scan_helper.h"
+
+#include "common/string_helpers.h"
+#include "explorer/syntax/lex_helper.h"
+#include "llvm/Support/FormatVariadic.h"
+
+namespace Carbon {
+
+auto StringLexHelper::Advance() -> bool {
+  CARBON_CHECK(is_eof_ == false);
+  const char c = YyinputWrapper(yyscanner_);
+  if (c <= 0) {
+    context_.RecordSyntaxError("Unexpected end of file");
+    is_eof_ = true;
+    return false;
+  }
+  str_.push_back(c);
+  return true;
+}
+
+auto ReadHashTags(Carbon::StringLexHelper& scan_helper,
+                  const size_t hashtag_num) -> bool {
+  for (size_t i = 0; i < hashtag_num; ++i) {
+    if (!scan_helper.Advance() || scan_helper.last_char() != '#') {
+      return false;
+    }
+  }
+  return true;
+}
+
+auto ProcessSingleLineString(llvm::StringRef str,
+                             Carbon::ParseAndLexContext& context,
+                             const size_t hashtag_num)
+    -> Carbon::Parser::symbol_type {
+  std::string hashtags(hashtag_num, '#');
+  const auto str_with_quote = str;
+  CARBON_CHECK(str.consume_front(hashtags + "\"") &&
+               str.consume_back("\"" + hashtags));
+
+  std::optional<std::string> unescaped =
+      Carbon::UnescapeStringLiteral(str, hashtag_num);
+  if (unescaped == std::nullopt) {
+    return context.RecordSyntaxError(
+        llvm::formatv("Invalid escaping in string: {0}", str_with_quote));
+  }
+  return CARBON_ARG_TOKEN(string_literal, *unescaped);
+}
+
+auto ProcessMultiLineString(llvm::StringRef str,
+                            Carbon::ParseAndLexContext& context,
+                            const size_t hashtag_num)
+    -> Carbon::Parser::symbol_type {
+  std::string hashtags(hashtag_num, '#');
+  CARBON_CHECK(str.consume_front(hashtags) && str.consume_back(hashtags));
+  Carbon::ErrorOr<std::string> block_string =
+      Carbon::ParseBlockStringLiteral(str, hashtag_num);
+  if (!block_string.ok()) {
+    return context.RecordSyntaxError(llvm::formatv(
+        "Invalid block string: {0}", block_string.error().message()));
+  }
+  return CARBON_ARG_TOKEN(string_literal, *block_string);
+}
+
+}  // namespace Carbon
diff --git a/explorer/syntax/lex_scan_helper.h b/explorer/syntax/lex_scan_helper.h
@@ -0,0 +1,58 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#ifndef CARBON_EXPLORER_SYNTAX_LEX_SCAN_HELPER_H_
+#define CARBON_EXPLORER_SYNTAX_LEX_SCAN_HELPER_H_
+
+#include <string>
+
+#include "explorer/syntax/parse_and_lex_context.h"
+#include "explorer/syntax/parser.h"
+
+// Exposes yyinput; defined in lexer.lpp.
+extern auto YyinputWrapper(yyscan_t yyscanner) -> int;
+
+namespace Carbon {
+
+class StringLexHelper {
+ public:
+  StringLexHelper(const char* text, yyscan_t yyscanner,
+                  Carbon::ParseAndLexContext& context)
+      : str_(text), yyscanner_(yyscanner), context_(context), is_eof_(false) {}
+  // Advances yyscanner by one char. Sets is_eof to true and returns false on
+  // EOF.
+  auto Advance() -> bool;
+  // Returns the last scanned char.
+  auto last_char() -> char { return str_.back(); };
+  // Returns the scanned string.
+  auto str() -> const std::string& { return str_; };
+
+  auto is_eof() -> bool { return is_eof_; };
+
+ private:
+  std::string str_;
+  yyscan_t yyscanner_;
+  Carbon::ParseAndLexContext& context_;
+  // Skips reading next char.
+  bool is_eof_;
+};
+
+// Tries to Read `hashtag_num` hashtags. Returns true on success.
+// Reads `hashtag_num` characters on success, and number of consecutive hashtags
+// (< `hashtag_num`) + 1 characters on failure.
+auto ReadHashTags(Carbon::StringLexHelper& scan_helper, size_t hashtag_num)
+    -> bool;
+
+// Removes quotes and escapes a single line string. Reports an error on
+// invalid escaping.
+auto ProcessSingleLineString(llvm::StringRef str,
+                             Carbon::ParseAndLexContext& context,
+                             size_t hashtag_num) -> Carbon::Parser::symbol_type;
+auto ProcessMultiLineString(llvm::StringRef str,
+                            Carbon::ParseAndLexContext& context,
+                            size_t hashtag_num) -> Carbon::Parser::symbol_type;
+
+}  // namespace Carbon
+
+#endif  // CARBON_EXPLORER_SYNTAX_LEX_SCAN_HELPER_H_