New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of raw string literals #1304
Conversation
…sage for simple string except for *#"""#*
block string literal cannot be one line
…sage for simple string except for *#"""#*
block string literal cannot be one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not finished reading the code, but I thought I'd send you what I have. I would encourage you to think about ways of writing code that uses fewer boolean flags. They make the code harder to read and reason about.
Co-authored-by: josh11b <josh11b@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! I'm mostly making code style comments. Please don't be bothered by that -- code review like this is often how we end up teaching style. If you have any questions, I should be available over IM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Just a few more comments.
explorer/syntax/lex_scan_helper.h
Outdated
// EOF. | ||
auto Advance() -> bool; | ||
// Returns last scanned char. | ||
auto last_char() -> int { return str_.back(); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is int
rather than char
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original ReadChar
has return type of int. Switched to char instead.
common/string_helpers.cpp
Outdated
@@ -27,87 +27,86 @@ static auto FromHex(char c) -> std::optional<char> { | |||
return std::nullopt; | |||
} | |||
|
|||
auto UnescapeStringLiteral(llvm::StringRef source, bool is_block_string) | |||
-> std::optional<std::string> { | |||
auto UnescapeStringLiteral(llvm::StringRef source, const size_t hashtag_num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use int
instead of size_t
. Avoid unsigned types unless they are needed: https://google.github.io/styleguide/cppguide.html#Integer_Types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced size_t
with int
.
explorer/syntax/lexer.lpp
Outdated
const size_t hashtag_num = std::count(s.begin(), s.end(), '#'); | ||
const size_t leading_quotes = std::count(s.begin(), s.end(), '"'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current code works because we know the string has to satisfy the regexp #*(\"\"\"|\")
due to line 266.
No comment on whether the multi_line
flag is easier to read. It is a case where I think a boolean flag is reasonable, particularly if it can be declared const
so you know that it never changes.
if (Carbon::ReadHashTags(str_lex_helper, hashtag_num)) { | ||
return Carbon::ProcessSingleLineString(str_lex_helper.str(), context, | ||
hashtag_num); | ||
} else if (str_lex_helper.is_eof()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadHashTags(
...)
calls Advance()
.
auto ReadHashTags(Carbon::StringLexHelper& scan_helper, | ||
const size_t hashtag_num) -> bool { | ||
for (size_t i = 0; i < hashtag_num; ++i) { | ||
if (!scan_helper.Advance() || scan_helper.last_char() != '#') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is backwards from what I expect:
if (!scan_helper.Advance() || scan_helper.last_char() != '#') { | |
if (scan_helper.last_char() != '#' || !scan_helper.Advance()) { |
That way only #
characters would be consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Advance()
needs to be called before calling ReadHashTags()
if hashtags are checked first. It is still possible to consume a non #
char after switching the order.
explorer/syntax/lex_scan_helper.h
Outdated
// Reads and returns a single character. Reports an error on EOF. | ||
auto ReadChar(yyscan_t yyscanner, Carbon::ParseAndLexContext& context) -> int; | ||
|
||
// Tries to Read [hashtag_num] hashtags. Returns true on success. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd generally write `hashtag_num`
instead of [hashtag_num]
.
Should document how much scan_helper
is Advance()
d. The change I recommend would make it clear that it advances past up to hashtag_num
hashtags, and no other characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited as suggested and added comment on how many characters are read.
const size_t hashtag_num = std::count(s.begin(), s.end(), '#'); | ||
const size_t leading_quotes = std::count(s.begin(), s.end(), '"'); | ||
if (leading_quotes == 3 && hashtag_num > 0) { | ||
// Check if it's a single-line string, like #"""#. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic isn't quite smart enough to handle #""""#
which should be treated as the string with two double-quote characters: ""
. I think the block string specification limits what can be after #"""
, otherwise it is a single-line string. @zygoloid probably can say what the exact rule is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A block string literal starts with
"""
, followed by an optional file type indicator, followed by a newline, ...
A file type indicator is any sequence of non-whitespace characters other than
"
or#
.
Please add tests and implement this rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SlaterLatiao I think this is what I'd discussed with you. I would still encourage you to gauge how much of a change it is -- even if it's a significant refactor, it may be easier to review as a delta from current work. In that case, I'd still suggest adding a TODO for this issue, finishing cleanups on this PR, getting it merged, and then adding support in a new PR. Closing out the review and getting it in can be valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the TODO.
auto ProcessSingleLineString(llvm::StringRef str, | ||
Carbon::ParseAndLexContext& context, | ||
const size_t hashtag_num) | ||
-> Carbon::Parser::symbol_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save a copy of str
for error messages before you consume the front and back. Also for ProcessMultiLineString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of copying str
(the parameter type of str is changed to llvm::StringRef to avoid such copies), the string used for error message will be reconstructed by prepending and appending the quotes. The hashtags are not added, to be consistent with ProcessMultiLineString
. The error messages in ProcessMultiLineString
are handled in ParseBlockStringLiteral
, where the hashtags are already removed when calling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying a llvm::StringRef
should be cheap, and not involve copying the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with copy of llvm::StringRef
.
explorer/syntax/lex_scan_helper.h
Outdated
// Reads and returns a single character. Reports an error on EOF. | ||
auto ReadChar(yyscan_t yyscanner, Carbon::ParseAndLexContext& context) -> int; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many places is this function called from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is only called by Advance
. I removed ReadChar
and merged its logic into Advance
.
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
…witched back to indicate muti-line string with a flag.
const size_t hashtag_num = std::count(s.begin(), s.end(), '#'); | ||
const size_t leading_quotes = std::count(s.begin(), s.end(), '"'); | ||
if (leading_quotes == 3 && hashtag_num > 0) { | ||
// Check if it's a single-line string, like #"""#. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A block string literal starts with
"""
, followed by an optional file type indicator, followed by a newline, ...
A file type indicator is any sequence of non-whitespace characters other than
"
or#
.
Please add tests and implement this rule.
Co-authored-by: josh11b <josh11b@users.noreply.github.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
* test cases for raw string literals * raw string literal implementation * match as block string if starting with triple ", and better error message for simple string except for *#"""#* * fix broken test case block string literal cannot be one line * test cases for raw string literals * raw string literal implementation * match as block string if starting with triple ", and better error message for simple string except for *#"""#* * fix broken test case block string literal cannot be one line * removed unused initial value * rename flag to indicate multi-line string and remove comment * use * to get value from std::optional * clean-ups * removed skip_scan flag and directly return in case of a single line string starting with #+\'\'\' * Updated error message: simple string -> single-line string. Co-authored-by: josh11b <josh11b@users.noreply.github.com> * Updated test cases according to changes in error message * Removed counting_hashtag flag. * Implemented ScanHelper class to handle scanning * Fixed explanation of ReadHashTags. * Addressed PR comment. * Clarify that scan_helper holds the source text. * Addressed PR comments. * Updated error messages in test cases. * Added const keyword to return type of GetCurrentStr(). * addressed PR comments. 1. Moved ScanHelper class to lex_scan_helper.h and lex_scan_helper.cpp. 2. Moved ReadHashTags and Process* functions to lex_scan_helper.cpp. Moved YY_USER_ACTION, SIMPLE_TOKEN and ARG_TOKEN to lex_helper.h. Added a wrapper function YyinputWrapper to call static function yyinput in lexer.lpp. 3. Renamed ScanHelper with StringLexHelper. 4. Modified BUILD accordingly. 5. Renamed data members and functions. * Addressed PR comments. 1. Adjusted order to keep ret usage close. 2. Used resize to construct the string to avoid creation of temp string. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Removed the multi_line flag and skip_read field to improve readability. * Copied default parameter value to definition of UnescapeStringLiteral. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Copied default parameter value to definition of ParseBlockStringLiteral. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Prefix CARBON_ to SIMPLE_TOKEN and ARG_TOKEN macros. * Rollback redefinition of arguments. * Updated comment on the flex macro. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Updated wording. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Moved the EOF error out of the loop. * Removed duplicated declaration. * Changed type of `hashtag_num` and `leading_quotes` to int. * Minor fix: string copy. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Added comment on YyinputWrapper. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Garmmar in comment. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> * Added check of eof before readling next char. * Minor updates based on PR comments. * Minor changes to address PR comments. * Used a clearer way to calculate `hashtag_num` and `leading_quotes`. Switched back to indicate muti-line string with a flag. * Directly copy StringRef for compilation error message. * Make str_with_quote const as we don't change it. Co-authored-by: josh11b <josh11b@users.noreply.github.com> * Added TODO for unsupported cases. * Fixed a typo. Co-authored-by: Jon Ross-Perkins <jperkins@google.com> Co-authored-by: josh11b <josh11b@users.noreply.github.com> Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Combined simple string and block string in lexer.lpp, and implemented raw string literals.
Modified ParseBlockStringLiteral() and UnescapeStringLiteral() in string_helper.cpp to allow raw string literals.