Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of raw string literals #1304

Merged
merged 50 commits into from Jun 22, 2022
Merged

Initial implementation of raw string literals #1304

merged 50 commits into from Jun 22, 2022

Conversation

SlaterLatiao
Copy link
Contributor

Combined simple string and block string in lexer.lpp, and implemented raw string literals.
Modified ParseBlockStringLiteral() and UnescapeStringLiteral() in string_helper.cpp to allow raw string literals.

@SlaterLatiao SlaterLatiao requested a review from a team as a code owner June 1, 2022 18:49
Copy link
Contributor

@josh11b josh11b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not finished reading the code, but I thought I'd send you what I have. I would encourage you to think about ways of writing code that uses fewer boolean flags. They make the code harder to read and reason about.

explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! I'm mostly making code style comments. Please don't be bothered by that -- code review like this is often how we end up teaching style. If you have any questions, I should be available over IM.

explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
common/string_helpers.h Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
@SlaterLatiao SlaterLatiao added the explorer Action items related to Carbon explorer code label Jun 7, 2022
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/testdata/string/fail_hex_lower.carbon Outdated Show resolved Hide resolved
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Just a few more comments.

common/string_helpers.cpp Outdated Show resolved Hide resolved
common/string_helpers.cpp Outdated Show resolved Hide resolved
explorer/syntax/lex_helper.h Outdated Show resolved Hide resolved
explorer/syntax/lex_helper.h Outdated Show resolved Hide resolved
explorer/syntax/lex_scan_helper.cpp Show resolved Hide resolved
explorer/syntax/lex_scan_helper.h Outdated Show resolved Hide resolved
// EOF.
auto Advance() -> bool;
// Returns last scanned char.
auto last_char() -> int { return str_.back(); };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is int rather than char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original ReadChar has return type of int. Switched to char instead.

explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
explorer/syntax/lex_scan_helper.h Show resolved Hide resolved
@@ -27,87 +27,86 @@ static auto FromHex(char c) -> std::optional<char> {
return std::nullopt;
}

auto UnescapeStringLiteral(llvm::StringRef source, bool is_block_string)
-> std::optional<std::string> {
auto UnescapeStringLiteral(llvm::StringRef source, const size_t hashtag_num,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use int instead of size_t. Avoid unsigned types unless they are needed: https://google.github.io/styleguide/cppguide.html#Integer_Types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced size_t with int.

explorer/syntax/lex_scan_helper.cpp Show resolved Hide resolved
Comment on lines 272 to 273
const size_t hashtag_num = std::count(s.begin(), s.end(), '#');
const size_t leading_quotes = std::count(s.begin(), s.end(), '"');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current code works because we know the string has to satisfy the regexp #*(\"\"\"|\") due to line 266.

No comment on whether the multi_line flag is easier to read. It is a case where I think a boolean flag is reasonable, particularly if it can be declared const so you know that it never changes.

if (Carbon::ReadHashTags(str_lex_helper, hashtag_num)) {
return Carbon::ProcessSingleLineString(str_lex_helper.str(), context,
hashtag_num);
} else if (str_lex_helper.is_eof()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadHashTags(...) calls Advance().

explorer/syntax/lex_scan_helper.h Show resolved Hide resolved
auto ReadHashTags(Carbon::StringLexHelper& scan_helper,
const size_t hashtag_num) -> bool {
for (size_t i = 0; i < hashtag_num; ++i) {
if (!scan_helper.Advance() || scan_helper.last_char() != '#') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is backwards from what I expect:

Suggested change
if (!scan_helper.Advance() || scan_helper.last_char() != '#') {
if (scan_helper.last_char() != '#' || !scan_helper.Advance()) {

That way only # characters would be consumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Advance() needs to be called before calling ReadHashTags() if hashtags are checked first. It is still possible to consume a non # char after switching the order.

// Reads and returns a single character. Reports an error on EOF.
auto ReadChar(yyscan_t yyscanner, Carbon::ParseAndLexContext& context) -> int;

// Tries to Read [hashtag_num] hashtags. Returns true on success.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd generally write `hashtag_num` instead of [hashtag_num].

Should document how much scan_helper is Advance()d. The change I recommend would make it clear that it advances past up to hashtag_num hashtags, and no other characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited as suggested and added comment on how many characters are read.

const size_t hashtag_num = std::count(s.begin(), s.end(), '#');
const size_t leading_quotes = std::count(s.begin(), s.end(), '"');
if (leading_quotes == 3 && hashtag_num > 0) {
// Check if it's a single-line string, like #"""#.
Copy link
Contributor

@josh11b josh11b Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic isn't quite smart enough to handle #""""# which should be treated as the string with two double-quote characters: "". I think the block string specification limits what can be after #""", otherwise it is a single-line string. @zygoloid probably can say what the exact rule is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md :

A block string literal starts with """, followed by an optional file type indicator, followed by a newline, ...

A file type indicator is any sequence of non-whitespace characters other than " or #.

Please add tests and implement this rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SlaterLatiao I think this is what I'd discussed with you. I would still encourage you to gauge how much of a change it is -- even if it's a significant refactor, it may be easier to review as a delta from current work. In that case, I'd still suggest adding a TODO for this issue, finishing cleanups on this PR, getting it merged, and then adding support in a new PR. Closing out the review and getting it in can be valuable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the TODO.

explorer/syntax/lex_scan_helper.cpp Show resolved Hide resolved
auto ProcessSingleLineString(llvm::StringRef str,
Carbon::ParseAndLexContext& context,
const size_t hashtag_num)
-> Carbon::Parser::symbol_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save a copy of str for error messages before you consume the front and back. Also for ProcessMultiLineString.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying str (the parameter type of str is changed to llvm::StringRef to avoid such copies), the string used for error message will be reconstructed by prepending and appending the quotes. The hashtags are not added, to be consistent with ProcessMultiLineString. The error messages in ProcessMultiLineString are handled in ParseBlockStringLiteral, where the hashtags are already removed when calling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying a llvm::StringRef should be cheap, and not involve copying the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with copy of llvm::StringRef.

Comment on lines 40 to 41
// Reads and returns a single character. Reports an error on EOF.
auto ReadChar(yyscan_t yyscanner, Carbon::ParseAndLexContext& context) -> int;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many places is this function called from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only called by Advance. I removed ReadChar and merged its logic into Advance.

explorer/syntax/lex_scan_helper.h Show resolved Hide resolved
SlaterLatiao and others added 18 commits June 13, 2022 13:42
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
…witched back to indicate muti-line string with a flag.
explorer/syntax/lex_scan_helper.cpp Outdated Show resolved Hide resolved
const size_t hashtag_num = std::count(s.begin(), s.end(), '#');
const size_t leading_quotes = std::count(s.begin(), s.end(), '"');
if (leading_quotes == 3 && hashtag_num > 0) {
// Check if it's a single-line string, like #"""#.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md :

A block string literal starts with """, followed by an optional file type indicator, followed by a newline, ...

A file type indicator is any sequence of non-whitespace characters other than " or #.

Please add tests and implement this rule.

SlaterLatiao and others added 2 commits June 22, 2022 14:34
explorer/syntax/lexer.lpp Outdated Show resolved Hide resolved
SlaterLatiao and others added 2 commits June 22, 2022 15:29
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
@SlaterLatiao SlaterLatiao merged commit e1fa153 into carbon-language:trunk Jun 22, 2022
@SlaterLatiao SlaterLatiao deleted the raw_string branch June 22, 2022 22:39
chandlerc pushed a commit that referenced this pull request Jun 28, 2022
* test cases for raw string literals

* raw string literal implementation

* match as block string if starting with triple ", and better error message for simple string

except for *#"""#*

* fix broken test case

block string  literal cannot be one line

* test cases for raw string literals

* raw string literal implementation

* match as block string if starting with triple ", and better error message for simple string

except for *#"""#*

* fix broken test case

block string  literal cannot be one line

* removed unused initial value

* rename flag to indicate multi-line string and remove comment

* use * to get value from std::optional

* clean-ups

* removed skip_scan flag and directly return in case of a single line string starting with #+\'\'\'

* Updated error message: simple string -> single-line string.

Co-authored-by: josh11b <josh11b@users.noreply.github.com>

* Updated test cases according to changes in error message

* Removed counting_hashtag flag.

* Implemented ScanHelper class to handle scanning

* Fixed explanation of ReadHashTags.

* Addressed PR comment.

* Clarify that scan_helper holds the source text.

* Addressed PR comments.

* Updated error messages in test cases.

* Added const keyword to return type of GetCurrentStr().

* addressed PR comments.

1. Moved ScanHelper class to lex_scan_helper.h and lex_scan_helper.cpp.
2. Moved ReadHashTags and Process* functions to lex_scan_helper.cpp. Moved YY_USER_ACTION, SIMPLE_TOKEN and ARG_TOKEN to lex_helper.h. Added a wrapper function YyinputWrapper to call static function yyinput in lexer.lpp.
3. Renamed ScanHelper with StringLexHelper.
4. Modified BUILD accordingly.
5. Renamed data members and functions.

* Addressed PR comments.

1. Adjusted order to keep ret usage close.
2. Used resize to construct the string to avoid creation of temp string.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* Removed the multi_line flag and skip_read field to improve readability.

* Copied default parameter value to definition of UnescapeStringLiteral.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* Copied default parameter value to definition of ParseBlockStringLiteral.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* Prefix CARBON_ to SIMPLE_TOKEN and ARG_TOKEN macros.

* Rollback redefinition of arguments.

* Updated comment on the flex macro.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* Updated wording.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* Moved the EOF error out of the loop.

* Removed duplicated declaration.

* Changed type of `hashtag_num` and `leading_quotes` to int.

* Minor fix: string copy.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* Added comment on YyinputWrapper.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* Garmmar in comment.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

* Added check of eof before readling next char.

* Minor updates based on PR comments.

* Minor changes to address PR comments.

* Used a clearer way to calculate `hashtag_num` and `leading_quotes`. Switched back to indicate muti-line string with a flag.

* Directly copy StringRef for compilation error message.

* Make str_with_quote const as we don't change it.

Co-authored-by: josh11b <josh11b@users.noreply.github.com>

* Added TODO for unsupported cases.

* Fixed a typo.

Co-authored-by: Jon Ross-Perkins <jperkins@google.com>

Co-authored-by: josh11b <josh11b@users.noreply.github.com>
Co-authored-by: Jon Ross-Perkins <jperkins@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
explorer Action items related to Carbon explorer code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants