-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix comment parsing inside command substitutions and brackets #8695
Conversation
Well I'm a fan, nice job, cool! The only thing I'd want to think on is if this is the simplest, best fix. These beyond-the-parser hinterlands of our codebase are super convoluted and tricky. |
I refactored my changes in 1b17cda - they are now the simplest I could get them without refactoring unrelated parsing code. I also added regression tests to |
Looks pretty nice! 👍
I think so. |
Nice work! I have found one minor issue Also, consider squashing the commits, so the changes are easier to follow for third parties (including |
@@ -679,6 +682,10 @@ maybe_t<tok_t> tokenizer_t::next() { | |||
return result; | |||
} | |||
|
|||
bool is_token_delimiter(wchar_t c, bool is_first, maybe_t<wchar_t> next) { | |||
return c == L'(' || !tok_is_string_character(c, is_first, next); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. This function deviates from tok_is_string_character() because today (a)(b)(c)
is a single token at parsing time. Only during execution, command substitutions are reparsed and further split. We could change this in future.
Here is an alternative solution I played around with, but I think we should just discard it, because
is_comment()
doesn't deal with backslash escapes itself but requires the caller to do so, which is weird.
diff --git a/src/parse_util.cpp b/src/parse_util.cpp
index c626804d0..82870c6d9 100644
--- a/src/parse_util.cpp
+++ b/src/parse_util.cpp
@@ -90,8 +90,6 @@ size_t parse_util_get_offset(const wcstring &str, int line, long line_offset) {
static int parse_util_locate_cmdsub(const wchar_t *in, const wchar_t **begin, const wchar_t **end,
bool allow_incomplete, bool *inout_is_quoted) {
bool escaped = false;
- bool is_first = true;
- bool is_token_begin = true;
bool syntax_error = false;
int paran_count = 0;
std::vector<int> quoted_cmdsubs;
@@ -126,7 +124,7 @@ static int parse_util_locate_cmdsub(const wchar_t *in, const wchar_t **begin, co
if (!process_opening_quote(*pos)) break;
} else if (*pos == L'\\') {
escaped = true;
- } else if (*pos == L'#' && is_token_begin) {
+ } else if (is_comment(pos, in)) {
pos = comment_end(pos) - 1;
} else {
if (*pos == L'(') {
@@ -167,12 +165,9 @@ static int parse_util_locate_cmdsub(const wchar_t *in, const wchar_t **begin, co
}
}
}
- is_token_begin = is_token_delimiter(pos[0], is_first, pos[1]);
} else {
escaped = false;
- is_token_begin = false;
}
- is_first = false;
}
syntax_error |= (paran_count < 0);
diff --git a/src/tokenizer.cpp b/src/tokenizer.cpp
index 861c57f61..9d48cdb7e 100644
--- a/src/tokenizer.cpp
+++ b/src/tokenizer.cpp
@@ -154,7 +154,6 @@ tok_t tokenizer_t::read_string() {
int slice_offset = 0;
const wchar_t *const buff_start = this->token_cursor;
bool is_first = true;
- bool is_token_begin = true;
auto process_opening_quote = [&](wchar_t quote) -> const wchar_t * {
const wchar_t *end = quote_end(this->token_cursor, quote);
@@ -193,7 +192,7 @@ tok_t tokenizer_t::read_string() {
// has been explicitly ignored (escaped).
else if (c == L'\\') {
mode |= tok_modes::char_escape;
- } else if (c == L'#' && is_token_begin) {
+ } else if (is_comment(this->token_cursor, buff_start)) {
this->token_cursor = comment_end(this->token_cursor) - 1;
} else if (c == L'(') {
paran_offsets.push_back(this->token_cursor - this->start);
@@ -281,7 +280,6 @@ tok_t tokenizer_t::read_string() {
FLOGF(error, msg.c_str(), c, c, int(mode_begin), int(mode));
#endif
- is_token_begin = is_token_delimiter(this->token_cursor[0], is_first, this->token_cursor[1]);
is_first = false;
this->token_cursor++;
}
@@ -682,8 +680,12 @@ maybe_t<tok_t> tokenizer_t::next() {
return result;
}
-bool is_token_delimiter(wchar_t c, bool is_first, maybe_t<wchar_t> next) {
- return c == L'(' || !tok_is_string_character(c, is_first, next);
+bool is_comment(const wchar_t *pos, const wchar_t *tok_start) {
+ if (*pos != L'#') return false;
+ if (pos == tok_start) return true;
+ wchar_t prev = *(pos - 1);
+ bool prev_is_first = (pos - 1) == tok_start;
+ return prev == L'(' || prev == L')' || !tok_is_string_character(prev, prev_is_first, *pos);
}
wcstring tok_first(const wcstring &str) {
diff --git a/src/tokenizer.h b/src/tokenizer.h
index fccff61db..5322a2d32 100644
--- a/src/tokenizer.h
+++ b/src/tokenizer.h
@@ -133,8 +133,8 @@ class tokenizer_t : noncopyable_t {
}
};
-/// Tests if this character can delimit tokens.
-bool is_token_delimiter(wchar_t c, bool is_first, maybe_t<wchar_t> next);
+/// Tests if the given position starts a comment (possible inside a command substitution).
+bool is_comment(const wchar_t *pos, const wchar_t *tok_start);
/// Returns only the first token from the specified string. This is a convenience function, used to
/// retrieve the first token of a string. This can be useful for error messages, etc. On failure,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your solution! My only reservation is that )
is interpreted as a delimiter - as I said in my previous comment, I think it shouldn't.
Escapes are not handled in tok_is_string_character()
either - for example, it wouldn't know if c = ';'
is escaped - it assumes that it isn't. So I think it is reasonable to burden the caller with escaping. We can also document that, perhaps like this:
/// \c pos must not be an escaped character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Escapes are not handled in
tok_is_string_character()
either
Yeah, but that doesn't contradict that function's name.
Meanwhile is_comment()
is a straight up lie if quoted/escaped, and I couldn't find a better name.
So I'd just merge as-is.
This change avoids parsing brackets and quotes within comments.
This fixes the following examples (as well as those described in #7866 and #8022):
a b
a b
a[b #c]
Fixes #7866, and fixes #8022.
TODOs: