Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix comment parsing inside command substitutions and brackets #8695

Merged
merged 1 commit into from
Feb 8, 2022
Merged

Fix comment parsing inside command substitutions and brackets #8695

merged 1 commit into from
Feb 8, 2022

Conversation

Shai-Aviv
Copy link
Contributor

@Shai-Aviv Shai-Aviv commented Feb 4, 2022

This change avoids parsing brackets and quotes within comments.

This fixes the following examples (as well as those described in #7866 and #8022):

Script Before After
echo a (
    echo b  # I'm using an apostrophe 
)
(line 1): Unexpected end of string, quotes are not balanced a b
echo a (
    echo b  # Oops, closing paren :)
)
(line 3): Unexpected ')' for unopened parenthesis a b
echo a[b #c]
a[b #c] Unexpected end of string, square brackets do not match

Fixes #7866, and fixes #8022.

TODOs:

  • Changes to fish usage are reflected in user documentation/manpages.
    • There are no changes in usage.
  • Tests have been added for regressions fixed
  • User-visible changes noted in CHANGELOG.rst

@floam
Copy link
Member

floam commented Feb 5, 2022

Well I'm a fan, nice job, cool!

The only thing I'd want to think on is if this is the simplest, best fix. These beyond-the-parser hinterlands of our codebase are super convoluted and tricky.

@Shai-Aviv
Copy link
Contributor Author

Shai-Aviv commented Feb 5, 2022

I refactored my changes in 1b17cda - they are now the simplest I could get them without refactoring unrelated parsing code. I also added regression tests to basic.fish.

@krobelus
Copy link
Member

krobelus commented Feb 6, 2022

Looks pretty nice! 👍

Should we note this change?

I think so.

src/parse_util.cpp Outdated Show resolved Hide resolved
tests/checks/basic.fish Show resolved Hide resolved
@Shai-Aviv Shai-Aviv changed the title Fix comment parsing Fix comment parsing inside command substitutions Feb 7, 2022
@Shai-Aviv Shai-Aviv changed the title Fix comment parsing inside command substitutions Fix comment parsing inside command substitutions and brackets Feb 7, 2022
@krobelus
Copy link
Member

krobelus commented Feb 8, 2022

Nice work! I have found one minor issue

Also, consider squashing the commits, so the changes are easier to follow for third parties (including
future you and future me).

@@ -679,6 +682,10 @@ maybe_t<tok_t> tokenizer_t::next() {
return result;
}

bool is_token_delimiter(wchar_t c, bool is_first, maybe_t<wchar_t> next) {
return c == L'(' || !tok_is_string_character(c, is_first, next);
Copy link
Member

@krobelus krobelus Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This function deviates from tok_is_string_character() because today (a)(b)(c) is a single token at parsing time. Only during execution, command substitutions are reparsed and further split. We could change this in future.

Here is an alternative solution I played around with, but I think we should just discard it, because
is_comment() doesn't deal with backslash escapes itself but requires the caller to do so, which is weird.

diff --git a/src/parse_util.cpp b/src/parse_util.cpp
index c626804d0..82870c6d9 100644
--- a/src/parse_util.cpp
+++ b/src/parse_util.cpp
@@ -90,8 +90,6 @@ size_t parse_util_get_offset(const wcstring &str, int line, long line_offset) {
 static int parse_util_locate_cmdsub(const wchar_t *in, const wchar_t **begin, const wchar_t **end,
                                     bool allow_incomplete, bool *inout_is_quoted) {
     bool escaped = false;
-    bool is_first = true;
-    bool is_token_begin = true;
     bool syntax_error = false;
     int paran_count = 0;
     std::vector<int> quoted_cmdsubs;
@@ -126,7 +124,7 @@ static int parse_util_locate_cmdsub(const wchar_t *in, const wchar_t **begin, co
                 if (!process_opening_quote(*pos)) break;
             } else if (*pos == L'\\') {
                 escaped = true;
-            } else if (*pos == L'#' && is_token_begin) {
+            } else if (is_comment(pos, in)) {
                 pos = comment_end(pos) - 1;
             } else {
                 if (*pos == L'(') {
@@ -167,12 +165,9 @@ static int parse_util_locate_cmdsub(const wchar_t *in, const wchar_t **begin, co
                     }
                 }
             }
-            is_token_begin = is_token_delimiter(pos[0], is_first, pos[1]);
         } else {
             escaped = false;
-            is_token_begin = false;
         }
-        is_first = false;
     }

     syntax_error |= (paran_count < 0);
diff --git a/src/tokenizer.cpp b/src/tokenizer.cpp
index 861c57f61..9d48cdb7e 100644
--- a/src/tokenizer.cpp
+++ b/src/tokenizer.cpp
@@ -154,7 +154,6 @@ tok_t tokenizer_t::read_string() {
     int slice_offset = 0;
     const wchar_t *const buff_start = this->token_cursor;
     bool is_first = true;
-    bool is_token_begin = true;

     auto process_opening_quote = [&](wchar_t quote) -> const wchar_t * {
         const wchar_t *end = quote_end(this->token_cursor, quote);
@@ -193,7 +192,7 @@ tok_t tokenizer_t::read_string() {
         // has been explicitly ignored (escaped).
         else if (c == L'\\') {
             mode |= tok_modes::char_escape;
-        } else if (c == L'#' && is_token_begin) {
+        } else if (is_comment(this->token_cursor, buff_start)) {
             this->token_cursor = comment_end(this->token_cursor) - 1;
         } else if (c == L'(') {
             paran_offsets.push_back(this->token_cursor - this->start);
@@ -281,7 +280,6 @@ tok_t tokenizer_t::read_string() {
         FLOGF(error, msg.c_str(), c, c, int(mode_begin), int(mode));
 #endif

-        is_token_begin = is_token_delimiter(this->token_cursor[0], is_first, this->token_cursor[1]);
         is_first = false;
         this->token_cursor++;
     }
@@ -682,8 +680,12 @@ maybe_t<tok_t> tokenizer_t::next() {
     return result;
 }

-bool is_token_delimiter(wchar_t c, bool is_first, maybe_t<wchar_t> next) {
-    return c == L'(' || !tok_is_string_character(c, is_first, next);
+bool is_comment(const wchar_t *pos, const wchar_t *tok_start) {
+    if (*pos != L'#') return false;
+    if (pos == tok_start) return true;
+    wchar_t prev = *(pos - 1);
+    bool prev_is_first = (pos - 1) == tok_start;
+    return prev == L'(' || prev == L')' || !tok_is_string_character(prev, prev_is_first, *pos);
 }

 wcstring tok_first(const wcstring &str) {
diff --git a/src/tokenizer.h b/src/tokenizer.h
index fccff61db..5322a2d32 100644
--- a/src/tokenizer.h
+++ b/src/tokenizer.h
@@ -133,8 +133,8 @@ class tokenizer_t : noncopyable_t {
     }
 };

-/// Tests if this character can delimit tokens.
-bool is_token_delimiter(wchar_t c, bool is_first, maybe_t<wchar_t> next);
+/// Tests if the given position starts a comment (possible inside a command substitution).
+bool is_comment(const wchar_t *pos, const wchar_t *tok_start);

 /// Returns only the first token from the specified string. This is a convenience function, used to
 /// retrieve the first token of a string. This can be useful for error messages, etc. On failure,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your solution! My only reservation is that ) is interpreted as a delimiter - as I said in my previous comment, I think it shouldn't.

Escapes are not handled in tok_is_string_character() either - for example, it wouldn't know if c = ';' is escaped - it assumes that it isn't. So I think it is reasonable to burden the caller with escaping. We can also document that, perhaps like this:

/// \c pos must not be an escaped character.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Escapes are not handled in tok_is_string_character() either

Yeah, but that doesn't contradict that function's name.
Meanwhile is_comment() is a straight up lie if quoted/escaped, and I couldn't find a better name.

So I'd just merge as-is.

src/tokenizer.cpp Show resolved Hide resolved
@krobelus krobelus merged commit 2ef12af into fish-shell:master Feb 8, 2022
@krobelus krobelus added the bug Something that's not working as intended label Feb 8, 2022
@krobelus krobelus added this to the fish 3.4.0 milestone Feb 8, 2022
@Shai-Aviv Shai-Aviv deleted the fix-comment-parsing branch February 11, 2022 09:12
@IlanCosman IlanCosman mentioned this pull request Feb 18, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something that's not working as intended
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bracket pairing breaks comments Parser checks quotes in command substitution comments
3 participants