-
Notifications
You must be signed in to change notification settings - Fork 0
Enforce byte-accurate stdin limits for strip-markup #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Enforce byte-accurate stdin limits for strip-markup #6
Conversation
|
strip-markup reads all of stdin into memory without any size limit before processing, so an attacker who can supply input can trigger unbounded memory and CPU consumption (denial of service) simply by piping a very large payload. The code path is straight from sys.stdin.read() into the sanitizer with no streaming or truncation. The sanitizer runs two full HTMLParser passes over whatever string it receives, again without any guardrails on input length or complexity. Combined with the unbounded read above, a crafted oversized or deeply nested input will be parsed twice and buffered in StringIO, compounding the potential DoS impact. Additional security concerns in strip-markup Not safe for contextual HTML output. The “sanitization” only removes tags and (in the fallback path) converts <, >, and & to underscores; it does not escape quotes, backticks, or other characters that are dangerous in attribute, JavaScript, or CSS contexts. If the supposedly “clean” output is later interpolated into HTML attributes or script blocks, an attacker can inject payloads such as javascript:... URLs or string-breaking quotes that lead to XSS despite markup stripping. Enforced a 1 MiB maximum input size for strip-markup arguments and stdin reads, returning a clear error when the limit is exceeded to prevent resource exhaustion. Added control-character filtering after markup stripping to block terminal escape injection while preserving printable content. Expanded unit tests with helpers to cover failure cases, control-character sanitization, and oversized input handling for both arguments and stdin. Updated control-character filtering to drop carriage returns so terminal overprinting and log spoofing sequences are removed from strip-markup output. Added regression coverage ensuring carriage return entities are sanitized during CLI processing. Enforced byte-accurate stdin handling by reading from the binary buffer when available and decoding only after validating the input length against the 1 MiB cap. Updated test helpers to model buffered stdin streams and added regression coverage for oversized multibyte stdin payloads to ensure byte-length limits cannot be bypassed. |
| if len(untrusted_string.encode()) > MAX_INPUT_BYTES: | ||
| print( | ||
| f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.", | ||
| file=sys.stderr, | ||
| ) | ||
| return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this to help prevent a DoS when using strip_markup is... questionable at best. We're taking UTF-8 text, running it through an encoder to generate a copy of the entire input in a different format, then taking the length of that generated data? This immediately doubles memory consumption, so if we're worried about receiving multi-gigabyte strings causing a DoS, this is not in our best interest.
I also feel like trying to fight against DoS attacks here is wrong. sanitize_string supports truncating an overlong input, but that's to ensure that the output is a suitable size for passing to other things thereafter. It's not to prevent an attacker from passing massive input data, arguably the caller should be the one ensuring massive input data is rejected early on. (In some instances, like when receiving data from the network, the caller is the only one who can ensure massive input data is properly rejected, trying to defend against it here wouldn't even work.) On the other hand, a user might legitimately want to process a massive file with this or similar tools. Why prevent that?
| if "pytest" not in sys.modules: | ||
| sys.stdin.reconfigure(errors="ignore") # type: ignore | ||
| untrusted_string = sys.stdin.read() | ||
| if hasattr(sys.stdin, "buffer"): | ||
| raw_stdin = sys.stdin.buffer.read(MAX_INPUT_BYTES + 1) | ||
| if len(raw_stdin) > MAX_INPUT_BYTES: | ||
| print( | ||
| f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.", | ||
| file=sys.stderr, | ||
| ) | ||
| return 1 | ||
| encoding = getattr(sys.stdin, "encoding", None) or "utf-8" | ||
| untrusted_string = raw_stdin.decode(encoding, errors="ignore") | ||
| else: | ||
| if "pytest" not in sys.modules and hasattr(sys.stdin, "reconfigure"): | ||
| sys.stdin.reconfigure(errors="ignore") # type: ignore | ||
| untrusted_string = sys.stdin.read(MAX_INPUT_BYTES + 1) | ||
| if len(untrusted_string.encode()) > MAX_INPUT_BYTES: | ||
| print( | ||
| f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.", | ||
| file=sys.stderr, | ||
| ) | ||
| return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems similarly wrong IMO.
| # pylint: disable=too-many-arguments,too-many-positional-arguments | ||
| def _test_args_failure( | ||
| self, | ||
| main_func: Callable[[], int], | ||
| argv0: str, | ||
| stdout_string: str, | ||
| stderr_string: str, | ||
| args: list[str], | ||
| exit_code: int = 1, | ||
| ) -> None: | ||
| """ | ||
| Executes the provided main function expecting failure output. | ||
| """ | ||
|
|
||
| args_arr: list[str] = [argv0, *args] | ||
| stdout_buf: io.StringIO = io.StringIO() | ||
| stderr_buf: io.StringIO = io.StringIO() | ||
| with ( | ||
| mock.patch.object(sys, "argv", args_arr), | ||
| mock.patch.object(sys, "stdout", stdout_buf), | ||
| mock.patch.object(sys, "stderr", stderr_buf), | ||
| ): | ||
| returned_exit_code: int = main_func() | ||
| self.assertEqual(stdout_buf.getvalue(), stdout_string) | ||
| self.assertEqual(stderr_buf.getvalue(), stderr_string) | ||
| self.assertEqual(returned_exit_code, exit_code) | ||
| stdout_buf.close() | ||
| stderr_buf.close() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is superfluous, we can just augment the existing _test_args mechanism rather than making a whole new function.
| stdout_buf: StringIO = StringIO() | ||
| stderr_buf: StringIO = StringIO() | ||
| stdin_buf: StringIO = StringIO() | ||
| stdin_buf.write(stdin_string) | ||
| stdin_buf.seek(0) | ||
| stdout_buf: io.StringIO = io.StringIO() | ||
| stderr_buf: io.StringIO = io.StringIO() | ||
| stdin_bytes: io.BytesIO = io.BytesIO() | ||
| stdin_bytes.write(stdin_string.encode("utf-8")) | ||
| stdin_bytes.seek(0) | ||
| stdin_buf: io.TextIOWrapper = io.TextIOWrapper(stdin_bytes, encoding="utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit overcomplicated. Earlier I ended up reimplementing this in a way that is simpler and uses TextIOWrappers for all of stdout, stderr, and stdin, so this isn't necessary.
| # pylint: disable=too-many-arguments,too-many-positional-arguments | ||
| def _test_stdin_failure( | ||
| self, | ||
| main_func: Callable[[], int], | ||
| argv0: str, | ||
| stdout_string: str, | ||
| stderr_string: str, | ||
| args: list[str], | ||
| stdin_string: str, | ||
| exit_code: int = 1, | ||
| ) -> None: | ||
| """ | ||
| Executes the provided main function expecting failure output when using stdin. | ||
| """ | ||
|
|
||
| stdout_buf: io.StringIO = io.StringIO() | ||
| stderr_buf: io.StringIO = io.StringIO() | ||
| stdin_bytes: io.BytesIO = io.BytesIO() | ||
| stdin_bytes.write(stdin_string.encode("utf-8")) | ||
| stdin_bytes.seek(0) | ||
| stdin_buf: io.TextIOWrapper = io.TextIOWrapper(stdin_bytes, encoding="utf-8") | ||
| args_arr: list[str] = [argv0, *args] | ||
| with ( | ||
| mock.patch.object(sys, "argv", args_arr), | ||
| mock.patch.object(sys, "stdin", stdin_buf), | ||
| mock.patch.object(sys, "stdout", stdout_buf), | ||
| mock.patch.object(sys, "stderr", stderr_buf), | ||
| ): | ||
| returned_exit_code: int = main_func() | ||
| self.assertEqual(stdout_buf.getvalue(), stdout_string) | ||
| self.assertEqual(stderr_buf.getvalue(), stderr_string) | ||
| self.assertEqual(returned_exit_code, exit_code) | ||
| stdout_buf.close() | ||
| stderr_buf.close() | ||
| stdin_buf.close() | ||
| stdin_bytes.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, again, could be implemented much more simply by augmenting the existing _test_stdin function.
| def test_rejects_oversized_argument(self) -> None: | ||
| """ | ||
| Ensure overly large arguments are rejected to avoid excessive memory use. | ||
| """ | ||
|
|
||
| oversized_input: str = "x" * (MAX_INPUT_BYTES + 1) | ||
| error_msg: str = ( | ||
| f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n" | ||
| ) | ||
| self._test_args_failure( | ||
| main_func=strip_markup_main, | ||
| argv0=self.argv0, | ||
| stdout_string="", | ||
| stderr_string=error_msg, | ||
| args=[oversized_input], | ||
| ) | ||
|
|
||
| def test_rejects_oversized_stdin(self) -> None: | ||
| """ | ||
| Ensure overly large stdin payloads are rejected. | ||
| """ | ||
|
|
||
| oversized_input: str = "y" * (MAX_INPUT_BYTES + 1) | ||
| error_msg: str = ( | ||
| f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n" | ||
| ) | ||
| self._test_stdin_failure( | ||
| main_func=strip_markup_main, | ||
| argv0=self.argv0, | ||
| stdout_string="", | ||
| stderr_string=error_msg, | ||
| args=[], | ||
| stdin_string=oversized_input, | ||
| ) | ||
|
|
||
| def test_rejects_multibyte_oversized_stdin(self) -> None: | ||
| """ | ||
| Ensure byte-size enforcement applies when stdin includes multibyte characters. | ||
| """ | ||
|
|
||
| multibyte_char: str = "€" | ||
| multibyte_oversized_input: str = multibyte_char * ( | ||
| (MAX_INPUT_BYTES // len(multibyte_char.encode("utf-8"))) + 1 | ||
| ) | ||
| error_msg: str = ( | ||
| f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n" | ||
| ) | ||
| self._test_stdin_failure( | ||
| main_func=strip_markup_main, | ||
| argv0=self.argv0, | ||
| stdout_string="", | ||
| stderr_string=error_msg, | ||
| args=[], | ||
| stdin_string=multibyte_oversized_input, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained earlier, we don't want to reject large inputs.
| def _strip_control_characters(untrusted_string: str) -> str: | ||
| """ | ||
| Remove control characters that could be used for terminal escapes. | ||
| """ | ||
|
|
||
| return "".join( | ||
| char for char in untrusted_string if char.isprintable() or char in "\n\t" | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks possibly useful, but it makes me wonder where the job of strip_markup stops and sanitize_string starts. Doing this will essentially do a lot of the work sanitize_string does, except it won't insert underscores in place of Unicode and control characters. The underscores seem valuable at least to me because it tells the user something weird is happening, rather than just protecting them from it and leaving them clueless about it. It's also likely that the regex matching is substantially faster than this.
Patrick and I have already been talking about the possibility that strip_markup is unnecessary now that we have sanitize_string. If we have to add something like this to make strip_markup even safe, that seems like a strong argument in favor of just getting rid of strip_markup and using sanitize_string instead. It does its job much better, and should be fairly fast at it.
As explained in the code review above, this isn't a problem.
Ditto.
This thankfully ended up not actually being an issue. While it is true one can encode ESC characters as HTML entities, Python 3.13's HTMLParser is smart enough to strip those out when used in the way strip_markup uses it. As defense-in-depth, I added an extra
sanitize_string is not expected to return strings that are safe to place into HTML attributes or script blocks. It is intended to make sure that text isn't interpreted as HTML if placed into an HTML viewer or similar, and to ensure that ANSI escape sequences are not present in the text. We can't sanitize text in such a way that it's safe for everything, because the unsafe characters for any particular use case may vary widely.
Rejected, because processing of large input is a legitimate use case for strip_markup and sanitize_string.
Rejected, because this is not strip_markup's job. Instead, added additional filtering to sanitize_string.
Rejected, because this is not strip_markup's job. sanitize_string already reuses stdisplay's test cases that cover the control character and carriage return scenarios.
Rejected, because this is not strip_markup's job.
Rejected, because this is not strip_markup's job. However, having this brought up led to the discovery that we're leaving Python's "universal newlines" feature enabled for all of these utilities, which means
Rejected, because processing of large input is a legitimate use case for strip_markup and sanitize_string, and this way of checking input length leads to excessive memory consumption.
Already implemented the buffered stdin stream modelling a bit ago. Rejected the oversized multibyte stdin coverage for reasons explained above. No commit made, since all suggestions were rejected. This PR can be closed. (This was still very helpful since the issues mentioned here led to hardening work being done before I got to reviewing this.) |
Summary
Testing
Codex Task