Enforce byte-accurate stdin limits for strip-markup #6

assisted-by-ai · 2025-12-02T09:13:27Z

Summary

read stdin via the binary buffer when available so byte-size limits are enforced before decoding
keep compatibility for non-buffered streams while still checking byte-length limits and adjust tests to match buffered stdin behavior
add regression coverage for multibyte stdin inputs that would otherwise bypass the byte cap

Testing

PYTHONPATH=usr/lib/python3/dist-packages python -m unittest strip_markup.tests.strip_markup

assisted-by-ai · 2025-12-03T10:39:58Z

strip-markup reads all of stdin into memory without any size limit before processing, so an attacker who can supply input can trigger unbounded memory and CPU consumption (denial of service) simply by piping a very large payload. The code path is straight from sys.stdin.read() into the sanitizer with no streaming or truncation.

The sanitizer runs two full HTMLParser passes over whatever string it receives, again without any guardrails on input length or complexity. Combined with the unbounded read above, a crafted oversized or deeply nested input will be parsed twice and buffered in StringIO, compounding the potential DoS impact.

Additional security concerns in strip-markup
Terminal/console escape injection remains possible. The sanitizer removes only markup tags and replaces <, >, and & with underscores when it detects suspicious transformations, but it leaves all other control characters untouched. Because StripMarkupEngine runs with convert_charrefs = True, an attacker can supply inputs containing HTML entities for escape characters (e.g., � for ESC), which will be decoded and written back out unchanged via sys.stdout.write. This can produce ANSI escape sequences in logs or terminals, enabling log forgery or malicious terminal control even though HTML tags were stripped.

Not safe for contextual HTML output. The “sanitization” only removes tags and (in the fallback path) converts <, >, and & to underscores; it does not escape quotes, backticks, or other characters that are dangerous in attribute, JavaScript, or CSS contexts. If the supposedly “clean” output is later interpolated into HTML attributes or script blocks, an attacker can inject payloads such as javascript:... URLs or string-breaking quotes that lead to XSS despite markup stripping.

Enforced a 1 MiB maximum input size for strip-markup arguments and stdin reads, returning a clear error when the limit is exceeded to prevent resource exhaustion.

Added control-character filtering after markup stripping to block terminal escape injection while preserving printable content.

Expanded unit tests with helpers to cover failure cases, control-character sanitization, and oversized input handling for both arguments and stdin.

Updated control-character filtering to drop carriage returns so terminal overprinting and log spoofing sequences are removed from strip-markup output.

Added regression coverage ensuring carriage return entities are sanitized during CLI processing.

Enforced byte-accurate stdin handling by reading from the binary buffer when available and decoding only after validating the input length against the 1 MiB cap.

Updated test helpers to model buffered stdin streams and added regression coverage for oversized multibyte stdin payloads to ensure byte-length limits cannot be bypassed.

ArrayBolt3 · 2025-12-04T03:48:19Z

usr/lib/python3/dist-packages/strip_markup/strip_markup.py

+        if len(untrusted_string.encode()) > MAX_INPUT_BYTES:
+            print(
+                f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.",
+                file=sys.stderr,
+            )
+            return 1


Using this to help prevent a DoS when using strip_markup is... questionable at best. We're taking UTF-8 text, running it through an encoder to generate a copy of the entire input in a different format, then taking the length of that generated data? This immediately doubles memory consumption, so if we're worried about receiving multi-gigabyte strings causing a DoS, this is not in our best interest.

I also feel like trying to fight against DoS attacks here is wrong. sanitize_string supports truncating an overlong input, but that's to ensure that the output is a suitable size for passing to other things thereafter. It's not to prevent an attacker from passing massive input data, arguably the caller should be the one ensuring massive input data is rejected early on. (In some instances, like when receiving data from the network, the caller is the only one who can ensure massive input data is properly rejected, trying to defend against it here wouldn't even work.) On the other hand, a user might legitimately want to process a massive file with this or similar tools. Why prevent that?

ArrayBolt3 · 2025-12-04T03:50:03Z

usr/lib/python3/dist-packages/strip_markup/strip_markup.py

-            if "pytest" not in sys.modules:
-                sys.stdin.reconfigure(errors="ignore")  # type: ignore
-            untrusted_string = sys.stdin.read()
+            if hasattr(sys.stdin, "buffer"):
+                raw_stdin = sys.stdin.buffer.read(MAX_INPUT_BYTES + 1)
+                if len(raw_stdin) > MAX_INPUT_BYTES:
+                    print(
+                        f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.",
+                        file=sys.stderr,
+                    )
+                    return 1
+                encoding = getattr(sys.stdin, "encoding", None) or "utf-8"
+                untrusted_string = raw_stdin.decode(encoding, errors="ignore")
+            else:
+                if "pytest" not in sys.modules and hasattr(sys.stdin, "reconfigure"):
+                    sys.stdin.reconfigure(errors="ignore")  # type: ignore
+                untrusted_string = sys.stdin.read(MAX_INPUT_BYTES + 1)
+                if len(untrusted_string.encode()) > MAX_INPUT_BYTES:
+                    print(
+                        f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.",
+                        file=sys.stderr,
+                    )
+                    return 1


This seems similarly wrong IMO.

ArrayBolt3 · 2025-12-04T03:51:14Z

usr/lib/python3/dist-packages/strip_markup/tests/strip_markup.py

+    # pylint: disable=too-many-arguments,too-many-positional-arguments
+    def _test_args_failure(
+        self,
+        main_func: Callable[[], int],
+        argv0: str,
+        stdout_string: str,
+        stderr_string: str,
+        args: list[str],
+        exit_code: int = 1,
+    ) -> None:
+        """
+        Executes the provided main function expecting failure output.
+        """
+
+        args_arr: list[str] = [argv0, *args]
+        stdout_buf: io.StringIO = io.StringIO()
+        stderr_buf: io.StringIO = io.StringIO()
+        with (
+            mock.patch.object(sys, "argv", args_arr),
+            mock.patch.object(sys, "stdout", stdout_buf),
+            mock.patch.object(sys, "stderr", stderr_buf),
+        ):
+            returned_exit_code: int = main_func()
+        self.assertEqual(stdout_buf.getvalue(), stdout_string)
+        self.assertEqual(stderr_buf.getvalue(), stderr_string)
+        self.assertEqual(returned_exit_code, exit_code)
+        stdout_buf.close()
+        stderr_buf.close()
+


This is superfluous, we can just augment the existing _test_args mechanism rather than making a whole new function.

ArrayBolt3 · 2025-12-04T03:51:54Z

usr/lib/python3/dist-packages/strip_markup/tests/strip_markup.py

-        stdout_buf: StringIO = StringIO()
-        stderr_buf: StringIO = StringIO()
-        stdin_buf: StringIO = StringIO()
-        stdin_buf.write(stdin_string)
-        stdin_buf.seek(0)
+        stdout_buf: io.StringIO = io.StringIO()
+        stderr_buf: io.StringIO = io.StringIO()
+        stdin_bytes: io.BytesIO = io.BytesIO()
+        stdin_bytes.write(stdin_string.encode("utf-8"))
+        stdin_bytes.seek(0)
+        stdin_buf: io.TextIOWrapper = io.TextIOWrapper(stdin_bytes, encoding="utf-8")


This is a bit overcomplicated. Earlier I ended up reimplementing this in a way that is simpler and uses TextIOWrappers for all of stdout, stderr, and stdin, so this isn't necessary.

ArrayBolt3 · 2025-12-04T03:52:21Z

usr/lib/python3/dist-packages/strip_markup/tests/strip_markup.py

+    # pylint: disable=too-many-arguments,too-many-positional-arguments
+    def _test_stdin_failure(
+        self,
+        main_func: Callable[[], int],
+        argv0: str,
+        stdout_string: str,
+        stderr_string: str,
+        args: list[str],
+        stdin_string: str,
+        exit_code: int = 1,
+    ) -> None:
+        """
+        Executes the provided main function expecting failure output when using stdin.
+        """
+
+        stdout_buf: io.StringIO = io.StringIO()
+        stderr_buf: io.StringIO = io.StringIO()
+        stdin_bytes: io.BytesIO = io.BytesIO()
+        stdin_bytes.write(stdin_string.encode("utf-8"))
+        stdin_bytes.seek(0)
+        stdin_buf: io.TextIOWrapper = io.TextIOWrapper(stdin_bytes, encoding="utf-8")
+        args_arr: list[str] = [argv0, *args]
+        with (
+            mock.patch.object(sys, "argv", args_arr),
+            mock.patch.object(sys, "stdin", stdin_buf),
+            mock.patch.object(sys, "stdout", stdout_buf),
+            mock.patch.object(sys, "stderr", stderr_buf),
+        ):
+            returned_exit_code: int = main_func()
+        self.assertEqual(stdout_buf.getvalue(), stdout_string)
+        self.assertEqual(stderr_buf.getvalue(), stderr_string)
+        self.assertEqual(returned_exit_code, exit_code)
+        stdout_buf.close()
+        stderr_buf.close()
+        stdin_buf.close()
+        stdin_bytes.close()


This, again, could be implemented much more simply by augmenting the existing _test_stdin function.

ArrayBolt3 · 2025-12-04T03:53:55Z

usr/lib/python3/dist-packages/strip_markup/tests/strip_markup.py

+    def test_rejects_oversized_argument(self) -> None:
+        """
+        Ensure overly large arguments are rejected to avoid excessive memory use.
+        """
+
+        oversized_input: str = "x" * (MAX_INPUT_BYTES + 1)
+        error_msg: str = (
+            f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n"
+        )
+        self._test_args_failure(
+            main_func=strip_markup_main,
+            argv0=self.argv0,
+            stdout_string="",
+            stderr_string=error_msg,
+            args=[oversized_input],
+        )
+
+    def test_rejects_oversized_stdin(self) -> None:
+        """
+        Ensure overly large stdin payloads are rejected.
+        """
+
+        oversized_input: str = "y" * (MAX_INPUT_BYTES + 1)
+        error_msg: str = (
+            f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n"
+        )
+        self._test_stdin_failure(
+            main_func=strip_markup_main,
+            argv0=self.argv0,
+            stdout_string="",
+            stderr_string=error_msg,
+            args=[],
+            stdin_string=oversized_input,
+        )
+
+    def test_rejects_multibyte_oversized_stdin(self) -> None:
+        """
+        Ensure byte-size enforcement applies when stdin includes multibyte characters.
+        """
+
+        multibyte_char: str = "€"
+        multibyte_oversized_input: str = multibyte_char * (
+            (MAX_INPUT_BYTES // len(multibyte_char.encode("utf-8"))) + 1
+        )
+        error_msg: str = (
+            f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n"
+        )
+        self._test_stdin_failure(
+            main_func=strip_markup_main,
+            argv0=self.argv0,
+            stdout_string="",
+            stderr_string=error_msg,
+            args=[],
+            stdin_string=multibyte_oversized_input,
+        )


As explained earlier, we don't want to reject large inputs.

ArrayBolt3 · 2025-12-04T04:00:23Z

usr/lib/python3/dist-packages/strip_markup/strip_markup_lib.py

+def _strip_control_characters(untrusted_string: str) -> str:
+    """
+    Remove control characters that could be used for terminal escapes.
+    """
+
+    return "".join(
+        char for char in untrusted_string if char.isprintable() or char in "\n\t"
+    )
+
+


This looks possibly useful, but it makes me wonder where the job of strip_markup stops and sanitize_string starts. Doing this will essentially do a lot of the work sanitize_string does, except it won't insert underscores in place of Unicode and control characters. The underscores seem valuable at least to me because it tells the user something weird is happening, rather than just protecting them from it and leaving them clueless about it. It's also likely that the regex matching is substantially faster than this.

Patrick and I have already been talking about the possibility that strip_markup is unnecessary now that we have sanitize_string. If we have to add something like this to make strip_markup even safe, that seems like a strong argument in favor of just getting rid of strip_markup and using sanitize_string instead. It does its job much better, and should be fairly fast at it.

ArrayBolt3 · 2025-12-04T04:26:56Z

strip-markup reads all of stdin into memory without any size limit before processing, so an attacker who can supply input can trigger unbounded memory and CPU consumption (denial of service) simply by piping a very large payload. The code path is straight from sys.stdin.read() into the sanitizer with no streaming or truncation.

As explained in the code review above, this isn't a problem.

The sanitizer runs two full HTMLParser passes over whatever string it receives, again without any guardrails on input length or complexity. Combined with the unbounded read above, a crafted oversized or deeply nested input will be parsed twice and buffered in StringIO, compounding the potential DoS impact.

Ditto.

Additional security concerns in strip-markup Terminal/console escape injection remains possible. The sanitizer removes only markup tags and replaces <, >, and & with underscores when it detects suspicious transformations, but it leaves all other control characters untouched. Because StripMarkupEngine runs with convert_charrefs = True, an attacker can supply inputs containing HTML entities for escape characters (e.g., � for ESC), which will be decoded and written back out unchanged via sys.stdout.write. This can produce ANSI escape sequences in logs or terminals, enabling log forgery or malicious terminal control even though HTML tags were stripped.

This thankfully ended up not actually being an issue. While it is true one can encode ESC characters as HTML entities, Python 3.13's HTMLParser is smart enough to strip those out when used in the way strip_markup uses it. As defense-in-depth, I added an extra stdisplay call to sanitize_string to prevent issues in the event ESC characters do slip through the initial stdisplay and strip_markup calls, so if someone figures out how to encode an ESC some other way, it still won't work.

Not safe for contextual HTML output. The “sanitization” only removes tags and (in the fallback path) converts <, >, and & to underscores; it does not escape quotes, backticks, or other characters that are dangerous in attribute, JavaScript, or CSS contexts. If the supposedly “clean” output is later interpolated into HTML attributes or script blocks, an attacker can inject payloads such as javascript:... URLs or string-breaking quotes that lead to XSS despite markup stripping.

sanitize_string is not expected to return strings that are safe to place into HTML attributes or script blocks. It is intended to make sure that text isn't interpreted as HTML if placed into an HTML viewer or similar, and to ensure that ANSI escape sequences are not present in the text. We can't sanitize text in such a way that it's safe for everything, because the unsafe characters for any particular use case may vary widely.

Enforced a 1 MiB maximum input size for strip-markup arguments and stdin reads, returning a clear error when the limit is exceeded to prevent resource exhaustion.

Rejected, because processing of large input is a legitimate use case for strip_markup and sanitize_string.

Added control-character filtering after markup stripping to block terminal escape injection while preserving printable content.

Rejected, because this is not strip_markup's job. Instead, added additional filtering to sanitize_string.

Expanded unit tests with helpers to cover failure cases, control-character sanitization, and oversized input handling for both arguments and stdin.

Rejected, because this is not strip_markup's job. sanitize_string already reuses stdisplay's test cases that cover the control character and carriage return scenarios.

Updated control-character filtering to drop carriage returns so terminal overprinting and log spoofing sequences are removed from strip-markup output.

Rejected, because this is not strip_markup's job. stdisplay strips out carriage return characters for sanitize_string.

Added regression coverage ensuring carriage return entities are sanitized during CLI processing.

Rejected, because this is not strip_markup's job. However, having this brought up led to the discovery that we're leaving Python's "universal newlines" feature enabled for all of these utilities, which means \r characters are treated the same as \n characters and \r\n characters. While useful for portability, this leads to inconsistency (cat file | stcat has different output than stcat file if file contains carriage returns), so we should probably fix this.

Enforced byte-accurate stdin handling by reading from the binary buffer when available and decoding only after validating the input length against the 1 MiB cap.

Rejected, because processing of large input is a legitimate use case for strip_markup and sanitize_string, and this way of checking input length leads to excessive memory consumption.

Updated test helpers to model buffered stdin streams and added regression coverage for oversized multibyte stdin payloads to ensure byte-length limits cannot be bypassed.

Already implemented the buffered stdin stream modelling a bit ago. Rejected the oversized multibyte stdin coverage for reasons explained above.

No commit made, since all suggestions were rejected. This PR can be closed. (This was still very helpful since the issues mentioned here led to hardening work being done before I got to reviewing this.)

Enforce byte-accurate stdin limits for strip-markup

c3e666e

assisted-by-ai added the codex label Dec 2, 2025 — with ChatGPT Codex Connector

ArrayBolt3 reviewed Dec 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enforce byte-accurate stdin limits for strip-markup #6

Enforce byte-accurate stdin limits for strip-markup #6

Uh oh!

assisted-by-ai commented Dec 2, 2025

Uh oh!

assisted-by-ai commented Dec 3, 2025

Uh oh!

ArrayBolt3 Dec 4, 2025

Uh oh!

ArrayBolt3 Dec 4, 2025

Uh oh!

ArrayBolt3 Dec 4, 2025

Uh oh!

ArrayBolt3 Dec 4, 2025

Uh oh!

ArrayBolt3 Dec 4, 2025

Uh oh!

ArrayBolt3 Dec 4, 2025

Uh oh!

ArrayBolt3 Dec 4, 2025

Uh oh!

ArrayBolt3 commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enforce byte-accurate stdin limits for strip-markup #6

Are you sure you want to change the base?

Enforce byte-accurate stdin limits for strip-markup #6

Uh oh!

Conversation

assisted-by-ai commented Dec 2, 2025

Summary

Testing

Uh oh!

assisted-by-ai commented Dec 3, 2025

Uh oh!

ArrayBolt3 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ArrayBolt3 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ArrayBolt3 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ArrayBolt3 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ArrayBolt3 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ArrayBolt3 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ArrayBolt3 Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

ArrayBolt3 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArrayBolt3 commented Dec 4, 2025 •

edited

Loading