Skip to content

Conversation

@assisted-by-ai
Copy link
Owner

Summary

  • read stdin via the binary buffer when available so byte-size limits are enforced before decoding
  • keep compatibility for non-buffered streams while still checking byte-length limits and adjust tests to match buffered stdin behavior
  • add regression coverage for multibyte stdin inputs that would otherwise bypass the byte cap

Testing

  • PYTHONPATH=usr/lib/python3/dist-packages python -m unittest strip_markup.tests.strip_markup

Codex Task

@assisted-by-ai
Copy link
Owner Author

strip-markup reads all of stdin into memory without any size limit before processing, so an attacker who can supply input can trigger unbounded memory and CPU consumption (denial of service) simply by piping a very large payload. The code path is straight from sys.stdin.read() into the sanitizer with no streaming or truncation.

The sanitizer runs two full HTMLParser passes over whatever string it receives, again without any guardrails on input length or complexity. Combined with the unbounded read above, a crafted oversized or deeply nested input will be parsed twice and buffered in StringIO, compounding the potential DoS impact.

Additional security concerns in strip-markup
Terminal/console escape injection remains possible. The sanitizer removes only markup tags and replaces <, >, and & with underscores when it detects suspicious transformations, but it leaves all other control characters untouched. Because StripMarkupEngine runs with convert_charrefs = True, an attacker can supply inputs containing HTML entities for escape characters (e.g., � for ESC), which will be decoded and written back out unchanged via sys.stdout.write. This can produce ANSI escape sequences in logs or terminals, enabling log forgery or malicious terminal control even though HTML tags were stripped.

Not safe for contextual HTML output. The “sanitization” only removes tags and (in the fallback path) converts <, >, and & to underscores; it does not escape quotes, backticks, or other characters that are dangerous in attribute, JavaScript, or CSS contexts. If the supposedly “clean” output is later interpolated into HTML attributes or script blocks, an attacker can inject payloads such as javascript:... URLs or string-breaking quotes that lead to XSS despite markup stripping.

Enforced a 1 MiB maximum input size for strip-markup arguments and stdin reads, returning a clear error when the limit is exceeded to prevent resource exhaustion.

Added control-character filtering after markup stripping to block terminal escape injection while preserving printable content.

Expanded unit tests with helpers to cover failure cases, control-character sanitization, and oversized input handling for both arguments and stdin.

Updated control-character filtering to drop carriage returns so terminal overprinting and log spoofing sequences are removed from strip-markup output.

Added regression coverage ensuring carriage return entities are sanitized during CLI processing.

Enforced byte-accurate stdin handling by reading from the binary buffer when available and decoding only after validating the input length against the 1 MiB cap.

Updated test helpers to model buffered stdin streams and added regression coverage for oversized multibyte stdin payloads to ensure byte-length limits cannot be bypassed.

Comment on lines +58 to +63
if len(untrusted_string.encode()) > MAX_INPUT_BYTES:
print(
f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.",
file=sys.stderr,
)
return 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this to help prevent a DoS when using strip_markup is... questionable at best. We're taking UTF-8 text, running it through an encoder to generate a copy of the entire input in a different format, then taking the length of that generated data? This immediately doubles memory consumption, so if we're worried about receiving multi-gigabyte strings causing a DoS, this is not in our best interest.

I also feel like trying to fight against DoS attacks here is wrong. sanitize_string supports truncating an overlong input, but that's to ensure that the output is a suitable size for passing to other things thereafter. It's not to prevent an attacker from passing massive input data, arguably the caller should be the one ensuring massive input data is rejected early on. (In some instances, like when receiving data from the network, the caller is the only one who can ensure massive input data is properly rejected, trying to defend against it here wouldn't even work.) On the other hand, a user might legitimately want to process a massive file with this or similar tools. Why prevent that?

Comment on lines -59 to +87
if "pytest" not in sys.modules:
sys.stdin.reconfigure(errors="ignore") # type: ignore
untrusted_string = sys.stdin.read()
if hasattr(sys.stdin, "buffer"):
raw_stdin = sys.stdin.buffer.read(MAX_INPUT_BYTES + 1)
if len(raw_stdin) > MAX_INPUT_BYTES:
print(
f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.",
file=sys.stderr,
)
return 1
encoding = getattr(sys.stdin, "encoding", None) or "utf-8"
untrusted_string = raw_stdin.decode(encoding, errors="ignore")
else:
if "pytest" not in sys.modules and hasattr(sys.stdin, "reconfigure"):
sys.stdin.reconfigure(errors="ignore") # type: ignore
untrusted_string = sys.stdin.read(MAX_INPUT_BYTES + 1)
if len(untrusted_string.encode()) > MAX_INPUT_BYTES:
print(
f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.",
file=sys.stderr,
)
return 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems similarly wrong IMO.

Comment on lines +53 to +81
# pylint: disable=too-many-arguments,too-many-positional-arguments
def _test_args_failure(
self,
main_func: Callable[[], int],
argv0: str,
stdout_string: str,
stderr_string: str,
args: list[str],
exit_code: int = 1,
) -> None:
"""
Executes the provided main function expecting failure output.
"""

args_arr: list[str] = [argv0, *args]
stdout_buf: io.StringIO = io.StringIO()
stderr_buf: io.StringIO = io.StringIO()
with (
mock.patch.object(sys, "argv", args_arr),
mock.patch.object(sys, "stdout", stdout_buf),
mock.patch.object(sys, "stderr", stderr_buf),
):
returned_exit_code: int = main_func()
self.assertEqual(stdout_buf.getvalue(), stdout_string)
self.assertEqual(stderr_buf.getvalue(), stderr_string)
self.assertEqual(returned_exit_code, exit_code)
stdout_buf.close()
stderr_buf.close()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is superfluous, we can just augment the existing _test_args mechanism rather than making a whole new function.

Comment on lines -68 to +102
stdout_buf: StringIO = StringIO()
stderr_buf: StringIO = StringIO()
stdin_buf: StringIO = StringIO()
stdin_buf.write(stdin_string)
stdin_buf.seek(0)
stdout_buf: io.StringIO = io.StringIO()
stderr_buf: io.StringIO = io.StringIO()
stdin_bytes: io.BytesIO = io.BytesIO()
stdin_bytes.write(stdin_string.encode("utf-8"))
stdin_bytes.seek(0)
stdin_buf: io.TextIOWrapper = io.TextIOWrapper(stdin_bytes, encoding="utf-8")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit overcomplicated. Earlier I ended up reimplementing this in a way that is simpler and uses TextIOWrappers for all of stdout, stderr, and stdin, so this isn't necessary.

Comment on lines +119 to +154
# pylint: disable=too-many-arguments,too-many-positional-arguments
def _test_stdin_failure(
self,
main_func: Callable[[], int],
argv0: str,
stdout_string: str,
stderr_string: str,
args: list[str],
stdin_string: str,
exit_code: int = 1,
) -> None:
"""
Executes the provided main function expecting failure output when using stdin.
"""

stdout_buf: io.StringIO = io.StringIO()
stderr_buf: io.StringIO = io.StringIO()
stdin_bytes: io.BytesIO = io.BytesIO()
stdin_bytes.write(stdin_string.encode("utf-8"))
stdin_bytes.seek(0)
stdin_buf: io.TextIOWrapper = io.TextIOWrapper(stdin_bytes, encoding="utf-8")
args_arr: list[str] = [argv0, *args]
with (
mock.patch.object(sys, "argv", args_arr),
mock.patch.object(sys, "stdin", stdin_buf),
mock.patch.object(sys, "stdout", stdout_buf),
mock.patch.object(sys, "stderr", stderr_buf),
):
returned_exit_code: int = main_func()
self.assertEqual(stdout_buf.getvalue(), stdout_string)
self.assertEqual(stderr_buf.getvalue(), stderr_string)
self.assertEqual(returned_exit_code, exit_code)
stdout_buf.close()
stderr_buf.close()
stdin_buf.close()
stdin_bytes.close()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, again, could be implemented much more simply by augmenting the existing _test_stdin function.

Comment on lines +467 to +521
def test_rejects_oversized_argument(self) -> None:
"""
Ensure overly large arguments are rejected to avoid excessive memory use.
"""

oversized_input: str = "x" * (MAX_INPUT_BYTES + 1)
error_msg: str = (
f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n"
)
self._test_args_failure(
main_func=strip_markup_main,
argv0=self.argv0,
stdout_string="",
stderr_string=error_msg,
args=[oversized_input],
)

def test_rejects_oversized_stdin(self) -> None:
"""
Ensure overly large stdin payloads are rejected.
"""

oversized_input: str = "y" * (MAX_INPUT_BYTES + 1)
error_msg: str = (
f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n"
)
self._test_stdin_failure(
main_func=strip_markup_main,
argv0=self.argv0,
stdout_string="",
stderr_string=error_msg,
args=[],
stdin_string=oversized_input,
)

def test_rejects_multibyte_oversized_stdin(self) -> None:
"""
Ensure byte-size enforcement applies when stdin includes multibyte characters.
"""

multibyte_char: str = "€"
multibyte_oversized_input: str = multibyte_char * (
(MAX_INPUT_BYTES // len(multibyte_char.encode("utf-8"))) + 1
)
error_msg: str = (
f"strip-markup: input exceeds maximum size of {MAX_INPUT_BYTES} bytes.\n"
)
self._test_stdin_failure(
main_func=strip_markup_main,
argv0=self.argv0,
stdout_string="",
stderr_string=error_msg,
args=[],
stdin_string=multibyte_oversized_input,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained earlier, we don't want to reject large inputs.

Comment on lines +45 to +54
def _strip_control_characters(untrusted_string: str) -> str:
"""
Remove control characters that could be used for terminal escapes.
"""

return "".join(
char for char in untrusted_string if char.isprintable() or char in "\n\t"
)


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks possibly useful, but it makes me wonder where the job of strip_markup stops and sanitize_string starts. Doing this will essentially do a lot of the work sanitize_string does, except it won't insert underscores in place of Unicode and control characters. The underscores seem valuable at least to me because it tells the user something weird is happening, rather than just protecting them from it and leaving them clueless about it. It's also likely that the regex matching is substantially faster than this.

Patrick and I have already been talking about the possibility that strip_markup is unnecessary now that we have sanitize_string. If we have to add something like this to make strip_markup even safe, that seems like a strong argument in favor of just getting rid of strip_markup and using sanitize_string instead. It does its job much better, and should be fairly fast at it.

@ArrayBolt3
Copy link

ArrayBolt3 commented Dec 4, 2025

strip-markup reads all of stdin into memory without any size limit before processing, so an attacker who can supply input can trigger unbounded memory and CPU consumption (denial of service) simply by piping a very large payload. The code path is straight from sys.stdin.read() into the sanitizer with no streaming or truncation.

As explained in the code review above, this isn't a problem.

The sanitizer runs two full HTMLParser passes over whatever string it receives, again without any guardrails on input length or complexity. Combined with the unbounded read above, a crafted oversized or deeply nested input will be parsed twice and buffered in StringIO, compounding the potential DoS impact.

Ditto.

Additional security concerns in strip-markup Terminal/console escape injection remains possible. The sanitizer removes only markup tags and replaces <, >, and & with underscores when it detects suspicious transformations, but it leaves all other control characters untouched. Because StripMarkupEngine runs with convert_charrefs = True, an attacker can supply inputs containing HTML entities for escape characters (e.g., � for ESC), which will be decoded and written back out unchanged via sys.stdout.write. This can produce ANSI escape sequences in logs or terminals, enabling log forgery or malicious terminal control even though HTML tags were stripped.

This thankfully ended up not actually being an issue. While it is true one can encode ESC characters as HTML entities, Python 3.13's HTMLParser is smart enough to strip those out when used in the way strip_markup uses it. As defense-in-depth, I added an extra stdisplay call to sanitize_string to prevent issues in the event ESC characters do slip through the initial stdisplay and strip_markup calls, so if someone figures out how to encode an ESC some other way, it still won't work.

Not safe for contextual HTML output. The “sanitization” only removes tags and (in the fallback path) converts <, >, and & to underscores; it does not escape quotes, backticks, or other characters that are dangerous in attribute, JavaScript, or CSS contexts. If the supposedly “clean” output is later interpolated into HTML attributes or script blocks, an attacker can inject payloads such as javascript:... URLs or string-breaking quotes that lead to XSS despite markup stripping.

sanitize_string is not expected to return strings that are safe to place into HTML attributes or script blocks. It is intended to make sure that text isn't interpreted as HTML if placed into an HTML viewer or similar, and to ensure that ANSI escape sequences are not present in the text. We can't sanitize text in such a way that it's safe for everything, because the unsafe characters for any particular use case may vary widely.

Enforced a 1 MiB maximum input size for strip-markup arguments and stdin reads, returning a clear error when the limit is exceeded to prevent resource exhaustion.

Rejected, because processing of large input is a legitimate use case for strip_markup and sanitize_string.

Added control-character filtering after markup stripping to block terminal escape injection while preserving printable content.

Rejected, because this is not strip_markup's job. Instead, added additional filtering to sanitize_string.

Expanded unit tests with helpers to cover failure cases, control-character sanitization, and oversized input handling for both arguments and stdin.

Rejected, because this is not strip_markup's job. sanitize_string already reuses stdisplay's test cases that cover the control character and carriage return scenarios.

Updated control-character filtering to drop carriage returns so terminal overprinting and log spoofing sequences are removed from strip-markup output.

Rejected, because this is not strip_markup's job. stdisplay strips out carriage return characters for sanitize_string.

Added regression coverage ensuring carriage return entities are sanitized during CLI processing.

Rejected, because this is not strip_markup's job. However, having this brought up led to the discovery that we're leaving Python's "universal newlines" feature enabled for all of these utilities, which means \r characters are treated the same as \n characters and \r\n characters. While useful for portability, this leads to inconsistency (cat file | stcat has different output than stcat file if file contains carriage returns), so we should probably fix this.

Enforced byte-accurate stdin handling by reading from the binary buffer when available and decoding only after validating the input length against the 1 MiB cap.

Rejected, because processing of large input is a legitimate use case for strip_markup and sanitize_string, and this way of checking input length leads to excessive memory consumption.

Updated test helpers to model buffered stdin streams and added regression coverage for oversized multibyte stdin payloads to ensure byte-length limits cannot be bypassed.

Already implemented the buffered stdin stream modelling a bit ago. Rejected the oversized multibyte stdin coverage for reasons explained above.

No commit made, since all suggestions were rejected. This PR can be closed. (This was still very helpful since the issues mentioned here led to hardening work being done before I got to reviewing this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants