Skip to content

fix(fetchers): harden youtube transcript handling#110

Merged
chaliy merged 3 commits into
mainfrom
2026-05-17-fix-youtube-transcript-handling-vulnerability
May 17, 2026
Merged

fix(fetchers): harden youtube transcript handling#110
chaliy merged 3 commits into
mainfrom
2026-05-17-fix-youtube-transcript-handling-vulnerability

Conversation

@chaliy
Copy link
Copy Markdown
Contributor

@chaliy chaliy commented May 17, 2026

Motivation

  • The YouTube timedtext transcript path previously fetched and built unbounded transcript text and then truncated with a byte-indexed slice, allowing excessive memory use and a panic on UTF-8 multibyte boundaries which can cause a denial-of-service.

Description

  • Pass options.max_body_size into fetch_transcript(...) and bail out when the timedtext XML exceeds the configured cap to avoid unbounded memory use.
  • Introduce MAX_TRANSCRIPT_CHARS and replace unsafe byte-slicing truncation with a safe_truncate_utf8 helper to guarantee UTF-8-safe truncation in format_youtube_response.
  • Keep transcript parsing logic but enforce the size check before joining segments and return None on oversized payloads.
  • Add a regression unit test test_safe_truncate_utf8_multibyte_boundary that validates truncation does not cut inside a multibyte code point.

Testing

  • Ran cargo fmt --all to format changes successfully.
  • Ran cargo test -p fetchkit test_format_youtube_response_truncates_long_transcript and it passed.
  • Ran cargo test -p fetchkit test_safe_truncate_utf8_multibyte_boundary and it passed.

Codex Task

@chaliy chaliy merged commit 7cb86ed into main May 17, 2026
11 checks passed
@chaliy chaliy deleted the 2026-05-17-fix-youtube-transcript-handling-vulnerability branch May 17, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant