Skip to content

Binary detection samples only the first 1024 bytes; mixed files are corrupted #9

@dolph

Description

@dolph

Summary

File.Read() uses golang.org/x/tools/godoc/util.IsText on the first 1024 bytes to decide whether to treat a file as text:

var buf [1024]byte
n, err := handle.Read(buf[0:])
if err != nil || !util.IsText(buf[0:n]) {
    return ""
}

Many real files have a textual prefix and binary tail (e.g., embedded thumbnails in HTML, certain log formats, __DATA__ sections in Perl scripts). These will be classified as text and the entire file will be passed through strings.Replace, which is byte-safe but treats the binary tail as bytes to potentially clobber if find happens to occur there as a substring.

The dependency golang.org/x/tools/godoc/util is also documented as deprecated/internal-ish (godoc/util is mostly removed/maintained for back-compat), and pulling in golang.org/x/tools for one heuristic is a heavy dependency.

Impact (Reliability: Medium)

  • Risk of corrupting binary content (esp. zip/png/etc embedded in larger files).
  • Heavy dependency for a few hundred lines of trivial logic.
  • err != nil from a 1024-byte read on a smaller file is a false negative for short text files (e.g., a 200-byte text file reads io.EOF and is classified non-text). Wait — Read does return io.EOF only at the end; for a partial read of a short file, n is the size and err is nil. But for an empty file, Read returns 0, io.EOF. So an empty file is currently classified as binary and skipped — also wrong, but harmless.

Suggested Fix

  • Replace util.IsText with a small in-tree NUL-byte / valid-UTF-8 check on the read prefix.
  • If the prefix is text, optionally verify (or rely on the streaming rewriter from Whole file is buffered into memory; large files cause OOM #8) before committing the rewrite.
  • Drop the golang.org/x/tools dependency.

Files

  • file_handling.go:50-74 (Read)
  • go.mod

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions