Skip to content

Detect binary formats by their file signatures#647

Merged
audreyfeldroy merged 5 commits intomainfrom
fix-png-misclassification
Mar 8, 2026
Merged

Detect binary formats by their file signatures#647
audreyfeldroy merged 5 commits intomainfrom
fix-png-misclassification

Conversation

@audreyfeldroy
Copy link
Collaborator

Summary

BinaryOrNot now checks the first bytes of each file against 45 known binary format signatures (PNG, JPEG, PDF, ZIP, ELF, Mach-O, and more) before running the statistical decision tree. Files that match a known signature are classified as binary immediately. The signatures come from binary_formats.csv, the same source of truth the training script already uses.

The detector also reads 512 bytes per file instead of 128. At 128 bytes, file header metadata (null bytes in PNG IHDR, length fields in chunk headers) dominated the statistical features and produced misleading signals. With 4x more data, the byte ratios and entropy calculations reflect actual file content rather than header artifacts. The tree was retrained from scratch on 512-byte samples.

The tree gains a 24th feature, has_magic_signature, which tells it whether the chunk starts with a known binary header. The tree uses this at 8.3% importance, making it the 4th most important feature. This gives the tree a second path to the right answer when statistical features are ambiguous, independent of the pre-tree guard.

The root cause of #642: a 512x512 grayscale+alpha PNG has an IHDR chunk with enough null bytes that the first 128 bytes accidentally decode as UTF-16. The tree follows a Shift-JIS branch and concludes "text." All three layers of this PR independently prevent that misclassification.

Closes #642

Test plan

  • Downloaded the reporter's exact PNG from the issue and added it as tests/files/issue-642.png
  • Red/green TDD for each layer:
    • Magic bytes guard: test_binary_png_issue_642 and test_png_signature_with_adversarial_content fail, then pass after guard
    • 512-byte chunks: 4 tests fail (Russian RST, EOT, Big5, Latin), then pass after retraining
    • Magic signature feature: 3 TestFeatureVector tests fail (wrong count), then pass after adding feature 23
  • Full suite: 220 passed, 5 xfailed

Files starting with known binary signatures (PNG, JPEG, PDF, ZIP,
etc.) are now identified before the statistical decision tree runs.
The guard loads 45 signatures from binary_formats.csv at import time
and checks the first few bytes of each chunk.

Closes #642, where a PNG file's IHDR chunk bytes happened to decode
as UTF-16, causing the tree to classify it as text. The guard catches
this because the chunk starts with the PNG signature (89 50 4e 47).

Key design decisions:
- Signatures come from binary_formats.csv (single source of truth,
  already used by the training script)
- Guard runs before feature extraction, so it's a fast path that
  skips all the decoding and entropy work
- The reporter's actual PNG is included as a test fixture so this
  specific regression stays caught
The tree now reads and trains on 512-byte chunks instead of 128.
Larger chunks push file-header artifacts (null bytes in PNG IHDR,
structured metadata fields) into a smaller fraction of the sample,
giving statistical features like entropy and encoding validity more
representative data to work with.

Key design decisions:
- Training strategies updated to generate proportionally larger
  samples (text up to 256 chars, CJK up to 200 chars, binary
  strategies up to 512 bytes)
- Chunk size centralized as CHUNK_SIZE constant in the training
  script to prevent 128/512 mismatches
- Tree depth settled at 8 (cross-validation selected), down from
  the previous tree's effective depth, because longer chunks need
  fewer splits to discriminate
The feature vector grows from 23 to 24 features with a new
has_magic_signature boolean (feature index 23). The retrained tree
uses it at 8.3% importance, particularly for high-null-ratio files
where byte statistics alone are ambiguous. This gives the tree a
second line of defense alongside the pre-tree magic bytes guard.

CHUNK_SIZE is now defined once in helpers.py and imported by both
the training script and tests, replacing the separate constant that
the training script maintained.
Ruff requires spaces around `:` in `chunk[: len(sig)]`.
The signature table grows from 45 to 55 formats. The new entries
reflect that binary-or-text detection matters beyond developer
tooling: Apple binary plists, network captures (PCAP), data
interchange (Arrow IPC, Avro), compression (LZ4, LZMA, Snappy),
archives (ar, CPIO), and security stores (Java KeyStore).

Each format with a stable magic signature includes a minimal test
fixture generated from Python stdlib or raw bytes. Snappy is the
only entry without a fixture (requires a C library to produce a
valid framed stream), but its 10-byte signature is tested via the
existing magic-bytes parametrized test.

No tree retraining needed. The pre-tree guard and the existing
has_magic_signature feature (index 23) pick up new signatures
automatically at runtime.
@audreyfeldroy audreyfeldroy merged commit a1ff8d0 into main Mar 8, 2026
11 checks passed
@audreyfeldroy audreyfeldroy deleted the fix-png-misclassification branch March 8, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The is_binary() outcome for a PNG file changed with 0.5.0

1 participant