Detect binary formats by their file signatures#647
Merged
audreyfeldroy merged 5 commits intomainfrom Mar 8, 2026
Merged
Conversation
Files starting with known binary signatures (PNG, JPEG, PDF, ZIP, etc.) are now identified before the statistical decision tree runs. The guard loads 45 signatures from binary_formats.csv at import time and checks the first few bytes of each chunk. Closes #642, where a PNG file's IHDR chunk bytes happened to decode as UTF-16, causing the tree to classify it as text. The guard catches this because the chunk starts with the PNG signature (89 50 4e 47). Key design decisions: - Signatures come from binary_formats.csv (single source of truth, already used by the training script) - Guard runs before feature extraction, so it's a fast path that skips all the decoding and entropy work - The reporter's actual PNG is included as a test fixture so this specific regression stays caught
The tree now reads and trains on 512-byte chunks instead of 128. Larger chunks push file-header artifacts (null bytes in PNG IHDR, structured metadata fields) into a smaller fraction of the sample, giving statistical features like entropy and encoding validity more representative data to work with. Key design decisions: - Training strategies updated to generate proportionally larger samples (text up to 256 chars, CJK up to 200 chars, binary strategies up to 512 bytes) - Chunk size centralized as CHUNK_SIZE constant in the training script to prevent 128/512 mismatches - Tree depth settled at 8 (cross-validation selected), down from the previous tree's effective depth, because longer chunks need fewer splits to discriminate
The feature vector grows from 23 to 24 features with a new has_magic_signature boolean (feature index 23). The retrained tree uses it at 8.3% importance, particularly for high-null-ratio files where byte statistics alone are ambiguous. This gives the tree a second line of defense alongside the pre-tree magic bytes guard. CHUNK_SIZE is now defined once in helpers.py and imported by both the training script and tests, replacing the separate constant that the training script maintained.
Ruff requires spaces around `:` in `chunk[: len(sig)]`.
The signature table grows from 45 to 55 formats. The new entries reflect that binary-or-text detection matters beyond developer tooling: Apple binary plists, network captures (PCAP), data interchange (Arrow IPC, Avro), compression (LZ4, LZMA, Snappy), archives (ar, CPIO), and security stores (Java KeyStore). Each format with a stable magic signature includes a minimal test fixture generated from Python stdlib or raw bytes. Snappy is the only entry without a fixture (requires a C library to produce a valid framed stream), but its 10-byte signature is tested via the existing magic-bytes parametrized test. No tree retraining needed. The pre-tree guard and the existing has_magic_signature feature (index 23) pick up new signatures automatically at runtime.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BinaryOrNot now checks the first bytes of each file against 45 known binary format signatures (PNG, JPEG, PDF, ZIP, ELF, Mach-O, and more) before running the statistical decision tree. Files that match a known signature are classified as binary immediately. The signatures come from
binary_formats.csv, the same source of truth the training script already uses.The detector also reads 512 bytes per file instead of 128. At 128 bytes, file header metadata (null bytes in PNG IHDR, length fields in chunk headers) dominated the statistical features and produced misleading signals. With 4x more data, the byte ratios and entropy calculations reflect actual file content rather than header artifacts. The tree was retrained from scratch on 512-byte samples.
The tree gains a 24th feature,
has_magic_signature, which tells it whether the chunk starts with a known binary header. The tree uses this at 8.3% importance, making it the 4th most important feature. This gives the tree a second path to the right answer when statistical features are ambiguous, independent of the pre-tree guard.The root cause of #642: a 512x512 grayscale+alpha PNG has an IHDR chunk with enough null bytes that the first 128 bytes accidentally decode as UTF-16. The tree follows a Shift-JIS branch and concludes "text." All three layers of this PR independently prevent that misclassification.
Closes #642
Test plan
tests/files/issue-642.pngtest_binary_png_issue_642andtest_png_signature_with_adversarial_contentfail, then pass after guardTestFeatureVectortests fail (wrong count), then pass after adding feature 23