Detect binary formats by their file signatures by audreyfeldroy · Pull Request #647 · binaryornot/binaryornot

audreyfeldroy · 2026-03-08T15:30:23Z

Summary

BinaryOrNot now checks the first bytes of each file against 45 known binary format signatures (PNG, JPEG, PDF, ZIP, ELF, Mach-O, and more) before running the statistical decision tree. Files that match a known signature are classified as binary immediately. The signatures come from binary_formats.csv, the same source of truth the training script already uses.

The detector also reads 512 bytes per file instead of 128. At 128 bytes, file header metadata (null bytes in PNG IHDR, length fields in chunk headers) dominated the statistical features and produced misleading signals. With 4x more data, the byte ratios and entropy calculations reflect actual file content rather than header artifacts. The tree was retrained from scratch on 512-byte samples.

The tree gains a 24th feature, has_magic_signature, which tells it whether the chunk starts with a known binary header. The tree uses this at 8.3% importance, making it the 4th most important feature. This gives the tree a second path to the right answer when statistical features are ambiguous, independent of the pre-tree guard.

The root cause of #642: a 512x512 grayscale+alpha PNG has an IHDR chunk with enough null bytes that the first 128 bytes accidentally decode as UTF-16. The tree follows a Shift-JIS branch and concludes "text." All three layers of this PR independently prevent that misclassification.

Closes #642

Test plan

Downloaded the reporter's exact PNG from the issue and added it as tests/files/issue-642.png
Red/green TDD for each layer:
- Magic bytes guard: test_binary_png_issue_642 and test_png_signature_with_adversarial_content fail, then pass after guard
- 512-byte chunks: 4 tests fail (Russian RST, EOT, Big5, Latin), then pass after retraining
- Magic signature feature: 3 TestFeatureVector tests fail (wrong count), then pass after adding feature 23
Full suite: 220 passed, 5 xfailed

Files starting with known binary signatures (PNG, JPEG, PDF, ZIP, etc.) are now identified before the statistical decision tree runs. The guard loads 45 signatures from binary_formats.csv at import time and checks the first few bytes of each chunk. Closes #642, where a PNG file's IHDR chunk bytes happened to decode as UTF-16, causing the tree to classify it as text. The guard catches this because the chunk starts with the PNG signature (89 50 4e 47). Key design decisions: - Signatures come from binary_formats.csv (single source of truth, already used by the training script) - Guard runs before feature extraction, so it's a fast path that skips all the decoding and entropy work - The reporter's actual PNG is included as a test fixture so this specific regression stays caught

The tree now reads and trains on 512-byte chunks instead of 128. Larger chunks push file-header artifacts (null bytes in PNG IHDR, structured metadata fields) into a smaller fraction of the sample, giving statistical features like entropy and encoding validity more representative data to work with. Key design decisions: - Training strategies updated to generate proportionally larger samples (text up to 256 chars, CJK up to 200 chars, binary strategies up to 512 bytes) - Chunk size centralized as CHUNK_SIZE constant in the training script to prevent 128/512 mismatches - Tree depth settled at 8 (cross-validation selected), down from the previous tree's effective depth, because longer chunks need fewer splits to discriminate

The feature vector grows from 23 to 24 features with a new has_magic_signature boolean (feature index 23). The retrained tree uses it at 8.3% importance, particularly for high-null-ratio files where byte statistics alone are ambiguous. This gives the tree a second line of defense alongside the pre-tree magic bytes guard. CHUNK_SIZE is now defined once in helpers.py and imported by both the training script and tests, replacing the separate constant that the training script maintained.

Ruff requires spaces around `:` in `chunk[: len(sig)]`.

The signature table grows from 45 to 55 formats. The new entries reflect that binary-or-text detection matters beyond developer tooling: Apple binary plists, network captures (PCAP), data interchange (Arrow IPC, Avro), compression (LZ4, LZMA, Snappy), archives (ar, CPIO), and security stores (Java KeyStore). Each format with a stable magic signature includes a minimal test fixture generated from Python stdlib or raw bytes. Snappy is the only entry without a fixture (requires a C library to produce a valid framed stream), but its 10-byte signature is tested via the existing magic-bytes parametrized test. No tree retraining needed. The pre-tree guard and the existing has_magic_signature feature (index 23) pick up new signatures automatically at runtime.

audreyfeldroy added 5 commits March 8, 2026 23:09

Apply ruff formatting to slice expression

2a31e62

Ruff requires spaces around `:` in `chunk[: len(sig)]`.

audreyfeldroy merged commit a1ff8d0 into main Mar 8, 2026
11 checks passed

audreyfeldroy deleted the fix-png-misclassification branch March 8, 2026 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect binary formats by their file signatures#647

Detect binary formats by their file signatures#647
audreyfeldroy merged 5 commits intomainfrom
fix-png-misclassification

audreyfeldroy commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

audreyfeldroy commented Mar 8, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant