junk-detector-v6 by tballison · Pull Request #2818 · apache/tika

tballison · 2026-05-14T20:41:32Z

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
the issue ID (TIKA-XXXX)
- is referenced in the title of the pull request
- and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
commits are squashed into a single one (or few commits for larger changes)
Tika is successfully built and unit tests pass by running ./mvnw clean test
there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

Five read-only tools that report training-corpus statistics used to inform per-script F1 sizing decisions. None of these are wired into the main trainer or model output; they're invoked manually. * CountPerScriptBigrams - distinct (cpA,cpB) pair counts per script, with coverage curves and per-cutoff model-size estimates for several candidate storage schemes (MPHF+val, MPHF+fp+val, open-addressing). * AnalyzeHanByBlock - bucket HAN bigrams by the Unicode block of each codepoint, with ASCII split into digit/letter/punct. Surfaces the CJK Unified / Hiragana / Katakana / ASCII composition of the HAN pool. * ScriptCensus - per-line dominant-script histogram for one or more text files (gz or plain). Used to verify how BuildJunkTrainingData routes mixed-script languages like Japanese. * LineScriptFractions - for each *.train.gz, histogram of the per-line target-script-fraction, with cumulative drop percentages at thresholds 10/20/30/50/70/90/100. Identifies scripts whose corpora are mostly off-target (e.g. GOTHIC: 40% of lines are <5% Gothic). * BoundaryBigramAudit - classify every bigram in *.train.gz as in-script / script-boundary / foreign-interior / pure-Latin-letter- run, and report distinct-pair drop counts under two candidate filter rules. All five build under existing checkstyle; no test fixtures added. Co-authored-by: Cursor <cursoragent@cursor.com>

New optional flag prunes F1 bigrams whose global per-pair count is below the threshold from the codepoint-bigram hash table and Bloom filter. Unigram counts and backoff are unaffected. When the flag is omitted (or set to 1), behavior is byte-identical to the previous code path; the existing 2-arg overload of trainCodepointHashTables is preserved as a thin wrapper. When >= 2, the trainer makes a pre-pass over all *.train.gz files to tally per-pair occurrence counts in a HashMap<Long,long[]>, then in the main pass only emits bigrams whose tally meets the cutoff. Pre- pass memory is bounded by the distinct-pair count (~450K pairs on the current 34-script madlad corpus, ~50 MB heap). Rationale: ablation on the dev split (held-out from training) shows that min_bigram_count=3 cuts the v6 model from 1456 KB -> 889 KB (-39%) and macro FPR from 0.018 -> 0.007 (-61%) with macro TPR only moving 0.890 -> 0.883. Per-distortion Cohen's d goes up on the realistic junk modes (byte-shuffle, byte-swap, wrong-codec) and only down on the synthetic inject distortion, where baseline d ~ 11.86 saturates well past any operating threshold anyway. See discussion in 20260514-junk-retrain-v6.md. The singletons dropped are mostly OCR artifacts, proper nouns, and typos that inflate the clean-side distribution tail without contributing real distributional information. All 24 existing tests pass with the change. Co-authored-by: Cursor <cursoragent@cursor.com>

Replaces per-tool CLI flags for durable training/build parameters with a single committed config class. CLI surface of the two tools shrinks to data-dir, output(-dir), and (for BuildJunkTrainingData) --dry-run. Any attempt to pass a now-removed flag like --total-budget-bytes or --min-bigram-count is rejected with a pointer to the config file. Rationale: we've repeatedly burned cycles asking "wait, which run was that?" when a model file's identity depended on shell history rather than tracked source. With this change every parameter that affects the model lives in code that's reviewed and grep-able from a commit hash. The config values pin the current shipping setup: 500 MB total budget with a 5 MB per-language cap, 5% target-script-fraction line filter, GOTHIC and THAANA dropped, min_bigram_count = 3, 16 Mbit Bloom. These together produce macro Cohen's d = 12.11 / FPR = 0.004 / TPR = 0.894 on the dev split (vs. honest v6 baseline of 9.81 / 0.017 / 0.865). The smoke-rerun produced a model file whose MD5 matches the prior CLI- flag-driven v2 model byte-for-byte; the refactor is provably behavior- preserving. Format-tied constants (V6_BIGRAM_BUCKETS, V6_FNV_SEED, etc.) stay in TrainJunkModel — they're part of the v6 binary protocol, not tunable training choices, and moving them would muddy the distinction. Test JunkDetectorTrainingConfigTest pins the current values so any future change has to land alongside an explicit assertion update. 29 tests pass (24 previous + 5 new). Co-authored-by: Cursor <cursoragent@cursor.com>

tballison and others added 10 commits May 13, 2026 11:59

junk-v6

cb5fa6e

Merge branch 'main' into junk-detector-v6

eaa72ad

v6 mods

3efaa01

checkpoint

a24d532

checkpoint

0e08c2d

checkpoint v7

9bae024

checkpoint v7

f5c61f3

tballison merged commit 465bc76 into main May 14, 2026
5 checks passed

tballison deleted the junk-detector-v6 branch May 14, 2026 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

junk-detector-v6#2818

junk-detector-v6#2818
tballison merged 10 commits into
mainfrom
junk-detector-v6

tballison commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tballison commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant