Skip to content

junk-detector-v6#2818

Merged
tballison merged 10 commits into
mainfrom
junk-detector-v6
May 14, 2026
Merged

junk-detector-v6#2818
tballison merged 10 commits into
mainfrom
junk-detector-v6

Conversation

@tballison
Copy link
Copy Markdown
Contributor

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

  • there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
  • the issue ID (TIKA-XXXX)
    • is referenced in the title of the pull request
    • and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
  • commits are squashed into a single one (or few commits for larger changes)
  • Tika is successfully built and unit tests pass by running ./mvnw clean test
  • there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
  • if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

tballison and others added 10 commits May 13, 2026 11:59
Five read-only tools that report training-corpus statistics used to
inform per-script F1 sizing decisions.  None of these are wired into
the main trainer or model output; they're invoked manually.

* CountPerScriptBigrams - distinct (cpA,cpB) pair counts per script,
  with coverage curves and per-cutoff model-size estimates for several
  candidate storage schemes (MPHF+val, MPHF+fp+val, open-addressing).

* AnalyzeHanByBlock - bucket HAN bigrams by the Unicode block of each
  codepoint, with ASCII split into digit/letter/punct.  Surfaces the
  CJK Unified / Hiragana / Katakana / ASCII composition of the HAN
  pool.

* ScriptCensus - per-line dominant-script histogram for one or more
  text files (gz or plain).  Used to verify how BuildJunkTrainingData
  routes mixed-script languages like Japanese.

* LineScriptFractions - for each *.train.gz, histogram of the per-line
  target-script-fraction, with cumulative drop percentages at
  thresholds 10/20/30/50/70/90/100.  Identifies scripts whose corpora
  are mostly off-target (e.g. GOTHIC: 40% of lines are <5% Gothic).

* BoundaryBigramAudit - classify every bigram in *.train.gz as
  in-script / script-boundary / foreign-interior / pure-Latin-letter-
  run, and report distinct-pair drop counts under two candidate filter
  rules.

All five build under existing checkstyle; no test fixtures added.

Co-authored-by: Cursor <cursoragent@cursor.com>
New optional flag prunes F1 bigrams whose global per-pair count is
below the threshold from the codepoint-bigram hash table and Bloom
filter.  Unigram counts and backoff are unaffected.

When the flag is omitted (or set to 1), behavior is byte-identical to
the previous code path; the existing 2-arg overload of
trainCodepointHashTables is preserved as a thin wrapper.

When >= 2, the trainer makes a pre-pass over all *.train.gz files to
tally per-pair occurrence counts in a HashMap<Long,long[]>, then in
the main pass only emits bigrams whose tally meets the cutoff.  Pre-
pass memory is bounded by the distinct-pair count (~450K pairs on the
current 34-script madlad corpus, ~50 MB heap).

Rationale: ablation on the dev split (held-out from training) shows
that min_bigram_count=3 cuts the v6 model from 1456 KB -> 889 KB
(-39%) and macro FPR from 0.018 -> 0.007 (-61%) with macro TPR only
moving 0.890 -> 0.883.  Per-distortion Cohen's d goes up on the
realistic junk modes (byte-shuffle, byte-swap, wrong-codec) and only
down on the synthetic inject distortion, where baseline d ~ 11.86
saturates well past any operating threshold anyway.  See discussion
in 20260514-junk-retrain-v6.md.

The singletons dropped are mostly OCR artifacts, proper nouns, and
typos that inflate the clean-side distribution tail without
contributing real distributional information.

All 24 existing tests pass with the change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces per-tool CLI flags for durable training/build parameters with a
single committed config class.  CLI surface of the two tools shrinks to
data-dir, output(-dir), and (for BuildJunkTrainingData) --dry-run.  Any
attempt to pass a now-removed flag like --total-budget-bytes or
--min-bigram-count is rejected with a pointer to the config file.

Rationale: we've repeatedly burned cycles asking "wait, which run was
that?" when a model file's identity depended on shell history rather
than tracked source.  With this change every parameter that affects the
model lives in code that's reviewed and grep-able from a commit hash.

The config values pin the current shipping setup: 500 MB total budget
with a 5 MB per-language cap, 5% target-script-fraction line filter,
GOTHIC and THAANA dropped, min_bigram_count = 3, 16 Mbit Bloom.  These
together produce macro Cohen's d = 12.11 / FPR = 0.004 / TPR = 0.894 on
the dev split (vs. honest v6 baseline of 9.81 / 0.017 / 0.865).

The smoke-rerun produced a model file whose MD5 matches the prior CLI-
flag-driven v2 model byte-for-byte; the refactor is provably behavior-
preserving.

Format-tied constants (V6_BIGRAM_BUCKETS, V6_FNV_SEED, etc.) stay in
TrainJunkModel — they're part of the v6 binary protocol, not tunable
training choices, and moving them would muddy the distinction.

Test JunkDetectorTrainingConfigTest pins the current values so any
future change has to land alongside an explicit assertion update.

29 tests pass (24 previous + 5 new).

Co-authored-by: Cursor <cursoragent@cursor.com>
@tballison tballison merged commit 465bc76 into main May 14, 2026
5 checks passed
@tballison tballison deleted the junk-detector-v6 branch May 14, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant