junk-detector-v6#2818
Merged
Merged
Conversation
Five read-only tools that report training-corpus statistics used to inform per-script F1 sizing decisions. None of these are wired into the main trainer or model output; they're invoked manually. * CountPerScriptBigrams - distinct (cpA,cpB) pair counts per script, with coverage curves and per-cutoff model-size estimates for several candidate storage schemes (MPHF+val, MPHF+fp+val, open-addressing). * AnalyzeHanByBlock - bucket HAN bigrams by the Unicode block of each codepoint, with ASCII split into digit/letter/punct. Surfaces the CJK Unified / Hiragana / Katakana / ASCII composition of the HAN pool. * ScriptCensus - per-line dominant-script histogram for one or more text files (gz or plain). Used to verify how BuildJunkTrainingData routes mixed-script languages like Japanese. * LineScriptFractions - for each *.train.gz, histogram of the per-line target-script-fraction, with cumulative drop percentages at thresholds 10/20/30/50/70/90/100. Identifies scripts whose corpora are mostly off-target (e.g. GOTHIC: 40% of lines are <5% Gothic). * BoundaryBigramAudit - classify every bigram in *.train.gz as in-script / script-boundary / foreign-interior / pure-Latin-letter- run, and report distinct-pair drop counts under two candidate filter rules. All five build under existing checkstyle; no test fixtures added. Co-authored-by: Cursor <cursoragent@cursor.com>
New optional flag prunes F1 bigrams whose global per-pair count is below the threshold from the codepoint-bigram hash table and Bloom filter. Unigram counts and backoff are unaffected. When the flag is omitted (or set to 1), behavior is byte-identical to the previous code path; the existing 2-arg overload of trainCodepointHashTables is preserved as a thin wrapper. When >= 2, the trainer makes a pre-pass over all *.train.gz files to tally per-pair occurrence counts in a HashMap<Long,long[]>, then in the main pass only emits bigrams whose tally meets the cutoff. Pre- pass memory is bounded by the distinct-pair count (~450K pairs on the current 34-script madlad corpus, ~50 MB heap). Rationale: ablation on the dev split (held-out from training) shows that min_bigram_count=3 cuts the v6 model from 1456 KB -> 889 KB (-39%) and macro FPR from 0.018 -> 0.007 (-61%) with macro TPR only moving 0.890 -> 0.883. Per-distortion Cohen's d goes up on the realistic junk modes (byte-shuffle, byte-swap, wrong-codec) and only down on the synthetic inject distortion, where baseline d ~ 11.86 saturates well past any operating threshold anyway. See discussion in 20260514-junk-retrain-v6.md. The singletons dropped are mostly OCR artifacts, proper nouns, and typos that inflate the clean-side distribution tail without contributing real distributional information. All 24 existing tests pass with the change. Co-authored-by: Cursor <cursoragent@cursor.com>
Replaces per-tool CLI flags for durable training/build parameters with a single committed config class. CLI surface of the two tools shrinks to data-dir, output(-dir), and (for BuildJunkTrainingData) --dry-run. Any attempt to pass a now-removed flag like --total-budget-bytes or --min-bigram-count is rejected with a pointer to the config file. Rationale: we've repeatedly burned cycles asking "wait, which run was that?" when a model file's identity depended on shell history rather than tracked source. With this change every parameter that affects the model lives in code that's reviewed and grep-able from a commit hash. The config values pin the current shipping setup: 500 MB total budget with a 5 MB per-language cap, 5% target-script-fraction line filter, GOTHIC and THAANA dropped, min_bigram_count = 3, 16 Mbit Bloom. These together produce macro Cohen's d = 12.11 / FPR = 0.004 / TPR = 0.894 on the dev split (vs. honest v6 baseline of 9.81 / 0.017 / 0.865). The smoke-rerun produced a model file whose MD5 matches the prior CLI- flag-driven v2 model byte-for-byte; the refactor is provably behavior- preserving. Format-tied constants (V6_BIGRAM_BUCKETS, V6_FNV_SEED, etc.) stay in TrainJunkModel — they're part of the v6 binary protocol, not tunable training choices, and moving them would muddy the distinction. Test JunkDetectorTrainingConfigTest pins the current values so any future change has to land alongside an explicit assertion update. 29 tests pass (24 previous + 5 new). Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for your contribution to Apache Tika! Your help is appreciated!
Before opening the pull request, please verify that
TIKA-XXXX)[TIKA-XXXX] Issue or pull request title)./mvnw clean testmainbranch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulledmainbranchtika-bom/pom.xml.We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!