From 9f3e79b770859de51a87bc22ecfd3c10cca328c2 Mon Sep 17 00:00:00 2001 From: tallison Date: Thu, 23 Apr 2026 20:51:01 -0400 Subject: [PATCH] improve legacy charset detector to benefit from features of StandardHtmlEncodingDetector --- .../advanced/charset-detection-design.adoc | 19 +- .../configuration/encoding-detectors.adoc | 20 +- .../parser/html/HtmlEncodingDetector.java | 8 +- .../parser/html/TikaHtmlCharsetAliases.java | 172 ++++++++++++++++++ .../html/charsetdetector/PreScanner.java | 17 -- .../StandardHtmlEncodingDetector.java | 66 +------ .../tika/config/TikaEncodingDetectorTest.java | 43 +---- ...273-exclude-encoding-detector-default.json | 2 +- .../tika-config-html-standalone-bom.json | 9 - .../parser/html/HtmlEncodingDetectorTest.java | 58 ++++++ .../StandardHtmlEncodingDetectorTest.java | 30 +-- 11 files changed, 274 insertions(+), 170 deletions(-) create mode 100644 tika-encoding-detectors/tika-encoding-detector-html/src/main/java/org/apache/tika/parser/html/TikaHtmlCharsetAliases.java delete mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/resources/configs/tika-config-html-standalone-bom.json diff --git a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc index 64e2bd4c66a..2df8226786c 100644 --- a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc +++ b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc @@ -60,12 +60,13 @@ every detector runs regardless of what the others returned, and the (≤ 50 bytes). | 4 -| `StandardHtmlEncodingDetector` +| `HtmlEncodingDetector` | `tika-encoding-detector-html` -| Scans HTML `` / `` tags. - Returns a DECLARATIVE result. Skips BOM detection by default - (`skipBOM=true`) so that `BOMDetector` owns that signal; set `skipBOM=false` - for standalone use without `BOMDetector`. +| Scans HTML `` / `` tags with a + fast lenient regex matcher. Returns a DECLARATIVE result. Applies a + curated subset of WHATWG label aliases (see <>). + An alternative, spec-strict implementation — `StandardHtmlEncodingDetector` + — is available opt-in for users who need the full WHATWG prescan algorithm. | 5 | `CharSoupEncodingDetector` @@ -503,10 +504,10 @@ Reads the first 4 bytes and detects: | `FE FF` | UTF-16-BE |=== -Returns a DECLARATIVE result. `StandardHtmlEncodingDetector` skips BOM -detection by default (`skipBOM=true`) so that `BOMDetector` is the sole source -of BOM evidence. This separation allows `CharSoupEncodingDetector` to -arbitrate when a BOM and a `` tag disagree. +Returns a DECLARATIVE result. The HTML detectors do not handle BOMs on their +own: `BOMDetector` is the sole source of BOM evidence, which lets +`CharSoupEncodingDetector` arbitrate when a BOM and a `` tag +disagree. == Performance and accuracy diff --git a/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc b/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc index 25999b3acdc..80105a77edc 100644 --- a/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc +++ b/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc @@ -40,9 +40,9 @@ The default chain when `tika-charset-detectors-core` is on the classpath: |A UTF-8, UTF-16 LE/BE, or UTF-32 LE/BE byte-order mark is present. |3 -|`standard-html-encoding-detector` +|`html-encoding-detector` |An HTML `` or `Content-Type` http-equiv tag is found -(WHATWG spec prescan algorithm). +(fast lenient regex matcher, curated WHATWG label aliases). |4 |`ml-encoding-detector` @@ -100,9 +100,10 @@ referenced by name in JSON configuration. |`tika-charset-detectors-core` |Byte-order mark detection (UTF-8/16/32). In the default chain. -|`standard-html-encoding-detector` +|`html-encoding-detector` |`tika-charset-detectors-core` -|WHATWG-spec HTML charset prescan. In the default chain. +|Fast lenient regex matcher for `` / `http-equiv` tags, with a +curated subset of WHATWG label aliases. In the default chain. |`ml-encoding-detector` |`tika-charset-detectors-core` @@ -114,10 +115,11 @@ In the default chain. |State-machine structural prober; wraps the `com.github.albfernandez:juniversalchardet` fork. Auto-registers when the module jar is on the classpath. -|`html-encoding-detector` +|`standard-html-encoding-detector` |`tika-charset-detectors-core` -|Older regex-based HTML meta-charset detector. Not in the default chain -(use `standard-html-encoding-detector` instead). +|Spec-strict WHATWG prescan algorithm. Not in the default chain — opt in +explicitly if you need strict WHATWG tokenisation (e.g. ignoring charset +declarations inside comments or other contexts the lenient regex may match). |`icu4j-encoding-detector` |`tika-charset-detectors-icu4j` @@ -159,7 +161,7 @@ statistical chain: "encoding-detectors": [ {"http-header-encoding-detector": {}}, {"bom-encoding-detector": {}}, - {"standard-html-encoding-detector": {}}, + {"html-encoding-detector": {}}, {"ml-encoding-detector": {}} ] } @@ -177,7 +179,7 @@ large `