diff --git a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc index 64e2bd4c66a..2df8226786c 100644 --- a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc +++ b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc @@ -60,12 +60,13 @@ every detector runs regardless of what the others returned, and the (≤ 50 bytes). | 4 -| `StandardHtmlEncodingDetector` +| `HtmlEncodingDetector` | `tika-encoding-detector-html` -| Scans HTML `` / `` tags. - Returns a DECLARATIVE result. Skips BOM detection by default - (`skipBOM=true`) so that `BOMDetector` owns that signal; set `skipBOM=false` - for standalone use without `BOMDetector`. +| Scans HTML `` / `` tags with a + fast lenient regex matcher. Returns a DECLARATIVE result. Applies a + curated subset of WHATWG label aliases (see <>). + An alternative, spec-strict implementation — `StandardHtmlEncodingDetector` + — is available opt-in for users who need the full WHATWG prescan algorithm. | 5 | `CharSoupEncodingDetector` @@ -503,10 +504,10 @@ Reads the first 4 bytes and detects: | `FE FF` | UTF-16-BE |=== -Returns a DECLARATIVE result. `StandardHtmlEncodingDetector` skips BOM -detection by default (`skipBOM=true`) so that `BOMDetector` is the sole source -of BOM evidence. This separation allows `CharSoupEncodingDetector` to -arbitrate when a BOM and a `` tag disagree. +Returns a DECLARATIVE result. The HTML detectors do not handle BOMs on their +own: `BOMDetector` is the sole source of BOM evidence, which lets +`CharSoupEncodingDetector` arbitrate when a BOM and a `` tag +disagree. == Performance and accuracy diff --git a/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc b/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc index 25999b3acdc..80105a77edc 100644 --- a/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc +++ b/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc @@ -40,9 +40,9 @@ The default chain when `tika-charset-detectors-core` is on the classpath: |A UTF-8, UTF-16 LE/BE, or UTF-32 LE/BE byte-order mark is present. |3 -|`standard-html-encoding-detector` +|`html-encoding-detector` |An HTML `` or `Content-Type` http-equiv tag is found -(WHATWG spec prescan algorithm). +(fast lenient regex matcher, curated WHATWG label aliases). |4 |`ml-encoding-detector` @@ -100,9 +100,10 @@ referenced by name in JSON configuration. |`tika-charset-detectors-core` |Byte-order mark detection (UTF-8/16/32). In the default chain. -|`standard-html-encoding-detector` +|`html-encoding-detector` |`tika-charset-detectors-core` -|WHATWG-spec HTML charset prescan. In the default chain. +|Fast lenient regex matcher for `` / `http-equiv` tags, with a +curated subset of WHATWG label aliases. In the default chain. |`ml-encoding-detector` |`tika-charset-detectors-core` @@ -114,10 +115,11 @@ In the default chain. |State-machine structural prober; wraps the `com.github.albfernandez:juniversalchardet` fork. Auto-registers when the module jar is on the classpath. -|`html-encoding-detector` +|`standard-html-encoding-detector` |`tika-charset-detectors-core` -|Older regex-based HTML meta-charset detector. Not in the default chain -(use `standard-html-encoding-detector` instead). +|Spec-strict WHATWG prescan algorithm. Not in the default chain — opt in +explicitly if you need strict WHATWG tokenisation (e.g. ignoring charset +declarations inside comments or other contexts the lenient regex may match). |`icu4j-encoding-detector` |`tika-charset-detectors-icu4j` @@ -159,7 +161,7 @@ statistical chain: "encoding-detectors": [ {"http-header-encoding-detector": {}}, {"bom-encoding-detector": {}}, - {"standard-html-encoding-detector": {}}, + {"html-encoding-detector": {}}, {"ml-encoding-detector": {}} ] } @@ -177,7 +179,7 @@ large `