diff --git a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
index 64e2bd4c66a..2df8226786c 100644
--- a/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
+++ b/docs/modules/ROOT/pages/advanced/charset-detection-design.adoc
@@ -60,12 +60,13 @@ every detector runs regardless of what the others returned, and the
(≤ 50 bytes).
| 4
-| `StandardHtmlEncodingDetector`
+| `HtmlEncodingDetector`
| `tika-encoding-detector-html`
-| Scans HTML `` / `` tags.
- Returns a DECLARATIVE result. Skips BOM detection by default
- (`skipBOM=true`) so that `BOMDetector` owns that signal; set `skipBOM=false`
- for standalone use without `BOMDetector`.
+| Scans HTML `` / `` tags with a
+ fast lenient regex matcher. Returns a DECLARATIVE result. Applies a
+ curated subset of WHATWG label aliases (see <>).
+ An alternative, spec-strict implementation — `StandardHtmlEncodingDetector`
+ — is available opt-in for users who need the full WHATWG prescan algorithm.
| 5
| `CharSoupEncodingDetector`
@@ -503,10 +504,10 @@ Reads the first 4 bytes and detects:
| `FE FF` | UTF-16-BE
|===
-Returns a DECLARATIVE result. `StandardHtmlEncodingDetector` skips BOM
-detection by default (`skipBOM=true`) so that `BOMDetector` is the sole source
-of BOM evidence. This separation allows `CharSoupEncodingDetector` to
-arbitrate when a BOM and a `` tag disagree.
+Returns a DECLARATIVE result. The HTML detectors do not handle BOMs on their
+own: `BOMDetector` is the sole source of BOM evidence, which lets
+`CharSoupEncodingDetector` arbitrate when a BOM and a `` tag
+disagree.
== Performance and accuracy
diff --git a/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc b/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc
index 25999b3acdc..80105a77edc 100644
--- a/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc
+++ b/docs/modules/ROOT/pages/configuration/encoding-detectors.adoc
@@ -40,9 +40,9 @@ The default chain when `tika-charset-detectors-core` is on the classpath:
|A UTF-8, UTF-16 LE/BE, or UTF-32 LE/BE byte-order mark is present.
|3
-|`standard-html-encoding-detector`
+|`html-encoding-detector`
|An HTML `` or `Content-Type` http-equiv tag is found
-(WHATWG spec prescan algorithm).
+(fast lenient regex matcher, curated WHATWG label aliases).
|4
|`ml-encoding-detector`
@@ -100,9 +100,10 @@ referenced by name in JSON configuration.
|`tika-charset-detectors-core`
|Byte-order mark detection (UTF-8/16/32). In the default chain.
-|`standard-html-encoding-detector`
+|`html-encoding-detector`
|`tika-charset-detectors-core`
-|WHATWG-spec HTML charset prescan. In the default chain.
+|Fast lenient regex matcher for `` / `http-equiv` tags, with a
+curated subset of WHATWG label aliases. In the default chain.
|`ml-encoding-detector`
|`tika-charset-detectors-core`
@@ -114,10 +115,11 @@ In the default chain.
|State-machine structural prober; wraps the `com.github.albfernandez:juniversalchardet`
fork. Auto-registers when the module jar is on the classpath.
-|`html-encoding-detector`
+|`standard-html-encoding-detector`
|`tika-charset-detectors-core`
-|Older regex-based HTML meta-charset detector. Not in the default chain
-(use `standard-html-encoding-detector` instead).
+|Spec-strict WHATWG prescan algorithm. Not in the default chain — opt in
+explicitly if you need strict WHATWG tokenisation (e.g. ignoring charset
+declarations inside comments or other contexts the lenient regex may match).
|`icu4j-encoding-detector`
|`tika-charset-detectors-icu4j`
@@ -159,7 +161,7 @@ statistical chain:
"encoding-detectors": [
{"http-header-encoding-detector": {}},
{"bom-encoding-detector": {}},
- {"standard-html-encoding-detector": {}},
+ {"html-encoding-detector": {}},
{"ml-encoding-detector": {}}
]
}
@@ -177,7 +179,7 @@ large `