Skip to content

Commit

Permalink
html: another shot at security doc
Browse files Browse the repository at this point in the history
Be clearer about the operation of the tokenizer and the parser (and
their differences), and be explicit about the need for re-serialization
when they are being used in security contexts.

Change-Id: Ieb8f2a9d4806fb7a8849a15671667396e81c53b9
Reviewed-on: https://go-review.googlesource.com/c/net/+/484795
Auto-Submit: Roland Shoemaker <roland@golang.org>
Reviewed-by: Damien Neil <dneil@google.com>
Run-TryBot: Roland Shoemaker <roland@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
  • Loading branch information
rolandshoemaker authored and gopherbot committed Apr 17, 2023
1 parent 9001ca7 commit eb1572c
Showing 1 changed file with 14 additions and 8 deletions.
22 changes: 14 additions & 8 deletions html/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -99,14 +99,20 @@ Care should be taken when parsing and interpreting HTML, whether full documents
or fragments, within the framework of the HTML specification, especially with
regard to untrusted inputs.
This package provides both a tokenizer and a parser. Only the parser constructs
a DOM according to the HTML specification, resolving malformed and misplaced
tags where appropriate. The tokenizer simply tokenizes the HTML presented to it,
and as such does not resolve issues that may exist in the processed HTML,
producing a literal interpretation of the input.
If your use case requires semantically well-formed HTML, as defined by the
WHATWG specification, the parser should be used rather than the tokenizer.
This package provides both a tokenizer and a parser, which implement the
tokenization, and tokenization and tree construction stages of the WHATWG HTML
parsing specification respectively. While the tokenizer parses and normalizes
individual HTML tokens, only the parser constructs the DOM tree from the
tokenized HTML, as described in the tree construction stage of the
specification, dynamically modifying or extending the docuemnt's DOM tree.
If your use case requires semantically well-formed HTML documents, as defined by
the WHATWG specification, the parser should be used rather than the tokenizer.
In security contexts, if trust decisions are being made using the tokenized or
parsed content, the input must be re-serialized (for instance by using Render or
Token.String) in order for those trust decisions to hold, as the process of
tokenization or parsing may alter the content.
*/
package html // import "golang.org/x/net/html"

Expand Down

0 comments on commit eb1572c

Please sign in to comment.