-
Notifications
You must be signed in to change notification settings - Fork 3.2k
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sirreal
wants to merge
52
commits into
WordPress:trunk
Choose a base branch
from
sirreal:html-api/auto-escape-javascript-json
base: trunk
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+796
−33
Open
Changes from all commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
4a8ca98
Auto-escape JavaScript and JSON script tags when necessary
sirreal 4a28ef2
Update ticket number
sirreal 0bef687
Fix those lints
sirreal cdba027
No trailing function commas
sirreal ba54ae4
Remove JSON_THROW_ON_ERROR constant
sirreal a697e9e
fixup! Remove JSON_THROW_ON_ERROR constant
sirreal 2b3d0d0
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal 8246439
Remove JSON +json subtype handling
sirreal 3f7f227
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal 1362451
Add JS and JSON script tag tests
Copilot 8d30680
Fix typo in comment
sirreal 3b5ef4e
Improve comment
sirreal f8cfdf9
Clean up and fix tests
sirreal 501d201
Clean up and improve type/language logic
sirreal bcc02ae
Fix svg SCRIPT tag tests
sirreal ea03441
Add todo on MIME type parsing
sirreal c7d1827
Update since tags 🤞
sirreal a134e82
Make is_{javascript,json}_script_tag methods private
sirreal d4bd4b3
Add language whitespace test
sirreal edae8d5
How'd that extra space get there, remove it!
sirreal 02ca3c0
Name search parts
sirreal d058a78
Clean up escaping tests
sirreal dfb63af
Add ignore and todo tags to new private is_*_script_tag functions
sirreal 11f51c9
Improve example comment
sirreal a3e0e27
Update regex comment with named groups
sirreal 288b952
Add details to "other" failure to escape comment
sirreal a7495dd
Improve consistency of comment quoting and differentiate types of dou…
sirreal d8c320c
Describe JavaScript escaping strategy in detail
sirreal 369eefc
Use the updated_html value in round-trip test
sirreal 55d4e47
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal 2ef0bf0
Fix demonstration comment
sirreal cb6b990
Extract and document escaping functions
sirreal e399bf6
Tweak workaround documentation
sirreal 83c1fab
Add ASCII chart about HTML script tags with original source
sirreal 1872681
Improve linking between escapes
sirreal 83ff62f
Fix comments, typos, lints
sirreal 5979782
fixup! Fix comments, typos, lints
sirreal d469ae4
fixup! fixup! Fix comments, typos, lints
sirreal 402ae9f
Fix \c -> \r (carriage return) typo
sirreal 6b2c9ba
Add note about not parsing MIME types for JS script tags
sirreal eef0ccb
Add todo comment to is_json_script_tag
sirreal 71d2686
Re-order tag name termination chars to match elsewhere
sirreal d4693a2
Fix typo
sirreal eb4e091
Merge branch 'trunk' into html-api/auto-escape-javascript-json
sirreal 4c3b0b2
Update comments on tag prefixes matching search pattern
sirreal bfa60cf
Use content-type identification and escape without PCRE
dmsnell 3252160
fixup! Use content-type identification and escape without PCRE
dmsnell 0b9b2bd
fixup! Use content-type identification and escape without PCRE
dmsnell b2533e1
fixup! Use content-type identification and escape without PCRE
dmsnell ddeb301
fixup! Use content-type identification and escape without PCRE
dmsnell b36fc32
fixup! Use content-type identification and escape without PCRE
dmsnell 1249af0
fixup! Use content-type identification and escape without PCRE
dmsnell File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3811,35 +3811,29 @@ public function set_modifiable_text( string $plaintext_content ): bool { | |
|
|
||
| switch ( $this->get_tag() ) { | ||
| case 'SCRIPT': | ||
| /** | ||
| * This is over-protective, but ensures the update doesn't break | ||
| * the HTML structure of the SCRIPT element. | ||
| * | ||
| * More thorough analysis could track the HTML tokenizer states | ||
| * and to ensure that the SCRIPT element closes at the expected | ||
| * SCRIPT close tag as is done in {@see ::skip_script_data()}. | ||
| * | ||
| * A SCRIPT element could be closed prematurely by contents | ||
| * like `</script>`. A SCRIPT element could be prevented from | ||
| * closing by contents like `<!--<script>`. | ||
| * | ||
| * The following strings are essential for dangerous content, | ||
| * although they are insufficient on their own. This trade-off | ||
| * prevents dangerous scripts from being sent to the browser. | ||
| * It is also unlikely to produce HTML that may confuse more | ||
| * basic HTML tooling. | ||
| */ | ||
| if ( | ||
| false !== stripos( $plaintext_content, '</script' ) || | ||
| false !== stripos( $plaintext_content, '<script' ) | ||
| ) { | ||
| return false; | ||
| $escaped_content = $plaintext_content; | ||
| $script_content_type = $this->get_script_content_type(); | ||
|
|
||
| switch ( $script_content_type ) { | ||
| case 'javascript': | ||
| case 'json': | ||
| $escaped_content = self::escape_javascript_script_contents( $plaintext_content ); | ||
| break; | ||
|
|
||
| default: | ||
| // There is no safe escaping for unrecognized script content types. | ||
| if ( | ||
| false !== stripos( $plaintext_content, '<script' ) || | ||
| false !== stripos( $plaintext_content, '</script' ) | ||
| ) { | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( | ||
| $this->text_starts_at, | ||
| $this->text_length, | ||
| $plaintext_content | ||
| $escaped_content | ||
| ); | ||
|
|
||
| return true; | ||
|
|
@@ -3891,6 +3885,277 @@ static function ( $tag_match ) { | |
| return false; | ||
| } | ||
|
|
||
| /** | ||
| * Returns the content type of the currently-matched SCRIPT tag, if matched and | ||
| * recognized, otherwise returns `null` to indicate an unrecognized content type. | ||
| * | ||
| * Note! This concept is related but distinct from the MIME type of the script. | ||
| * Parsing MUST match the specific algorithm in the HTML specification, which | ||
| * relies on exact string comparison in some cases. | ||
| * | ||
| * Only 'javascript' and 'json' content types are currently recognized. | ||
| * | ||
| * @see https://html.spec.whatwg.org/multipage/scripting.html#prepare-the-script-element | ||
| * | ||
| * @since 7.0.0 | ||
| * | ||
| * @return 'javascript'|'json'|null Type of script element content if matched and recognized. | ||
| */ | ||
| private function get_script_content_type(): ?string { | ||
| // SVG and MathML SCRIPT elements are not recognized. | ||
| if ( 'SCRIPT' !== $this->get_tag() || $this->get_namespace() !== 'html' ) { | ||
| return null; | ||
| } | ||
|
|
||
| /* | ||
| * > If any of the following are true: | ||
| * > - el has a type attribute whose value is the empty string; | ||
| * > - el has no type attribute but it has a language attribute and that attribute's | ||
| * > value is the empty string; or | ||
| * > - el has neither a type attribute nor a language attribute, | ||
| * > then let the script block's type string for this script element be "text/javascript". | ||
| */ | ||
| $type = $this->get_attribute( 'type' ); | ||
| $lang = $this->get_attribute( 'language' ); | ||
|
|
||
| if ( true === $type || '' === $type ) { | ||
| return 'javascript'; | ||
| } | ||
|
|
||
| if ( null === $type && ( null === $lang || true === $lang || '' === $lang ) ) { | ||
| return 'javascript'; | ||
| } | ||
|
|
||
| /* | ||
| * > Otherwise, if el has a type attribute, then let the script block's type string be | ||
| * > the value of that attribute with leading and trailing ASCII whitespace stripped. | ||
| * > Otherwise, el has a non-empty language attribute; let the script block's type string | ||
| * > be the concatenation of "text/" and the value of el's language attribute. | ||
| */ | ||
| $type_string = is_string( $type ) ? trim( $type, " \t\f\r\n" ) : "text/{$lang}"; | ||
|
|
||
| // All matches are ASCII case-insensitive; eagerly lower-case for comparison. | ||
| $type_string = strtolower( $type_string ); | ||
|
|
||
| /* | ||
| * > If the script block's type string is a JavaScript MIME type essence match, then | ||
| * > set el's type to "classic". | ||
| * | ||
| * > A string is a JavaScript MIME type essence match if it is an ASCII case-insensitive | ||
| * > match for one of the JavaScript MIME type essence strings. | ||
| * | ||
| * > A JavaScript MIME type is any MIME type whose essence is one of the following: | ||
| * > | ||
| * > - application/ecmascript | ||
| * > - application/javascript | ||
| * > - application/x-ecmascript | ||
| * > - application/x-javascript | ||
| * > - text/ecmascript | ||
| * > - text/javascript | ||
| * > - text/javascript1.0 | ||
| * > - text/javascript1.1 | ||
| * > - text/javascript1.2 | ||
| * > - text/javascript1.3 | ||
| * > - text/javascript1.4 | ||
| * > - text/javascript1.5 | ||
| * > - text/jscript | ||
| * > - text/livescript | ||
| * > - text/x-ecmascript | ||
| * > - text/x-javascript | ||
| * | ||
| * @see https://mimesniff.spec.whatwg.org/#javascript-mime-type-essence-match | ||
| * @see https://mimesniff.spec.whatwg.org/#javascript-mime-type | ||
| */ | ||
| switch ( $type_string ) { | ||
| case 'application/ecmascript': | ||
| case 'application/javascript': | ||
| case 'application/x-ecmascript': | ||
| case 'application/x-javascript': | ||
| case 'text/ecmascript': | ||
| case 'text/javascript': | ||
| case 'text/javascript1.0': | ||
| case 'text/javascript1.1': | ||
| case 'text/javascript1.2': | ||
| case 'text/javascript1.3': | ||
| case 'text/javascript1.4': | ||
| case 'text/javascript1.5': | ||
| case 'text/jscript': | ||
| case 'text/livescript': | ||
| case 'text/x-ecmascript': | ||
| case 'text/x-javascript': | ||
| return 'javascript'; | ||
|
|
||
| /* | ||
| * > Otherwise, if the script block's type string is an ASCII case-insensitive match for | ||
| * > the string "module", then set el's type to "module". | ||
| * | ||
| * A module is evaluated as JavaScript. | ||
| */ | ||
| case 'module': | ||
| return 'javascript'; | ||
|
|
||
| /* | ||
| * > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "importmap", then set el's type to "importmap". | ||
| * > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "speculationrules", then set el's type to "speculationrules". | ||
| * | ||
| * These conditions indicate JSON content. | ||
| */ | ||
| case 'importmap': | ||
| case 'speculationrules': | ||
| return 'json'; | ||
|
|
||
| /** @todo Rely on a full MIME parser for determining JSON content. */ | ||
| case 'application/json': | ||
| case 'text/json': | ||
| return 'json'; | ||
| } | ||
|
|
||
| /* | ||
| * > Otherwise, return. (No script is executed, and el's type is left as null.) | ||
| */ | ||
| return null; | ||
| } | ||
|
|
||
| /** | ||
| * Escape JavaScript script tag contents. | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll need to document that this is suitable for JavaScript and JSON escaping. |
||
| * | ||
| * Prevent JavaScript text from modifying the HTML structure of a document and | ||
| * ensure that it's contained within its enclosing SCRIPT tag as intended. | ||
| * | ||
| * JavaScript can be safely escaped with a few exceptions. This is achieved by | ||
| * replacing dangerous sequences like "<script" and "</script" with a form | ||
| * using a Unicode escape sequence "<\u0073cript>" and "</\u0073cript>". | ||
| * | ||
| * This text may appear in the JavaScript in limited ways, all of which support | ||
| * the use of Unicode escape sequences on the "s" character. The escaping is safe | ||
| * to perform in all JavaScript and the modified JavaScript maintains identical | ||
| * behavior with a few exceptions: | ||
| * | ||
| * - Comments. | ||
| * - Tagged templates like `String.raw()` that access “raw” strings. | ||
| * - The `source` property of a RegExp object. | ||
| * | ||
| * For example, this input JavaScript: | ||
| * | ||
| * // A comment: "</script>" | ||
| * | ||
| * console.log( String.raw`</script>` ); | ||
| * | ||
| * const regex = /<script>/; | ||
| * console.log( regex.source ); | ||
| * | ||
| * Is transformed to: | ||
| * | ||
| * // A comment: "</\u0073cript>" | ||
| * | ||
| * console.log( String.raw`</\u0073cript>` ); | ||
| * | ||
| * const regex = /<\u0073cript>/; | ||
| * console.log( regex.source ); | ||
| * | ||
| * Note that the RegExp's matching behavior is equivalent, meaning that | ||
| * `regex.test( '<script>' ) === true` in both the unescaped and | ||
| * escaped versions. | ||
| * | ||
| * JavaScript that relies on behavior affected by this escaping must provide | ||
| * safe script contents in order to avoid this escaping. For example, a raw string | ||
| * may be split up to make its contents safe or avoided altogether: | ||
| * | ||
| * console.log( String.raw`</script>` ); // !!UNSAFE!! Will be escaped. | ||
| * console.log( String.raw`</\u0073cript>` ); // "</\u0073cript>" | ||
| * console.log( String.raw`</scr` + String.raw`ipt>` ); // "</script>" | ||
| * console.log( String.raw`</${"script"}>` ); // "</script>" | ||
| * console.log( "\x3C/script>" ); // "</script>" | ||
| * console.log( "<\/script>" ); // "</script>" | ||
| * | ||
| * The following graph is a simplified interpretation of how HTML interprets the contents | ||
| * of a SCRIPT tag and identifies the closing tag. It is useful to understand what text | ||
| * is dangerous inside of a SCRIPT tag and why different approaches to escaping work. | ||
| * | ||
| * Open script | ||
| * │ | ||
| * │ | ||
| * ▼ | ||
| * ╔═════════════════════════════════════════╗ <!--(…)> | ||
| * ║ ║ (all dashes) | ||
| * ║ script ║ ───────────────┐ | ||
| * ║ data ║ │ | ||
| * ┌────────── ║ ║ ◀──────────────┘ | ||
| * │ ╚═════════════════════════════════════════╝ | ||
| * │ │ ▲ ▲ | ||
| * │ │ <!-- │ --> └─────┐ | ||
| * │ ▼ │ │ | ||
| * │ ┌─────────────────────────────────────────┐ │ | ||
| * │ </script† │ escaped │ │ | ||
| * │ └─────────────────────────────────────────┘ │ | ||
| * │ │ ▲ │ │ --> | ||
| * │ │ </script† │ </script† │ <script† │ | ||
| * │ ▼ │ ▼ │ | ||
| * │ ╔══════════════╗ │ ┌───────────┐ │ | ||
| * │ ║ Close script ║ │ │ double │ │ | ||
| * └─────────▶ ║ ║ └────────── │ escaped │ ─┘ | ||
| * ╚══════════════╝ └───────────┘ | ||
| * | ||
| * † = Case insensitive 'script' followed by one of ' \t\f\r\n/>' | ||
| * | ||
| * The original source of this graph is included at the bottom of this file. | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Needs to be updated. |
||
| * | ||
| * @see https://html.spec.whatwg.org/#restrictions-for-contents-of-script-elements | ||
| * @see wp_html_api_script_element_escaping_diagram_source() | ||
| * | ||
| * @since 7.0.0 | ||
| * | ||
| * @param string $sourcecode Raw contents intended to be serialized into an HTML SCRIPT element. | ||
| * @return string Escaped form of input contents which will not lead to premature closing of the containing SCRIPT element. | ||
| */ | ||
| public static function escape_javascript_script_contents( string $sourcecode ): string { | ||
| $at = 0; | ||
| $was_at = 0; | ||
| $end = strlen( $sourcecode ); | ||
| $escaped = ''; | ||
|
|
||
| /* | ||
| * Replace all instances of the ASCII case-insensitive match of "<script" | ||
| * and "</script", when followed by whitespace or "/" or ">", by using a | ||
| * character replacement for the "s" (or the "S"). | ||
| */ | ||
| while ( $at < $end ) { | ||
| $tag_at = strpos( $sourcecode, '<', $at ); | ||
| if ( false === $tag_at ) { | ||
| break; | ||
| } | ||
|
|
||
| $tag_name_at = $tag_at + 1; | ||
| $has_closing_slash = $tag_name_at < $end && '/' === $sourcecode[ $tag_name_at ]; | ||
| $tag_name_at += $has_closing_slash ? 1 : 0; | ||
|
|
||
| if ( 0 !== substr_compare( $sourcecode, 'script', $tag_name_at, 6, true ) ) { | ||
| $at = $tag_at + 1; | ||
| continue; | ||
| } | ||
|
|
||
| if ( 1 !== strspn( $sourcecode, " \t\f\r\n/>", $tag_name_at + 6, 1 ) ) { | ||
| $at = $tag_name_at + 5; | ||
| continue; | ||
| } | ||
|
|
||
| $escaped .= substr( $sourcecode, $was_at, $tag_name_at - $was_at ); | ||
| $escaped .= 's' === $sourcecode[ $tag_name_at ] ? '\u0073' : '\u0053'; | ||
| $was_at = $tag_name_at + 1; | ||
| $at = $tag_name_at + 7; | ||
| } | ||
|
|
||
| if ( '' === $escaped ) { | ||
| return $sourcecode; | ||
| } | ||
|
|
||
| if ( $was_at < $end ) { | ||
| $escaped .= substr( $sourcecode, $was_at ); | ||
| } | ||
|
|
||
| return $escaped; | ||
| } | ||
|
|
||
| /** | ||
| * Updates or creates a new attribute on the currently matched tag with the passed value. | ||
| * | ||
|
|
||
30 changes: 30 additions & 0 deletions
30
tests/phpunit/data/html-api/script-element-escaping-diagram.dot
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| digraph { | ||
| rankdir=TB; | ||
|
|
||
| // Entry point | ||
| entry [shape=plaintext label="Open script"]; | ||
| entry -> script_data; | ||
|
|
||
| // Double-circle states arranged more compactly | ||
| data [shape=doublecircle label="Close script"]; | ||
| script_data [shape=doublecircle color=blue label="script\ndata"]; | ||
| script_data_escaped [shape=circle color=orange label="escaped"]; | ||
| script_data_double_escaped [shape=circle color=red label="double\nescaped"]; | ||
|
|
||
| // Group related nodes on same ranks where possible | ||
| {rank=same; script_data script_data_escaped script_data_double_escaped} | ||
|
|
||
| script_data -> script_data [label="<!--(…)>\n(all dashes)"]; | ||
| script_data -> script_data_escaped [label="<!--"]; | ||
| script_data -> data [label="</script†"]; | ||
|
|
||
| script_data_escaped -> script_data [label="-->"]; | ||
| script_data_escaped -> script_data_double_escaped [label="<script†"]; | ||
| script_data_escaped -> data [label="</script†"]; | ||
|
|
||
| script_data_double_escaped -> script_data [label="-->"]; | ||
| script_data_double_escaped -> script_data_escaped [label="</script†"]; | ||
|
|
||
| label="† = Case insensitive 'script' followed by one of ' \\t\\f\\r\\n/>'"; | ||
| labelloc=b; | ||
| } |
13 changes: 13 additions & 0 deletions
13
tests/phpunit/data/html-api/script-element-escaping-diagram.php
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| <?php | ||
|
|
||
| /** | ||
| * This is the original Graphviz source for the SCRIPT tag | ||
| * parsing behavior, used in the documentation for the HTML API. | ||
| * | ||
| * @see WP_HTML_Tag_Processor::escape_javascript_script_contents() | ||
| * | ||
| * @return string | ||
| */ | ||
| function wp_html_api_script_element_escaping_diagram_source() { | ||
| return file_get_contents( __DIR__ . '/script-element-escaping-diagram.dot' ); | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also good to include here initially:
These and other such MIME types would be handled by the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for these “MIME type essence” matches it’s okay to add, but I wanted to somewhat limit how much further we go parsing them until we do actual MIME parsing, otherwise we’ll just end up duplicating a bunch of effort.
perhaps we try and merge #10640 before this one and then we can eliminate the TODO as well as the special cases