This repository was archived by the owner on Apr 10, 2025. It is now read-only.
This repository was archived by the owner on Apr 10, 2025. It is now read-only.
HTML URL attributes with multi-byte characters may be misinterpreted #425
Closed
Description
What steps will reproduce the problem?
1. code up a filter that scans a href attributes
2. send an utf-8 document through that filter
What is the expected output? What do you see instead?
Calling DecodedValueOrNull on a href that containins multi-byte characters will
return NULL. I probably expected to see an html escaped single byte version of
the attribute, but i'm not sure about that :)
What version of the product are you using (please check X-Mod-Pagespeed
header)?
On what operating system?
ubuntu server 11.10
Which version of Apache?
apache traffic server / custom implementation
Which MPM?
none
Please provide any additional information below, especially a URL or an
HTML file that exhibits the problem.
i created a custom filter to rebase documents to a new domain and remove any
base tag found while doing that. it's working fine, except when there are
multibyte characters found in attribute values that need to be rewritten.
Currently, i have a workaround based on rewriting escaped_value() instead of
DecodedValueOrNull(). Do i need to reencode the stream to a single byte
character set before passing it in to ParseText (possibly creating html escapes
by doing that)? Or should this be handled by pagespeed (it does see the
response headers, specifiying character sets and all?)
Original issue reported on code.google.com by osch...@gmail.com
on 26 Apr 2012 at 8:08