Skip to content
This repository was archived by the owner on Apr 10, 2025. It is now read-only.
This repository was archived by the owner on Apr 10, 2025. It is now read-only.

HTML URL attributes with multi-byte characters may be misinterpreted #425

Closed
@GoogleCodeExporter

Description

@GoogleCodeExporter
What steps will reproduce the problem?
1. code up a filter that scans a href attributes
2. send an utf-8 document through that filter

What is the expected output? What do you see instead?
Calling DecodedValueOrNull on a href that containins multi-byte characters will 
return NULL. I probably expected to see an html escaped single byte version of 
the attribute, but i'm not sure about that :)

What version of the product are you using (please check X-Mod-Pagespeed
header)?

On what operating system?
ubuntu server 11.10
Which version of Apache?
apache traffic server / custom implementation
Which MPM?
none
Please provide any additional information below, especially a URL or an
HTML file that exhibits the problem.

i created a custom filter to rebase documents to a new domain and remove any 
base tag found while doing that. it's working fine, except when there are 
multibyte characters found in attribute values that need to be rewritten. 
Currently, i have a workaround based on rewriting escaped_value() instead of 
DecodedValueOrNull(). Do i need to reencode the stream to a single byte 
character set before passing it in to ParseText (possibly creating html escapes 
by doing that)? Or should this be handled  by pagespeed (it does see the 
response headers, specifiying character sets and all?)

Original issue reported on code.google.com by osch...@gmail.com on 26 Apr 2012 at 8:08

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions