Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing unprintable characters? #65

Closed
swanson opened this issue Apr 6, 2014 · 6 comments
Closed

Consider removing unprintable characters? #65

swanson opened this issue Apr 6, 2014 · 6 comments

Comments

@swanson
Copy link
Contributor

swanson commented Apr 6, 2014

We've encountered some issues with HTML containing unprintable characters (namely \u2028\u2029) over on the Stringer project (stringer-rss/stringer#295). The issue manifests itself further down the chain once we try to load the HTML with unprintable characters as JSON.

We've resolved our issues by adding a gsub(/[^[:print:]]/, '') call after running the contents through loofah. I think this makes sense to add this as part of loofah's sanitizing process - any thoughts before I take a stab at a PR?

@flavorjones
Copy link
Owner

In principle this feature makes sense to me; however I'm curious as to why
it matters whether these characters are present or not. Can you provide a
bit of context to help me understand?
On Apr 6, 2014 6:00 PM, "matt swanson" notifications@github.com wrote:

We've encountered some issues with HTML containing unprintable characters
(namely \u2028\u2029) over on the Stringer project (stringer-rss/stringer#295stringer-rss/stringer#295).
The issue manifests itself further down the chain once we try to load the
HTML with unprintable characters as JSON.

We've resolved our issues by adding a gsub(/[^[:print:]]/, '') call after
running the contents through loofah. I think this makes sense to add this
as part of loofah's sanitizing process - any thoughts before I take a
stab at a PR?

Reply to this email directly or view it on GitHubhttps://github.com//issues/65
.

@swanson
Copy link
Contributor Author

swanson commented Apr 9, 2014

Sure - the problem occurs downstream. In our particular case, we are parsing RSS feeds and then sanitizing the post content and storing it in a database. Later on, we are converting that post content to JSON (no problems yet, the unprintable characters are valid JSON) and then using a client-side Javascript framework (Backbone.js in our case) to render the data.

The issues occur when we try to run those unprintable characters through the Javascript parser. More details here: http://timelessrepo.com/json-isnt-a-javascript-subset but the long and short of it: Javascript cannot handle these unprintable characters even though JSON can. "No string in JavaScript can contain a literal U+2028 or a U+2029."

@flavorjones
Copy link
Owner

That's crazy! OK, got it loud and clear.

How about this proposed usage:

doc = Loofah.fragment(thing)
doc.scrub!(:unprintable)

so that the non-printable characters can be removed by chaining .scrub!(:unprintable) to the end of the currently-used scrubbers.

If that usage makes sense, then are you able to take a first whack at an implementation? If not, I'll try to get to it.

@flavorjones
Copy link
Owner

Possibly useful reference: here's how Rack::JSONP addressed this issue:

https://github.com/rack/rack-contrib/pull/37/files

@swanson
Copy link
Contributor Author

swanson commented Apr 9, 2014

👍 sounds good to me. I can probably get going on a pull request next week.

@flavorjones
Copy link
Owner

Let me know if you have questions on where to start. Should be as easy as
adding a new class to lib/loofah/scrubbers.rb, and a corresponding entry
in MAP at the bottom of that file.

On Wed, Apr 9, 2014 at 12:25 PM, matt swanson notifications@github.comwrote:

[image: 👍] sounds good to me. I can probably get going on a pull
request next week.

Reply to this email directly or view it on GitHubhttps://github.com//issues/65#issuecomment-39984807
.

flavorjones added a commit that referenced this issue Apr 23, 2014
Add new scrubber to remove unprintable Unicode characters, fixes #65
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants