Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain unexpected characters in an entry title can cause a stack trace #1535

Closed
narration-sd opened this issue Mar 15, 2017 · 16 comments

Comments

@narration-sd
Copy link
Contributor

commented Mar 15, 2017

Description

I'm going to write this up here to save @takobell from having to parse a detailed late-night note. It's not likely to show up unless you are doing as follows (common enough possible occurances, though):

  • pasting in title text from some web source
  • pasting in from a Mac's or other rich text
  • loading the title from an external database, with typical nasty customer data (my case)

What will happen if in particular your title text includes a Latin-1 non-breaking space (hex a0), is that a php preg_replace will mysteriously error when creating the slug, returng a null which will give the stack trace exception. Other odd characters will likely do the same thing, so very possibly this Entity range ought to be filtered out -- which preg_replace actually will handle. Apparently it's the html tag-removing matcher that blows up.

Steps to reproduce

  1. try a title with an embedded hex A0 character
    • fill in the rest of an entry and save
    • get a stack trace on null return from StringHelpers::stripHtml()

Additional info

PHP version 7.1.2-4+deb.sury.orgxenial+1
Database driver & version MySQL 5.7.17-0ubuntu0.16.04.1
Image driver & version GD 7.1.2-4+deb.sury.org
xenial+1
Craft edition & version Craft Personal 3.0.0-beta.6
Yii version 2.0.11.2
Twig version 2.1.0
Guzzle version 6.2.1
Imagine version 0.7-dev
Plugins: yes, proprietary StandardsAgent, a big one near full conversion, and this string with the divot is coming from an external database it watches for reliable and timely creation, update, delete, etc. tracking.

@narration-sd

This comment has been minimized.

Copy link
Contributor Author

commented Mar 15, 2017

You know, I may have slightly mis-reported this. The actual content in the original database is hex C2A0, which is however the Unicode representation of the same reported, A0 -- Latin-1 nonbreaking space.

I'm including here a text file of what is actually in the external database being read from (look between BS and EN), and it should have propagated through as the full C2A0 anyway by the time Craft saw it, but not sure about that, or which/both are causing your stripHTML preg_remove to blow up.

It's a tedious thing to set up the test case actually in situ, if aided well by PhpStorm's breakpoint from conditioned breakpoint nice abilities, so I'll try to close this loop later, possibly also by a separated example.

As you'll know, discerning what is actually present in the world of various viewers and editors and database interfaces vs. Unicode can be pretty tricky. In any case, that preg ought to be made safe, or safetied, imagine you'd agree.

nbsp problem.txt

@angrybrad angrybrad self-assigned this Mar 15, 2017
@narration-sd

This comment has been minimized.

Copy link
Contributor Author

commented Mar 16, 2017

Ok, I've pretty well verified it is indeed a single 0xA0 that exists in the text sent to the stripHtml regex during slugbuilding, which causes it to fail.

Apparently the MsSql database driver just changes original Unicode into equivalent Latin-1, when it can - in this case U+C2A0 to 0xA0. 'Nice' -- not exactly, but sort of their style. Hence a lot of fun matching out bad characters.

I'm still going to use my preg_replace matcher on this single \hA0 char to replace it with a space at least for the present,, as the HtmlPurify in StringHelper::convertToUtf8() while safe entirely takes this Latin-1 nbsp out.

Thanks for the extra notes offline, Brad -- much appreciated and leads to some still-advancing thinking here.

@angrybrad

This comment has been minimized.

Copy link
Member

commented Mar 27, 2017

I can see the offending whitespace in the sample text file you provided, but C3 beta 8 handles it gracefully for me either via an auto-generated slug or if I manually insert the whitespace into the slug.

Maybe something to do with the version of PCRE installed? Running 8.38 2015-11-23 here.

@narration-sd

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2017

Hmm - could well be.

Handy beta 8 phpinfo says I'm running PCRE 8.39 2016-06-14 on this fragrant , flagrantly up-to-date-by-Otwell-standards Vagrant.

Without looking too deeply into it, I did find some murky water having to do with decisions on how to handle various POSIX (?) classes and/or a bug in the jit, which is turned on in this php load.

Two links -- first has a plod through for alternative issues, second lists what they fixed on 8.40. It sounds a bit of a mess, actually.

https://bugzilla.redhat.com/show_bug.cgi?id=1400267
https://fossies.org/diffs/pcre/8.39_vs_8.40/ChangeLog-diff.html

I guess you could do as I did, and just convert that particular character to safety Craft titles -- as people could paste that in as they did in this client database. Or if Macish make it in some case.

Otherwise, I liked the idea of tuning up use of HtmlPurifier some year when there might be time to sensibly do that -- and if it didn't reinvent similar problems...what is it using if not REs.

I want a Yogi Berra-ism 'judgement is hard when you need to decide', or something, for a case like this - think you have had a few of them...and I really like the way you guys choose, quite dependably...

@angrybrad

This comment has been minimized.

Copy link
Member

commented Mar 30, 2017

@narration-sd Can you confirm if replacing StringHelper::stripHtml() with PHP's strip_tags() fixes the issue for you? Looks like it does basically the same thing and accounts for null spaces as well and will be much more light-weight that spinning up HTML Purifier.

@narration-sd

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2017

Sigh. Hope you didn't get emailed with my first results, which were incorrect since I didn't read your note carefully at this hour.

Answer, unfortunately what you propose doesn't work.

StringHelper::stripHtml() fails out with a TypeError because it is trying to return null.

Null in turn probably means the same (PCRE?) fault as drives preg_replace() to fail is occurring somewhere underneath. Strip_tags() didn't get the chance to be involved.

I will keep my special two-step breakpoint around, if disabled. And now that I have more of the plugin up, while still having the dirty database, it's not so much trouble to run a test.

@angrybrad

This comment has been minimized.

Copy link
Member

commented Mar 30, 2017

So you're saying:

public static function stripHtml($str)
{
    return strip_tags($str);
}

returns null with your null space sample input?

@narration-sd

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2017

No, php raised an exception, the TypeError on StringHelper::stripHtml() -- because of its attempt to return null.

I'm not sure I like the type handling system zapping an old use of null for strings, but there it is.

And again, the null return has the scent of the preg_replace thing we are trying to work around.

@narration-sd

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2017

You know, I[m still not reading you straight.

Slowing down and doing it right. (also still-db-related vision issues about to be well treated, but still my fault) -- though the reference to replacing StringHelper::stripHtml() also threw me, as I'd been talking about StringHelper::convertToUtf8().

Am setting up the test again to do it right. What i just said about stripHtml() is true I believe, though - will verify that as well, maybe trace deeper....

@narration-sd

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2017

Ok, now some more detailed information.

  • StringHelper::stripHtml() indeed also crashes out on the \xa0/C2A0 nbsp character. It happens directly from its quite simple preg_replace call, same Return value of craft\helpers\StringHelper::stripHtml() must be of the type string, null returned message, as expected. So, it acts the same as the preg_replace fault I originally complained about -- guess you were checking me by control of stripHtml(), reparsing your message again. Yogi Berra is definitely in there somewhere....

  • now, second part. Php \strip_tage()j is more opaque, does not crash out. But it also doesn't do anything about this nbsp character -- this is not an $nbsp;, but rather the Latin-1 high-range character equivalent of that. At 0xa0. So doesn't solve my problem, because the character is still there to trip up later PCRE uses -- as in the original issue situation, that Craft 3 (properly in my opinion) doesn't allow it in a title.

  • I went further, after suddenly wondering if I was causing something with my belt and suspenders remover(s) of the diamond-questionmark out-of-band utf8 don't-know mark, occuring before these problem calls. But this is not so. Upon placing it first in line, stripHtml()'s preg_replace immediately coughed up with the same TypeExceptioned null return on the diamond-? character, the first time encountered, so the same bad behaviour.

I don't know if it's something we might be missing about tuning PCRE 8.40, or rather the rude basket of known faults with my version, reported by that team links above -- they ve pretty evidently abandoned even trying with it, in favor of producing PCRE 10.x, not in public.

Brad, I know it's messy, I don't know if the simple /\xa0/ - replace w/ordinarly-space remover I put in can be a general answer for titles inside of Craft. If we were sure PCRE wouldn't get it mixed up as part of a longer utf8 entity, this could be ok. Really, it shouldn't, and my Chinese string thrown at it gave a first approximation, but that wasn't a real test. I guess I should find some Chinese characters with A0s in them...can't now for this afternoon.

@angrybrad

This comment has been minimized.

Copy link
Member

commented Apr 5, 2017

Yeah... seems like when you add UTF-8 into the mix, there's just not an easy answer.

https://stackoverflow.com/questions/1279774/to-strip-whitespaces-inside-a-variable-in-php

http://stackoverflow.com/a/1176923/684

Given that it seems to be edge case (tried on multiple computers and couldn't reproduce), no one else has reported a similar issue and all of the solutions I've found have their own issues and potential side-effects, going to go ahead and close this. Can always re-open if necessary.

@angrybrad angrybrad closed this Apr 5, 2017
@narration-sd

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2017

Well, pretty much agree that's sound judgement, Brad. I do have one more thing on a note to try here, just to be sure I actually covered all the bases. News at some 11 if I learn anything :)

And thanks for all the consideration...

@sokolovstas

This comment has been minimized.

Copy link

commented Jul 25, 2019

Hi!

I have same error but with russian chars in slug field. It works fine on local dev machine but fail on hosting:

Return value of craft\helpers\StringHelper::stripHtml() must be of the type string, null returned

// Apply normal slug rules
$slug = ElementHelper::createSlug($slug);
 // Remove HTML tags
$str = StringHelper::stripHtml($str);
public static function stripHtml(string $str): string
{
   return preg_replace('/<(.*?)>/u', '', $str);
}
@sokolovstas

This comment has been minimized.

Copy link

commented Jul 25, 2019

If remove /u flag then regex return "Сертификация" -> "-е---------к------"

@sokolovstas

This comment has been minimized.

Copy link

commented Jul 25, 2019

@narration-sd @angrybrad

For my this fixed by adding:
mb_internal_encoding('UTF-8') at index.php

Also there is error in createSlug but it was fixed in latest commit

brandonkelly added a commit that referenced this issue Jul 25, 2019
@brandonkelly

This comment has been minimized.

Copy link
Member

commented Jul 25, 2019

@sokolovstas I just tried that slug locally and got a “Malformed UTF-8 characters, possibly incorrectly encoded” error, even with mb_internal_encoding set to UTF-8. Turns out р (Cyrillic ‘Er’) shares a byte with smart quote characters (, , , ), so this line was breaking up that character:

$str = preg_replace('/[\'"‘’“”\[\]\(\)\{\}:]/', '', $str);

I’ve updated that line to be unicode-aware for the next release, which fixes this.

To get the fix early, change your craftcms/cms requirement in composer.json to:

"require": {
  "craftcms/cms": "dev-develop#ab3095b6447dadcf17ce7a3197975b31efe609b1 as 3.2.6",
  "...": "..."
}

Then run composer update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.