Accents/special characters in payloads causing problems #39

roynasser · 2012-02-01T23:01:04Z

Due to the way that JSON_ENCODE works, accents and some other special characters cause it to return null variables.

This poses a problem when passing messages that include these characters such as text in foreign languages and perhaps more... One work-around for this is to UTF8 encode all of the payload data.

I have tried this in development and it works fine. Is there any downside that you guys can point?

The function I am using was obtained in the php manpages and I havent even tweaked it yet, but the idea is to add something such as:

public static function array_utf8_encode_recursive($dat)
        { if (is_string($dat)) {
            return utf8_encode($dat);
          }
          if (is_object($dat)) {
            $ovs= get_object_vars($dat);
            $new=$dat;
            foreach ($ovs as $k =>$v)    {
                $new->$k=self::array_utf8_encode_recursive($new->$k);
            }
            return $new;
          }

          if (!is_array($dat)) return $dat;
          $ret = array();
          foreach($dat as $i=>$d) $ret[$i] = self::array_utf8_encode_recursive($d);
          return $ret;
        }

and then encode the payload through this function (in order to retain the array structure).

I currently identified that this is necessary in resque.php (function push), in job.php (crica line 250) and worker.php (crica line 500)

Another idea is to UTF8 before sneding to resque (which I will do tomorrow for compatibility sakes since I use a "enque wrapper function" in my app). but I thought I'd contirbute this anyhow in case anyone has issues.

pilif · 2012-02-14T14:55:14Z

You can't unconditionally utf-8 encode: If the input data is already in utf-8 (likely), this would lead to a double encode which, again, breaks characters.

You can of course detect whether the input is already in utf-8 (mb_detect_encoding()), but it's not the quickest operation of them all.

I would recommend this being handled on the client side for performance reasons, especially because more and more projects are using utf-8 internally.

roynasser · 2012-02-14T22:18:54Z

@pilif Hey! That is what I ended up doing... in my enquer I encode the strings, and I always decode them on my job workers.

Anyways, I thought it was useful to post here as it took me a little while to figure out what was going on with the empty payloads...

danhunsaker · 2013-06-19T17:09:35Z

Even mb_detect_encoding() isn't enough, sometimes. The trouble is that detecting character encodings is always a bit of a hack, relying on a lot of guess-work. This is because the actual encoding of a string isn't actually stored anywhere, and most encodings are extremely similar to one another. A string of UTF-8 encoded text that doesn't happen to use any characters outside the range shared by ASCII, for example (anything with a byte value between [IIRC] 32 and 126, decimal - most English text qualifies here), could be detected as either ASCII or UTF-8, or perhaps even a host of other encodings that use the same range of bytes in the same way.

The only piece of code in a position to know for sure is the piece that actually accepts the input. In most PHP use cases, this is the browser. Most of the rest depend on the encoding of whatever other application (MySQL, Redis, web API, etc) the data is coming from. Some of these will tell you the encoding they've used. Some of them won't. And sadly, sometimes the ones that do mention an encoding are simply wrong.

Ultimately, the only sane location to handle encoding conversions is in the Resque client, not the library itself.

danhunsaker · 2013-12-18T14:34:08Z

Is this still an issue, or can it be closed?

chrisboulton closed this as completed Oct 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accents/special characters in payloads causing problems #39

Accents/special characters in payloads causing problems #39

roynasser commented Feb 1, 2012

pilif commented Feb 14, 2012

roynasser commented Feb 14, 2012

danhunsaker commented Jun 19, 2013

danhunsaker commented Dec 18, 2013

Accents/special characters in payloads causing problems #39

Accents/special characters in payloads causing problems #39

Comments

roynasser commented Feb 1, 2012

pilif commented Feb 14, 2012

roynasser commented Feb 14, 2012

danhunsaker commented Jun 19, 2013

danhunsaker commented Dec 18, 2013