Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accents/special characters in payloads causing problems #39

Closed
roynasser opened this issue Feb 1, 2012 · 4 comments
Closed

Accents/special characters in payloads causing problems #39

roynasser opened this issue Feb 1, 2012 · 4 comments

Comments

@roynasser
Copy link

Due to the way that JSON_ENCODE works, accents and some other special characters cause it to return null variables.

This poses a problem when passing messages that include these characters such as text in foreign languages and perhaps more... One work-around for this is to UTF8 encode all of the payload data.

I have tried this in development and it works fine. Is there any downside that you guys can point?

The function I am using was obtained in the php manpages and I havent even tweaked it yet, but the idea is to add something such as:

public static function array_utf8_encode_recursive($dat)
        { if (is_string($dat)) {
            return utf8_encode($dat);
          }
          if (is_object($dat)) {
            $ovs= get_object_vars($dat);
            $new=$dat;
            foreach ($ovs as $k =>$v)    {
                $new->$k=self::array_utf8_encode_recursive($new->$k);
            }
            return $new;
          }

          if (!is_array($dat)) return $dat;
          $ret = array();
          foreach($dat as $i=>$d) $ret[$i] = self::array_utf8_encode_recursive($d);
          return $ret;
        } 

and then encode the payload through this function (in order to retain the array structure).

I currently identified that this is necessary in resque.php (function push), in job.php (crica line 250) and worker.php (crica line 500)

Another idea is to UTF8 before sneding to resque (which I will do tomorrow for compatibility sakes since I use a "enque wrapper function" in my app). but I thought I'd contirbute this anyhow in case anyone has issues.

@pilif
Copy link

pilif commented Feb 14, 2012

You can't unconditionally utf-8 encode: If the input data is already in utf-8 (likely), this would lead to a double encode which, again, breaks characters.

You can of course detect whether the input is already in utf-8 (mb_detect_encoding()), but it's not the quickest operation of them all.

I would recommend this being handled on the client side for performance reasons, especially because more and more projects are using utf-8 internally.

@roynasser
Copy link
Author

@pilif Hey! That is what I ended up doing... in my enquer I encode the strings, and I always decode them on my job workers.

Anyways, I thought it was useful to post here as it took me a little while to figure out what was going on with the empty payloads...

@danhunsaker
Copy link
Contributor

Even mb_detect_encoding() isn't enough, sometimes. The trouble is that detecting character encodings is always a bit of a hack, relying on a lot of guess-work. This is because the actual encoding of a string isn't actually stored anywhere, and most encodings are extremely similar to one another. A string of UTF-8 encoded text that doesn't happen to use any characters outside the range shared by ASCII, for example (anything with a byte value between [IIRC] 32 and 126, decimal - most English text qualifies here), could be detected as either ASCII or UTF-8, or perhaps even a host of other encodings that use the same range of bytes in the same way.

The only piece of code in a position to know for sure is the piece that actually accepts the input. In most PHP use cases, this is the browser. Most of the rest depend on the encoding of whatever other application (MySQL, Redis, web API, etc) the data is coming from. Some of these will tell you the encoding they've used. Some of them won't. And sadly, sometimes the ones that do mention an encoding are simply wrong.

Ultimately, the only sane location to handle encoding conversions is in the Resque client, not the library itself.

@danhunsaker
Copy link
Contributor

Is this still an issue, or can it be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants