Workaround for Yajl encoding problems #69

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
4 participants
Contributor

rogerbraun commented Jan 25, 2012

I don't really know what triggers this, but with Picky integrated in my Rails app, I get this error when dumping the index:

ruby-1.9.3-p0 :006 >   BookIndex.dump
Encoding::UndefinedConversionError: "\xE5" from ASCII-8BIT to UTF-8
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/yajl-ruby-1.1.0/lib/yajl.rb:72:in `write'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/yajl-ruby-1.1.0/lib/yajl.rb:72:in `encode'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/yajl-ruby-1.1.0/lib/yajl.rb:72:in `encode'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/picky-4.1.0/lib/picky/backends/memory/json.rb:35:in `block in dump_json'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/picky-4.1.0/lib/picky/backends/memory/json.rb:34:in `open'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/picky-4.1.0/lib/picky/backends/memory/json.rb:34:in `dump_json'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/picky-4.1.0/lib/picky/backends/memory/json.rb:27:in `dump'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/picky-4.1.0/lib/picky/bundle_indexing.rb:34:in `dump'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/picky-4.1.0/lib/picky/category.rb:80:in `dump'
        from (__DELEGATION__):2:in `block in dump'
        from (__DELEGATION__):2:in `each'
        from (__DELEGATION__):2:in `dump'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/picky-4.1.0/lib/picky/index.rb:92:in `dump'
        from (irb):4
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/railties-3.1.1/lib/rails/commands/console.rb:45:in `start'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/railties-3.1.1/lib/rails/commands/console.rb:8:in `start'
        from /home/roger/.rvm/gems/ruby-1.9.3-p0/gems/railties-3.1.1/lib/rails/commands.rb:40:in `<top (required)>'
        from script/rails:6:in `require'
        from script/rails:6:in `<main>'ruby-1.9.3-p0 :005 > 

This seems to be an error in Yajl, as changing the way the file is written makes it work. This is a workaround for this problem.

Owner

floere commented Jan 26, 2012

Thanks a lot for this – however, before pulling, I'd like to know what the actual problem is as this change more likely results in large strings to be created (if I am not mistaken).

Owner

floere commented Jan 27, 2012

As a reminder for myself: http://bugs.ruby-lang.org/issues/2313

Owner

floere commented Apr 17, 2012

Hi @rogerbraun, sorry for dragging my feet on this one. Do you have a minimal failing case for this, perhaps?

I can confirm this bug, and that fix works.

Owner

floere commented Apr 26, 2012

Thanks @BadMinus. My point here is that I first need to understand why it doesn't (and does) work before this goes into Picky. So if you can help me (us) with understanding the issue, that would be great :)

I'd pull this in in a second, but it raises new issues with big indexes, sadly.

The thing is, most current pull requests or issues are related to encoding. And Picky itself does not do much with encodings, deliberately. So I'd like to know why and how Rails triggers this problem. And with that information, solve it in the best way possible.
(I have a very strong feeling the solution lies in Rails and setting Ruby's encodings)

What do you guys think?

I don't know too where is problem, and I tried many different possible solutions in rails and in yabl, but come to only this

        def dump_json internal
          Yajl::Encoder.encode(internal) do |chunk|
              ::File.open(cache_path, 'w') do |out_file|
                out_file.write chunk.force_encoding Encoding::UTF_8
              end
          end
        end

How that will work with big indexes?

Owner

floere commented Apr 26, 2012

I'm currently trying to come up with a minimal failing test case: 2c50a3b

Owner

floere commented Apr 26, 2012

@rogerbraun @BadMinus What version of Rails are you guys using?

With my code I look at chunks that Yajl encoding and in the beginning encoding is ASCII-8BIT but next chunk in same file UTF-8.
There is no cyrillic symbols in price model, only numbers.
It seems problem in yajl.
I use 3.2.3 and mongoid

puts cache_path
           Yajl::Encoder.encode(internal) do |chunk|
             puts chunk.encoding 
             puts chunk
             puts chunk.encoding
             ::File.open(cache_path, 'w:utf-8') do |out_file|
                chunk = chunk.to_s.force_encoding Encoding::UTF_8
                out_file.write chunk
              end
          end
/Users/onemanstartup/code/projects/finisher/index/development/events/minimum_price_partial_inverted.memory.json
ASCII-8BIT
{"100":["4f942e33b0207a9f2f00000f","4f944761b0207a9f2f000010","4f944778b0207a9f2f000011"
....
,"4f957350b0207a0ba100003c","4f957350b0207a0ba100004f","4f957350b0207a0ba1000054"
ASCII-8BIT
UTF-8
,"4f957350b0207a0ba1000061","4f957350b0207a0ba1000067","4f957350b0207a0ba1000069","4f957350b0207a0ba100006d"],"50":
....
["4f957350b0207a0ba100002c","4f957350b0207a0ba1000043","4f957350b0207a0ba1000060","4f957350b0207a0ba100006e"]}
UTF-8

I just found another solution.
In documentation to Ruby
"If length is omitted or is nil, it reads until EOF and the encoding conversion is applied. It returns a string even if EOF is met at beginning."
I don't know how it affects performance, but it definitely encoding issues in this method I think.

# backends/prepared/text.rb
        def retrieve
          id    = nil
          token = nil
          ::File.open(cache_path, 'r:utf-8') do |file|
            file.read
            file.each_line do |line|
              id, token = line.split ?,, 2
              yield id, (token.chomp! || token)
            end
          end
        end
Owner

floere commented Apr 26, 2012

@BadMinus Regarding your latest comment: Where exactly is the problem here? (I'm not sure I understand)

Owner

floere commented Apr 26, 2012

So, to restate – the original problem occurred with Picky inside Rails.
To find out what the problem is:

  1. What if the same code runs outside of Rails?
  2. Can we find a minimal test case where this fails? If yes, what is it? Can we show the problem is indeed in Yajl?
Contributor

rogerbraun commented Apr 26, 2012

I used Rails 3.2.0. If I remember correctly, the problem only occurred on OS X with Picky running inside Rails.

I'm sorry I can't help much more, I 'solved' this problem by using a separate Picky server and don't really know how to reproduce it.

Owner

floere commented Apr 26, 2012

@rogerbraun No worries. I am wondering though: Why did you conclude it is Yajl when the same code worked when not using it within Rails?

Contributor

rogerbraun commented Apr 26, 2012

I have no idea...

Owner

floere commented Apr 27, 2012

No worries.

Owner

floere commented Apr 27, 2012

I'm getting closer to the issue. Rails explicitly sets

Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8

while Picky doesn't. I'm considering doing this to finally squash the encoding issues.
(Just an FYI I am working on this)

BadMinus commented May 1, 2012

Any progress?
It's crazy, but I came up to third solution that I don't know how works :)

        def dump_json internal
          ::File.open(cache_path, 'w') do |out_file|
            ::JSON.dump  internal, out_file
            #Yajl::Encoder.encode internal, out_file
          end
        end

but solution with file.read works faster. With that it indexed in 0 seconds and other solutions indexing in 1 second

Owner

floere commented May 2, 2012

@BadMinus 0-1 second? Are you indexing your mp3 collection? ;)

Some progress information: Picky in its current version has UTF8 as internal encoding, and an external encoding whatever is set by ENV variables etc. This is the basic encoding state of Ruby. It is sensible in that if you have a strange encoding set in your env, anything incoming will be transcoded into the internal encoding.

I am currently experimenting with setting the external encoding to UTF8, like Rails does.
Also, I've asked @brianmario about Yajl 2.0 which seems to deal with encodings a bit better (if I got that right). I am rather enamored with Yajl (due to its performance), and would like to know if Yajl is indeed "the problem" before abandoning it without reason. However, the inclusion of multi_json is also planned, so all this might change anyway.
(Even though multi_json sounds great, it is cause for worry, as e.g. problems with a specific json lib will be obfuscated, making it even harder to determine the cause for a problem)

Sorry about all this!
But feel free to monkey-patch. This method is part of an internal interface and will not be changed without at least a minor or even major version jump. I do not intend to change the interface any time soon, so it will remain at least in 5.0, and I will announce any changes to it in the history.textile file.

BadMinus commented May 2, 2012

@floere it's fast or not? I don't understand :) I'm indexing around 100 simple objects.

Solutions works so it's ok, I just wanted to help and understand how picky works. It's hard to understand for me yet.

Owner

floere commented May 2, 2012

@BadMinus It's ok – there's a bit of overhead on an indexing run, so 1 second even is ok.

Don't get me wrong, I appreciate your help a lot! :) If you have questions about Picky, don't hesitate to ask. it is a big framework, after all. But, I hope, well enough structured such that it is readable.

gokulj commented May 23, 2012

I believe the solution is to open the caching file in binary mode?
http://stackoverflow.com/questions/4988724/ruby-on-rails-upload-file-problem-odd-utf8-conversion-error
From the answer to that question, it looks ike the json_dump method (see below) in Picky's code
should open the file in binary mode to resolve this?

picky/backends/memory/json.rb:
18 #
19 def load
20: Yajl::Parser.parse ::File.open(cache_path, 'r') # , symbolize_keys: true # TODO Symbols.
21 end
22
..
31 #
32 def dump_json internal
33: ::File.open(cache_path, 'w') do |out_file|
34 Yajl::Encoder.encode internal, out_file
35 end

Owner

floere commented Jun 20, 2012

Sadly, I cannot reproduce the original problem. See https://github.com/floere/picky/blob/cff4f81264e0171158334619c4850fd58d082715/server/spec/functional/backends/memory_json_utf8_spec.rb#L16-34 for details. Both internal and external encoding are set to the Rails defaults. If I forcefully encode, I get the error mentioned above, but not when saving with Yajl.
Thoughts, criticism?

Owner

floere commented Jun 20, 2012

I'm now very interested in: How do you load the data to be indexed, @rogerbraun? Thanks for any info!

Owner

floere commented Jul 1, 2012

Hi @rogerbraun, @BadMinus, @gokulj (@kschiess), I've just released Picky 4.5.0 – it uses the multi_json gem, and thus allows you to choose which library is used. For example, set MultiJson.use :yajl if you'd like to continue using Yajl. This is not a solution for the issue at hand (the problem is that Rails explicitly sets encodings), but it will definitely open up your options.
Thanks again, @rogerbraun, for entering this – I cannot pull it in as it is, but I will definitely add your contribution to the contributors wiki page!

floere closed this Jul 1, 2012

floere was assigned Dec 6, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment