Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forward error error=#<Encoding::UndefinedConversionError: "\xE6" from ASCII-8BIT to UTF-8> error_class=Encoding::UndefinedConversionError #31

Open
breath-co2 opened this issue Dec 4, 2015 · 7 comments

Comments

@breath-co2
Copy link

telnet to server and send data

["test.abc",[[1449207484,{"cid":1,"time":1402497769,"name":"\u6155\u5bb9\u5fb7\u5eb7","ctime":1402110157}]],{"chunk":"a14492074850006"}]

error info

2015-12-04 13:38:04 +0800 [warn]: emit transaction failed: error_class=Encoding::UndefinedConversionError error="\"\\xE6\" from ASCII-8BIT to UTF-8" tag="test.abc"
@tagomoris
Copy link
Member

Could you paste what you did, with fluentd's configuration? And do you have stack trace for that error?

@selerite
Copy link

i have the same problem. i will show the how the problem occurs.
let's say, i "tail" a file into hdfs using webhdfs.
td-agent.conf

<source>
  type tail
  pos_file /home/lvjin/workspace/work/td-agent/pos_files/test_log.pos
  format json
  path /home/lvjin/workspace/work/log_producer/test_log/test_log.json
  tag test_log
</source>
<match test_log>
  type webhdfs
  host 192.168.1.245
  port 50070
  path /ehualu/logs/watch_log/watch_log.%Y%m%d.json
  output_include_time false
  output_include_tag false
  flush_interval 10s
</match>

each line of log in the 'tail' file(test_log.json in the demo) is a json (utf-8 encoded) like:

{"close_time":"019:00","device_tags":[{"tag":"学习用品"},{"tag":"学校"},{"tag":"大屏"},{"tag":"学校"},{"tag":"汽车"},{"tag":"加油站"},{"tag":"教科书"},{"tag":"汽车"}],"start_time":"08:30","daily_h_traffic":7379,"device_size":"600*2000","device_intr":"位于CBD核心地带","screen_size":56,"device_ratio":"3:4","device_height":40,"visiable_angle":50,"device_resolution ":"1280*720","daily_car_traffic":5861,"visable_distance":10,"is_corner":false,"weekly_price":11452,"geo":[{"province":"安徽","city":"常州","coordinates":[{"lat":"12.96391","lon":"121.48462"}],"district":"锡山区"}],"id":1,"is_disturbed":true,"device_id":"09:2B:DD:6E:AD:F8"}

here is the stack trace:

2015-12-10 09:15:28 +0800 [warn]: emit transaction failed: error_class=Encoding::UndefinedConversionError error="\"\\xE5\" from ASCII-8BIT to UTF-8" tag="test_log"
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `encode'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `to_json'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:85:in `stringify_record'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-mixin-plaintextformatter-0.2.6/lib/fluent/mixin/plaintextformatter.rb:115:in `format'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:551:in `block in emit'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:128:in `call'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:128:in `block in each'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:127:in `each'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event.rb:127:in `each'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/output.rb:542:in `emit'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/event_router.rb:88:in `emit_stream'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:230:in `receive_lines'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:322:in `call'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:322:in `wrap_receive_lines'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:514:in `call'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:514:in `on_notify'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:347:in `on_notify'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:448:in `call'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:448:in `on_change'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
  2015-12-10 09:15:28 +0800 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.12/lib/fluent/plugin/in_tail.rb:215:in `run'

additional: ruby 2.1.5, td-agent 2.2.x

while i change the output from webhdfs to kafka, the problem doesn't occur, and i can read the correct data from kafka. I think perhaps it is the webhdfs output plugin that leads to it. i read the source code a moment ago, and i have nearly no knowledge about ruby, but personally, i was wondering if msgpack needed here? if i was wrong, forgive my ignorance. Looking forward to you apply!
sincerely!

@tagomoris
Copy link
Member

Your data contains invalid character for UTF-8 (JSON requires valid utf-8 chars).
I can add an option to ignore&skip such records... but is it what you need?
Another options is to add scrub strings such that (convert invalid chars to '?').

@selerite
Copy link

Thanks for replying, but, i can guarantee my data is valid UTF-8 chars. I can correctly output my data into a file with out_file, kafka with out_kafka(_buffered), but not hdfs with out_webhdfs. I compared the source code between out_file.rb and out_webhdfs.rb, and I found the difference:
out_file.rb:

Plugin.new_formatter(@format)

out_hdfs.rb

include Fluent::Mixin::PlainTextFormatter

The two formatter is different and the error is appeared in exactly in Fluent::Mixin::PlainTextFormatter.
I wondered whether Fluent::Mixin::PlainTextFormatter causes the error?

sincerely

@tagomoris
Copy link
Member

I know about Fluent::Mixin::PlainTextFormatter because it's also my product...
PlainTextFormatter uses JSON module of ruby, and Fluentd's default formatter (used in out_file) is using Yajl (yajl-ruby). Yajl ignores invalid utf-8 chars always.

@selerite
Copy link

I replaced "JSON module of ruby" in the PlainTextFormatter with Yajl, and it works well

record.to_json

replaced by

Yajl.dump(record)

I am wondering why you've chosen "JSON module of ruby" instead of "Yajl"?
Another question, have you tested your out_webhdfs plugin with data of "Non Latin", such as Japanese, Chinese?

Anyway, it works well now, thanks for your help!

sincerely
A noob of ruby

@btwood
Copy link

btwood commented Mar 25, 2016

I'm also having this issue. It took me a while to figure out, but I have some raw logs that are getting escaped with '\xAE'
The character conversion in both out_forward and out_file happens correctly.
This plugin is inconsistent with the others.
Replacing it with "?" isn't really an option for me, because I'm expecting any "garbage" to be propagated through my system.

It would seem that \xAE doesn't get padded or interpreted as \u00AE for some reason. Any one byte character code above 7F seems to be failing to write at the "to_json" conversion. Possibly because it expects a padded/wide character, and it's given a short.

I get a "warn" in the logs relating to JSON::GeneratorError and it doesn't emit the record.
This is a problem, because now I'm missing records in Hadoop, where they would have been written to file otherwise.

Because x00-xFF are identically mapped to U+0000-U+00FF, why not allow them as valid unicode characters as many others do? I guess this is a bug in the Ruby JSON package then.

I'm looking into making the above edit, but at this point it may be faster for me to deploy kafka to my cluster and use that instead.

I hope this can be resolved in future td-agent package releases. Where I'm using the rpm, I seem to be locked into this bug.

Is there a reason you don't use the same Yajl record writer as other plugins?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants