Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour of IO#set_encoding has been changed as of Ruby 3.3 #4058

Open
ashie opened this issue Feb 16, 2023 · 8 comments
Open

Behaviour of IO#set_encoding has been changed as of Ruby 3.3 #4058

ashie opened this issue Feb 16, 2023 · 8 comments
Assignees
Labels

Comments

@ashie
Copy link
Member

ashie commented Feb 16, 2023

Describe the bug

2 tests of child_process helper always fail in Ruby 3.3.0-dev after https://redmine.ruby-lang.org/issues/18899 is landed.

===============================================================================================================================================================
Failure: test: can scrub characters without exceptions(ChildProcessTest)
/home/aho/Projects/Fluentd/fluentd/test/plugin_helper/test_child_process.rb:531:in `block (2 levels) in <class:ChildProcessTest>'
     528:       end
     529:       sleep TEST_WAIT_INTERVAL_FOR_BLOCK_RUNNING until m.locked? || ran
     530:       m.lock
  => 531:       assert_equal Encoding.find('utf-8'), str.encoding
     532:       expected = "\xEF\xBF\xBD\xEF\xBF\xBD\x00\xEF\xBF\xBD\xEF\xBF\xBD".force_encoding("utf-8")
     533:       assert_equal expected, str
     534:       @d.stop; @d.shutdown; @d.close; @d.terminate
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:189:in `block in timeout'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `block in catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:198:in `timeout'
/home/aho/Projects/Fluentd/fluentd/test/plugin_helper/test_child_process.rb:520:in `block in <class:ChildProcessTest>'
<#<Encoding:UTF-8>> expected but was
<#<Encoding:ASCII-8BIT>>

diff:
? #<Encoding:UTF  -8   >
?            ASCII  BIT 
?            ???  +++ 
===============================================================================================================================================================
F
===============================================================================================================================================================
Failure: test: can scrub characters without exceptions and replace specified chars(ChildProcessTest)
/home/aho/Projects/Fluentd/fluentd/test/plugin_helper/test_child_process.rb:552:in `block (2 levels) in <class:ChildProcessTest>'
     549:       end
     550:       sleep TEST_WAIT_INTERVAL_FOR_BLOCK_RUNNING until m.locked? || ran
     551:       m.lock
  => 552:       assert_equal Encoding.find('utf-8'), str.encoding
     553:       expected = "??\x00??".force_encoding("utf-8")
     554:       assert_equal expected, str
     555:       @d.stop; @d.shutdown; @d.close; @d.terminate
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:189:in `block in timeout'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `block in catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:36:in `catch'
/home/aho/.rbenv/versions/3.3.0-dev/lib/ruby/3.3.0+0/timeout.rb:198:in `timeout'
/home/aho/Projects/Fluentd/fluentd/test/plugin_helper/test_child_process.rb:541:in `block in <class:ChildProcessTest>'
<#<Encoding:UTF-8>> expected but was
<#<Encoding:ASCII-8BIT>>

diff:
? #<Encoding:UTF  -8   >
?            ASCII  BIT 
?            ???  +++ 
===============================================================================================================================================================

To Reproduce

$ rbenv install 3.3.0-dev
$ rbenv local 3.3.0-dev
$ bundle exec rake test TEST=test/plugin_helper/test_child_process.rb

Expected behavior

These tests shouldn't fail.

Your Environment

- Fluentd version: 8bb38b5c5cb13d2d9446f94708940f83f56a4b6f
- TD Agent version: N/A
- Operating system: Ubuntu 22.04, Windows 10
- Kernel version: 5.15.0-60-generic

Your Configuration

N/A

Your Error Log

Described in `Describe the bug` section.

Additional context

No response

@ashie ashie self-assigned this Mar 13, 2023
@daipom daipom self-assigned this Mar 16, 2023
@ashie ashie added the bug label Mar 31, 2023
@daipom daipom changed the title 2 tests of child_process helper always fail in Ruby 3.3.0-dev CI: 2 tests of child_process helper always fail in Ruby 3.3.0-dev Apr 10, 2023
@daipom
Copy link
Contributor

daipom commented May 30, 2023

@ashie ashie changed the title CI: 2 tests of child_process helper always fail in Ruby 3.3.0-dev Behaviour of IO#set_encoding has been changed as of Ruby 3.3 Jan 10, 2024
@ashie
Copy link
Member Author

ashie commented Jan 10, 2024

After reading following articles, I've got the reason why these tests are failed on Ruby 3.3:

IO#set_encoding ignores the second argument (int_enc) when ASCII-8bit is specified as ext_enc as of Ruby 3.3.
I'm not sure but I'm now doubting reasonability of this Ruby's change because the documentation about Encoding Options implies possibility of transcoding ASCII-8BIT -> UTF-8 with invalid: :replace or undef: :replace.
I'll open a new issue at https://redmine.ruby-lang.org/issues then ask about it.
I'll describe the detail of this issue at there.

On the other hand, I'm now doubting the specification of Fluent::PluginHelper::ChildProcess#child_process_execute too.
Although Fluentd's default internal & external encoding are ASCII-8BIT (workers are executed by ruby -Eascii-8bit:ascii-8bit), Fluent::PluginHelper::ChildProcess#child_process_execute uses UTF-8 as default internal encoding. They seem inconsistent. It would be better that Fluentd doesn't modify incoming data implicitly by default. How to modify data should be determined by each plugins & users.

So I'm not sure how to resolve this issue for now.
I'm considering to mark these tests as pending until these problems are solved.

@daipom
Copy link
Contributor

daipom commented Jan 10, 2024

Thanks!

I'm considering to mark these tests as pending until these problems are solved.

I agree.

@ashie
Copy link
Member Author

ashie commented Mar 25, 2024

After further investigation, I found that probably it won't work as expected even on Ruby 3.2 or former.

test-set-conding.rb:

#!/usr/bin/env ruby
STDIN.set_encoding('ascii-8bit', 'utf-8', invalid: :replace, undef: :replace) 
str = STDIN.read
p str.encoding
p str
p str.force_encoding("ascii-8bit")

Results with the input string foo\xFFbar are following:

Ruby 3.2:

$ ruby -e 'STDOUT.write("foo\xFFbar")' | ruby test-set-conding.rb
#<Encoding:UTF-8>
"foo�bar"
"foo\xEF\xBF\xBDbar"
"foo�bar"

Ruby 3.3

$ ruby -e 'STDOUT.write("foo\xFFbar")' | ruby test-set-conding.rb
#<Encoding:ASCII-8BIT>
"foo\xFFbar"
"foo\xFFbar"
"foo\xFFbar"

Looking at this result alone, it appears to be working as expected on Ruby 3.2.

But when I test it with the input string あ\xFFい, the result seems questionable:

Ruby 3.2

$ ruby -e 'STDOUT.write("あ\xFFい")' | ruby test-set-conding.rb
#<Encoding:UTF-8>
"�������"
"\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD"
"�������"

Ruby 3.3

$ ruby -e 'STDOUT.write("あ\xFFい")' | ruby test-set-conding.rb
#<Encoding:ASCII-8BIT>
"\xE3\x81\x82\xFF\xE3\x81\x84"
"\xE3\x81\x82\xFF\xE3\x81\x84"
"あ\xFFい"

Probably the current Fluentd's test expects あ�い for this byte sequence、but it replaces all bytes to even if the input string includes valid UTF-8 sequence in actual.

@ashie
Copy link
Member Author

ashie commented Apr 3, 2024

related issue: #4460

@ashie
Copy link
Member Author

ashie commented Apr 3, 2024

Now I think Ruby 3.3's change seems better than before.

But when I test it with the input string あ\xFFい, the result seems questionable:

Probably the current Fluentd's test expects あ�い for this byte sequence、but it replaces all bytes to even if the input string includes valid UTF-8 sequence in actual.

This result on Ruby 3.2 isn't questionable, because :invalid and :undef are for internal_encoding, so that all non-ASCII characters should be replaced. But such conversion doesn't seem make sense in most cases for ASCII-8Bit (it means binary). It would be better not to modifier input data.

@daipom
Copy link
Contributor

daipom commented Apr 3, 2024

Now I think Ruby 3.3's change seems better than before.

But when I test it with the input string あ\xFFい, the result seems questionable:

Probably the current Fluentd's test expects あ�い for this byte sequence、but it replaces all bytes to even if the input string includes valid UTF-8 sequence in actual.

This result on Ruby 3.2 isn't questionable, because :invalid and :undef are for internal_encoding, so that all non-ASCII characters should be replaced. But such conversion doesn't seem make sense in most cases for ASCII-8Bit (it means binary). It would be better not to modifier input data.

I see!
Yes! I think so too!

@ashie
Copy link
Member Author

ashie commented Apr 3, 2024

- internal_encoding
+ external_encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To-Do
Development

No branches or pull requests

2 participants