-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix encoding bugs and add tests to String::center
#1731
Open
lopopolo
wants to merge
5
commits into
trunk
Choose a base branch
from
lopopolo/gh-1634-followup
base: trunk
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
lopopolo
added
A-ruby-core
Area: Ruby Core types.
A-spec
Area: ruby/spec infrastructure and completeness.
S-wip
Status: This pull request is a work in progress.
C-bug
Category: This is a bug.
labels
Mar 11, 2022
lopopolo
force-pushed
the
lopopolo/gh-1634-followup
branch
from
March 31, 2022 02:03
e79bc65
to
d69db32
Compare
lopopolo
force-pushed
the
lopopolo/gh-1634-followup
branch
from
June 27, 2022 00:20
d69db32
to
70af0c9
Compare
Some more tests: $ irb
[3.1.2] > utf8 = "abc"
=> "abc"
[3.1.2] > utf8.encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb = "你好"
=> "你好"
[3.1.2] > utf8mb.encoding
=> #<Encoding:UTF-8>
[3.1.2] > ascii = "xyz".force_encoding(Encoding::ASCII)
=> "xyz"
[3.1.2] > ascii.encoding
=> #<Encoding:US-ASCII>
[3.1.2] > binary = "\xFF\xFE".b
=> "\xFF\xFE"
[3.1.2] > binary.encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > utf8.center(10, utf8)
=> "abcabcabca"
[3.1.2] > utf8.center(10, utf8).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8.center(10, utf8mb)
=> "你好你abc你好你好"
[3.1.2] > utf8.center(10, utf8mb).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8.center(10, ascii)
=> "xyzabcxyzx"
[3.1.2] > utf8.center(10, ascii).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8.center(10, binary)
=> "\xFF\xFE\xFFabc\xFF\xFE\xFF\xFE"
[3.1.2] > utf8.center(10, binary).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > utf8.center(10, utf8mb.b)
=> "\xE4\xBD\xA0abc\xE4\xBD\xA0\xE5"
[3.1.2] > utf8.center(10, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(10, utf8)
=> "abcxyzabca"
[3.1.2] > ascii.center(10, utf8).encoding
=> #<Encoding:US-ASCII>
[3.1.2] > ascii.center(10, utf8mb)
=> "你好你xyz你好你好"
[3.1.2] > ascii.center(10, utf8mb).encoding
=> #<Encoding:UTF-8>
[3.1.2] > ascii.center(10, ascii)
=> "xyzxyzxyzx"
[3.1.2] > ascii.center(10, ascii).encoding
=> #<Encoding:US-ASCII>
[3.1.2] > ascii.center(10, binary)
=> "\xFF\xFE\xFFxyz\xFF\xFE\xFF\xFE"
[3.1.2] > ascii.center(10, binary).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(10, ascii.b)
=> "xyzxyzxyzx"
[3.1.2] > ascii.center(10, ascii.b).encoding
=> #<Encoding:US-ASCII>
[3.1.2] > utf8.center(10, ascii.b).encoding
=> #<Encoding:UTF-8>
[3.1.2] > ascii.center(11, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(12, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(13, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(14, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(16, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(15, utf8mb.b).encoding
=> #<Encoding:ASCII-8BIT>
[3.1.2] > ascii.center(15, utf8mb.b)
=> "\xE4\xBD\xA0\xE5\xA5\xBDxyz\xE4\xBD\xA0\xE5\xA5\xBD"
[3.1.2] > "喜欢".center 5, "打球".b
(irb):37:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
from (irb):37:in `<main>'
from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > "喜欢".center 10, "打球".b
(irb):38:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
from (irb):38:in `<main>'
from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > "喜欢".center 10, utf8mb.b
(irb):39:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
from (irb):39:in `<main>'
from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > "喜欢".encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, utf8)
=> "abca你好abca"
[3.1.2] > utf8mb.center(10, utf8).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, utf8.b)
=> "abca你好abca"
[3.1.2] > utf8mb.center(10, utf8.b).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, utf8mb)
=> "你好你好你好你好你好"
[3.1.2] > utf8mb.center(10, utf8mb).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, utf8mb.b)
(irb):47:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
from (irb):47:in `<main>'
from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > utf8mb.center(10, ascii)
=> "xyzx你好xyzx"
[3.1.2] > utf8mb.center(10, ascii).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, ascii.b)
=> "xyzx你好xyzx"
[3.1.2] > utf8mb.center(10, ascii.b).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8mb.center(10, binary)
(irb):52:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
from (irb):52:in `<main>'
from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>'
[3.1.2] > utf8invalid = "\xFF"
=> "\xFF"
[3.1.2] > utf8invalid.encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8invalid.center(10, utf8)
=> "abca\xFFabcab"
[3.1.2] > utf8invalid.center(10, utf8).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8invalid.center(10, utf8.b).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8invalid.center(10, utf8mb)
=> "你好你好\xFF你好你好你"
[3.1.2] > utf8invalid.center(10, utf8mb).encoding
=> #<Encoding:UTF-8>
[3.1.2] > utf8invalid.center(10, utf8mb.b)
(irb):60:in `center': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
from (irb):60:in `<main>'
from /usr/local/var/rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/irb-1.4.1/exe/irb:11:in `<top (required)>'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `load'
from /usr/local/var/rbenv/versions/3.1.2/bin/irb:25:in `<main>' |
lopopolo
force-pushed
the
lopopolo/gh-1634-followup
branch
from
May 17, 2023 22:01
70af0c9
to
973b81f
Compare
this is necessary but has been stale for a while. Just pushed up a rebase to latest trunk (while there are still no conflicts 😅) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-ruby-core
Area: Ruby Core types.
A-spec
Area: ruby/spec infrastructure and completeness.
C-bug
Category: This is a bug.
S-wip
Status: This pull request is a work in progress.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Followup to #1634 and also big bugfix to
String#center
.String#center
is an encoding-aware API. Both the receiver and the padding need to be:(Encoding::Utf8, Encoding::Utf8)
(Encoding::Ascii | Encoding::Binary, Encoding::Ascii | Encoding::Binary)
This means that character counts are encoding-aware. This is in contrast to the current implementation which treats padding as raw bytes.
For example, consider the difference between MRI and Artichoke for these UTF-8 combinations:
MRI:
Artichoke:
Additional tests and comparisons with MRI have revealed that the given padding width is the desired width of the returned string, not the number of padding bytes to add.