New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StringScanner#read_char
and #read_byte
#11785
base: master
Are you sure you want to change the base?
Conversation
StringScanner#read_char
and #read_byte
@@ -73,7 +75,7 @@ class StringScanner | |||
|
|||
# Returns the current position of the scan offset. | |||
def offset : Int32 | |||
@str.byte_index_to_char_index(@byte_offset).not_nil! | |||
@byte_offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this representative of the character offset, and not byte_offset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, but it raises error when calling #read_byte
to a multibyte character then calling #offset
in the current implementation.
I concern this behavior is expected or not.
require "string_scanner"
s = StringScanner.new("あ")
s.read_byte
s.offset #=> Unhandled exception: Nil assertion failed (NilAssertionError)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think this change would be a breaking change so in theory we can'd do this.
However, I consider the existing definition of offset
to be incorrect. offset
should actually return the byte offset because that's more useful, and it's the only correct thing we can return if one can advance byte per byte. So we can consider this change a bugfix instead of a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful to who? Isn't it pretty useful to use the same index values when parsing String
s as the String
s themselves use when indexing with String#[]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the suggested change is inconsistent with offset=
, so if this change is wanted then that also needs to be updated.
src/string_scanner.cr
Outdated
# s.read_char # => "a" | ||
# s.read_char # => "b" | ||
# ``` | ||
def read_char : String? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit unintuitive that read_char
returns a String?
and not a Char?
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to return Char?
.
src/string_scanner.cr
Outdated
# s.read_byte # => "\x81" | ||
# s.read_byte # => "\x82" | ||
# ``` | ||
def read_byte : String? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situation is this something that is reasonable (rather than reading a full utf8 character)? The only situations I can think of is where the data isn't actually a valid string, and I'd argue that if that is the case a solution working directly on Slice
would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that if we go with this, we should return UInt8?
, not String?
But yes, it would be nice to know the actual use case for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to return UInt8?
.
src/string_scanner.cr
Outdated
# s.read_char # => "b" | ||
# ``` | ||
def read_char : String? | ||
scan(/./) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/./
might actually miss a few characters. The safest way is to do it in Crystal with something like String#char_bytesize_at
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about /./m
? /./
misses newline characters, so changing it multiline mode, it detects newline characters. It also changes behaviors of ^
and $
, but it doesn't matter.
src/string_scanner.cr
Outdated
@@ -68,12 +70,12 @@ class StringScanner | |||
# Sets the *position* of the scan offset. | |||
def offset=(position : Int) | |||
raise IndexError.new unless position >= 0 | |||
@byte_offset = @str.char_index_to_byte_index(position) || @str.bytesize | |||
@byte_offset = [position, @str.bytesize].min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Math.min or tuple instead of array
src/string_scanner.cr
Outdated
# s.read_char # => 'b' | ||
# ``` | ||
def read_char : Char? | ||
scan(/./m).try &.[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Char::Reader should be used here
src/string_scanner.cr
Outdated
s = @str.byte_at(@byte_offset) | ||
@byte_offset += 1 | ||
s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using one letter variable names, please.
s = @str.byte_at(@byte_offset) | |
@byte_offset += 1 | |
s | |
byte = @str.byte_at(@byte_offset) | |
@byte_offset += 1 | |
byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I corrected it.
src/string_scanner.cr
Outdated
c = reader.current_char | ||
@byte_offset += c.bytesize | ||
c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
c = reader.current_char | |
@byte_offset += c.bytesize | |
c | |
char = reader.current_char | |
@byte_offset += c.bytesize | |
char |
please refer #11259