Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StringScanner#read_char and #read_byte #11785

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Kanezoh
Copy link
Contributor

@Kanezoh Kanezoh commented Jan 30, 2022

please refer #11259

@straight-shoota straight-shoota changed the title add #read_char and #read_byte Add StringScanner#read_char and #read_byte Jan 30, 2022
spec/std/string_scanner_spec.cr Outdated Show resolved Hide resolved
spec/std/string_scanner_spec.cr Outdated Show resolved Hide resolved
src/string_scanner.cr Outdated Show resolved Hide resolved
src/string_scanner.cr Outdated Show resolved Hide resolved
@@ -73,7 +75,7 @@ class StringScanner

# Returns the current position of the scan offset.
def offset : Int32
@str.byte_index_to_char_index(@byte_offset).not_nil!
@byte_offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this representative of the character offset, and not byte_offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but it raises error when calling #read_byte to a multibyte character then calling #offset in the current implementation.
I concern this behavior is expected or not.

require "string_scanner"

s = StringScanner.new("")
s.read_byte
s.offset #=> Unhandled exception: Nil assertion failed (NilAssertionError)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think this change would be a breaking change so in theory we can'd do this.

However, I consider the existing definition of offset to be incorrect. offset should actually return the byte offset because that's more useful, and it's the only correct thing we can return if one can advance byte per byte. So we can consider this change a bugfix instead of a breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to who? Isn't it pretty useful to use the same index values when parsing Strings as the Strings themselves use when indexing with String#[]?

Copy link
Contributor

@yxhuvud yxhuvud Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the suggested change is inconsistent with offset=, so if this change is wanted then that also needs to be updated.

# s.read_char # => "a"
# s.read_char # => "b"
# ```
def read_char : String?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit unintuitive that read_char returns a String? and not a Char?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to return Char?.

# s.read_byte # => "\x81"
# s.read_byte # => "\x82"
# ```
def read_byte : String?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation is this something that is reasonable (rather than reading a full utf8 character)? The only situations I can think of is where the data isn't actually a valid string, and I'd argue that if that is the case a solution working directly on Slice would be more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that if we go with this, we should return UInt8?, not String?

But yes, it would be nice to know the actual use case for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to return UInt8?.

# s.read_char # => "b"
# ```
def read_char : String?
scan(/./)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/./ might actually miss a few characters. The safest way is to do it in Crystal with something like String#char_bytesize_at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about /./m? /./ misses newline characters, so changing it multiline mode, it detects newline characters. It also changes behaviors of ^ and $, but it doesn't matter.

@@ -68,12 +70,12 @@ class StringScanner
# Sets the *position* of the scan offset.
def offset=(position : Int)
raise IndexError.new unless position >= 0
@byte_offset = @str.char_index_to_byte_index(position) || @str.bytesize
@byte_offset = [position, @str.bytesize].min
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Math.min or tuple instead of array

# s.read_char # => 'b'
# ```
def read_char : Char?
scan(/./m).try &.[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Char::Reader should be used here

@Kanezoh Kanezoh requested a review from asterite February 7, 2022 01:51
Comment on lines 296 to 298
s = @str.byte_at(@byte_offset)
@byte_offset += 1
s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using one letter variable names, please.

Suggested change
s = @str.byte_at(@byte_offset)
@byte_offset += 1
s
byte = @str.byte_at(@byte_offset)
@byte_offset += 1
byte

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I corrected it.

Comment on lines 311 to 313
c = reader.current_char
@byte_offset += c.bytesize
c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
c = reader.current_char
@byte_offset += c.bytesize
c
char = reader.current_char
@byte_offset += c.bytesize
char

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants