Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StringScanner#read_char and #read_byte #11785

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
28 changes: 28 additions & 0 deletions spec/std/string_scanner_spec.cr
Original file line number Diff line number Diff line change
Expand Up @@ -274,3 +274,31 @@ describe StringScanner, "#terminate" do
s.eos?.should eq(true)
end
end

describe StringScanner, "#read_byte" do
it "returns one byte from current offset and adcance the offset" do
Kanezoh marked this conversation as resolved.
Show resolved Hide resolved
s = StringScanner.new("あ")
s.read_byte.should eq "\xE3"
s.offset.should eq 1
s.read_byte.should eq "\x81"
s.offset.should eq 2
s.read_byte.should eq "\x82"
s.offset.should eq 3
s.read_byte.should be_nil
s.offset.should eq 3
s.eos?.should eq(true)
end
end

describe StringScanner, "#read_char" do
it "returns a char from current offset and adcance the offset" do
Kanezoh marked this conversation as resolved.
Show resolved Hide resolved
s = StringScanner.new("ab")
s.read_char.should eq "a"
s.offset.should eq 1
s.read_char.should eq "b"
s.offset.should eq 2
s.read_byte.should be_nil
s.offset.should eq 2
s.eos?.should eq(true)
end
end
32 changes: 31 additions & 1 deletion src/string_scanner.cr
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@
# * `#scan_until`
# * `#skip`
# * `#skip_until`
# * `#read_byte`
# * `#read_char`
#
# Methods that look ahead:
# * `#peek`
Expand Down Expand Up @@ -73,7 +75,7 @@ class StringScanner

# Returns the current position of the scan offset.
def offset : Int32
@str.byte_index_to_char_index(@byte_offset).not_nil!
@byte_offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this representative of the character offset, and not byte_offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but it raises error when calling #read_byte to a multibyte character then calling #offset in the current implementation.
I concern this behavior is expected or not.

require "string_scanner"

s = StringScanner.new("")
s.read_byte
s.offset #=> Unhandled exception: Nil assertion failed (NilAssertionError)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think this change would be a breaking change so in theory we can'd do this.

However, I consider the existing definition of offset to be incorrect. offset should actually return the byte offset because that's more useful, and it's the only correct thing we can return if one can advance byte per byte. So we can consider this change a bugfix instead of a breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to who? Isn't it pretty useful to use the same index values when parsing Strings as the Strings themselves use when indexing with String#[]?

Copy link
Contributor

@yxhuvud yxhuvud Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the suggested change is inconsistent with offset=, so if this change is wanted then that also needs to be updated.

end

# Tries to match with *pattern* at the current position. If there's a match,
Expand Down Expand Up @@ -280,6 +282,34 @@ class StringScanner
@str.byte_slice(@byte_offset, @str.bytesize - @byte_offset)
end

# Returns one byte from current offset.
Kanezoh marked this conversation as resolved.
Show resolved Hide resolved
# ```
# require "string_scanner"
#
# s = StringScanner.new("あ")
# s.read_byte # => "\xE3"
# s.read_byte # => "\x81"
# s.read_byte # => "\x82"
# ```
def read_byte : String?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situation is this something that is reasonable (rather than reading a full utf8 character)? The only situations I can think of is where the data isn't actually a valid string, and I'd argue that if that is the case a solution working directly on Slice would be more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that if we go with this, we should return UInt8?, not String?

But yes, it would be nice to know the actual use case for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to return UInt8?.

return nil if eos?
s = @str.byte_slice(@byte_offset, 1)
@byte_offset += 1
s
end

# Returns one char from current offset.
Kanezoh marked this conversation as resolved.
Show resolved Hide resolved
# ```
# require "string_scanner"
#
# s = StringScanner.new("ab")
# s.read_char # => "a"
# s.read_char # => "b"
# ```
def read_char : String?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit unintuitive that read_char returns a String? and not a Char?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to return Char?.

scan(/./)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/./ might actually miss a few characters. The safest way is to do it in Crystal with something like String#char_bytesize_at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about /./m? /./ misses newline characters, so changing it multiline mode, it detects newline characters. It also changes behaviors of ^ and $, but it doesn't matter.

end

# Writes a representation of the scanner.
#
# Includes the current position of the offset, the total size of the string,
Expand Down