Add `StringScanner#read_char` and `#read_byte` #11785

Kanezoh · 2022-01-30T13:32:34Z

please refer #11259

spec/std/string_scanner_spec.cr

src/string_scanner.cr

caspiano · 2022-01-30T23:21:22Z

src/string_scanner.cr

@@ -73,7 +75,7 @@ class StringScanner

  # Returns the current position of the scan offset.
  def offset : Int32
-    @str.byte_index_to_char_index(@byte_offset).not_nil!
+    @byte_offset


Isn't this representative of the character offset, and not byte_offset?

I think so, but it raises error when calling #read_byte to a multibyte character then calling #offset in the current implementation.
I concern this behavior is expected or not.

require "string_scanner" s = StringScanner.new("あ") s.read_byte s.offset #=> Unhandled exception: Nil assertion failed (NilAssertionError)

Right, I think this change would be a breaking change so in theory we can'd do this.

However, I consider the existing definition of offset to be incorrect. offset should actually return the byte offset because that's more useful, and it's the only correct thing we can return if one can advance byte per byte. So we can consider this change a bugfix instead of a breaking change.

Useful to who? Isn't it pretty useful to use the same index values when parsing Strings as the Strings themselves use when indexing with String#[]?

Also, the suggested change is inconsistent with offset=, so if this change is wanted then that also needs to be updated.

yxhuvud · 2022-01-31T10:25:44Z

src/string_scanner.cr

+  # s.read_char # => "a"
+  # s.read_char # => "b"
+  # ```
+  def read_char : String?


I find it a bit unintuitive that read_char returns a String? and not a Char?.

changed to return Char?.

yxhuvud · 2022-01-31T10:33:57Z

src/string_scanner.cr

+  # s.read_byte # => "\x81"
+  # s.read_byte # => "\x82"
+  # ```
+  def read_byte : String?


In what situation is this something that is reasonable (rather than reading a full utf8 character)? The only situations I can think of is where the data isn't actually a valid string, and I'd argue that if that is the case a solution working directly on Slice would be more appropriate.

I also think that if we go with this, we should return UInt8?, not String?

But yes, it would be nice to know the actual use case for this.

changed to return UInt8?.

HertzDevil · 2022-01-31T13:24:07Z

src/string_scanner.cr

+  # s.read_char # => "b"
+  # ```
+  def read_char : String?
+    scan(/./)


/./ might actually miss a few characters. The safest way is to do it in Crystal with something like String#char_bytesize_at.

How about /./m? /./ misses newline characters, so changing it multiline mode, it detects newline characters. It also changes behaviors of ^ and $, but it doesn't matter.

asterite · 2022-02-06T23:22:21Z

src/string_scanner.cr

@@ -68,12 +70,12 @@ class StringScanner
  # Sets the *position* of the scan offset.
  def offset=(position : Int)
    raise IndexError.new unless position >= 0
-    @byte_offset = @str.char_index_to_byte_index(position) || @str.bytesize
+    @byte_offset = [position, @str.bytesize].min


Please use Math.min or tuple instead of array

asterite · 2022-02-06T23:22:42Z

src/string_scanner.cr

+  # s.read_char # => 'b'
+  # ```
+  def read_char : Char?
+    scan(/./m).try &.[0]


Char::Reader should be used here

Sija · 2022-02-07T04:35:36Z

src/string_scanner.cr

+    s = @str.byte_at(@byte_offset)
+    @byte_offset += 1
+    s


Avoid using one letter variable names, please.

Suggested change

s = @str.byte_at(@byte_offset)

@byte_offset += 1

s

byte = @str.byte_at(@byte_offset)

@byte_offset += 1

byte

OK, I corrected it.

Sija · 2022-02-07T04:35:52Z

src/string_scanner.cr

+    c = reader.current_char
+    @byte_offset += c.bytesize
+    c


ditto

Suggested change

c = reader.current_char

@byte_offset += c.bytesize

c

char = reader.current_char

@byte_offset += c.bytesize

char

add #read_char and #read_byte

d276f5f

Kanezoh mentioned this pull request Jan 30, 2022

Add missing StringScanner methods #11259

Open

Blacksmoke16 added kind:feature topic:stdlib:text labels Jan 30, 2022

straight-shoota changed the title ~~add #read_char and #read_byte~~ Add StringScanner#read_char and #read_byte Jan 30, 2022

caspiano reviewed Jan 30, 2022

View reviewed changes

fix comment

96a8fd3

yxhuvud reviewed Jan 31, 2022

View reviewed changes

HertzDevil reviewed Jan 31, 2022

View reviewed changes

change return type

45d0506

Kanezoh requested review from asterite and HertzDevil February 6, 2022 14:26

asterite reviewed Feb 6, 2022

View reviewed changes

use Math.min and Char::Reader

ac6bde4

Kanezoh requested a review from asterite February 7, 2022 01:51

Sija reviewed Feb 7, 2022

View reviewed changes

avoid one letter variable names

1204472

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `StringScanner#read_char` and `#read_byte` #11785

Add `StringScanner#read_char` and `#read_byte` #11785

Kanezoh commented Jan 30, 2022

caspiano Jan 30, 2022

Kanezoh Jan 31, 2022

asterite Jan 31, 2022

yxhuvud Jan 31, 2022

yxhuvud Jan 31, 2022 •

edited

yxhuvud Jan 31, 2022

Kanezoh Feb 1, 2022

yxhuvud Jan 31, 2022

asterite Jan 31, 2022

Kanezoh Feb 1, 2022

HertzDevil Jan 31, 2022

Kanezoh Feb 1, 2022

asterite Feb 6, 2022

asterite Feb 6, 2022

Sija Feb 7, 2022

Kanezoh Feb 7, 2022

Sija Feb 7, 2022

Add StringScanner#read_char and #read_byte #11785

Are you sure you want to change the base?

Add StringScanner#read_char and #read_byte #11785

Conversation

Kanezoh commented Jan 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yxhuvud Jan 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `StringScanner#read_char` and `#read_byte` #11785

Add `StringScanner#read_char` and `#read_byte` #11785

yxhuvud Jan 31, 2022 •

edited