Documentation: BufferedTokenizer does not work with longer delimiters #338

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
4 participants

No description provided.

I agree with the documentation update. This was quite frustrating for me as I was doing some stress tests.

stakach commented Jun 30, 2012

Well it can take anything that String split can take. So accepts regular expressions too

With longer delimiters, it can happen that the boundary of a received packet is within the delimiter. This breaks the current implementation.

stakach commented Jun 30, 2012

?? Isn't that the point of buffering? Always works for me!

def receive_data(data)
  @buf ||= BufferedTokenizer.new("\r\n")
  data = @buf.extract(data)
  data.each { |item| process_response(item) }
end

def process_response(item)
  # process buffered tokens here
end
irb(main):005:0> t = BufferedTokenizer.new("xx")
=> #<BufferedTokenizer:0x000001008caa90 @delimiter="xx", @size_limit=nil, @input=[], @input_size=0>
irb(main):006:0> t.extract("x")
=> []
irb(main):007:0> t.extract("x")
=> []
irb(main):008:0> t.extract("x")
=> []

stakach commented Jun 30, 2012

That is desired behaviour.

Try:

irb(main):005:0> t = BufferedTokenizer.new("xx")
=> #<BufferedTokenizer:0x000001008caa90 @delimiter="xx", @size_limit=nil, @input=[], @input_size=0>
irb(main):006:0> t.extract("abx")
=> []
irb(main):007:0> t.extract("xcd")
=> ["ab"]
irb(main):008:0> t.extract("xx")
=> ["cd"]

I still see this:

irb(main):003:0> t = BufferedTokenizer.new("xx")
=> #<BufferedTokenizer:0x00000102037048 @delimiter="xx", @size_limit=nil, @input=[], @input_size=0>
irb(main):004:0> t.extract("abx")
=> []
irb(main):005:0> t.extract("xcd")
=> []
irb(main):006:0> EventMachine::VERSION
=> "0.12.10"
irb(main):120:0> RUBY_VERSION
=> "1.9.3"

stakach commented Jun 30, 2012

Install it this way: gem install eventmachine --pre

Should be version: 1.0.0.RC.4

Example doesn't work:

irb(main):010:0> t = BufferedTokenizer.new("xx")
=> #<BufferedTokenizer:0x00000001d786f0 @delimiter="xx", @size_limit=nil, @input=[], @input_size=0>
irb(main):011:0> t.extract("abx")
=> []
irb(main):012:0> t.extract("xcd")
=> []
irb(main):013:0> EM::VERSION
=> "1.0.0.rc.4"
irb(main):014:0> RUBY_VERSION
=> "1.9.3"

Note that the BufferedTokenizer doesn't tokenize the joined buffer:

irb(main):047:0> t = BufferedTokenizer.new("xx")
=> #<BufferedTokenizer:0x00000002590680 @delimiter="xx", @size_limit=nil, @input=[], @input_size=0>
irb(main):050:0> t.extract "abx"
=> []
irb(main):051:0> t.extract "xcd"
=> []
irb(main):052:0> t
=> #<BufferedTokenizer:0x00000002590680 @delimiter="xx", @size_limit=nil, @input=["abx", "xcd"], @input_size=0>
irb(main):053:0> t.extract "xx"
=> ["abxxcd"]

So two solutions I can think of: either we update the BufferedTokenizer#extract to tokenize the joined buffer, or we update the doc. I'd prefer the former, and wouldn't mind sending a pull request later.

stakach commented Jul 1, 2012

That is a little disturbing.
I guess we need to write a test that describes the senario and work on a patch.

In my experience multi-character delimiters are quite common. HTTP comes to mind "\r\n\r\n"

@stakach Wrote up a preliminary commit here with the fix: #342

Contributor

sodabrew commented Feb 3, 2015

Resolved by #547

@sodabrew sodabrew closed this Feb 3, 2015

@sodabrew sodabrew added this to the v1.0.6 milestone Feb 3, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment