Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Merge pull request #547 from sferik/buftok
Update buftok.rb
  • Loading branch information
sodabrew committed Feb 3, 2015
2 parents 88fda90 + e385d0a commit cc7af74
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 89 deletions.
119 changes: 34 additions & 85 deletions lib/em/buftok.rb
@@ -1,110 +1,59 @@
# BufferedTokenizer takes a delimiter upon instantiation, or acts line-based
# by default. It allows input to be spoon-fed from some outside source which
# receives arbitrary length datagrams which may-or-may-not contain the token
# by which entities are delimited.
#
# By default, new BufferedTokenizers will operate on lines delimited by "\n" by default
# or allow you to specify any delimiter token you so choose, which will then
# be used by String#split to tokenize the input data
#
# @example Using BufferedTokernizer to parse lines out of incoming data
#
# module LineBufferedConnection
# def receive_data(data)
# (@buffer ||= BufferedTokenizer.new).extract(data).each do |line|
# receive_line(line)
# end
# end
# end
#
# @author Tony Arcieri
# @author Martin Emde
# by which entities are delimited. In this respect it's ideally paired with
# something like EventMachine (http://rubyeventmachine.com/).
class BufferedTokenizer
# @param [String] delimiter
# @param [Integer] size_limit
def initialize(delimiter = "\n", size_limit = nil)
@delimiter = delimiter
@size_limit = size_limit

# The input buffer is stored as an array. This is by far the most efficient
# approach given language constraints (in C a linked list would be a more
# appropriate data structure). Segments of input data are stored in a list
# which is only joined when a token is reached, substantially reducing the
# number of objects required for the operation.
# New BufferedTokenizers will operate on lines delimited by a delimiter,
# which is by default the global input delimiter $/ ("\n").
#
# The input buffer is stored as an array. This is by far the most efficient
# approach given language constraints (in C a linked list would be a more
# appropriate data structure). Segments of input data are stored in a list
# which is only joined when a token is reached, substantially reducing the
# number of objects required for the operation.
def initialize(delimiter = $/)
@delimiter = delimiter
@input = []

# Size of the input buffer
@input_size = 0
@tail = ''
@trim = @delimiter.length - 1
end

# Extract takes an arbitrary string of input data and returns an array of
# tokenized entities, provided there were any available to extract.
#
# @example
# tokenized entities, provided there were any available to extract. This
# makes for easy processing of datagrams using a pattern like:
#
# tokenizer.extract(data).
# map { |entity| Decode(entity) }.each { ... }
# tokenizer.extract(data).map { |entity| Decode(entity) }.each do ...
#
# @param [String] data
# Using -1 makes split to return "" if the token is at the end of
# the string, meaning the last element is the start of the next chunk.
def extract(data)
# Extract token-delimited entities from the input string with the split command.
# There's a bit of craftiness here with the -1 parameter. Normally split would
# behave no differently regardless of if the token lies at the very end of the
# input buffer or not (i.e. a literal edge case) Specifying -1 forces split to
# return "" in this case, meaning that the last entry in the list represents a
# new segment of data where the token has not been encountered
entities = data.split @delimiter, -1

# Check to see if the buffer has exceeded capacity, if we're imposing a limit
if @size_limit
raise 'input buffer full' if @input_size + entities.first.size > @size_limit
@input_size += entities.first.size
if @trim > 0
tail_end = @tail.slice!(-@trim, @trim) # returns nil if string is too short
data = tail_end + data if tail_end
end

# Move the first entry in the resulting array into the input buffer. It represents
# the last segment of a token-delimited entity unless it's the only entry in the list.
@input << entities.shift

# If the resulting array from the split is empty, the token was not encountered
# (not even at the end of the buffer). Since we've encountered no token-delimited
# entities this go-around, return an empty array.
return [] if entities.empty?

# At this point, we've hit a token, or potentially multiple tokens. Now we can bring
# together all the data we've buffered from earlier calls without hitting a token,
# and add it to our list of discovered entities.
entities.unshift @input.join
@input << @tail
entities = data.split(@delimiter, -1)
@tail = entities.shift

# Now that we've hit a token, joined the input buffer and added it to the entities
# list, we can go ahead and clear the input buffer. All of the segments that were
# stored before the join can now be garbage collected.
@input.clear

# The last entity in the list is not token delimited, however, thanks to the -1
# passed to split. It represents the beginning of a new list of as-yet-untokenized
# data, so we add it to the start of the list.
@input << entities.pop

# Set the new input buffer size, provided we're keeping track
@input_size = @input.first.size if @size_limit
unless entities.empty?
@input << @tail
entities.unshift @input.join
@input.clear
@tail = entities.pop
end

# Now we're left with the list of extracted token-delimited entities we wanted
# in the first place. Hooray!
entities
end

# Flush the contents of the input buffer, i.e. return the input buffer even though
# a token has not yet been encountered.
#
# @return [String]
# a token has not yet been encountered
def flush
@input << @tail
buffer = @input.join
@input.clear
@tail = "" # @tail.clear is slightly faster, but not supported on 1.8.7
buffer
end

# @return [Boolean]
def empty?
@input.empty?
end
end
5 changes: 2 additions & 3 deletions lib/em/protocols/line_and_text.rb
Expand Up @@ -32,7 +32,6 @@ module Protocols
# for a version which is optimized for correctness with regard to binary text blocks
# that can switch back to line mode.
class LineAndTextProtocol < Connection
MaxLineLength = 16*1024
MaxBinaryLength = 32*1024*1024

def initialize *args
Expand All @@ -42,7 +41,7 @@ def initialize *args
def receive_data data
if @lbp_mode == :lines
begin
@lpb_buffer.extract(data).each do |line|
@lpb_buffer.extract(data).each do |line|
receive_line(line.chomp) if respond_to?(:receive_line)
end
rescue Exception
Expand Down Expand Up @@ -116,7 +115,7 @@ def set_binary_mode size = nil
#--
# For internal use, establish protocol baseline for handling lines.
def lbp_init_line_state
@lpb_buffer = BufferedTokenizer.new("\n", MaxLineLength)
@lpb_buffer = BufferedTokenizer.new("\n")
@lbp_mode = :lines
end
private :lbp_init_line_state
Expand Down
1 change: 0 additions & 1 deletion lib/em/protocols/linetext2.rb
Expand Up @@ -37,7 +37,6 @@ module LineText2
# When we get around to that, call #receive_error if the user defined it, otherwise
# throw exceptions.

MaxLineLength = 16*1024
MaxBinaryLength = 32*1024*1024

#--
Expand Down

0 comments on commit cc7af74

Please sign in to comment.