Skip to content

Commit

Permalink
Use codepoints instead of chars in the lexer.
Browse files Browse the repository at this point in the history
Grand wizard overlord @whitequark recommended this as it will bypass the need
for creating individual String instance for every character (at least not until
needed). This becomes noticable on large inputs (e.g. 100 MB of XML).
Previously these would result in the kernel OOM killing the process. Using
codepoints memory increase by a "mere" 1-1,5 GB.
  • Loading branch information
Yorick Peterse committed Mar 23, 2014
1 parent cdf5f1d commit a2452b6
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions lib/oga/lexer.rl
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ module Oga
# @return [Array]
#
def lex(data)
@data = data.chars.to_a
@data = data.codepoints
lexer_start = self.class.lexer_start
eof = data.length

Expand Down Expand Up @@ -152,7 +152,7 @@ module Oga
# @return [String]
#
def text(start = @ts, stop = @te)
return @data[start...stop].join('')
return @data[start...stop].pack('U*')
end

##
Expand Down Expand Up @@ -223,6 +223,7 @@ module Oga
%%{
# Use instance variables for `ts` and friends.
access @;
getkey (@data[p] || 0);

newline = '\n' | '\r\n';
whitespace = [ \t];
Expand Down

0 comments on commit a2452b6

Please sign in to comment.