Skip to content

XMLLexer

do- edited this page Oct 16, 2022 · 55 revisions

Description

XMLLexer is a lowest level asynchronous XML parser. Lower than SAX. It only splits incoming text into fragments (lexemes):

  • text nodes
  • pointy braces delimited tags of all kinds
    • including comments, processing instructions and all weird <! stuff inherited from SGML
      • and, yes, <![CDATA[ belongs here.

In application code, XMLLexer can be used:

  • directly (what makes sense for huge XML files with primitive structure);
  • as data source for SAXEventEmitter.

Technically it is a transform

  • from a Readable stream representing the content of an XML (HTML and so on) document;
  • to an object mode stream of strings (primitive ones, not wrapped).

Usage

const lexer = new XMLLexer ({...options})

someInputStream.pipe (lexer)
for await (const tag of lexer) if (/* the tag is OK */) {
  // do something with the tag
}

// or

someInputStream.pipe (lexer).pipe (someObjectOutputStream)

// or

lexer.on  ('data', s => console.log (s))
lexed.end ('<root>...</root>')

Options

Name Default Description
maxLength 1 << 20 Maximum lexeme length, characters
encoding utf8 Encoding for output lexemes, used with binary input

Lexeme types

XMLLexer splits the incoming stream into strings matching following patterns:

Template Description Note
<? ... ?> Processing instruction or prolog <?xml ... ?>
<!-- ... --> Comment
<! ... [ ... [ ... ]]> Conditional, complex doctype or <![CDATA[ ... ]]>
<! ... [ ... ]> Medium doctype
<! ... > Doctype, atomic
< ... > Element tag (opening, closing or both)
... Text node

Implementation

Members

Name Type Description
body string What's left unparsed
beforeBody BigInt Amount of chars or bytes left past current body contents
state int Current state
position int Last position analyzed for [ in state ST_TAG_X
awaited Buffer Enclosing sequence

Constants

States (state values)

XMLLexer is basically a finite state machine with the following states:

Constant Body from start Description
ST_TEXT unknown or anyting but < find the nearest <, slice the text
ST_LT < Look ahead for the next byte to switch to ST_LT_X (if !) or ST_TAG (otherwise)
ST_LT_X <! Look ahead for the next byte to switch to ST_TAG with awaited --> (if -) or ST_TAG_X (otherwise)
ST_TAG <... Scan for '>', find the one preceded with awaited sequence, slice the tag
ST_TAG_X <!... Scan through all bytes, adjust awaited when [ occur, watch for '>', finally slice the tag

Following transitions are possible:

From To Awaited Explanation
ST_TEXT ST_LT < occured, buffered text is given out (if any)
ST_TEXT ST_TEXT some other char occured
ST_LT ST_LT_X ! occured
ST_LT ST_TAG ?> ! occured
ST_LT ST_TAG > not !, normal tag
ST_LT_X ST_TAG --> - occured, must be a comment (unless broken)
ST_LT_X ST_TAG_X > / ]> / ]]> counting [s
ST_TAG ST_TEXT > occured, previous characters match -- giving out the tag
ST_TAG_X ST_TEXT > occured, previous characters match -- giving out the tag

Enclosing sequences (awaited values)

Possible enclosing sequences (> chopped) are stored as Buffer instances:

Name Value Description
CL_DEAFULT '' default
CL_PI ? for <?
CL_COMMENT -- for <!-
CL_SQ_1 ] if [ occured in ST_TAG state and CL_DEAFULT awaited
CL_SQ_2 ]] if [ occured in ST_TAG state and CL_SQ_1 awaited

Methods

Name, Params Type Description
setState (state, awaited) set the state and awaited members
isClosing (pos) boolean do bytes preceding pos match awaited
publishTo (pos) push the the slice from start to pos in the output stream, then start after pos
parse () scan the body from start, publish lexemes found, move start after the last one
checkMaxLength () Throws an error if body.size () > maxLength
_transform (chunk, encoding, callback) append the incoming chunk, parse the body, then trim, reset start to 0 it and invoke callback
_flush (callback) publish the rest of the body, invoke callback
getPosition () BigInt Position of the current lexeme in the incoming stream (for error reporting)

Comparison to XMLIterator

XMLLexer and XMLIterator and are both low level XML parsers splitting pointy bracket delimited text to syntactically atomic tokens. But:

Name Proto XML Source Pro Contra
XMLLexer Transform Readable limited memory footprint with any XML size asynchronous by nature
XMLIterator Iterable String can be used in synchronous for ... of, e. g. in object constructors limited size XML only

So, XMLLexer vs. XMLIterator is basically like fs.createReadStream vs. fs.readFileSync.