XMLLexer
do- edited this page Oct 16, 2022
·
55 revisions
XMLLexer is a lowest level asynchronous XML parser. Lower than SAX. It only splits incoming text into fragments (lexemes):
- text nodes
- pointy braces delimited tags of all kinds
- including comments, processing instructions and all weird
<!
stuff inherited from SGML- and, yes,
<![CDATA[
belongs here.
- and, yes,
- including comments, processing instructions and all weird
In application code, XMLLexer can be used:
- directly (what makes sense for huge XML files with primitive structure);
- as data source for SAXEventEmitter.
Technically it is a transform
- from a Readable stream representing the content of an XML (HTML and so on) document;
- to an object mode stream of strings (primitive ones, not wrapped).
const lexer = new XMLLexer ({...options})
someInputStream.pipe (lexer)
for await (const tag of lexer) if (/* the tag is OK */) {
// do something with the tag
}
// or
someInputStream.pipe (lexer).pipe (someObjectOutputStream)
// or
lexer.on ('data', s => console.log (s))
lexed.end ('<root>...</root>')
Name | Default | Description |
---|---|---|
maxLength | 1 << 20 | Maximum lexeme length, characters |
encoding | utf8 | Encoding for output lexemes, used with binary input |
XMLLexer splits the incoming stream into strings matching following patterns:
Template | Description | Note |
---|---|---|
<? ... ?>
|
Processing instruction | or prolog <?xml ... ?>
|
<!-- ... -->
|
Comment | |
<! ... [ ... [ ... ]]>
|
Conditional, complex doctype | or <![CDATA[ ... ]]>
|
<! ... [ ... ]>
|
Medium doctype | |
<! ... >
|
Doctype, atomic | |
< ... >
|
Element tag (opening, closing or both) | |
... | Text node |
Name | Type | Description |
---|---|---|
body | string | What's left unparsed |
beforeBody | BigInt | Amount of chars or bytes left past current body contents |
state | int | Current state |
position | int | Last position analyzed for [ in state ST_TAG_X
|
awaited | Buffer | Enclosing sequence |
XMLLexer is basically a finite state machine with the following states:
Constant | Body from start
|
Description |
---|---|---|
ST_TEXT | unknown or anyting but <
|
find the nearest < , slice the text |
ST_LT | < |
Look ahead for the next byte to switch to ST_LT_X (if ! ) or ST_TAG (otherwise) |
ST_LT_X | <! |
Look ahead for the next byte to switch to ST_TAG with awaited --> (if - ) or ST_TAG_X (otherwise) |
ST_TAG | <... |
Scan for '>', find the one preceded with awaited sequence, slice the tag |
ST_TAG_X | <!... |
Scan through all bytes, adjust awaited when [ occur, watch for '>', finally slice the tag |
Following transitions are possible:
From | To | Awaited | Explanation |
---|---|---|---|
ST_TEXT | ST_LT |
< occured, buffered text is given out (if any) |
|
ST_TEXT | ST_TEXT | some other char occured | |
ST_LT | ST_LT_X |
! occured |
|
ST_LT | ST_TAG | ?> |
! occured |
ST_LT | ST_TAG | > |
not ! , normal tag |
ST_LT_X | ST_TAG | --> |
- occured, must be a comment (unless broken) |
ST_LT_X | ST_TAG_X |
> / ]> / ]]>
|
counting [ s |
ST_TAG | ST_TEXT |
> occured, previous characters match -- giving out the tag |
|
ST_TAG_X | ST_TEXT |
> occured, previous characters match -- giving out the tag |
Possible enclosing sequences (>
chopped) are stored as Buffer instances:
Name | Value | Description |
---|---|---|
CL_DEAFULT | '' | default |
CL_PI | ? |
for <?
|
CL_COMMENT | -- |
for <!-
|
CL_SQ_1 | ] |
if [ occured in ST_TAG state and CL_DEAFULT awaited |
CL_SQ_2 | ]] |
if [ occured in ST_TAG state and CL_SQ_1 awaited |
Name, Params | Type | Description |
---|---|---|
setState (state, awaited) | set the state and awaited members |
|
isClosing (pos) | boolean | do bytes preceding pos match awaited
|
publishTo (pos) |
push the the slice from start to pos in the output stream, then start after pos
|
|
parse () | scan the body from start , publish lexemes found, move start after the last one |
|
checkMaxLength () | Throws an error if body.size () > maxLength
|
|
_transform (chunk, encoding, callback) |
append the incoming chunk, parse the body, then trim , reset start to 0 it and invoke callback
|
|
_flush (callback) | publish the rest of the body , invoke callback
|
|
getPosition () | BigInt | Position of the current lexeme in the incoming stream (for error reporting) |
XMLLexer
and XMLIterator
and are both low level XML parsers splitting pointy bracket delimited text to syntactically atomic tokens. But:
Name | Proto | XML Source | Pro | Contra |
---|---|---|---|---|
XMLLexer |
Transform |
Readable |
limited memory footprint with any XML size | asynchronous by nature |
XMLIterator |
Iterable |
String |
can be used in synchronous for ... of , e. g. in object constructors |
limited size XML only |
So, XMLLexer
vs. XMLIterator
is basically like fs.createReadStream vs. fs.readFileSync.