Use Case: Reading a Record List
Consider the following XML:
<Data>
<Record>
<Id>1</Id>
<Name>John Doe</Name>
</Record>
<!-- ... millions of records, same structure ...-->
</Data>
The main goal is to transfrorm xmlSource
, a readable utf8 bytes stream representing the input xml, into records
: a stream of objects like
//...
{
Id: "1",
Name: "John Doe",
}
//...
const {XMLReader, XMLNode} = require ('xml-toolkit')
const records = new XMLReader ({
filterElements : 'Record',
map : XMLNode.toObject ({})
}).process (xmlSource)
// ...then:
// await someLoader.load (records)
// ...or
// for await (const record of records) { // pull parser mode
// ...or
// records.on ('error', e => console.log (e))
// records.pipe (nextStream)
// ...or
// records.on ('error', e => console.log (e))
// records.on ('data', record => doSomethingWith (record))
Here:
- an XMLReader is created;
- with the
filterElements
that tells him to only handleRecord
elements (skipping the rootData
node); - and the
map
option requiring the XMLNode.toObject transformation;
- with the
- the
process
method implicitly creates an XMLLexer instance, performs all the necessary piping and produces the desired object mode readable stream.
With XMLNode.toObject's getName
option. For example, by setting
map: XMLNode.toObject ({
getName: s => s.toLowerCase (),
//...
},
we'll obtain {id: "1", name: "John Doe"}
instead of {Id: "1", Name: "John Doe"}
.
Yes, but not by default. To calculate js object property names from XML local name and namespace uri, use the 2-argument getName
, for example:
map: XMLNode.toObject ({
getName: (localName, namespaceURI) => `{${namespaceURI}}${localName}`,
//...
},
In general, this is what transform streams are for.
But in simple cases, using XMLNode.toObject's map
option is more handy. For example, by setting
map: XMLNode.toObject ({
map: r => {...r, ord: ++ord}
//...
},
we'll add the record counter named ord
:
{Id: "1", Name: "John Doe", ord: 1},
{Id: "989", Name: "Mary Sue", ord: 2},
...
The input
<Data>
<Record Id="1" Name="John Doe" />
<!-- ... and so on ...-->
</Data>
can be processed with literally same code as shown above with the same result: XMLNode.toObject treats attributes and unique children equally.
However, for a flat namespaceless attribute-only XML like this, there is a faster parsing method described in the next subsection.
Without any child elements in use, one don't have to build any fragment of DOM tree at all. Attribute only XML is very similar to CSV consisting of lines delimited with some char sequences. In this case, low level XMLLexer and SAXEvent can be used instead of XMLReader
and XMLNode
:
const {XMLLexer, SAXEvent} = require ('xml-toolkit')
const lex = new XMLLexer ()
lex.on ('data', s => {
const e = new SAXEvent (data); if (!e.isSelfEnclosed) return
const {attributes} = e
// ...do whathever with attributes
// it's a Map, not a plain Object
})
xmlSource.pipe (lex)
This technique lets us skip creating some objects and thus outperform a bit saxophone (the fastest streaming xml parser for node.js known to author so far) when reading multi-gigabyte files.
The POJO creation is not shown here because that would probably be a wrong step on the way to the maximum productivity: Maps are faster, this is why SAXEvent
provides attributes in this form.
Simple: all XML content is presented as primitive js strings except for empty strings that are mapped to null
. See XMLNode.toObject for details.
In most cases, that works fine for further load to database in ETL scenarios.
If you need Number
s, Date
s etc, transform the output (see above).
In this use case (reading a flat record list as a stream), yes, it's mandatory.
And the root element must not comply to this filter condition: otherwise, the full document tree will be constructed, that may lead to memory exhaustion.
Yes.
The filterElements
option can be any function mapping XMLNode to Boolean.
For example, by setting
filterElements: e => e.level === 1,
we'll obtain all root element's children, whatever their names are.
By tuning up filterElements
, one can obtain most of XPath filter functionality available for a streaming parser.
No, it's not. Though, setting it to XMLNode.toObject (...)
should be useful in most cases.
Without any map
set, the stream will provide parsed content in form of XMLNode instances instead of plain js objects. Application developers are free to explore inner details and to build custom mappers.
Just supply the string as the xmlSource
parameter.
Not the only, but the choice is limited by node.js native support.
If you have XML presented as a binary readable stream compatible with the standard StringDecoder, you can set its encoding as process
parameter (effectively, an XMLLexer option):
.process (xmlSource, {
encoding: 'latin1',
})
For any other encoding, the source is to be preprocessed with something like iconv-lite. This is not an xml-toolkit
's task.