Skip to content

Use Case: Reading a Record List

do- edited this page Dec 27, 2021 · 30 revisions

Input

Consider the following XML:

<Data>
  <Record>
    <Id>1</Id>
    <Name>John Doe</Name>
  </Record>
  <!-- ... millions of records, same structure ...-->
</Data>

Problem

The main goal is to transfrorm xmlSource, a readable utf8 bytes stream representing the input xml, into records: a stream of objects like

//...
{
  Id: "1",
  Name: "John Doe",
}
//...

Basic Solution

const {XMLReader, XMLNode} = require ('xml-toolkit')

const records = new XMLReader ({
  filterElements : 'Record', 
  map            : XMLNode.toObject ({})
}).process (xmlSource)

// ...then:
// await someLoader.load (records)

// ...or
// for await (const record of records) { // pull parser mode

// ...or
// records.on ('error', e => console.log (e))
// records.pipe (nextStream)

// ...or
// records.on ('error', e => console.log (e))
// records.on ('data', record => doSomethingWith (record))

Explanation

Here:

  • an XMLReader is created;
    • with the filterElements that tells him to only handle Record elements (skipping the root Data node);
    • and the map option requiring the XMLNode.toObject transformation;
  • the process method implicitly creates an XMLLexer instance, performs all the necessary piping and produces the desired object mode readable stream.

Q&A

How to alter field names?

With XMLNode.toObject's getName option. For example, by setting

map: XMLNode.toObject ({
  getName: s => s.toLowerCase (),
  //...
},

we'll obtain {id: "1", name: "John Doe"} instead of {Id: "1", Name: "John Doe"}.

Are XML namespaces supported?

Yes, but not by default. To calculate js object property names from XML local name and namespace uri, use the 2-argument getName, for example:

map: XMLNode.toObject ({
  getName: (localName, namespaceURI) => `{${namespaceURI}}${localName}`,
  //...
},

How to adjust output records content?

In general, this is what transform streams are for.

But in simple cases, using XMLNode.toObject's map option is more handy. For example, by setting

map: XMLNode.toObject ({
  map: r => {...r, ord: ++ord}
  //...
},

we'll add the record counter named ord:

{Id: "1",   Name: "John Doe", ord: 1},
{Id: "989", Name: "Mary Sue", ord: 2},
...

What if record fields are presented as attributes, not nested elements?

The input

<Data>
  <Record Id="1" Name="John Doe" />
  <!-- ... and so on ...-->
</Data>

can be processed with literally same code as shown above with the same result: XMLNode.toObject treats attributes and unique children equally.

However, for a flat namespaceless attribute-only XML like this, there is a faster parsing method described in the next subsection.

Using XMLLexer

Without any child elements in use, one don't have to build any fragment of DOM tree at all. Attribute only XML is very similar to CSV consisting of lines delimited with some char sequences. In this case, low level XMLLexer and SAXEvent can be used instead of XMLReader and XMLNode:

const {XMLLexer, SAXEvent} = require ('xml-toolkit')

const lex = new XMLLexer ()

lex.on ('data', s => {
  const e = new SAXEvent (data); if (!e.isSelfEnclosed) return
  const {attributes} = e
  // ...do whathever with attributes
  //      it's a Map, not a plain Object
})

xmlSource.pipe (lex)

This technique lets us skip creating some objects and thus outperform a bit saxophone (the fastest streaming xml parser for node.js known to author so far) when reading multi-gigabyte files.

The POJO creation is not shown here because that would probably be a wrong step on the way to the maximum productivity: Maps are faster, this is why SAXEvent provides attributes in this form.

What are type mapping rules for field values?

Simple: all XML content is presented as primitive js strings except for empty strings that are mapped to null. See XMLNode.toObject for details.

In most cases, that works fine for further load to database in ETL scenarios.

If you need Numbers, Dates etc, transform the output (see above).

Is the filterElements option mandatory?

In this use case (reading a flat record list as a stream), yes, it's mandatory.

And the root element must not comply to this filter condition: otherwise, the full document tree will be constructed, that may lead to memory exhaustion.

Is it possible to filterElements on other things that element's local name?

Yes.

The filterElements option can be any function mapping XMLNode to Boolean.

For example, by setting

filterElements: e => e.level === 1,

we'll obtain all root element's children, whatever their names are.

By tuning up filterElements, one can obtain most of XPath filter functionality available for a streaming parser.

Is the map option mandatory?

No, it's not. Though, setting it to XMLNode.toObject (...) should be useful in most cases.

Without any map set, the stream will provide parsed content in form of XMLNode instances instead of plain js objects. Application developers are free to explore inner details and to build custom mappers.

I have a limited length XML presented as a string. How to parse it?

Just supply the string as the xmlSource parameter.

Is utf-8 the only supported encoding for streams?

Not the only, but the choice is limited by node.js native support.

If you have XML presented as a binary readable stream compatible with the standard StringDecoder, you can set its encoding as process parameter (effectively, an XMLLexer option):

.process (xmlSource, {
  encoding: 'latin1',
})

For any other encoding, the source is to be preprocessed with something like iconv-lite. This is not an xml-toolkit's task.