Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also .

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also .
base repository: antchfx/xmlquery
base: v1.3.0
Choose a base ref
head repository: antchfx/xmlquery
compare: v1.3.1
Choose a head ref
  • 3 commits
  • 2 files changed
  • 3 contributors

Commits on Aug 30, 2020

Commits on Sep 13, 2020

  1. Fix a bug in xml stream parsing where a previously unmatched node cau…

    …sing all subsequent valid matches fail.
    
    Recall that for streaming mode, we have two xpaths: one for matching the element, the other (optionally) for
    adding additional filtering. Imagine the following example, where the xml doc is:
    
    ```
    	<ROOT>
    		<AAA>
    			<CCC>c1</CCC>
    			<BBB>b1</BBB>
    			<DDD>d1</DDD>
    			<BBB>b2<ZZZ z="1">z1</ZZZ></BBB>
    			<BBB>b3</BBB>
    		</AAA>
    		<ZZZ>
    			<BBB>b4</BBB>
    			<BBB>b5</BBB>
    			<CCC>c3</CCC>
    		</ZZZ>
    	</ROOT>
    ```
    
    The stream parser is created as:
    
    ```
    	CreateStreamParser(strings.NewReader(s), "/ROOT/*/BBB", "/ROOT/*/BBB[. != 'b3']")
    ```
    
    Basically we want the stream parser to return all the `BBB` nodes whose text aren't `b3`. By looking at the sample
    XML, we know it should return: the `<BBB>` nodes whose texts are `b1`, `b2`, `b4`, and `b5`.
    
    However, the current code only returns `b1` and `b2`.
    
    The problem lies in the stream element matching inside `case xml.StartElement`.
    
    Currently the code does this:
    
    ```
    case xml.StartElement:
    	...
    	...
    	if p.streamElementXPath != nil {
    		if p.streamNode == nil {
    			if QuerySelector(p.doc, p.streamElementXPath) != nil {
    				// Now we assume this node is the stream node candidate.
    			}
    ```
    We originally under the assumption that if the `streamElementXPath` query returns anything, it must be this node itself;
    thus if it returns, this node is the stream node candidate.
    
    But it's clearly wrong in this `b3` example above. For the node `<BBB>b3</BBB>` it is first considered as the stream
    candidate, but later filtering (`[. != 'b3']`) removes its stream node status, and treats it just like any other non-stream
    nodes, and keeps it in the node tree. But the problem is, by keeping it in the tree, any future XML element start will
    **always** "matches" `streamElementXPath`. So in the example above, the node `<ZZZ>` is now erroneously considered stream
    node, and any child nodes are not even tested for streaming anymore.
    
    There are two fixes:
    
    1) In `xml.StartElement` stream match, instead of just doing `QuerySelector(...) != nil` check, we need to issue a
    `QuerySelectorAll(...)` call and examine all the returned nodes, if the current node is one of them, then this current node
    is considered stream candidate.
    
    2) Simpler: if a stream candidate is later filtered out inside `case xml.EndElement` handling, then simply remove
    it from the node tree, thus preventing future erroneous matches.
    
    1) seems an overkill: if a stream candidate gets filtered out, it's hard to imagine caller would like to interact with it
    in any capacity. Thus chose fix 2).
    jf-tech committed Sep 13, 2020

Commits on Sep 14, 2020

  1. Merge pull request #40 from jf-tech/jf-tech/streamparser-fix

    Fix a bug in xml stream parsing where a previously unmatched node causing all subsequent valid matches fail.
    zhengchun committed Sep 14, 2020