Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML file loaded incorrectly, results missing #131

Closed
katanacrimson opened this issue Dec 10, 2012 · 16 comments
Closed

XML file loaded incorrectly, results missing #131

katanacrimson opened this issue Dec 10, 2012 · 16 comments

Comments

@katanacrimson
Copy link

For some strange reason, when attempting to load some very heavy XML pages for algorithmic stuff, I'm seeing parts of the XML tree just...disappear, unable to be selected at all.

http://steamcommunity.com/id/katana_/games/?tab=all&xml=1

Here's a copy of the XML in case the above changes: https://gist.github.com/4248909

And, some results of tinkering in REPL with it:

{ '0':
   { type: 'tag',
     name: 'games',
     attribs: {},
     children: [ [Object], [Object] ],
     prev:
      { type: 'tag',
        name: 'gamesList',
        attribs: {},
        children: [Object],
        prev: null,
        next: [Circular],
        parent: [Object] },
     next:
      { type: 'tag',
        name: 'storeLink',
        attribs: {},
        children: [Object],
        prev: [Circular],
        next: null,
        parent: [Object] },
     parent:
      { type: 'tag',
        name: 'root',
        attribs: {},
        children: [Object],
        prev: null,
        next: [Object],
        parent: [Object] } },
  length: 1 }
> $('steamID')
{ length: 0 }
> $('steamID64')
{ length: 0 }

So, it thinks the <games> tree is there, but not its parent, nor its own siblings.

And yes, I've got xmlMode enabled.

@matthewmueller
Copy link
Member

Ahh, yah.. the main focus/demand has been for HTML parsing and manipulation and as a result, the XML piece of the library hasn't been well maintained.

I haven't had a use case for XML parsing yet and probably won't have time to fix this one. Definitely accepting pull requests. Sorry about that.

@katanacrimson
Copy link
Author

hmm. looking through this...it's just like everything before the <games> element is just...removed, almost as if it's focusing on the last child only.

I've even tried injecting another element as a root element here, but no dice. Wonder what the underlying bug is here.

Are there any methods to get a simple "tree" structure of the entire loaded DOM - something simple to throw at util.inspect, perhaps? Would help with debugging in several cases.

@matthewmueller
Copy link
Member

Take a look at var parse = require('cheerio').parse. This will take in a string and output the raw XML/DOM.

You can find an example of the parsing at https://github.com/MatthewMueller/cheerio-select.

There's no "pretty" print, but it should be pretty easy to write one with the raw DOM object. Good luck

@katanacrimson
Copy link
Author

Examining the structure a bit, trying to step through the tree - after injecting a <root> element toplevel into the XML and parsing, this is what I'm seeing...

> $('root').children()
{ '0':
   { type: 'tag',
     name: 'gamesList',
     attribs: {},
     children: [ [Object], [Object] ],
     prev: null,
     next:
      { type: 'tag',
        name: 'games',
        attribs: {},
        children: [Object],
        prev: [Circular],
        next: [Object],
        parent: [Object] },
     parent:
      { type: 'tag',
        name: 'root',
        attribs: {},
        children: [Object],
        prev: null,
        next: [Object],
        parent: [Object] } },
  '1':
   { type: 'tag',
     name: 'games',
     attribs: {},
     children: [ [Object], [Object] ],
     prev:
      { type: 'tag',
        name: 'gamesList',
        attribs: {},
        children: [Object],
        prev: null,
        next: [Circular],
        parent: [Object] },
 next:
      { type: 'tag',
        name: 'storeLink',
        attribs: {},
        children: [Object],
        prev: [Circular],
        next: null,
        parent: [Object] },
     parent:
      { type: 'tag',
        name: 'root',
        attribs: {},
        children: [Object],
        prev: null,
        next: [Object],
        parent: [Object] } },
  '2':
   { type: 'tag',
     name: 'storeLink',
     attribs: {},
     children: [ [Object] ],
     prev:
      { type: 'tag',
        name: 'games',
        attribs: {},
        children: [Object],
        prev: [Object],
        next: [Circular],
        parent: [Object] },
     next: null,
     parent:
      { type: 'tag',
        name: 'root',
        attribs: {},
        children: [Object],
        prev: null,
        next: [Object],
        parent: [Object] } },
  length: 3 }

So, looking at this we've got a messed up structure interpretation. Really need a way to generate a tree here - give me a while, I'll see if I can come up with something if you don't have anything already. Won't be pretty though.

@katanacrimson
Copy link
Author

> buildDOMTree = function(elem, depth) { var ret = []; ret.push(new Array(depth + 1).join('-') + " " + elem.name); ret = ret.concat($(elem).children().map(function(i, el) { return buildDOMTree(el, (depth + 1)) })); return ret.join("\n"); }
[Function]
> console.log(buildDOMTree($('root'), 1))
- undefined
-- gamesList
--- steamID64
--- steamID
-- games
--- game
---- appID
---- name
--- logo
-- storeLink
undefined

Structure is strange as hell - root object has two children, not one. gamesList isn't coming up in the selectors at all when directly queried, but it's listed under the children just fine - and it's a sibling of games.

I wonder if this is due to the extensive use of the CDATA? Has that been tested?

@katanacrimson
Copy link
Author

Confirmed - use of <![CDATA[]]> breaks parsing. Regexp stripping of the CDATA regains proper structure.

> console.log(buildDOMTree($('root'), 0))
 undefined
- gamesList
-- steamID64
-- steamID
-- games
--- game
---- appID
---- name
---- logo
---- storeLink
---- hoursLast2Weeks
---- hoursOnRecord
---- statsLink
---- globalStatsLink
--- game
---- appID
---- name
---- logo
---- storeLink
---- hoursLast2Weeks
---- hoursOnRecord
---- statsLink
---- globalStatsLink
--- game
---- appID
---- name
---- logo

[...]

@katanacrimson
Copy link
Author

okay, so @matthewmueller - is this a problem with htmlparser2, or cheerio? Would like to know so a bug report can go upstream if necessary.

@matthewmueller
Copy link
Member

Ahh okay, yah it would be fb55/node-htmlparser then. A quick look at the parser and it looks like @fb55 has code for CDATA there must just be a bug.

This should get you started:
https://github.com/fb55/node-htmlparser/blob/master/lib/Parser.js#L149

@katanacrimson
Copy link
Author

It appears that this is looking fixed upstream - waiting on @fb55 to confirm when the fix will be shipped.

@fb55
Copy link
Member

fb55 commented Feb 16, 2013

It was already published as 2.5.2.

@bencevans
Copy link

@damianb does this work for you now (with htmlparser2@2.5.2)? I'm experiencing what I believe may be the same problem.

@katanacrimson
Copy link
Author

@bencevans Was working for me fine at the time, I don't have any more code currently relying on cheerio for xml parsing at this time, however. Got something you can turn into a reproducible test case?

@bencevans
Copy link

Test Script:

var cheerio = require('cheerio');

var body = "<?xml version=\"1.0\" ?>\n<?xml-stylesheet type=\"text/xsl\" href=\"/xml/review.xsl\"?><ZPSupportInfo><ZPInfo><ZoneName>Office</ZoneName><ZoneIcon>x-rincon-roomicon:office</ZoneIcon><Configuration>1</Configuration><LocalUID>RINCON_000E585B7C9801400</LocalUID><SerialNumber>00-0E-58-5B-7C-98:D</SerialNumber><SoftwareVersion>21.4-61160c</SoftwareVersion><MinCompatibleVersion>21.1-00000</MinCompatibleVersion><HardwareVersion>1.16.4.1-2</HardwareVersion><IPAddress>192.168.2.22</IPAddress><MACAddress>00:0E:58:5B:7C:98</MACAddress><Copyright>© 2004-2007 Sonos, Inc. All Rights Reserved.</Copyright><ExtraInfo>OTP: 1.1.1(1-16-4-zp5s-0.5)</ExtraInfo><HTAudioInCode>0</HTAudioInCode><IdxTrk></IdxTrk></ZPInfo></ZPSupportInfo>";

var $ = cheerio.load(body, {
  xmlMode: true
});

console.log($('ZoneName'));

npm ls:

sonos@0.2.0 /home/bencevans/Development/Open/node-sonos
├─┬ cheerio@0.10.7 extraneous
│ ├─┬ cheerio-select@0.0.3
│ │ └─┬ CSSselect@0.3.1
│ │   └── CSSwhat@0.1.1
│ ├── entities@0.2.0
│ ├─┬ htmlparser2@2.5.2
│ │ ├── domelementtype@1.1.1
│ │ ├── domhandler@2.0.2
│ │ └── domutils@1.0.1
│ └── underscore@1.4.4
...

node@v0.8.19

@damianb that should(n't) do it. :P

@fb55
Copy link
Member

fb55 commented Mar 9, 2013

@bencevans Your script doesn't contain any CDATA, so this is another issue, #147 to be precise. In case this doesn't lead to collisions, enable the lowerCaseTagNames option and it should work.

@bencevans
Copy link

@fb55 Aaah yeh, thanks. Unfortunately lowerCaseTagNames isn't doing the job though.

@katanacrimson
Copy link
Author

@bencevans loading it all into REPL works for me, through $._root though. Querying in the tree is borked it seems.

$ cat test.js && echo '---'
    var cheerio = require('cheerio'),
body = "<?xml version=\"1.0\" ?>\n<?xml-stylesheet type=\"text/xsl\" href=\"/xml/review.xsl\"?><ZPSupportInfo><ZPInfo><ZoneName>Office</ZoneName><ZoneIcon>x-rincon-roomicon:office</ZoneIcon><Configuration>1</Configuration><LocalUID>RINCON_000E585B7C9801400</LocalUID><SerialNumber>00-0E-58-5B-7C-98:D</SerialNumber><SoftwareVersion>21.4-61160c</SoftwareVersion><MinCompatibleVersion>21.1-00000</MinCompatibleVersion><HardwareVersion>1.16.4.1-2</HardwareVersion><IPAddress>192.168.2.22</IPAddress><MACAddress>00:0E:58:5B:7C:98</MACAddress><Copyright>© 2004-2007 Sonos, Inc. All Rights Reserved.</Copyright><ExtraInfo>OTP: 1.1.1(1-16-4-zp5s-0.5)</ExtraInfo><HTAudioInCode>0</HTAudioInCode><IdxTrk></IdxTrk></ZPInfo></ZPSupportInfo>"

module.exports = cheerio.load(body, {
  xmlMode: true,
  lowerCaseTagNames: false,
})
---

$ node
> $ = require('./test')
> $._root.children[2]
{ type: 'tag',
  name: 'ZPSupportInfo',
  attribs: {},
  children:
   [ { type: 'tag',
       name: 'ZPInfo',
       attribs: {},
       children: [Object],
       prev: null,
       next: null,
       parent: [Circular] } ],
  prev:
   { name: '?xml-stylesheet',
     data: '?xml-stylesheet type="text/xsl" href="/xml/review.xsl"?',
     type: 'directive',
     parent:
      { type: 'root',
        name: 'root',
        parent: null,
        prev: null,
        next: null,
        children: [Object] },
     prev:
      { name: '?xml',
        data: '?xml version="1.0" ?',
        type: 'directive',
        parent: [Object],
        prev: null,
        next: [Circular] },
     next: [Circular] },
  next: null,
  parent:
   { type: 'root',
     name: 'root',
     parent: null,
     prev: null,
     next: null,
     children: [ [Object], [Object], [Circular] ] } }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants