-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML file loaded incorrectly, results missing #131
Comments
Ahh, yah.. the main focus/demand has been for HTML parsing and manipulation and as a result, the XML piece of the library hasn't been well maintained. I haven't had a use case for XML parsing yet and probably won't have time to fix this one. Definitely accepting pull requests. Sorry about that. |
hmm. looking through this...it's just like everything before the I've even tried injecting another element as a root element here, but no dice. Wonder what the underlying bug is here. Are there any methods to get a simple "tree" structure of the entire loaded DOM - something simple to throw at util.inspect, perhaps? Would help with debugging in several cases. |
Take a look at You can find an example of the parsing at https://github.com/MatthewMueller/cheerio-select. There's no "pretty" print, but it should be pretty easy to write one with the raw DOM object. Good luck |
Examining the structure a bit, trying to step through the tree - after injecting a
{ '0':
{ type: 'tag',
name: 'gamesList',
attribs: {},
children: [ [Object], [Object] ],
prev: null,
next:
{ type: 'tag',
name: 'games',
attribs: {},
children: [Object],
prev: [Circular],
next: [Object],
parent: [Object] },
parent:
{ type: 'tag',
name: 'root',
attribs: {},
children: [Object],
prev: null,
next: [Object],
parent: [Object] } },
'1':
{ type: 'tag',
name: 'games',
attribs: {},
children: [ [Object], [Object] ],
prev:
{ type: 'tag',
name: 'gamesList',
attribs: {},
children: [Object],
prev: null,
next: [Circular],
parent: [Object] },
next:
{ type: 'tag',
name: 'storeLink',
attribs: {},
children: [Object],
prev: [Circular],
next: null,
parent: [Object] },
parent:
{ type: 'tag',
name: 'root',
attribs: {},
children: [Object],
prev: null,
next: [Object],
parent: [Object] } },
'2':
{ type: 'tag',
name: 'storeLink',
attribs: {},
children: [ [Object] ],
prev:
{ type: 'tag',
name: 'games',
attribs: {},
children: [Object],
prev: [Object],
next: [Circular],
parent: [Object] },
next: null,
parent:
{ type: 'tag',
name: 'root',
attribs: {},
children: [Object],
prev: null,
next: [Object],
parent: [Object] } },
length: 3 } So, looking at this we've got a messed up structure interpretation. Really need a way to generate a tree here - give me a while, I'll see if I can come up with something if you don't have anything already. Won't be pretty though. |
Structure is strange as hell - root object has two children, not one. gamesList isn't coming up in the selectors at all when directly queried, but it's listed under the children just fine - and it's a sibling of games. I wonder if this is due to the extensive use of the CDATA? Has that been tested? |
Confirmed - use of
|
okay, so @matthewmueller - is this a problem with htmlparser2, or cheerio? Would like to know so a bug report can go upstream if necessary. |
Ahh okay, yah it would be This should get you started: |
It appears that this is looking fixed upstream - waiting on @fb55 to confirm when the fix will be shipped. |
It was already published as 2.5.2. |
@damianb does this work for you now (with htmlparser2@2.5.2)? I'm experiencing what I believe may be the same problem. |
@bencevans Was working for me fine at the time, I don't have any more code currently relying on cheerio for xml parsing at this time, however. Got something you can turn into a reproducible test case? |
Test Script: var cheerio = require('cheerio');
var body = "<?xml version=\"1.0\" ?>\n<?xml-stylesheet type=\"text/xsl\" href=\"/xml/review.xsl\"?><ZPSupportInfo><ZPInfo><ZoneName>Office</ZoneName><ZoneIcon>x-rincon-roomicon:office</ZoneIcon><Configuration>1</Configuration><LocalUID>RINCON_000E585B7C9801400</LocalUID><SerialNumber>00-0E-58-5B-7C-98:D</SerialNumber><SoftwareVersion>21.4-61160c</SoftwareVersion><MinCompatibleVersion>21.1-00000</MinCompatibleVersion><HardwareVersion>1.16.4.1-2</HardwareVersion><IPAddress>192.168.2.22</IPAddress><MACAddress>00:0E:58:5B:7C:98</MACAddress><Copyright>© 2004-2007 Sonos, Inc. All Rights Reserved.</Copyright><ExtraInfo>OTP: 1.1.1(1-16-4-zp5s-0.5)</ExtraInfo><HTAudioInCode>0</HTAudioInCode><IdxTrk></IdxTrk></ZPInfo></ZPSupportInfo>";
var $ = cheerio.load(body, {
xmlMode: true
});
console.log($('ZoneName'));
node@v0.8.19 @damianb that should(n't) do it. :P |
@bencevans Your script doesn't contain any CDATA, so this is another issue, #147 to be precise. In case this doesn't lead to collisions, enable the |
@fb55 Aaah yeh, thanks. Unfortunately |
@bencevans loading it all into REPL works for me, through $ cat test.js && echo '---'
var cheerio = require('cheerio'),
body = "<?xml version=\"1.0\" ?>\n<?xml-stylesheet type=\"text/xsl\" href=\"/xml/review.xsl\"?><ZPSupportInfo><ZPInfo><ZoneName>Office</ZoneName><ZoneIcon>x-rincon-roomicon:office</ZoneIcon><Configuration>1</Configuration><LocalUID>RINCON_000E585B7C9801400</LocalUID><SerialNumber>00-0E-58-5B-7C-98:D</SerialNumber><SoftwareVersion>21.4-61160c</SoftwareVersion><MinCompatibleVersion>21.1-00000</MinCompatibleVersion><HardwareVersion>1.16.4.1-2</HardwareVersion><IPAddress>192.168.2.22</IPAddress><MACAddress>00:0E:58:5B:7C:98</MACAddress><Copyright>© 2004-2007 Sonos, Inc. All Rights Reserved.</Copyright><ExtraInfo>OTP: 1.1.1(1-16-4-zp5s-0.5)</ExtraInfo><HTAudioInCode>0</HTAudioInCode><IdxTrk></IdxTrk></ZPInfo></ZPSupportInfo>"
module.exports = cheerio.load(body, {
xmlMode: true,
lowerCaseTagNames: false,
})
---
$ node
> $ = require('./test')
> $._root.children[2]
{ type: 'tag',
name: 'ZPSupportInfo',
attribs: {},
children:
[ { type: 'tag',
name: 'ZPInfo',
attribs: {},
children: [Object],
prev: null,
next: null,
parent: [Circular] } ],
prev:
{ name: '?xml-stylesheet',
data: '?xml-stylesheet type="text/xsl" href="/xml/review.xsl"?',
type: 'directive',
parent:
{ type: 'root',
name: 'root',
parent: null,
prev: null,
next: null,
children: [Object] },
prev:
{ name: '?xml',
data: '?xml version="1.0" ?',
type: 'directive',
parent: [Object],
prev: null,
next: [Circular] },
next: [Circular] },
next: null,
parent:
{ type: 'root',
name: 'root',
parent: null,
prev: null,
next: null,
children: [ [Object], [Object], [Circular] ] } } |
For some strange reason, when attempting to load some very heavy XML pages for algorithmic stuff, I'm seeing parts of the XML tree just...disappear, unable to be selected at all.
http://steamcommunity.com/id/katana_/games/?tab=all&xml=1
Here's a copy of the XML in case the above changes: https://gist.github.com/4248909
And, some results of tinkering in REPL with it:
So, it thinks the
<games>
tree is there, but not its parent, nor its own siblings.And yes, I've got xmlMode enabled.
The text was updated successfully, but these errors were encountered: