Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML that doesn't parse correctly (but doesn't fail either) #45

Closed
atifaziz opened this issue Aug 23, 2015 · 3 comments
Closed

HTML that doesn't parse correctly (but doesn't fail either) #45

atifaziz opened this issue Aug 23, 2015 · 3 comments

Comments

@atifaziz
Copy link
Owner

Originally reported on Google Code with ID 45

I've been using Fizzler with great success, but today I came across some HTML that silently
failed to parse correctly.

I was selecting all of the <a> elements and noticed that one was being ignored. Here
are the repo steps:

1. Load the HTML from http://pastebin.com/T1Lsr6w6 (this is the "View Source" for http://www.diapers.com/product/productdetail.aspx?productid=16913)
2. Try to query the selector "#pdp"
3. Example code (assuming String html has the HTML above)

var doc = new HtmlDocument();
doc.LoadHtml(html);
var dom = doc.DocumentNode;
var pdpElement = dom.QuerySelector("#pdp");


What is the expected output? What do you see instead?
Expect pdpElement to be an HtmlNode of <a href="http://c1.diapers.com/images/products/p/pg/pg-256_1z.jpg"
class="MagicZoomPlus" id="pdp" title="Pampers Sensitive Thick Baby Wipes Refill 360ct."
target="_blank">

Instead, it doesn't find a match.

What version of the product are you using? On what operating system?
Fizzler 0.9

Please provide any additional information below.

Reported by portman.wills on 2011-04-06 19:36:38

@atifaziz
Copy link
Owner Author

I narrowed down the error slightly.

Using VisualFizzler (neat tool!) I can see that everything up to line 282 is selectable
(for example "#siteNav").

But after line 283, I can't select anything (for example "div.topToolBox").

So the issue has to do with long lines like on line 283 of that pastebin example.

Reported by portman.wills on 2011-04-06 19:59:05

@atifaziz
Copy link
Owner Author

Sure enough, when I remove this line (#283) from the HTML, everything works perfectly.
It's pathologically long (51,553 characters in fact!!) so this is probably a defect
in one of the underlying framework classes that Fizzler is using.

In the meantime, I've changed my code to chop long lines at 1024 characters before
handing off to Fizzler, and everything is working again. But you still might want to
investigate what precisely is going wrong on that long line, so I'll keep the issue
open.

Reported by portman.wills on 2011-04-06 20:08:17

@atifaziz
Copy link
Owner Author

We're using HTMLAgilityPack so it's probably an issue there, but it should be fairly
trivial to swap out HTMLAgilityPack for another parser. It could also be that this
issue has been fixed by a more recent version of HTMLAgilityPack than the one in the
download.

Reported by info%colinramsay.co.uk@gtempaccount.com on 2011-04-07 13:48:49

@atifaziz atifaziz closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant