[Question] Best practice to cleanup HTML Spaghetti code #630

JpEncausse · 2023-08-09T11:00:49Z

Question or comment

I need to cleanup a random HTML page to extract readable content. Modern website use A LOT of Spaghetti HTML. For instance :

<div> <div><div> <a href="/"> <div>Title of the site</div> </a> <div lazyload="event"><!--lazy <div class="headerPageHtml"><a href="/include/news.xml" target="_blank"><img id="socialRss" alt="access to rss" src="/asset/social/rss.png"></a></div></div></div>

In this example I don't want all the <div> and the formating <div><img></div>
Should I strip all the

tags ? Or is there a clever way ?

The text was updated successfully, but these errors were encountered:

boutell · 2023-08-09T11:54:56Z

It's unclear what you're trying to do exactly, but sanitize-html is quite good at keeping only the tags and attributes you approve, as you can see in the documentation. If you want to do more subtle things, there are transformation features. If your needs exceed that, then you might consider using sanitize-html as a first pass and then cheerio for the transformations.

coreyward · 2023-10-02T21:45:36Z

I came here to report a similar issue. An unclosed attribute (missing final double-quote) will cause everything from the start of that tag through to the end of the input to be stripped by sanitize-html.

//                                   ↓ Missing double-quote
sanitize(`Hello, world. <a href="/this>this</a> is a demo of this behavior. <b>I won't be in the output!</b>`)
// => 'Hello, world. '

boutell · 2023-10-03T14:49:09Z

Angle brackets are not forbidden in quoted HTML attributes, and in fact this document produces the expected title on hover in Chrome:

<h4 title="this is a title<containing><punctuation>">h4 body</h4>

If both the standard and actual browsers permit it then sanitize-html can't reliably detect that it is "wrong" (because it isn't, strictly speaking). Also this behavior is coming from the htmlparse2 module in any case, but keep in mind it is not a bug before reporting anything there.

coreyward · 2023-10-04T21:16:01Z

@boutell Got it. So in the case of invalid HTML (the double quote never closes anywhere), is there any way to get an error back instead of having large portions of the input stripped out?

boutell · 2023-10-06T13:50:18Z

If you mean at the very end of the fragment, when you're absolutely sure no closing quote is coming, it looks like htmlparser2 always tidies up at the end by closing whatever isn't closed, and we're downstream of that. There may or may not be htmlparser2 options that modify this behavior.

…

On Wed, Oct 4, 2023 at 5:16 PM Corey Ward ***@***.***> wrote: @boutell <https://github.com/boutell> Got it. So in the case of invalid HTML (the double quote never closes anywhere), is there any way to get an error back instead of having large portions of the input stripped out? — Reply to this email directly, view it on GitHub <#630 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAH27N2JHIUGHYKV2FSGSLX5XGZ3AVCNFSM6AAAAAA3J2GNFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBXGY2TGOBWG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

coreyward · 2023-10-12T00:00:08Z

Okay, I think we can live with that for now. Thank you!

JpEncausse added the question label Aug 9, 2023

boutell closed this as completed Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Best practice to cleanup HTML Spaghetti code #630

[Question] Best practice to cleanup HTML Spaghetti code #630

JpEncausse commented Aug 9, 2023

boutell commented Aug 9, 2023

coreyward commented Oct 2, 2023 •

edited

Loading

boutell commented Oct 3, 2023 •

edited

Loading

coreyward commented Oct 4, 2023

boutell commented Oct 6, 2023 via email

coreyward commented Oct 12, 2023

[Question] Best practice to cleanup HTML Spaghetti code #630

[Question] Best practice to cleanup HTML Spaghetti code #630

Comments

JpEncausse commented Aug 9, 2023

Question or comment

boutell commented Aug 9, 2023

coreyward commented Oct 2, 2023 • edited Loading

boutell commented Oct 3, 2023 • edited Loading

coreyward commented Oct 4, 2023

boutell commented Oct 6, 2023 via email

coreyward commented Oct 12, 2023

coreyward commented Oct 2, 2023 •

edited

Loading

boutell commented Oct 3, 2023 •

edited

Loading