Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing HTML content inside <noscript> tags #1797

Closed
wants to merge 2 commits into from

Conversation

paulsmithkc
Copy link

The scriptingEnabled flag was added to parse5 in version 5.0.0 parse5: ParserOptions

scriptingEnabled=true will parse <script> tags as javascript and <noscript> tags as raw text.
scriptingEnabled=false will parse <script> tags as raw text and <noscript> tags as HTML.

The later is the preferred default behavior for cheerio. As we do not want to execute the javascript, but do want to view the page as a scripts-disabled browser would.

See #1105 for discussion on this issue.

@5saviahv
Copy link
Contributor

I can't understand why change Cheerios current default behaviour scriptingEnabled=true to scriptingEnabled=false.

@paulsmithkc
Copy link
Author

paulsmithkc commented Mar 31, 2021

The default behavior is not the correct or expected behavior. It doesn't make sense for users to need to opt-out of bad/abnormal behavior.

Cheerio does not execute javascript anyway and is used primarily for web-scraping, which is exactly what the contents of the noscript is meant to be delivered to. Not parsing noscript goes against the whole purpose of it.

@5saviahv
Copy link
Contributor

I am not against it but I simply dont like to change current default behaviour, since you have already option to explicitly pass in option scriptingEnabled=false.

@paulsmithkc
Copy link
Author

paulsmithkc commented Mar 31, 2021

But that behavior also is not documented anywhere in cheerio's docs. I checked. @joeybaker found it buried in parse5.

@5saviahv
Copy link
Contributor

true, documentation should be better.

@fb55
Copy link
Member

fb55 commented May 14, 2021

I don't think making this the default behavior is the right choice here. The current documentation should make it easier to find the option. Happy to accept further updates to the docs to make it clear.

@fb55 fb55 closed this May 14, 2021
@joeybaker
Copy link

@fb55 why do you believe this is the wrong default behavior? I'd think it's unlikely that cheerio users will not want <noscript> tags parsed as HTML?

@fb55
Copy link
Member

fb55 commented May 20, 2021

The current parsing behavior is in line with what a browser would present with JS enabled. Finding additional elements might come as a surprise. For users that do want this behavior, there is an opt-in.

I am not stuck to this, but to me this trade-off makes the most sense at the moment.

@paulsmithkc
Copy link
Author

@fb55 Both me and @joeybaker agree that we expect HTML content of to be parsed as HTML and that the current behavior is unexpected.

Perhaps it would be better to ask what your users/customers expect?

Personally, I expect cheerio to behave like a JS disabled browser. I expect it to parse all of the HTML. Not to pick and choose what HTML it parses.

@5saviahv
Copy link
Contributor

Cheerio is not a web browser ... it simply tries to give a similar results than jQuery in browser.

I dont think majority will expect Cheerio bring some additional tags into dom if they load web page with Cheerio. You expect same results than jQuery gives you.

In concept Cheerio acting like "JS disabled browser" maybe good, but making it default behaviour will be breaking change with no good reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants