-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing HTML content inside <noscript> tags #1797
Conversation
I can't understand why change Cheerios current default behaviour |
The default behavior is not the correct or expected behavior. It doesn't make sense for users to need to opt-out of bad/abnormal behavior. Cheerio does not execute javascript anyway and is used primarily for web-scraping, which is exactly what the contents of the |
I am not against it but I simply dont like to change current default behaviour, since you have already option to explicitly pass in option |
But that behavior also is not documented anywhere in cheerio's docs. I checked. @joeybaker found it buried in parse5. |
true, documentation should be better. |
I don't think making this the default behavior is the right choice here. The current documentation should make it easier to find the option. Happy to accept further updates to the docs to make it clear. |
@fb55 why do you believe this is the wrong default behavior? I'd think it's unlikely that cheerio users will not want |
The current parsing behavior is in line with what a browser would present with JS enabled. Finding additional elements might come as a surprise. For users that do want this behavior, there is an opt-in. I am not stuck to this, but to me this trade-off makes the most sense at the moment. |
@fb55 Both me and @joeybaker agree that we expect HTML content of to be parsed as HTML and that the current behavior is unexpected. Perhaps it would be better to ask what your users/customers expect? Personally, I expect cheerio to behave like a JS disabled browser. I expect it to parse all of the HTML. Not to pick and choose what HTML it parses. |
Cheerio is not a web browser ... it simply tries to give a similar results than jQuery in browser. I dont think majority will expect Cheerio bring some additional tags into dom if they load web page with Cheerio. You expect same results than jQuery gives you. In concept Cheerio acting like "JS disabled browser" maybe good, but making it default behaviour will be breaking change with no good reason. |
The
scriptingEnabled
flag was added to parse5 in version 5.0.0 parse5: ParserOptionsscriptingEnabled=true
will parse<script>
tags as javascript and<noscript>
tags as raw text.scriptingEnabled=false
will parse<script>
tags as raw text and<noscript>
tags as HTML.The later is the preferred default behavior for cheerio. As we do not want to execute the javascript, but do want to view the page as a scripts-disabled browser would.
See #1105 for discussion on this issue.