Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reach anything under noscript tag #1105

Closed
rightaway opened this issue Nov 5, 2017 · 11 comments
Closed

Can't reach anything under noscript tag #1105

rightaway opened this issue Nov 5, 2017 · 11 comments

Comments

@rightaway
Copy link

I've got cheerio.load(pageContent, { decodeEntities: false }), but then when I try to match an element that's under a noscript tag it doesn't work. I thought setting decodeEntities to false would allow this. How do I match that element?

@enricopolanski
Copy link

try cheerio.load(pageContent, [ xmlMode: true});

@fb55 fb55 closed this as completed Feb 22, 2018
@ghost
Copy link

ghost commented Mar 14, 2018

I'm seeing this too. This should not have been closed.

@fb55
Copy link
Member

fb55 commented Mar 14, 2018

Enabling xml mode fixes this, otherwise parse5 will always strip noscript tags.

edit: See #1105 (comment) for how to fix this.

@ghost
Copy link

ghost commented Mar 14, 2018

Right, but it's still an issue, right? It should stay open until it gets resolved.

@mitin001
Copy link

Try to match the noscript tag itself, get the html by calling .html() and then load that html with cheerio again. That way you'll be able to match any element under the noscript tag.

@paulsmithkc
Copy link

This is definitely still a bug with cheerio, as cheerio is essentially a browser without JavaScript support. The noscript tag is intended to provide content for browsers that do not support javascript (which would include search engines, web crawlers, and web scrapers).

"xmlMode: true" has several other side effects (see documentation) which can cause most pages, and especially those in question to fail to parse.

@joeybaker
Copy link

Quickly looking through, it seems like https://github.com/cheeriojs/cheerio/blob/208bce1ee8ed921dbd0fc2988644fd3a68bf8bd1/lib/parse.js needs to be updated to turn scriptingEnabled: false https://github.com/inikulin/parse5/blob/master/packages/parse5/docs/options/parser-options.md

I'm not sure if there are security implications of doing that though.

@paulsmithkc
Copy link

paulsmithkc commented Mar 31, 2021

<noscript> tags should be parsed as HTML as that is what they contain, not text or javascript.

paulsmithkc added a commit to paulsmithkc/cheerio that referenced this issue Mar 31, 2021
The `scriptingEnabled` flag was added to parse5 in version 5.0.0 [parse5: ParserOptions](https://github.com/inikulin/parse5/blob/master/packages/parse5/docs/options/parser-options.md)  

`scriptingEnabled=true` will parse `<script>` tags as javascript and `<noscript>` tags as raw text.
`scriptingEnabled=false` will parse `<script>` tags as raw text and `<noscript>` tags as HTML.

The later is the preferred default behavior for cheerio. As we do not want to execute the javascript, but do want to view the page as a scripts-disabled browser would.

See cheeriojs#1105 for discussion on this issue.
@dsmmcken
Copy link

dsmmcken commented Jul 11, 2022

This is the first google result for "cheerio noscript". If you are attempting to parse html inside a noscript tag with cheerio, see the doc page linked below.

Set the scriptingEnabled option to false.

const $ = cheerio.load(data, { scriptingEnabled: false });

@ggorlen
Copy link

ggorlen commented Feb 5, 2023

xmlMode is deprecated in favor of xml.

@rishi-raj-jain
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants