Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parseWithCheerio pierces iframe elements #2507

Closed
barjin opened this issue May 31, 2024 · 2 comments · Fixed by #2542
Closed

feat: parseWithCheerio pierces iframe elements #2507

barjin opened this issue May 31, 2024 · 2 comments · Fixed by #2542
Assignees
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@barjin
Copy link
Contributor

barjin commented May 31, 2024

Following the recent update adding the shadow root expansion to the parseWithCheerio method, it would be nice if we could expand iframe elements (their contents) in a similar manner.

While replacing the iframe element in a browser could cause some styling / security issues, I'm assuming it's just a simple XML tree modification for Cheerio.

@barjin barjin added feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team. labels May 31, 2024
@B4nan
Copy link
Member

B4nan commented Jun 11, 2024

Looks like unless explicitly allowed, we cant access the content of an iframe from another origin at all. I was trying to get the HTML from inside the iframe of giscus comments and you only get null from the iframe.contentDocument in there.

@barjin
Copy link
Contributor Author

barjin commented Jun 11, 2024

You're talking about DOM API, right? I think there should be some way to look into the iframes from the Playwright orchestration (at least the docs say so).

@barjin barjin self-assigned this Jun 18, 2024
barjin added a commit that referenced this issue Jun 20, 2024
Replaces the `iframe` elements with their contents in `<div
class="crawlee-iframe-replacement"></div>` element.

Closes #2507
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants