Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add iframe expansion to parseWithCheerio in browsers #2542

Merged
merged 6 commits into from
Jun 20, 2024

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Jun 18, 2024

Replaces the iframe elements with their contents in <div class="crawlee-iframe-replacement"></div> element.

Closes #2507

@barjin barjin self-assigned this Jun 18, 2024
@github-actions github-actions bot added this to the 92nd sprint - Tooling team milestone Jun 18, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jun 18, 2024
@barjin barjin force-pushed the feat/cheerio-iframe-expansion branch from f20915e to 4f11d0f Compare June 18, 2024 12:42
@barjin barjin requested a review from janbuchar June 18, 2024 12:43
@@ -191,6 +191,26 @@ export async function injectJQuery(page: Page, options?: { surviveNavigations?:
export async function parseWithCheerio(page: Page, ignoreShadowRoots = false): Promise<CheerioRoot> {
ow(page, ow.object.validate(validators.browserPage));

if (page.frames().length > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to duplicate this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, @crawlee/playwright and @crawlee/puppeteer are separate packages, so we would have to create a new package for this shared code (any other crawlee package doesn't / cannot depend on playwright or puppeteer(?)).

I see that these two are verbatim copies, but that's only because here we're using the subsets of PW / PP interfaces that are equal... other utils methods are different for PW / PP. I like to think of these as "platform" specific ports of the same features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't it be put in @crawlee/browser-crawler somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of what I mentioned above, it would be very awkward - see here:

export async function extractUrlsFromPage(
// eslint-disable-next-line @typescript-eslint/ban-types
page: { $$eval: Function },
selector: string,
baseUrl: string,
): Promise<string[]> {

Or here:

export interface CommonPage {
close(...args: unknown[]): Promise<unknown>;
url(): string | Promise<string>;
}

Dependency injection... or something, I guess.

With this as an alternative, I'm more than happy to have "duplicate" separate implementations for both libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I guess you'd have to write quite a lot of boilerplate types. I guess I'm equally unhappy with both approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawlee/browser package has optional peer dependencies on both playwright and puppeteer, so you can surely have a code that works with both of them inside it. But to do that without hacks like ts-ignore comments and dynamic imports, you would need to introduce separate exports for each library that wouldn't be exported from the root index file. Probably not worth it now.

@barjin barjin requested a review from B4nan June 20, 2024 11:28
Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, i am curious if this works with the iframe we have in crawlee docs (giscus) too

@barjin barjin merged commit 328d085 into master Jun 20, 2024
9 checks passed
@barjin barjin deleted the feat/cheerio-iframe-expansion branch June 20, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: parseWithCheerio pierces iframe elements
3 participants