Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-page screenshot when extracting page URL #24

Open
michael-supreme opened this issue May 29, 2024 · 4 comments
Open

Full-page screenshot when extracting page URL #24

michael-supreme opened this issue May 29, 2024 · 4 comments

Comments

@michael-supreme
Copy link

michael-supreme commented May 29, 2024

I'm running thepipe locally to extract some page URLs for processing with GPT4o, and it seems that the image generated for each page only captures the content above the fold (See example below). Is there a method to have it capture the entire page to be processed? (perhaps an argument such as fullPage=True/False)

My token limit for GPT4o as part of my plan is 10M, so I'm not overly concerned with hitting limits.

Example image: https://imgur.com/a/a06g3lh

@emcf
Copy link
Owner

emcf commented May 31, 2024

I'm running thepipe locally to extract some page URLs for processing with GPT4o, and it seems that the image generated for each page only captures the content above the fold (See example below). Is there a method to have it capture the entire page to be processed? (perhaps an argument such as fullPage=True/False)

My token limit for GPT4o as part of my plan is 10M, so I'm not overly concerned with hitting limits.

Example image: https://imgur.com/a/a06g3lh

Hey @michael-supreme , this should be the default behaviour already. In extractor.py there is

# Get the viewport size and document size to scroll
viewport_height = page.viewport_size['height']
total_height = page.evaluate("document.body.scrollHeight")
# Scroll to the bottom of the page and take screenshots
current_scroll_position = 0
scrolldowns, max_scrolldowns = 0, 10 # in case of infinite scroll
while current_scroll_position < total_height and scrolldowns < max_scrolldowns:
    # rest of code...
    current_scroll_position += viewport_height
    page.evaluate(f"window.scrollTo(0, {current_scroll_position})")
    scrolldowns += 1

If it is not scrolling automatically for you, you can post the link you're trying to extract and I can take a closer look.

@michael-supreme
Copy link
Author

michael-supreme commented Jun 4, 2024

@emcf Seems it works on some pages but not others. For example, on this contact us page, I get the full page captured in multiple screenshots for every 720px of page height.

But on this homepage, it stops after the second chunk (wondering if it fails due to scripts or animations on the page?).

Also, the homepage in the original post has the same issue, where it stops after the second screenshot.

@emcf
Copy link
Owner

emcf commented Jun 13, 2024

@michael-supreme Thanks for providing these links to replicate, currently still investigating this issue

@michael-supreme
Copy link
Author

@michael-supreme Thanks for providing these links to replicate, currently still investigating this issue

@emcf Just wanted to let you know that the issue also happens then setting the extraction to text_only=True - It appears to only extract the text content for the first 720px of the page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants