-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add browser rendering using headless chrome #115
Comments
@untoldbyte what is the use case for using puppeteer to render the html? do you require page interaction? |
yes |
is there any specific reason why chrome/puppeteer instead of selenium? |
@Ziinc we also did not use Selenium for real scraping back in time when I was working with scrapy. My personal vision of it:
So my understanding that selenium is not suitable for web crawling. At least was not suitable when I was looking at it last time. I would try to build something similar to what is described here: https://blog.scrapinghub.com/how-to-use-a-proxy-in-puppeteer. It should be similar to splash. (Was about to start with the task, but this summer was a bit chaotic for me :() |
While there is no doubt that puppeteer would be great for browser automation and there isn't much browser monoculture issues (since they are adding in firefox soon), I am not particularly against adding puppeteer as a Fetcher module. I am only concerned about how we would reconcile the page interaction and html parsing. The key pain point to resolve this issue is how to obtain further html from page interactions. For example, the data that we want are only rendered in modals which are toggled open with buttons. Using puppeteer, we can open up each modal and scrape the data in a single request using a single nodejs script. However, doesn't that render the spider's A few ideas I have for this problem:
I think idea 2 is the better option, and idea 3 could be a nice-to-have |
@Ziinc maybe you're right. I don't have a full picture of it yet. I would try to build a prototype and see what is required from the production usages. From my experience of scraping, I would say: Scraping something from modal windows was an extremely rare [I would even say almost never used] use case... In the vast majority of the cases, a simple request to something like splash (which is also scriptable, and kind of allows to execute js on the client-side) was enough. For now I would consider headless chrome as a way to overcome bans. Currently, our amazon spiders are blocked after 2000-3000 requests, so I am looking for a standard way to do it better, |
we could consider microsoft's playwright, which is cross browser. Might make rotating the user agent easier |
Hi
FeatureRequest - We can replace splash with headless chrome via puppeteer
The text was updated successfully, but these errors were encountered: