-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add puppeteer util to block ads, trackers, and annoyances #600
Conversation
Hi @remusao, thanks for the very well structured PR. It's always a pleasure to review those. Before I start digging into it, I see that you're using request interception to block ads. Sadly, unless something changed without my knowledge, request interception still disables cache in Puppeteer, which means that the performance gains from blocking ads would be diminished. Could you explain why you're not using simple URL pattern blocking, as we do with Thanks. |
Hi @mnmkng, Thank you for your quick feedback. Let me try to address your initial points.
I was not aware of this limitation, and it is a sad one... It could be that the speed benefit still outweighs the cost of not caching on some websites, but probably not on all.
I was honestly not aware of this API but after looking at the documentation it seems that it is much too limited to handle the use-case of adblocking. It only supports a list of patterns with optional use of
And more... Combinations of those features are also possible. If you are interested, you can have a look at this blog post which dives deeper in how the rules of adblockers are defined as well as how There is also a compatibility matrix here listing the features supported by the filtering engine of Best, |
Thanks for the resources @remusao. We don't have the time to do the necessary performance tests now, but let me get back to you in March and we'll see whether it's faster to keep cache or block ads. |
Just to add a comment to the pull request, I am using the uBlock extention to block ads and trackers, that way puppeteer is still caching the requests and I can use uBlock to block unwanted requests without the use of request interception. This is the only good workaround I could come up with to be able to cache requests and block unwanted ones. I don't really see a way to add the functionality to apify, or think it is a good idea, but maybe it will be good enough to just update the documentation and add a page on how to block unwanted requests while keeping the built-in caching that puppeteer has. |
@Nicklason Are you running Puppeteer headful? Or has there been a change and Puppeteer now allows extensions even in headless? |
@mnmkng I am very happy that you asked about that because I just assumed that it would work in headless too, but it looks like extentions are disabled when running headless. I will try and play around with it and see if I can come up with something. |
As a follow-up to #456, I would like to propose the following implementation for an integration of blocking ads, trackers and annoyances into
Apify
. The expected benefits are faster crawls thanks to the reduced number of requests (all ads and trackers can be blocked, leading to less requests, less JavaScript to run and less data to download). The breakage should also be fairly low thanks to the use of actively maintained lists from Easylist and uBlock Origin projects (updated continuously).The implementation is strongly inspired by the existing utils used to block requests (e.g.
blockRequests()
) and all the new logic is located inside ofsrc/puppeteer_utils.js
. The newblockAdsAndTrackers(page, [options])
can be used be used to enable blocking of request based on rules found on the widely-used subscriptions: Easylist, uBlock Origin filters, etc. (see the docstring for more information). The following behaviors are implemented and can be selected when callingblockAdsAndTrackers()
:blockTrackers
options)blockAnnoyances
)Currently, the implementation will perform requests to GitHub (where lists of rules are hosted) to initialize the blocking engine, then cache the state on disk. This behavior can be tweaked in different ways depending on what the maintainers judge acceptable:
postinstall
hook instead of dynamically,There are also things which I did not commit in this initial PR as I was not sure if I should do it, or if this would be done as part of the releasing process:
I am happy to receive any feedback on this and make changes accordingly,
Best,