Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add puppeteer util to block ads, trackers, and annoyances #600

Closed
wants to merge 1 commit into from

Conversation

remusao
Copy link

@remusao remusao commented Feb 23, 2020

As a follow-up to #456, I would like to propose the following implementation for an integration of blocking ads, trackers and annoyances into Apify. The expected benefits are faster crawls thanks to the reduced number of requests (all ads and trackers can be blocked, leading to less requests, less JavaScript to run and less data to download). The breakage should also be fairly low thanks to the use of actively maintained lists from Easylist and uBlock Origin projects (updated continuously).

The implementation is strongly inspired by the existing utils used to block requests (e.g. blockRequests()) and all the new logic is located inside of src/puppeteer_utils.js. The new blockAdsAndTrackers(page, [options]) can be used be used to enable blocking of request based on rules found on the widely-used subscriptions: Easylist, uBlock Origin filters, etc. (see the docstring for more information). The following behaviors are implemented and can be selected when calling blockAdsAndTrackers():

  • Block ads only (default behavior)
  • Block ads and trackers (using blockTrackers options)
  • Block ads, trackers and annoyances, such as cookie popups, banners, etc. (using blockAnnoyances)

Currently, the implementation will perform requests to GitHub (where lists of rules are hosted) to initialize the blocking engine, then cache the state on disk. This behavior can be tweaked in different ways depending on what the maintainers judge acceptable:

  • Hosting rules as static file in the apify repository to avoid any network request,
  • Customizing caching (disabling, using some Apify API to persist the data, customizing the path for caching, etc.),
  • Perform the building of the engine as a postinstall hook instead of dynamically,

There are also things which I did not commit in this initial PR as I was not sure if I should do it, or if this would be done as part of the releasing process:

  • Re-generating the docs
  • Re-generating the types
  • Updating CHANGELOG.md

I am happy to receive any feedback on this and make changes accordingly,
Best,

@mnmkng
Copy link
Member

mnmkng commented Feb 23, 2020

Hi @remusao,

thanks for the very well structured PR. It's always a pleasure to review those.

Before I start digging into it, I see that you're using request interception to block ads. Sadly, unless something changed without my knowledge, request interception still disables cache in Puppeteer, which means that the performance gains from blocking ads would be diminished.

Could you explain why you're not using simple URL pattern blocking, as we do with blockRequests(), which is done in browser and does not engage request interception? I assume there's a good reason.

Thanks.

@remusao
Copy link
Author

remusao commented Feb 24, 2020

Hi @mnmkng,

Thank you for your quick feedback. Let me try to address your initial points.

Before I start digging into it, I see that you're using request interception to block ads. Sadly, unless something changed without my knowledge, request interception still disables cache in Puppeteer, which means that the performance gains from blocking ads would be diminished.

I was not aware of this limitation, and it is a sad one... It could be that the speed benefit still outweighs the cost of not caching on some websites, but probably not on all.

Could you explain why you're not using simple URL pattern blocking, as we do with blockRequests(), which is done in browser and does not engage request interception? I assume there's a good reason.

I was honestly not aware of this API but after looking at the documentation it seems that it is much too limited to handle the use-case of adblocking. It only supports a list of patterns with optional use of * for globbing (and each pattern implicitly starts and ends with *). In contrast, the rules used for the purpose of ad- and tracker- blocking are much more fine-grained; here are a few examples of things they allow:

  • Matching on a specific hostname (including its subdomains).
  • Matching at the beginning of the URL.
  • Matching at the end of the URL.
  • Matching specific types of requests (e.g. image, script, or combinations of those).
  • Matching based on partiness (i.e. is the request first-party or third-party?).
  • Cancelling blocking using exception rules which take precedence over blocking rules.
  • Matching only on pages of a specific domain (maybe less problematic for Apify since this is controlled).
  • Matching based on a RegExp pattern (less frequent than other rules though).

And more... Combinations of those features are also possible. If you are interested, you can have a look at this blog post which dives deeper in how the rules of adblockers are defined as well as how @cliqz/adblocker's engine is able to perform matching in an efficient way: https://0x65.dev/blog/2019-12-20/not-all-adblockers-are-born-equal.html

There is also a compatibility matrix here listing the features supported by the filtering engine of @cliqz/adblocker: https://github.com/cliqz-oss/adblocker/wiki/Compatibility-Matrix

Best,

@mnmkng
Copy link
Member

mnmkng commented Feb 24, 2020

Thanks for the resources @remusao. We don't have the time to do the necessary performance tests now, but let me get back to you in March and we'll see whether it's faster to keep cache or block ads.

@Nicklason
Copy link

Just to add a comment to the pull request, I am using the uBlock extention to block ads and trackers, that way puppeteer is still caching the requests and I can use uBlock to block unwanted requests without the use of request interception. This is the only good workaround I could come up with to be able to cache requests and block unwanted ones.

I don't really see a way to add the functionality to apify, or think it is a good idea, but maybe it will be good enough to just update the documentation and add a page on how to block unwanted requests while keeping the built-in caching that puppeteer has.

@mnmkng
Copy link
Member

mnmkng commented Jun 5, 2020

@Nicklason Are you running Puppeteer headful? Or has there been a change and Puppeteer now allows extensions even in headless?

@Nicklason
Copy link

@mnmkng I am very happy that you asked about that because I just assumed that it would work in headless too, but it looks like extentions are disabled when running headless. I will try and play around with it and see if I can come up with something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants